The open-source AI ecosystem is increasingly constrained by licensing restrictions on training data. FlexOlmo, presented by Sewon Min in ScaleML Lecture 71, addresses this directly: a modular pretraining framework that lets model builders choose exactly which data sources to incorporate, without retraining from scratch. This is part of the ScaleML Series within GPU Mode, focused on machine learning at scale.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
The data licensing problem in open LLMs
Most large language models are trained on massive web crawls that mix data under incompatible licenses — commercial, research-only, and public domain sources all blended together. For users who need a model trained exclusively on commercially permissive data, or who want to exclude copyrighted material for legal reasons, the only option today is to retrain from scratch. This creates three compounding problems:- Cost: retraining a competitive LLM costs millions of dollars in compute
- Reproducibility: different data mixes produce models that are hard to compare fairly
- Flexibility: a single pretraining run bakes in one fixed data policy for the model’s lifetime
FlexOlmo is built on top of OLMo, AI2’s fully open language model. OLMo releases model weights, training code, training data, and evaluation harness — making it uniquely suited as a foundation for this kind of modular research.
What FlexOlmo is
FlexOlmo decomposes pretraining by data source. Instead of training a single model on a fixed mixture, FlexOlmo trains a set of expert modules — one per data source or data group. Each module specializes on its own slice of data, and the modules can be combined post-hoc without any further training.Modular pretraining
Each data source (e.g., web, books, code, scientific papers) trains its own module independently. Modules share a base architecture but have separate weights.
Mix-and-match
At deployment time, a user selects which modules to include. The combined model reflects only the data policy of the chosen sources — no retraining required.
Flexible licensing
A research lab can ship a model trained on all data, while an enterprise user can request a version that excludes any non-commercial sources.
Open infrastructure
Built on the fully open OLMo stack: weights, data, and training code are all publicly released alongside FlexOlmo.
Mix-and-match: combining expert modules
The core technical challenge in FlexOlmo is how to combine independently trained modules into a single coherent model. The lecture covers two main approaches:Model merging
Model merging averages or interpolates the weights of independently trained models. The simplest form is linear interpolation (task arithmetic):Mixture of experts (MoE) composition
A more expressive alternative routes each token through a learned gating function that selects which expert modules to activate. This preserves more of each expert’s specialization but adds inference complexity:Technical approach: the FlexOlmo training recipe
Train a shared base model
All expert modules begin from the same pretrained checkpoint. This shared initialization ensures the modules speak the same “language” and can be merged later without catastrophic interference.
Partition data by source
Training data is split into discrete groups by provenance: web crawl, books, code repositories, scientific papers, etc. Each group has a known license profile.
Continue pretraining per source
Each module continues training from the shared base using only its designated data source. This is a standard continued pretraining run — no architectural changes required.
Evaluation: flexible data policies vs. downstream tasks
A key question is whether mix-and-match degradation is acceptable — does selectively excluding data sources meaningfully hurt model quality? The lecture presents evaluations across:| Benchmark | What it tests |
|---|---|
| MMLU | Broad world knowledge across 57 subjects |
| HellaSwag | Commonsense reasoning |
| ARC-Challenge | Science QA requiring multi-step reasoning |
| TruthfulQA | Tendency to generate false but plausible statements |
| WinoGrande | Pronoun resolution / commonsense |
Results show that FlexOlmo models excluding certain data sources do lose some capability on tasks correlated with that source (e.g., excluding scientific papers slightly reduces MMLU science scores). However, the degradation is substantially smaller than training from scratch on only the permitted data.
Implications for the open-source AI ecosystem
FlexOlmo changes the economics of data compliance in open models:- One training run, many data policies: organizations can serve multiple compliance profiles from a single modular training run
- Auditable provenance: because each module is tied to a specific data source, capability claims are falsifiable
- Community contributions: third parties can train and release expert modules for new data sources that plug into the existing base
Connection to OLMo and AI2’s open models
FlexOlmo is a direct extension of the OLMo project at the Allen Institute for AI (AI2). OLMo distinguishes itself from other open-weight models by releasing:- Full training code (not just inference code)
- Complete training data (Dolma dataset)
- Intermediate checkpoints at every training step
- A standardized evaluation harness (OLMES)
Lecture references
Lecture 71 slides
ScaleML Lecture 71 slides by Sewon Min (PDF in the lecture_071 folder)
Sewon Min
Speaker homepage — research on language models, retrieval, and data
OLMo at AI2
The open language model project that FlexOlmo extends
GPU Mode YouTube
Full lecture recordings on the GPU Mode YouTube channel