Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt

Use this file to discover all available pages before exploring further.

The open-source AI ecosystem is increasingly constrained by licensing restrictions on training data. FlexOlmo, presented by Sewon Min in ScaleML Lecture 71, addresses this directly: a modular pretraining framework that lets model builders choose exactly which data sources to incorporate, without retraining from scratch. This is part of the ScaleML Series within GPU Mode, focused on machine learning at scale.

The data licensing problem in open LLMs

Most large language models are trained on massive web crawls that mix data under incompatible licenses — commercial, research-only, and public domain sources all blended together. For users who need a model trained exclusively on commercially permissive data, or who want to exclude copyrighted material for legal reasons, the only option today is to retrain from scratch. This creates three compounding problems:
  • Cost: retraining a competitive LLM costs millions of dollars in compute
  • Reproducibility: different data mixes produce models that are hard to compare fairly
  • Flexibility: a single pretraining run bakes in one fixed data policy for the model’s lifetime
FlexOlmo is built on top of OLMo, AI2’s fully open language model. OLMo releases model weights, training code, training data, and evaluation harness — making it uniquely suited as a foundation for this kind of modular research.

What FlexOlmo is

FlexOlmo decomposes pretraining by data source. Instead of training a single model on a fixed mixture, FlexOlmo trains a set of expert modules — one per data source or data group. Each module specializes on its own slice of data, and the modules can be combined post-hoc without any further training.

Modular pretraining

Each data source (e.g., web, books, code, scientific papers) trains its own module independently. Modules share a base architecture but have separate weights.

Mix-and-match

At deployment time, a user selects which modules to include. The combined model reflects only the data policy of the chosen sources — no retraining required.

Flexible licensing

A research lab can ship a model trained on all data, while an enterprise user can request a version that excludes any non-commercial sources.

Open infrastructure

Built on the fully open OLMo stack: weights, data, and training code are all publicly released alongside FlexOlmo.

Mix-and-match: combining expert modules

The core technical challenge in FlexOlmo is how to combine independently trained modules into a single coherent model. The lecture covers two main approaches:

Model merging

Model merging averages or interpolates the weights of independently trained models. The simplest form is linear interpolation (task arithmetic):
# Merge two expert models via linear interpolation
import torch

def merge_models(base_state_dict, expert_state_dicts, weights):
    """
    base_state_dict: shared initialization weights
    expert_state_dicts: list of expert module state dicts
    weights: interpolation coefficients (sum to 1.0)
    """
    merged = {}
    for key in base_state_dict:
        # Compute task vectors (delta from base)
        task_vectors = [
            expert[key] - base_state_dict[key]
            for expert in expert_state_dicts
        ]
        # Weighted sum of task vectors
        combined = sum(w * tv for w, tv in zip(weights, task_vectors))
        merged[key] = base_state_dict[key] + combined
    return merged
Model merging works best when individual experts are initialized from the same base checkpoint. FlexOlmo starts all experts from a shared pretrained base, then continues training each on its designated data source.

Mixture of experts (MoE) composition

A more expressive alternative routes each token through a learned gating function that selects which expert modules to activate. This preserves more of each expert’s specialization but adds inference complexity:
import torch
import torch.nn as nn

class FlexRouter(nn.Module):
    """Learned router that selects which expert modules to activate."""
    def __init__(self, hidden_dim, num_experts, top_k=2):
        super().__init__()
        self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
        self.top_k = top_k

    def forward(self, x):
        # x: [batch, seq_len, hidden_dim]
        logits = self.gate(x)  # [batch, seq_len, num_experts]
        weights, indices = torch.topk(logits, self.top_k, dim=-1)
        weights = torch.softmax(weights, dim=-1)
        return weights, indices

Technical approach: the FlexOlmo training recipe

1

Train a shared base model

All expert modules begin from the same pretrained checkpoint. This shared initialization ensures the modules speak the same “language” and can be merged later without catastrophic interference.
2

Partition data by source

Training data is split into discrete groups by provenance: web crawl, books, code repositories, scientific papers, etc. Each group has a known license profile.
3

Continue pretraining per source

Each module continues training from the shared base using only its designated data source. This is a standard continued pretraining run — no architectural changes required.
4

Merge at deployment time

A user specifies their desired data policy (e.g., “only commercially permissive sources”). FlexOlmo merges the corresponding modules using linear interpolation or learned routing and produces a deployable model.

Evaluation: flexible data policies vs. downstream tasks

A key question is whether mix-and-match degradation is acceptable — does selectively excluding data sources meaningfully hurt model quality? The lecture presents evaluations across:
BenchmarkWhat it tests
MMLUBroad world knowledge across 57 subjects
HellaSwagCommonsense reasoning
ARC-ChallengeScience QA requiring multi-step reasoning
TruthfulQATendency to generate false but plausible statements
WinoGrandePronoun resolution / commonsense
Results show that FlexOlmo models excluding certain data sources do lose some capability on tasks correlated with that source (e.g., excluding scientific papers slightly reduces MMLU science scores). However, the degradation is substantially smaller than training from scratch on only the permitted data.
The key finding is that merging preserves most of the capability of the full-data model even when several sources are excluded — the shared base provides a strong foundation that each expert fine-tunes rather than overwriting.

Implications for the open-source AI ecosystem

FlexOlmo changes the economics of data compliance in open models:
  • One training run, many data policies: organizations can serve multiple compliance profiles from a single modular training run
  • Auditable provenance: because each module is tied to a specific data source, capability claims are falsifiable
  • Community contributions: third parties can train and release expert modules for new data sources that plug into the existing base
Model merging is not a substitute for careful data governance. FlexOlmo reduces the cost of flexibility, but verifying that a module was actually trained only on its claimed data still requires external audit of the training pipeline.

Connection to OLMo and AI2’s open models

FlexOlmo is a direct extension of the OLMo project at the Allen Institute for AI (AI2). OLMo distinguishes itself from other open-weight models by releasing:
  • Full training code (not just inference code)
  • Complete training data (Dolma dataset)
  • Intermediate checkpoints at every training step
  • A standardized evaluation harness (OLMES)
This openness makes OLMo the right substrate for FlexOlmo: you can verify which data went into each module because the full data pipeline is public.

Lecture references

Lecture 71 slides

ScaleML Lecture 71 slides by Sewon Min (PDF in the lecture_071 folder)

Sewon Min

Speaker homepage — research on language models, retrieval, and data

OLMo at AI2

The open language model project that FlexOlmo extends

GPU Mode YouTube

Full lecture recordings on the GPU Mode YouTube channel

Build docs developers (and LLMs) love