Lineage

The Trustworthy Model Registry tracks model lineage—the parent-child relationships between models—to provide transparency into model provenance and enable dependency analysis.

What is lineage?

Lineage captures the ancestry of a model by identifying which base models it was fine-tuned from. For example: In this graph:

bert-base-uncased is the parent of bert-large-cased
your-custom-bert is a child of bert-large-cased
bert-base-uncased is a grandparent of your-custom-bert

Lineage extraction

Source: src/metrics/treescore.py:190-256

Parent model detection

The registry extracts parent models from config.json metadata:

# From src/metrics/treescore.py:190-256
def extract_parents_from_resource(resource: Dict[str, Any]) -> List[str]:
    parents: Set[str] = set()
    
    # 1. PRIMARY SOURCE: config.json (structured metadata)
    cfg = resource.get("config") or {}
    candidate_keys = (
        "base_model",
        "teacher_model",
        "parent_model",
        "source_model",
        "original_model",
        "pretrained_model_name_or_path",
    )
    
    for key in candidate_keys:
        val = cfg.get(key)
        if isinstance(val, str) and "/" in val:
            parents.add(normalize_hf_id(val))
    
    # 2. SECONDARY SOURCE: HF metadata tags
    hf_meta = resource.get("hf_metadata") or {}
    tags = hf_meta.get("tags") or []
    for t in tags:
        if "/" in t and not t.startswith(("task:", "pipeline:", "license:")):
            parents.add(normalize_hf_id(t))
    
    return sorted(parents)

Extraction priority

config.json (highest priority)

The config.json file contains structured metadata set during model training:

{
  "base_model": "bert-base-uncased",
  "pretrained_model_name_or_path": "google-bert/bert-base-uncased"
}

HF metadata tags (fallback)

If config.json doesn’t specify parents, check Hugging Face tags:

tags = ["bert-base", "pytorch", "fill-mask"]
# "bert-base" has no "/" → ignored
# "pytorch" → ignored (not a model ID)

Self-reference filtering

The model itself is excluded from its parent list:

# From src/metrics/treescore.py:252-253
parents.discard(name)  # Don't list model as its own parent

Config.json keys checked

The extractor looks for these keys in order:

Key	Description	Example
`base_model`	Base model identifier	`"bert-base-uncased"`
`teacher_model`	For distillation	`"bert-large-uncased"`
`parent_model`	Explicit parent	`"gpt2"`
`source_model`	Source checkpoint	`"roberta-base"`
`original_model`	Original pretrained model	`"t5-small"`
`pretrained_model_name_or_path`	Transformers standard	`"facebook/bart-large"`

The extractor checks all keys and returns all unique parent references found.

Lineage graph construction

Source: src/services/registry.py:150-221

Graph structure

Lineage graphs use a node-and-edge format:

{
  "nodes": [
    {
      "artifact_id": "1",
      "name": "your-custom-bert",
      "source": "config_json",
      "metadata": {}
    },
    {
      "artifact_id": "external:bert-base-uncased",
      "name": "bert-base-uncased",
      "source": "config_json",
      "metadata": {"external": true}
    }
  ],
  "edges": [
    {
      "from_node_artifact_id": "external:bert-base-uncased",
      "to_node_artifact_id": "1",
      "relationship": "base_model"
    }
  ]
}

Graph building algorithm

# From src/services/registry.py:150-221
def get_lineage_graph(self, id_: str) -> Dict[str, Any]:
    root = self.get(id_)
    nodes: Dict[str, Dict[str, Any]] = {}
    edges: List[Dict[str, str]] = []
    
    # 1. Always include root node first
    nodes[root["id"]] = {
        "artifact_id": root["id"],
        "name": root["name"],
        "source": "config_json",
        "metadata": {}
    }
    
    # 2. Extract parent from metadata
    meta = root.get("metadata") or {}
    hf_parent = meta.get("base_model_name_or_path")
    
    # 3. Check if parent exists in registry
    if hf_parent:
        parent_model = None
        for m in self._models:
            if m.get("name") == hf_parent:
                parent_model = m
                break
        
        if parent_model:
            # Internal parent (already in registry)
            pid = str(parent_model["id"])
            nodes[pid] = {
                "artifact_id": pid,
                "name": parent_model["name"],
                "source": "config_json",
                "metadata": {}
            }
            edges.append({
                "from_node_artifact_id": pid,
                "to_node_artifact_id": root["id"],
                "relationship": "base_model"
            })
        else:
            # External parent (not in registry)
            external_id = f"external:{hf_parent}"
            nodes[external_id] = {
                "artifact_id": external_id,
                "name": hf_parent,
                "source": "config_json",
                "metadata": {"external": True}
            }
            edges.append({
                "from_node_artifact_id": external_id,
                "to_node_artifact_id": root["id"],
                "relationship": "base_model"
            })
    
    return {"nodes": list(nodes.values()), "edges": edges}

Internal vs external parents

Internal parent
External parent

When the parent model exists in the registry:

{
  "artifact_id": "2",
  "name": "bert-base-uncased",
  "source": "config_json",
  "metadata": {}
}

Uses the registry’s assigned ID.

When the parent model is not in the registry:

{
  "artifact_id": "external:bert-base-uncased",
  "name": "bert-base-uncased",
  "source": "config_json",
  "metadata": {"external": true}
}

Uses a synthetic ID with external: prefix.

External parents allow the graph to reference models not yet ingested into the registry.

Tree score calculation

Source: src/metrics/treescore.py The tree score evaluates a model in the context of its ancestry:

Calculation algorithm

# From src/metrics/treescore.py:258-283
def metric(resource: Dict[str, Any]) -> Tuple[float, int]:
    repo = normalize_hf_id(resource.get("name"))
    
    # 1. Compute model's own net score
    own = _net(repo)
    
    # 2. Walk lineage tree to get ancestor scores
    ancestor_sum, ancestor_count = _walk(repo, seen=set())
    
    # 3. Combine scores
    if ancestor_count == 0:
        score = own  # No parents, use own score
    else:
        avg_ancestor = ancestor_sum / ancestor_count
        score = (own + avg_ancestor) / 2.0
    
    return float(max(0, min(1, score))), latency

Recursive tree walking

# From src/metrics/treescore.py:169-188
def _walk(repo: str, seen: Set[str]) -> Tuple[float, int]:
    if repo in seen:
        return 0.0, 0  # Cycle detection
    seen.add(repo)
    
    parents = _parents(repo)
    total = 0.0
    count = 0
    
    for p in parents:
        # Score this parent
        pn = _net(p)
        total += pn
        count += 1
        
        # Recursively score parent's ancestors
        ssum, scnt = _walk(p, seen)
        total += ssum
        count += scnt
    
    return total, count

Net score computation

# From src/metrics/treescore.py:101-132
def _net(repo: str) -> float:
    metrics = _load_metrics()  # All other metrics except treescore
    resource = {
        "name": repo,
        "url": f"https://huggingface.co/{repo}",
        "category": "MODEL",
    }
    
    scores = []
    for name, fn in metrics.items():
        try:
            s, _lat = fn(resource)
            sc = _scalar(name, s)
            if sc is not None:
                scores.append(sc)
        except:
            pass
    
    return float(sum(scores)/len(scores)) if scores else 0.0

The net score is computed from all metrics except treescore to avoid circular dependencies.

Lineage API

Endpoint: GET /artifact/model/{id}/lineage Source: src/api/routers/models.py:714-731

Request

GET /artifact/model/3/lineage

Response

{
  "nodes": [
    {
      "artifact_id": "3",
      "name": "my-fine-tuned-model",
      "source": "config_json",
      "metadata": {}
    },
    {
      "artifact_id": "2",
      "name": "bert-base-uncased",
      "source": "config_json",
      "metadata": {}
    },
    {
      "artifact_id": "external:roberta-base",
      "name": "roberta-base",
      "source": "config_json",
      "metadata": {"external": true}
    }
  ],
  "edges": [
    {
      "from_node_artifact_id": "2",
      "to_node_artifact_id": "3",
      "relationship": "base_model"
    },
    {
      "from_node_artifact_id": "external:roberta-base",
      "to_node_artifact_id": "2",
      "relationship": "base_model"
    }
  ]
}

Error handling

# From src/api/routers/models.py:718-725
item = _registry.get(id)
if not item:
    raise HTTPException(status_code=404, detail="Artifact does not exist.")

try:
    g = _registry.get_lineage_graph(item["id"])
except KeyError:
    raise HTTPException(status_code=404, detail="Artifact does not exist.")

Lineage examples

Example 1: Simple fine-tuning

Scenario: You fine-tune BERT for sentiment analysis

// your-sentiment-model/config.json
{
  "base_model": "bert-base-uncased",
  "num_labels": 2
}

Extracted lineage:

parents = ["bert-base-uncased"]

Tree score calculation:

own_score = 0.75  # Your model's metrics
parent_score = 0.9  # bert-base-uncased metrics
tree_score = (0.75 + 0.9) / 2 = 0.825

Example 2: Distillation

Scenario: You distill a large model to a smaller one

// your-distilled-model/config.json
{
  "teacher_model": "bert-large-uncased",
  "student_model": "distilbert-base-uncased"
}

Extracted lineage:

parents = ["bert-large-uncased", "distilbert-base-uncased"]

Tree score calculation:

own_score = 0.7
parent1_score = 0.85  # bert-large-uncased
parent2_score = 0.8   # distilbert-base-uncased
avg_ancestor = (0.85 + 0.8) / 2 = 0.825
tree_score = (0.7 + 0.825) / 2 = 0.7625

Example 3: Multi-generation lineage

Scenario: Fine-tune a fine-tuned model Tree score for task-specific-bert:

own_score = 0.75
parent_score = 0.8  # domain-bert
grandparent_score = 0.9  # bert-base-uncased

# Walk returns both ancestors
ancestor_sum = 0.8 + 0.9 = 1.7
ancestor_count = 2
avg_ancestor = 1.7 / 2 = 0.85

tree_score = (0.75 + 0.85) / 2 = 0.8

Tree score rewards models built on high-quality foundations. A model with mediocre metrics but excellent parents can still achieve a good tree score.

Cycle detection

Source: src/metrics/treescore.py:169-173 The tree walker detects and breaks cycles:

# From src/metrics/treescore.py:169-173
def _walk(repo: str, seen: Set[str]) -> Tuple[float, int]:
    if repo in seen:
        return 0.0, 0  # Cycle detected, stop recursion
    seen.add(repo)
    # ... continue walking

Why cycles occur

Incorrect metadata in config.json
Models listing themselves as parents
Circular fine-tuning chains (A → B → A)

Cycles are silently broken to prevent infinite loops. The cycle edge contributes 0.0 to the tree score.

Limitations

Shallow graphs

The current implementation only builds one level of lineage (direct parents). Multi-generation graphs require recursive API calls.

Config.json dependency

Lineage extraction requires config.json to be present. Models without it show no parents.

No sibling detection

The graph doesn’t identify sibling models (models sharing the same parent).

External parents not scored

External parents (not in registry) don’t contribute to tree score calculation.

Best practices

Always set base_model

When fine-tuning, set base_model in your config.json:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2
)

# Ensure config includes base model
model.config.base_model = "bert-base-uncased"
model.save_pretrained("./my-model")

Normalize parent IDs

Use full Hugging Face IDs:✅ "google-bert/bert-base-uncased"
❌ "bert-base"
❌ "BERT"

Ingest parents first

For accurate tree scores, ingest parent models before child models:

# 1. Ingest base model
POST /artifact/model {"url": "https://huggingface.co/bert-base-uncased"}

# 2. Ingest fine-tuned model
POST /artifact/model {"url": "https://huggingface.co/your-bert-model"}

Visualize lineage

Use the lineage API to visualize model ancestry:

const response = await fetch('/artifact/model/3/lineage');
const graph = await response.json();
renderGraph(graph.nodes, graph.edges);

Get Started

Core Concepts

Deployment

CLI Tool

Development

What is lineage?

Lineage extraction

Parent model detection

Extraction priority

Config.json keys checked

Lineage graph construction

Graph structure

Graph building algorithm

Internal vs external parents

Tree score calculation

Calculation algorithm

Recursive tree walking

Net score computation

Lineage API

Request

Response

Error handling

Lineage examples

Example 1: Simple fine-tuning

Example 2: Distillation

Example 3: Multi-generation lineage

Cycle detection

Why cycles occur

Limitations

Shallow graphs

Config.json dependency

No sibling detection

External parents not scored

Best practices

Build docs developers (and LLMs) love

Get Started

Core Concepts

Deployment

CLI Tool

Development

Documentation Index

​What is lineage?

​Lineage extraction

​Parent model detection

​Extraction priority

​Config.json keys checked

​Lineage graph construction

​Graph structure

​Graph building algorithm

​Internal vs external parents

​Tree score calculation

​Calculation algorithm

​Recursive tree walking

​Net score computation

​Lineage API

​Request

​Response

​Error handling

​Lineage examples

​Example 1: Simple fine-tuning

​Example 2: Distillation

​Example 3: Multi-generation lineage

​Cycle detection

​Why cycles occur

​Limitations

Shallow graphs

Config.json dependency

No sibling detection

External parents not scored

​Best practices

Build docs developers (and LLMs) love

What is lineage?

Lineage extraction

Parent model detection

Extraction priority

Config.json keys checked

Lineage graph construction

Graph structure

Graph building algorithm

Internal vs external parents

Tree score calculation

Calculation algorithm

Recursive tree walking

Net score computation

Lineage API

Request

Response

Error handling

Lineage examples

Example 1: Simple fine-tuning

Example 2: Distillation

Example 3: Multi-generation lineage

Cycle detection

Why cycles occur

Limitations

Best practices