Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/karpathy/llm-council/llms.txt

Use this file to discover all available pages before exploring further.

Stage 2 is where the council holds each other accountable. Every council model reads all of the Stage 1 responses — but without knowing who wrote them — and produces a written evaluation followed by a strict ranked list. The anonymization is what makes the peer review meaningful: a model cannot give a favorable review to a peer it recognizes or a poor review to a competitor it dislikes.

Anonymization: Assigning Letter Labels

At the start of stage2_collect_rankings, the backend assigns a sequential letter label to each Stage 1 response:
labels = [chr(65 + i) for i in range(len(stage1_results))]  # A, B, C, ...

label_to_model = {
    f"Response {label}": result['model']
    for label, result in zip(labels, stage1_results)
}
chr(65) is 'A', so four council models produce labels A, B, C, D. The resulting label_to_model dictionary might look like:
{
    "Response A": "openai/gpt-5.1",
    "Response B": "google/gemini-3-pro-preview",
    "Response C": "anthropic/claude-sonnet-4.5",
    "Response D": "x-ai/grok-4"
}
This mapping is stored in metadata and returned by the API, but it is never sent to the ranking models. Each evaluating model only ever sees the letter labels.

The Ranking Prompt

All anonymized responses are concatenated into a single block and embedded in a prompt that instructs the evaluating model to:
  1. Evaluate each response individually — noting strengths and weaknesses.
  2. Conclude with a FINAL RANKING: section that follows a precise numbered-list format.
ranking_prompt = f"""You are evaluating different responses to the following question:

Question: {user_query}

Here are the responses from different models (anonymized):

{responses_text}

Your task:
1. First, evaluate each response individually. For each response, explain what it does well and what it does poorly.
2. Then, at the very end of your response, provide a final ranking.

IMPORTANT: Your final ranking MUST be formatted EXACTLY as follows:
- Start with the line "FINAL RANKING:" (all caps, with colon)
- Then list the responses from best to worst as a numbered list
- Each line should be: number, period, space, then ONLY the response label (e.g., "1. Response A")
- Do not add any other text or explanations in the ranking section

Example of the correct format for your ENTIRE response:

Response A provides good detail on X but misses Y...
Response B is accurate but lacks depth on Z...
Response C offers the most comprehensive answer...

FINAL RANKING:
1. Response C
2. Response A
3. Response B

Now provide your evaluation and ranking:"""
The same prompt is sent to all council models in parallel using query_models_parallel, exactly as in Stage 1.
The strict format requirement — FINAL RANKING: in all caps, one label per numbered line, no trailing commentary — is not stylistic preference. It is a contract that the parser depends on. If a model deviates from the format, the fallback regex does its best, but reliable extraction requires adherence to the template.

Parsing Rankings from Free Text

parse_ranking_from_text in council.py extracts the structured ranking from each model’s free-form evaluation:
def parse_ranking_from_text(ranking_text: str) -> List[str]:
    import re

    if "FINAL RANKING:" in ranking_text:
        parts = ranking_text.split("FINAL RANKING:")
        if len(parts) >= 2:
            ranking_section = parts[1]

            # Primary: numbered list format — "1. Response A"
            numbered_matches = re.findall(r'\d+\.\s*Response [A-Z]', ranking_section)
            if numbered_matches:
                return [re.search(r'Response [A-Z]', m).group() for m in numbered_matches]

            # Fallback: any "Response X" in order
            matches = re.findall(r'Response [A-Z]', ranking_section)
            return matches

    # Last resort: scan entire text for "Response X" patterns
    matches = re.findall(r'Response [A-Z]', ranking_text)
    return matches
The function applies three strategies in order:
1

Locate the FINAL RANKING section

Split on "FINAL RANKING:" and work only with the text that follows. This discards the evaluation prose, which may contain incidental mentions of letter labels.
2

Primary regex — numbered list

\d+\.\s*Response [A-Z] matches lines like 1. Response C. If any matches are found, the Response [A-Z] portion is extracted from each match and returned in order.
3

Fallback regex — bare labels

If no numbered lines are found, Response [A-Z] is extracted from anywhere in the ranking section, in the order they appear.
4

Last-resort scan

If "FINAL RANKING:" is absent altogether, the regex runs over the entire response text. This rarely produces a reliable ordering but prevents a complete parse failure.
Each Stage 2 result stored in stage2_results contains both the raw full text and the parsed list:
{
    "model": "x-ai/grok-4",
    "ranking": "Response A provides …\n\nFINAL RANKING:\n1. Response C\n2. Response A\n…",
    "parsed_ranking": ["Response C", "Response A", "Response B", "Response D"]
}

Calculating Aggregate Rankings

Once every model has submitted a ranking, calculate_aggregate_rankings converts individual votes into a single leaderboard:
def calculate_aggregate_rankings(
    stage2_results: List[Dict[str, Any]],
    label_to_model: Dict[str, str]
) -> List[Dict[str, Any]]:
    from collections import defaultdict

    model_positions = defaultdict(list)

    for ranking in stage2_results:
        parsed_ranking = parse_ranking_from_text(ranking['ranking'])
        for position, label in enumerate(parsed_ranking, start=1):
            if label in label_to_model:
                model_name = label_to_model[label]
                model_positions[model_name].append(position)

    aggregate = []
    for model, positions in model_positions.items():
        if positions:
            avg_rank = sum(positions) / len(positions)
            aggregate.append({
                "model": model,
                "average_rank": round(avg_rank, 2),
                "rankings_count": len(positions)
            })

    aggregate.sort(key=lambda x: x['average_rank'])
    return aggregate
Each model’s average rank position is computed across all evaluators (lower is better — a score of 1.0 means every evaluator placed this model first). The list is sorted ascending by average_rank so index 0 is the overall winner.

Frontend Display

The Stage2 React component renders two sections: Raw evaluations tab view. One tab per council model shows that model’s full written evaluation. De-anonymization happens client-side via the deAnonymizeText helper, which replaces every "Response X" occurrence with the real model’s short name in bold:
function deAnonymizeText(text, labelToModel) {
  let result = text;
  Object.entries(labelToModel).forEach(([label, model]) => {
    const modelShortName = model.split('/')[1] || model;
    result = result.replace(new RegExp(label, 'g'), `**${modelShortName}**`);
  });
  return result;
}
An explanatory note directly in the UI reminds users: “model names are shown in bold for readability, but the original evaluation used anonymous labels.” Extracted Ranking. Below each evaluation’s raw text, the component renders the parsed_ranking list as an ordered HTML list. This lets users immediately see what the parser extracted and compare it against the raw text. Aggregate Rankings leaderboard. Below the tab view, all models are listed in order of their average_rank score along with the number of votes counted. The frontend labels this section “Aggregate Rankings (Street Cred).”
Compare the Extracted Ranking list to the raw evaluation text to verify the parser worked correctly. If a model went off-format and the ranking looks wrong, the raw text will show you what it actually wrote — useful for diagnosing edge cases.

Build docs developers (and LLMs) love