Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/skydiscover-ai/skydiscover/llms.txt

Use this file to discover all available pages before exploring further.

The Frontier-CS benchmark contains 172 competitive programming problems that test algorithm design and optimization. SkyDiscover evolves C++ solutions that are evaluated using a Docker-based judge.

Overview

Frontier-CS is a benchmark from Meta Research testing algorithmic problem-solving capabilities. Problems cover:
  • Graph algorithms (shortest paths, flows, matchings)
  • Dynamic programming
  • Greedy algorithms
  • Data structures (trees, heaps, segment trees)
  • Computational geometry
  • Number theory
  • Combinatorics
Note: Unlike Python-based benchmarks, Frontier-CS evolves C++ code. The evaluator compiles and tests solutions against hidden test cases.

Setup

Frontier-CS requires Docker for the judge server:
1

Clone Frontier-CS

cd benchmarks/frontier-cs-eval
git clone https://github.com/FrontierCS/Frontier-CS.git
2

Start Judge Server

cd Frontier-CS/algorithmic
docker compose up -d
The judge will run on http://localhost:8081
3

Install Dependencies

cd ../../..
uv sync --extra frontier-cs
4

Set API Key

export OPENAI_API_KEY="sk-..."

Initial Program

The seed program is a minimal C++ skeleton:
#include <bits/stdc++.h>
using namespace std;

int main(){
    std::cout << "Hello, World!" << std::endl;
    return 0;
}
Evolution will replace this with a complete solution for the specified problem.

Evaluator

The evaluator submits C++ code to the Frontier-CS judge:
import os
import random
from pathlib import Path
from frontier_cs.single_evaluator import SingleEvaluator
from frontier_cs.runner.base import EvaluationStatus

# Support multiple judge servers for load balancing
DEFAULT_JUDGE_URL = "http://localhost:8081"
JUDGE_URLS = os.environ.get("JUDGE_URLS", DEFAULT_JUDGE_URL).split(",")

def get_judge_url():
    """Random selection for load balancing"""
    return random.choice(JUDGE_URLS)

def evaluate(program_path: str, problem_id: str = None):
    """
    Evaluate C++ solution for a Frontier-CS problem.
    
    Args:
        program_path: Path to C++ solution file
        problem_id: Problem ID (0-171) or from FRONTIER_CS_PROBLEM env var
    
    Returns:
        dict with combined_score and evaluation metadata
    """
    # Get problem ID from parameter or environment
    if problem_id is None:
        problem_id = os.environ.get('FRONTIER_CS_PROBLEM', '0')
    
    # Initialize judge
    evaluator = SingleEvaluator(
        backend="docker",
        judge_url=get_judge_url(),
        register_cleanup=False
    )
    
    # Read solution code
    solution_path = Path(program_path)
    if not solution_path.exists():
        return {
            "combined_score": 0.0,
            "status": "error",
            "message": f"File not found: {program_path}"
        }
    
    code = solution_path.read_text()
    
    # Submit to judge
    result = evaluator.evaluate(
        track="algorithmic",
        problem_id=problem_id,
        code=code,
        backend="docker"
    )
    
    # Process result
    if result.status == EvaluationStatus.SUCCESS:
        score = result.score
        # Use unbounded score (can exceed 100 if beating reference)
        score_unbounded = result.metadata.get(
            'scoreUnbounded', score
        ) if result.metadata else score
        
        return {
            "combined_score": float(score),
            "score_unbounded": score_unbounded,
            "status": "success",
            "problem_id": problem_id,
            "duration_seconds": result.duration_seconds
        }
    
    elif result.status == EvaluationStatus.TIMEOUT:
        return {
            "combined_score": 0.0,
            "status": "timeout",
            "message": result.message
        }
    
    else:  # ERROR
        return {
            "combined_score": 0.0,
            "status": "error",
            "message": result.message,
            "logs": result.logs
        }

Running Single Problem

Specify which problem to solve with the FRONTIER_CS_PROBLEM environment variable:
cd benchmarks/frontier-cs-eval

FRONTIER_CS_PROBLEM=0 uv run skydiscover-run \
  initial_program.cpp \
  evaluator.py \
  -c config.yaml \
  -s adaevolve \
  -i 50
Problem IDs range from 0 to 171. Start with simpler problems (lower IDs) to test your setup.

Running All Problems in Parallel

The benchmark includes a script to evolve solutions for all 172 problems:
uv run python run_all_frontiercs.py \
  --search adaevolve \
  --iterations 50 \
  --workers 6
#!/usr/bin/env python3
import argparse
import subprocess
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path

def run_single_problem(problem_id, search_algo, iterations):
    """Run evolution for a single problem"""
    cmd = [
        "skydiscover-run",
        "initial_program.cpp",
        "evaluator.py",
        "-c", "config.yaml",
        "-s", search_algo,
        "-i", str(iterations)
    ]
    
    env = os.environ.copy()
    env["FRONTIER_CS_PROBLEM"] = str(problem_id)
    
    result = subprocess.run(cmd, env=env, capture_output=True)
    return problem_id, result.returncode

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--search", default="adaevolve")
    parser.add_argument("--iterations", type=int, default=50)
    parser.add_argument("--workers", type=int, default=6)
    args = parser.parse_args()
    
    # Run all 172 problems in parallel
    with ProcessPoolExecutor(max_workers=args.workers) as executor:
        futures = [
            executor.submit(
                run_single_problem, 
                problem_id, 
                args.search, 
                args.iterations
            )
            for problem_id in range(172)
        ]
        
        for future in futures:
            problem_id, status = future.result()
            print(f"Problem {problem_id}: {'✓' if status == 0 else '✗'}")

Evaluating Best Programs

After evolution, re-evaluate the best solutions on test sets:
uv run python run_best_programs_frontiercs.py
This reads the best program from each problem’s evolution directory and runs it through the judge again.

Analyzing Results

Combine training and testing scores into CSV:
uv run python combine_results.py
Generate plots and statistics:
uv run python analyze_results.py

Environment Variables

VariableDefaultDescription
OPENAI_API_KEY(required)Your API key
FRONTIER_CS_PROBLEM0Problem ID to evolve (0-171)
JUDGE_URLShttp://localhost:8081Comma-separated judge URLs for load balancing
Load Balancing: If running many problems in parallel, you can start multiple judge servers and specify all URLs:
export JUDGE_URLS="http://localhost:8081,http://localhost:8082,http://localhost:8083"

Configuration

The config.yaml specifies C++ as the language:
system_prompt: |
  You are solving a competitive programming problem.
  Write efficient C++ code that passes all test cases.
  
language: cpp
diff_based_generation: true

search_algorithm:
  population_size: 20
  tournament_size: 3

Tips for Algorithm Benchmarks

Start Small

Test on a few problems first. Some are significantly harder than others.

Use Load Balancing

Run multiple judge servers if evolving many problems in parallel.

Check Logs

Judge logs show compilation errors and runtime failures.

Unbounded Scores

Solutions can score >100 if they beat the reference implementation.

Common Issues

Verify Docker is running:
docker ps | grep judge
Restart if needed:
cd Frontier-CS/algorithmic
docker compose restart
Check the judge logs for detailed error messages:
docker compose logs -f
The evaluator returns logs in the result dictionary.
Solutions must complete within the judge’s time limit. Optimize algorithmic complexity.
Ensure you cloned the repository:
cd benchmarks/frontier-cs-eval
git clone https://github.com/FrontierCS/Frontier-CS.git

Supported Search Algorithms

  • adaevolve (recommended)
  • evox
  • openevolve
  • gepa
  • shinkaevolve
All require the --extra external installation:
uv sync --extra external

Next Steps

Math Examples

Explore math benchmarks

Systems Examples

See systems optimization

Create Custom

Build your own benchmark

Build docs developers (and LLMs) love