Algorithm Design Examples

The Frontier-CS benchmark contains 172 competitive programming problems that test algorithm design and optimization. SkyDiscover evolves C++ solutions that are evaluated using a Docker-based judge.

Overview

Frontier-CS is a benchmark from Meta Research testing algorithmic problem-solving capabilities. Problems cover:

Graph algorithms (shortest paths, flows, matchings)
Dynamic programming
Greedy algorithms
Data structures (trees, heaps, segment trees)
Computational geometry
Number theory
Combinatorics

Note: Unlike Python-based benchmarks, Frontier-CS evolves C++ code. The evaluator compiles and tests solutions against hidden test cases.

Setup

Frontier-CS requires Docker for the judge server:

Clone Frontier-CS

cd benchmarks/frontier-cs-eval
git clone https://github.com/FrontierCS/Frontier-CS.git

Start Judge Server

cd Frontier-CS/algorithmic
docker compose up -d

The judge will run on http://localhost:8081

Install Dependencies

cd ../../..
uv sync --extra frontier-cs

Set API Key

export OPENAI_API_KEY="sk-..."

Initial Program

The seed program is a minimal C++ skeleton:

#include <bits/stdc++.h>
using namespace std;

int main(){
    std::cout << "Hello, World!" << std::endl;
    return 0;
}

Evolution will replace this with a complete solution for the specified problem.

Evaluator

The evaluator submits C++ code to the Frontier-CS judge:

import os
import random
from pathlib import Path
from frontier_cs.single_evaluator import SingleEvaluator
from frontier_cs.runner.base import EvaluationStatus

# Support multiple judge servers for load balancing
DEFAULT_JUDGE_URL = "http://localhost:8081"
JUDGE_URLS = os.environ.get("JUDGE_URLS", DEFAULT_JUDGE_URL).split(",")

def get_judge_url():
    """Random selection for load balancing"""
    return random.choice(JUDGE_URLS)

def evaluate(program_path: str, problem_id: str = None):
    """
    Evaluate C++ solution for a Frontier-CS problem.
    
    Args:
        program_path: Path to C++ solution file
        problem_id: Problem ID (0-171) or from FRONTIER_CS_PROBLEM env var
    
    Returns:
        dict with combined_score and evaluation metadata
    """
    # Get problem ID from parameter or environment
    if problem_id is None:
        problem_id = os.environ.get('FRONTIER_CS_PROBLEM', '0')
    
    # Initialize judge
    evaluator = SingleEvaluator(
        backend="docker",
        judge_url=get_judge_url(),
        register_cleanup=False
    )
    
    # Read solution code
    solution_path = Path(program_path)
    if not solution_path.exists():
        return {
            "combined_score": 0.0,
            "status": "error",
            "message": f"File not found: {program_path}"
        }
    
    code = solution_path.read_text()
    
    # Submit to judge
    result = evaluator.evaluate(
        track="algorithmic",
        problem_id=problem_id,
        code=code,
        backend="docker"
    )
    
    # Process result
    if result.status == EvaluationStatus.SUCCESS:
        score = result.score
        # Use unbounded score (can exceed 100 if beating reference)
        score_unbounded = result.metadata.get(
            'scoreUnbounded', score
        ) if result.metadata else score
        
        return {
            "combined_score": float(score),
            "score_unbounded": score_unbounded,
            "status": "success",
            "problem_id": problem_id,
            "duration_seconds": result.duration_seconds
        }
    
    elif result.status == EvaluationStatus.TIMEOUT:
        return {
            "combined_score": 0.0,
            "status": "timeout",
            "message": result.message
        }
    
    else:  # ERROR
        return {
            "combined_score": 0.0,
            "status": "error",
            "message": result.message,
            "logs": result.logs
        }

Running Single Problem

Specify which problem to solve with the FRONTIER_CS_PROBLEM environment variable:

cd benchmarks/frontier-cs-eval

FRONTIER_CS_PROBLEM=0 uv run skydiscover-run \
  initial_program.cpp \
  evaluator.py \
  -c config.yaml \
  -s adaevolve \
  -i 50

Problem IDs range from 0 to 171. Start with simpler problems (lower IDs) to test your setup.

Running All Problems in Parallel

The benchmark includes a script to evolve solutions for all 172 problems:

uv run python run_all_frontiercs.py \
  --search adaevolve \
  --iterations 50 \
  --workers 6

#!/usr/bin/env python3
import argparse
import subprocess
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path

def run_single_problem(problem_id, search_algo, iterations):
    """Run evolution for a single problem"""
    cmd = [
        "skydiscover-run",
        "initial_program.cpp",
        "evaluator.py",
        "-c", "config.yaml",
        "-s", search_algo,
        "-i", str(iterations)
    ]
    
    env = os.environ.copy()
    env["FRONTIER_CS_PROBLEM"] = str(problem_id)
    
    result = subprocess.run(cmd, env=env, capture_output=True)
    return problem_id, result.returncode

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--search", default="adaevolve")
    parser.add_argument("--iterations", type=int, default=50)
    parser.add_argument("--workers", type=int, default=6)
    args = parser.parse_args()
    
    # Run all 172 problems in parallel
    with ProcessPoolExecutor(max_workers=args.workers) as executor:
        futures = [
            executor.submit(
                run_single_problem, 
                problem_id, 
                args.search, 
                args.iterations
            )
            for problem_id in range(172)
        ]
        
        for future in futures:
            problem_id, status = future.result()
            print(f"Problem {problem_id}: {'✓' if status == 0 else '✗'}")

Evaluating Best Programs

After evolution, re-evaluate the best solutions on test sets:

uv run python run_best_programs_frontiercs.py

This reads the best program from each problem’s evolution directory and runs it through the judge again.

Analyzing Results

Combine training and testing scores into CSV:

uv run python combine_results.py

Generate plots and statistics:

uv run python analyze_results.py

Environment Variables

Variable	Default	Description
`OPENAI_API_KEY`	(required)	Your API key
`FRONTIER_CS_PROBLEM`	`0`	Problem ID to evolve (0-171)
`JUDGE_URLS`	`http://localhost:8081`	Comma-separated judge URLs for load balancing

Load Balancing: If running many problems in parallel, you can start multiple judge servers and specify all URLs:

export JUDGE_URLS="http://localhost:8081,http://localhost:8082,http://localhost:8083"

Configuration

The config.yaml specifies C++ as the language:

system_prompt: |
  You are solving a competitive programming problem.
  Write efficient C++ code that passes all test cases.
  
language: cpp
diff_based_generation: true

search_algorithm:
  population_size: 20
  tournament_size: 3

Tips for Algorithm Benchmarks

Start Small

Test on a few problems first. Some are significantly harder than others.

Use Load Balancing

Run multiple judge servers if evolving many problems in parallel.

Check Logs

Judge logs show compilation errors and runtime failures.

Unbounded Scores

Solutions can score >100 if they beat the reference implementation.

Common Issues

Judge server not responding

Verify Docker is running:

docker ps | grep judge

Restart if needed:

cd Frontier-CS/algorithmic
docker compose restart

Compilation errors

Check the judge logs for detailed error messages:

docker compose logs -f

The evaluator returns logs in the result dictionary.

Timeouts

Solutions must complete within the judge’s time limit. Optimize algorithmic complexity.

Missing Frontier-CS

Ensure you cloned the repository:

cd benchmarks/frontier-cs-eval
git clone https://github.com/FrontierCS/Frontier-CS.git

Supported Search Algorithms

adaevolve (recommended)
evox
openevolve
gepa
shinkaevolve

All require the --extra external installation:

uv sync --extra external

Next Steps

Math Examples

Explore math benchmarks

Systems Examples

See systems optimization

Create Custom

Build your own benchmark

Get Started

Core Concepts

Guides

Examples

Extending

Algorithm Design Examples

Overview

Setup

Initial Program

Evaluator

Running Single Problem

Running All Problems in Parallel

Evaluating Best Programs

Analyzing Results

Environment Variables

Configuration

Tips for Algorithm Benchmarks

Start Small

Use Load Balancing

Check Logs

Unbounded Scores

Common Issues

Supported Search Algorithms

Next Steps

Math Examples

Systems Examples

Create Custom

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

Extending

Documentation Index

​Overview

​Setup

​Initial Program

​Evaluator

​Running Single Problem

​Running All Problems in Parallel

​Evaluating Best Programs

​Analyzing Results

​Environment Variables

​Configuration

​Tips for Algorithm Benchmarks

Start Small

Use Load Balancing

Check Logs

Unbounded Scores

​Common Issues

​Supported Search Algorithms

​Next Steps

Math Examples

Systems Examples

Create Custom

Build docs developers (and LLMs) love

Overview

Setup

Initial Program

Evaluator

Running Single Problem

Running All Problems in Parallel

Evaluating Best Programs

Analyzing Results

Environment Variables

Configuration

Tips for Algorithm Benchmarks

Common Issues

Supported Search Algorithms

Next Steps