Documentation Index Fetch the complete documentation index at: https://mintlify.com/skydiscover-ai/skydiscover/llms.txt
Use this file to discover all available pages before exploring further.
The Frontier-CS benchmark contains 172 competitive programming problems that test algorithm design and optimization. SkyDiscover evolves C++ solutions that are evaluated using a Docker-based judge.
Overview
Frontier-CS is a benchmark from Meta Research testing algorithmic problem-solving capabilities. Problems cover:
Graph algorithms (shortest paths, flows, matchings)
Dynamic programming
Greedy algorithms
Data structures (trees, heaps, segment trees)
Computational geometry
Number theory
Combinatorics
Note: Unlike Python-based benchmarks, Frontier-CS evolves C++ code. The evaluator compiles and tests solutions against hidden test cases.
Setup
Frontier-CS requires Docker for the judge server:
Clone Frontier-CS
cd benchmarks/frontier-cs-eval
git clone https://github.com/FrontierCS/Frontier-CS.git
Start Judge Server
cd Frontier-CS/algorithmic
docker compose up -d
The judge will run on http://localhost:8081
Install Dependencies
cd ../../..
uv sync --extra frontier-cs
Set API Key
export OPENAI_API_KEY = "sk-..."
Initial Program
The seed program is a minimal C++ skeleton:
#include <bits/stdc++.h>
using namespace std ;
int main (){
std ::cout << "Hello, World!" << std ::endl;
return 0 ;
}
Evolution will replace this with a complete solution for the specified problem.
Evaluator
The evaluator submits C++ code to the Frontier-CS judge:
import os
import random
from pathlib import Path
from frontier_cs.single_evaluator import SingleEvaluator
from frontier_cs.runner.base import EvaluationStatus
# Support multiple judge servers for load balancing
DEFAULT_JUDGE_URL = "http://localhost:8081"
JUDGE_URLS = os.environ.get( "JUDGE_URLS" , DEFAULT_JUDGE_URL ).split( "," )
def get_judge_url ():
"""Random selection for load balancing"""
return random.choice( JUDGE_URLS )
def evaluate ( program_path : str , problem_id : str = None ):
"""
Evaluate C++ solution for a Frontier-CS problem.
Args:
program_path: Path to C++ solution file
problem_id: Problem ID (0-171) or from FRONTIER_CS_PROBLEM env var
Returns:
dict with combined_score and evaluation metadata
"""
# Get problem ID from parameter or environment
if problem_id is None :
problem_id = os.environ.get( 'FRONTIER_CS_PROBLEM' , '0' )
# Initialize judge
evaluator = SingleEvaluator(
backend = "docker" ,
judge_url = get_judge_url(),
register_cleanup = False
)
# Read solution code
solution_path = Path(program_path)
if not solution_path.exists():
return {
"combined_score" : 0.0 ,
"status" : "error" ,
"message" : f "File not found: { program_path } "
}
code = solution_path.read_text()
# Submit to judge
result = evaluator.evaluate(
track = "algorithmic" ,
problem_id = problem_id,
code = code,
backend = "docker"
)
# Process result
if result.status == EvaluationStatus. SUCCESS :
score = result.score
# Use unbounded score (can exceed 100 if beating reference)
score_unbounded = result.metadata.get(
'scoreUnbounded' , score
) if result.metadata else score
return {
"combined_score" : float (score),
"score_unbounded" : score_unbounded,
"status" : "success" ,
"problem_id" : problem_id,
"duration_seconds" : result.duration_seconds
}
elif result.status == EvaluationStatus. TIMEOUT :
return {
"combined_score" : 0.0 ,
"status" : "timeout" ,
"message" : result.message
}
else : # ERROR
return {
"combined_score" : 0.0 ,
"status" : "error" ,
"message" : result.message,
"logs" : result.logs
}
Running Single Problem
Specify which problem to solve with the FRONTIER_CS_PROBLEM environment variable:
cd benchmarks/frontier-cs-eval
FRONTIER_CS_PROBLEM = 0 uv run skydiscover-run \
initial_program.cpp \
evaluator.py \
-c config.yaml \
-s adaevolve \
-i 50
Problem IDs range from 0 to 171. Start with simpler problems (lower IDs) to test your setup.
Running All Problems in Parallel
The benchmark includes a script to evolve solutions for all 172 problems:
uv run python run_all_frontiercs.py \
--search adaevolve \
--iterations 50 \
--workers 6
run_all_frontiercs.py (excerpt)
#!/usr/bin/env python3
import argparse
import subprocess
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path
def run_single_problem ( problem_id , search_algo , iterations ):
"""Run evolution for a single problem"""
cmd = [
"skydiscover-run" ,
"initial_program.cpp" ,
"evaluator.py" ,
"-c" , "config.yaml" ,
"-s" , search_algo,
"-i" , str (iterations)
]
env = os.environ.copy()
env[ "FRONTIER_CS_PROBLEM" ] = str (problem_id)
result = subprocess.run(cmd, env = env, capture_output = True )
return problem_id, result.returncode
def main ():
parser = argparse.ArgumentParser()
parser.add_argument( "--search" , default = "adaevolve" )
parser.add_argument( "--iterations" , type = int , default = 50 )
parser.add_argument( "--workers" , type = int , default = 6 )
args = parser.parse_args()
# Run all 172 problems in parallel
with ProcessPoolExecutor( max_workers = args.workers) as executor:
futures = [
executor.submit(
run_single_problem,
problem_id,
args.search,
args.iterations
)
for problem_id in range ( 172 )
]
for future in futures:
problem_id, status = future.result()
print ( f "Problem { problem_id } : { '✓' if status == 0 else '✗' } " )
Evaluating Best Programs
After evolution, re-evaluate the best solutions on test sets:
uv run python run_best_programs_frontiercs.py
This reads the best program from each problem’s evolution directory and runs it through the judge again.
Analyzing Results
Combine training and testing scores into CSV:
uv run python combine_results.py
Generate plots and statistics:
uv run python analyze_results.py
Environment Variables
Variable Default Description OPENAI_API_KEY(required) Your API key FRONTIER_CS_PROBLEM0Problem ID to evolve (0-171) JUDGE_URLShttp://localhost:8081Comma-separated judge URLs for load balancing
Load Balancing: If running many problems in parallel, you can start multiple judge servers and specify all URLs:export JUDGE_URLS = "http://localhost:8081,http://localhost:8082,http://localhost:8083"
Configuration
The config.yaml specifies C++ as the language:
system_prompt : |
You are solving a competitive programming problem.
Write efficient C++ code that passes all test cases.
language : cpp
diff_based_generation : true
search_algorithm :
population_size : 20
tournament_size : 3
Tips for Algorithm Benchmarks
Start Small Test on a few problems first. Some are significantly harder than others.
Use Load Balancing Run multiple judge servers if evolving many problems in parallel.
Check Logs Judge logs show compilation errors and runtime failures.
Unbounded Scores Solutions can score >100 if they beat the reference implementation.
Common Issues
Judge server not responding
Verify Docker is running: Restart if needed: cd Frontier-CS/algorithmic
docker compose restart
Check the judge logs for detailed error messages: The evaluator returns logs in the result dictionary.
Solutions must complete within the judge’s time limit. Optimize algorithmic complexity.
Ensure you cloned the repository: cd benchmarks/frontier-cs-eval
git clone https://github.com/FrontierCS/Frontier-CS.git
Supported Search Algorithms
adaevolve (recommended)
evox
openevolve
gepa
shinkaevolve
All require the --extra external installation:
Next Steps
Math Examples Explore math benchmarks
Systems Examples See systems optimization
Create Custom Build your own benchmark