Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/skydiscover-ai/skydiscover/llms.txt

Use this file to discover all available pages before exploring further.

Overview

SkyDiscover includes comprehensive benchmarks across multiple domains. Use these to:
  • Test search algorithms on real problems
  • Learn evaluator patterns from working examples
  • Benchmark LLM performance on hard optimization tasks
  • Reproduce published results from research papers

Math

14 tasks

Systems

5 tasks (ADRS)

GPU Kernels

4 tasks (Triton)

Algorithms

172 tasks (Frontier-CS)

Reasoning

ARC-AGI tasks

Creative

Image generation

Quick Start

Installation

# Base installation
uv sync

# Add domain-specific dependencies
uv sync --extra math                # Math benchmarks
uv sync --extra adrs                # Systems benchmarks
uv sync --extra external            # OpenEvolve/GEPA backends
uv sync --extra frontier-cs         # Competitive programming
uv sync --extra prompt-optimization # Prompt evolution

Running a Benchmark

export OPENAI_API_KEY="sk-..."

# Run circle packing benchmark
uv run skydiscover-run \
  benchmarks/math/circle_packing/initial_program.py \
  benchmarks/math/circle_packing/evaluator.py \
  -c benchmarks/math/circle_packing/config.yaml \
  -s adaevolve \
  -i 100

Benchmark Catalog

Math Benchmarks

Path: benchmarks/math/circle_packing/Problem: Pack 26 circles in a unit square to maximize the sum of radii.Target: 2.635 (AlphaEvolve result)Run:
uv run skydiscover-run \
  benchmarks/math/circle_packing/initial_program.py \
  benchmarks/math/circle_packing/evaluator.py \
  -c benchmarks/math/circle_packing/config.yaml \
  -s adaevolve -i 100
Evaluator excerpt:
def evaluate(program_path):
    centers, radii, sum_radii = run_packing()
    valid = validate_packing(centers, radii)
    target_ratio = sum_radii / 2.635 if valid else 0.0
    return {"combined_score": target_ratio, "sum_radii": sum_radii}
Path: benchmarks/math/heilbronn_triangle/Problem: Place N points in a unit square to maximize the minimum triangle area.Run:
uv run skydiscover-run \
  benchmarks/math/heilbronn_triangle/initial_program.py \
  benchmarks/math/heilbronn_triangle/evaluator.py \
  -s adaevolve -i 100
Path: benchmarks/math/erdos_min_overlap/Problem: Construct sets with minimal overlap satisfying Erdős constraints.
Paths:
  • benchmarks/math/first_autocorr_ineq/
  • benchmarks/math/second_autocorr_ineq/
  • benchmarks/math/third_autocorr_ineq/
Problem: Find binary sequences minimizing autocorrelation merit factor.
  • Hexagon Packing: benchmarks/math/hexagon_packing/
  • Heilbronn Convex: benchmarks/math/heilbronn_convex/
  • Signal Processing: benchmarks/math/signal_processing/
  • Matrix Multiplication: benchmarks/math/matmul/
  • Min-Max Distance: benchmarks/math/minimizing_max_min_dist/

ADRS (Systems Benchmarks)

Path: benchmarks/ADRS/cloudcast/Problem: Schedule cloud VMs to minimize cost while meeting performance targets.Dependencies:
uv sync --extra adrs
Run:
uv run skydiscover-run \
  benchmarks/ADRS/cloudcast/initial_program.py \
  benchmarks/ADRS/cloudcast/evaluator.py \
  -s adaevolve -i 50
Path: benchmarks/ADRS/eplb/Problem: Balance load across mixture-of-experts model to minimize latency.
Path: benchmarks/ADRS/prism/Problem: Place ML models on heterogeneous devices for optimal throughput.
Path: benchmarks/ADRS/txn_scheduling/Problem: Schedule database transactions to maximize concurrency.
Path: benchmarks/ADRS/llm_sql/Problem: Optimize SQL queries for LLM-powered database systems.

GPU Kernels

Paths:
  • benchmarks/gpu_mode/vecadd/ - Vector addition
  • benchmarks/gpu_mode/grayscale/ - Image grayscale conversion
  • benchmarks/gpu_mode/trimul/ - Matrix multiplication
  • benchmarks/gpu_mode/mla_decode/ - Multi-head latent attention decode
Problem: Optimize Triton GPU kernels for performance.Requirements: CUDA-capable GPURun:
uv run skydiscover-run \
  benchmarks/gpu_mode/vecadd/initial_program.py \
  benchmarks/gpu_mode/vecadd/evaluator.py \
  -s adaevolve -i 50

Competitive Programming

Path: benchmarks/frontier-cs-eval/Problem: Solve competitive programming problems (ICPC, Codeforces, AtCoder).Setup:
uv sync --extra frontier-cs
cd benchmarks/frontier-cs-eval
python run_all_frontiercs.py --model gpt-5 --search adaevolve
Features:
  • Docker-based judge for secure execution
  • 172 problems from Frontier-CS benchmark
  • Automated testing and scoring
Path: benchmarks/ale_bench/Problem: AtCoder Heuristic Contest problems (C++).Examples:
  • ale_bench/ale-bench-lite-problems/ahc046/
  • ale_bench/ale-bench-lite-problems/ahc039/
  • And 8 more…

Reasoning

Path: benchmarks/arc_benchmark/Problem: Abstract reasoning tasks (visual pattern completion).Description: Generate Python code to solve ARC-AGI visual reasoning puzzles.Run:
uv run skydiscover-run \
  benchmarks/arc_benchmark/evaluator.py \
  -c benchmarks/arc_benchmark/config.yaml \
  -s adaevolve -i 100

Creative Tasks

Path: benchmarks/image_gen/sky_festival/Problem: Evolve DALL-E/Stable Diffusion prompts for a “sky festival” image.Run:
uv run skydiscover-run \
  benchmarks/image_gen/sky_festival/initial_prompt.txt \
  benchmarks/image_gen/sky_festival/evaluator.py \
  -c benchmarks/image_gen/sky_festival/config_adaevolve.yaml \
  -s adaevolve -i 50
Note: Requires image generation API credentials.

Prompt Optimization

Path: benchmarks/prompt_optimization/hotpot_qa/Problem: Evolve natural-language prompts (not code) for question-answering.Setup:
uv sync --extra prompt-optimization
Run:
uv run skydiscover-run \
  benchmarks/prompt_optimization/hotpot_qa/initial_prompt.txt \
  benchmarks/prompt_optimization/hotpot_qa/evaluator.py \
  -c benchmarks/prompt_optimization/hotpot_qa/config.yaml \
  -s adaevolve -i 50
Config excerpt:
language: text
diff_based_generation: false
file_suffix: ".txt"

Benchmark Structure

Every benchmark follows this pattern:
<benchmark_name>/
├── initial_program.py      # Starting solution (contains EVOLVE-BLOCK)
├── evaluator.py           # Scoring function (returns combined_score)
├── config.yaml            # System prompt + search/evaluator settings
├── README.md              # Problem description and setup
└── requirements.txt       # (optional) Additional dependencies

EVOLVE-BLOCK Markers

Mark the region for SkyDiscover to evolve:
initial_program.py
# EVOLVE-BLOCK-START
def solve(input_data):
    # LLM will improve this function
    return simple_solution(input_data)
# EVOLVE-BLOCK-END

# Code outside the block remains unchanged
def helper_function():
    pass
For prompt optimization tasks (.txt files), the entire file is evolved — no markers needed.

Creating Your Own Benchmark

1

Write an Evaluator

evaluator.py
def evaluate(program_path: str) -> dict:
    # Load and run the program
    result = run_program(program_path)
    
    # Compute score
    score = compute_score(result)
    
    return {
        "combined_score": score,  # Required
        "custom_metric": 0.95,    # Optional
    }
2

(Optional) Create Initial Program

initial_program.py
# EVOLVE-BLOCK-START
def solve(input_data):
    return naive_solution(input_data)
# EVOLVE-BLOCK-END
Or start from scratch by omitting this file.
3

Write Config

config.yaml
max_iterations: 100

llm:
  models:
    - name: "gpt-5"
      weight: 1.0

search:
  type: "adaevolve"

prompt:
  system_message: |
    You are an expert in [domain].
    Improve the given function to maximize [objective].

evaluator:
  timeout: 360
4

Test Locally

uv run skydiscover-run \
  initial_program.py \
  evaluator.py \
  -c config.yaml \
  -s adaevolve \
  -i 10
See Writing Evaluators for detailed guidance.

Benchmark Best Practices

Keep combined_score in [0, 1] range:
BEST_KNOWN = 2.635
score = min(sum_radii / BEST_KNOWN, 1.0)
Prevent slow programs from blocking discovery:
evaluator:
  timeout: 60  # Kill after 60 seconds
Log multiple metrics for analysis:
return {
    "combined_score": 0.87,
    "accuracy": 0.92,
    "speed": 1.3,
    "memory": 512,
    "validity": 1.0,
}
A reasonable starting point helps algorithms converge faster:
# Don't start with a no-op
def solve(x):
    return x  # Too simple

# Do provide a working baseline
def solve(x):
    return simple_heuristic(x)  # Good starting point

Reproducing Published Results

AlphaEvolve (Circle Packing)

uv run skydiscover-run \
  benchmarks/math/circle_packing/initial_program.py \
  benchmarks/math/circle_packing/evaluator.py \
  -c benchmarks/math/circle_packing/config.yaml \
  -s adaevolve \
  -i 200 \
  -m gpt-5
Expected: combined_score ≥ 0.95 (≥ 2.50 / 2.635)

Frontier-CS Benchmark

cd benchmarks/frontier-cs-eval
python run_all_frontiercs.py \
  --model gpt-5 \
  --search adaevolve \
  --iterations 100
Expected: Solve 60-80% of problems depending on difficulty tier.

Performance Comparison

Here are typical results across search algorithms (averaged over 10 math benchmarks):
AlgorithmMean ScoreBest ScoreRuntime (min)
topk0.650.7815
beam_search0.710.8322
adaevolve0.820.9135
evox0.790.8940
gepa0.840.9338
openevolve0.860.9545
Results vary by problem, model, and random seed. Run your own experiments!

Benchmark Categories Summary

Category# TasksAvg RuntimeDependencies
Math1420-40 min--extra math
ADRS530-60 min--extra adrs
GPU410-30 minCUDA GPU
Frontier-CS1725-20 min each--extra frontier-cs
ARC-AGIMultiple40-80 minBase install
ALE-Bench1030-60 minC++ compiler
Image Gen140-60 minImage API
Prompts120-40 min--extra prompt-optimization

Next Steps

Writing Evaluators

Learn from benchmark evaluators

Configuration

Understand benchmark configs

Running Discovery

Run your first benchmark

GitHub Repository

Browse all benchmarks on GitHub

Build docs developers (and LLMs) love