Documentation Index
Fetch the complete documentation index at: https://mintlify.com/skydiscover-ai/skydiscover/llms.txt
Use this file to discover all available pages before exploring further.
Overview
SkyDiscover includes comprehensive benchmarks across multiple domains. Use these to:- Test search algorithms on real problems
- Learn evaluator patterns from working examples
- Benchmark LLM performance on hard optimization tasks
- Reproduce published results from research papers
Math
14 tasks
Systems
5 tasks (ADRS)
GPU Kernels
4 tasks (Triton)
Algorithms
172 tasks (Frontier-CS)
Reasoning
ARC-AGI tasks
Creative
Image generation
Quick Start
Installation
Running a Benchmark
Benchmark Catalog
Math Benchmarks
Circle Packing
Circle Packing
Path: Evaluator excerpt:
benchmarks/math/circle_packing/Problem: Pack 26 circles in a unit square to maximize the sum of radii.Target: 2.635 (AlphaEvolve result)Run:Heilbronn Triangle
Heilbronn Triangle
Path:
benchmarks/math/heilbronn_triangle/Problem: Place N points in a unit square to maximize the minimum triangle area.Run:Erdős Minimum Overlap
Erdős Minimum Overlap
Path:
benchmarks/math/erdos_min_overlap/Problem: Construct sets with minimal overlap satisfying Erdős constraints.Autocorrelation Inequalities
Autocorrelation Inequalities
Paths:
benchmarks/math/first_autocorr_ineq/benchmarks/math/second_autocorr_ineq/benchmarks/math/third_autocorr_ineq/
Other Math Tasks
Other Math Tasks
- Hexagon Packing:
benchmarks/math/hexagon_packing/ - Heilbronn Convex:
benchmarks/math/heilbronn_convex/ - Signal Processing:
benchmarks/math/signal_processing/ - Matrix Multiplication:
benchmarks/math/matmul/ - Min-Max Distance:
benchmarks/math/minimizing_max_min_dist/
ADRS (Systems Benchmarks)
CloudCast (Cloud Scheduling)
CloudCast (Cloud Scheduling)
Path: Run:
benchmarks/ADRS/cloudcast/Problem: Schedule cloud VMs to minimize cost while meeting performance targets.Dependencies:EPLB (MoE Load Balancing)
EPLB (MoE Load Balancing)
Path:
benchmarks/ADRS/eplb/Problem: Balance load across mixture-of-experts model to minimize latency.Prism (Model Placement)
Prism (Model Placement)
Path:
benchmarks/ADRS/prism/Problem: Place ML models on heterogeneous devices for optimal throughput.Transaction Scheduling
Transaction Scheduling
Path:
benchmarks/ADRS/txn_scheduling/Problem: Schedule database transactions to maximize concurrency.LLM-SQL (Query Optimization)
LLM-SQL (Query Optimization)
Path:
benchmarks/ADRS/llm_sql/Problem: Optimize SQL queries for LLM-powered database systems.GPU Kernels
Triton Kernel Optimization
Triton Kernel Optimization
Paths:
benchmarks/gpu_mode/vecadd/- Vector additionbenchmarks/gpu_mode/grayscale/- Image grayscale conversionbenchmarks/gpu_mode/trimul/- Matrix multiplicationbenchmarks/gpu_mode/mla_decode/- Multi-head latent attention decode
Competitive Programming
Frontier-CS Eval (172 Problems)
Frontier-CS Eval (172 Problems)
Path: Features:
benchmarks/frontier-cs-eval/Problem: Solve competitive programming problems (ICPC, Codeforces, AtCoder).Setup:- Docker-based judge for secure execution
- 172 problems from Frontier-CS benchmark
- Automated testing and scoring
ALE-Bench (10 Problems)
ALE-Bench (10 Problems)
Path:
benchmarks/ale_bench/Problem: AtCoder Heuristic Contest problems (C++).Examples:ale_bench/ale-bench-lite-problems/ahc046/ale_bench/ale-bench-lite-problems/ahc039/- And 8 more…
Reasoning
ARC-AGI
ARC-AGI
Path:
benchmarks/arc_benchmark/Problem: Abstract reasoning tasks (visual pattern completion).Description: Generate Python code to solve ARC-AGI visual reasoning puzzles.Run:Creative Tasks
AI Image Generation
AI Image Generation
Path: Note: Requires image generation API credentials.
benchmarks/image_gen/sky_festival/Problem: Evolve DALL-E/Stable Diffusion prompts for a “sky festival” image.Run:Prompt Optimization
HotPotQA
HotPotQA
Path: Run:Config excerpt:
benchmarks/prompt_optimization/hotpot_qa/Problem: Evolve natural-language prompts (not code) for question-answering.Setup:Benchmark Structure
Every benchmark follows this pattern:EVOLVE-BLOCK Markers
Mark the region for SkyDiscover to evolve:initial_program.py
For prompt optimization tasks (
.txt files), the entire file is evolved — no markers needed.Creating Your Own Benchmark
Benchmark Best Practices
Normalize Scores
Normalize Scores
Keep
combined_score in [0, 1] range:Use Timeouts
Use Timeouts
Prevent slow programs from blocking discovery:
Return Rich Metrics
Return Rich Metrics
Log multiple metrics for analysis:
Provide Good Initial Program
Provide Good Initial Program
A reasonable starting point helps algorithms converge faster:
Reproducing Published Results
AlphaEvolve (Circle Packing)
combined_score ≥ 0.95 (≥ 2.50 / 2.635)
Frontier-CS Benchmark
Performance Comparison
Here are typical results across search algorithms (averaged over 10 math benchmarks):| Algorithm | Mean Score | Best Score | Runtime (min) |
|---|---|---|---|
| topk | 0.65 | 0.78 | 15 |
| beam_search | 0.71 | 0.83 | 22 |
| adaevolve | 0.82 | 0.91 | 35 |
| evox | 0.79 | 0.89 | 40 |
| gepa | 0.84 | 0.93 | 38 |
| openevolve | 0.86 | 0.95 | 45 |
Results vary by problem, model, and random seed. Run your own experiments!
Benchmark Categories Summary
| Category | # Tasks | Avg Runtime | Dependencies |
|---|---|---|---|
| Math | 14 | 20-40 min | --extra math |
| ADRS | 5 | 30-60 min | --extra adrs |
| GPU | 4 | 10-30 min | CUDA GPU |
| Frontier-CS | 172 | 5-20 min each | --extra frontier-cs |
| ARC-AGI | Multiple | 40-80 min | Base install |
| ALE-Bench | 10 | 30-60 min | C++ compiler |
| Image Gen | 1 | 40-60 min | Image API |
| Prompts | 1 | 20-40 min | --extra prompt-optimization |
Next Steps
Writing Evaluators
Learn from benchmark evaluators
Configuration
Understand benchmark configs
Running Discovery
Run your first benchmark
GitHub Repository
Browse all benchmarks on GitHub