Overview
The scheduler acts as the central coordinator between build clients and worker nodes: Responsibilities:- Accept execution requests from clients
- Queue actions awaiting workers
- Match actions to capable workers based on platform properties
- Monitor worker health and remove dead workers
- Handle action timeouts and retries
- Stream execution updates to clients
Scheduler Types
NativeLink provides four scheduler implementations that can be composed for different deployment patterns.Simple Scheduler
The core scheduler implementation that manages worker pools and action execution.Platform Properties
Defines how worker capabilities are matched to action requirements:- exact
- minimum
- priority
- ignore
Requires exact string match between action and worker property.Example:
- Action:
cpu_arch: "arm64" - Worker:
cpu_arch: "arm64"✓ - Worker:
cpu_arch: "x86_64"✗
Allocation Strategies
Determines which worker is selected when multiple workers match:Allocation Strategy Details
Allocation Strategy Details
Timeout Configuration
Worker Timeout
Client Action Timeout
Max Action Executing Timeout
Set to 0 to rely only on worker keepalives.
Retain Completed
WaitExecution calls.Default: 60 secondsRetry Configuration
- Worker disconnection
- Internal server errors
- Network timeouts
- CAS upload/download failures
- Compilation errors (exit code 1)
- Test failures (exit code != 0)
- Missing input files
- Invalid action configuration
Backend Storage
Scheduler state can be persisted for high availability:- Memory (Default)
- Redis (Experimental)
Stores all state in memory.Pros: Fast, simpleCons: Lost on restart
Cache Lookup Scheduler
Wraps another scheduler with Action Cache checking.- Check Action Cache for existing result
- If cache hit: Return cached result immediately
- If cache miss: Forward to nested scheduler for execution
- After execution: Store result in Action Cache
Recommendation: Use
CompletenessCheckingSpec for the ac_store to ensure cached results reference existing CAS objects.Property Modifier Scheduler
Modifies action platform properties before forwarding to nested scheduler.- Add
- Remove
- Replace
Add a new property to all actions.Use Cases:
- Route to specific worker pools
- Add default properties
- Tag actions for monitoring
GRPC Scheduler
Forwards all requests to a remote scheduler via gRPC.- endpoint: Remote scheduler address and connection settings
- connections_per_endpoint: TCP connection pooling
- max_concurrent_requests: Limit in-flight requests
- retry: Retry behavior for transient failures
Hybrid Deployments
Local CAS caching with remote execution cluster.Clients upload to local CAS, scheduler forwards execution to remote cluster.
Multi-Region
Regional schedulers forward to global scheduler.Reduces latency while maintaining centralized worker pool.
Development
Local developer builds use remote shared scheduler.Developers get remote execution without running full cluster.
Federation
Multiple independent clusters with cross-cluster fallback.Primary cluster handles most work, overflow to secondary.
Scheduler Composition
Schedulers can be nested to create sophisticated routing and caching strategies:Example: Full-Featured Scheduler
- Cache Lookup: Check AC for cached result
- Property Modifier: Add cluster tag
- Simple Scheduler: Match to workers and execute
Worker Management
Worker Registration
Workers connect to the scheduler and register their capabilities:Worker Health Monitoring
Scheduler monitors worker health via:- Keepalive Messages: Workers send periodic heartbeats
- Timeout Detection: Workers not responding within
worker_timeout_sare removed - Backpressure: Workers can signal they’re full (paused state)
- Draining: Workers can request graceful shutdown
Worker Capacity
Workers declare maximum concurrent actions:- Running Actions: Currently executing
- Available Slots:
max_inflight_tasks - running_actions - Paused State: No available slots (backpressure)
Monitoring and Debugging
Logging
Control scheduler logging verbosity:- > 0: Log worker matching events every N seconds
- -1: Disable worker matching logs
- “Worker busy” - All capable workers at capacity
- “Can’t find any worker” - No workers match platform properties
- “Action assigned” - Successful worker assignment
Metrics
Scheduler exposes Prometheus metrics:- Actions queued: Number of actions awaiting workers
- Actions executing: Number of actions currently running
- Actions completed: Total completed actions
- Workers connected: Number of active workers
- Worker timeouts: Workers removed due to timeout
- Action retries: Number of retried actions
Tracing
OpenTelemetry traces provide visibility into:- Action queuing duration
- Worker matching time
- Execution duration
- Result upload time
Best Practices
- Always use
cache_lookupscheduler in production to leverage Action Cache - Configure platform properties to match your worker heterogeneity
- Set appropriate timeouts based on expected action duration
- Use LRU allocation for most workloads unless you have specific caching needs
- Enable Redis backend for multi-scheduler deployments or HA requirements
- Monitor worker health and adjust
worker_timeout_sfor network conditions - Tune
max_job_retriesbased on infrastructure reliability
Troubleshooting
Common Issues
Common Issues
Next Steps
Workers
Configure and manage worker nodes
Remote Execution
Understand the execution flow
Architecture
See how schedulers fit in the system