Skip to main content
Schedulers are the orchestration layer in NativeLink that manage the lifecycle of build actions, match them to appropriate workers, and handle failures and retries. Understanding scheduler configuration is crucial for optimizing remote execution performance.

Overview

The scheduler acts as the central coordinator between build clients and worker nodes: Responsibilities:
  • Accept execution requests from clients
  • Queue actions awaiting workers
  • Match actions to capable workers based on platform properties
  • Monitor worker health and remove dead workers
  • Handle action timeouts and retries
  • Stream execution updates to clients

Scheduler Types

NativeLink provides four scheduler implementations that can be composed for different deployment patterns.

Simple Scheduler

The core scheduler implementation that manages worker pools and action execution.
{
  "simple": {
    "supported_platform_properties": {
      "cpu_arch": "exact",
      "OSFamily": "exact",
      "cpu_count": "minimum",
      "memory_gb": "minimum"
    },
    "allocation_strategy": "least_recently_used",
    "retain_completed_for_s": 60,
    "client_action_timeout_s": 60,
    "worker_timeout_s": 5,
    "max_action_executing_timeout_s": 300,
    "max_job_retries": 3,
    "worker_match_logging_interval_s": 10
  }
}

Platform Properties

Defines how worker capabilities are matched to action requirements:
Requires exact string match between action and worker property.Example:
  • Action: cpu_arch: "arm64"
  • Worker: cpu_arch: "arm64"
  • Worker: cpu_arch: "x86_64"
Use For: OS family, CPU architecture, environment type
Configuration Example:
{
  "supported_platform_properties": {
    "cpu_arch": "exact",        // Must match exactly
    "OSFamily": "exact",        // Must match exactly
    "cpu_count": "minimum",     // Worker must have >= requested
    "memory_gb": "minimum",     // Worker must have >= requested
    "pool": "priority",         // Informational only
    "optional_gpu": "ignore"    // Actions can request, workers needn't have
  }
}

Allocation Strategies

Determines which worker is selected when multiple workers match:

Timeout Configuration

Worker Timeout

{
  "worker_timeout_s": 5
}
Remove workers that haven’t sent keepalive in this duration.Default: 5 seconds

Client Action Timeout

{
  "client_action_timeout_s": 60
}
Mark actions as failed if client doesn’t update within this duration.Default: 60 seconds

Max Action Executing Timeout

{
  "max_action_executing_timeout_s": 300
}
Timeout actions that execute without progress for this duration.Default: 0 (disabled)
Set to 0 to rely only on worker keepalives.

Retain Completed

{
  "retain_completed_for_s": 60
}
Keep completed actions in memory for late WaitExecution calls.Default: 60 seconds

Retry Configuration

{
  "max_job_retries": 3
}
Actions that fail with internal errors or timeouts are automatically retried up to this limit on different workers.
Actions that fail due to user errors (non-zero exit code) are NOT retried. Only infrastructure failures trigger retries.
Retryable Failures:
  • Worker disconnection
  • Internal server errors
  • Network timeouts
  • CAS upload/download failures
Non-Retryable Failures:
  • Compilation errors (exit code 1)
  • Test failures (exit code != 0)
  • Missing input files
  • Invalid action configuration

Backend Storage

Scheduler state can be persisted for high availability:
Stores all state in memory.
{
  "experimental_backend": null
}
Pros: Fast, simpleCons: Lost on restart

Cache Lookup Scheduler

Wraps another scheduler with Action Cache checking.
{
  "cache_lookup": {
    "ac_store": "AC_MAIN_STORE",
    "scheduler": {
      "simple": { ... }
    }
  }
}
Behavior:
  1. Check Action Cache for existing result
  2. If cache hit: Return cached result immediately
  3. If cache miss: Forward to nested scheduler for execution
  4. After execution: Store result in Action Cache
Recommendation: Use CompletenessCheckingSpec for the ac_store to ensure cached results reference existing CAS objects.

Property Modifier Scheduler

Modifies action platform properties before forwarding to nested scheduler.
{
  "property_modifier": {
    "modifications": [
      {
        "add": {
          "name": "pool",
          "value": "production"
        }
      },
      {
        "remove": "legacy_flag"
      },
      {
        "replace": {
          "name": "cpu_arch",
          "value": "amd64",
          "new_name": "cpu_arch",
          "new_value": "x86_64"
        }
      }
    ],
    "scheduler": {
      "simple": { ... }
    }
  }
}
Modification Types:
Add a new property to all actions.
{
  "add": {
    "name": "environment",
    "value": "production"
  }
}
Use Cases:
  • Route to specific worker pools
  • Add default properties
  • Tag actions for monitoring
Modification Order: Modifications are applied in order, so later modifications can affect earlier ones.

GRPC Scheduler

Forwards all requests to a remote scheduler via gRPC.
{
  "grpc": {
    "endpoint": {
      "address": "grpc://remote-scheduler.example.com:50051",
      "concurrency_limit": 100,
      "connect_timeout_s": 30,
      "tcp_keepalive_s": 30,
      "http2_keepalive_interval_s": 30,
      "http2_keepalive_timeout_s": 20
    },
    "connections_per_endpoint": 5,
    "max_concurrent_requests": 1000,
    "retry": {
      "max_retries": 6,
      "delay": 0.3,
      "jitter": 0.5
    }
  }
}
Configuration:
  • endpoint: Remote scheduler address and connection settings
  • connections_per_endpoint: TCP connection pooling
  • max_concurrent_requests: Limit in-flight requests
  • retry: Retry behavior for transient failures
Use Cases:

Hybrid Deployments

Local CAS caching with remote execution cluster.Clients upload to local CAS, scheduler forwards execution to remote cluster.

Multi-Region

Regional schedulers forward to global scheduler.Reduces latency while maintaining centralized worker pool.

Development

Local developer builds use remote shared scheduler.Developers get remote execution without running full cluster.

Federation

Multiple independent clusters with cross-cluster fallback.Primary cluster handles most work, overflow to secondary.

Scheduler Composition

Schedulers can be nested to create sophisticated routing and caching strategies:
{
  "cache_lookup": {
    "ac_store": "AC_MAIN",
    "scheduler": {
      "property_modifier": {
        "modifications": [
          {
            "add": {
              "name": "cluster",
              "value": "prod-us-west"
            }
          }
        ],
        "scheduler": {
          "simple": {
            "supported_platform_properties": {
              "cpu_arch": "exact",
              "OSFamily": "exact",
              "cpu_count": "minimum"
            },
            "allocation_strategy": "least_recently_used",
            "max_job_retries": 3
          }
        }
      }
    }
  }
}
Flow:
  1. Cache Lookup: Check AC for cached result
  2. Property Modifier: Add cluster tag
  3. Simple Scheduler: Match to workers and execute

Worker Management

Worker Registration

Workers connect to the scheduler and register their capabilities:
message ConnectWorkerRequest {
  string worker_id = 1;
  repeated Platform.Property platform_properties = 2;
}
Platform Properties advertised by worker:
{
  "cpu_arch": "x86_64",
  "OSFamily": "linux",
  "cpu_count": "16",
  "memory_gb": "64",
  "pool": "production"
}

Worker Health Monitoring

Scheduler monitors worker health via:
  1. Keepalive Messages: Workers send periodic heartbeats
  2. Timeout Detection: Workers not responding within worker_timeout_s are removed
  3. Backpressure: Workers can signal they’re full (paused state)
  4. Draining: Workers can request graceful shutdown

Worker Capacity

Workers declare maximum concurrent actions:
{
  "max_inflight_tasks": 8
}
Scheduler tracks:
  • Running Actions: Currently executing
  • Available Slots: max_inflight_tasks - running_actions
  • Paused State: No available slots (backpressure)

Monitoring and Debugging

Logging

Control scheduler logging verbosity:
{
  "worker_match_logging_interval_s": 10
}
  • > 0: Log worker matching events every N seconds
  • -1: Disable worker matching logs
Logs include:
  • “Worker busy” - All capable workers at capacity
  • “Can’t find any worker” - No workers match platform properties
  • “Action assigned” - Successful worker assignment

Metrics

Scheduler exposes Prometheus metrics:
  • Actions queued: Number of actions awaiting workers
  • Actions executing: Number of actions currently running
  • Actions completed: Total completed actions
  • Workers connected: Number of active workers
  • Worker timeouts: Workers removed due to timeout
  • Action retries: Number of retried actions

Tracing

OpenTelemetry traces provide visibility into:
  • Action queuing duration
  • Worker matching time
  • Execution duration
  • Result upload time

Best Practices

  1. Always use cache_lookup scheduler in production to leverage Action Cache
  2. Configure platform properties to match your worker heterogeneity
  3. Set appropriate timeouts based on expected action duration
  4. Use LRU allocation for most workloads unless you have specific caching needs
  5. Enable Redis backend for multi-scheduler deployments or HA requirements
  6. Monitor worker health and adjust worker_timeout_s for network conditions
  7. Tune max_job_retries based on infrastructure reliability

Troubleshooting

Next Steps

Workers

Configure and manage worker nodes

Remote Execution

Understand the execution flow

Architecture

See how schedulers fit in the system

Build docs developers (and LLMs) love