Schedulers - NativeLink

Schedulers are the orchestration layer in NativeLink that manage the lifecycle of build actions, match them to appropriate workers, and handle failures and retries. Understanding scheduler configuration is crucial for optimizing remote execution performance.

Overview

The scheduler acts as the central coordinator between build clients and worker nodes: Responsibilities:

Accept execution requests from clients
Queue actions awaiting workers
Match actions to capable workers based on platform properties
Monitor worker health and remove dead workers
Handle action timeouts and retries
Stream execution updates to clients

Scheduler Types

NativeLink provides four scheduler implementations that can be composed for different deployment patterns.

Simple Scheduler

The core scheduler implementation that manages worker pools and action execution.

{
  "simple": {
    "supported_platform_properties": {
      "cpu_arch": "exact",
      "OSFamily": "exact",
      "cpu_count": "minimum",
      "memory_gb": "minimum"
    },
    "allocation_strategy": "least_recently_used",
    "retain_completed_for_s": 60,
    "client_action_timeout_s": 60,
    "worker_timeout_s": 5,
    "max_action_executing_timeout_s": 300,
    "max_job_retries": 3,
    "worker_match_logging_interval_s": 10
  }
}

Platform Properties

Defines how worker capabilities are matched to action requirements:

exact
minimum
priority
ignore

Requires exact string match between action and worker property.Example:

Action: cpu_arch: "arm64"
Worker: cpu_arch: "arm64" ✓
Worker: cpu_arch: "x86_64" ✗

Use For: OS family, CPU architecture, environment type

Worker must have at least the requested value (numeric comparison).Example:

Action: cpu_count: "8"
Worker: cpu_count: "16" ✓
Worker: cpu_count: "4" ✗

Use For: CPU count, memory, disk space

Informational only, does not restrict matching.Example:

Action: pool: "high-priority"
Worker: pool: "standard" ✓ (still matches)

Use For: Soft preferences (future: worker selection hints)

Allows actions to request property without requiring workers to have it.Example:

Action: experimental_feature: "enabled"
Worker: (no property) ✓ (still matches)

Use For: Optional capabilities, backward compatibility

Configuration Example:

{
  "supported_platform_properties": {
    "cpu_arch": "exact",        // Must match exactly
    "OSFamily": "exact",        // Must match exactly
    "cpu_count": "minimum",     // Worker must have >= requested
    "memory_gb": "minimum",     // Worker must have >= requested
    "pool": "priority",         // Informational only
    "optional_gpu": "ignore"    // Actions can request, workers needn't have
  }
}

Allocation Strategies

Determines which worker is selected when multiple workers match:

Allocation Strategy Details

Timeout Configuration

Worker Timeout

{
  "worker_timeout_s": 5
}

Remove workers that haven’t sent keepalive in this duration.Default: 5 seconds

Client Action Timeout

{
  "client_action_timeout_s": 60
}

Mark actions as failed if client doesn’t update within this duration.Default: 60 seconds

Max Action Executing Timeout

{
  "max_action_executing_timeout_s": 300
}

Timeout actions that execute without progress for this duration.Default: 0 (disabled)

Set to 0 to rely only on worker keepalives.

Retain Completed

{
  "retain_completed_for_s": 60
}

Keep completed actions in memory for late WaitExecution calls.Default: 60 seconds

Retry Configuration

{
  "max_job_retries": 3
}

Actions that fail with internal errors or timeouts are automatically retried up to this limit on different workers.

Actions that fail due to user errors (non-zero exit code) are NOT retried. Only infrastructure failures trigger retries.

Retryable Failures:

Worker disconnection
Internal server errors
Network timeouts
CAS upload/download failures

Non-Retryable Failures:

Compilation errors (exit code 1)
Test failures (exit code != 0)
Missing input files
Invalid action configuration

Backend Storage

Scheduler state can be persisted for high availability:

Memory (Default)
Redis (Experimental)

Stores all state in memory.

{
  "experimental_backend": null
}

Pros: Fast, simpleCons: Lost on restart

Persists state to Redis for durability.

{
  "experimental_backend": {
    "redis": {
      "redis_store": "SCHEDULER_REDIS"
    }
  }
}

Pros: Survives restarts, multi-scheduler supportCons: Added latency, requires Redis cluster

This is experimental and may have limitations.

Cache Lookup Scheduler

Wraps another scheduler with Action Cache checking.

{
  "cache_lookup": {
    "ac_store": "AC_MAIN_STORE",
    "scheduler": {
      "simple": { ... }
    }
  }
}

Behavior:

Check Action Cache for existing result
If cache hit: Return cached result immediately
If cache miss: Forward to nested scheduler for execution
After execution: Store result in Action Cache

Recommendation: Use CompletenessCheckingSpec for the ac_store to ensure cached results reference existing CAS objects.

Property Modifier Scheduler

Modifies action platform properties before forwarding to nested scheduler.

{
  "property_modifier": {
    "modifications": [
      {
        "add": {
          "name": "pool",
          "value": "production"
        }
      },
      {
        "remove": "legacy_flag"
      },
      {
        "replace": {
          "name": "cpu_arch",
          "value": "amd64",
          "new_name": "cpu_arch",
          "new_value": "x86_64"
        }
      }
    ],
    "scheduler": {
      "simple": { ... }
    }
  }
}

Modification Types:

Add
Remove
Replace

Add a new property to all actions.

{
  "add": {
    "name": "environment",
    "value": "production"
  }
}

Use Cases:

Route to specific worker pools
Add default properties
Tag actions for monitoring

Remove a property by name.

{
  "remove": "deprecated_property"
}

Use Cases:

Strip incompatible properties
Remove sensitive information
Clean up legacy flags

Replace property name and/or value.

{
  "replace": {
    "name": "cpu_arch",
    "value": "amd64",      // Optional: match this value
    "new_name": "cpu_arch",
    "new_value": "x86_64"  // Optional: keep same if omitted
  }
}

Use Cases:

Normalize property names
Translate between client/worker conventions
Conditional property modification

Modification Order: Modifications are applied in order, so later modifications can affect earlier ones.

GRPC Scheduler

Forwards all requests to a remote scheduler via gRPC.

{
  "grpc": {
    "endpoint": {
      "address": "grpc://remote-scheduler.example.com:50051",
      "concurrency_limit": 100,
      "connect_timeout_s": 30,
      "tcp_keepalive_s": 30,
      "http2_keepalive_interval_s": 30,
      "http2_keepalive_timeout_s": 20
    },
    "connections_per_endpoint": 5,
    "max_concurrent_requests": 1000,
    "retry": {
      "max_retries": 6,
      "delay": 0.3,
      "jitter": 0.5
    }
  }
}

Configuration:

endpoint: Remote scheduler address and connection settings
connections_per_endpoint: TCP connection pooling
max_concurrent_requests: Limit in-flight requests
retry: Retry behavior for transient failures

Use Cases:

Hybrid Deployments

Local CAS caching with remote execution cluster.Clients upload to local CAS, scheduler forwards execution to remote cluster.

Multi-Region

Regional schedulers forward to global scheduler.Reduces latency while maintaining centralized worker pool.

Development

Local developer builds use remote shared scheduler.Developers get remote execution without running full cluster.

Federation

Multiple independent clusters with cross-cluster fallback.Primary cluster handles most work, overflow to secondary.

Scheduler Composition

Schedulers can be nested to create sophisticated routing and caching strategies:

Example: Full-Featured Scheduler

{
  "cache_lookup": {
    "ac_store": "AC_MAIN",
    "scheduler": {
      "property_modifier": {
        "modifications": [
          {
            "add": {
              "name": "cluster",
              "value": "prod-us-west"
            }
          }
        ],
        "scheduler": {
          "simple": {
            "supported_platform_properties": {
              "cpu_arch": "exact",
              "OSFamily": "exact",
              "cpu_count": "minimum"
            },
            "allocation_strategy": "least_recently_used",
            "max_job_retries": 3
          }
        }
      }
    }
  }
}

Flow:

Cache Lookup: Check AC for cached result
Property Modifier: Add cluster tag
Simple Scheduler: Match to workers and execute

Worker Management

Worker Registration

Workers connect to the scheduler and register their capabilities:

message ConnectWorkerRequest {
  string worker_id = 1;
  repeated Platform.Property platform_properties = 2;
}

Platform Properties advertised by worker:

{
  "cpu_arch": "x86_64",
  "OSFamily": "linux",
  "cpu_count": "16",
  "memory_gb": "64",
  "pool": "production"
}

Worker Health Monitoring

Scheduler monitors worker health via:

Keepalive Messages: Workers send periodic heartbeats
Timeout Detection: Workers not responding within worker_timeout_s are removed
Backpressure: Workers can signal they’re full (paused state)
Draining: Workers can request graceful shutdown

Worker Capacity

Workers declare maximum concurrent actions:

{
  "max_inflight_tasks": 8
}

Scheduler tracks:

Running Actions: Currently executing
Available Slots: max_inflight_tasks - running_actions
Paused State: No available slots (backpressure)

Monitoring and Debugging

Logging

Control scheduler logging verbosity:

{
  "worker_match_logging_interval_s": 10
}

> 0: Log worker matching events every N seconds
-1: Disable worker matching logs

Logs include:

“Worker busy” - All capable workers at capacity
“Can’t find any worker” - No workers match platform properties
“Action assigned” - Successful worker assignment

Metrics

Scheduler exposes Prometheus metrics:

Actions queued: Number of actions awaiting workers
Actions executing: Number of actions currently running
Actions completed: Total completed actions
Workers connected: Number of active workers
Worker timeouts: Workers removed due to timeout
Action retries: Number of retried actions

Tracing

OpenTelemetry traces provide visibility into:

Action queuing duration
Worker matching time
Execution duration
Result upload time

Best Practices

Always use cache_lookup scheduler in production to leverage Action Cache
Configure platform properties to match your worker heterogeneity
Set appropriate timeouts based on expected action duration
Use LRU allocation for most workloads unless you have specific caching needs
Enable Redis backend for multi-scheduler deployments or HA requirements
Monitor worker health and adjust worker_timeout_s for network conditions
Tune max_job_retries based on infrastructure reliability

Troubleshooting

Common Issues

Next Steps

Workers

Configure and manage worker nodes

Remote Execution

Understand the execution flow

Architecture

See how schedulers fit in the system

Getting Started

Core Concepts

Deployment

Integration

Operations

Security

Documentation Index

​Overview

​Scheduler Types

​Simple Scheduler

​Platform Properties

​Allocation Strategies

​Timeout Configuration

Worker Timeout

Client Action Timeout

Max Action Executing Timeout

Retain Completed

​Retry Configuration

​Backend Storage

​Cache Lookup Scheduler

​Property Modifier Scheduler

​GRPC Scheduler

Hybrid Deployments

Multi-Region

Development

Federation

​Scheduler Composition

​Example: Full-Featured Scheduler

​Worker Management

​Worker Registration

​Worker Health Monitoring

​Worker Capacity

​Monitoring and Debugging

​Logging

​Metrics

​Tracing

​Best Practices

​Troubleshooting

​Next Steps

Workers

Remote Execution

Architecture

Build docs developers (and LLMs) love

Overview

Scheduler Types

Simple Scheduler

Platform Properties

Allocation Strategies

Timeout Configuration

Retry Configuration

Backend Storage

Cache Lookup Scheduler

Property Modifier Scheduler

GRPC Scheduler

Scheduler Composition

Example: Full-Featured Scheduler

Worker Management

Worker Registration

Worker Health Monitoring

Worker Capacity

Monitoring and Debugging

Logging

Metrics

Tracing

Best Practices

Troubleshooting

Next Steps