Remote Execution - NativeLink

Remote Execution is NativeLink’s ability to distribute build and test tasks across a pool of worker machines, enabling massive parallelization and offloading computational burden from developer workstations.

Overview

Instead of running all build tasks locally, remote execution allows build tools to:

Upload inputs to a shared Content Addressable Storage (CAS)
Submit actions to a scheduler
Execute on workers with matching capabilities
Download outputs from CAS

Benefits of Remote Execution

Massive Parallelism

Execute hundreds or thousands of tasks simultaneously across a worker pool.

Consistent Environments

All builds run in controlled, hermetic environments ensuring reproducibility.

Resource Offloading

Free up local CPU, memory, and disk for other tasks.

Faster Iteration

Large builds that take hours locally can complete in minutes.

Execution Lifecycle

1. Action Submission

The build client creates an Action protobuf message:

message Action {
  Digest command_digest = 1;        // Hash of Command proto
  Digest input_root_digest = 2;     // Root directory of inputs
  Duration timeout = 3;              // Max execution time
  bool do_not_cache = 4;            // Skip action cache
  repeated Platform.Property platform = 5;  // Execution requirements
}

Key Components:

Command: What to execute (program, arguments, environment)
Input Root: Merkle tree of all input files/directories
Platform Properties: Worker requirements (OS, CPU architecture, etc.)
Timeout: Maximum allowed execution time

2. Scheduler Queueing

The scheduler receives the action and:

Validates the action is well-formed
Checks Action Cache (if caching enabled)
Queues the action awaiting a suitable worker
Monitors progress and handles timeouts

Actions are queued in the order received, but workers may pull from the queue based on availability and platform matching.

3. Worker Matching

The scheduler matches actions to workers based on platform properties: Matching Rules (configured in scheduler):

Exact Match
Minimum Match
Priority Match

Worker must have the exact property value.

{
  "supported_platform_properties": {
    "cpu_arch": "exact",
    "OSFamily": "exact"
  }
}

Example: Action with cpu_arch: arm64 only runs on workers with cpu_arch: arm64.

Worker must have at least the property value (numeric comparison).

{
  "supported_platform_properties": {
    "cpu_count": "minimum"
  }
}

Example: Action requiring cpu_count: 8 can run on workers with 8, 16, or 32 cores.

Informational property, does not restrict worker selection.

{
  "supported_platform_properties": {
    "pool": "priority"
  }
}

4. Worker Execution

Once assigned, the worker: Execution Steps:

Precondition Check: Run optional script to verify worker capabilities
Input Download: Fetch all input files from CAS into working directory
Environment Setup: Set environment variables, create output directories
Command Execution: Run the command with timeout and resource monitoring
Output Capture: Collect stdout, stderr, and exit code
Output Upload: Hash and upload all output files to CAS
Result Reporting: Send ActionResult back to scheduler

Working Directory Isolation

Each action executes in a clean, isolated directory:

No leftover files from previous actions
Inputs materialized from CAS
Outputs collected after execution
Directory deleted after completion

This ensures hermetic execution where actions cannot interfere with each other.

5. Result Collection

The scheduler receives the ActionResult:

message ActionResult {
  repeated OutputFile output_files = 1;
  repeated OutputDirectory output_directories = 2;
  int32 exit_code = 3;
  bytes stdout_digest = 4;
  bytes stderr_digest = 5;
  ExecutedActionMetadata execution_metadata = 6;
}

The result is:

Stored in Action Cache (if caching enabled)
Returned to client via gRPC stream
Used to update operation status for WaitExecution subscribers

Worker Configuration

Workers are configured to advertise their capabilities and resource limits.

Platform Properties

Declare worker capabilities:

{
  "platform_properties": {
    "cpu_arch": "x86_64",
    "OSFamily": "linux",
    "cpu_count": "16",
    "memory_gb": "64",
    "pool": "production"
  }
}

Resource Limits

{
  "max_inflight_tasks": 8,           // Max concurrent actions
  "timeout": "1200s",                 // Default action timeout
  "upload_timeout": "600s",          // Max time to upload outputs
  "temp_path": "/tmp/nativelink"     // Scratch space
}

Set max_inflight_tasks based on CPU cores and memory to avoid resource exhaustion.

Precondition Scripts

Dynamically check worker readiness:

{
  "precondition_script": "/usr/local/bin/check-resources.sh"
}

Use Cases:

Check available disk space
Verify required tools are installed
Ensure GPU is available
Confirm license server connectivity

Precondition scripts run before accepting each action. If the script fails (non-zero exit), the worker rejects the action.

Scheduler Configuration

Schedulers manage the action queue and worker pool.

Simple Scheduler

The primary scheduler implementation:

{
  "simple": {
    "supported_platform_properties": {
      "cpu_arch": "exact",
      "OSFamily": "exact",
      "cpu_count": "minimum"
    },
    "allocation_strategy": "least_recently_used",
    "worker_timeout_s": 5,
    "client_action_timeout_s": 60,
    "max_job_retries": 3
  }
}

Configuration Options:

Allocation Strategy

Worker Timeout

{
  "worker_timeout_s": 5
}

Remove workers from pool if no keepalive received within this duration.

Workers send periodic keepalive messages to signal they’re still alive. If a worker crashes or loses network connectivity, it’s automatically removed after the timeout.

Action Timeout

{
  "client_action_timeout_s": 60,
  "max_action_executing_timeout_s": 300
}

client_action_timeout_s: Mark action failed if no client update (for multi-stage operations)
max_action_executing_timeout_s: Max time an action can execute without progress updates

Retry Logic

{
  "max_job_retries": 3
}

If an action fails with internal errors or timeouts, retry up to this many times on different workers.

Actions that fail due to user errors (non-zero exit code) are not retried. Only infrastructure failures trigger retries.

Advanced Features

Cache Lookup Integration

Wrap the scheduler with a cache lookup layer:

{
  "cache_lookup": {
    "ac_store": "AC_STORE",
    "scheduler": {
      "simple": { ... }
    }
  }
}

Actions are first checked against the Action Cache before being queued for execution.

Property Modification

Modify action properties before execution:

{
  "property_modifier": {
    "modifications": [
      {
        "add": {
          "name": "pool",
          "value": "experimental"
        }
      },
      {
        "remove": "legacy_flag"
      }
    ],
    "scheduler": { ... }
  }
}

Use Cases:

Route actions to specific worker pools
Add default properties
Remove incompatible properties

Multi-Scheduler Federation

Forward actions to remote schedulers:

{
  "grpc": {
    "endpoint": {
      "address": "grpc://remote-scheduler:50051"
    }
  }
}

Scenario: Local CAS cache with remote execution cluster.

Monitoring Execution

Clients can monitor execution progress:

WaitExecution

rpc WaitExecution(WaitExecutionRequest) returns (stream Operation);

Streams operation state updates:

Queued: Action accepted, waiting for worker
Executing: Worker is running the action
Completed: Action finished (success or failure)

Operation Metadata

message ExecuteOperationMetadata {
  ActionState stage = 1;
  Digest action_digest = 2;
  string worker_id = 3;
  google.protobuf.Timestamp queued_timestamp = 4;
  google.protobuf.Timestamp worker_start_timestamp = 5;
}

Provides visibility into:

Current execution stage
Assigned worker ID
Queue and execution timestamps

Performance Optimization

Input Minimization

Include only necessary inputs in input_root_digest:

Reduces upload/download time
Decreases storage usage
Improves cache hit rates

Output Locality

Use output paths to avoid downloading unnecessary outputs:

message Command {
  repeated string output_files = 2;
  repeated string output_directories = 3;
}

Only specified outputs are uploaded to CAS.

Worker Affinity

With MRU allocation, repeatedly scheduled actions on the same worker can reuse:

Downloaded inputs (if still in local cache)
Compiled headers and intermediate files

Troubleshooting

Common Issues

Next Steps

Schedulers

Configure scheduler behavior

Workers

Set up and manage worker nodes

Stores

Optimize CAS storage backends

Getting Started

Core Concepts

Deployment

Integration

Operations

Security

Documentation Index

​Overview

​Benefits of Remote Execution

Massive Parallelism

Consistent Environments

Resource Offloading

Faster Iteration

​Execution Lifecycle

​1. Action Submission

​2. Scheduler Queueing

​3. Worker Matching

​4. Worker Execution

​5. Result Collection

​Worker Configuration

​Platform Properties

​Resource Limits

​Precondition Scripts

​Scheduler Configuration

​Simple Scheduler

​Worker Timeout

​Action Timeout

​Retry Logic

​Advanced Features

​Cache Lookup Integration

​Property Modification

​Multi-Scheduler Federation

​Monitoring Execution

​WaitExecution

​Operation Metadata

​Performance Optimization

​Input Minimization

​Output Locality

​Worker Affinity

​Troubleshooting

​Next Steps

Schedulers

Workers

Stores

Build docs developers (and LLMs) love

Overview

Benefits of Remote Execution

Execution Lifecycle

1. Action Submission

2. Scheduler Queueing

3. Worker Matching

4. Worker Execution

5. Result Collection

Worker Configuration

Platform Properties

Resource Limits

Precondition Scripts

Scheduler Configuration

Simple Scheduler

Worker Timeout

Action Timeout

Retry Logic

Advanced Features

Cache Lookup Integration

Property Modification

Multi-Scheduler Federation

Monitoring Execution

WaitExecution

Operation Metadata

Performance Optimization

Input Minimization

Output Locality

Worker Affinity

Troubleshooting

Next Steps