Workers - NativeLink

Workers are the execution engines of NativeLink that run build and test actions submitted by clients through the scheduler. They download inputs from CAS, execute commands in isolated environments, and upload outputs back to CAS.

Overview

Workers form a pool of computational resources that:

Connect to the scheduler and advertise their capabilities
Receive action assignments based on platform property matching
Download input files from Content Addressable Storage (CAS)
Execute commands in clean, isolated working directories
Upload output artifacts back to CAS
Report execution results to the scheduler

Worker Lifecycle

1. Connection and Registration

When a worker starts, it connects to the scheduler: Worker Registration:

message ConnectWorkerRequest {
  string worker_id = 1;  // Unique identifier for this worker
  repeated Platform.Property platform_properties = 2;
}

Platform Properties advertise the worker’s capabilities:

{
  "cpu_arch": "x86_64",
  "OSFamily": "linux",
  "cpu_count": "16",
  "memory_gb": "64",
  "disk_gb": "500",
  "pool": "production",
  "container-image": "docker://ubuntu:22.04"
}

Platform properties must match the scheduler’s supported_platform_properties configuration for the worker to receive matching actions.

2. Receiving Actions

The scheduler assigns actions to workers via the bidirectional stream:

message UpdateForWorker {
  oneof update {
    StartExecute start_action = 1;
    GoingAwayRequest going_away = 2;
  }
}

StartExecute contains:

operation_id: Unique identifier for this execution
action_digest: Hash of the Action proto
action_info: Expanded action details (command, inputs, timeout)

3. Action Execution

The Running Actions Manager handles concurrent action execution:

Precondition Checks

Before accepting an action, workers can run a precondition script:

{
  "precondition_script": "/usr/local/bin/check-resources.sh"
}

Purpose:

Verify sufficient disk space
Check required tools are installed
Confirm GPU availability
Validate license server connectivity

Behavior:

Exit 0: Accept the action
Non-zero exit: Reject the action (worker signals backpressure)

Precondition scripts run before each action. Keep them fast (< 1 second) to avoid delaying execution.

Working Directory Setup

Each action executes in a clean, isolated directory:

Create temporary working directory (e.g., /tmp/nativelink/<operation_id>)
Download input root from CAS
Materialize directory tree structure
Set environment variables
Execute command
Capture stdout/stderr and exit code
Upload outputs to CAS
Delete working directory

This ensures hermetic execution - actions cannot interfere with each other or be affected by previous executions.

Command Execution

The worker executes the command specified in the Action:

message Command {
  repeated string arguments = 1;              // e.g., ["gcc", "-c", "main.c"]
  repeated EnvironmentVariable env = 2;       // Environment variables
  repeated string output_files = 3;           // Expected output files
  repeated string output_directories = 4;     // Expected output directories
  Platform platform = 5;                      // Platform properties
  string working_directory = 6;               // Working directory (relative to root)
}

Execution:

Spawns process with specified arguments
Sets environment variables
Captures stdout and stderr
Enforces timeout
Monitors for completion

Timeouts:

Action timeout: Specified in Action proto or worker default
Upload timeout: Maximum time to upload outputs

If an action exceeds its timeout, the process is killed and the action is marked as failed.

Output Collection

After successful execution:

Identify output files/directories specified in Command
Hash each output file (compute digest)
Upload outputs to CAS
Create ActionResult proto with output digests
Upload stdout/stderr to CAS
Report result to scheduler

4. Keepalive and Health

Workers send periodic keepalive messages:

message KeepAliveRequest {
  string worker_id = 1;
}

Purpose:

Signal the worker is still alive
Prevent scheduler from timing out the worker
Update last-seen timestamp

Frequency: Every few seconds (configurable)

If the scheduler doesn’t receive a keepalive within worker_timeout_s, the worker is removed from the pool.

5. Graceful Shutdown

Workers can gracefully drain:

Worker receives shutdown signal (SIGTERM)
Worker sends GoingAway to scheduler
Scheduler stops assigning new actions
Worker completes running actions
Worker disconnects

message GoingAwayRequest {
  string worker_id = 1;
}

Worker Configuration

Basic Configuration

{
  "worker_api_endpoint": {
    "address": "grpc://scheduler.example.com:50051"
  },
  "cas_stores": {
    "CAS_MAIN": {
      "grpc": {
        "instance_name": "main",
        "endpoints": [{
          "address": "grpc://cas.example.com:50051"
        }],
        "store_type": "cas"
      }
    }
  },
  "platform_properties": {
    "cpu_arch": "x86_64",
    "OSFamily": "linux",
    "cpu_count": "16",
    "memory_gb": "64"
  },
  "max_inflight_tasks": 8,
  "timeout": "1200s",
  "upload_timeout": "600s",
  "work_directory": "/tmp/nativelink"
}

Configuration Options

Worker Settings

Advanced Features

Multi-Worker CAS

Workers can share a local CAS to reduce redundant downloads:

{
  "cas_stores": {
    "SHARED_LOCAL_CAS": {
      "fast_slow": {
        "fast": {
          "filesystem": {
            "content_path": "/shared/nativelink/cas",
            "temp_path": "/shared/nativelink/tmp",
            "eviction_policy": { "max_bytes": "200gb" }
          }
        },
        "slow": {
          "grpc": { ... }
        }
      }
    }
  }
}

Benefits:

Multiple workers on the same machine share cached inputs
Reduces network traffic to remote CAS
Faster action startup (inputs already local)

Directory Caching

Workers maintain a cache of downloaded directory trees to avoid re-downloading:

Digest-based cache: Directories indexed by digest
LRU eviction: Old directories removed when cache is full
Atomic updates: Directories fully downloaded before use

Resource Monitoring

Workers can monitor resource usage and reject actions when resources are constrained: Metrics:

CPU usage
Memory usage
Disk space
Network I/O

Integration: Use precondition scripts to check resources before accepting actions.

Running Workers

Standalone Worker

nativelink \n  --config worker-config.json \n  --worker

Worker Pool (Multiple Workers on One Machine)

# Start 4 workers with different IDs
for i in {1..4}; do
  nativelink \
    --config worker-config.json \
    --worker \
    --worker-id "worker-$HOSTNAME-$i" &
done

Systemd Service

[Unit]
Description=NativeLink Worker
After=network.target

[Service]
Type=simple
User=nativelink
Group=nativelink
ExecStart=/usr/local/bin/nativelink /etc/nativelink/worker.json
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Docker Container

FROM ghcr.io/tracemachina/nativelink:latest

COPY worker-config.json /config.json

CMD ["nativelink", "/config.json"]

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nativelink-worker
spec:
  replicas: 10
  selector:
    matchLabels:
      app: nativelink-worker
  template:
    metadata:
      labels:
        app: nativelink-worker
    spec:
      containers:
      - name: worker
        image: ghcr.io/tracemachina/nativelink:latest
        args: ["/config/worker.json"]
        volumeMounts:
        - name: config
          mountPath: /config
        - name: cache
          mountPath: /var/cache/nativelink
        resources:
          requests:
            cpu: "8"
            memory: "32Gi"
          limits:
            cpu: "16"
            memory: "64Gi"
      volumes:
      - name: config
        configMap:
          name: worker-config
      - name: cache
        emptyDir:
          sizeLimit: 100Gi

Monitoring Workers

Metrics

Workers expose Prometheus metrics:

Actions completed: Total actions executed
Actions failed: Failed action count
Action duration: Execution time histogram
Download bytes: Total input download volume
Upload bytes: Total output upload volume
Working directory size: Current disk usage

Logging

Workers log execution details:

Action received and started
Input download progress
Command execution (stdout/stderr)
Output upload progress
Execution result (success/failure)
Errors and warnings

Log Levels:

ERROR: Critical failures
WARN: Retryable issues (network errors, timeouts)
INFO: Action lifecycle events
DEBUG: Detailed execution traces
TRACE: Low-level protocol details

Troubleshooting

Common Issues

Best Practices

Size worker pool based on expected workload and machine resources
Use local CAS cache (filesystem or memory) to reduce network traffic
Configure precondition scripts for dynamic resource checks
Set appropriate timeouts based on typical action duration
Monitor metrics to track worker health and performance
Use graceful shutdown to avoid killing in-progress actions
Allocate sufficient disk space for working directories
Run workers on fast storage (SSDs) for better I/O performance

Next Steps

Schedulers

Configure scheduler to manage workers

Remote Execution

Understand the execution flow

Stores

Optimize CAS configuration

Getting Started

Core Concepts

Deployment

Integration

Operations

Security

Documentation Index

​Overview

​Worker Lifecycle

​1. Connection and Registration

​2. Receiving Actions

​3. Action Execution

​Precondition Checks

​Working Directory Setup

​Command Execution

​Output Collection

​4. Keepalive and Health

​5. Graceful Shutdown

​Worker Configuration

​Basic Configuration

​Configuration Options

​Advanced Features

​Multi-Worker CAS

​Directory Caching

​Resource Monitoring

​Running Workers

​Standalone Worker

​Worker Pool (Multiple Workers on One Machine)

​Systemd Service

​Docker Container

​Kubernetes Deployment

​Monitoring Workers

​Metrics

​Logging

​Troubleshooting

​Best Practices

​Next Steps

Schedulers

Remote Execution

Stores

Build docs developers (and LLMs) love

Overview

Worker Lifecycle

1. Connection and Registration

2. Receiving Actions

3. Action Execution

Precondition Checks

Working Directory Setup

Command Execution

Output Collection

4. Keepalive and Health

5. Graceful Shutdown

Worker Configuration

Basic Configuration

Configuration Options

Advanced Features

Multi-Worker CAS

Directory Caching

Resource Monitoring

Running Workers

Standalone Worker

Worker Pool (Multiple Workers on One Machine)

Systemd Service

Docker Container

Kubernetes Deployment

Monitoring Workers

Metrics

Logging

Troubleshooting

Best Practices

Next Steps