Skip to main content
Workers are the execution engines of NativeLink that run build and test actions submitted by clients through the scheduler. They download inputs from CAS, execute commands in isolated environments, and upload outputs back to CAS.

Overview

Workers form a pool of computational resources that:
  • Connect to the scheduler and advertise their capabilities
  • Receive action assignments based on platform property matching
  • Download input files from Content Addressable Storage (CAS)
  • Execute commands in clean, isolated working directories
  • Upload output artifacts back to CAS
  • Report execution results to the scheduler

Worker Lifecycle

1. Connection and Registration

When a worker starts, it connects to the scheduler: Worker Registration:
message ConnectWorkerRequest {
  string worker_id = 1;  // Unique identifier for this worker
  repeated Platform.Property platform_properties = 2;
}
Platform Properties advertise the worker’s capabilities:
{
  "cpu_arch": "x86_64",
  "OSFamily": "linux",
  "cpu_count": "16",
  "memory_gb": "64",
  "disk_gb": "500",
  "pool": "production",
  "container-image": "docker://ubuntu:22.04"
}
Platform properties must match the scheduler’s supported_platform_properties configuration for the worker to receive matching actions.

2. Receiving Actions

The scheduler assigns actions to workers via the bidirectional stream:
message UpdateForWorker {
  oneof update {
    StartExecute start_action = 1;
    GoingAwayRequest going_away = 2;
  }
}
StartExecute contains:
  • operation_id: Unique identifier for this execution
  • action_digest: Hash of the Action proto
  • action_info: Expanded action details (command, inputs, timeout)

3. Action Execution

The Running Actions Manager handles concurrent action execution:

Precondition Checks

Before accepting an action, workers can run a precondition script:
{
  "precondition_script": "/usr/local/bin/check-resources.sh"
}
Purpose:
  • Verify sufficient disk space
  • Check required tools are installed
  • Confirm GPU availability
  • Validate license server connectivity
Behavior:
  • Exit 0: Accept the action
  • Non-zero exit: Reject the action (worker signals backpressure)
Precondition scripts run before each action. Keep them fast (< 1 second) to avoid delaying execution.

Working Directory Setup

Each action executes in a clean, isolated directory:
  1. Create temporary working directory (e.g., /tmp/nativelink/<operation_id>)
  2. Download input root from CAS
  3. Materialize directory tree structure
  4. Set environment variables
  5. Execute command
  6. Capture stdout/stderr and exit code
  7. Upload outputs to CAS
  8. Delete working directory
This ensures hermetic execution - actions cannot interfere with each other or be affected by previous executions.

Command Execution

The worker executes the command specified in the Action:
message Command {
  repeated string arguments = 1;              // e.g., ["gcc", "-c", "main.c"]
  repeated EnvironmentVariable env = 2;       // Environment variables
  repeated string output_files = 3;           // Expected output files
  repeated string output_directories = 4;     // Expected output directories
  Platform platform = 5;                      // Platform properties
  string working_directory = 6;               // Working directory (relative to root)
}
Execution:
  • Spawns process with specified arguments
  • Sets environment variables
  • Captures stdout and stderr
  • Enforces timeout
  • Monitors for completion
Timeouts:
  • Action timeout: Specified in Action proto or worker default
  • Upload timeout: Maximum time to upload outputs
If an action exceeds its timeout, the process is killed and the action is marked as failed.

Output Collection

After successful execution:
  1. Identify output files/directories specified in Command
  2. Hash each output file (compute digest)
  3. Upload outputs to CAS
  4. Create ActionResult proto with output digests
  5. Upload stdout/stderr to CAS
  6. Report result to scheduler

4. Keepalive and Health

Workers send periodic keepalive messages:
message KeepAliveRequest {
  string worker_id = 1;
}
Purpose:
  • Signal the worker is still alive
  • Prevent scheduler from timing out the worker
  • Update last-seen timestamp
Frequency: Every few seconds (configurable)
If the scheduler doesn’t receive a keepalive within worker_timeout_s, the worker is removed from the pool.

5. Graceful Shutdown

Workers can gracefully drain:
  1. Worker receives shutdown signal (SIGTERM)
  2. Worker sends GoingAway to scheduler
  3. Scheduler stops assigning new actions
  4. Worker completes running actions
  5. Worker disconnects
message GoingAwayRequest {
  string worker_id = 1;
}

Worker Configuration

Basic Configuration

{
  "worker_api_endpoint": {
    "address": "grpc://scheduler.example.com:50051"
  },
  "cas_stores": {
    "CAS_MAIN": {
      "grpc": {
        "instance_name": "main",
        "endpoints": [{
          "address": "grpc://cas.example.com:50051"
        }],
        "store_type": "cas"
      }
    }
  },
  "platform_properties": {
    "cpu_arch": "x86_64",
    "OSFamily": "linux",
    "cpu_count": "16",
    "memory_gb": "64"
  },
  "max_inflight_tasks": 8,
  "timeout": "1200s",
  "upload_timeout": "600s",
  "work_directory": "/tmp/nativelink"
}

Configuration Options

Advanced Features

Multi-Worker CAS

Workers can share a local CAS to reduce redundant downloads:
{
  "cas_stores": {
    "SHARED_LOCAL_CAS": {
      "fast_slow": {
        "fast": {
          "filesystem": {
            "content_path": "/shared/nativelink/cas",
            "temp_path": "/shared/nativelink/tmp",
            "eviction_policy": { "max_bytes": "200gb" }
          }
        },
        "slow": {
          "grpc": { ... }
        }
      }
    }
  }
}
Benefits:
  • Multiple workers on the same machine share cached inputs
  • Reduces network traffic to remote CAS
  • Faster action startup (inputs already local)

Directory Caching

Workers maintain a cache of downloaded directory trees to avoid re-downloading:
  • Digest-based cache: Directories indexed by digest
  • LRU eviction: Old directories removed when cache is full
  • Atomic updates: Directories fully downloaded before use

Resource Monitoring

Workers can monitor resource usage and reject actions when resources are constrained: Metrics:
  • CPU usage
  • Memory usage
  • Disk space
  • Network I/O
Integration: Use precondition scripts to check resources before accepting actions.

Running Workers

Standalone Worker

nativelink \n  --config worker-config.json \n  --worker

Worker Pool (Multiple Workers on One Machine)

# Start 4 workers with different IDs
for i in {1..4}; do
  nativelink \
    --config worker-config.json \
    --worker \
    --worker-id "worker-$HOSTNAME-$i" &
done

Systemd Service

[Unit]
Description=NativeLink Worker
After=network.target

[Service]
Type=simple
User=nativelink
Group=nativelink
ExecStart=/usr/local/bin/nativelink /etc/nativelink/worker.json
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Docker Container

FROM ghcr.io/tracemachina/nativelink:latest

COPY worker-config.json /config.json

CMD ["nativelink", "/config.json"]

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nativelink-worker
spec:
  replicas: 10
  selector:
    matchLabels:
      app: nativelink-worker
  template:
    metadata:
      labels:
        app: nativelink-worker
    spec:
      containers:
      - name: worker
        image: ghcr.io/tracemachina/nativelink:latest
        args: ["/config/worker.json"]
        volumeMounts:
        - name: config
          mountPath: /config
        - name: cache
          mountPath: /var/cache/nativelink
        resources:
          requests:
            cpu: "8"
            memory: "32Gi"
          limits:
            cpu: "16"
            memory: "64Gi"
      volumes:
      - name: config
        configMap:
          name: worker-config
      - name: cache
        emptyDir:
          sizeLimit: 100Gi

Monitoring Workers

Metrics

Workers expose Prometheus metrics:
  • Actions completed: Total actions executed
  • Actions failed: Failed action count
  • Action duration: Execution time histogram
  • Download bytes: Total input download volume
  • Upload bytes: Total output upload volume
  • Working directory size: Current disk usage

Logging

Workers log execution details:
  • Action received and started
  • Input download progress
  • Command execution (stdout/stderr)
  • Output upload progress
  • Execution result (success/failure)
  • Errors and warnings
Log Levels:
  • ERROR: Critical failures
  • WARN: Retryable issues (network errors, timeouts)
  • INFO: Action lifecycle events
  • DEBUG: Detailed execution traces
  • TRACE: Low-level protocol details

Troubleshooting

Best Practices

  1. Size worker pool based on expected workload and machine resources
  2. Use local CAS cache (filesystem or memory) to reduce network traffic
  3. Configure precondition scripts for dynamic resource checks
  4. Set appropriate timeouts based on typical action duration
  5. Monitor metrics to track worker health and performance
  6. Use graceful shutdown to avoid killing in-progress actions
  7. Allocate sufficient disk space for working directories
  8. Run workers on fast storage (SSDs) for better I/O performance

Next Steps

Schedulers

Configure scheduler to manage workers

Remote Execution

Understand the execution flow

Stores

Optimize CAS configuration

Build docs developers (and LLMs) love