Overview
Workers form a pool of computational resources that:- Connect to the scheduler and advertise their capabilities
- Receive action assignments based on platform property matching
- Download input files from Content Addressable Storage (CAS)
- Execute commands in clean, isolated working directories
- Upload output artifacts back to CAS
- Report execution results to the scheduler
Worker Lifecycle
1. Connection and Registration
When a worker starts, it connects to the scheduler: Worker Registration:Platform properties must match the scheduler’s
supported_platform_properties configuration for the worker to receive matching actions.2. Receiving Actions
The scheduler assigns actions to workers via the bidirectional stream:- operation_id: Unique identifier for this execution
- action_digest: Hash of the Action proto
- action_info: Expanded action details (command, inputs, timeout)
3. Action Execution
The Running Actions Manager handles concurrent action execution:Precondition Checks
Before accepting an action, workers can run a precondition script:- Verify sufficient disk space
- Check required tools are installed
- Confirm GPU availability
- Validate license server connectivity
- Exit 0: Accept the action
- Non-zero exit: Reject the action (worker signals backpressure)
Working Directory Setup
Each action executes in a clean, isolated directory:- Create temporary working directory (e.g.,
/tmp/nativelink/<operation_id>) - Download input root from CAS
- Materialize directory tree structure
- Set environment variables
- Execute command
- Capture stdout/stderr and exit code
- Upload outputs to CAS
- Delete working directory
This ensures hermetic execution - actions cannot interfere with each other or be affected by previous executions.
Command Execution
The worker executes the command specified in the Action:- Spawns process with specified arguments
- Sets environment variables
- Captures stdout and stderr
- Enforces timeout
- Monitors for completion
- Action timeout: Specified in Action proto or worker default
- Upload timeout: Maximum time to upload outputs
Output Collection
After successful execution:- Identify output files/directories specified in Command
- Hash each output file (compute digest)
- Upload outputs to CAS
- Create ActionResult proto with output digests
- Upload stdout/stderr to CAS
- Report result to scheduler
4. Keepalive and Health
Workers send periodic keepalive messages:- Signal the worker is still alive
- Prevent scheduler from timing out the worker
- Update last-seen timestamp
If the scheduler doesn’t receive a keepalive within
worker_timeout_s, the worker is removed from the pool.5. Graceful Shutdown
Workers can gracefully drain:- Worker receives shutdown signal (SIGTERM)
- Worker sends GoingAway to scheduler
- Scheduler stops assigning new actions
- Worker completes running actions
- Worker disconnects
Worker Configuration
Basic Configuration
Configuration Options
Worker Settings
Worker Settings
Advanced Features
Multi-Worker CAS
Workers can share a local CAS to reduce redundant downloads:- Multiple workers on the same machine share cached inputs
- Reduces network traffic to remote CAS
- Faster action startup (inputs already local)
Directory Caching
Workers maintain a cache of downloaded directory trees to avoid re-downloading:- Digest-based cache: Directories indexed by digest
- LRU eviction: Old directories removed when cache is full
- Atomic updates: Directories fully downloaded before use
Resource Monitoring
Workers can monitor resource usage and reject actions when resources are constrained: Metrics:- CPU usage
- Memory usage
- Disk space
- Network I/O
Running Workers
Standalone Worker
Worker Pool (Multiple Workers on One Machine)
Systemd Service
Docker Container
Kubernetes Deployment
Monitoring Workers
Metrics
Workers expose Prometheus metrics:- Actions completed: Total actions executed
- Actions failed: Failed action count
- Action duration: Execution time histogram
- Download bytes: Total input download volume
- Upload bytes: Total output upload volume
- Working directory size: Current disk usage
Logging
Workers log execution details:- Action received and started
- Input download progress
- Command execution (stdout/stderr)
- Output upload progress
- Execution result (success/failure)
- Errors and warnings
ERROR: Critical failuresWARN: Retryable issues (network errors, timeouts)INFO: Action lifecycle eventsDEBUG: Detailed execution tracesTRACE: Low-level protocol details
Troubleshooting
Common Issues
Common Issues
Best Practices
- Size worker pool based on expected workload and machine resources
- Use local CAS cache (filesystem or memory) to reduce network traffic
- Configure precondition scripts for dynamic resource checks
- Set appropriate timeouts based on typical action duration
- Monitor metrics to track worker health and performance
- Use graceful shutdown to avoid killing in-progress actions
- Allocate sufficient disk space for working directories
- Run workers on fast storage (SSDs) for better I/O performance
Next Steps
Schedulers
Configure scheduler to manage workers
Remote Execution
Understand the execution flow
Stores
Optimize CAS configuration