This document describes the architecture and data flow for BuildBuddy’s remote execution service, which allows Bazel to execute build actions on remote worker machines instead of locally.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/buildbuddy-io/buildbuddy/llms.txt
Use this file to discover all available pages before exploring further.
Architecture Diagram
Overview
BuildBuddy implements the Remote Execution API, allowing Bazel to offload build action execution to a pool of remote workers. This provides massive parallelism, consistent build environments, and efficient resource utilization.Components Involved
Bazel Client
The build tool requesting remote execution:- Analyzes build graph locally
- Uploads inputs to CAS
- Sends execution requests
- Downloads outputs from cache
- Monitors execution progress
Execution Service (Scheduler)
Orchestrates remote execution:- Receives Execute requests
- Validates action requirements
- Matches actions to executors
- Manages execution queue
- Tracks action lifecycle
- Returns execution results
Executor Pool
Worker machines that run actions:- Registers with scheduler
- Declares capabilities (platform properties)
- Receives action assignments
- Executes commands in isolation
- Uploads outputs to cache
- Reports execution status
Redis Queue
Manages task distribution:- Queues pending actions
- Priority-based scheduling
- Ensures fair distribution
- Handles executor failures
Content Addressable Storage (CAS)
Stores action inputs and outputs:- Input files downloaded by executors
- Output files uploaded after execution
- Digest-based addressing
- Shared across all executors
Action Cache
Stores execution results:- Checks for cached results before execution
- Stores results after execution
- Enables build action reuse
Execution Flow
Step 1: Action Preparation
-
Local Analysis:
- Bazel analyzes build graph
- Identifies actions to execute
- Determines action inputs and commands
-
Input Upload:
- Bazel computes input file digests
- Checks which inputs are missing from CAS
- Uploads missing inputs to BuildBuddy
- Creates input root directory structure
-
Action Digest Computation:
- Computes action hash from:
- Command line and arguments
- Input file digests
- Environment variables
- Platform properties
- Computes action hash from:
Step 2: Cache Check
-
GetActionResult Request:
- Bazel checks Action Cache first
- Sends action digest to BuildBuddy
-
Cache Hit Path:
- If cached result exists:
- Return ActionResult immediately
- Bazel skips execution
- Downloads outputs from CAS
- Continues to next action
- If cached result exists:
-
Cache Miss Path:
- If no cached result:
- Proceed to remote execution
- If no cached result:
Step 3: Execute Request
-
Bazel Sends Execute RPC:
-
Scheduler Receives Request:
- Validates action format
- Authenticates request
- Extracts platform requirements
- Assigns unique task ID
Step 4: Task Scheduling
-
Queue Action:
- Add to Redis queue
- Priority based on:
- User priority settings
- Action size/complexity
- Queue time (fairness)
-
Executor Matching:
- Find executor with matching platform
- Check executor capacity
- Consider executor health/performance
- Assign task to executor
-
Task Assignment:
- Notify executor of new task
- Executor claims task
- Update task status to RUNNING
Step 5: Action Execution
-
Input Preparation:
- Executor downloads input root from CAS
- Reconstructs directory structure
- Downloads all input files
- Verifies input digests
-
Environment Setup:
- Create isolated execution environment:
- Docker container, or
- Podman container, or
- Firecracker VM, or
- Bare metal with sandbox
- Set environment variables
- Configure working directory
- Create isolated execution environment:
-
Command Execution:
- Run the command (e.g., compiler, linker)
- Capture stdout and stderr
- Monitor resource usage
- Enforce timeout
- Record exit code
-
Output Collection:
- Identify output files
- Compute output digests
- Prepare ActionResult
-
Output Upload:
- Upload output files to CAS
- Upload stdout/stderr if requested
- Ensure all outputs uploaded before completing
Step 6: Result Reporting
-
Update Action Cache:
- Store action digest → ActionResult mapping
- Unless do_not_cache=true
- Future executions will cache hit
-
Send ExecuteResponse:
-
Bazel Receives Result:
- Checks exit code
- Downloads output files from CAS
- Continues build with outputs
- Or reports action failure
Step 7: Output Download
- Bazel receives output file digests
- Downloads outputs from CAS
- Places files in local build directory
- Proceeds to dependent actions
Executor Management
Executor Registration
-
Executor Startup:
- Executor process starts on worker machine
- Connects to BuildBuddy scheduler
- Registers capabilities:
- Platform properties (OS, arch, etc.)
- Resource capacity (CPU, memory, disk)
- Container/VM support
-
Health Monitoring:
- Periodic heartbeats to scheduler
- Reports current load and availability
- Updates capability changes
-
Deregistration:
- Graceful shutdown drains tasks
- Notifies scheduler of unavailability
- Scheduler reassigns pending tasks
Platform Properties
Executors advertise capabilities:Isolation Mechanisms
Docker Containers:- Each action runs in fresh container
- Specified by container-image property
- Provides filesystem isolation
- Manages resource limits
- Rootless container execution
- Better security isolation
- Compatible with Docker images
- Lightweight microVMs
- Stronger isolation than containers
- Fast startup (sub-second)
- Used for untrusted code
- Sandboxing without containers
- Faster for trusted code
- Limited isolation
Performance Optimizations
Input Deduplication
- Content addressing eliminates duplicate uploads
- Common inputs (toolchains, SDKs) uploaded once
- Executors cache frequently used inputs locally
Persistent Workers
For JVM-based tools (Java, Kotlin, Scala):- Keep compiler process running between actions
- Avoid JVM startup overhead
- Warm JIT compilation
- Significant speedup for incremental builds
Local Execution Cache
Executor maintains local cache:- Input files cached on disk
- Container images cached
- Avoids repeated CAS downloads
- LRU eviction when disk fills
Action Prioritization
- Critical path actions prioritized
- Large actions scheduled early
- Fair queuing prevents starvation
- Priority can be set per user/org
Speculative Execution
For slow actions:- Execute same action on multiple executors
- Use result from first to complete
- Cancel redundant executions
- Reduces tail latency
Failure Handling
Executor Failures
-
Executor Crash:
- Heartbeat timeout detected
- Scheduler marks executor unhealthy
- Reschedules in-progress actions
-
Network Partition:
- Executor isolated from scheduler
- Actions eventually timeout
- Executor re-registers on reconnect
Action Failures
-
Command Failure (non-zero exit code):
- Result returned with exit code
- Bazel handles as normal build failure
- Logs available for debugging
-
Timeout:
- Action exceeds timeout
- Executor kills process
- Returns DEADLINE_EXCEEDED error
-
Resource Exhaustion:
- Out of memory, disk space
- Executor fails action
- May retry on different executor
Retries
- Transient errors (network, executor failure) retried automatically
- Configurable retry limits
- Exponential backoff
- Non-transient errors (command failure) not retried
Monitoring and Metrics
Execution Metrics
- Actions queued, running, completed
- Queue time (time waiting for executor)
- Execution time (time running on executor)
- Upload/download time and bytes
- Cache hit rate (action cache)
- Executor utilization
Performance Metrics
- End-to-end execution latency (p50, p95, p99)
- Input download time
- Output upload time
- Scheduler overhead
Reliability Metrics
- Action failure rate (by type)
- Executor failure rate
- Retry rate
- Timeout rate