Health Check service

The Health Check service provides HTTP endpoints for monitoring NativeLink’s operational status.

Overview

The Health service:

Exposes HTTP endpoint for health checks
Reports status of all registered components
Returns JSON status for each subsystem
Supports custom timeout configuration

Key features:

Non-gRPC HTTP endpoint (easy integration with load balancers)
Component-level health reporting
Configurable timeouts
Service-unavailable status on failures

Configuration

path

string

default:"/status"

HTTP path for health check endpoint

timeout_seconds

uint64

default:5

Timeout for health check queries

{
  services: {
    health: {
      path: "/status",  // Default: /status
      timeout_seconds: 5
    }
  }
}

The default path is /status as documented in the source code

HTTP Endpoint

GET /status

Query health status of all components. Response:

status

int

HTTP status code:

200 OK: All components healthy
503 SERVICE_UNAVAILABLE: One or more components failed

body

JSON

Array of component health descriptions

Example response (healthy):

[
  {
    "namespace": "nativelink",
    "component": "stores/MAIN_CAS",
    "status": {
      "Ok": {
        "message": "Store operational"
      }
    }
  },
  {
    "namespace": "nativelink",
    "component": "scheduler/MAIN_SCHEDULER",
    "status": {
      "Ok": {
        "message": "Scheduler running"
      }
    }
  }
]

Example response (unhealthy):

[
  {
    "namespace": "nativelink",
    "component": "stores/S3_STORE",
    "status": {
      "Failed": {
        "message": "Connection timeout",
        "error": "Failed to connect to S3"
      }
    }
  }
]

Health Status Types

Components can report these statuses:

Ok
Failed
Timeout

Component is healthy and operational

{"Ok": {"message": "Service running"}}

Component has failed

{
  "Failed": {
    "message": "Database unreachable",
    "error": "Connection refused"
  }
}

Health check exceeded timeout

{
  "Timeout": {
    "duration_seconds": 5.2
  }
}

Registered Components

NativeLink automatically registers health checks for:

Stores: All configured stores (CAS, AC, etc.)
Schedulers: Task schedulers
Workers: Local workers (if configured)
Backend connections: S3, Redis, database connections

Components are registered during service initialization based on your configuration

Kubernetes Integration

Use the health endpoint for liveness and readiness probes:

apiVersion: v1
kind: Pod
metadata:
  name: nativelink
spec:
  containers:
  - name: nativelink
    image: ghcr.io/tracemachina/nativelink:latest
    livenessProbe:
      httpGet:
        path: /status
        port: 50051
      initialDelaySeconds: 10
      periodSeconds: 30
    readinessProbe:
      httpGet:
        path: /status
        port: 50051
      initialDelaySeconds: 5
      periodSeconds: 10

Load Balancer Integration

Configure health checks in your load balancer: AWS ALB:

HealthCheckPath: /status
HealthCheckIntervalSeconds: 30
HealthCheckTimeoutSeconds: 5
HealthyThresholdCount: 2
UnhealthyThresholdCount: 3

nginx:

upstream nativelink {
    server nativelink1:50051 max_fails=3 fail_timeout=30s;
    server nativelink2:50051 max_fails=3 fail_timeout=30s;
}

server {
    location /status {
        proxy_pass http://nativelink/status;
        proxy_connect_timeout 5s;
    }
}

Timeout Configuration

The timeout_seconds setting controls how long to wait for each component:

{
  services: {
    health: {
      timeout_seconds: 10  // Increase for slower backends
    }
  }
}

Components that exceed the timeout are marked as Timeout status and contribute to a 503 response

Monitoring Best Practices

Set appropriate timeouts

Configure timeout_seconds based on your slowest backend (S3, database, etc.)

Monitor health check latency

Track how long health checks take - increases indicate problems

Alert on 503 responses

Set up alerts when the health endpoint returns SERVICE_UNAVAILABLE

Review component failures

Parse the JSON response to identify which specific component failed

Custom Health Path

Change the health check path if it conflicts with other endpoints:

{
  services: {
    health: {
      path: "/health/check"  // Custom path
    }
  }
}

Implementation Details

From nativelink-service/src/health_server.rs:

pub struct HealthServer {
    health_registry: HealthRegistry,
    timeout: Duration,
}

The health server queries all registered components in parallel and aggregates their status.

Error Handling

200 OK: All components healthy or no components registered
503 SERVICE_UNAVAILABLE: One or more components failed or timed out
500 INTERNAL_SERVER_ERROR: Failed to serialize health status (rare)

Use the health endpoint in your CI/CD pipeline to verify deployments before routing traffic

Services

Store Types

Health Check service

Overview

Configuration

HTTP Endpoint

GET /status

Health Status Types

Registered Components

Kubernetes Integration

Load Balancer Integration

Timeout Configuration

Monitoring Best Practices

Custom Health Path

Implementation Details

Error Handling

Build docs developers (and LLMs) love

Services

Store Types

Documentation Index

​Overview

​Configuration

​HTTP Endpoint

​GET /status

​Health Status Types

​Registered Components

​Kubernetes Integration

​Load Balancer Integration

​Timeout Configuration

​Monitoring Best Practices

​Custom Health Path

​Implementation Details

​Error Handling

Build docs developers (and LLMs) love

Overview

Configuration

HTTP Endpoint

GET /status

Health Status Types

Registered Components

Kubernetes Integration

Load Balancer Integration

Timeout Configuration

Monitoring Best Practices

Custom Health Path

Implementation Details

Error Handling