Skip to main content
The Health Check service provides HTTP endpoints for monitoring NativeLink’s operational status.

Overview

The Health service:
  • Exposes HTTP endpoint for health checks
  • Reports status of all registered components
  • Returns JSON status for each subsystem
  • Supports custom timeout configuration
Key features:
  • Non-gRPC HTTP endpoint (easy integration with load balancers)
  • Component-level health reporting
  • Configurable timeouts
  • Service-unavailable status on failures

Configuration

path
string
default:"/status"
HTTP path for health check endpoint
timeout_seconds
uint64
default:5
Timeout for health check queries
{
  services: {
    health: {
      path: "/status",  // Default: /status
      timeout_seconds: 5
    }
  }
}
The default path is /status as documented in the source code

HTTP Endpoint

GET /status

Query health status of all components. Response:
status
int
HTTP status code:
  • 200 OK: All components healthy
  • 503 SERVICE_UNAVAILABLE: One or more components failed
body
JSON
Array of component health descriptions
Example response (healthy):
[
  {
    "namespace": "nativelink",
    "component": "stores/MAIN_CAS",
    "status": {
      "Ok": {
        "message": "Store operational"
      }
    }
  },
  {
    "namespace": "nativelink",
    "component": "scheduler/MAIN_SCHEDULER",
    "status": {
      "Ok": {
        "message": "Scheduler running"
      }
    }
  }
]
Example response (unhealthy):
[
  {
    "namespace": "nativelink",
    "component": "stores/S3_STORE",
    "status": {
      "Failed": {
        "message": "Connection timeout",
        "error": "Failed to connect to S3"
      }
    }
  }
]

Health Status Types

Components can report these statuses:
Component is healthy and operational
{"Ok": {"message": "Service running"}}

Registered Components

NativeLink automatically registers health checks for:
  • Stores: All configured stores (CAS, AC, etc.)
  • Schedulers: Task schedulers
  • Workers: Local workers (if configured)
  • Backend connections: S3, Redis, database connections
Components are registered during service initialization based on your configuration

Kubernetes Integration

Use the health endpoint for liveness and readiness probes:
apiVersion: v1
kind: Pod
metadata:
  name: nativelink
spec:
  containers:
  - name: nativelink
    image: ghcr.io/tracemachina/nativelink:latest
    livenessProbe:
      httpGet:
        path: /status
        port: 50051
      initialDelaySeconds: 10
      periodSeconds: 30
    readinessProbe:
      httpGet:
        path: /status
        port: 50051
      initialDelaySeconds: 5
      periodSeconds: 10

Load Balancer Integration

Configure health checks in your load balancer: AWS ALB:
HealthCheckPath: /status
HealthCheckIntervalSeconds: 30
HealthCheckTimeoutSeconds: 5
HealthyThresholdCount: 2
UnhealthyThresholdCount: 3
nginx:
upstream nativelink {
    server nativelink1:50051 max_fails=3 fail_timeout=30s;
    server nativelink2:50051 max_fails=3 fail_timeout=30s;
}

server {
    location /status {
        proxy_pass http://nativelink/status;
        proxy_connect_timeout 5s;
    }
}

Timeout Configuration

The timeout_seconds setting controls how long to wait for each component:
{
  services: {
    health: {
      timeout_seconds: 10  // Increase for slower backends
    }
  }
}
Components that exceed the timeout are marked as Timeout status and contribute to a 503 response

Monitoring Best Practices

1

Set appropriate timeouts

Configure timeout_seconds based on your slowest backend (S3, database, etc.)
2

Monitor health check latency

Track how long health checks take - increases indicate problems
3

Alert on 503 responses

Set up alerts when the health endpoint returns SERVICE_UNAVAILABLE
4

Review component failures

Parse the JSON response to identify which specific component failed

Custom Health Path

Change the health check path if it conflicts with other endpoints:
{
  services: {
    health: {
      path: "/health/check"  // Custom path
    }
  }
}

Implementation Details

From nativelink-service/src/health_server.rs:
pub struct HealthServer {
    health_registry: HealthRegistry,
    timeout: Duration,
}
The health server queries all registered components in parallel and aggregates their status.

Error Handling

  • 200 OK: All components healthy or no components registered
  • 503 SERVICE_UNAVAILABLE: One or more components failed or timed out
  • 500 INTERNAL_SERVER_ERROR: Failed to serialize health status (rare)
Use the health endpoint in your CI/CD pipeline to verify deployments before routing traffic

Build docs developers (and LLMs) love