Monitoring & Observability

iii provides comprehensive observability through OpenTelemetry integration, metrics collection, distributed tracing, and structured logging.

Configuration

Configure observability in config.yaml:

modules:
  - class: modules::observability::OtelModule
    config:
      # Tracing configuration
      enabled: true
      service_name: my-service
      service_version: 1.0.0
      service_namespace: production
      
      # Exporter: otlp, memory, or both
      exporter: otlp
      endpoint: http://localhost:4317
      
      # Sampling (0.0 to 1.0)
      sampling_ratio: 1.0
      
      # Memory storage (for 'memory' or 'both' exporters)
      memory_max_spans: 1000
      
      # Metrics configuration
      metrics_enabled: true
      metrics_exporter: otlp  # or 'memory'
      metrics_retention_seconds: 3600
      metrics_max_count: 10000
      
      # Logs configuration
      logs_enabled: true
      logs_exporter: memory  # or 'otlp', 'both'
      logs_retention_seconds: 3600
      logs_max_count: 10000

OpenTelemetry Traces

Exporters

iii supports multiple trace exporters:

OTLP (Production)

Export traces to OpenTelemetry collectors (Jaeger, Grafana Tempo, etc.).

exporter: otlp
endpoint: http://localhost:4317

Collector setup (Docker Compose):

services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP

Memory (Development)

Store traces in-memory for API querying.

exporter: memory
memory_max_spans: 1000

Query via REST API:

curl http://localhost:3111/api/traces/list

Both (Hybrid)

Export to OTLP and store in memory (enables trace-based triggers).

exporter: both
endpoint: http://localhost:4317
memory_max_spans: 1000

Distributed Tracing

iii automatically propagates W3C trace context across function invocations. Trace context format:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
             ││  │                                │                  │
             ││  └─ trace-id (128-bit)            │                  └─ flags
             ││                                    └─ parent-id (64-bit)
             │└─ version
             └─ version

Manual trace injection:

import { init } from 'iii-sdk';

const iii = init('ws://localhost:49134');

await iii.invoke('my.function', 
  { data: 'value' },
  {
    traceparent: '00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01',
    baggage: 'user_id=123,session_id=abc'
  }
);

Trace API

List traces:

curl -X POST http://localhost:3111/api/traces/list \
  -H "Content-Type: application/json" \
  -d '{
    "service_name": "my-service",
    "min_duration_ms": 100,
    "limit": 50
  }'

Filter options:

interface TracesListInput {
  trace_id?: string;              // Specific trace ID
  offset?: number;                // Pagination offset
  limit?: number;                 // Pagination limit (default: 100)
  service_name?: string;          // Filter by service
  name?: string;                  // Filter by span name
  status?: string;                // Filter by status
  min_duration_ms?: number;       // Minimum duration
  max_duration_ms?: number;       // Maximum duration
  start_time?: number;            // Unix timestamp (ms)
  end_time?: number;              // Unix timestamp (ms)
  sort_by?: "duration" | "start_time" | "name";
  sort_order?: "asc" | "desc";
  attributes?: [string, string][]; // Exact attribute matches
  include_internal?: boolean;     // Include engine.* traces
}

Get trace tree:

curl -X POST http://localhost:3111/api/traces/tree \
  -H "Content-Type: application/json" \
  -d '{"trace_id": "0af7651916cd43dd8448eb211c80319c"}'

Metrics

Metrics Exporters

Memory (Default)

Store metrics in-memory for API querying.

metrics_enabled: true
metrics_exporter: memory
metrics_retention_seconds: 3600
metrics_max_count: 10000

OTLP (Production)

Export metrics to OpenTelemetry collectors.

metrics_enabled: true
metrics_exporter: otlp
endpoint: http://localhost:4317

Prometheus Metrics

iii exposes Prometheus-compatible metrics on port 9464.

curl http://localhost:9464/metrics

Key metrics:

# Invocation metrics
iii_invocations_total{function_id="math.add"} 1234
iii_invocation_duration_seconds{function_id="math.add"} 0.042
iii_invocation_errors_total{function_id="math.add"} 5

# Worker metrics
iii_workers_active 3
iii_workers_spawns_total 10
iii_workers_deaths_total 2
iii_workers_by_status{status="connected"} 3

# Worker resource metrics
iii_worker_memory_heap_bytes{worker_id="w1"} 45678912
iii_worker_memory_rss_bytes{worker_id="w1"} 89123456
iii_worker_cpu_percent{worker_id="w1"} 12.5
iii_worker_event_loop_lag_ms{worker_id="w1"} 2.3
iii_worker_uptime_seconds{worker_id="w1"} 3600

See src/modules/observability/metrics.rs:244 for full metric definitions.

Metrics API

List metrics:

curl -X POST http://localhost:3111/api/metrics/list \
  -H "Content-Type: application/json" \
  -d '{
    "metric_name": "iii.invocations.total",
    "start_time": 1640000000000,
    "end_time": 1640100000000,
    "aggregate_interval": 300
  }'

Query options:

interface MetricsListInput {
  start_time?: number;        // Unix timestamp (ms)
  end_time?: number;          // Unix timestamp (ms)
  metric_name?: string;       // Filter by metric name
  aggregate_interval?: number; // Aggregation interval (seconds)
}

Worker Metrics

Workers automatically report resource metrics:

// Node.js SDK auto-reports these metrics
interface WorkerMetrics {
  memory_heap_used: number;    // Heap memory used (bytes)
  memory_heap_total: number;   // Total heap (bytes)
  memory_rss: number;          // Resident set size (bytes)
  memory_external: number;     // External memory (bytes)
  
  cpu_user_micros: number;     // User CPU time (μs)
  cpu_system_micros: number;   // System CPU time (μs)
  cpu_percent: number;         // Current CPU %
  
  event_loop_lag_ms: number;   // Event loop lag (ms)
  uptime_seconds: number;      // Worker uptime (s)
  
  timestamp_ms: number;        // Metric timestamp
  runtime: string;             // "node", "rust", "python"
}

Query worker metrics:

curl http://localhost:3111/api/workers/list

Response includes latest_metrics for each worker.

Structured Logging

Log Levels

Configure via environment or YAML:

modules:
  - class: modules::observability::LoggingModule
    config:
      level: info  # trace, debug, info, warn, error
      format: json # or 'pretty'

Log Exporters

Memory

Store logs in-memory for querying.

logs_enabled: true
logs_exporter: memory
logs_retention_seconds: 3600
logs_max_count: 10000

OTLP

Export logs to OpenTelemetry collectors.

logs_enabled: true
logs_exporter: otlp
endpoint: http://localhost:4317

Logs API

List logs:

curl -X POST http://localhost:3111/api/logs/list \
  -H "Content-Type: application/json" \
  -d '{
    "level": "error",
    "limit": 100
  }'

Alerting

Configure metric-based alerts:

modules:
  - class: modules::observability::OtelModule
    config:
      alerts:
        - name: high_error_rate
          metric: iii.invocations.error
          threshold: 10
          operator: ">"
          window_seconds: 60
          cooldown_seconds: 300
          action:
            type: webhook
            url: https://hooks.slack.com/...
        
        - name: worker_memory_high
          metric: worker.memory.rss
          threshold: 1073741824  # 1GB
          operator: ">="
          window_seconds: 30
          action:
            type: function
            path: alerts.worker_memory

Alert operators:

> (greater than)
>= (greater than or equal)
< (less than)
<= (less than or equal)
== (equal)
!= (not equal)

Alert actions:

# Log alert (default)
action:
  type: log

# Webhook notification
action:
  type: webhook
  url: https://example.com/webhook

# Function invocation
action:
  type: function
  path: my.alert.handler

Advanced Sampling

Configure sampling strategies to reduce trace volume:

modules:
  - class: modules::observability::OtelModule
    config:
      sampling:
        # Default sampling ratio
        default: 0.1  # Sample 10% of traces
        
        # Per-operation sampling
        rules:
          - operation: "api.health"
            ratio: 0.01  # Sample 1% of health checks
          
          - operation: "auth.*"
            ratio: 1.0   # Sample 100% of auth operations
          
          - service: "critical-service"
            ratio: 1.0   # Sample all traces from this service
        
        # Rate limiting (traces per second)
        rate_limit: 100

See src/modules/observability/config.rs:142 for sampling configuration.

Production Setup

Full Observability Stack

docker-compose.yml:

services:
  iii:
    image: iiidev/iii:latest
    ports:
      - "3111:3111"
      - "49134:49134"
      - "9464:9464"
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: http://tempo:4317
    volumes:
      - ./config.yaml:/app/config.yaml:ro
  
  # Traces: Grafana Tempo
  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml
      - tempo-data:/tmp/tempo
    ports:
      - "4317:4317"   # OTLP gRPC
      - "3200:3200"   # Tempo
  
  # Metrics: Prometheus
  prometheus:
    image: prom/prometheus:latest
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
  
  # Visualization: Grafana
  grafana:
    image: grafana/grafana:latest
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: "Admin"
    volumes:
      - grafana-data:/var/lib/grafana
    ports:
      - "3000:3000"

volumes:
  tempo-data:
  prometheus-data:
  grafana-data:

prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'iii'
    static_configs:
      - targets: ['iii:9464']

Environment Variables

Override configuration via environment:

# Tracing
export OTEL_ENABLED=true
export OTEL_SERVICE_NAME=my-service
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_TRACES_SAMPLER_ARG=0.1

# Metrics
export OTEL_METRICS_ENABLED=true
export OTEL_METRICS_EXPORTER=otlp
export OTEL_METRICS_RETENTION_SECONDS=7200
export OTEL_METRICS_MAX_COUNT=50000

Best Practices

Sampling Strategy

Use lower sampling for high-traffic endpoints (health checks, static assets)
Use 100% sampling for critical operations (auth, payments)
Implement rate limiting to prevent trace storms
Monitor sampling effectiveness in production

Metric Cardinality

Limit unique label combinations to avoid cardinality explosion
Use metric aggregation for high-cardinality data
Set appropriate retention periods
Monitor memory usage for in-memory storage

Trace Context Propagation

Always propagate traceparent and baggage headers
Use baggage for cross-cutting concerns (user_id, request_id)
Avoid large baggage payloads (max ~8KB)

Performance

Use OTLP exporter for production (lower overhead than memory)
Batch metrics exports (default: 60s interval)
Configure appropriate buffer sizes
Monitor exporter queue depth

Security

Use TLS for OTLP endpoints in production
Sanitize sensitive data from traces/logs
Implement access controls for observability APIs
Rotate service credentials regularly

Troubleshooting

High Memory Usage

# Reduce in-memory retention
metrics_max_count: 5000
memory_max_spans: 500
logs_max_count: 5000

Missing Traces

Check sampling ratio (may be too low)
Verify OTLP endpoint connectivity
Check for trace export errors in logs

Prometheus Scrape Failures

# Verify metrics endpoint
curl http://localhost:9464/metrics

# Check Prometheus targets
curl http://localhost:9090/targets

High Cardinality

Metric cardinality limit reached, new metric names will be dropped.

Reduce label combinations or increase limit:

// In custom adapter
max_unique_names: 50000

References

OpenTelemetry implementation: src/modules/observability/otel.rs
Metrics implementation: src/modules/observability/metrics.rs
Sampling strategies: src/modules/observability/sampler.rs
Configuration: src/modules/observability/config.rs

Get Started

Core Concepts

Modules

SDKs

Deployment

Advanced

​Configuration

​OpenTelemetry Traces

​Exporters

​OTLP (Production)

​Memory (Development)

​Both (Hybrid)

​Distributed Tracing

​Trace API

​Metrics

​Metrics Exporters

​Memory (Default)

​OTLP (Production)

​Prometheus Metrics

​Metrics API

​Worker Metrics

​Structured Logging

​Log Levels

​Log Exporters

​Memory

​OTLP

​Logs API

​Alerting

​Advanced Sampling

​Production Setup

​Full Observability Stack

​Environment Variables

​Best Practices

​Sampling Strategy

​Metric Cardinality

​Trace Context Propagation

​Performance

​Security

​Troubleshooting

​High Memory Usage

​Missing Traces

​Prometheus Scrape Failures

​High Cardinality

​References

Build docs developers (and LLMs) love

Configuration

OpenTelemetry Traces

Exporters

OTLP (Production)

Memory (Development)

Both (Hybrid)

Distributed Tracing

Trace API

Metrics

Metrics Exporters

Memory (Default)

OTLP (Production)

Prometheus Metrics

Metrics API

Worker Metrics

Structured Logging

Log Levels

Log Exporters

Memory

OTLP

Logs API

Alerting

Advanced Sampling

Production Setup

Full Observability Stack

Environment Variables

Best Practices

Sampling Strategy

Metric Cardinality

Trace Context Propagation

Performance

Security

Troubleshooting

High Memory Usage

Missing Traces

Prometheus Scrape Failures

High Cardinality

References