Monitoring Health - Conway Automaton

Health Status Overview

Your automaton continuously monitors its own health and reports status through multiple channels.

Quick Status Check

automaton-cli status

Output:

Name: MyAutomaton
Address: 0x742d35Cc6634C0532925a3b844Bc9e7595f0bEb1
State: running
Credits: $12.45
USDC Balance: $50.00
Survival Tier: normal
Uptime: 3h 24m
Last Heartbeat: 2 minutes ago
Turns: 147

Agent States

Your automaton transitions through different operational states:

export type AgentState =
  | "setup"         // Initial configuration
  | "waking"        // Starting up
  | "running"       // Normal operation
  | "sleeping"      // Idle, waiting for input
  | "low_compute"   // Low credits, reduced activity
  | "critical"      // Zero credits, survival mode
  | "dead";         // Non-operational

State Transitions

┌──────┐
│ setup│
└───┬──┘
    ▼
┌───────┐
│waking │
└───┬───┘
    ▼
┌────────┐    ┌──────────┐
│running │◄──►│ sleeping │
└───┬────┘    └──────────┘
    │
    ▼ (credits low)
┌─────────────┐
│ low_compute │
└──────┬──────┘
       │
       ▼ (credits exhausted)
   ┌──────────┐
   │ critical │
   └─────┬────┘
         │
         ▼ (negative balance)
     ┌──────┐
     │ dead │
     └──────┘

Observability System

The automaton includes a comprehensive observability stack with structured logging, metrics, and alerts.

Structured Logging

All logs are JSON-formatted for easy parsing:

{
  "timestamp": "2026-03-03T10:15:30.123Z",
  "level": "info",
  "module": "heartbeat.scheduler",
  "message": "Heartbeat task completed",
  "context": {
    "taskName": "check_inbox",
    "durationMs": 234
  }
}

Log Levels

debug: Detailed diagnostic information
info: General informational messages
warn: Warning messages (non-critical issues)
error: Error messages (operation failed)
fatal: Fatal errors (system shutdown)

Viewing Logs

# All logs
automaton-cli logs

# Filter by level
automaton-cli logs --level error

# Follow live
automaton-cli logs --follow

# Search content
automaton-cli logs --grep "credit"

# Last 100 lines
automaton-cli logs --tail 100

Metrics Collection

The automaton tracks metrics in three types:

Counters

Monotonically increasing values:

turns_total: Total turns executed
inference_cost_cents: Cumulative inference cost
heartbeat_task_successes_total: Successful heartbeat tasks
heartbeat_task_failures_total: Failed heartbeat tasks
policy_decisions_total: Total policy evaluations
policy_denies_total: Denied policy decisions

Gauges

Point-in-time measurements:

balance_cents: Current credit balance
usdc_balance: Current USDC balance
context_tokens_total: Current context size
turns_last_hour: Turns in last hour (windowed)
unhealthy_child_count: Number of unhealthy child agents

Histograms

Distribution of values over time:

turn_duration_ms: Turn execution time
inference_latency_ms: Model API latency
tool_duration_ms: Tool execution time

Metrics API

From src/observability/metrics.ts:

import { getMetrics } from "./observability/metrics.js";

const metrics = getMetrics();

// Increment counter
metrics.increment("turns_total", { state: "running" });

// Set gauge
metrics.gauge("balance_cents", creditsCents);

// Record histogram value
metrics.histogram("turn_duration_ms", durationMs);

// Query metrics
const turnCount = metrics.getCounter("turns_total");
const balance = metrics.getGauge("balance_cents");

Viewing Metrics

# Current metrics snapshot
automaton-cli metrics

# Specific metric
automaton-cli metrics --name balance_cents

# Export to JSON
automaton-cli metrics --format json > metrics.json

Alert System

The automaton evaluates alert rules against metric snapshots and triggers notifications.

Built-in Alert Rules

From src/observability/alerts.ts:

1. Balance Below Reserve

{
  name: "balance_below_reserve",
  severity: "critical",
  message: "Balance is below minimum reserve (1000 cents)",
  cooldownMs: 5 * 60 * 1000,
  condition: (metrics) => {
    const balance = metrics.gauges.get("balance_cents") ?? Infinity;
    return balance < 1000;
  },
}

2. High Heartbeat Failure Rate

{
  name: "heartbeat_high_failure_rate",
  severity: "warning",
  message: "Heartbeat task failure rate exceeds 20%",
  cooldownMs: 15 * 60 * 1000,
  condition: (metrics) => {
    const failures = metrics.counters.get("heartbeat_task_failures_total") ?? 0;
    const successes = metrics.counters.get("heartbeat_task_successes_total") ?? 0;
    const total = failures + successes;
    if (total === 0) return false;
    return failures / total > 0.2;
  },
}

3. Context Near Capacity

{
  name: "context_near_capacity",
  severity: "warning",
  message: "Context token usage above 90% of budget",
  cooldownMs: 10 * 60 * 1000,
  condition: (metrics) => {
    const tokens = metrics.gauges.get("context_tokens_total") ?? 0;
    return tokens > 90_000; // 100k default budget
  },
}

4. Zero Turns Last Hour

{
  name: "zero_turns_last_hour",
  severity: "critical",
  message: "No successful turns in the last hour",
  cooldownMs: 60 * 60 * 1000,
  condition: (metrics) => {
    const turnsLastHour = metrics.gauges.get("turns_last_hour") ?? -1;
    if (turnsLastHour >= 0) return turnsLastHour === 0;
    return false;
  },
}

Alert Cooldowns

Each alert has a cooldown period to prevent alert storms. Once an alert fires, it won’t fire again until the cooldown expires.

Viewing Alerts

# Active alerts
automaton-cli alerts

# Alert history
automaton-cli alerts --history

Heartbeat Health

The heartbeat system provides autonomous health monitoring.

Heartbeat Tasks

From ~/.automaton/heartbeat.yml:

entries:
  - name: check_balance
    schedule: "0 */5 * * * *"  # Every 5 minutes
    task: check_balance
    enabled: true

  - name: check_inbox
    schedule: "0 */2 * * * *"  # Every 2 minutes
    task: check_inbox
    enabled: true

  - name: self_reflect
    schedule: "0 0 */6 * * *"  # Every 6 hours
    task: self_reflect
    enabled: true

Heartbeat Status

automaton-cli heartbeat status

Output:

Task               Schedule        Last Run         Next Run         Status
check_balance      */5 * * * *     2 minutes ago    3 minutes        OK
check_inbox        */2 * * * *     1 minute ago     1 minute         OK
self_reflect       0 */6 * * *     4 hours ago      2 hours          OK

Child Agent Health Monitoring

If your automaton has spawned child agents, it monitors their health:

Health Checks

From src/orchestration/health-monitor.ts:

export interface AgentHealthStatus {
  address: string;
  name: string;
  status: string;
  healthy: boolean;
  lastHeartbeat: string | null;
  currentTaskId: string | null;
  creditBalance: number | null;
  errorRate: number;
  issues: string[];
}

Health Issues

The system detects:

heartbeat_missing: No recent heartbeat
heartbeat_stale: Heartbeat older than 15 minutes
process_crashed: No response for 45 minutes
stuck_on_task: Task execution exceeding timeout
out_of_credits: Balance below minimum (10 cents)
error_loop: Error rate exceeds 60%

Auto-Healing

The health monitor can automatically:

Fund agents low on credits
Restart crashed agents
Reassign stuck tasks
Stop agents in error loops

# View child health
automaton-cli children health

# Trigger auto-heal
automaton-cli children heal

Performance Metrics

Turn Performance

Track agent turn execution:

SELECT 
  AVG(cost_cents) as avg_cost,
  AVG(json_extract(token_usage, '$.totalTokens')) as avg_tokens,
  COUNT(*) as turn_count
FROM turns
WHERE timestamp > datetime('now', '-1 day');

Inference Costs

Query inference spending:

SELECT 
  model,
  SUM(cost_cents) as total_cost,
  SUM(input_tokens) as total_input,
  SUM(output_tokens) as total_output,
  AVG(latency_ms) as avg_latency
FROM inference_costs
GROUP BY model;

Tool Usage

Analyze tool call patterns:

SELECT 
  name,
  COUNT(*) as call_count,
  AVG(duration_ms) as avg_duration,
  SUM(CASE WHEN error IS NOT NULL THEN 1 ELSE 0 END) as error_count
FROM tool_calls
GROUP BY name
ORDER BY call_count DESC;

Database Health

Monitor SQLite database:

# Database size
ls -lh ~/.automaton/state.db

# Integrity check
sqlite3 ~/.automaton/state.db "PRAGMA integrity_check;"

# Table sizes
sqlite3 ~/.automaton/state.db "SELECT name, SUM(pgsize) as size FROM dbstat GROUP BY name;"

Vacuum and Optimize

# Compact database
sqlite3 ~/.automaton/state.db "VACUUM;"

# Analyze query planner
sqlite3 ~/.automaton/state.db "ANALYZE;"

System Resource Monitoring

Sandbox Resources

Check Conway sandbox usage:

automaton-cli sandbox stats

Memory Usage

# Process memory
ps aux | grep automaton

# Database memory
du -sh ~/.automaton/

Diagnostic Tools

Export State

Export full automaton state for debugging:

automaton-cli export --output state-dump.json

Health Report

Generate comprehensive health report:

automaton-cli health-report

Includes:

Current state and uptime
Credit and USDC balances
Recent turns and costs
Active alerts
Heartbeat status
Child agent health
Database stats

Best Practices

Proactive Monitoring

Set up alerts: Configure notifications for critical alerts
Review daily: Check status and metrics daily
Trend analysis: Track spending and performance trends
Capacity planning: Monitor context usage and database growth

Performance Tuning

Optimize context: Prune old turns to reduce context size
Tune heartbeat: Balance responsiveness vs cost
Index optimization: Add database indexes for common queries
Model selection: Use appropriate models for task complexity

Debugging

Enable debug logs: Set logLevel: "debug" temporarily
Increase verbosity: Add context to critical operations
Trace tool calls: Monitor tool execution and errors
Profile turns: Measure turn duration and token usage

Troubleshooting Common Issues

High error rate

Check logs: automaton-cli logs --level error
Review failed tool calls
Verify API keys and network connectivity
Check treasury policy for denied operations

Stuck in sleeping state

Verify heartbeat is running
Check inbox for unprocessed messages
Review wake conditions
Manually trigger: automaton-cli wake

High costs

Review inference costs by model
Check for inefficient tool usage
Optimize heartbeat frequency
Switch to cheaper models for routine tasks

Database corruption

Run integrity check
Restore from backup
Check disk space
Review logs for write errors

Overview

Getting started

Core concepts

Features

Guides

Architecture

Conway Cloud

Documentation Index

​Health Status Overview

​Quick Status Check

​Agent States

​State Transitions

​Observability System

​Structured Logging

​Log Levels

​Viewing Logs

​Metrics Collection

​Counters

​Gauges

​Histograms

​Metrics API

​Viewing Metrics

​Alert System

​Built-in Alert Rules

​1. Balance Below Reserve

​2. High Heartbeat Failure Rate

​3. Context Near Capacity

​4. Zero Turns Last Hour

​Alert Cooldowns

​Viewing Alerts

​Heartbeat Health

​Heartbeat Tasks

​Heartbeat Status

​Child Agent Health Monitoring

​Health Checks

​Health Issues

​Auto-Healing

​Performance Metrics

​Turn Performance

​Inference Costs

​Tool Usage

​Database Health

​Vacuum and Optimize

​System Resource Monitoring

​Sandbox Resources

​Memory Usage

​Diagnostic Tools

​Export State

​Health Report

​Best Practices

​Proactive Monitoring

​Performance Tuning

​Debugging

​Troubleshooting Common Issues

​High error rate

​Stuck in sleeping state

​High costs

​Database corruption

Build docs developers (and LLMs) love