Jobs - Chronoverse

Overview

Jobs are individual execution instances of workflows. When a workflow’s scheduled time arrives, Chronoverse creates a job to execute the workflow’s defined task. Each job has its own lifecycle, status, and log history.

Job Properties

Every job has the following properties:

string

required

Unique identifier (UUID) for the job

workflow_id

string

required

ID of the parent workflow this job belongs to

user_id

string

required

ID of the user who owns the workflow

status

string

required

Current job status (see Job Statuses below)

trigger

string

required

How the job was triggered:

AUTOMATIC: Scheduled automatically by Chronoverse
MANUAL: Triggered manually by user via API

container_id

string

Docker container ID (for CONTAINER workflows only)

scheduled_at

timestamp

required

When the job was scheduled to run (RFC3339 format)

started_at

timestamp

When job execution actually started

completed_at

timestamp

When job execution finished

created_at

timestamp

When the job record was created

updated_at

timestamp

Last update timestamp

Job Statuses

Jobs progress through several states during their lifecycle:

PENDING

Initial State: Job has been created but not yet queued for execution.

Job record exists in database
Waiting to be picked up by workers
Typically very short duration (milliseconds)

QUEUED

Queued for Execution: Job has been sent to the execution queue.

Event published to Kafka
Waiting for worker to process
Duration depends on worker availability

RUNNING

Active Execution: Job is currently executing.

Container/process is running
Logs are being captured in real-time
started_at timestamp is set
Duration varies by workflow complexity

COMPLETED

Successful Completion: Job finished successfully.

Exit code 0 (or equivalent success indicator)
All logs captured and stored
completed_at timestamp is set
Consecutive failure counter is reset

FAILED

Failed Execution: Job encountered an error.

Non-zero exit code or exception
Error logs captured
completed_at timestamp is set
Consecutive failure counter incremented

CANCELED

Canceled Execution: Job was canceled before or during execution.

Manually canceled by user
Or workflow terminated mid-execution
Resources cleaned up
completed_at timestamp is set

Job Lifecycle

1. Scheduling Phase

Trigger: Scheduling Worker Every poll interval (configurable), the Scheduling Worker:

Queries PostgreSQL for workflows due for execution
Checks workflow status and build readiness
Creates job records with PENDING status
Sets scheduled_at to current time
Publishes events to Kafka

Scheduling Query Logic

SELECT * FROM workflows
WHERE terminated_at IS NULL
  AND build_status = 'COMPLETED'
  AND (
    last_execution_time IS NULL
    OR last_execution_time + (interval * INTERVAL '1 minute') <= NOW()
  )

Automatic Trigger: Jobs created by the Scheduling Worker have trigger = AUTOMATIC.

2. Queue Phase

Status: PENDING → QUEUED Once events are published to Kafka:

Job status updates to QUEUED
Events wait in Kafka topics based on workflow type:
- HEARTBEAT workflows → jobs topic
- CONTAINER workflows → workflows topic (build first)
Consumer groups ensure at-least-once delivery

3. Execution Phase

Status: QUEUED → RUNNING

For HEARTBEAT Workflows

Worker: Execution Worker

Consumes job event from jobs topic
Updates status to RUNNING
Sets started_at timestamp
Executes health check logic
Captures result (success/failure)

For CONTAINER Workflows

Worker: Workflow Worker → Execution Worker Two-phase execution: Phase 1: Workflow Worker

Consumes workflow event from workflows topic
Retrieves workflow configuration from Redis
Validates Docker image availability
Prepares execution environment
Publishes job event to jobs topic

Phase 2: Execution Worker

Consumes job event from jobs topic
Updates status to RUNNING
Sets started_at timestamp
Creates Docker container with configuration
Starts container execution
Streams stdout/stderr to Kafka’s job_logs topic
Monitors container lifecycle

Container logs are streamed in real-time via Redis for live viewing while also being published to Kafka for persistence.

4. Completion Phase

Status: RUNNING → COMPLETED/FAILED/CANCELED When execution finishes:

Container exits (or heartbeat completes)
Exit code determines outcome:
- Exit code 0 → COMPLETED
- Exit code ≠ 0 → FAILED
- Canceled by user → CANCELED
Sets completed_at timestamp
Updates job status in database
Publishes analytics events to Kafka
Cleans up resources (container removal, cache cleanup)

5. Post-Processing Phase

Workers: JobLogs Processor & Analytics Processor

JobLogs Processor

Consumes log events from Kafka
Performs batch insertion to ClickHouse
Indexes logs in MeiliSearch
Enables efficient log querying and search

Analytics Processor

Consumes analytics events from Kafka
Aggregates job metrics (duration, success rate, etc.)
Stores in PostgreSQL for reporting
Enables trend analysis and dashboards

Job Triggers

AUTOMATIC Trigger

Jobs created by the Scheduling Worker based on workflow interval. Characteristics:

Regular, predictable scheduling
Respects workflow interval
Skipped if previous job still running
Default trigger type

Example Timeline:

Interval: 10 minutes

00:00 - Job 1 starts (AUTOMATIC)
00:45 - Job 1 completes
10:00 - Job 2 starts (AUTOMATIC)
10:30 - Job 2 completes
20:00 - Job 3 starts (AUTOMATIC)

MANUAL Trigger

Jobs created by explicit user request via API. Characteristics:

On-demand execution
Ignores workflow interval
Can run even if automatic job is running
Useful for testing and debugging

Use Cases:

Test workflow configuration
Run job outside regular schedule
Retry failed execution
Debug workflow issues

Manual jobs count toward consecutive failure tracking. A failed manual job will increment the failure counter.

Job Logs

Log Capture

During execution, Chronoverse captures:

stdout: Standard output stream
stderr: Standard error stream
Timestamp: Nanosecond precision
Sequence Number: Order preservation

Log Storage

Real-time (Redis):

Temporary storage during execution
Enables live log streaming
TTL-based expiration

Long-term (ClickHouse):

Permanent storage for all logs
Optimized for time-series queries
Compressed for efficient storage

Search (MeiliSearch):

Full-text search capability
Fast log filtering
Message content indexing

Log Retrieval

Get Job Logs (Paginated)

Retrieve logs for a completed job with pagination:

GET /api/v1/workflows/{workflow_id}/jobs/{job_id}/logs?cursor={cursor}

Supports filtering by stream (stdout, stderr, all).

Stream Job Logs (Real-time)

Stream logs for a running job using Server-Sent Events:

GET /api/v1/workflows/{workflow_id}/jobs/{job_id}/logs/stream

Automatically falls back to static logs when job completes.

Search Job Logs

Full-text search across job logs:

GET /api/v1/workflows/{workflow_id}/jobs/{job_id}/logs/search?message=error

Job Monitoring

Status Tracking

Monitor job progress through:

API Polling: GET /api/v1/workflows/{workflow_id}/jobs/{job_id}
Real-time Notifications: Subscribe to SSE endpoint for status updates
Dashboard: Visual workflow and job status

Performance Metrics

Available metrics for each job:

Execution Duration: completed_at - started_at
Queue Time: started_at - scheduled_at
Total Time: completed_at - scheduled_at

Access aggregated metrics via the Analytics Service for trend analysis.

Failure Analysis

When a job fails:

Check job status for current state
Review stderr logs for error messages
Examine exit code or error details
Verify workflow configuration
Check resource availability (memory, disk, network)

Job Management

List Jobs

Retrieve all jobs for a workflow:

GET /api/v1/workflows/{workflow_id}/jobs?status=FAILED&trigger=AUTOMATIC

Filters:

status: Filter by job status
trigger: Filter by trigger type (AUTOMATIC/MANUAL)
cursor: Pagination cursor

Schedule Manual Job

Trigger a workflow manually:

POST /api/v1/workflows/{workflow_id}/jobs

Creates a job with trigger = MANUAL that executes immediately.

View Job Details

Get complete job information:

GET /api/v1/workflows/{workflow_id}/jobs/{job_id}

Includes all timestamps, status, and configuration.

Best Practices

Interval Management

Start Conservative

Begin with longer intervals and decrease as needed. Prevents resource exhaustion during initial testing.

Consider Execution Time

Ensure interval > average execution time. If jobs take 5 minutes, use at least 10-minute intervals.

Account for Failures

Longer intervals allow time for failure investigation without rapid consecutive failures.

Log Management

Keep Logs Concise: Excessive logging impacts performance
Use Structured Output: JSON logs are easier to parse and search
Log Errors to stderr: Helps with filtering and analysis
Include Context: Timestamps, request IDs, relevant data

Error Handling

Exit Codes

Use appropriate exit codes in your containerized applications:

0: Success
1: General error
2: Misuse/invalid arguments
130: Terminated by Ctrl+C
Custom codes: Document in your workflow

Retry Strategy

Implement retries within your application code for transient failures
Use exponential backoff for external API calls
Set reasonable timeouts
Log retry attempts for debugging

Resource Limits

Set memory limits to prevent OOM errors
Monitor execution time and optimize long-running jobs
Clean up temporary files and connections
Handle signals for graceful shutdown

Troubleshooting

Job Stuck in QUEUED

Possible Causes:

Workers are down or overloaded
Kafka consumer lag
Network connectivity issues

Solutions:

Check worker health and logs
Monitor Kafka consumer lag metrics
Scale workers horizontally if needed

Job Failing Consistently

Possible Causes:

Invalid workflow configuration
Missing dependencies in container
External service unavailability
Resource constraints

Solutions:

Review job logs for error messages
Test container locally with same configuration
Verify external dependencies
Check resource limits and utilization

Logs Not Appearing

Possible Causes:

JobLogs Processor is down
Kafka connection issues
ClickHouse write errors

Solutions:

Check JobLogs Processor status
Verify Kafka connectivity
Review ClickHouse logs for errors

Next Steps

Schedule a Job

Learn how to schedule manual jobs

Workers

Understand how workers execute jobs

Workflows

Learn about workflow configuration

API Reference

Complete jobs API documentation

Get Started

Core Concepts

Deployment

Features

Operations

​Overview

​Job Properties

​Job Statuses

​PENDING

​QUEUED

​RUNNING

​COMPLETED

​FAILED

​CANCELED

​Job Lifecycle

​1. Scheduling Phase

​2. Queue Phase

​3. Execution Phase

​For HEARTBEAT Workflows

​For CONTAINER Workflows

​4. Completion Phase

​5. Post-Processing Phase

​JobLogs Processor

​Analytics Processor

​Job Triggers

​AUTOMATIC Trigger

​MANUAL Trigger

​Job Logs

​Log Capture

​Log Storage

​Log Retrieval

​Job Monitoring

​Status Tracking

​Performance Metrics

​Failure Analysis

​Job Management

​List Jobs

​Schedule Manual Job

​View Job Details

​Best Practices

​Interval Management

​Log Management

​Error Handling

​Troubleshooting

​Job Stuck in QUEUED

​Job Failing Consistently

​Logs Not Appearing

​Next Steps

Schedule a Job

Workers

Workflows

API Reference

Build docs developers (and LLMs) love

Overview

Job Properties

Job Statuses

PENDING

QUEUED

RUNNING

COMPLETED

FAILED

CANCELED

Job Lifecycle

1. Scheduling Phase

2. Queue Phase

3. Execution Phase

For HEARTBEAT Workflows

For CONTAINER Workflows

4. Completion Phase

5. Post-Processing Phase

JobLogs Processor

Analytics Processor

Job Triggers

AUTOMATIC Trigger

MANUAL Trigger

Job Logs

Log Capture

Log Storage

Log Retrieval

Job Monitoring

Status Tracking

Performance Metrics

Failure Analysis

Job Management

List Jobs

Schedule Manual Job

View Job Details

Best Practices

Interval Management

Log Management

Error Handling

Troubleshooting

Job Stuck in QUEUED

Job Failing Consistently

Logs Not Appearing

Next Steps