Monitoring, Health Checks, and Sync Observability

API-HUB exposes purpose-built health and observability endpoints rather than bolting on an external APM tool. The sync pipeline records granular per-product outcomes in a JSONB errors column, per-supplier health is aggregated on demand, and n8n workflows act as the alerting layer — polling job status until a terminal state and posting to Slack on failure. This page covers everything you need to understand the operational state of the system at a glance and diagnose problems when they occur.

Health Endpoints

Three unauthenticated endpoints give you an immediate view of system liveness and data volumes.

Method + Path	Response
`GET /health`	`{"status": "ok", "service": "api-hub"}`
`GET /api/stats`	`{"suppliers": N, "products": N, "variants": N}`
`GET /api/sync-jobs/health`	Array of `SupplierSyncHealth` objects — one per active supplier

/health is intentionally shallow — it confirms the process is running but does not probe the database. Use /api/stats as a deeper liveness check; it executes three COUNT(*) queries and will fail if the database connection pool is exhausted or the schema is broken.

Supplier Sync Health

GET /api/sync-jobs/health returns one record per supplier. Each record surfaces the fields you need to assess whether a supplier’s sync pipeline is healthy without querying the jobs table directly.

Field	Type	Meaning
`supplier_id`	UUID	Supplier’s primary key
`supplier_name`	String	Human-readable supplier name
`is_active`	Boolean	Whether the supplier is enabled
`last_full_sync`	Timestamp or null	Last completed full-catalog sync
`last_delta_sync`	Timestamp or null	Last completed delta sync
`last_sync_status`	String	`success` / `partial_success` / `failed`
`last_sync_completed_at`	Timestamp or null	Wall-clock time the last job reached a terminal state
`recent_error_count`	Integer	Error count across the last 10 jobs
`consecutive_failures`	Integer	Number of consecutive failed jobs before the last success

Sync Job Monitoring

Individual sync jobs are recorded in the sync_jobs table and exposed via the admin API.

List all jobs
Single job detail
Per-supplier history

GET /api/sync-jobs

Returns all sync jobs with optional status filtering. Use this to scan for jobs stuck in pending or running states.

GET /api/sync-jobs/{id}

Returns full detail for one job:

Field	Meaning
`status`	`pending` / `running` / `success` / `partial_success` / `failed`
`discovery_mode`	`full_sellable` / `delta` / `closeouts` / `explicit_list` / `first_n`
`total_products`	Number of product IDs discovered
`success_count`	Products successfully fetched and upserted
`failed_count`	Products that encountered an error
`errors`	JSONB array — one entry per failed product with `product_id`, `error_type`, `message`
`started_at`	Job start timestamp
`completed_at`	Terminal timestamp (null while running)

GET /api/suppliers/{id}/sync-jobs

Returns the most recent sync jobs for a specific supplier (default limit 50). Useful for reviewing the history of a supplier that is showing elevated error counts in /api/sync-jobs/health.

Reading the `errors` JSONB Field

When a job completes with partial_success or failed, the errors field contains structured per-product failure data. Query it to find which products failed and why.

{
  "status": "partial_success",
  "total_products": 412,
  "success_count": 409,
  "failed_count": 3,
  "errors": [
    {
      "product_id": "ABC-1234",
      "error_type": "SupplierError",
      "message": "SOAP fault: Product discontinued"
    },
    {
      "product_id": "XYZ-5678",
      "error_type": "TransientError",
      "message": "Connection timeout after 3 retries"
    }
  ]
}

Error Classification and Retry Behaviour

The sync pipeline classifies every error into one of three types, each with different retry and recording behaviour.

TransientError

Network errors, connection timeouts, and HTTP 5xx responses from supplier APIs. Retried 3 times with exponential backoff: 2^(2 - retries) seconds between attempts (4s, 2s, 1s). If all retries are exhausted, the product is counted in failed_count and recorded in errors.

AuthError

SOAP fault codes 100, 104, or 110 — invalid credentials, expired token, or access denied. These are fatal: no retry is attempted, the entire job is marked failed, and the error is surfaced immediately. Check suppliers.auth_config via the Suppliers UI and re-run the sync after updating credentials.

SupplierError

Per-product failures such as malformed SOAP responses, discontinued products, or normalisation errors. The product is skipped, logged to errors, and counted in failed_count. The job continues processing remaining products and completes as partial_success if at least one product succeeded.

The backoff formula is 2 ** (2 - retries) where retries counts down from 2. This gives delays of 4 seconds on the first retry, 2 seconds on the second, and 1 second on the third — front-loaded to absorb brief supplier hiccups without excessive wait time.

Interpreting Health Signals

Use these thresholds when reviewing /api/sync-jobs/health output or building alerting rules around it.

consecutive_failures > 3

A streak of more than three consecutive failed jobs almost always indicates a persistent problem rather than transient supplier instability: expired credentials, a changed SOAP endpoint URL, an IP allowlist block, or a supplier outage. Check last_sync_status and the errors field of the most recent job. If error_type is AuthError, update the supplier’s credentials in the Suppliers UI.

recent_error_count is elevated

A high error count across the last 10 jobs with last_sync_status: partial_success typically indicates intermittent supplier API instability — individual products failing while the overall sync succeeds. Review the errors JSONB for patterns: if the same product IDs recur across jobs, the products may be discontinued or malformed at the supplier. If the product IDs vary, the supplier API is likely rate-limiting or timing out under load.

last_full_sync is null or very old

A null value means no full sync has ever completed for this supplier — verify the adapter class is set correctly in the Suppliers UI and trigger a manual import with mode=full_sellable. A timestamp older than your expected full-sync cadence (weekly by default) means the scheduled workflow is not triggering or is failing before a job record is created.

status: partial_success

This is the most common non-green status and is not necessarily alarming. A supplier with 3 failed products out of 400 is partial_success. The threshold that warrants investigation is when failed_count / total_products is high (>5%) or when the same products fail repeatedly across multiple runs.

Audit Log

Every authenticated API operation is recorded by AuditLogMiddleware, which is registered globally on the FastAPI app (main.py:214).

GET /api/audit-log

Each audit record captures:

user — the authenticated user who made the request
action — HTTP method and route pattern
resource — the specific resource ID affected (supplier, product, customer, etc.)
timestamp — wall-clock time of the request

The audit log is useful for tracing configuration changes — for example, identifying which user updated a supplier’s credentials before a sync started failing.

Background Scheduler

The backend runs a lightweight background scheduler implemented as an asyncio.Task that triggers a full sync across all active suppliers every 24 hours.

Startup

The scheduler task is created in the FastAPI lifespan handler (main.py:178) immediately after the database is ready. It sleeps first, so no sync is triggered on startup — only after the full interval has elapsed.

_scheduler_task = asyncio.create_task(start_scheduler(interval_hours=24))

Suppress the scheduler

Set DISABLE_SCHEDULER=true to prevent the task from doing any work. This is useful in environments where syncs are driven exclusively by n8n cron workflows, or when debugging background activity.

DISABLE_SCHEDULER=true

Graceful shutdown

On process termination, the lifespan handler cancels the scheduler task before closing the database connection pool. This ensures in-flight sync operations are not killed mid-transaction and the pool is returned cleanly.

_scheduler_task.cancel()
try:
    await _scheduler_task
except asyncio.CancelledError:
    pass
await engine.dispose()

n8n Workflow Alerting

The n8n cron workflows act as the primary alerting mechanism for sync failures. After triggering a sync job via POST /api/suppliers/{id}/import, each workflow polls GET /api/sync-jobs/{id} on a configurable interval until the job reaches a terminal status (success, partial_success, or failed). On failed, the workflow posts a Slack alert containing the supplier name, job ID, and error summary.

Workflow	Schedule	Mode
`catalog-sync-weekly`	Sunday 1 AM	`full_sellable`
`pricing-sync-daily`	Daily	`delta`
`inventory-sync-hourly`	Hourly	`delta`
`closeouts-monthly`	1st of month	`closeouts`
`ops-push`	Triggered on demand	—

All n8n cron workflows ship with active: false. You must activate them manually in the n8n editor after binding the OnPrintShop OAuth2 credentials for each customer. Activating a workflow before credentials are set will cause it to fail on every run.

Get Started

Suppliers

Catalog & Pricing

Storefront Push

n8n Automation

Operations

Monitoring, Health Checks, and Sync Observability

Health Endpoints

Supplier Sync Health

Sync Job Monitoring

Reading the `errors` JSONB Field

Error Classification and Retry Behaviour

TransientError

AuthError

SupplierError

Interpreting Health Signals

Audit Log

Background Scheduler

n8n Workflow Alerting

Build docs developers (and LLMs) love

Get Started

Suppliers

Catalog & Pricing

Storefront Push

n8n Automation

Operations

Documentation Index

​Health Endpoints

​Supplier Sync Health

​Sync Job Monitoring

​Reading the errors JSONB Field

​Error Classification and Retry Behaviour

TransientError

AuthError

SupplierError

​Interpreting Health Signals

​Audit Log

​Background Scheduler

​n8n Workflow Alerting

Build docs developers (and LLMs) love

Health Endpoints

Supplier Sync Health

Sync Job Monitoring

Reading the `errors` JSONB Field

Error Classification and Retry Behaviour

Interpreting Health Signals

Audit Log

Background Scheduler

n8n Workflow Alerting