Skip to main content
Draft Thinker is configured via a YAML file. Pass the path with the --config flag when starting the gateway (defaults to config.yaml).
./draft-thinker --config config.yaml

Default config.yaml

server:
  port: 8080
  read_timeout: 30
  write_timeout: 120
  idle_timeout: 60

drafter:
  provider: openai
  base_url: https://api.openai.com/v1
  model: gpt-4.1-nano
  timeout: 30

heavyweight:
  provider: openai
  base_url: https://api.openai.com/v1
  model: gpt-4.1
  timeout: 60

entropy:
  threshold: 2.0
  window_size: 10
  early_exit_count: 10
  top_logprobs: 5

speculative:
  enabled: true
  soft_threshold_mult: 0.8

cache:
  enabled: true
  similarity_threshold: 0.95
  ttl_seconds: 3600
  embedding_model: text-embedding-3-small
  embedding_dimensions: 1536
  qdrant_collection: draftthinker_cache

metrics:
  enabled: true
  path: /metrics

server

HTTP server settings.
server.port
number
default:"8080"
Port the gateway listens on. The gateway exposes POST /v1/chat/completions on this port.
server.read_timeout
number
default:"30"
Maximum duration in seconds for reading the full request, including the body. Connections that take longer are closed.
server.write_timeout
number
default:"120"
Maximum duration in seconds for writing the response. Set this higher than the heavyweight model’s timeout to avoid cutting off slow responses.
server.idle_timeout
number
default:"60"
Maximum duration in seconds to wait for the next request on a keep-alive connection before closing it.

drafter

Configuration for the fast, cheap model that handles all requests first.
drafter.provider
string
default:"openai"
Model provider. Currently openai is supported. The client uses the OpenAI-compatible API format.
drafter.base_url
string
default:"https://api.openai.com/v1"
Base URL for the drafter’s API. Override to point at a compatible local or third-party endpoint.
drafter.model
string
default:"gpt-4.1-nano"
Model name sent in requests to the drafter. Must support logprobs and top_logprobs parameters for entropy analysis to function.
drafter.timeout
number
default:"30"
Per-request timeout in seconds for the drafter. If the drafter does not complete within this duration, the request is escalated.

heavyweight

Configuration for the frontier model used when the drafter’s entropy exceeds the threshold.
heavyweight.provider
string
default:"openai"
Model provider. Currently openai is supported.
heavyweight.base_url
string
default:"https://api.openai.com/v1"
Base URL for the heavyweight model’s API.
heavyweight.model
string
default:"gpt-4.1"
Model name sent in requests to the heavyweight. This is the escalation target — choose a model with strong reasoning capabilities.
heavyweight.timeout
number
default:"60"
Per-request timeout in seconds for the heavyweight. This should be larger than drafter.timeout because frontier models have higher latency.

entropy

Controls the Shannon entropy algorithm that drives routing decisions.
entropy.threshold
number
default:"2.0"
Calibrated entropy threshold T in bits. If windowed entropy exceeds this value during drafter generation, the request is escalated to the heavyweight. The value 2.0 was determined by sweeping a 518-prompt benchmark dataset and finding the knee of the accuracy-cost curve. Lower values escalate more aggressively (higher accuracy, higher cost); higher values escalate less (lower accuracy, lower cost).
entropy.window_size
number
default:"10"
Number of tokens in the sliding window used to compute windowed average entropy. A window smooths noise from individual uncertain tokens (for example, rare proper nouns) that do not indicate reasoning failure.
entropy.early_exit_count
number
default:"10"
Number of initial tokens to evaluate before triggering an early exit. If the first early_exit_count tokens produce entropy above threshold, the draft is aborted immediately and the request is escalated without completing the draft — avoiding wasted compute.
entropy.top_logprobs
number
default:"5"
Number of top token candidates (with log-probabilities) to request from the drafter per token. Used to compute per-token Shannon entropy: H = -Σ p(x) log₂ p(x). The OpenAI API supports values 0–20.

speculative

Controls speculative execution — the parallel heavyweight pre-fetch that reduces latency on escalated requests.
speculative.enabled
boolean
default:"true"
Enable or disable speculative execution. When enabled, a parallel heavyweight request is fired as soon as early tokens indicate elevated uncertainty. When disabled, draft-then-verify is strictly serial.
speculative.soft_threshold_mult
number
default:"0.8"
Multiplier applied to entropy.threshold to compute the soft threshold. When windowed entropy exceeds soft_threshold_mult × threshold (default: 0.8 × 2.0 = 1.6 bits) during the first tokens, the gateway fires a speculative parallel call to the heavyweight model. If the drafter’s entropy subsequently drops below threshold, the heavyweight call is canceled and the draft is accepted. If entropy stays elevated, the heavyweight response is used and the additional latency is heavyweight_total - drafter_abort_time rather than the full heavyweight latency.

cache

Controls the semantic cache backed by Qdrant (vector store) and Redis (metadata/TTLs).
The cache requires REDIS_URL and QDRANT_URL environment variables at runtime. They default to localhost:6379 and http://localhost:6333 respectively if not set. Only draft-accepted responses are cached — escalated responses indicate drafter uncertainty and are not stored.
cache.enabled
boolean
default:"true"
Enable or disable the semantic cache. When disabled, every request goes through the full draft-verify pipeline.
cache.similarity_threshold
number
default:"0.95"
Minimum cosine similarity between the incoming prompt’s embedding and a cached entry for a cache hit to be returned. The value 0.95 is intentionally conservative to avoid serving stale or semantically drifted responses.
cache.ttl_seconds
number
default:"3600"
Time-to-live in seconds for cached entries. After this duration, entries expire and subsequent similar prompts will go through the draft-verify pipeline again.
cache.embedding_model
string
default:"text-embedding-3-small"
OpenAI embedding model used to convert prompts to vectors. This model is called on every request (for both cache lookup and cache population). Must be consistent with the embedding_dimensions value.
cache.embedding_dimensions
number
default:"1536"
Dimensionality of the embedding vectors. Must match the output dimensions of embedding_model. For text-embedding-3-small, this is 1536.
cache.qdrant_collection
string
default:"draftthinker_cache"
Name of the Qdrant collection used to store and query cached embeddings. The collection is created automatically on first startup if it does not exist.

metrics

Controls Prometheus metrics exposure.
metrics.enabled
boolean
default:"true"
Enable or disable the Prometheus metrics endpoint. When disabled, a no-op recorder is used internally and no metrics are exported.
metrics.path
string
default:"/metrics"
HTTP path on which Prometheus metrics are served. Prometheus is configured by default to scrape this endpoint every 15 seconds.

Environment variables

The following environment variables are read at startup and are not part of config.yaml:
VariableRequiredDefaultDescription
OPENAI_API_KEYYesAPI key for OpenAI. Used for both model calls and embeddings. The gateway exits immediately if this is not set.
REDIS_URLNolocalhost:6379Address of the Redis instance used for cache metadata and TTLs.
QDRANT_URLNohttp://localhost:6333Base URL of the Qdrant instance used for vector similarity search.

Build docs developers (and LLMs) love