Feature Engineering, Versioning, and the Feature Store

Features are the foundational inputs to every ML model in the Hedge Fund Backend. Rather than re-computing indicators on every training or backtest run, the platform separates a feature definition (the plugin key and parameters) from a feature dataset (the actual computed values for a specific symbol, timeframe, and date range). The Feature Engine manages this split: it generates datasets on demand, content-addresses them with a SHA-256 hash, persists them as Parquet files in S3/MinIO, and records metadata in Postgres — so the same computation is never run twice unless the underlying data changes.

Feature Definitions

A Feature row is a definition — it tells the engine which plugin to run and with what parameters. It does not store the actual time-series values.

{
  "id": "f1000000-0000-0000-0000-000000000001",
  "name": "RSI-14",
  "type": "technical",
  "description": "14-period Relative Strength Index computed via pandas-ta",
  "plugin_key": "technical.rsi",
  "parameters": { "length": 14 },
  "storage_uri": "s3://feature-store/technical.rsi/",
  "version": 1,
  "created_at": "2024-01-15T09:00:00Z",
  "updated_at": "2024-01-15T09:00:00Z"
}

Definition Fields

Field	Type	Description
`name`	`string`	Human-readable label for the feature
`type`	`string`	Feature family — see Feature Types below
`plugin_key`	`string`	Registry key that resolves to a `BaseFeature` subclass
`parameters`	`object`	Plugin-specific hyperparameters (e.g. `{"length": 14}`)
`storage_uri`	`string \| null`	S3/MinIO key prefix where generated datasets are stored
`version`	`integer`	Optimistic-lock version; increments on every update

Feature Types

The type field categorises the data source the plugin consumes:

technical

Computed from OHLCV price/volume data — RSI, ATR, Bollinger Bands, MACD, etc. Implemented via pandas-ta.

statistical

Derived from statistical transforms of price history — autocorrelations, rolling moments, structural breaks. Implemented via tsfresh.

automated

Auto-extracted by tsfresh’s extract_features across hundreds of time-series statistics at once.

news

Sentiment scores and entity counts from news feeds — bullish/bearish polarity per symbol per day.

fundamental

Balance-sheet and income-statement ratios (P/E, P/B, ROE, etc.) from quarterly filings.

macro

Macroeconomic indicators — yield curve slope, VIX level, PMI, CPI surprise.

Feature Generation Pipeline

When you call POST /api/features/{id}/generate, the Feature Engine runs the following pipeline:

Market/Alt Data
     │
     ▼
Plugin.compute()          ← BaseFeature subclass resolved via plugin_key
     │
     ▼
SHA-256 Hash              ← version_hash = hash(plugin_key, params, symbol,
     │                        timeframe, date_range, source_fingerprint)
     │
     ├── Cache hit?  ──── YES ──► Return existing FeatureDataset from Postgres
     │
     NO
     │
     ▼
Persist Parquet           ← s3://feature-store/{plugin_key}/{symbol}/{hash}.parquet
     │
     ▼
Write FeatureDataset row  ← Postgres: feature_id, symbol, timeframe, version_hash,
     │                        storage_uri, row_count, columns, source_fingerprint
     ▼
Return FeatureGenerateResponse

The generation request specifies the symbol, timeframe, and date window:

{
  "symbol": "AAPL",
  "timeframe": "1d",
  "start_date": "2020-01-01T00:00:00Z",
  "end_date": "2024-01-01T00:00:00Z"
}

Content-Hash Versioning

The most important property of the Feature Store is content-addressed deduplication. The version_hash for a dataset is a SHA-256 digest of all its inputs:

import hashlib, json

payload = {
    "plugin_key": "technical.rsi",
    "params": {"length": 14},
    "symbol": "AAPL",
    "timeframe": "1d",
    "start_date": "2020-01-01T00:00:00",
    "end_date":   "2024-01-01T00:00:00",
    "source_fingerprint": "<sha256 of raw OHLCV bytes>",
}
stable = json.dumps(payload, sort_keys=True, default=str, separators=(",", ":"))
version_hash = hashlib.sha256(stable.encode("utf-8")).hexdigest()

This gives three guarantees:

Reproducibility

Re-running the identical pipeline always produces the same hash — and therefore the same dataset — with no ambiguity.

Cache Hits

If the hash already exists in feature_datasets, the engine skips computation entirely and returns the stored dataset.

Automatic Invalidation

If upstream market data is revised (corporate actions, late-arriving prints), the source_fingerprint changes, producing a new hash and triggering automatic regeneration.

The source_fingerprint is a SHA-256 of the raw OHLCV bytes fed to the plugin. This means a backfill or data vendor correction automatically invalidates cached features without any manual intervention.

FeatureDataset

A FeatureDataset row is one generated instance of a feature definition. Multiple datasets can exist for the same feature definition (different symbols, different date ranges, or different source data versions).

{
  "id": "d1000000-0000-0000-0000-000000000001",
  "feature_id": "f1000000-0000-0000-0000-000000000001",
  "symbol": "AAPL",
  "timeframe": "1d",
  "start_date": "2020-01-01T00:00:00Z",
  "end_date": "2024-01-01T00:00:00Z",
  "version_hash": "a3f5c2b8d9e1f04762890abc1234567890abcdef1234567890abcdef12345678",
  "storage_uri": "s3://feature-store/technical.rsi/AAPL/a3f5c2b8.parquet",
  "row_count": 1006,
  "columns": ["timestamp", "rsi_14"],
  "created_at": "2024-01-15T09:05:00Z"
}

The feature_datasets table has a composite index on (feature_id, symbol, timeframe, version_hash) to make cache lookups sub-millisecond even at scale.

Built-in Plugins

technical.rsi

RSI — Relative Strength Index. Parameter: length (default 14). Output column: rsi_{length}.

technical.atr

ATR — Average True Range. Parameter: length (default 14). Output column: atr_{length}.

statistical.tsfresh

tsfresh — Extracts a configurable subset of the tsfresh feature library (autocorrelations, entropy, linear trend coefficients, etc.).

news.sentiment

News Sentiment — Aggregates intraday news polarity scores into a daily bullish/bearish score per symbol. Output columns: sentiment_score, sentiment_count.

All plugins implement the BaseFeature interface from app/plugins/base.py. You can add custom plugins by subclassing BaseFeature, setting a unique key, and registering it in the feature plugin registry — without touching any engine code.

Regeneration

To force recomputation regardless of cache state — for example, after changing plugin parameters or fixing a data source — call:

POST /api/features/{id}/regenerate

with the same FeatureGenerateRequest body. The engine computes a new source_fingerprint from the current market data slice, derives a new version_hash, and writes a new FeatureDataset row pointing at a freshly generated Parquet file. The old dataset row is retained for historical reproducibility.

Redis Caching

In addition to the Parquet + Postgres persistence layer, frequently accessed feature datasets are cached in Redis. On a cache hit the engine deserialises the dataset directly from Redis without an S3 round-trip. Cache entries carry a configurable TTL and are evicted when a new FeatureDataset version is written for the same (feature_id, symbol, timeframe) tuple.

API Reference

For full endpoint documentation — including listing datasets, previewing generated data, and bulk generation across a symbol universe — see the Features API.

Get Started

Core Concepts

Guides

Feature Engineering, Versioning, and the Feature Store

Feature Definitions

Definition Fields

Feature Types

technical

statistical

automated

news

fundamental

macro

Feature Generation Pipeline

Content-Hash Versioning

Reproducibility

Cache Hits

Automatic Invalidation

FeatureDataset

Built-in Plugins

technical.rsi

technical.atr

statistical.tsfresh

news.sentiment

Regeneration

Redis Caching

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Documentation Index

​Feature Definitions

​Definition Fields

​Feature Types

technical

statistical

automated

news

fundamental

macro

​Feature Generation Pipeline

​Content-Hash Versioning

Reproducibility

Cache Hits

Automatic Invalidation

​FeatureDataset

​Built-in Plugins

technical.rsi

technical.atr

statistical.tsfresh

news.sentiment

​Regeneration

​Redis Caching

​API Reference

Build docs developers (and LLMs) love

Feature Definitions

Definition Fields

Feature Types

Feature Generation Pipeline

Content-Hash Versioning

FeatureDataset

Built-in Plugins

Regeneration

Redis Caching

API Reference