Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/najmulhossainnj/Hedge-fund-backend/llms.txt

Use this file to discover all available pages before exploring further.

Features are the foundational inputs to every ML model in the Hedge Fund Backend. Rather than re-computing indicators on every training or backtest run, the platform separates a feature definition (the plugin key and parameters) from a feature dataset (the actual computed values for a specific symbol, timeframe, and date range). The Feature Engine manages this split: it generates datasets on demand, content-addresses them with a SHA-256 hash, persists them as Parquet files in S3/MinIO, and records metadata in Postgres — so the same computation is never run twice unless the underlying data changes.

Feature Definitions

A Feature row is a definition — it tells the engine which plugin to run and with what parameters. It does not store the actual time-series values.
{
  "id": "f1000000-0000-0000-0000-000000000001",
  "name": "RSI-14",
  "type": "technical",
  "description": "14-period Relative Strength Index computed via pandas-ta",
  "plugin_key": "technical.rsi",
  "parameters": { "length": 14 },
  "storage_uri": "s3://feature-store/technical.rsi/",
  "version": 1,
  "created_at": "2024-01-15T09:00:00Z",
  "updated_at": "2024-01-15T09:00:00Z"
}

Definition Fields

FieldTypeDescription
namestringHuman-readable label for the feature
typestringFeature family — see Feature Types below
plugin_keystringRegistry key that resolves to a BaseFeature subclass
parametersobjectPlugin-specific hyperparameters (e.g. {"length": 14})
storage_uristring | nullS3/MinIO key prefix where generated datasets are stored
versionintegerOptimistic-lock version; increments on every update

Feature Types

The type field categorises the data source the plugin consumes:

technical

Computed from OHLCV price/volume data — RSI, ATR, Bollinger Bands, MACD, etc. Implemented via pandas-ta.

statistical

Derived from statistical transforms of price history — autocorrelations, rolling moments, structural breaks. Implemented via tsfresh.

automated

Auto-extracted by tsfresh’s extract_features across hundreds of time-series statistics at once.

news

Sentiment scores and entity counts from news feeds — bullish/bearish polarity per symbol per day.

fundamental

Balance-sheet and income-statement ratios (P/E, P/B, ROE, etc.) from quarterly filings.

macro

Macroeconomic indicators — yield curve slope, VIX level, PMI, CPI surprise.

Feature Generation Pipeline

When you call POST /api/features/{id}/generate, the Feature Engine runs the following pipeline:
Market/Alt Data


Plugin.compute()          ← BaseFeature subclass resolved via plugin_key


SHA-256 Hash              ← version_hash = hash(plugin_key, params, symbol,
     │                        timeframe, date_range, source_fingerprint)

     ├── Cache hit?  ──── YES ──► Return existing FeatureDataset from Postgres

     NO


Persist Parquet           ← s3://feature-store/{plugin_key}/{symbol}/{hash}.parquet


Write FeatureDataset row  ← Postgres: feature_id, symbol, timeframe, version_hash,
     │                        storage_uri, row_count, columns, source_fingerprint

Return FeatureGenerateResponse
The generation request specifies the symbol, timeframe, and date window:
{
  "symbol": "AAPL",
  "timeframe": "1d",
  "start_date": "2020-01-01T00:00:00Z",
  "end_date": "2024-01-01T00:00:00Z"
}

Content-Hash Versioning

The most important property of the Feature Store is content-addressed deduplication. The version_hash for a dataset is a SHA-256 digest of all its inputs:
import hashlib, json

payload = {
    "plugin_key": "technical.rsi",
    "params": {"length": 14},
    "symbol": "AAPL",
    "timeframe": "1d",
    "start_date": "2020-01-01T00:00:00",
    "end_date":   "2024-01-01T00:00:00",
    "source_fingerprint": "<sha256 of raw OHLCV bytes>",
}
stable = json.dumps(payload, sort_keys=True, default=str, separators=(",", ":"))
version_hash = hashlib.sha256(stable.encode("utf-8")).hexdigest()
This gives three guarantees:

Reproducibility

Re-running the identical pipeline always produces the same hash — and therefore the same dataset — with no ambiguity.

Cache Hits

If the hash already exists in feature_datasets, the engine skips computation entirely and returns the stored dataset.

Automatic Invalidation

If upstream market data is revised (corporate actions, late-arriving prints), the source_fingerprint changes, producing a new hash and triggering automatic regeneration.
The source_fingerprint is a SHA-256 of the raw OHLCV bytes fed to the plugin. This means a backfill or data vendor correction automatically invalidates cached features without any manual intervention.

FeatureDataset

A FeatureDataset row is one generated instance of a feature definition. Multiple datasets can exist for the same feature definition (different symbols, different date ranges, or different source data versions).
{
  "id": "d1000000-0000-0000-0000-000000000001",
  "feature_id": "f1000000-0000-0000-0000-000000000001",
  "symbol": "AAPL",
  "timeframe": "1d",
  "start_date": "2020-01-01T00:00:00Z",
  "end_date": "2024-01-01T00:00:00Z",
  "version_hash": "a3f5c2b8d9e1f04762890abc1234567890abcdef1234567890abcdef12345678",
  "storage_uri": "s3://feature-store/technical.rsi/AAPL/a3f5c2b8.parquet",
  "row_count": 1006,
  "columns": ["timestamp", "rsi_14"],
  "created_at": "2024-01-15T09:05:00Z"
}
The feature_datasets table has a composite index on (feature_id, symbol, timeframe, version_hash) to make cache lookups sub-millisecond even at scale.

Built-in Plugins

technical.rsi

RSI — Relative Strength Index. Parameter: length (default 14). Output column: rsi_{length}.

technical.atr

ATR — Average True Range. Parameter: length (default 14). Output column: atr_{length}.

statistical.tsfresh

tsfresh — Extracts a configurable subset of the tsfresh feature library (autocorrelations, entropy, linear trend coefficients, etc.).

news.sentiment

News Sentiment — Aggregates intraday news polarity scores into a daily bullish/bearish score per symbol. Output columns: sentiment_score, sentiment_count.
All plugins implement the BaseFeature interface from app/plugins/base.py. You can add custom plugins by subclassing BaseFeature, setting a unique key, and registering it in the feature plugin registry — without touching any engine code.

Regeneration

To force recomputation regardless of cache state — for example, after changing plugin parameters or fixing a data source — call:
POST /api/features/{id}/regenerate
with the same FeatureGenerateRequest body. The engine computes a new source_fingerprint from the current market data slice, derives a new version_hash, and writes a new FeatureDataset row pointing at a freshly generated Parquet file. The old dataset row is retained for historical reproducibility.

Redis Caching

In addition to the Parquet + Postgres persistence layer, frequently accessed feature datasets are cached in Redis. On a cache hit the engine deserialises the dataset directly from Redis without an S3 round-trip. Cache entries carry a configurable TTL and are evicted when a new FeatureDataset version is written for the same (feature_id, symbol, timeframe) tuple.

API Reference

For full endpoint documentation — including listing datasets, previewing generated data, and bulk generation across a symbol universe — see the Features API.

Build docs developers (and LLMs) love