Running the Full Pipeline

Overview

The full pipeline executes 16 scripts in strict dependency order to produce all_stocks_fundamental_analysis.json.gz - a comprehensive dataset of 2,775+ Indian stocks with 86 fields per stock covering fundamentals, technicals, events, and sentiment. Expected Runtime: ~4 minutes (without OHLCV) | ~34 minutes (with OHLCV first-time fetch)

Quick Start

Navigate to pipeline directory

cd ~/workspace/source/DO\ NOT\ DELETE\ EDL\ PIPELINE/

Run the pipeline

python3 run_full_pipeline.py

The script will automatically:

Fetch data from all sources (Dhan ScanX, NSE)
Build the master JSON structure
Enrich with technical indicators, events, and news
Compress output to .json.gz format
Clean up intermediate files

Verify output

Check for the final compressed file:

ls -lh all_stocks_fundamental_analysis.json.gz

Expected size: ~2 MB (compressed from ~50-60 MB raw JSON)

Configuration Options

Edit run_full_pipeline.py to customize behavior:

OHLCV Data Fetching

FETCH_OHLCV = True  # Default: True

When True:

First run: Downloads complete OHLCV history (~30 min for all stocks)
Subsequent runs: Incremental update only (~2-5 min)
Enables: ADR, RVOL, ATH, % from ATH, returns calculations

When False:

Skips OHLCV entirely
ADR, RVOL, ATH fields will be 0
Runtime: ~4 minutes

Optional Standalone Data

FETCH_OPTIONAL = False  # Default: False

When True: Also fetches (not included in master JSON):

all_indices_list.json - 194 market indices
etf_data_response.json - 361 ETFs

Auto-Cleanup

CLEANUP_INTERMEDIATE = True  # Default: True

When True: Removes intermediate files after successful completion, keeping only:

all_stocks_fundamental_analysis.json.gz
sector_analytics.json.gz
market_breadth.json.gz
ohlcv_data/ directory (if FETCH_OHLCV=True)

When False: Retains all intermediate JSON files for debugging

Pipeline Phases

The pipeline executes in strict order:

Phase 1: Core Data (Foundation)

fetch_dhan_data.py          → dhan_data_response.json + master_isin_map.json
fetch_fundamental_data.py   → fundamental_data.json
NSE CSV download            → nse_equity_list.csv (listing dates)

Critical: fetch_dhan_data.py must succeed - it creates master_isin_map.json which all other scripts need.

Phase 2: Data Enrichment (Fetching)

 fetch_company_filings.py       → company_filings/*.json
 fetch_new_announcements.py     → all_company_announcements.json
 fetch_advanced_indicators.py   → advanced_indicator_data.json
 fetch_market_news.py           → market_news/*.json
 fetch_corporate_actions.py     → upcoming/history_corporate_actions.json
 fetch_surveillance_lists.py    → nse_asm_list.json, nse_gsm_list.json
 fetch_circuit_stocks.py        → upper/lower_circuit_stocks.json
fetch_bulk_block_deals.py      → bulk_block_deals.json
fetch_incremental_price_bands.py → incremental_price_bands.json
fetch_complete_price_bands.py    → complete_price_bands.json
fetch_all_indices.py           → all_indices_list.json

Phase 2.5: OHLCV History (Smart Incremental)

14. fetch_all_ohlcv.py         → ohlcv_data/*.csv
15. fetch_indices_ohlcv.py     → (indices OHLCV)

Smart Incremental Logic:

Checks existing CSV files in ohlcv_data/
Only fetches missing dates since last update
First run: Fetches up to 2 years of history per stock
Daily updates: Only fetches 1-2 days of new data

Phase 3: Base Analysis

16. bulk_market_analyzer.py    → all_stocks_fundamental_analysis.json (BASE)

Creates the master JSON structure with fundamental data for all stocks.

Phase 4: Enrichment (Order Matters!)

advanced_metrics_processor.py   → Adds ADR, RVOL, ATH, Turnover
process_earnings_performance.py → Adds post-earnings returns
enrich_fno_data.py              → Adds F&O flag, Lot Size, Next Expiry
process_market_breadth.py       → Generates sector analytics
process_historical_market_breadth.py → Generates breadth charts
add_corporate_events.py         → Adds Events, Announcements, News (LAST!)

Critical: add_corporate_events.py MUST run last as it performs final JSON injection.

Phase 5: Compression

Compress all output files:
- all_stocks_fundamental_analysis.json → .json.gz
- sector_analytics.json → .json.gz
- market_breadth.csv → .json.gz

Compression ratio: ~90-95% size reduction

Output Files

Primary Output

Location: ~/workspace/source/DO NOT DELETE EDL PIPELINE/all_stocks_fundamental_analysis.json.gz Format: Gzip-compressed JSON array Structure:

[
  {
    "Symbol": "RELIANCE",
    "Name": "Reliance Industries Limited",
    "Market Cap(Cr.)": 1850000,
    "Stock Price(₹)": 2734.50,
    "P/E": 28.5,
    "ROE(%)": 15.2,
    "Latest Quarter": "Dec 2025",
    "Net Profit Latest": 18200,
    "QoQ % Net Profit Latest": 5.3,
    "YoY % Net Profit Latest": 12.7,
    "RSI (14)": 62.5,
    "Event Markers": "💸: Dividend (15-Mar)",
    "Recent Announcements": [...],
    "News Feed": [...]
    // ... 86 total fields
  },
  // ... 2,775+ stocks
]

Decompression:

import gzip
import json

with gzip.open('all_stocks_fundamental_analysis.json.gz', 'rb') as f:
    data = json.load(f)

print(f"Total stocks: {len(data)}")
print(f"Fields per stock: {len(data[0])}")

Secondary Outputs

File	Size	Description
`sector_analytics.json.gz`	~500 KB	Sector-wise aggregated metrics
`market_breadth.json.gz`	~8 MB	Historical market breadth data
`ohlcv_data/*.csv`	~200 MB	Individual stock OHLCV history
`all_indices_list.json`	~85 KB	Market indices data (if FETCH_OPTIONAL=True)

Runtime Breakdown

First-Time Execution (with OHLCV)

Phase 1: Core Data                    ~30s
Phase 2: Data Enrichment              ~90s
Phase 2.5: OHLCV History (first)      ~30 min
Phase 3: Base Analysis                ~20s
Phase 4: Enrichment                   ~45s
Phase 5: Compression                  ~15s
─────────────────────────────────────────
Total:                                ~34 min

Daily Update (with incremental OHLCV)

Phase 1: Core Data                    ~30s
Phase 2: Data Enrichment              ~90s
Phase 2.5: OHLCV Incremental          ~2-5 min
Phase 3: Base Analysis                ~20s
Phase 4: Enrichment                   ~45s
Phase 5: Compression                  ~15s
─────────────────────────────────────────
Total:                                ~6-9 min

Without OHLCV

Phase 1: Core Data                    ~30s
Phase 2: Data Enrichment              ~90s
Phase 3: Base Analysis                ~20s
Phase 4: Enrichment                   ~30s
Phase 5: Compression                  ~15s
─────────────────────────────────────────
Total:                                ~4 min

Console Output Example

════════════════════════════════════════════════════════════
  EDL PIPELINE - FULL DATA REFRESH
════════════════════════════════════════════════════════════

📦 PHASE 1: Core Data (Foundation)
────────────────────────────────────────
  ▶ Running fetch_dhan_data.py...
  ✅ fetch_dhan_data.py (12.3s)
  ▶ Running fetch_fundamental_data.py...
  ✅ fetch_fundamental_data.py (18.7s)
  ▶ Downloading NSE Listing Dates...
  ✅ NSE Listing Dates downloaded.

📡 PHASE 2: Data Enrichment (Fetching)
────────────────────────────────────────
  ▶ Running fetch_company_filings.py...
  ✅ fetch_company_filings.py (45.2s)
  ...

📊 PHASE 2.5: OHLCV History (Smart Incremental)
────────────────────────────────────────
  ▶ Running fetch_all_ohlcv.py...
  ✅ fetch_all_ohlcv.py (142.5s)

🔬 PHASE 3: Base Analysis (Building Master JSON)
────────────────────────────────────────
  ▶ Running bulk_market_analyzer.py...
  ✅ bulk_market_analyzer.py (19.8s)

✨ PHASE 4: Enrichment (Injecting into Master JSON)
────────────────────────────────────────
  ▶ Running advanced_metrics_processor.py...
  ✅ advanced_metrics_processor.py (8.2s)
  ...

📦 PHASE 5: Compression (.json → .json.gz)
────────────────────────────────────────
  📦 Compressed: 58.3 MB → 2.1 MB (96% reduction)

🧹 CLEANUP: Removing intermediate files...
────────────────────────────────────────
  🗑️  Cleaned: 13 files + 2 dirs (56.2 MB freed)

════════════════════════════════════════════════════════════
  PIPELINE COMPLETE
════════════════════════════════════════════════════════════
  Total Time:  245.7s (4.1 min)
  Successful:  22/22
  Failed:      0/22

  📄 Output: all_stocks_fundamental_analysis.json.gz (2.1 MB)
  📦 Compression: 58.3 MB → 2.1 MB (96% smaller)
  🧹 Only .json.gz + ohlcv_data/ remain. All intermediate data purged.
════════════════════════════════════════════════════════════

Troubleshooting

Pipeline Fails at fetch_dhan_data.py

Error: CRITICAL: fetch_dhan_data.py failed. Cannot continue. Cause: This script fetches the master stock list and creates master_isin_map.json which all other scripts need. Solutions:

Check internet connectivity
Verify Dhan API endpoint is accessible
Check if rate-limited (wait 5 minutes and retry)
Inspect error message in console output

OHLCV Fetch Takes Too Long

Symptom: Phase 2.5 exceeds 30 minutes Solutions:

First run is expected to take ~30 min for full history
Reduce thread count: Edit fetch_all_ohlcv.py, set MAX_THREADS = 10 (line 14)
For faster daily updates, keep existing ohlcv_data/ directory - it will only fetch new dates
If not needed immediately, set FETCH_OHLCV = False and run later

Script Times Out

Error: ⏰ {script_name} TIMED OUT (>30 min) Cause: Individual script timeout is set to 30 minutes (1800 seconds) Solutions:

Check network stability
Increase timeout in run_full_pipeline.py line 117: timeout=3600 (1 hour)
Run the individual script manually to see detailed error

Compression Fails

Error: Files to compress not found Cause: Phase 3 or Phase 4 failed to produce expected output files Solutions:

Check console for which Phase 4 script failed
Run pipeline with CLEANUP_INTERMEDIATE = False to inspect intermediate files
Verify all_stocks_fundamental_analysis.json exists before compression

Memory Issues

Symptom: Process killed or out of memory errors Solutions:

Free up system RAM (close other applications)
Reduce parallelization: Lower thread counts in fetcher scripts
Process in batches: Set FETCH_OPTIONAL = False
Pipeline requires ~2-4 GB RAM for full execution

Partial Data in Output

Symptom: Some stocks missing fields or empty values Cause: Non-critical enrichment scripts failed but pipeline continued Solutions:

Check console output for failed scripts (marked with ❌)
Pipeline continues even if enrichment fails (line 126: return True)
Re-run pipeline to retry failed fetches
Some data sources may be temporarily unavailable (ASM/GSM lists, news feed)

Manual Script Execution

If you need to run individual scripts for debugging:

cd ~/workspace/source/DO\ NOT\ DELETE\ EDL\ PIPELINE/

# Core data (must run first)
python3 fetch_dhan_data.py
python3 fetch_fundamental_data.py

# Any enrichment script (requires master_isin_map.json)
python3 fetch_company_filings.py
python3 fetch_market_news.py

# OHLCV (requires dhan_data_response.json)
python3 fetch_all_ohlcv.py

# Base analysis (requires all fetched data)
python3 bulk_market_analyzer.py

# Enrichment (requires all_stocks_fundamental_analysis.json to exist)
python3 advanced_metrics_processor.py
python3 add_corporate_events.py  # Must be last!

Best Practices

Daily Updates

Run once per day after market close (after 3:30 PM IST)
Keep FETCH_OHLCV = True for incremental updates
OHLCV incremental fetch only takes 2-5 minutes
Set up a cron job for automated daily execution:

# Run at 4 PM IST daily
0 16 * * 1-5 cd ~/workspace/source/DO\ NOT\ DELETE\ EDL\ PIPELINE/ && python3 run_full_pipeline.py >> pipeline.log 2>&1

First-Time Setup

Allow 30-40 minutes for first run with OHLCV
Verify output file exists and is properly formatted
Test decompression with a JSON parser
Keep intermediate files for first run (CLEANUP_INTERMEDIATE = False)

Production Environment

Monitor disk space (OHLCV data grows to ~200 MB)
Archive old .json.gz files with timestamps
Set up error alerting for pipeline failures
Keep logs of each run for debugging

Next Steps

Incremental Updates - Run daily updates efficiently
Single Stock Analysis - Analyze individual stocks
API Reference - Detailed endpoint documentation

Usage

Data Management

Advanced

Overview

Quick Start

Configuration Options

OHLCV Data Fetching

Optional Standalone Data

Auto-Cleanup

Pipeline Phases

Phase 1: Core Data (Foundation)

Phase 2: Data Enrichment (Fetching)

Phase 2.5: OHLCV History (Smart Incremental)

Phase 3: Base Analysis

Phase 4: Enrichment (Order Matters!)

Phase 5: Compression

Output Files

Primary Output

Secondary Outputs

Runtime Breakdown

First-Time Execution (with OHLCV)

Daily Update (with incremental OHLCV)

Without OHLCV

Console Output Example

Troubleshooting

Pipeline Fails at fetch_dhan_data.py

OHLCV Fetch Takes Too Long

Script Times Out

Compression Fails

Memory Issues

Partial Data in Output

Manual Script Execution

Best Practices

Daily Updates

First-Time Setup

Production Environment

Next Steps

Build docs developers (and LLMs) love

Usage

Data Management

Advanced

Documentation Index

​Overview

​Quick Start

​Configuration Options

​OHLCV Data Fetching

​Optional Standalone Data

​Auto-Cleanup

​Pipeline Phases

​Phase 1: Core Data (Foundation)

​Phase 2: Data Enrichment (Fetching)

​Phase 2.5: OHLCV History (Smart Incremental)

​Phase 3: Base Analysis

​Phase 4: Enrichment (Order Matters!)

​Phase 5: Compression

​Output Files

​Primary Output

​Secondary Outputs

​Runtime Breakdown

​First-Time Execution (with OHLCV)

​Daily Update (with incremental OHLCV)

​Without OHLCV

​Console Output Example

​Troubleshooting

​Pipeline Fails at fetch_dhan_data.py

​OHLCV Fetch Takes Too Long

​Script Times Out

​Compression Fails

​Memory Issues

​Partial Data in Output

​Manual Script Execution

​Best Practices

​Daily Updates

​First-Time Setup

​Production Environment

​Next Steps

Build docs developers (and LLMs) love

Overview

Quick Start

Configuration Options

OHLCV Data Fetching

Optional Standalone Data

Auto-Cleanup

Pipeline Phases

Phase 1: Core Data (Foundation)

Phase 2: Data Enrichment (Fetching)

Phase 2.5: OHLCV History (Smart Incremental)

Phase 3: Base Analysis

Phase 4: Enrichment (Order Matters!)

Phase 5: Compression

Output Files

Primary Output

Secondary Outputs

Runtime Breakdown

First-Time Execution (with OHLCV)

Daily Update (with incremental OHLCV)

Without OHLCV

Console Output Example

Troubleshooting

Pipeline Fails at fetch_dhan_data.py

OHLCV Fetch Takes Too Long

Script Times Out

Compression Fails

Memory Issues

Partial Data in Output

Manual Script Execution

Best Practices

Daily Updates

First-Time Setup

Production Environment

Next Steps