Documentation Index Fetch the complete documentation index at: https://mintlify.com/Anny26022/chartsmaze_clone/llms.txt
Use this file to discover all available pages before exploring further.
The EDL Pipeline can automatically clean up intermediate files after successful completion, keeping only the final compressed outputs and essential data.
Overview
Intermediate cleanup is controlled by the CLEANUP_INTERMEDIATE flag in run_full_pipeline.py (line 71):
# Auto-delete intermediate files after pipeline succeeds
# Keeps: all_stocks_fundamental_analysis.json.gz + ohlcv_data/
CLEANUP_INTERMEDIATE = True
Default : True in production environments to minimize storage usage
What Gets Deleted
The cleanup process removes 15 intermediate files and 2 directories that are only needed between pipeline stages.
INTERMEDIATE_FILES = [
"master_isin_map.json" , # Stock symbol → ISIN mapping
"dhan_data_response.json" , # Raw Dhan market data
"fundamental_data.json" , # Raw fundamental data (35 MB)
"advanced_indicator_data.json" , # Technical indicators
"all_company_announcements.json" , # Corporate announcements
"upcoming_corporate_actions.json" , # Upcoming corp actions
"history_corporate_actions.json" , # Historical corp actions
"nse_asm_list.json" , # ASM surveillance list
"nse_gsm_list.json" , # GSM surveillance list
"bulk_block_deals.json" , # Bulk/block deals
"upper_circuit_stocks.json" , # Upper circuit stocks
"lower_circuit_stocks.json" , # Lower circuit stocks
"incremental_price_bands.json" , # Daily price band changes
"complete_price_bands.json" , # All price bands
"nse_equity_list.csv" , # NSE listing dates
"all_stocks_fundamental_analysis.json" , # Raw JSON (before .gz)
]
INTERMEDIATE_DIRS = [
"company_filings" , # ~2,775 per-stock filing JSON files
"market_news" , # ~2,775 per-stock news JSON files
]
What Gets Preserved
The cleanup process preserves these essential outputs:
✅ all_stocks_fundamental_analysis.json.gz (~2 MB) - Final compressed output
✅ sector_analytics.json.gz - Sector performance data
✅ market_breadth.json.gz - Market breadth metrics
✅ ohlcv_data/ directory - Historical OHLCV CSV files (~200 MB)
✅ indices_ohlcv_data/ directory - Indices OHLCV data
The ohlcv_data/ directory is preserved because re-fetching it takes 25-35 minutes. The smart incremental updater needs existing data to calculate date ranges.
Cleanup Implementation
The cleanup logic is in run_full_pipeline.py (lines 169-192):
def cleanup_intermediate ():
"""Delete all intermediate files and directories, keeping only .json.gz + ohlcv_data/."""
removed_files = 0
removed_dirs = 0
freed_bytes = 0
# Remove intermediate files
for f in INTERMEDIATE_FILES :
fp = os.path.join( BASE_DIR , f)
if os.path.exists(fp):
freed_bytes += os.path.getsize(fp)
os.remove(fp)
removed_files += 1
# Remove intermediate directories
for d in INTERMEDIATE_DIRS :
dp = os.path.join( BASE_DIR , d)
if os.path.exists(dp):
for root, dirs, files in os.walk(dp):
for file in files:
freed_bytes += os.path.getsize(os.path.join(root, file ))
shutil.rmtree(dp)
removed_dirs += 1
freed_mb = freed_bytes / ( 1024 * 1024 )
print ( f "🗑️ Cleaned: { removed_files } files + { removed_dirs } dirs ( { freed_mb :.1f} MB freed)" )
Space Savings
Typical cleanup results:
Category Size Count Total JSON files ~38 MB 15 38 MB company_filings/~5 KB/file 2,775 ~13 MB market_news/~3 KB/file 2,775 ~8 MB Total Freed ~59 MB
fundamental_data.json: ~35 MB (largest file)
dhan_data_response.json: ~2 MB
advanced_indicator_data.json: ~8 MB
all_stocks_fundamental_analysis.json: ~50 MB → deleted after .gz created
Other JSONs: ~1 MB total
Execution Timing
Cleanup happens automatically in the pipeline:
PHASE 1-4: Data fetching & processing (3-34 min)
PHASE 5: Compression (2 sec)
🧹 CLEANUP: Removing intermediate files... (1 sec)
🗑️ Cleaned: 15 files + 2 dirs (59 MB freed)
Configuration Options
Production Mode (Default)
CLEANUP_INTERMEDIATE = True
✅ Minimizes disk usage
✅ Keeps only final outputs
❌ Cannot inspect intermediate files for debugging
Development Mode
CLEANUP_INTERMEDIATE = False
✅ Preserves all intermediate files for inspection
✅ Easier debugging of individual pipeline stages
❌ Uses ~59 MB extra disk space
Manual Cleanup
If you run the pipeline with CLEANUP_INTERMEDIATE = False, you can manually clean up later:
# Navigate to pipeline directory
cd "DO NOT DELETE EDL PIPELINE/"
# Remove intermediate JSON files
rm master_isin_map.json dhan_data_response.json fundamental_data.json \
advanced_indicator_data.json all_company_announcements.json \
upcoming_corporate_actions.json history_corporate_actions.json \
nse_asm_list.json nse_gsm_list.json bulk_block_deals.json \
upper_circuit_stocks.json lower_circuit_stocks.json \
incremental_price_bands.json complete_price_bands.json \
nse_equity_list.csv all_stocks_fundamental_analysis.json
# Remove intermediate directories
rm -rf company_filings/ market_news/
# Check space freed
du -sh .
Selective Preservation
To preserve specific intermediate files for debugging:
Edit run_full_pipeline.py (lines 76-93) and comment out files you want to keep:
INTERMEDIATE_FILES = [
"master_isin_map.json" ,
# "fundamental_data.json", # Keep for debugging
"advanced_indicator_data.json" ,
# ... rest of files
]
Recovery from Accidental Deletion
If you accidentally delete intermediate files:
Re-run the full pipeline :
python3 run_full_pipeline.py
This will regenerate all files from scratch.
Restore from backup (if available):
cp backup/dhan_data_response.json .
There is no recovery mechanism for deleted intermediate files. The pipeline must be re-run to regenerate them (~4-34 min depending on OHLCV setting).
Best Practices
Production: Enable cleanup
Set CLEANUP_INTERMEDIATE = True for daily automated runs to save disk space.
Development: Disable cleanup
Set CLEANUP_INTERMEDIATE = False when debugging or inspecting pipeline stages.
Archive final outputs
Backup .json.gz files before each run to maintain historical snapshots.
Monitor disk usage
Check ohlcv_data/ size periodically (~200 MB). This directory is never auto-deleted.
Next Steps
Compression Learn how final outputs are compressed to .json.gz
Working with Output Parse and analyze the compressed output files