Data Flow - tif1

Overview

Tif1’s architecture is designed for performance and reliability. Data flows through multiple layers with caching, validation, and lazy loading to ensure fast access while maintaining data integrity.

Architecture Overview

Data Flow Layers

1. User Layer

The entry point where users request data:

import tif1

# Step 1: Create session (instant - no data loaded)
session = tif1.get_session(2024, "Monaco", "Race")

# Step 2: Access property (triggers data flow)
laps = session.laps  # Data flow starts here

2. Lazy Loading Layer

Data is only loaded when accessed:

# Creating a session doesn't load any data
session = tif1.get_session(2024, "Monaco", "Race")  # Instant

# Data is loaded on first access
laps = session.laps  # Triggers: check cache → fetch from CDN → process
weather = session.weather  # Separate data flow for weather
tel = laps.iloc[0].telemetry  # Separate data flow for telemetry

From core.py:3491-3547 - the laps property:

@property
def laps(self) -> DataFrame:
    """Get all laps data for the session (auto-async for 4-5x faster loading)."""
    if self._laps is None:
        # Check in-memory cache
        cache_key = f"{self.year}_{self.gp}_{self.session}_laps"
        lap_cache = _get_backend_lap_cache(self.lib) if self.enable_cache else None
        if lap_cache is not None:
            cached_laps = lap_cache.get(cache_key)
            if cached_laps is not None:
                logger.info(f"Lap cache hit ({self.lib}): {cache_key}")
                self._laps = cached_laps
                return self._laps

        # Cache miss - load async
        logger.info(f"Loading laps async ({self.lib}): {cache_key}")
        laps_df = asyncio.run(self.laps_async())
        self._laps = Laps(laps_df)
        self._laps.session = self
        
        # Store in cache
        if lap_cache is not None:
            lap_cache.set(cache_key, self._laps)

    return self._laps

3. Cache Layer

Two-level caching system:

Level 1: In-Memory Cache (LRU)

Fast, process-local cache using an LRU (Least Recently Used) strategy:

# From core.py:935-962
class LRUCache:
    """Thread-safe LRU cache with size limit."""
    def __init__(self, maxsize: int = MAX_CACHE_SIZE):
        self.cache = OrderedDict()
        self.maxsize = maxsize
        self.lock = threading.Lock()

Performance:

Speed: Instant (memory access)
Scope: Current process only
Capacity: Limited by MAX_CACHE_SIZE (default: 1000 items)
Persistence: Lost when process exits

Level 2: SQLite Cache

Persistent cache stored on disk:

# Cache structure
import sqlite3

# Session-level data cache
CREATE TABLE cache (
    key TEXT PRIMARY KEY,          # e.g., "2024/Monaco/Race/drivers.json"
    data TEXT                       # JSON string
)

# Telemetry-specific cache
CREATE TABLE telemetry_cache (
    year INTEGER,
    gp TEXT,
    session TEXT,
    driver TEXT,
    lap INTEGER,
    data TEXT,                      # JSON string
    PRIMARY KEY (year, gp, session, driver, lap)
)

Performance:

Speed: Fast (disk I/O, ~1-10ms)
Scope: Persistent across sessions
Capacity: Limited by disk space
Persistence: Survives process restarts

Cache Hit Performance:

In-Memory: Instant (< 1ms)
SQLite: Fast (~1-10ms)
CDN: Slow (~200-1000ms)

Using cache is 20-1000x faster than fetching from CDN.

4. Network Layer

CDN Data Source

Tif1 fetches data from TracingInsights GitHub repositories served through jsDelivr CDN:

# CDN URL structure
https://cdn.jsdelivr.net/gh/tracinginsights/{year}@main/
    {gp}/{session}/{path}

# Example URLs:
# Drivers data:
https://cdn.jsdelivr.net/gh/tracinginsights/2024@main/
    Monaco_Grand_Prix/Race/drivers.json

# Lap times for VER:
https://cdn.jsdelivr.net/gh/tracinginsights/2024@main/
    Monaco_Grand_Prix/Race/VER/laptimes.json

# Telemetry for VER lap 45:
https://cdn.jsdelivr.net/gh/tracinginsights/2024@main/
    Monaco_Grand_Prix/Race/VER/45_tel.json

Async HTTP Fetching

Tif1 uses parallel async fetching with HTTP/2 via niquests:

# From async_fetch.py
async def fetch_multiple_async(
    requests: list[tuple[int, str, str, str]],
    use_cache: bool = True,
    write_cache: bool = True,
    max_concurrent_requests: int = 10
) -> list[dict | None]:
    """Fetch multiple JSON files in parallel."""
    # Parallel HTTP requests with connection pooling
    async with niquests.AsyncSession() as session:
        tasks = [fetch_one(session, year, gp, sess, path) 
                 for year, gp, sess, path in requests]
        results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

Performance:

HTTP/2: Multiplexing, header compression
Connection Pooling: Reuse TCP connections
Parallel Fetching: Load multiple files simultaneously
Result: 4-5x faster than sequential loading

Async Loading Example:

import asyncio

async def load_session():
    session = tif1.get_session(2024, "Monaco", "Race")
    laps = await session.laps_async()  # Parallel loading
    return laps

laps = asyncio.run(load_session())

For 20 drivers:

Sequential: ~10 seconds
Async parallel: ~0.4 seconds (25x faster)

Retry Logic

Automatic retry with exponential backoff:

# From retry.py
@retry_with_backoff(
    max_retries=3,
    backoff_factor=2.0,
    jitter=True,
    exceptions=(niquests.RequestException,)
)
def fetch_from_url(url: str) -> dict:
    """Fetch with automatic retry."""
    response = session.get(url, timeout=30)
    response.raise_for_status()
    return parse_response_json(response)

Retry Strategy:

Attempt 1: Immediate
Attempt 2: Wait ~2 seconds
Attempt 3: Wait ~4 seconds
Attempt 4: Fail with exception

5. Processing Layer

JSON Parsing and Validation

# From core_utils/json_utils.py
def parse_response_json(response) -> dict:
    """Parse JSON with validation."""
    import orjson  # Fast JSON parser
    
    # Parse JSON (orjson is 2-3x faster than stdlib json)
    data = orjson.loads(response.content)
    
    # Validate structure
    if not isinstance(data, dict):
        raise InvalidDataError("Expected dict payload")
    
    return data

DataFrame Construction

Transform JSON to optimized DataFrames:

# From core.py:1086-1148
def _create_lap_df(lap_data: dict, driver: str, team: str, lib: str) -> DataFrame:
    """Create lap DataFrame with driver and team info (zero-copy optimized)."""
    
    # Create DataFrame (zero-copy when possible)
    if lib == 'polars':
        lap_df = pl.DataFrame(lap_data, strict=False)
        lap_df = lap_df.with_columns([
            pl.lit(driver).alias('Driver'),
            pl.lit(team).alias('Team')
        ])
    else:
        lap_df = pd.DataFrame(lap_data, copy=False)  # Zero-copy
        lap_df['Driver'] = driver
        lap_df['Team'] = team
    
    return lap_df

Type Optimization

Optimize memory usage with proper dtypes:

# From core.py:1167-1249
def _apply_laps_dtypes(df: pd.DataFrame) -> pd.DataFrame:
    """Enforce dtype contract on pandas laps DataFrame."""
    
    # Timedelta columns (lap times, sector times)
    for col in ('LapTime', 'Sector1Time', 'Sector2Time', 'Sector3Time'):
        if col in df.columns:
            df[col] = pd.to_timedelta(df[col], unit='s')
    
    # Float64 columns (lap number, position, speeds)
    for col in ('LapNumber', 'Position', 'SpeedI1', 'SpeedI2'):
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce').astype('float64')
    
    # Categorical columns (driver, team, compound) - 50% memory reduction
    for col in ('Driver', 'Team', 'Compound'):
        if col in df.columns:
            df[col] = df[col].astype('category')
    
    return df

Type Optimization Benefits:

Categorical types: 50% memory reduction for repeated strings
Proper numeric types: Faster computations
Timedelta types: Native time operations

Data Flow Scenarios

Scenario 1: Cold Start (First Load)

No cached data available:

# User code
session = tif1.get_session(2024, "Monaco", "Race")
laps = session.laps  # First access

# Internal flow:
# 1. Check in-memory cache → MISS
# 2. Check SQLite cache → MISS
# 3. Build async requests for all drivers
# 4. Fetch 20 driver laptime files in parallel (~0.4s)
# 5. Parse and validate JSON
# 6. Construct DataFrame with proper dtypes
# 7. Apply categorical types
# 8. Store in both caches
# 9. Return to user

Performance: ~0.4-1.0 seconds for 20 drivers

Scenario 2: Warm Start (SQLite Cache)

Data exists in SQLite cache:

# User code (different process, same day)
session = tif1.get_session(2024, "Monaco", "Race")
laps = session.laps

# Internal flow:
# 1. Check in-memory cache → MISS (different process)
# 2. Check SQLite cache → HIT
# 3. Deserialize from SQLite (~10ms)
# 4. Store in in-memory cache
# 5. Return to user

Performance: ~10-50ms

Scenario 3: Hot Start (In-Memory Cache)

Data already loaded in current process:

# User code (same process)
session = tif1.get_session(2024, "Monaco", "Race")
laps1 = session.laps  # First access - loads data
laps2 = session.laps  # Second access

# Internal flow (second access):
# 1. Check in-memory cache → HIT
# 2. Return cached DataFrame immediately

Performance: < 1ms (instant)

Scenario 4: Telemetry Loading

Telemetry has a more granular flow:

# User code
lap = session.laps.pick_fastest()
tel = lap.telemetry

# Internal flow:
# 1. Identify driver and lap number
# 2. Check in-memory telemetry cache → MISS
# 3. Check SQLite telemetry_cache → MISS
# 4. Fetch telemetry JSON from CDN
#    URL: {year}/{gp}/{session}/{driver}/{lap}_tel.json
# 5. Parse telemetry data (arrays of sensor values)
# 6. Create telemetry DataFrame
# 7. Add metadata columns (Driver, LapNumber)
# 8. Store in both caches
# 9. Return to user

Performance:

Cold: ~200-500ms per lap
Cached: ~1-10ms

Scenario 5: Batch Telemetry Loading

Optimized parallel loading:

# User code
fastest_tels = session.get_fastest_laps_tels(by_driver=True)

# Internal flow:
# 1. Get fastest lap for each driver (from laps DataFrame)
# 2. Check which telemetry is cached
# 3. Build list of missing telemetry files
# 4. Fetch ALL missing telemetry in parallel
#    - 20 drivers, 20 parallel requests
#    - Uses asyncio.gather() for concurrency
# 5. Process all telemetry DataFrames
# 6. Concatenate into single DataFrame
# 7. Store each in cache
# 8. Return combined DataFrame

Performance:

Cold: ~0.4s for 20 drivers (parallel)
Sequential would be: ~10s (25x slower)

Ultra-Cold Mode

For maximum performance on first load, tif1 offers ultra-cold mode:

# From config.py
config = {
    'ultra_cold_start': True,
    'ultra_cold_skip_retries': True,
    'ultra_cold_background_cache_fill': True
}

Ultra-Cold Optimizations:

Skip Validation: Parse JSON without schema validation
Skip Retries: Fail fast on errors
Background Caching: Fetch data, return immediately, cache in background

# With ultra-cold mode
session = tif1.get_session(2024, "Monaco", "Race")
laps = session.laps  # Returns immediately with data
# Cache is filled in background thread

Performance: Can be 2-3x faster on first load

Ultra-cold mode trades reliability for speed. Use only when:

You need maximum performance
You can tolerate occasional errors
Data source is trusted

Data Flow Diagrams

Complete Data Flow

Telemetry-Specific Flow

Performance Characteristics

Operation Latencies

Operation	Cold (No Cache)	Warm (SQLite)	Hot (Memory)
Load laps (20 drivers)	400-1000ms	10-50ms	<1ms
Load single telemetry	200-500ms	1-10ms	<1ms
Load 20 telemetry (parallel)	400-1000ms	10-50ms	<1ms
Load 20 telemetry (sequential)	10-15s	200-500ms	<1ms
Load weather	200-400ms	1-5ms	<1ms
Load race control	200-400ms	1-5ms	<1ms

Cache Effectiveness

# Typical cache hit rates in a workflow
import tif1

# First run (cold start)
session1 = tif1.get_session(2024, "Monaco", "Race")
laps1 = session1.laps  # 400ms - CDN fetch

# Second run (same process)
laps2 = session1.laps  # &lt;1ms - memory cache hit

# Third run (different process, same machine)
session2 = tif1.get_session(2024, "Monaco", "Race")
laps3 = session2.laps  # 10ms - SQLite cache hit

Cache Hit Rate Expectations:

Development: 90-95% (iterating on same data)
Production: 60-80% (varied data access)
CI/CD: 0-20% (fresh environments)

Configuration

Cache Configuration

import tif1

# Disable all caching
session = tif1.get_session(2024, "Monaco", "Race", enable_cache=False)

# Custom cache directory
import os
os.environ['TIF1_CACHE_DIR'] = '/path/to/cache'

# Clear cache
from tif1.cache import get_cache
cache = get_cache()
cache.clear()

Network Configuration

import os

# Set request timeout (seconds)
os.environ['TIF1_TIMEOUT'] = '60'

# Set max retries
os.environ['TIF1_MAX_RETRIES'] = '5'

# Set retry backoff factor
os.environ['TIF1_RETRY_BACKOFF_FACTOR'] = '1.5'

Troubleshooting

Slow Initial Load

Problem: First data access takes too long Solutions:

# 1. Use async loading
laps = await session.laps_async()  # Parallel loading

# 2. Enable ultra-cold mode
os.environ['TIF1_ULTRA_COLD_START'] = 'true'

# 3. Load only what you need
session.load(laps=True, telemetry=False, weather=False, messages=False)

Cache Issues

Problem: Cache not being used Check:

# Verify cache is enabled
print(session.enable_cache)  # Should be True

# Check cache directory
from tif1.cache import get_cache
cache = get_cache()
print(cache.cache_dir)  # Verify directory exists and is writable

# Clear corrupted cache
cache.clear()

Network Errors

Problem: Frequent network failures Solutions:

# Increase timeout
os.environ['TIF1_TIMEOUT'] = '60'  # Default: 30

# Increase retries
os.environ['TIF1_MAX_RETRIES'] = '5'  # Default: 3

# Check CDN status
# Visit: https://www.jsdelivr.com/

Sessions

Understanding session objects and lazy loading

Laps and Telemetry

Working with lap and telemetry data structures

Caching

Detailed cache configuration and management

Performance

Performance optimization tips and benchmarks

Get Started

Core Concepts

Guides

Advanced

CLI

Documentation Index

​Overview

​Architecture Overview

​Data Flow Layers

​1. User Layer

​2. Lazy Loading Layer

​3. Cache Layer

​Level 1: In-Memory Cache (LRU)

​Level 2: SQLite Cache

​4. Network Layer

​CDN Data Source

​Async HTTP Fetching

​Retry Logic

​5. Processing Layer

​JSON Parsing and Validation

​DataFrame Construction

​Type Optimization

​Data Flow Scenarios

​Scenario 1: Cold Start (First Load)

​Scenario 2: Warm Start (SQLite Cache)

​Scenario 3: Hot Start (In-Memory Cache)

​Scenario 4: Telemetry Loading

​Scenario 5: Batch Telemetry Loading

​Ultra-Cold Mode

​Data Flow Diagrams

​Complete Data Flow

​Telemetry-Specific Flow

​Performance Characteristics

​Operation Latencies

​Cache Effectiveness

​Configuration

​Cache Configuration

​Network Configuration

​Troubleshooting

​Slow Initial Load

​Cache Issues

​Network Errors

​Related Topics

Sessions

Laps and Telemetry

Caching

Performance

Build docs developers (and LLMs) love

Overview

Architecture Overview

Data Flow Layers

1. User Layer

2. Lazy Loading Layer

3. Cache Layer

Level 1: In-Memory Cache (LRU)

Level 2: SQLite Cache

4. Network Layer

CDN Data Source

Async HTTP Fetching

Retry Logic

5. Processing Layer

JSON Parsing and Validation

DataFrame Construction

Type Optimization

Data Flow Scenarios

Scenario 1: Cold Start (First Load)

Scenario 2: Warm Start (SQLite Cache)

Scenario 3: Hot Start (In-Memory Cache)

Scenario 4: Telemetry Loading

Scenario 5: Batch Telemetry Loading

Ultra-Cold Mode

Data Flow Diagrams

Complete Data Flow

Telemetry-Specific Flow

Performance Characteristics

Operation Latencies

Cache Effectiveness

Configuration

Cache Configuration

Network Configuration

Troubleshooting

Slow Initial Load

Cache Issues

Network Errors

Related Topics