Model Architecture

Overview

AQI prediction is fundamentally a time series forecasting problem with multivariate inputs. This page explores the machine learning architectures supported by AQI Predictor and the rationale behind each approach.

No single architecture is universally best. The optimal choice depends on your data characteristics, prediction horizon, computational resources, and accuracy requirements.

Model Types

Recurrent Neural Networks (RNNs)

Recurrent networks are designed to process sequential data by maintaining internal state (memory) across time steps.

LSTM
GRU
Bidirectional RNNs

Long Short-Term Memory NetworksLSTMs are the most popular architecture for time series prediction, including AQI forecasting.Architecture:

Input Sequence → LSTM Layers → Dense Layers → Output
[t-n...t-1, t] → [Hidden States] → [Prediction] → [t+k]

Key Components:

Forget Gate: Decides what information to discard
Input Gate: Decides what new information to store
Output Gate: Decides what to output based on cell state
Cell State: Long-term memory carrier

Advantages:

Captures long-term dependencies (days, weeks)
Handles variable-length sequences
Well-suited for multiple time horizons
Robust to gradient vanishing problem

Best For:

Medium to long-term predictions (6-48 hours)
When long-term patterns matter
Multiple pollutants with complex interactions

Typical Configuration:

lookback_window: 24 hours
lstm_units: [128, 64]
dropout: 0.2
prediction_horizon: 24 hours

Gated Recurrent UnitGRUs are a simpler alternative to LSTMs with fewer parameters and faster training.Architecture:

Input Sequence → GRU Layers → Dense Layers → Output

Key Components:

Update Gate: Controls how much past information to keep
Reset Gate: Controls how much past information to forget
No separate cell state (simpler than LSTM)

Advantages:

Faster training than LSTM (30-40% fewer parameters)
Often comparable performance to LSTM
Less prone to overfitting on smaller datasets
Lower computational requirements

Best For:

Short to medium-term predictions (1-12 hours)
Limited computational resources
Smaller datasets (< 1 year of data)
Quick experimentation

Typical Configuration:

lookback_window: 12 hours
gru_units: [128, 64]
dropout: 0.2
prediction_horizon: 12 hours

Bidirectional LSTMs/GRUsProcess sequences in both forward and backward directions to capture context from past and future.Architecture:

Input → [Forward LSTM] → Concatenate → Dense → Output
         [Backward LSTM] ↗

Advantages:

Richer representation of temporal patterns
Better feature extraction
Improved accuracy for sequence-to-sequence tasks

Trade-offs:

Requires full sequence availability (not suitable for true real-time streaming)
Doubles computational cost
More complex training

Best For:

Batch predictions (not real-time)
Post-processing or analysis of historical data
When you have complete sequences

Transformer-Based Models

Transformers use self-attention mechanisms to capture relationships across time steps without recurrence.

Temporal Fusion Transformer (TFT)

State-of-the-art for time series forecastingTFT combines several advanced components:

Variable Selection: Learns which features are most relevant
Static Covariates: Incorporates time-invariant features
Multi-Horizon: Predicts multiple time steps simultaneously
Attention: Interprets which past time steps influence predictions

Architecture Highlights:

Static Features ────┐
Historical Inputs ──┼─→ Variable Selection → LSTM Encoder → 
Known Future Inputs─┘                                         
                                          → Multi-Head Attention →
                                          → Gated Residual Network →
                                          → Output Layer

Advantages:

Superior accuracy on complex datasets
Interpretable attention weights
Handles multiple types of inputs naturally
Built-in uncertainty quantification

Requirements:

Large datasets (2+ years recommended)
More computational resources
Longer training time
Hyperparameter tuning is critical

Best For:

Production systems with high accuracy needs
When interpretability matters
Multi-step ahead predictions
Rich feature sets with multiple data types

Standard Transformer

Attention-based sequence modelingPure transformer architecture adapted for time series.Key Mechanisms:

Self-Attention: Captures relationships between all time steps
Positional Encoding: Injects temporal order information
Multi-Head Attention: Multiple attention patterns

Advantages:

Parallelizable (faster training than RNNs)
No vanishing gradient problems
Can capture long-range dependencies

Challenges:

Requires more data than RNNs
Can overfit on smaller datasets
Less inductive bias for temporal structure

Best For:

Very large datasets
When training time matters
Long sequences (> 48 hours lookback)

Hybrid Architectures

Combining multiple architectural components often yields best results.

CNN-LSTM

Convolutional layers extract local patterns, LSTM captures temporal dependencies

Input → 1D CNN → LSTM → Dense → Output

Use Case:

Extract local temporal patterns (hourly cycles)
Good for multi-sensor or spatial data
Reduces sequence length for LSTM

Encoder-Decoder

Separate encoding and decoding phases

Encoder: Input Sequence → Context Vector
Decoder: Context → Multi-step Output

Use Case:

Multi-step ahead prediction
Sequence-to-sequence mapping
When output length differs from input

Attention + LSTM

LSTM with attention mechanismAttention layer helps model focus on most relevant past time steps.Use Case:

Improved accuracy over plain LSTM
Interpretable predictions
Long sequences

Ensemble Models

Combine predictions from multiple modelsAverage or weighted combination of LSTM, GRU, Transformer outputs.Use Case:

Maximum accuracy
Reduce model variance
Production systems

Training Considerations

Input/Output Configuration

Univariate vs Multivariate
Single-step vs Multi-step
Point vs Probabilistic

Univariate Prediction

Input: PM2.5[t-24:t]
Output: PM2.5[t+1]

Predict one pollutant based on its history
Simpler, faster training
Limited by single variable view

Multivariate Prediction

Input: [PM2.5, PM10, NO2, O3, temp, wind, ...][t-24:t]
Output: PM2.5[t+1]

Use multiple variables to predict target
Captures cross-pollutant relationships
Better accuracy but more complex
Recommended approach

Single-step

Output: AQI[t+1]

Predict one time step ahead
Can be chained (use prediction as input for next step)
Error accumulates in chaining

Multi-step Direct

Output: [AQI[t+1], AQI[t+2], ..., AQI[t+24]]

Predict multiple time steps simultaneously
No error accumulation
More complex output layer
Recommended for horizons > 3 hours

Multi-step Recursive

Output: AQI[t+1]
Feed output back as input → AQI[t+2]
...

Chain single-step predictions
Simpler model
Compounds prediction errors

Point Prediction

Output: AQI = 87.3

Single value prediction
Standard approach
No uncertainty information

Probabilistic Prediction

Output: AQI ~ N(87.3, σ=12.5)
or: P10=72, P50=87, P90=103

Provides prediction intervals
Quantifies uncertainty
Better for decision-making
Requires different loss functions (quantile loss, negative log-likelihood)
Recommended for production systems

Loss Functions

Mean Squared Error (MSE)

Standard regression loss

MSE = (1/n) Σ(y_true - y_pred)²

Characteristics:

Penalizes large errors heavily (quadratic)
Sensitive to outliers
Most common choice

Use when: Outliers are truly errors and should be heavily penalized

Mean Absolute Error (MAE)

Robust to outliers

MAE = (1/n) Σ|y_true - y_pred|

Characteristics:

Linear penalty
More robust to outliers
All errors weighted equally

Use when: Dataset has many outliers or measurement noise

Huber Loss

Hybrid MSE/MAE

Huber = MSE for small errors, MAE for large errors

Characteristics:

Best of both worlds
Robust but still sensitive to large errors
Requires delta parameter tuning

Use when: Want balance between MSE and MAE

Quantile Loss

For probabilistic predictions

QuantileLoss(τ) = max(τ(y - ŷ), (τ-1)(y - ŷ))

Characteristics:

Predicts specific quantiles (e.g., P10, P50, P90)
Asymmetric penalty
Produces prediction intervals

Use when: Need uncertainty quantification or risk-based decisions

Regularization Techniques

Dropout

Randomly deactivate neurons during training to prevent co-adaptation.Typical rates: 0.2-0.5Apply to: Dense layers, recurrent connections

Recurrent Dropout

Dropout applied to recurrent connections in LSTM/GRU.Typical rates: 0.1-0.3Careful: Too high degrades temporal learning

L1/L2 Regularization

Penalize large weights in loss function.L2 (Ridge): Smooth weight decay L1 (Lasso): Sparse weightsTypical values: 1e-5 to 1e-3

Early Stopping

Stop training when validation loss stops improving.Patience: 10-20 epochsMost effective regularization technique

Model Evaluation

Metrics

Use multiple metrics to get a complete picture of model performance. Different metrics emphasize different aspects.

Metric	Formula	Interpretation	Use Case
RMSE	√(MSE)	Same units as target	Overall accuracy, penalizes large errors
MAE	Mean(\|y - ŷ\|)	Same units, robust to outliers	Typical prediction error
MAPE	Mean(\|y - ŷ\|/y) × 100	Percentage error	Relative accuracy across scales
R²	1 - (SS_res/SS_tot)	Variance explained (0-1)	Model quality vs baseline
IA	Index of Agreement	0-1, how well model matches observations	Overall performance

Validation Strategy

Time Series Split
Walk-Forward Validation
Holdout Test Set

Forward-chaining validation

Train: [────────────────] (2020-2022)
Valid: [────] (2023 Q1)
Test:  [────] (2023 Q2)

Respect temporal order (no future data in training)
Split chronologically
Always use this for time series

Never use random K-fold for time series!

Simulate production deployment

Fold 1: Train[Year 1] → Test[Month 13]
Fold 2: Train[Year 1-2] → Test[Month 25]
Fold 3: Train[Year 1-3] → Test[Month 37]

More robust evaluation
Tests multiple time periods
Higher computational cost
Recommended for final evaluation

Architecture Selection Guide

Start simple, add complexity only if needed. An LSTM or GRU is sufficient for most AQI prediction tasks.

Quick Recommendations

Scenario	Recommended Architecture	Lookback	Horizon
Quick prototype	GRU (64 units)	12h	6h
Standard production	LSTM (128, 64 units)	24h	24h
High-accuracy system	TFT or Ensemble	48h	48h
Limited compute	GRU (single layer)	6h	3h
Research / state-of-art	Transformer + Attention	72h	72h

Next Steps: With this understanding of architectures, you’re ready to explore the Quick Start Guide to begin training your first model.

Get Started

Core Concepts

Guides

Advanced

Overview

Model Types

Recurrent Neural Networks (RNNs)

Transformer-Based Models

Hybrid Architectures

CNN-LSTM

Encoder-Decoder

Attention + LSTM

Ensemble Models

Training Considerations

Input/Output Configuration

Loss Functions

Regularization Techniques

Dropout

Recurrent Dropout

L1/L2 Regularization

Early Stopping

Model Evaluation

Metrics

Validation Strategy

Architecture Selection Guide

Quick Recommendations

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Overview

​Model Types

​Recurrent Neural Networks (RNNs)

​Transformer-Based Models

​Hybrid Architectures

CNN-LSTM

Encoder-Decoder

Attention + LSTM

Ensemble Models

​Training Considerations

​Input/Output Configuration

​Loss Functions

​Regularization Techniques

Dropout

Recurrent Dropout

L1/L2 Regularization

Early Stopping

​Model Evaluation

​Metrics

​Validation Strategy

​Architecture Selection Guide

​Quick Recommendations

Build docs developers (and LLMs) love

Overview

Model Types

Recurrent Neural Networks (RNNs)

Transformer-Based Models

Hybrid Architectures

Training Considerations

Input/Output Configuration

Loss Functions

Regularization Techniques

Model Evaluation

Metrics

Validation Strategy

Architecture Selection Guide

Quick Recommendations