Overview
AQI prediction is fundamentally a time series forecasting problem with multivariate inputs. This page explores the machine learning architectures supported by AQI Predictor and the rationale behind each approach.No single architecture is universally best. The optimal choice depends on your data characteristics, prediction horizon, computational resources, and accuracy requirements.
Model Types
Recurrent Neural Networks (RNNs)
Recurrent networks are designed to process sequential data by maintaining internal state (memory) across time steps.- LSTM
- GRU
- Bidirectional RNNs
Long Short-Term Memory NetworksLSTMs are the most popular architecture for time series prediction, including AQI forecasting.Architecture:Key Components:
- Forget Gate: Decides what information to discard
- Input Gate: Decides what new information to store
- Output Gate: Decides what to output based on cell state
- Cell State: Long-term memory carrier
- Captures long-term dependencies (days, weeks)
- Handles variable-length sequences
- Well-suited for multiple time horizons
- Robust to gradient vanishing problem
- Medium to long-term predictions (6-48 hours)
- When long-term patterns matter
- Multiple pollutants with complex interactions
Transformer-Based Models
Transformers use self-attention mechanisms to capture relationships across time steps without recurrence.Temporal Fusion Transformer (TFT)
Temporal Fusion Transformer (TFT)
State-of-the-art for time series forecastingTFT combines several advanced components:Advantages:
- Variable Selection: Learns which features are most relevant
- Static Covariates: Incorporates time-invariant features
- Multi-Horizon: Predicts multiple time steps simultaneously
- Attention: Interprets which past time steps influence predictions
- Superior accuracy on complex datasets
- Interpretable attention weights
- Handles multiple types of inputs naturally
- Built-in uncertainty quantification
- Large datasets (2+ years recommended)
- More computational resources
- Longer training time
- Hyperparameter tuning is critical
- Production systems with high accuracy needs
- When interpretability matters
- Multi-step ahead predictions
- Rich feature sets with multiple data types
Standard Transformer
Standard Transformer
Attention-based sequence modelingPure transformer architecture adapted for time series.Key Mechanisms:
- Self-Attention: Captures relationships between all time steps
- Positional Encoding: Injects temporal order information
- Multi-Head Attention: Multiple attention patterns
- Parallelizable (faster training than RNNs)
- No vanishing gradient problems
- Can capture long-range dependencies
- Requires more data than RNNs
- Can overfit on smaller datasets
- Less inductive bias for temporal structure
- Very large datasets
- When training time matters
- Long sequences (> 48 hours lookback)
Hybrid Architectures
Combining multiple architectural components often yields best results.CNN-LSTM
Convolutional layers extract local patterns, LSTM captures temporal dependenciesUse Case:
- Extract local temporal patterns (hourly cycles)
- Good for multi-sensor or spatial data
- Reduces sequence length for LSTM
Encoder-Decoder
Separate encoding and decoding phasesUse Case:
- Multi-step ahead prediction
- Sequence-to-sequence mapping
- When output length differs from input
Attention + LSTM
LSTM with attention mechanismAttention layer helps model focus on most relevant past time steps.Use Case:
- Improved accuracy over plain LSTM
- Interpretable predictions
- Long sequences
Ensemble Models
Combine predictions from multiple modelsAverage or weighted combination of LSTM, GRU, Transformer outputs.Use Case:
- Maximum accuracy
- Reduce model variance
- Production systems
Training Considerations
Input/Output Configuration
- Univariate vs Multivariate
- Single-step vs Multi-step
- Point vs Probabilistic
Univariate Prediction
- Predict one pollutant based on its history
- Simpler, faster training
- Limited by single variable view
- Use multiple variables to predict target
- Captures cross-pollutant relationships
- Better accuracy but more complex
- Recommended approach
Loss Functions
Mean Squared Error (MSE)
Mean Squared Error (MSE)
Standard regression lossCharacteristics:
- Penalizes large errors heavily (quadratic)
- Sensitive to outliers
- Most common choice
Mean Absolute Error (MAE)
Mean Absolute Error (MAE)
Robust to outliersCharacteristics:
- Linear penalty
- More robust to outliers
- All errors weighted equally
Huber Loss
Huber Loss
Hybrid MSE/MAECharacteristics:
- Best of both worlds
- Robust but still sensitive to large errors
- Requires delta parameter tuning
Quantile Loss
Quantile Loss
For probabilistic predictionsCharacteristics:
- Predicts specific quantiles (e.g., P10, P50, P90)
- Asymmetric penalty
- Produces prediction intervals
Regularization Techniques
Dropout
Randomly deactivate neurons during training to prevent co-adaptation.Typical rates: 0.2-0.5Apply to: Dense layers, recurrent connections
Recurrent Dropout
Dropout applied to recurrent connections in LSTM/GRU.Typical rates: 0.1-0.3Careful: Too high degrades temporal learning
L1/L2 Regularization
Penalize large weights in loss function.L2 (Ridge): Smooth weight decay
L1 (Lasso): Sparse weightsTypical values: 1e-5 to 1e-3
Early Stopping
Stop training when validation loss stops improving.Patience: 10-20 epochsMost effective regularization technique
Model Evaluation
Metrics
Use multiple metrics to get a complete picture of model performance. Different metrics emphasize different aspects.
| Metric | Formula | Interpretation | Use Case |
|---|---|---|---|
| RMSE | √(MSE) | Same units as target | Overall accuracy, penalizes large errors |
| MAE | Mean(|y - ŷ|) | Same units, robust to outliers | Typical prediction error |
| MAPE | Mean(|y - ŷ|/y) × 100 | Percentage error | Relative accuracy across scales |
| R² | 1 - (SS_res/SS_tot) | Variance explained (0-1) | Model quality vs baseline |
| IA | Index of Agreement | 0-1, how well model matches observations | Overall performance |
Validation Strategy
- Time Series Split
- Walk-Forward Validation
- Holdout Test Set
Forward-chaining validation
- Respect temporal order (no future data in training)
- Split chronologically
- Always use this for time series
Architecture Selection Guide
Start simple, add complexity only if needed. An LSTM or GRU is sufficient for most AQI prediction tasks.
Quick Recommendations
| Scenario | Recommended Architecture | Lookback | Horizon |
|---|---|---|---|
| Quick prototype | GRU (64 units) | 12h | 6h |
| Standard production | LSTM (128, 64 units) | 24h | 24h |
| High-accuracy system | TFT or Ensemble | 48h | 48h |
| Limited compute | GRU (single layer) | 6h | 3h |
| Research / state-of-art | Transformer + Attention | 72h | 72h |
Next Steps: With this understanding of architectures, you’re ready to explore the Quick Start Guide to begin training your first model.