Before diving into the preprocessing pipeline, it helps to understand the physical and mathematical concepts that underpin the quality-control filters and evaluation metrics used throughout this workshop. This page covers the essentials: what GHI is, how solar geometry defines its physical limits, how the clear-sky model works, and what makes time-series data different from ordinary tabular data for Machine Learning.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/maxiricalde/ProfeLedesma/llms.txt
Use this file to discover all available pages before exploring further.
Global Horizontal Irradiance (GHI)
GHI — Global Horizontal Irradiance — is the total solar radiation received on a horizontal surface at ground level, expressed in W/m². It combines:- Direct normal irradiance (DNI) — the beam component coming straight from the solar disk, projected onto the horizontal plane.
- Diffuse horizontal irradiance (DHI) — the scattered component arriving from the rest of the sky dome.
- Solar energy estimation — calculating how much energy a location receives over a day, month, or year.
- Photovoltaic (PV) system sizing — determining how large a panel array needs to be to meet a given energy demand.
- Climate and atmospheric analysis — tracking cloud cover, aerosol loading, and surface radiation budgets.
Top of Atmosphere (TOA) Irradiance
The Top of Atmosphere irradiance (TOA) is the solar irradiance that would reach a horizontal surface if there were no atmosphere at all. It is the hard physical upper bound for any GHI measurement at ground level — no real observation can exceed it. TOA is computed inGeo.py as:
The formula is:
TOA = 1361 × E0 × cos(θz)1361W/m² is the solar constant — the mean solar irradiance at the top of the atmosphere.E0is the orbital correction factor, which accounts for the slight variation in Earth–Sun distance throughout the year. It is computed from the day-of-year ordinalNas1 + 0.033 × cos(2π × N / 365).cos(θz)is the cosine of the Solar Zenith Angle (stored asCTZin the code — cosine of tita z). When the sun is directly overhead,cos(θz) = 1and TOA reaches its maximum. When the sun is at the horizon,cos(θz) → 0and TOA → 0.- When
CTZ < 0(sun below the horizon), TOA is set to zero.
Solar Zenith Angle (SZA)
The Solar Zenith Angle (SZA) is the angle between the sun’s position and the vertical (zenith). It is the complement of the solar elevation angle.SZA is computed in Geo.py from the cosine of the zenith angle CTZ:
CTZ itself is calculated from solar declination δ, latitude φ, and the hour angle ω:
Key SZA thresholds:
- SZA = 0° — sun directly overhead (maximum possible irradiance for the given date and location).
- SZA = 90° — sun exactly at the horizon. At this angle,
cos(θz) = 0and TOA = 0. - SZA > 90° — the sun is below the horizon. Any non-zero GHI reading at SZA > 90° is physically impossible and must be treated as noise, a sensor offset, or an electronic artefact.
QualityControl.py.
Clear-Sky Model: ARGP
The clear-sky model estimates what GHI would be on a perfectly clear day — no clouds, no aerosols beyond a climatological baseline. It provides a physically-grounded reference against which measured values can be compared. The workshop uses the ARGP (Argentine Radiation Parametric) model, implemented inGeo.py:
The ARGP formula is:
GHI_clear = TOA × ktrp ^ (AM ^ 0.678)-
ktrpis the clear-sky transmission parameter. It depends on altitude (alt) and is computed as:- If altitude > 1000 m:
ktrp = 0.7 + 1.6391 × 10⁻³ × alt ^ 0.55 - Otherwise:
ktrp = 0.7570 + 1.0112 × 10⁻⁵ × alt ^ 1.1067
ktrpthan sea-level sites — meaning the atmosphere is thinner and transmits more radiation, which is why high-altitude measurements can be surprisingly large. - If altitude > 1000 m:
-
AMis the Kasten air mass — how much atmosphere the sunlight travels through relative to a vertical path. It is computed fromCTZand the zenith angleTZ(in radians), pressure-corrected for altitude:The pressure correction reduces the effective air mass at high-altitude sites.
Physical Upper Bound: The QC Limit
Beyond the TOA limit,QualityControl.py applies a tighter, empirically-derived upper boundary for daytime GHI that accounts for rare but physically possible super-clear-sky conditions (e.g. cloud-enhancement effects):
Clearness Index (kt)
The clearness indexkt is the ratio of measured GHI to TOA irradiance. It quantifies how much of the extraterrestrial solar radiation actually reaches the surface:
kt = GHI / TOA- A value near 1 indicates a very clear sky.
- A value near 0 indicates heavy cloud cover or that the sun is near the horizon.
- A value greater than 1 is physically impossible under normal conditions (it would mean the surface receives more radiation than the top of the atmosphere — which can occur briefly during circumsolar cloud-enhancement but is capped at 1.4 in the QC filter).
- Negative
ktvalues mean the GHI reading is negative, which is a sensor artefact.
0 < kt < 1.4 for a reading to be accepted.Why Time-Series ML Is Different
Preprocessing a time-series dataset for Machine Learning is not the same as preparing a standard tabular dataset. Three properties of time-series data demand special treatment:Temporal Ordering Must Be Preserved
In a conventional tabular dataset, rows are exchangeable — you can shuffle them freely for cross-validation. In a time-series, each observation depends on what came before. Shuffling destroys this structure and introduces data leakage: the model would “see” future values during training and appear to perform well while actually being useless in production. The correct approach is a chronological train/test split — all training data precedes all test data in time.Autocorrelation
Consecutive GHI readings are highly correlated: a sunny minute is likely to be followed by another sunny minute. This autocorrelation means that a model trained naively on randomly-split data would exploit temporal proximity between train and test samples, inflating apparent accuracy.Seasonality
GHI exhibits strong diurnal (within a day) and annual (across the year) cycles. These cycles must be either explicitly provided to the model as temporal features (hour of day, day of year, etc.) or removed through appropriate normalisation — otherwise the model must infer them implicitly, which is less efficient and less reliable.Evaluation Metrics
The workshop uses several metrics to compare preprocessed model outputs against reference measurements. All are implemented inhelpers/Metrics.py.
Point-to-Point Metrics
These compare predicted and observed values at each timestamp:| Metric | Formula | What it measures | ||
|---|---|---|---|---|
| MBE — Mean Bias Error | sum(pred - true) / n | Systematic over- or under-prediction (bias direction) | ||
| MAE — Mean Absolute Error | `sum( | pred - true | ) / n` | Average magnitude of error, in the same units as GHI (W/m²) |
| RMSD — Root Mean Square Deviation | sqrt(sum((pred - true)²) / n) | Error magnitude with extra weight on large deviations |
Distribution-Based Metrics
These compare the statistical distributions of two series rather than individual paired values — useful when evaluating whether a preprocessed or imputed dataset preserves the statistical character of the original: KSI / OVER — based on the Kolmogorov-Smirnov statistic:- KSI measures the total area between the two empirical CDFs. A value of zero means the distributions are identical.
- OVER measures only the area where the CDF difference exceeds the critical threshold
Vc = 1.63 / sqrt(n). It captures statistically significant distributional differences.