Documentation Index
Fetch the complete documentation index at: https://mintlify.com/maxiricalde/ProfeLedesma/llms.txt
Use this file to discover all available pages before exploring further.
Loading the Dataset
The raw GHI data for station LQ is spread across four annual CSV files (GHI_LQ2020.csv through GHI_LQ2023.csv). This guide walks through every step required to combine them into a single, clean, fully time-indexed DataFrame ready for preprocessing.
Import pandas and the Site helper
Begin by importing the required libraries. The
Site class from helpers/Sites.py provides programmatic access to station metadata (latitude, longitude, altitude).Instantiate a Site object
Create a
Site object for station LQ. This gives you the geographic coordinates needed later for solar geometry calculations.Site('LQ') looks up the station code in the master list defined in helpers/Sites.py and exposes .lat, .long, and .alt as attributes.Concatenate the four annual CSV files
Use After this step
pd.concat with a list comprehension to load and stack all four years in a single call. The usecols=[4, 3] argument selects only the two columns we need:- Index 4 →
Fecha(the timestamp string) - Index 3 →
IRRADIANCIA (W/m2)(the GHI reading)
df has approximately 1,764,898 rows × 2 columns (before deduplication) with the original column names IRRADIANCIA (W/m2) and Fecha.Parse the datetime column
Convert the
datetime column from its raw string representation to proper pandas.Timestamp objects so that all time-based operations work correctly:Drop duplicate timestamps
The sensor can occasionally record the same minute twice. Remove duplicates while keeping the first occurrence:
Resample to a regular 1-minute grid
Even after deduplication the timestamps may not fall on exact minute boundaries. Resample to a strict 1-minute rule and take the mean within each bin to align everything to a uniform grid:
Reindex to a full date range to make gaps explicit
Construct a complete minute-by-minute date range covering the entire measurement period and reindex the DataFrame against it. Any minute for which the sensor has no reading will appear as an explicit After
NaN row:.reindex(ranges), every gap in the sensor record is visible as a NaN value in the ghi column rather than a missing row that would be invisible to downstream analysis.Complete Loading Code
Why .reindex() Matters
After loading and resampling, the DataFrame only contains rows for timestamps that actually exist in the raw CSV files. If the sensor was offline for an hour, those 60 minutes are simply absent from the data — they are invisible gaps.
Calling .reindex(ranges) against the full date range inserts NaN values for every missing timestamp. This makes every gap explicit and detectable by any subsequent step (outlier detection, visualisation, imputation, etc.) instead of being silently skipped.