Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/maxiricalde/ProfeLedesma/llms.txt

Use this file to discover all available pages before exploring further.

The first cleaning step after loading is checking for duplicate timestamps. Raw sensor data often contains multiple readings for the same minute due to logging retries, communication errors, or overlapping exports.

Detecting duplicates

Compare the total number of non-null rows against the number of unique timestamps to find how many duplicates exist:
# Total rows vs. unique timestamps
print(len(df.dropna()))          # e.g. 1764898
print(len(df.datetime.unique()))  # e.g. 1752120
print(len(df.dropna()) - len(df.datetime.unique()))  # difference = duplicates
In the workshop dataset (four years of 1-minute GHI readings from site LQ), this reveals 12,778 duplicate entries.

Removing duplicates

The simplest fix is to drop all but the first occurrence of each duplicated timestamp:
df.drop_duplicates(subset='datetime', inplace=True)
# Verify
print(len(df.dropna()) - len(df.datetime.unique()))  # should be 0
drop_duplicates keeps the first occurrence by default. If you need to average duplicates instead, use:
df.resample(on='datetime', rule='1min').mean()
Always verify deduplication by confirming that len(df.dropna()) == len(df.datetime.unique()) after the operation.

What comes next

Even after deduplication there may be missing timestamps (gaps in the time series). Those are handled by reindexing to a complete date range — see the Loading Data page for details.

Build docs developers (and LLMs) love