Removing Duplicate Timestamps from GHI Time Series

Detecting duplicates
Removing duplicates
What comes next

The first cleaning step after loading is checking for duplicate timestamps. Raw sensor data often contains multiple readings for the same minute due to logging retries, communication errors, or overlapping exports.

Detecting duplicates

Compare the total number of non-null rows against the number of unique timestamps to find how many duplicates exist:

# Total rows vs. unique timestamps
print(len(df.dropna()))          # e.g. 1764898
print(len(df.datetime.unique()))  # e.g. 1752120
print(len(df.dropna()) - len(df.datetime.unique()))  # difference = duplicates

In the workshop dataset (four years of 1-minute GHI readings from site LQ), this reveals 12,778 duplicate entries.

Removing duplicates

The simplest fix is to drop all but the first occurrence of each duplicated timestamp:

df.drop_duplicates(subset='datetime', inplace=True)
# Verify
print(len(df.dropna()) - len(df.datetime.unique()))  # should be 0

drop_duplicates keeps the first occurrence by default. If you need to average duplicates instead, use:

df.resample(on='datetime', rule='1min').mean()

Always verify deduplication by confirming that len(df.dropna()) == len(df.datetime.unique()) after the operation.

What comes next

Even after deduplication there may be missing timestamps (gaps in the time series). Those are handled by reindexing to a complete date range — see the Loading Data page for details.

GHI Measurement Stations: South American Network Guide

Physical Quality Control Filters for Solar Irradiance Data

Build docs developers (and LLMs) love

Get started for free Talk to us

Introduction

The Dataset

Preprocessing Steps

Modeling & Evaluation

Detecting duplicates

Removing duplicates

What comes next

Build docs developers (and LLMs) love

Introduction

The Dataset

Preprocessing Steps

Modeling & Evaluation

Documentation Index

​Detecting duplicates

​Removing duplicates

​What comes next

Build docs developers (and LLMs) love

Detecting duplicates

Removing duplicates

What comes next