BaseDataset Interface
All datasets in Samay inherit from theBaseDataset class (src/samay/dataset.py:54), which provides a consistent interface for data loading and preprocessing.
Common Parameters
All dataset classes share these core parameters:name
Type:str (optional)Description: Dataset identifier. When provided, triggers automatic download from HuggingFace datasets.
datetime_col
Type:strDefault: Varies by model (“ds” for TimesFM, “date” for others)
Description: Name of the datetime column in your CSV file. This column is typically dropped during preprocessing.
path
Type:str (optional)Description: Path to your CSV file. If omitted,
name is used to download data.
batchsize
Type:intDefault: Varies by model (4-128)
Description: Batch size for DataLoader.
mode
Type:strOptions:
"train", "test", "val"Description: Determines which data split to use based on boundaries.
boundaries
Type:list[int] or tuple[int]Default:
[0, 0, 0] (auto-computed as 50%/20%/30% split)Description: Indices defining train/val/test splits:
[train_end, val_end, test_end]
Dataset Structure
Data Format
Datasets expect CSV files with:- One datetime column (specified by
datetime_col) - One or more value columns (time series channels)
Windowing and Chunking
Datasets create sliding windows from the time series:context_len/seq_len- Historical context window sizehorizon_len/forecast_horizon- Forecast horizon sizestride- Step size between windows (default 10)
Multi-channel Handling
For datasets with many channels, data is split into chunks:- ChronosDataset: 16 channels per chunk
- ChronosBoltDataset: 64 channels per chunk
- MomentDataset: 64 channels per chunk
- TimesfmDataset: No chunking (processes all channels together)
Data Preprocessing
Padding
Short time series are automatically padded:Normalization
Some datasets support normalization: TimesfmDataset:Tokenization (Chronos)
Chronos datasets apply tokenization for transformer models:DataLoader Integration
All datasets provide aget_data_loader() method:
Model-Specific Datasets
ChronosDataset - Tokenized transformer input
ChronosDataset - Tokenized transformer input
Key Parameters:
context_len(int, default 512) - Context window sizehorizon_len(int, default 64) - Forecast horizonstride(int, default 10) - Window strideconfig(ChronosConfig) - Model configurationtokenizer_class(str) - Tokenizer type
TimesfmDataset - Patch-based input
TimesfmDataset - Patch-based input
Key Parameters:
context_len(int, default 128) - Historical windowhorizon_len(int, default 32) - Forecast windowfreq(str, default “h”) - Data frequencynormalize(bool, default False) - Apply normalizationstride(int, default 10) - Window stride
MomentDataset - Multi-task dataset
MomentDataset - Multi-task dataset
Key Parameters:Returns (classification):
task_name(str) - “forecasting”, “imputation”, “detection”, “classification”seq_len(int, default 512) - Sequence lengthhorizon_len(int, default 0) - Forecast horizonlabel_col(str) - Label column for classification
ChronosBoltDataset - Simplified Chronos variant
ChronosBoltDataset - Simplified Chronos variant
Key Parameters:No tokenization - returns raw sequences.
context_len(int, default 512)horizon_len(int, default 64)stride(int, default 10)max_col_num(int, default 64) - Channels per chunk
Dataset Examples
Basic Usage
Multi-task Dataset (MOMENT)
Next Steps
Models
Learn how to use datasets with models
Evaluation
Understand metrics and evaluation