Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/MilesONerd/neurenix/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The DatasetHub module provides utilities for loading datasets from URLs or file paths, supporting various formats and preprocessing options.

DatasetFormat Enum

class DatasetFormat(Enum):
    CSV = auto()
    JSON = auto()
    NUMPY = auto()
    PICKLE = auto()
    TEXT = auto()
    IMAGE = auto()
    AUDIO = auto()
    VIDEO = auto()
    SQL = auto()
    CUSTOM = auto()

Methods

from_extension

@classmethod
def from_extension(cls, extension: str) -> DatasetFormat
Determine format from file extension.
extension
str
required
File extension (e.g., ‘csv’, ‘json’, ‘npy’).
return
DatasetFormat
The corresponding dataset format.

Dataset Class

class Dataset:
    def __init__(
        self,
        data: Any,
        format: DatasetFormat,
        name: str = None,
        metadata: Dict = None,
        transform: Callable = None
    )

Parameters

data
Any
required
The dataset content.
format
DatasetFormat
required
Format of the dataset.
name
str
Name of the dataset.
metadata
Dict
Additional information about the dataset.
transform
Callable
Function to transform data samples.

Methods

len

def __len__(self) -> int
Return the number of samples in the dataset.

getitem

def __getitem__(self, idx: Union[int, slice]) -> Any
Get a sample or batch from the dataset.
idx
Union[int, slice]
required
Index or slice to retrieve.
return
Any
The sample or batch at the specified index.

to_tensor

def to_tensor(self, framework: str = 'auto') -> Any
Convert the dataset to a tensor.
framework
str
default:"auto"
The framework to use (‘torch’, ‘tensorflow’, ‘numpy’, or ‘auto’).
return
Any
Tensor representation of the dataset.

DatasetHub Class

class DatasetHub:
    def __init__(self, cache_dir: Optional[str] = None)
Main class for loading and managing datasets.

Parameters

cache_dir
Optional[str]
Directory to cache downloaded datasets. If None, uses default cache directory.

Methods

load_dataset

def load_dataset(
    self,
    source: Union[str, Path],
    format: Optional[DatasetFormat] = None,
    download: bool = True,
    **kwargs
) -> Dataset
Load a dataset from a file path or URL.
source
Union[str, Path]
required
File path or URL to load the dataset from.
format
Optional[DatasetFormat]
Format of the dataset. If None, inferred from file extension.
download
bool
default:"True"
Whether to download the dataset if it’s a URL.
return
Dataset
The loaded dataset.

register_dataset

def register_dataset(
    self,
    name: str,
    source: Union[str, Path],
    format: DatasetFormat,
    metadata: Optional[Dict] = None
) -> None
Register a dataset for easy loading by name.
name
str
required
Name to register the dataset under.
source
Union[str, Path]
required
File path or URL of the dataset.
format
DatasetFormat
required
Format of the dataset.
metadata
Optional[Dict]
Additional metadata about the dataset.

Convenience Functions

load_dataset

def load_dataset(*args, **kwargs) -> Dataset
Convenience function to load a dataset using the default DatasetHub instance.

register_dataset

def register_dataset(*args, **kwargs)
Convenience function to register a dataset using the default DatasetHub instance.

Example Usage

import neurenix as nx
from neurenix.data import DatasetHub, Dataset, DatasetFormat, load_dataset

# Load a CSV dataset
dataset = load_dataset(
    "https://example.com/data.csv",
    format=DatasetFormat.CSV
)

print(f"Dataset size: {len(dataset)}")

# Access individual samples
sample = dataset[0]
print(f"First sample: {sample}")

# Access a batch
batch = dataset[0:10]
print(f"Batch shape: {len(batch)}")

# Load a NumPy dataset
numpy_dataset = load_dataset(
    "data.npy",
    format=DatasetFormat.NUMPY
)

# Convert to tensor
tensor_data = numpy_dataset.to_tensor(framework='neurenix')

# Create DatasetHub with custom cache directory
hub = DatasetHub(cache_dir="./my_cache")

# Register a dataset
hub.register_dataset(
    name="my_dataset",
    source="https://example.com/dataset.json",
    format=DatasetFormat.JSON,
    metadata={"version": "1.0", "author": "Neurenix Team"}
)

# Load registered dataset by name
registered_dataset = hub.load_dataset("my_dataset")

# Load with custom transformations
def preprocess(sample):
    # Custom preprocessing logic
    return sample * 2.0

dataset = Dataset(
    data=my_data,
    format=DatasetFormat.NUMPY,
    transform=preprocess
)

# Transformed data is returned when accessing samples
transformed_sample = dataset[0]

# Load image dataset
image_dataset = load_dataset(
    "images/",
    format=DatasetFormat.IMAGE
)

# Load from Parquet (for large datasets)
parquet_dataset = load_dataset(
    "large_dataset.parquet",
    format=DatasetFormat.CUSTOM
)

# Use with DataLoader for training
from neurenix.data import DataLoader

train_loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4
)

for batch in train_loader:
    # Training logic
    pass

Supported Formats

CSV

Comma-separated values files

JSON

JSON and JSONL formats

NumPy

.npy and .npz arrays

Pickle

Python pickle files

Text

Plain text files

Images

JPG, PNG, BMP, GIF

Audio

WAV, MP3, OGG, FLAC

Video

MP4, AVI, MOV, MKV

SQL

SQLite databases

Best Practices

Caching: DatasetHub automatically caches downloaded datasets. Use a persistent cache directory for better performance.
Transformations: Apply transformations in the Dataset constructor for automatic preprocessing during data loading.
Large datasets: For datasets that don’t fit in memory, use lazy loading or streaming formats like Parquet.

Build docs developers (and LLMs) love