Documentation Index
Fetch the complete documentation index at: https://mintlify.com/MilesONerd/neurenix/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The DatasetHub module provides utilities for loading datasets from URLs or file paths, supporting various formats and preprocessing options.
class DatasetFormat(Enum):
CSV = auto()
JSON = auto()
NUMPY = auto()
PICKLE = auto()
TEXT = auto()
IMAGE = auto()
AUDIO = auto()
VIDEO = auto()
SQL = auto()
CUSTOM = auto()
Methods
from_extension
@classmethod
def from_extension(cls, extension: str) -> DatasetFormat
Determine format from file extension.
File extension (e.g., ‘csv’, ‘json’, ‘npy’).
The corresponding dataset format.
Dataset Class
class Dataset:
def __init__(
self,
data: Any,
format: DatasetFormat,
name: str = None,
metadata: Dict = None,
transform: Callable = None
)
Parameters
Additional information about the dataset.
Function to transform data samples.
Methods
len
Return the number of samples in the dataset.
getitem
def __getitem__(self, idx: Union[int, slice]) -> Any
Get a sample or batch from the dataset.
idx
Union[int, slice]
required
Index or slice to retrieve.
The sample or batch at the specified index.
to_tensor
def to_tensor(self, framework: str = 'auto') -> Any
Convert the dataset to a tensor.
The framework to use (‘torch’, ‘tensorflow’, ‘numpy’, or ‘auto’).
Tensor representation of the dataset.
DatasetHub Class
class DatasetHub:
def __init__(self, cache_dir: Optional[str] = None)
Main class for loading and managing datasets.
Parameters
Directory to cache downloaded datasets. If None, uses default cache directory.
Methods
load_dataset
def load_dataset(
self,
source: Union[str, Path],
format: Optional[DatasetFormat] = None,
download: bool = True,
**kwargs
) -> Dataset
Load a dataset from a file path or URL.
File path or URL to load the dataset from.
Format of the dataset. If None, inferred from file extension.
Whether to download the dataset if it’s a URL.
register_dataset
def register_dataset(
self,
name: str,
source: Union[str, Path],
format: DatasetFormat,
metadata: Optional[Dict] = None
) -> None
Register a dataset for easy loading by name.
Name to register the dataset under.
File path or URL of the dataset.
Additional metadata about the dataset.
Convenience Functions
load_dataset
def load_dataset(*args, **kwargs) -> Dataset
Convenience function to load a dataset using the default DatasetHub instance.
register_dataset
def register_dataset(*args, **kwargs)
Convenience function to register a dataset using the default DatasetHub instance.
Example Usage
import neurenix as nx
from neurenix.data import DatasetHub, Dataset, DatasetFormat, load_dataset
# Load a CSV dataset
dataset = load_dataset(
"https://example.com/data.csv",
format=DatasetFormat.CSV
)
print(f"Dataset size: {len(dataset)}")
# Access individual samples
sample = dataset[0]
print(f"First sample: {sample}")
# Access a batch
batch = dataset[0:10]
print(f"Batch shape: {len(batch)}")
# Load a NumPy dataset
numpy_dataset = load_dataset(
"data.npy",
format=DatasetFormat.NUMPY
)
# Convert to tensor
tensor_data = numpy_dataset.to_tensor(framework='neurenix')
# Create DatasetHub with custom cache directory
hub = DatasetHub(cache_dir="./my_cache")
# Register a dataset
hub.register_dataset(
name="my_dataset",
source="https://example.com/dataset.json",
format=DatasetFormat.JSON,
metadata={"version": "1.0", "author": "Neurenix Team"}
)
# Load registered dataset by name
registered_dataset = hub.load_dataset("my_dataset")
# Load with custom transformations
def preprocess(sample):
# Custom preprocessing logic
return sample * 2.0
dataset = Dataset(
data=my_data,
format=DatasetFormat.NUMPY,
transform=preprocess
)
# Transformed data is returned when accessing samples
transformed_sample = dataset[0]
# Load image dataset
image_dataset = load_dataset(
"images/",
format=DatasetFormat.IMAGE
)
# Load from Parquet (for large datasets)
parquet_dataset = load_dataset(
"large_dataset.parquet",
format=DatasetFormat.CUSTOM
)
# Use with DataLoader for training
from neurenix.data import DataLoader
train_loader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
num_workers=4
)
for batch in train_loader:
# Training logic
pass
CSV
Comma-separated values files
JSON
JSON and JSONL formats
NumPy
.npy and .npz arrays
Pickle
Python pickle files
Best Practices
Caching: DatasetHub automatically caches downloaded datasets. Use a persistent cache directory for better performance.
Transformations: Apply transformations in the Dataset constructor for automatic preprocessing during data loading.
Large datasets: For datasets that don’t fit in memory, use lazy loading or streaming formats like Parquet.