Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/MilesONerd/neurenix/llms.txt

Use this file to discover all available pages before exploring further.

The dataset command provides comprehensive dataset management capabilities including listing, downloading, registering, splitting, and converting datasets.

Usage

neurenix dataset <action> [options]

Actions

ActionDescription
listList available registered datasets
downloadDownload a dataset from URL or registry
registerRegister a dataset in the local registry
infoGet detailed information about a dataset
splitSplit a dataset into train/val/test sets
convertConvert dataset to different format

Available Formats

  • csv - Comma-separated values
  • json - JSON format
  • npy - NumPy binary format
  • hdf5 - HDF5 format
  • tfrecord - TensorFlow record format
  • parquet - Apache Parquet format

Examples

List registered datasets

neurenix dataset list
Available datasets:

cifar10:
  URL: https://datasets.neurenix.ai/cifar10
  Format: auto-detect
  Metadata: {"classes": 10, "size": 60000}

imagenet:
  URL: https://datasets.neurenix.ai/imagenet
  Format: auto-detect
  Metadata: {"classes": 1000, "size": 1281167}

List in JSON format

neurenix dataset list --format json
{
  "cifar10": {
    "url": "https://datasets.neurenix.ai/cifar10",
    "format": "auto-detect",
    "metadata": {"classes": 10, "size": 60000}
  },
  "imagenet": {
    "url": "https://datasets.neurenix.ai/imagenet",
    "format": "auto-detect",
    "metadata": {"classes": 1000, "size": 1281167}
  }
}

Download a dataset

neurenix dataset download cifar10
Downloading dataset from cifar10...
Saving dataset to data/cifar10.csv...
Dataset downloaded and saved to data/cifar10.csv

Download to specific location

neurenix dataset download cifar10 --output datasets/cifar10.csv
Downloading dataset from cifar10...
Saving dataset to datasets/cifar10.csv...
Dataset downloaded and saved to datasets/cifar10.csv

Download from URL

neurenix dataset download https://example.com/data.csv --output data/external.csv
Downloading dataset from https://example.com/data.csv...
Saving dataset to data/external.csv...
Dataset downloaded and saved to data/external.csv

Register a dataset

neurenix dataset register my_dataset https://example.com/dataset.csv
Dataset 'my_dataset' registered successfully.

Register with format

neurenix dataset register my_images /path/to/images --format hdf5
Dataset 'my_images' registered successfully.

Register with metadata

neurenix dataset register custom_data data.csv \
  --metadata '{"classes": 5, "samples": 10000}'
Dataset 'custom_data' registered successfully.

Get dataset info

neurenix dataset info cifar10
Dataset: cifar10
URL: https://datasets.neurenix.ai/cifar10
Format: auto-detect
Metadata: {"classes": 10, "size": 60000}

Get info for local dataset

neurenix dataset info data/my_dataset.csv
Loading dataset from data/my_dataset.csv...
Dataset: my_dataset.csv
Path: /home/user/project/data/my_dataset.csv
Format: csv
Size: 5000
Metadata: {}

Split a dataset

neurenix dataset split data/full_dataset.csv --ratio 0.7,0.15,0.15
Loading dataset from data/full_dataset.csv...
Splitting dataset with ratio 0.7,0.15,0.15...
Saving train split (700 samples) to data/train/train_data.csv...
Saving val split (150 samples) to data/val/val_data.csv...
Saving test split (150 samples) to data/test/test_data.csv...
Dataset split successfully. Results saved to data

Split with shuffling

neurenix dataset split data/dataset.csv \
  --ratio 0.8,0.2 \
  --shuffle \
  --seed 42
Loading dataset from data/dataset.csv...
Splitting dataset with ratio 0.8,0.2...
Saving train split (800 samples) to data/train/train_data.csv...
Saving val split (200 samples) to data/val/val_data.csv...
Dataset split successfully. Results saved to data

Convert dataset format

neurenix dataset convert data.csv data.json --output-format json
Loading dataset from data.csv...
Converting dataset to json format...
Dataset converted and saved to data.json

Convert with explicit input format

neurenix dataset convert input.txt output.csv \
  --input-format csv \
  --output-format csv
Loading dataset from input.txt...
Converting dataset to csv format...
Dataset converted and saved to output.csv

Action Details

list

List all registered datasets in the local registry. Options:
  • --format: Output format (text, json)
neurenix dataset list [--format text|json]

download

Download a dataset from a URL or registered name. Arguments:
  • source: Dataset URL or registered name
Options:
  • --output: Output directory or file (default: data)
  • --format: Dataset format (auto-detected if not specified)
neurenix dataset download <source> [--output <path>] [--format <format>]

register

Register a dataset in the local registry for easy access. Arguments:
  • name: Dataset name
  • url: Dataset URL or file path
Options:
  • --format: Dataset format
  • --metadata: Metadata (JSON string or file path)
neurenix dataset register <name> <url> [--format <format>] [--metadata <json>]

info

Get detailed information about a dataset. Arguments:
  • name: Dataset name or path
Options:
  • --format: Output format (text, json)
neurenix dataset info <name> [--format text|json]

split

Split a dataset into train/validation/test sets. Arguments:
  • input: Input dataset file or directory
Options:
  • --output: Output directory (default: data)
  • --ratio: Split ratios (default: 0.8,0.2)
  • --shuffle: Shuffle data before splitting
  • --seed: Random seed for reproducibility
neurenix dataset split <input> [--output <dir>] [--ratio <ratios>] [--shuffle] [--seed <int>]

convert

Convert a dataset to a different format. Arguments:
  • input: Input dataset file or directory
  • output: Output file or directory
Options:
  • --input-format: Input format (auto-detected if not specified)
  • --output-format: Output format (required)
neurenix dataset convert <input> <output> [--input-format <format>] --output-format <format>

Error Handling

Dataset not found

neurenix dataset download unknown_dataset
Error managing dataset: Dataset 'unknown_dataset' not found in registry

Invalid split ratio

neurenix dataset split data.csv --ratio 0.5,0.3
Error: Invalid split ratio: Split ratios must sum to 1.0

File not found

neurenix dataset split missing.csv
Error: Input dataset 'missing.csv' not found.

Use Cases

1. Download and prepare dataset

# Download
neurenix dataset download cifar10 --output data/cifar10.csv

# Split into train/val/test
neurenix dataset split data/cifar10.csv --ratio 0.7,0.15,0.15 --shuffle

2. Register custom dataset

# Register
neurenix dataset register my_data /path/to/data.csv \
  --metadata '{"description": "Custom dataset", "version": "1.0"}'

# Download later
neurenix dataset download my_data

3. Convert dataset format

# CSV to JSON
neurenix dataset convert data.csv data.json --output-format json

# CSV to NumPy
neurenix dataset convert data.csv data.npy --output-format npy

4. Create reproducible splits

neurenix dataset split full_data.csv \
  --output splits \
  --ratio 0.8,0.1,0.1 \
  --shuffle \
  --seed 42

5. Manage multiple datasets

# Register datasets
neurenix dataset register train_data data/train.csv
neurenix dataset register test_data data/test.csv

# List all
neurenix dataset list

# Get info
neurenix dataset info train_data

Best Practices

1. Always shuffle when splitting

neurenix dataset split data.csv --ratio 0.8,0.2 --shuffle --seed 42

2. Use consistent split ratios

Standard splits:
  • 70/15/15: Balanced three-way split
  • 80/20: Simple train/val split
  • 80/10/10: More training data
neurenix dataset split data.csv --ratio 0.7,0.15,0.15

3. Register datasets with metadata

neurenix dataset register my_dataset data.csv \
  --metadata '{"version": "1.0", "date": "2024-01-15", "samples": 10000}'

4. Convert to efficient formats

For large datasets, use efficient formats:
# Convert to Parquet for efficient storage
neurenix dataset convert large_data.csv large_data.parquet \
  --output-format parquet

# Convert to HDF5 for numerical data
neurenix dataset convert data.csv data.h5 --output-format hdf5

5. Organize dataset directories

mkdir -p datasets/{raw,processed,splits}

# Download to raw
neurenix dataset download cifar10 --output datasets/raw/cifar10.csv

# Split to processed
neurenix dataset split datasets/raw/cifar10.csv \
  --output datasets/splits \
  --ratio 0.8,0.1,0.1

Integration Example

Complete dataset preparation pipeline:
#!/bin/bash

# 1. Register dataset
neurenix dataset register my_data https://example.com/data.csv

# 2. Download
neurenix dataset download my_data --output data/raw/data.csv

# 3. Convert to efficient format
neurenix dataset convert \
  data/raw/data.csv \
  data/processed/data.parquet \
  --output-format parquet

# 4. Split for training
neurenix dataset split \
  data/processed/data.parquet \
  --output data/splits \
  --ratio 0.7,0.15,0.15 \
  --shuffle \
  --seed 42

# 5. Preprocess
neurenix preprocess \
  --input data/splits/train \
  --output data/ready/train \
  --normalize

echo "Dataset prepared and ready for training!"

See Also

Build docs developers (and LLMs) love