Documentation Index Fetch the complete documentation index at: https://mintlify.com/avnlp/vectordb/llms.txt
Use this file to discover all available pages before exploring further.
The dataloaders module provides a unified interface for loading benchmark datasets, normalizing their formats, and converting them to framework-specific document objects.
Overview
Dataloaders abstract away dataset-specific formats and provide:
Catalog-based loader creation - Factory pattern for consistent instantiation
Normalized record format - All datasets produce DatasetRecord objects
Framework conversion - Automatic conversion to Haystack or LangChain documents
Evaluation queries - Ground-truth QA pairs for retrieval benchmarking
Streaming support - Memory-efficient iteration over large datasets
Supported datasets
Dataset Type Records Queries Description TriviaQA (triviaqa)Open-domain QA ~500 index ~100 eval Trivia questions with evidence documents ARC (arc)Science QA ~1000 index ~200 eval AI2 Reasoning Challenge questions PopQA (popqa)Entity-centric QA ~500 index ~100 eval Entity-focused questions from Wikipedia FActScore (factscore)Factuality QA ~500 index ~100 eval Factuality-focused evaluation dataset Earnings Calls (earnings_calls)Financial QA ~300 index ~50 eval Financial QA from earnings call transcripts
Architecture
Core components
dataloaders/
├── catalog.py # Factory for creating loaders
├── base.py # Abstract base class defining the contract
├── types.py # Shared types (DatasetRecord, EvaluationQuery)
├── converters.py # Framework document conversion
├── dataset.py # LoadedDataset wrapper
├── evaluation.py # Evaluation query extraction
└── datasets/ # Per-dataset implementations
├── triviaqa.py
├── arc.py
├── popqa.py
├── factscore.py
└── earnings_calls.py
Class hierarchy
BaseDatasetLoader ( ABC )
├── TriviaQALoader
├── ARCLoader
├── PopQALoader
├── FactScoreLoader
└── EarningsCallsLoader
Basic usage
Creating a loader
Use the DataloaderCatalog factory to create loaders:
from vectordb.dataloaders import DataloaderCatalog
loader = DataloaderCatalog.create(
name = "triviaqa" ,
split = "test" ,
limit = 500
)
Loading datasets
# Load normalized records
dataset = loader.load()
print ( f "Loaded { len (dataset.records) } records" )
print ( f "Dataset type: { dataset.dataset_type } " )
Converting to framework documents
from vectordb.dataloaders.converters import DocumentConverter, records_to_items
# Convert records to normalized items
items = records_to_items(dataset.records)
# Convert to Haystack documents
haystack_docs = DocumentConverter.to_haystack(items)
# Convert to LangChain documents
langchain_docs = DocumentConverter.to_langchain(items)
Data structures
DatasetRecord
Normalized document record with text and metadata:
@dataclass ( frozen = True , slots = True )
class DatasetRecord :
text: str # Document content to index
metadata: dict[ str , Any] # Dataset-specific metadata
Example:
DatasetRecord(
text = "The Great Wall of China was built over several centuries..." ,
metadata = {
"id" : "doc_001" ,
"source" : "triviaqa" ,
"title" : "Great Wall of China"
}
)
EvaluationQuery
Evaluation query with ground-truth answers and relevant document IDs:
@dataclass ( frozen = True , slots = True )
class EvaluationQuery :
query: str # User/evaluation question
answers: list[ str ] # Ground-truth answers
relevant_doc_ids: list[ str ] # IDs of known relevant docs
metadata: dict[ str , Any] # Additional metadata
Example:
EvaluationQuery(
query = "When was the Great Wall of China built?" ,
answers = [ "over several centuries" , "7th century BC" ],
relevant_doc_ids = [ "doc_001" , "doc_045" ],
metadata = { "difficulty" : "easy" , "category" : "history" }
)
LoadedDataset
Wrapper containing loaded records and metadata:
class LoadedDataset :
dataset_type: DatasetType # "triviaqa", "arc", etc.
records: list[DatasetRecord] # Normalized documents
Dataset implementations
TriviaQA
Dataset ID: triviaqa
HuggingFace: trivia_qa
Structure: Questions with multiple evidence documents
loader = DataloaderCatalog.create( "triviaqa" , split = "test" , limit = 500 )
dataset = loader.load()
Record format:
text: Evidence document content
metadata.id: Document identifier
metadata.title: Document title
metadata.question: Associated question
ARC (AI2 Reasoning Challenge)
Dataset ID: arc
HuggingFace: ai2_arc
Structure: Science questions with multiple-choice answers
loader = DataloaderCatalog.create( "arc" , split = "test" , limit = 1000 )
dataset = loader.load()
Record format:
text: Question + answer choices
metadata.id: Question identifier
metadata.question: Question text
metadata.answerKey: Correct answer
PopQA
Dataset ID: popqa
HuggingFace: akariasai/PopQA
Structure: Entity-centric questions from Wikipedia
loader = DataloaderCatalog.create( "popqa" , split = "test" , limit = 500 )
dataset = loader.load()
Record format:
text: Wikipedia passage
metadata.id: Passage identifier
metadata.entity: Entity mention
metadata.question: Associated question
FActScore
Dataset ID: factscore
HuggingFace: dskar/FActScore
Structure: Factuality-focused QA pairs
loader = DataloaderCatalog.create( "factscore" , split = "test" , limit = 500 )
dataset = loader.load()
Record format:
text: Factual statement
metadata.id: Statement identifier
metadata.topic: Topic/category
Earnings Calls
Dataset ID: earnings_calls
HuggingFace: lamini/earnings-calls-qa
Structure: Financial QA from earnings call transcripts
loader = DataloaderCatalog.create( "earnings_calls" , split = "train" , limit = 300 )
dataset = loader.load()
Record format:
text: Transcript excerpt
metadata.id: Excerpt identifier
metadata.company: Company name
metadata.quarter: Reporting quarter
Base loader interface
All loaders implement the BaseDatasetLoader abstract class:
class BaseDatasetLoader ( ABC ):
def __init__ (
self ,
dataset_name : str ,
split : str ,
limit : int | None = None ,
streaming : bool = True ,
) -> None :
"""Initialize the loader with dataset configuration."""
@ property
@abstractmethod
def dataset_type ( self ) -> DatasetType:
"""Return the supported dataset type identifier."""
@abstractmethod
def _load_dataset_iterable ( self ) -> Iterable[Mapping[ str , Any]]:
"""Return the raw dataset rows as an iterable."""
@abstractmethod
def _parse_row ( self , row : Mapping[ str , Any]) -> list[DatasetRecord]:
"""Parse a dataset row into normalized records."""
def load ( self ) -> LoadedDataset:
"""Load the dataset and return normalized records."""
Document conversion
The DocumentConverter class provides framework-specific conversion:
Haystack conversion
from vectordb.dataloaders.converters import DocumentConverter
from haystack import Document
items = [{ "text" : "content" , "metadata" : { "id" : "1" }}]
haystack_docs = DocumentConverter.to_haystack(items)
# Result: List[Document]
# Document(content="content", meta={"id": "1"})
LangChain conversion
from vectordb.dataloaders.converters import DocumentConverter
from langchain_core.documents import Document
items = [{ "text" : "content" , "metadata" : { "id" : "1" }}]
langchain_docs = DocumentConverter.to_langchain(items)
# Result: List[Document]
# Document(page_content="content", metadata={"id": "1"})
Configuration
Dataloaders integrate with YAML configuration:
dataloader :
dataset : "triviaqa" # Dataset identifier
split : "test" # Dataset split
limit : 500 # Optional record limit
Load configuration and create loader:
from vectordb.utils.config import load_config
from vectordb.dataloaders import DataloaderCatalog
config = load_config( "config.yaml" )
dl_config = config[ "dataloader" ]
loader = DataloaderCatalog.create(
name = dl_config[ "dataset" ],
split = dl_config[ "split" ],
limit = dl_config.get( "limit" )
)
Streaming mode
By default, loaders use streaming to handle large datasets efficiently:
loader = DataloaderCatalog.create(
name = "triviaqa" ,
split = "test" ,
limit = None # No limit
)
# Streams dataset without loading all records into memory
dataset = loader.load()
Custom loaders
Implement custom loaders by extending BaseDatasetLoader:
from vectordb.dataloaders.base import BaseDatasetLoader
from vectordb.dataloaders.types import DatasetRecord, DatasetType
class CustomLoader ( BaseDatasetLoader ):
@ property
def dataset_type ( self ) -> DatasetType:
return "custom"
def _load_dataset_iterable ( self ):
# Load raw dataset rows
from datasets import load_dataset
dataset = load_dataset(
self .dataset_name,
split = self .split,
streaming = self .streaming
)
return dataset
def _parse_row ( self , row ):
# Parse row into DatasetRecord objects
return [
DatasetRecord(
text = row[ "content" ],
metadata = { "id" : row[ "id" ]}
)
]
Error handling
Dataloaders raise specific exceptions for different error conditions:
from vectordb.dataloaders.types import (
UnsupportedDatasetError,
DatasetLoadError,
DatasetValidationError
)
try :
loader = DataloaderCatalog.create( "invalid_dataset" )
except UnsupportedDatasetError as e:
print ( f "Dataset not supported: { e } " )
try :
dataset = loader.load()
except DatasetLoadError as e:
print ( f "Failed to load dataset: { e } " )
except DatasetValidationError as e:
print ( f "Dataset validation failed: { e } " )
Best practices
Use the catalog Always use DataloaderCatalog.create() instead of instantiating loaders directly for consistency
Set limits during development Use limit= parameter during prototyping to avoid loading full datasets
Convert once Convert to framework documents once during indexing, not repeatedly during queries
Handle exceptions Catch DatasetLoadError and DatasetValidationError for robust pipelines