The indexing pipeline prepares your dataset for retrieval by loading raw documents from HuggingFace, enriching each document with structured metadata extracted by an LLM, encoding the texts into dense vector embeddings using a SentenceTransformer model, and writing the resulting objects into a Weaviate collection. Every subsequent RAG pipeline query runs against this pre-built collection, so indexing is a required one-time setup step before running any optimizer or evaluation script.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/avnlp/dspy-opt/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before running the indexing script, ensure the following are in place:- A Weaviate Cloud cluster is running and accessible.
- A
.envfile exists in the project root with the required credentials:
- Project dependencies are installed with
uv sync --all-extras --devand the virtual environment is activated.
Indexing Workflow
Load the YAML config
The script opens
<dataset>_indexing_config.yml from the current directory and parses all settings — dataset coordinates, LLM model name, embedding model, metadata schema, collection name, and encoding batch size.Load environment variables
load_dotenv() is called immediately after the config is parsed. This reads WEAVIATE_URL, WEAVIATE_API_KEY, and GROQ_API_KEY from the .env file in the project root and makes them available via os.getenv() for the rest of the script.Load the dataset from HuggingFace
The dataset is fetched via the
datasets library using the name, subset, and split fields from the config. For FreshQA, this retrieves the longseal subset of vtllms/sealqa.Extract structured metadata with an LLM
A Only fields that are successfully extracted and non-null are stored; missing values are silently dropped rather than filled with placeholders.
MetadataExtractor instance is created with the configured LLM (e.g. groq/llama-3.3-70b-versatile). It calls transform_documents() to iterate over every document, run structured-output generation against the metadata schema, and attach validated fields to each dspy.Example.Encode documents with SentenceTransformer
All document texts are encoded into dense vectors in batches. The
batch_size and show_progress_bar values come from the document_encoding section of the config.Create the Weaviate collection
The script connects to Weaviate Cloud using
WEAVIATE_URL and WEAVIATE_API_KEY from environment variables. It checks whether the target collection already exists, deletes it if so, and creates a fresh collection with a self_provided vector configuration (since embeddings are generated externally by SentenceTransformer).Run the Indexing Script
Each dataset has its own indexing script. Run the script from inside the dataset directory so that the config YAML is resolved relative to the current working directory:Indexing Config Reference
The indexing config file controls every aspect of the pipeline. Below is the completefreshqa_indexing_config.yml with inline annotations:
Config Sections
| Section | Key fields | Purpose |
|---|---|---|
dataset | name, subset, split | Identifies which HuggingFace dataset and split to load |
extractor_llm | model | LLM used by MetadataExtractor for structured-output generation |
embedding | embedding_model, tokenizer_kwargs | SentenceTransformer model and tokenizer settings |
metadata_schema | properties | JSON schema that drives metadata extraction per document |
collection_name | — | Weaviate collection target; will be recreated if it exists |
document_encoding | batch_size, show_progress_bar | Batching settings for the embedding encoding loop |