Documentation Index
Fetch the complete documentation index at: https://mintlify.com/avnlp/dspy-opt/llms.txt
Use this file to discover all available pages before exploring further.
MetadataExtractor is a dspy.Module that reads arbitrary text and produces a structured dictionary of metadata fields according to a user-supplied JSON schema. The extracted metadata can annotate a document corpus for downstream filtering in WeaviateRetriever, or it can annotate query context so that retrieval is scoped to documents with matching attributes. The module uses a dedicated extractor_llm instance — separate from the main answer LLM — to keep extraction concerns isolated from the generation context.
A separate
extractor_llm is passed at construction time rather than inherited from dspy.settings. This prevents the metadata extraction prompts and their outputs from appearing in the main LLM’s context window, which could otherwise bias answer generation.Signature
The module is driven byExtractMetadataSignature:
| Field | Type | Description |
|---|---|---|
text | InputField | The input text to extract metadata from. The LLM must omit any field not explicitly mentioned in the text — no placeholders like "Unknown" or "N/A", and no invented values. Only facts directly stated in the input are allowed. |
metadata_schema | InputField | JSON schema string defining the expected metadata structure. |
metadata | OutputField | JSON object containing only successfully extracted, non-null fields. |
Schema format
The schema must be a Pythondict with a top-level "properties" key. Each property entry contains "type" and an optional "description":
Allowed types
| JSON type | Python equivalent | Notes |
|---|---|---|
"string" | str | Supports an optional "enum" list of valid values |
"number" | float / int | Numeric values only |
"boolean" | bool | true / false only |
Constructor
A fully initialized
dspy.LM instance dedicated to metadata extraction. This LM is used inside a dspy.context(lm=self.extractor_llm) block so that extraction calls never touch dspy.settings.lm.Methods
forward
validate_schema first — a malformed schema raises ValueError immediately before any LLM call. For failures that occur after validation (JSON parse errors, LLM errors), returns an empty {} so the pipeline continues without crashing. Only fields with non-null values are included in the returned dictionary.
The raw text from which metadata should be extracted.
A schema dict conforming to the format described above. Validated by
validate_schema before the LLM call.Dict[str, Any] — a flat dictionary of successfully extracted, non-null metadata fields. Fields absent from the text are not included.
transform_documents
dspy.Example objects. For each document, the extracted metadata is merged into the existing example.metadata dict (extracted fields take precedence on key collisions). Returns a new list of dspy.Example objects with updated metadata and "text" as the designated input field.
List of
dspy.Example objects, each expected to have a text attribute and a metadata dict attribute.Schema applied uniformly to every document in the list.
List[dspy.Example] — new list of examples with enriched metadata.
validate_schema
ValueError with a descriptive message on any of the following violations:
- Missing top-level
"properties"key - A property uses a type other than
"string","number", or"boolean" - A non-string property includes an
"enum"constraint