VuFind includes a full OAI-PMH harvesting stack built on theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/vufind-org/vufind/llms.txt
Use this file to discover all available pages before exploring further.
VuFindHarvest library. The harvester downloads metadata records from remote repositories, saves them to disk, and a set of batch import scripts then index those records into Solr. This page explains how to configure sources, run the harvester, process harvested files, and set up automated scheduling.
The harvest/ directory
Theharvest/ directory at the root of the VuFind installation contains all harvesting-related scripts and the default oai.ini configuration file.
$VUFIND_LOCAL_DIR/harvest/ (or $VUFIND_HOME/harvest/ if VUFIND_LOCAL_DIR is not set), one subdirectory per configured OAI source.
Configuring OAI-PMH sources
All harvest sources are defined inoai.ini. Each source is an INI section whose name becomes the subdirectory into which records are written.
Core settings
| Key | Default | Description |
|---|---|---|
url | (required) | Base URL of the OAI-PMH endpoint |
set | (harvest all) | setSpec value to restrict harvesting to a single set. Repeat with set[] = x to harvest multiple sets. |
metadataPrefix | oai_dc | Metadata format to request (e.g. marc21, oai_marc, oai_dc) |
timeout | 60 | HTTP request timeout in seconds |
dateGranularity | auto | Date format used by the server: YYYY-MM-DD, YYYY-MM-DDThh:mm:ssZ, or auto |
Record combination
By default the harvester writes one file per OAI record. SetcombineRecords = true to merge a server’s response chunk into a single file. This is required when the downstream XSLT transformation expects a <collection> wrapper.
ID manipulation
UseidSearch[] and idReplace[] pairs (PHP preg_replace syntax) to normalise OAI identifiers before they are written to filenames and injected into records.
Authentication and proxies
SSL settings
XML sanitisation
Some servers return XML with characters outside the legal XML 1.0 range. Enable sanitisation to strip them automatically:Running the harvester
The harvester is invoked through VuFind’s Symfony Console interface.Set the
VUFIND_LOCAL_DIR environment variable before running the harvester. Without it, VuFind cannot locate your local configuration overrides, and harvested records will be written to the base installation directory.Resumption tokens
The harvester automatically handles OAI-PMH resumption tokens. If a harvest is interrupted, re-run the same command — it will pick up from the last successfully fetched token recorded inlast_state.txt inside the harvest subdirectory.
Processing harvested records
After harvesting, files in each harvest subdirectory must be indexed into Solr. Use the batch scripts inharvest/ that match the record format.
MARC records
- Finds all
.mrc,.xml, and.marcfiles in the target directory - Sends them to
import-marc.shin batches (default: 10 files per batch) - Moves processed files into a
processed/subdirectory - Writes per-batch logs under
log/
Authority MARC records
XSL-transformed records (Dublin Core, etc.)
Processing delete lists
When an OAI-PMH server reports deleted records, the harvester writes their IDs to a delete list. Process it with:Scheduling automated harvesting
A typical cron setup runs the harvest followed immediately by the import.Troubleshooting harvest failures
Harvest stops early with a resumption token error
Harvest stops early with a resumption token error
The OAI server returned an expired or invalid resumption token. Delete
last_state.txt from the harvest subdirectory to force a full re-harvest from the beginning, then investigate whether the server has a shorter token expiry than the harvest duration.SSL certificate verification fails
SSL certificate verification fails
Add the server’s CA certificate path to As a temporary diagnostic measure only, you can disable peer verification with
oai.ini:sslverifypeer = false. Do not leave this in production.Records contain illegal XML characters
Records contain illegal XML characters
Enable the
sanitize option in oai.ini for the affected source. Use badXMLLog to keep a copy of the raw bad XML for inspection:Harvest hangs on large responses
Harvest hangs on large responses
Increase the
timeout value for the source. For very large repositories, also consider setting stopAfter to a small number during initial testing to confirm connectivity before running a full harvest.Records are harvested but not appearing in search results
Records are harvested but not appearing in search results
The most common cause is that the batch import step did not run, or failed silently. Check the log files under
harvest/MyRepository/log/. Also verify that Solr is running and that VUFIND_LOCAL_DIR points to the correct local directory containing your import configuration.Duplicate records after re-harvest
Duplicate records after re-harvest
Ensure that the
idSearch[] / idReplace[] patterns produce the same IDs that were indexed previously. A change in the ID normalisation pattern will cause SolrMarc to create new documents instead of updating existing ones.