Multi-hop reasoning requires a model to synthesize information from multiple sources before producing an answer. This module fine-tunesDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/avnlp/llm-finetuning/llms.txt
Use this file to discover all available pages before exploring further.
unsloth/Llama-3.2-3B-Instruct using QLoRA and GRPO on three multi-hop QA datasets. Eight reward functions — four measuring correctness through LLM judges and four enforcing structured output format — provide dense supervision at every training step.
Datasets
| Dataset | Size | Description |
|---|---|---|
| HotpotQA | 90,447 | Multi-hop questions requiring reasoning over two Wikipedia paragraphs; includes a distractor configuration with irrelevant paragraphs mixed in |
| FreshQA | 254 | Search-augmented QA benchmark for long-context, multi-document reasoning under noisy retrieval |
| MuSiQue | 19,938 | Multi-hop questions with explicit supporting facts and compositional reasoning structure |
Reward functions
Correctness rewards
These four functions evaluate the quality and accuracy of the model’s answer by calling an LLM judge (via DeepEval or Evidently AI).| Reward function | Framework | Description |
|---|---|---|
DeepEvalGEvalRAGReward | DeepEval | GEval metric with a custom LLM-as-a-Judge instruction tailored for RAG evaluation; assesses whether the answer correctly addresses the multi-hop question |
DeepEvalSummarizationReward | DeepEval | Summarization metric measuring faithfulness to the provided source context |
DeepEvalAnswerRelevancyReward | DeepEval | Answer Relevancy metric measuring how relevant the response is to the question |
EvidentlyCorrectnessLLMReward | Evidently AI | CorrectnessLLMEval metric for answer accuracy scoring against the ground truth |
Format rewards
These four functions score the structural quality of the response without calling an external API.| Reward function | Description |
|---|---|
ReasoningTagsReward | Validates presence and proper nesting of <reasoning> and <answer> tags |
MultilineComplianceReward | Rewards multi-line structured responses with sufficient depth |
StructureValidationReward | Validates overall response structure against the expected schema |
ResponseFormatReward | Enforces consistent formatting conventions across the response |
Running the pipelines
HotpotQA has two configurations:
fullwiki (full Wikipedia as evidence) and distractor (two gold paragraphs plus eight distractors). The default dataset subset used by the HotpotQALoader is distractor, which is the standard evaluation setting for multi-hop retrieval robustness. Override it in config.yaml with dataset_subset: "fullwiki" if needed.Configuration
Each pipeline uses an identical config structure. Here is the HotpotQA example:max_prompt_length of 512 tokens accommodates multi-hop context passages. Increase it if your dataset examples include longer supporting documents, and reduce per_device_train_batch_size accordingly to manage VRAM.
Training script pattern
The HotpotQA script (hotpotqa/train.py) initializes the model with Unsloth’s 4-bit QLoRA, loads the dataset, and passes all eight reward functions to GRPOTrainer:
Output
Trained adapter weights and the tokenizer are saved to./outputs/multi_hop_question_answering/grpo/<dataset>/ by default. Change output_dir in config.yaml to save elsewhere.
GPU memory
GRPO with QLoRA on a 3B model requires approximately 12–16 GB of VRAM. Reduceper_device_train_batch_size to 1 and increase gradient_accumulation_steps to compensate on smaller GPUs.