Medical and biomedical QA demands precise factual grounding and clinical reasoning that general-purpose reward signals struggle to capture. This module fine-tunesDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/avnlp/llm-finetuning/llms.txt
Use this file to discover all available pages before exploring further.
unsloth/Llama-3.2-3B-Instruct using QLoRA and GRPO on three biomedical datasets, using eight reward functions tuned for medical evaluation.
Datasets
| Dataset | Size | Description |
|---|---|---|
| MedQA | 10,178 | USMLE-style multiple-choice medical questions covering clinical diagnosis, treatment, and pharmacology |
| BioASQ | 4,012 | Free-text biomedical questions sourced from PubMed literature (public substitute dataset) |
| PubMedQA | 211,269 | Research article QA derived from PubMed abstracts; questions ask whether a finding supports, contradicts, or is inconclusive |
The BioASQ pipeline uses the publicly available
enelpol/rag-mini-bioasq dataset as a substitute for the official BioASQ benchmark, which requires registration. The subset covers the same biomedical QA format and can be swapped with the official data by overriding dataset_id in config.yaml.Reward functions
The eight reward functions are identical in structure to the multi-hop QA module but are re-instantiated from themedical_question_answering module so that future domain-specific tuning remains isolated.
Correctness rewards
| Reward function | Framework | Description |
|---|---|---|
DeepEvalGEvalRAGReward | DeepEval | GEval with a biomedical LLM-as-a-Judge criterion: evaluates whether the response correctly addresses the medical question given the provided context |
DeepEvalSummarizationReward | DeepEval | Faithfulness metric measuring how well the answer captures the key facts from the source context |
DeepEvalAnswerRelevancyReward | DeepEval | Measures relevance of the response to the clinical or research question |
EvidentlyCorrectnessLLMReward | Evidently AI | CorrectnessLLMEval scoring answer accuracy against ground-truth medical answers |
Format rewards
| Reward function | Description |
|---|---|
ReasoningTagsReward | Validates presence and nesting of <reasoning> and <answer> tags |
MultilineComplianceReward | Rewards multi-line responses with structured depth |
StructureValidationReward | Validates overall response structure against the expected schema |
ResponseFormatReward | Enforces consistent formatting conventions |
DeepEvalGEvalRAGReward GEval criterion reads “correctly addresses the medical question” rather than the generic RAG evaluation phrasing, making the judge sensitive to clinical correctness.
Running the pipelines
Configuration
All three pipelines share the same hyperparameter structure. Here is the MedQA config:dataset_id and dataset_split in config.yaml to point to a different biomedical dataset without modifying any Python code.
Training script pattern
Output
Adapter weights and the tokenizer are saved to./outputs/medical_question_answering/<dataset>/ by default. Change output_dir in config.yaml to save elsewhere.
GPU memory
GRPO with QLoRA on a 3B model requires approximately 12–16 GB of VRAM. Reduceper_device_train_batch_size to 1 and increase gradient_accumulation_steps to compensate on smaller GPUs.