Hypothetical Document Generation: Use an LLM (gpt-3.5-turbo with temperature=0.7) to generate a detailed, hypothetical medical document that would answer the question
Semantic Retrieval: Embed the hypothetical document and retrieve real documents similar to it
Answer Generation: Use a more powerful LLM (gpt-4o) to generate the final answer from real retrieved documents
HyDE uses two LLM calls: one creative call (temperature=0.7) to generate the hypothetical document, and one precise call (temperature=0) to generate the final answer.
from langchain_openai import ChatOpenAI, OpenAIEmbeddings# Creative model for hypothetical document generationllm_hyde = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)# Precise model for final answer generationllm_answer = ChatOpenAI(model_name="gpt-4o", temperature=0)# Embeddings for retrievalembeddings = OpenAIEmbeddings(model="text-embedding-3-small")
hyde_prompt_template = """You are a medical expert writing a detailed section for a medical guide on pregnancy and childbirth.Based on this question: {question}Write a detailed and comprehensive medical document that would perfectly answer this question.The document should include:- Accurate medical information on the topic- Relevant clinical details- Appropriate medical recommendations- Important considerations for maternal health- Practical information and adviceWrite the document as if it were part of an official medical guide on pregnancy and childbirth.Be specific, detailed, and use appropriate medical terminology.HYPOTHETICAL DOCUMENT:"""
For the query “¿Qué es la preeclampsia?”, HyDE might generate:
La preeclampsia es una complicación grave del embarazo caracterizada por hipertensión arterial y daño a órganos, típicamente después de la semana 20 de gestación. Se manifiesta con presión arterial superior a 140/90 mmHg y proteinuria. Los síntomas incluyen dolores de cabeza severos, cambios en la visión, dolor abdominal superior y edema significativo. Los factores de riesgo incluyen primer embarazo, embarazo múltiple, hipertensión crónica, diabetes, obesidad, y antecedentes familiares. El tratamiento requiere monitoreo cercano de la presión arterial, pruebas de función renal y hepática, y evaluación fetal frecuente. En casos graves, el único tratamiento definitivo es el parto, que puede necesitar inducirse prematuramente. Las complicaciones potenciales incluyen eclampsia, síndrome HELLP, desprendimiento de placenta, y restricción del crecimiento fetal...
This rich document is then embedded and used for retrieval.
from src.rag.hyde import query_for_evaluation# Basic usage with default modelsresult = query_for_evaluation( question="¿Cuáles son los síntomas del parto prematuro?")# With custom models for each stageresult = query_for_evaluation( question="¿Qué es la diabetes gestacional?", hyde_model="gpt-3.5-turbo", # Hypothetical doc generation answer_model="gpt-4o-mini" # Final answer generation)# With custom LLM instancesfrom langchain_openai import ChatOpenAIhyde_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.8)answer_llm = ChatOpenAI(model_name="gpt-4o", temperature=0)result = query_for_evaluation( question="¿Qué cuidados necesito en el embarazo?", custom_hyde_llm=hyde_llm, custom_answer_llm=answer_llm)
Hallucination risk: Hypothetical document may include incorrect information
Retrieval bias: Retrieves documents similar to what LLM thinks answer should be
No keyword precision: Pure semantic search may miss exact term matches
HyDE’s hypothetical document is generated by an LLM and may contain hallucinations or inaccuracies. These don’t appear in the final answer (which is grounded in real documents), but they can bias retrieval toward certain types of documents.
hyde_prompt = """You are a medical expert writing a detailed section for a medical guide.Write a detailed and comprehensive medical document that would perfectly answer this question: {question}Include medical information, clinical details, recommendations, and practical advice. Use appropriate medical terminology."""