from ai_infra.eval import RAGFaithfulnessEvaluate if an answer is grounded in the provided context. Uses an LLM judge to verify that the generated answer is faithful to the retrieved context and doesn't contain hallucinations.
llm_judge: Model to use for judging (e.g., "gpt-4o-mini"). If None, uses default from environment. provider: LLM provider (openai, anthropic, google, etc.). context_key: Metadata key containing the context/retrieved docs. Default: "context". strict: If True, requires exact grounding. If False, allows reasonable inferences. Default: False.
>>> from ai_infra.eval.evaluators import RAGFaithfulness >>> from pydantic_evals import Case, Dataset >>> >>> dataset = Dataset( ... cases=[ ... Case( ... inputs="What is the refund policy?", ... metadata={"context": "Refunds are available within 30 days."}, ... ), ... ], ... evaluators=[RAGFaithfulness(llm_judge="gpt-4o-mini")], ... )
EvaluationReason with: - value: float (faithfulness score 0.0-1.0) - reason: Explanation from the LLM judge