ClinHallu: A New Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning
ClinHallu introduces a specialized benchmark designed to move beyond general hallucination detection in medical Multimodal Large Language Models (MLLMs) by pinpointing the specific stages of the reasoning process where errors originate.
Addressing the Trust Gap in Medical MLLMs
The deployment of Multimodal Large Language Models (MLLMs) for clinical decision support requires an extremely high threshold of reliability. While existing benchmarks have focused heavily on the collection of data to identify whether a model is hallucinating, they often treat the hallucination as a binary outcome, ignoring the underlying mechanism of the failure. This lack of granularity hinders the ability of researchers to systematically debug and improve the reliability of medical AI.
Stage-Wise Hallucination Diagnosis
The authors of the ClinHallu benchmark argue that hallucinations in medical contexts are not monolithic; rather, they stem from distinct failure points within the model's processing pipeline. ClinHallu is designed to diagnose these "source-level" hallucinations by categorizing errors into three primary stages:
- Visual Misrecognition: Errors occurring during the initial perception phase, where the model fails to correctly identify or interpret visual cues from medical imaging.
- Incorrect Medical Knowledge Recall: Failures in the model's internal knowledge base, where the model retrieves inaccurate medical facts or associations.
- Flawed Reasoning Integration: Errors that occur during the synthesis phase, where the model possesses correct visual information and correct medical knowledge but fails to integrate them logically to reach a correct conclusion.
Improving Clinical Decision Support
By isolating these stages, ClinHallu allows developers to determine whether a model's failure is a vision problem, a knowledge problem, or a reasoning problem. This diagnostic approach provides a more precise roadmap for optimization, whether through improved vision encoders, specialized medical fine-tuning, or enhanced chain-of-thought prompting strategies.
Note: The provided source text was truncated; further details regarding the specific dataset size, evaluation metrics, and quantitative results are not available in the current input.
Original Source