Non-factuality long-form hallucination benchmark
Supervisors
Suitable for
Abstract
Prerequisites:
1. E: familiar with the basics of LLMs
2. D: familiar with AI hallucination evaluation and detection
Background
As large language models (LLMs) are increasingly deployed in real-world applications, their tendency to
produce factually incorrect or fabricated information—known as hallucination—has become a major concern. To
systematically measure and compare this behavior, researchers have developed hallucination benchmarks that evaluate models’
factual accuracy, grounding, and faithfulness across tasks such as question answering,
summarization, and dialogue. These
benchmarks, including datasets like TruthfulQA, FActScore, and HaluEval, provide standardized settings to quantify how and
when models deviate from reliable information. Establishing
robust hallucination benchmarks is essential for tracking
progress, guiding model improvement, and ensuring the development of trustworthy AI systems.
Focus
While most existing studies evaluate language models primarily through factuality—how closely their outputs
align with external facts—hallucination is a broader phenomenon that extends beyond factual correctness. A model may
produce information that is plausible yet unsupported or ungrounded in its input, even when not strictly false. This project
focuses on measuring such non-factuality hallucinations—content generated without sufficient grounding
or evidence
in the provided context. By distinguishing hallucination from simple factual errors, the project aims to develop more comprehensive
evaluation methods that capture the full range of unfaithful or unsupported model
behaviors.
Method
This project proposes the design of a new benchmark to evaluate hallucinations in long-form generation while
clearly distinguishing them from factuality errors. Unlike factual inaccuracies, these hallucinations cannot be
validated
or refuted using external knowledge bases or web search, making them more subtle and challenging to detect. Some potential
tasks/scenarios for benchmarks include data analysis, Deep Research, etc.