Skip to main content

Non-factuality long-form hallucination benchmark

Supervisors

Suitable for

MSc in Advanced Computer Science

Abstract

Prerequisites:
1. E: familiar with the basics of LLMs
2. D: familiar with AI hallucination evaluation and detection

Background
As large language models (LLMs) are increasingly deployed in real-world applications, their tendency to produce factually incorrect or fabricated information—known as hallucination—has become a major concern. To
systematically measure and compare this behavior, researchers have developed hallucination benchmarks that evaluate models’ factual accuracy, grounding, and faithfulness across tasks such as question answering,
summarization, and dialogue. These benchmarks, including datasets like TruthfulQA, FActScore, and HaluEval, provide standardized settings to quantify how and when models deviate from reliable information. Establishing
robust hallucination benchmarks is essential for tracking progress, guiding model improvement, and ensuring the development of trustworthy AI systems.

Focus
While most existing studies evaluate language models primarily through factuality—how closely their outputs align with external facts—hallucination is a broader phenomenon that extends beyond factual correctness. A model may
produce information that is plausible yet unsupported or ungrounded in its input, even when not strictly false. This project focuses on measuring such non-factuality hallucinations—content generated without sufficient grounding
or evidence in the provided context. By distinguishing hallucination from simple factual errors, the project aims to develop more comprehensive evaluation methods that capture the full range of unfaithful or unsupported model
behaviors.

Method
This project proposes the design of a new benchmark to evaluate hallucinations in long-form generation while clearly distinguishing them from factuality errors. Unlike factual inaccuracies, these hallucinations cannot be
validated or refuted using external knowledge bases or web search, making them more subtle and challenging to detect. Some potential tasks/scenarios for benchmarks include data analysis, Deep Research, etc.