Rank-Calibrating Large Language Models (LLMs) for Improved Uncertainty Estimation

Supervisor

Suitable for

Abstract

Language models (LMs) have made significant progress in natural language generation, but their tendency to produce incorrect or hallucinated outputs necessitates robust uncertainty quantification methods. Various uncertainty measures, such as semantic entropy [1,2] and verbalised confidence scores [3] have been proposed to estimate confidence and uncertainty in model-generated responses. However, these measures vary in scale and interpretability, making it challenging to compare their effectiveness. A promising direction for addressing this issue is Rank-Calibration [4], a framework that assesses whether higher uncertainty (or lower confidence) consistently corresponds to lower-quality generations. Rank-Calibration provides a principled alternative to binary correctness thresholding, offering a more fine-grained evaluation of uncertainty measures.

This project focuses on two key challenges in Rank-Calibration. First, we investigate how to better rank-calibrate existing uncertainty quantification techniques using non-linear recalibration methods, such as histogram binning, to ensure that higher uncertainty consistently corresponds to lower generation quality. Second, we explore whether it is possible to develop uncertainty quantification techniques that inherently guarantee Rank-Calibration. By addressing these challenges, this work aims to improve the reliability and interpretability of uncertainty estimates in generative language models.

[1] Kuhn, Lorenz, Yarin Gal, and Sebastian Farquhar. "Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation." The Eleventh International Conference on Learning Representations.

[2] Nikitin, Alexander, et al. "Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities." arXiv preprint arXiv:2405.20003 (2024).

[3] Xiong, Miao, et al. "Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs." The Twelfth International Conference on Learning Representations.

[4] Huang, Xinmeng, et al. "Uncertainty in language models: Assessment through rank-calibration." arXiv preprint arXiv:2404.03163 (2024).

Rank-Calibrating Large Language Models (LLMs) for Improved Uncertainty Estimation

Supervisor

Suitable for

Abstract

Student Space