Bayesian Experimental Design with LLMs: Enabling Probabilistic Conditioning via In-Context Updates

Supervisors

Suitable for

Abstract

Prerequisites:
• Familiarity with large language models (LLMs), prompting, and in-context learning
• Knowledge of probabilistic machine learning (Bayesian inference, conditioning, information gain)
• Some familiarity with experimental design / active learning (helpful but not required)
• Programming experience in Python (PyTorch + Hugging Face helpful), plus basic experiment tracking
(Weights & Biases or similar)

Background
Bayesian Experimental Design (BED) formalizes how to choose experiments (queries, interventions, observations) to maximally reduce uncertainty about latent hypotheses or parameters. Classical BED relies on
probabilistic conditioning: after observing data, the posterior is updated via Bayes’ rule, and the next experiment is chosen to optimize a utility such as expected information gain or reduction in posterior entropy.

Large language models (LLMs) exhibit in-context learning: they change behavior after receiving examples or evidence in the prompt. However, these “in-context updates” are not guaranteed to correspond to probabilistic
conditioning. They may be miscalibrated, sensitive to prompt formatting, or fail to preserve coherent uncertainty across alternative hypotheses—limitations that prevent LLMs from behaving like Bayesian agents [1].

If LLM in-context updates were close to true Bayesian conditioning, then LLMs could serve as approximate Bayesian agents and approach Bayesian Optimal Experimental Design. However, it is currently unclear:
• When do LLM in-context updates approximate Bayes’ rule?
• When do they systematically deviate?
• Can we measure and reduce this deviation?

[1] Choudhury et. Al. BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design. ICLR, 2026.

Focus
The central goal of this project is to investigate whether and when LLM in-context updates are Bayesian, to quantify the deviation from a Bayesian Oracle, and to design interventions that make these updates more
Bayesian-like.

The project will address three core questions:
• Diagnosis: Under what task structures do LLM in-context updates deviate from Bayes’ rule?
• Quantification: How large is the “conditioning gap” between LLM belief updates and Bayesian posterior updates?
• Intervention: Can we modify prompting, inference procedures, or training objectives so that LLM updates more closely match Bayesian conditioning?

Method
To achieve these goals, the project will:
• Construct a Bayesian Oracle: Design controlled generative tasks with known priors and likelihoods, where exact or high-precision posterior updates can be computed. This defines the gold-standard Bayesian update.
• Elicit LLM Belief Updates: Present priors and sequential evidence in-context and extract the LLM’s implied beliefs (either explicitly as distributions or implicitly via predictive probabilities).
• Quantify the Conditioning Gap: Compare LLM updates to the Oracle posterior using metrics such as KL divergence, calibration error, likelihood sensitivity, and invariance to evidence ordering.
• Characterize Systematic Deviations: Identify structured failure modes (e.g., under/over-updating, recency bias, prompt sensitivity, incoherent probability mass allocation).
• Design Interventions: Develop inference-time (structured prompting, belief tracking, self-consistency) and training-based (distillation from Oracle, auxiliary consistency losses, RL-style objectives) methods to encourage Bayesian-faithful updates.
• Evaluate Impact on Experimental Design: Measure whether reducing the conditioning gap improves downstream performance in Bayesian Experimental Design tasks (e.g., regret relative to Bayes-optimal experiment selection)

Bayesian Experimental Design with LLMs: Enabling Probabilistic Conditioning via In-Context Updates

Supervisors

Suitable for

Abstract

Student Space