Bayesian Experimental Design with LLMs: Enabling Probabilistic Conditioning via In-Context Updates
Supervisors
Suitable for
Abstract
Prerequisites:
• Familiarity with large language models (LLMs), prompting, and in-context learning
• Knowledge of probabilistic machine learning (Bayesian inference, conditioning, information gain)
• Some
familiarity with experimental design / active learning (helpful but not required)
• Programming experience in Python
(PyTorch + Hugging Face helpful), plus basic experiment tracking
(Weights & Biases or similar)
Background
Bayesian Experimental Design (BED) formalizes how to choose experiments (queries, interventions,
observations) to maximally reduce uncertainty about latent hypotheses or parameters. Classical BED relies on
probabilistic
conditioning: after observing data, the posterior is updated via Bayes’ rule, and the next experiment is chosen to optimize
a utility such as expected information gain or reduction in posterior entropy.
Large language models (LLMs) exhibit in-context learning: they change behavior after receiving examples or evidence in
the prompt. However, these “in-context updates” are not guaranteed to correspond to probabilistic
conditioning.
They may be miscalibrated, sensitive to prompt formatting, or fail to preserve coherent uncertainty across alternative hypotheses—limitations
that prevent LLMs from behaving like Bayesian agents [1].
If LLM in-context updates were close to true Bayesian conditioning, then LLMs could serve as approximate Bayesian agents
and approach Bayesian Optimal Experimental Design. However, it is currently unclear:
• When do LLM in-context updates
approximate Bayes’ rule?
• When do they systematically deviate?
• Can we measure and reduce this
deviation?
[1] Choudhury et. Al. BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design. ICLR, 2026.
Focus
The central goal of this project is to investigate whether and when LLM in-context updates
are Bayesian, to quantify the deviation from a Bayesian Oracle, and to design interventions that make these updates more
Bayesian-like.
The project will address three core questions:
• Diagnosis: Under what task structures do LLM in-context updates
deviate from Bayes’ rule?
• Quantification: How large is the “conditioning gap” between LLM belief
updates and Bayesian posterior updates?
• Intervention: Can we modify prompting, inference procedures, or training
objectives so that LLM updates more closely match Bayesian conditioning?
Method
To achieve these goals, the project will:
• Construct a Bayesian Oracle: Design
controlled generative tasks with known priors and likelihoods, where exact or high-precision posterior updates can be computed.
This defines the gold-standard Bayesian update.
• Elicit LLM Belief Updates: Present priors and sequential evidence
in-context and extract the LLM’s implied beliefs (either explicitly as distributions or implicitly via predictive probabilities).
• Quantify the Conditioning Gap: Compare LLM updates to the Oracle posterior using metrics such as KL divergence, calibration
error, likelihood sensitivity, and invariance to evidence ordering.
• Characterize Systematic Deviations: Identify
structured failure modes (e.g., under/over-updating, recency bias, prompt sensitivity, incoherent probability mass allocation).
• Design Interventions: Develop inference-time (structured prompting, belief tracking, self-consistency) and training-based
(distillation from Oracle, auxiliary consistency losses, RL-style objectives) methods to encourage Bayesian-faithful updates.
• Evaluate Impact on Experimental Design: Measure whether reducing the conditioning gap improves downstream performance
in Bayesian Experimental Design tasks (e.g., regret relative to Bayes-optimal experiment selection)