Skip to main content

Stealing reasoning from a closed source language model

Supervisors

Suitable for

MSc in Advanced Computer Science

Abstract

Prerequisites: Completion of the Uncertainty in Deep Learning course, strong mathematical foundations

Background
Knowledge distillation is traditionally a technique whereby a large trained “teacher” model can be used to better train a much smaller “student”. It’s now used for many other applications, where recently some studies found that
various forms of distillation can transfer behaviours between models. This includes having a non-reasoning language model copy a reasoning model or quirks like having a distilled model inherit the teacher’s liking of owls,
although the distilled data has no mention of them and the distillation is done at the token and not the logits level.

This last form of distillation has been referred to as “subliminal learning”.
Relevant literature: https://arxiv.org/pdf/2509.23886 , https://owls.baulab.info/
Research Questions
● What’s an attack that can steal reasoning capabilities from a closed source model?
● How can we do this so the reasoning traces are still legible?
Method
● To be developed, with a focus on simple baselines. LLM inversion may be useful to recover full reasoning
traces from partial summaries