Do Punctuation Tokens Act as Sinks, Summaries, and Anchors in Transformers?
Supervisor
Suitable for
Abstract
Supervised by Andrey Kravchenko andrey.kravchenko@cs.ox.ac.uk
Expected background of students and CS techniques that will be applied
The student will know how to work or be able to quickly learn how to work with Pytorch, Hugging Face, IPython notebooks. They should also have a solid mathematical base.
Do Punctuation Tokens Act as Sinks, Summaries, and Anchors in Transformers?
Transformers often allocate unusually high attention to punctuation—especially periods and commas—hinting that these tokens play a structural role in how context is organized. This project will test three hypotheses: (0) punctuation acts as an attention “sink” that reliably attracts attention mass; (1) punctuation positions encode compressed summaries of the preceding span; and (2) these tokens serve as anchors that help the model predict what comes next.
Methodologically, you’ll quantify attention-to-punctuation across families of pre-trained generative models (GPT,
Llama), controlling for frequency and position. You’ll run causal interventions by removing, shuffling, or inserting
punctuation and measuring changes in loss, attention
redistribution, and representation drift. Finally, you’ll
probe hidden states at punctuation to test whether they reconstruct preceding content and evaluate anchor effects on next-token
prediction.
Outcomes include a comparative map of punctuation-related attention by layer/head, causal evidence for or against sink/summary/anchor roles, and practical guidance for punctuation-aware decoding or caching in long-context settings.