Divide-and-Conquer Context: Do many short-context agents beat one long-prompt agent?
Supervisor
Suitable for
Abstract
Supervised by Andrey Kravchenko andrey.kravchenko@cs.ox.ac.uk
Expected background of students and CS techniques that will be applied
The student will know how to work or be able to quickly learn how to work with Pytorch, Hugging Face, IPython notebooks. They should also have a solid mathematical base.
Divide-and-Conquer Context: Do many short-context agents beat one long-prompt agent?
Many studies show LLMs don’t robustly use very long prompts: performance drops as context grows and when key facts
sit mid-prompt (“lost in the middle”); synthetic and realistic long-context suites (BABILong, RULER, LongBench/v2)
likewise report sharp degradation beyond a
few-thousand tokens on tasks that require multi-hop aggregation rather
than simple retrieval. This motivates testing whether teams of short-context, task-specialized agents can beat a single long-prompt
agent under the same token budget.
The project involves comparing (i) a single long-prompt baseline that ingests the entire context vs.
(ii) a multi-agent setup that shards the corpus into ≤1–5k-token slices, routes them to specialists (retriever, summarizer, evidence-checker), and aggregates with a judge/vote or Mixture-of-Agents layer; use AutoGen (or other alternatives) for orchestration variants. Evaluate on BABILong (reasoning-in-a-haystack) and LongBench/v2 (long-doc QA/summary/code). Track accuracy/F1, position robustness, wall-clock, total tokens, and an overhead ratio (coordination tokens ÷ reading tokens) while sweeping team size, shard overlap, and aggregator type.
Outcomes & next step. Deliver (i) breakpoints where short-context teams reliably outperform long prompts, (ii) a “law of diminishing returns” curve for coordination overhead, and (iii) design guidance for team size/routing/aggregation. Potential next step is to build a Prompt-to-Team compiler that automatically decomposes long prompts into agent roles and shards under a token budget, validated on the same suites.