Graph-Memory Agents for Self-Evolving Web and SWE Tasks
Supervisor
Suitable for
Abstract
Supervised by Andrey Kravchenko andrey.kravchenko@cs.ox.ac.uk
Expected background of students and CS techniques that will be applied
The student will know how to work or be able to quickly learn how to work with Pytorch, Hugging Face, IPython notebooks. They should also have a solid mathematical base.
Graph-Memory Agents for Self-Evolving Web and SWE Tasks
Do structured world models improve self-evolving agents compared to reasoning-memory banks? This project will adapt the AriGraph architecture—an evolving semantic-episodic knowledge graph for agent memory—to the ReasoningBank task suite (WebArena, Mind2Web, SWE-Bench-Verified).
The goal is to test whether graph-based memory can match or surpass ReasoningBank’s distilled “reasoning memory” for long-horizon, cross-task learning, and to explore hybrid designs that combine both.
Methodologically, you’ll implement an AriGraph-based agent for the ReasoningBank environments. AriGraph will encode web or code entities (e.g., pages, DOM elements, repositories, tests, patches) as nodes and link them via semantic and episodic relations (“contains,” “links_to,” “fails_on,” “fixed_by”). As the agent acts, it will extract triplets and events using an LLM parser, incrementally updating the graph to reflect new discoveries or corrections. For each new task, you’ll retrieve relevant subgraphs—filtered by recency and degree—to guide reasoning, replacing or augmenting ReasoningBank’s textual memory items.
You’ll benchmark this AriGraph agent against the ReasoningBank + MaTTS baseline and simple retrieval or trajectory-memory
controls. Metrics include success rate, steps per task, and
computational efficiency across WebArena, Mind2Web,
and SWE-Bench-Verified. Causal analyses will involve interventions on the graph (e.g., pruning nodes, corrupting edges, disabling
contradiction handling) and patching tests (e.g., removing specific strategy memories) to measure effects on loss, reasoning
stability, and generalization. Representation analyses will inspect how retrieved subgraphs influence hidden states and whether
AriGraph supports smoother reasoning transitions than textual memory.
Outcomes include: (i) a comparative map of graph vs. reasoning-memory performance by domain and model family; (ii) causal
evidence for when structured world knowledge or distilled strategies drive improvement; and (iii) design insights for hybrid
graph-reasoning memory, suggesting where to store world state versus reusable reasoning. The project may also extend AriGraph
with temporal decay, graph compression, or multi-task schemas, offering practical guidance for scalable,
structured,
self-evolving agents in web and software reasoning settings.