Skip to main content

Graph-Memory Agents for Self-Evolving Web and SWE Tasks

Supervisor

Andrey Kravchenko

Suitable for

MSc in Advanced Computer Science
Computer Science, Part B
Computer Science, Part C

Abstract

Supervised by Andrey Kravchenko andrey.kravchenko@cs.ox.ac.uk

Expected background of students and CS techniques that will be applied

The student will know how to work or be able to quickly learn how to work with Pytorch, Hugging Face, IPython notebooks. They should also have a solid mathematical base.

 

Graph-Memory Agents for Self-Evolving Web and SWE Tasks

Do structured world models improve self-evolving agents compared to reasoning-memory banks? This project will adapt the AriGraph architecture—an evolving semantic-episodic knowledge graph for agent memory—to the ReasoningBank task suite (WebArena, Mind2Web, SWE-Bench-Verified).

The goal is to test whether graph-based memory can match or surpass ReasoningBank’s distilled “reasoning memory” for long-horizon, cross-task learning, and to explore hybrid designs that combine both.

Methodologically, you’ll implement an AriGraph-based agent for the ReasoningBank environments. AriGraph will encode web or code entities (e.g., pages, DOM elements, repositories, tests, patches) as nodes and link them via semantic and episodic relations (“contains,” “links_to,” “fails_on,” “fixed_by”). As the agent acts, it will extract triplets and events using an LLM parser, incrementally updating the graph to reflect new discoveries or corrections. For each new task, you’ll retrieve relevant subgraphs—filtered by recency and degree—to guide reasoning, replacing or augmenting ReasoningBank’s textual memory items.

You’ll benchmark this AriGraph agent against the ReasoningBank + MaTTS baseline and simple retrieval or trajectory-memory controls. Metrics include success rate, steps per task, and
computational efficiency across WebArena, Mind2Web, and SWE-Bench-Verified. Causal analyses will involve interventions on the graph (e.g., pruning nodes, corrupting edges, disabling contradiction handling) and patching tests (e.g., removing specific strategy memories) to measure effects on loss, reasoning stability, and generalization. Representation analyses will inspect how retrieved subgraphs influence hidden states and whether AriGraph supports smoother reasoning transitions than textual memory.

Outcomes include: (i) a comparative map of graph vs. reasoning-memory performance by domain and model family; (ii) causal evidence for when structured world knowledge or distilled strategies drive improvement; and (iii) design insights for hybrid graph-reasoning memory, suggesting where to store world state versus reusable reasoning. The project may also extend AriGraph with temporal decay, graph compression, or multi-task schemas, offering practical guidance for scalable,
structured, self-evolving agents in web and software reasoning settings.