Transformer Models for Logic-Guided Reinforcement Learning
Supervisors
Suitable for
Abstract
Prerequisites: CAFV, basic AI/ML courses
Background
Reinforcement Learning (RL) is a framework in which agents learn decision-making strategies through interaction with an environment. While modern RL algorithms have achieved impressive results, they often struggle with tasks that require structured reasoning, long-term temporal dependencies, or guarantees about correctness and safety.
Logic-guided RL addresses these limitations by incorporating formal specifications—such as Linear Temporal Logic (LTL)—into the learning process. This allows tasks to be described in a precise, high-level way and provides a mechanism for ensuring that the learned behaviours satisfy complex temporal requirements. Approaches like Logically Constrained Reinforcement Learning (LCRL) demonstrate how logical structure can guide reward design and support more reliable policy synthesis.
In parallel, recent advances in machine learning have explored sequence-modelling methods, particularly Transformer architectures. These models, which underpin modern large language models, excel at capturing long-range dependencies in sequential data. When applied to reinforcement learning, they enable decision-making to be formulated as a sequence-prediction problem, allowing policies to be learned through supervised modelling of offline trajectories rather than through traditional value-based or policy-gradient methods.
Focus
The project explores whether logic-guided reinforcement learning tasks can be approached effectively using sequence-modelling techniques. Instead of learning a policy through direct online interaction, the key idea is to train a Transformer model on offline trajectories where the reward signal is derived from temporal-logic specifications. Action selection is then framed as an autoregressive sequence-prediction problem based on past states, actions, and returns.
Core research questions include:
Can Transformer models learn behaviours that respect temporal-logic constraints?
How well do sequence-modelling methods capture long-horizon, structured decision dependencies?
What are the benefits and limitations of using offline, data-driven models for logic-constrained tasks?
The expected outcome is a deeper understanding of how Transformer-based models behave on logically structured tasks and whether this approach offers advantages over traditional RL.
Method
The project will:
Use environments with temporal-logic task specifications.
Generate or use offline datasets of trajectories annotated with rewards derived from these specifications.
Train a Transformer-based sequence model to predict actions conditioned on past trajectories and desired outcomes.
Compare performance to conventional RL baselines on logic-guided tasks.
Essential goals:
Implement a sequence-modelling RL pipeline using Transformer architectures.
Evaluate whether the learned behaviours satisfy the logical specifications.
Stretch goals:
Explore reward-shaping techniques to mitigate sparsity.
Investigate alternative conditioning strategies (e.g., goal-based conditioning).
Test generalisation across multiple specification classes or task domains.
Further reading:
Decision Transformer: Reinforcement Learning via Sequence Modeling (Chen et al., 2021)
Logically-Constrained Reinforcement Learning (Hasanbeig et al., 2018)
Reinforcement learning under temporal logic constraints as a sequence modeling problem (Tian et al., 2023)