Understanding and Evaluating Reasoning in Large Language Models and AI Agents
Supervisors
Suitable for
Abstract
Prerequisites: Background in machine learning; familiarity with PyTorch and
exposure to Large Language Models. Experience
with LLM APIs, agent
frameworks, or training pipelines is a plus.
Background
● LLMs and agent-based systems achieve strong performance on tasks labeled as “reasoning,”
but
it remains unclear what these benchmarks truly measure and what factors drive observed
improvements. Performance
gains may reflect reasoning ability, scale, data, system design, or
evaluation artifacts. Clarifying these issues is
essential for reliable AI evaluation and
interpretation.
Focus
● This project examines both the validity of reasoning benchmarks and the drivers of reasoning
performance
in LLMs and AI agents. Key questions include what current evaluations actually
measure and which factors most contribute
to reasoning success.
Method
The project will build on existing reasoning benchmarks and evaluation frameworks used in recent AI
research
(e.g. FrontierScience, ARC-AGI, ARC AI2, GSM-Symbolic, LogicBench, GenBench, ). Students
will survey relevant literature
on LLM and Agentic reasoning, and perform empirical analyses on selected
benchmarks using modern LLMs and agentic systems.
[1] Mirzadeh, Seyed Iman, et al. "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in
Large Language
Models." The Thirteenth International Conference on Learning Representations.
[2] Parmar, Mihir, et al. "LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large
Language
Models." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers).
2024.
Goals:
● Essential: Review and analyze existing reasoning benchmarks for LLMs and agents; identify
patterns,
assumptions, and limitations.
● Essential: Empirically evaluate model behavior across across models, tasks, and system
configurations.
● Stretch: Propose or prototype alternative evaluation perspectives or diagnostic analyses for
reasoning.