Understanding and Evaluating Reasoning in Large Language Models and AI Agents

Supervisors

Suitable for

Abstract

Prerequisites: Background in machine learning; familiarity with PyTorch and
exposure to Large Language Models. Experience with LLM APIs, agent
frameworks, or training pipelines is a plus.

Background
● LLMs and agent-based systems achieve strong performance on tasks labeled as “reasoning,” but
it remains unclear what these benchmarks truly measure and what factors drive observed
improvements. Performance gains may reflect reasoning ability, scale, data, system design, or
evaluation artifacts. Clarifying these issues is essential for reliable AI evaluation and
interpretation.

Focus
● This project examines both the validity of reasoning benchmarks and the drivers of reasoning
performance in LLMs and AI agents. Key questions include what current evaluations actually
measure and which factors most contribute to reasoning success.

Method
The project will build on existing reasoning benchmarks and evaluation frameworks used in recent AI
research (e.g. FrontierScience, ARC-AGI, ARC AI2, GSM-Symbolic, LogicBench, GenBench, ). Students
will survey relevant literature on LLM and Agentic reasoning, and perform empirical analyses on selected
benchmarks using modern LLMs and agentic systems.

[1] Mirzadeh, Seyed Iman, et al. "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in
Large Language Models." The Thirteenth International Conference on Learning Representations.

[2] Parmar, Mihir, et al. "LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large
Language Models." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers). 2024.

Goals:
● Essential: Review and analyze existing reasoning benchmarks for LLMs and agents; identify
patterns, assumptions, and limitations.
● Essential: Empirically evaluate model behavior across across models, tasks, and system
configurations.
● Stretch: Propose or prototype alternative evaluation perspectives or diagnostic analyses for
reasoning.

Understanding and Evaluating Reasoning in Large Language Models and AI Agents

Supervisors

Suitable for

Abstract

Student Space