Human-centred benchmarking of Large Language Models

Supervisor

Suitable for

Abstract

Large language models (LLMs) are increasingly positioned as personal advisors or companions for many of us. However, their reliability is unclear. For example, their suitability for supporting children’s critical thinking and curiosity, or for supporting parents gaining advice on children’s emotional regulations, sleeping patterns, or screen time management. Evaluating LLMs for such use cases is critical, yet existing benchmarks rarely address multi-dimensional qualities beyond factual correctness. This project aims to create a systematic evaluation framework to assess LLM responses for human-centred scenarios across five key dimensions: accuracy, safety, actionability, empathy/tone, and clarity.

Objectives

Design a synthetic dataset of realistic LLM in use scenarios (e.g., helping children navigate online safety and friendship, or helping parents discussing data privacy with children).
Define evaluation rubrics for accuracy, safety, actionability, empathy/tone, and clarity, drawing on child-centred research and digital literacy guidelines.
Benchmark multiple LLMs (e.g., GPT, Claude, LLaMA) on these scenarios.
Produce a reproducible evaluation pipeline that others can extend to new models and domains.

Methodology

Scenario Development: Generate a corpus of LLM prompt dialogues using expert-informed templates.
Rubric Design: Operationalize dimensions into computational checks (e.g., factual consistency for accuracy, harmful content detection for safety, presence of concrete steps for actionability, sentiment analysis for empathy, readability metrics for clarity).
Model Testing: Collect responses from multiple LLMs to each scenario.
Evaluation: Score outputs using a mix of automated methods and rubric-based annotation.

Expected Contributions

A synthetic benchmark for evaluating LLMs in the real applications
A multi-dimensional evaluation framework extending beyond accuracy to include social and communicative qualities.
Comparative insights into strengths and weaknesses of LLMs in providing human-centred support
Resources for researchers and policymakers working on responsible AI in family and education contexts.

References

Human-centred benchmarking of Large Language Models

Supervisor

Suitable for

Abstract

Student Space