Towards Trustworthy AI Agents for Information Veracity
In an era where disinformation is already recognised as a top global risk, the rapid advancement of AI capabilities threatens to magnify the danger. AI is increasingly surpassing human persuasiveness in various settings, making it a powerful tool for manipulation. This raises serious concerns about how to protect society from large-scale harmful manipulation by AI agents. This project aims to address this challenge by leveraging AI itself to create a protective shield against harmful manipulation.
There is promising evidence that AI can be used positively to combat disinformation. Recent studies have shown that AI can persuade people out of entrenched conspiracy beliefs and outperform human fact-checkers in accuracy and helpfulness. In debates among AI agents, increased persuasiveness helps non-experts judge the truth. The challenge, however, lies in determining which AI models and responses are trustworthy. There is a substantial risk of creating biased and misinformation-spreading models, and this risk increases with larger, more capable models.
Our solution focuses on creating a virtuous circle of credibility assessment and veracity evaluation. By grounding AI responses in rigorously vetted evidence and continuously improving credibility assessment, we aim to build trustworthy AI Stewards that can help navigate the information space. These Stewards will help individuals assess the veracity of information and avoid manipulative and fake content.
Our approach combines Large Language Models (LLMs) leveraging Retrieval-Augmented Generation (RAG) with temporal graphs of web domains. This combination will allow for dynamic credibility assessments that keep up with the evolving internet landscape. The system will include credibility weighting based on the source of retrieved information, hyperlink scraping, and a module to identify supporting vs. opposing sources. This will help the LLM understand the array of evidence and generate signals about the accuracy of sources.
A temporal graph will be built where nodes represent web domains and edges represent hyperlinks. Each node will have a credibility attribute updated through a combination of evidence accuracy and graph topology. The system will be seeded with aggregated ratings of popular news domains from multiple nonpartisan assessment organisations and will operate inductively, adding new nodes as they appear in open web searches.
This project will engage stakeholders from academia, industry, non-profit organisations, and government to produce immediately actionable progress on one of the most critical threats the current society faces. By creating a scalable, continuously improving system for evaluating and leveraging credibility, we aim to build a robust defence against AI-driven disinformation.