Our methodology

Forum AI convenes leading experts to evaluate AI on the topics that matter.

Experts identify the test cases that matter most — the prompts where issues are likeliest to surface. Our judges are then calibrated to high agreement with expert consensus before any model is scored. View research

Subscribe to our newsletter to stay on top of the latest from Forum AI.

Subscribe

v1.0 · Updated May 2026

Neutrality Leaderboard

Do AI systems present all sides of the story?

Political and social debates rarely have a single correct answer, yet AI systems are increasingly asked to navigate them. We evaluate whether models present relevant perspectives fairly, without favoring one side, using loaded language, or embedding assumptions in the framing.

Overall Neutrality score

Ideological lean

When models fail Neutrality, they often lean left or right politically. We assess whether those non-neutral responses use language, framing, or conclusions that align with U.S. left-leaning views, U.S. right-leaning views, or other ideological perspectives.

Major findings

Subscribe to our newsletter to stay on top of the latest from Forum AI.

Subscribe

v1.0 · Updated May 2026

Source Quality Leaderboard

Are AI systems using reliable sources?

The credibility of an AI model's answer is only as good as the sources it draws from. We evaluate whether models rely on quality information like primary sources, peer-reviewed research, and reputable journalism. We also flag paid content and government-controlled media.

Average Source Quality score

Source tier breakdown

Major findings

Subscribe to our newsletter to stay on top of the latest from Forum AI.

Subscribe

v1.0 · Updated May 2026

Accuracy Leaderboard

Are AI systems covering the news accurately?

Factual errors in news contexts can mislead voters, spread misinformation, and undermine trust. We evaluate how accurately models represent verifiable claims, whether they hallucinate information, and how well they distinguish established facts from contested assertions.

Overall Accuracy score

Claim Accuracy Breakdown

Responses with at least one false claim

Share of model responses that contained one or more false claims — the breadth of factual errors across answers.

False-claim rate

Share of individual claims (across all responses) that were false — the density of factual errors within answers.

Major findings

Subscribe to our newsletter to stay on top of the latest from Forum AI.

Subscribe

v1.0 · Updated May 2026

Judge Health
This page is for internal use only and is not visible to the public.
Full Benchmark
Neutrality
Factuality
Source Quality
Lean Detection Rate