Google's New Benchmark Reveals Wide Gaps in AI Factual Accuracy — and Shows Search Tools Help
The best large language models get facts wrong about one-third of the time
Welcome to AI Papers Explained, an experiment in using AI to help translate the latest AI research into plain language for journalists and technologists (we're getting meta). We're scanning for papers on arXiv, an open-access repository where researchers share preprints — papers that haven't yet gone through formal peer review. These summaries are AI-generated and lightly edited, and may contain errors or omissions.
Paper: The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
Authors: Aileen Cheng, Alon Jacovi and 63 co-authors (Google DeepMind & Google Research)
Published: December 11, 2025
One in three. That's roughly how often even the best AI models get facts wrong, according to a new benchmark — and the error patterns vary dramatically depending on what you're asking.
Previous benchmarks like TruthfulQA and FActScore have measured aspects of AI factuality, but they've tended to focus on narrow slices of the problem. This week Google released the FACTS Leaderboard, a more comprehensive benchmark that evaluates AI models across four distinct dimensions of factual accuracy. The accompanying paper offers insights that should inform how newsrooms think about deploying these tools — though it also comes with some important caveats about who's doing the measuring.
What FACTS Actually Measures
The benchmark breaks factuality into four components, each testing a different capability:
Grounding: Can the model accurately summarize or respond to questions about a document without making things up? This is the "don't hallucinate from your source material" test — essential for any AI-assisted reporting or research workflow.
Parametric Knowledge: Does the model actually "know" facts from its training, or will it confidently fabricate answers? These are closed-book factoid questions with no search allowed.
Search: When given access to web search tools, can the model effectively find and synthesize accurate information? This tests the increasingly common "AI + search" workflow.
Multimodal: Can the model accurately describe what's in an image without inventing details that aren't there?
The Numbers That Matter
The top-performing model (Gemini 3 Pro) achieves just 68.8% overall accuracy. That means even the best available AI gets facts wrong roughly one-third of the time across these tasks.
But the more interesting findings are in the details:
| Model | Grounding | Parametric | Search | Multimodal |
|---|---|---|---|---|
| Gemini 3 Pro | 69.0% | 76.4% | 83.8% | 46.1% |
| GPT-5 | 69.6% | 55.8% | 77.7% | 44.1% |
| Claude 4.5 Opus | 62.1% | 30.6% | 73.2% | 39.2% |
Search-augmented tasks work best. All models perform significantly better when they can search the web (60-84% accuracy) versus relying on internal knowledge alone. This suggests newsroom workflows should probably default to search-enabled AI rather than expecting models to know things from memory.
Multimodal accuracy is concerning. The best models hit only ~47% accuracy on image-based factuality tasks. If you're considering AI for visual journalism applications, human oversight isn't optional — it's essential.
Grounding varies widely. The ability to stay faithful to source documents ranges from 62% to 74% depending on the model. For document summarization or extraction tasks, this variance matters.
The Hedging Tradeoff
One of the most interesting findings involves how models handle uncertainty. Claude models hedge significantly more than others — Claude 4 Sonnet declines to answer 45% of parametric knowledge questions rather than risk being wrong.
Is that good or bad? It depends on your use case.
For journalism applications where being wrong carries real consequences, a model that says "I don't know" might be preferable to one that confidently fabricates. GPT-5 achieved 64.3% accuracy on questions it attempted versus 55.7% overall — because it declined to answer 13% of uncertain questions.
The researchers capture this tradeoff with an "attempted accuracy" metric alongside raw accuracy. Models that guess more often may score higher on raw accuracy but lower on attempted accuracy.
Implications for Newsrooms
A few takeaways for journalists and technologists working with AI:
Match the model to the task. Gemini models tend toward higher coverage (more comprehensive answers) while GPT models prioritize precision (fewer errors but sometimes less complete). Claude models are more cautious overall. There's no single "best" model — it depends on whether you'd rather have an AI that occasionally misses things or one that occasionally makes things up.
Search-enabled workflows are more reliable. The substantial performance gap between parametric and search tasks suggests that AI tools with web access are meaningfully more trustworthy than those relying on training data alone.
Build in verification for visual content. Sub-50% accuracy on multimodal tasks means AI-generated image descriptions need human review as a default, not an exception.
Consider the confidence question. A model that hedges more might actually be more useful in journalism contexts than one that always produces an answer. The benchmark's "hedging rate" metric is worth paying attention to.
What's Still Missing
The benchmark doesn't cover everything. It doesn't test video understanding, rapidly changing information or more complex reasoning tasks. It also uses Wikipedia as the ground truth for parametric knowledge, which has its own limitations.
And crucially, these are synthetic benchmark conditions. Real-world newsroom tasks involve messier inputs, time pressure and domain-specific knowledge that general benchmarks can't fully capture.
Also, Google created this benchmark ... and Google's model tops the leaderboard.
That's worth scrutinizing. The AI industry has a history of companies designing benchmarks that favor their own models, or optimizing specifically for benchmark performance in ways that don't translate to real-world use. When the referee is also a competitor, healthy skepticism is warranted.
To their credit, the researchers did several things to mitigate this: the benchmark uses external data sources (Wikipedia, web search via Brave), includes a private evaluation set to prevent overfitting and will accept third-party model submissions. The methodology is documented in detail and the leaderboard is public.
But the fundamental tension remains. Independent replication and evaluation by researchers without ties to any AI company would strengthen confidence in these results. For now, treat the specific rankings with appropriate caution while still finding value in the methodology and the broader patterns the benchmark reveals.
See For Yourself
The FACTS Leaderboard is live on Kaggle and will be actively maintained with new model submissions. The public portion of the benchmark is available for testing, while a private set guards against overfitting.
For those building AI tools for journalism, these metrics offer a starting point for evaluation. For those using AI tools in newsrooms, they provide useful context for understanding the current state of the technology — and its limitations.