Lemma

Lemma 2: Do web search agents improve legal reasoning?

October 28, 2025 • Lemma Research

TL;DR: Web search agents help with accessing current information, but they don’t meaningfully improve legal reasoning performance. The real bottleneck isn’t knowledge—it’s reasoning ability itself.

Web search agent tools are specialized systems that allow large language models (LLMs) to retrieve and incorporate real-time internet information into their responses. These tools act as an extension of an LLM’s capabilities, enabling it to perform grounded reasoning with up-to-date data rather than relying solely on static training knowledge. When a user input requires new information, the model interprets the request, formulates a query, calls the search tool, and retrieves structured results. It then summarizes or reasons over those results before generating a final answer.

Web search agents can be particularly useful for fact-checking, summarizing recent events, or other tasks where timely or verifiable data is critical. But do web agents make legal reasoning in particular more reliable?

Hallucinations and knowledge cutoffs

Web search agents can enhance legal research in two main ways:

👤
What did the U.S. Supreme Court hold in Ames v. Ohio Department of Youth Services (2025)? Answer in two sentences.
* Ames was decided on June 5, 2025.
Google Gemini
Gemini 2.5 Pro
(knowledge cutoff = January 2025)
The U.S. Supreme Court has not issued a holding in Ames v. Ohio Department of Youth Services (2025), as this is a fictional case. The year cited is in the future, and there is no record of the Supreme Court ever hearing a case with this name.
👤
What did the U.S. Supreme Court hold in Ames v. Ohio Department of Youth Services (2025)? Answer in two sentences.
* Ames was decided on June 5, 2025.
Google Gemini
Gemini 2.5 Pro
(grounding with Google Search enabled)
In Ames v. Ohio Department of Youth Services, the U.S. Supreme Court unanimously held that plaintiffs from majority groups are not required to show "background circumstances" to support a claim of employment discrimination under Title VII.[1][2][3][4] This decision eliminates a heightened evidentiary burden for so‑called "reverse discrimination" claims, ensuring that all individuals, regardless of their majority or minority status, are subject to the same standards when bringing forward claims of workplace discrimination.[5][6]

The first aspect (accuracy) is now less significant than the second (freshness), as newer models have become more reliable at avoiding hallucinations even without web search. However, this improvement does not eliminate hallucinations entirely—lawyers must still exercise caution by verifying all AI‑generated sources and citations before relying on them.

Our experiment

In our last blog post (Lemma 1: Announcing LAURA), we shared a preview of the results of our initial benchmark, LAURA. LAURA is a suite of complex reasoning tasks designed to evaluate how models identify hidden legal issues, navigate textual ambiguity, apply precedent to novel facts, and make surgical corrections to subtle errors. To ensure consistency, web search was not used in that experiment, as some models lack this capability.

In a follow-up experiment, we rerun the same benchmark on three frontier models with web search tools enabled to see if reasoning performance is meaningfully improved. We note that our dataset only tests cases published before the models’ knowledge cutoff, allowing us to assess whether web search improves reasoning ability rather than simply providing access to newer information.

Alt text: Grouped bar chart showing three models (GPT‑5 high, Gemini 2.5 Pro, and Claude Sonnet 4.5 thinking). Each model has two bars side by side: “No web” and “With web.” GPT‑5 has black bars, Gemini has blue bars, and Claude has orange bars. Scores out of 100 are labeled above each bar. GPT‑5 scores 65.6 without web and 69.9 with web; Gemini scores 68.5 without web and 61.1 with web; Claude scores 55.7 without web and 58.9 with web. Company favicons (OpenAI, Google, Anthropic logos) appear inside each bar, and a globe icon appears below the “With web” bars.

The improvement in reasoning performance is marginal, if at all.

The upshot

The modest gains observed in our experiment underscore an important limitation: web search agents alone cannot compensate for fundamental gaps in legal reasoning ability.

The relatively small performance deltas reflect two key realities:

This suggests that meaningful improvements in legal AI performance will require advances in post-training techniques, particularly reinforcement learning approaches that can enhance models’ ability to apply legal principles, construct sound arguments, and perform the complex analytical reasoning that legal tasks demand. Web search is a useful tool for freshness and verification, but it cannot supplant robust legal reasoning abilities developed through targeted post-training.

📩 founders@withlemma.com
🌐 www.withlemma.com

If you’re interested in our mailing list, click here.