Lemma 2: Do web search agents improve legal reasoning?
TL;DR: Web search agents help with accessing current information, but they don’t meaningfully improve legal reasoning performance. The real bottleneck isn’t knowledge—it’s reasoning ability itself.
Web search agent tools are specialized systems that allow large language models (LLMs) to retrieve and incorporate real-time internet information into their responses. These tools act as an extension of an LLM’s capabilities, enabling it to perform grounded reasoning with up-to-date data rather than relying solely on static training knowledge. When a user input requires new information, the model interprets the request, formulates a query, calls the search tool, and retrieves structured results. It then summarizes or reasons over those results before generating a final answer.
Web search agents can be particularly useful for fact-checking, summarizing recent events, or other tasks where timely or verifiable data is critical. But do web agents make legal reasoning in particular more reliable?
Hallucinations and knowledge cutoffs
Web search agents can enhance legal research in two main ways:
- Accuracy: Mitigating hallucinations.
- AI hallucinations occur when a model produces false or fabricated information that appears factually correct.
- In the legal field, this is a serious problem because AI-generated hallucinations, such as citing nonexistent cases or statutes, have led to court sanctions, ethical violations, and professional misconduct findings against attorneys who failed to verify AI outputs before filing them in court.
- In theory, grounding LLMs with web search reduces hallucinations in legal research by connecting model outputs to real, verifiable legal sources such as statutes, case law, and official databases in real time. This retrieval-based approach ensures that legal citations and conclusions are supported by existing authority rather than fabricated text.
- Freshness: “Extending” the knowledge cutoff.
- A knowledge cutoff is the specific date after which an AI model has not been trained on any new data, meaning it lacks information about events or developments that occurred after that point.
- Grounding with web search enables legal AI to access real-time information about cases decided after the model's knowledge cutoff. This dynamic connection allows AI to provide accurate, up-to-date answers grounded in current statutes and rulings rather than relying on outdated or incomplete training data.
- For example, compare the following pair of model responses to a simple prompt about a case decided after the model’s knowledge cutoff:
The first aspect (accuracy) is now less significant than the second (freshness), as newer models have become more reliable at avoiding hallucinations even without web search. However, this improvement does not eliminate hallucinations entirely—lawyers must still exercise caution by verifying all AI‑generated sources and citations before relying on them.
Our experiment
In our last blog post (Lemma 1: Announcing LAURA), we shared a preview of the results of our initial benchmark, LAURA. LAURA is a suite of complex reasoning tasks designed to evaluate how models identify hidden legal issues, navigate textual ambiguity, apply precedent to novel facts, and make surgical corrections to subtle errors. To ensure consistency, web search was not used in that experiment, as some models lack this capability.
In a follow-up experiment, we rerun the same benchmark on three frontier models with web search tools enabled to see if reasoning performance is meaningfully improved. We note that our dataset only tests cases published before the models’ knowledge cutoff, allowing us to assess whether web search improves reasoning ability rather than simply providing access to newer information.
The improvement in reasoning performance is marginal, if at all.
The upshot
The modest gains observed in our experiment underscore an important limitation: web search agents alone cannot compensate for fundamental gaps in legal reasoning ability.
The relatively small performance deltas reflect two key realities:
- First, modern frontier models have become substantially better at avoiding hallucinations even without external grounding, reducing one of the primary historical benefits of web search.
- Second, and more significantly, our benchmark was deliberately designed to test reasoning over cases within each model’s knowledge cutoff—meaning the models already possessed the requisite legal knowledge from pretraining. The performance gaps that remain, therefore, stem not from insufficient access to information, but from limitations in reasoning capability itself.
This suggests that meaningful improvements in legal AI performance will require advances in post-training techniques, particularly reinforcement learning approaches that can enhance models’ ability to apply legal principles, construct sound arguments, and perform the complex analytical reasoning that legal tasks demand. Web search is a useful tool for freshness and verification, but it cannot supplant robust legal reasoning abilities developed through targeted post-training.
📩 founders@withlemma.com
🌐 www.withlemma.com
If you’re interested in our mailing list, click here.
