Home
/
Blog
/
Optimizing RAG for Contract Analysis: Our Research Findings
AI

Optimizing RAG for Contract Analysis: Our Research Findings

Share

Introduction: Why Contracts Need RAG

At Robin AI, We Make Contracts Simple. But the reality is, contracts are complex documents filled with specialized language, intricate clauses, and important details buried within pages of text. When lawyers and business professionals need to extract specific information from contracts, they don't need to read the entire document; they need precise answers from relevant sections.

This is where Retrieval Augmented Generation (RAG) comes in. RAG is a powerful AI approach that combines the strengths of information retrieval with generative AI to provide accurate, contextually relevant responses based on specific documents. Instead of feeding entire contracts into a Large Language Model, which is expensive, inefficient, and potentially error-prone, RAG first identifies the most relevant portions of text and then generates answers based only on those sections.

Implementing RAG effectively for legal documents isn't straightforward. Contract language is specialized, formatting can be inconsistent, and many contracts can contain noise stemming from, for example, imperfect Optical Character Recognition (OCR) conversion. To address these challenges, we conducted extensive experiments to develop the optimal RAG approach for contract analysis.

Understanding RAG for Contracts

Before diving into our research, let's quickly explain the three stages of RAG in the context of contract analysis:

  1. Indexing Phase: Contracts are processed, divided into chunks (“chunked”), embedded (converted to numerical vectors that capture meaning), and stored in a vector database.
  2. Retrieval Phase: When a user asks a question, the query is embedded in the same way, and the system retrieves the most semantically similar contract sections.
  3. Generation Phase: The retrieved sections, along with the original query, are sent to an LLM (in our case, Anthropic’s Claude models) to generate a comprehensive answer.

This approach combines the benefits of retrieval systems (accessing specific information from a large corpus) with the natural language understanding and generation capabilities of modern LLMs.

Research Methodology: Finding the Optimal RAG Configuration

To determine the most effective RAG implementation for contracts, we systematically evaluated two key components:

  1. Embedding Models: We tested a dozen different embedding models, including both commercial and open-source options.
  2. Metadata Augmentation: We explored various types of metadata that could enhance the contract text, including clause labels, clause summaries, document types and more.

Our evaluation used an internally collected dataset covering thousands of questions over various contracts and contract types, with associated expert-validated relevant sections and answers.

For retrieval, we evaluate two metrics:

  • Recall@k: How many (k) contract segments do we need to retrieve on average to achieve a certain recall score.
  • Recall@p: What percentage (p) of contract segments do we need to retrieve on average to achieve a certain recall score.

Metadata Magic: Enhancing Contract Chunks with Context

One of our most significant findings was the substantial impact of metadata on retrieval performance. Rather than using raw contract text alone, we experimented with augmenting text chunks with various types of metadata:

Our experiments using a standard embedding provider revealed that carefully designed clause labels (one or two word phrases that describe the content of a clause) and cross-clause summaries consistently delivered the most significant performance improvements. For example, adding metadata like "Termination Clause - Summary: Outlines conditions under which either party may terminate the agreement and the required notice period" to contract chunks improved retrieval accuracy by up to 6% compared to using the raw text alone.

Other metadata elements we tested included:

  • Named Entities.
  • Manually designed labels.
  • Specialised tags for certain content.

While all metadata types provided some benefit, clause labels and summaries delivered the greatest impact with the least additional token overhead.

With our final meta-data combination, we were able to surpass 90% recall, on an internal dataset, using standard embeddings.

A Voyage into Embedding Space: Comparing Model Performance

Embedding models are the backbone of RAG systems because they translate text into numerical vectors that capture semantic meaning. The quality of these embeddings directly impacts retrieval accuracy.

We evaluated embedding models from OpenAI, Cohere, Voyage, Amazon, and open-source models like Qwen2, ME5 and Snowflake. We measured Min p@90 and Min k@90, that is, at what percentage p or total number k of retrieved contract paragraphs can we reach a recall score of at least 90.

Our testing revealed a clear winner: Voyage 3 Large by VoyageAI significantly outperformed all other models.

The performance gap was particularly pronounced for complex legal queries that required understanding specialised terminology and contextual nuances within contracts. Voyage 3 Large demonstrated superior ability to match user questions with relevant contract sections, even when the language used in the query differed from the exact wording in the contract.

About Voyage 3 Large

What makes Voyage 3 Large stand out isn't just its state-of-the-art retrieval quality across diverse domains like law and finance, but its remarkable efficiency. Enabled by techniques like Matryoshka learning and quantization-aware training, it supports various embedding dimensions (down to 256) and even int8 or binary quantization. This dramatically reduces vector database storage costs – potentially up to 200x less than competitors like OpenAI’s text-embedding-3-large – with minimal impact on retrieval quality. Combined with a generous 32K-token context length (compared to 8K for OpenAI and 512 for Cohere) and strong multilingual capabilities, Voyage 3 Large establishes a new frontier for balancing top-tier performance with practical cost-effectiveness.

The 85% Efficiency Gain: Less is More

Perhaps the most compelling result from our research was the dramatic reduction in required text processing. When we established a minimum "satisfactory" retrieval recall threshold of 90% (meaning the system successfully retrieves the relevant information at least 90% of the time), we found that:

On average, with our best performing metadata and embeddings, we only needed to retrieve and process 15% or less of the full contract text to achieve 90% recall.

This represents an 85% reduction in text that needs to be processed by the downstream generation model.

This efficiency gain translates directly into:

  • Faster response times
  • Lower API costs
  • Reduced token consumption
  • Ability to process more complex queries within token limits

The savings were particularly dramatic for longer contracts, where the relevant information might be buried within dozens of pages of text.

Answer Generation with Claude: Quality Without Compromise

With our optimized retrieval system in place, the final component was generating high-quality answers using Claude by Anthropic. We conducted a comparative evaluation of two approaches:

  1. Full-Contract Processing: Feeding the entire contract into Claude along with the user query.
  2. RAG-Powered Generation: Feeding only the retrieved relevant sections into Claude.

The results were clear: Overall, RAG-powered generation performed on par with full-contract processing in terms of answer accuracy and completeness.

These results held over a wide range of tested contract types, ranging from relatively short and “straight-forward” Non-Disclosure Agreements (NDAs), to intricate Master Service Agreements (MSAs), Limited Partnership Agreements (LPAs) and Side Letters.

In fact, across our full internal dataset, RAG was able to slightly outperform the full-contract based answers. We hypothesize that this is due to noise reduction introduced (almost as a side effect) by only providing the pertinent contract clauses to the generation model. This means Claude cannot be distracted by irrelevant contract language, making the overall answer generation task easier.

This quality preservation, combined with the 85% reduction in processed text, represents a breakthrough in efficient contract analysis.

Example: RAG in Action for Contract Analysis

Let's look at a concrete example of how our optimised RAG system works, where a customer has a specific question about a long supply agreement contract:

User Query: "What happens if the vendor misses the delivery deadline in this supply agreement?"

Traditional Approach: Process all 30 pages (15,000+ tokens) of the supply agreement.

RAG Approach:

  1. The system retrieves two relevant sections:
    • "Delivery Timeline" clause (300 tokens)
    • "Remedies for Delay" clause (450 tokens)
    • (It also retrieves a number of irrelevant clauses with a total 1500 tokens; this is ok and expected)
  2. Only these total of ~2250 retrieved tokens (15% of the total contract) are sent to Claude
  3. Claude is able to focus on the truly relevant retrieved clauses, and generates a comprehensive answer about late delivery consequences

Result: The user receives the same high-quality answer in less time, with 85% fewer tokens processed.

Conclusion: The Future of Contract Analysis

Our research confirms that properly implemented RAG can transform contract analysis, making it faster, more efficient, and more cost-effective without sacrificing accuracy. The combination of Voyage 3 Large embeddings and strategic metadata augmentation enables us to achieve our mission of making contracts simple.

The 85% reduction in text processing requirements opens up exciting possibilities for handling even the most complex contracts and queries within reasonable token limits and response times.

At RobinAI, we're continuing to explore new frontiers in AI-powered contract analysis, including:

  • Multi-contract RAG for comparing terms across documents
  • Temporal analysis for tracking how contract language evolves
  • Fine-tuning embedding models specifically for legal language

We believe RAG represents a fundamental shift in how AI can interact with legal documents, moving from brute force approaches to intelligent, targeted analysis that focuses on exactly what matters in each specific context.

Want to learn more about how Robin AI can simplify contract analysis for your organisation? Contact our team for a demonstration of our RAG-powered contract solutions.

Robin AI is hiring across a variety of roles. Check our open positions and apply!

Focus on the strategic work you do best

Let Robin AI handle the rest