Home
/
Newsroom
/
Anthropic’s Claude 87% better at legal tasks
News

Anthropic’s Claude 87% better at legal tasks

Share

Introduction

Robin AI is excited to welcome the new Claude 3.7 Sonnet model, from our partner Anthropic. Claude 3.7 Sonnet ****excelled across our benchmarks, and Robin was hugely impressed by Claude 3.7 Sonnet’s ability to accurately find and extract critical data from legal contracts - a prerequisite for almost all legal tasks.

Specifically, the release of Claude 3.7 Sonnet’s ‘thinking mode’ represents a significant step forward in LLM capabilities for legal reasoning and demonstrates Anthropic’s success in advancing frontier model performance in specific domains.

Anthropic's Claude 3.7 Sonnet model has surpassed all other LLMs in our benchmarks for identifying legal concepts and clauses in Vendor Agreements. Claude 3.7 Sonnet shows an 8% performance improvement over previous Anthropic models, which were already the market leaders for legal applications.

Since March 2024, we’ve seen an 87.5% increase in the performance of Anthropic models to locate content in contracts and today we’re excited to announce that Claude 3.7 Sonnet is now live in Robin AI’s Legal AI Assistant.

Data at Robin AI

Robin has been at the forefront of utilising its proprietary data to refine and customise its products and features for specialized legal workflows.

Robin’s library of labelled contracts builds on top of Anthropic’s Claude 3.7 Sonnet model, enabling our models to predict clause types with greater accuracy, helping users to complete their legal tasks faster.

Effective legal applications depend on accurately identifying the right data. Whether the task is highlighting critical clauses, or producing clear and concise summaries, precise clause extraction is essential to core legal processes.

For example, in the event that you suffer a data breach, you will need to know the notification requirements that apply to each of your customers. Robin’s models can accelerate your response to an incident because they are trained with relevant legal data. In this scenario, our models can quickly find which customers you need to notify about a data breach and when. And all this information can be pulled into a single report - in minutes.

The safest way to use Legal AI for these types of tasks is to keep a human-in-the-loop of each step in the process. To make that manageable, users need simple ways to trust and verify AI-generated outputs. We achieve this by using clickable citations which take users directly to the relevant contract clause/s text for verification, and that is made possible using Anthropic’s improving family of models.

Benchmarking Model Performance

In order to continuously benchmark and refine our models, Robin AI maintains a subset of rigorously labelled “gold standard” contracts. This allows us to assess where our models perform well and where specialised data curation is needed to handle our customer’s workflows. These documents are only ever used for model testing, ensuring consistent, unbiased evaluation. And the dataset consists of a sample of gated legal contracts, including samples which represent some of the most complex and nuanced legal language.

Every contract in the gold standard dataset has been carefully annotated by human legal experts who specialise in annotating legal data for use in technology. These annotations — or labels — are applied to sentences and paragraphs when the team determines that a specific section corresponds to a particular legal concept.

In order to benchmark and evaluate different models, Robin AI takes the contracts and asks various models to apply the relevant labels to the relevant spans of text. We then compare the model generated labels to our gold standard and compute a “F1” score to represent model performance.

The F1 score measures a model's performance by balancing precision (how many of the predicted positives are actually correct) and recall (how many of the actual positives are correctly identified). A higher F1 score, which ranges from 0 to 1, indicates better overall accuracy. For example, if a model correctly identifies an "automatic renewal" provision, but also adds incorrect labels like "governing law," to the same passage, the model (in this case) would have high recall, but low precision. The F1 score combines both of these metrics to provide a more balanced view of performance, highlighting the trade-off between identifying the right cases and avoiding false positives.

An F1 score of 1 would indicate that the model perfectly separates the needles from the haystack, significantly streamlining legal workflows by finding only the relevant information for a user’s query.

Performance

In this blog, we have focussed on how the upgraded Claude 3.7 Sonnet model performs in identifying key clauses in Vendor Agreements (VAs) - a contract type that every business deals with. Whether it’s a software subscription that silently auto-renews, a manufacturing deal that stipulates complex delivery schedules or details on who needs to be notified in a data breach, it’s vital that businesses have the ability to understand their rights and obligations from their contracts.

Our research concludes that Anthropic’s Claude 3.7 Sonnet model has outperformed all other LLM models we benchmarked on finding legal concepts and clauses in VAs. And we are excited to report that the launch of Anthropic’s new reasoning parameters, enabled by configuring the model to “thinking mode”, appears to represent a significant step forward in legal reasoning. According to our evaluations, using Claude 3.7 Sonnet’s thinking mode resulted in an average 8% increase in F1 on our previous LLM benchmarks on identifying and classifying legal VA contract data on average.

Claude’s baseline quality for legal tasks is rising significantly in some cases. Top performing examples include that Claude 3.7 Sonnet achieved an F1 score of 0.826 in accurately locating liquidated damages clauses in VAs, compared to Anthropic’s previous model (Sonnet 3.5 v2) which achieved an F1 of 0.368.

Similarly, when looking at indemnification provisions, such as events, procedures and remedies, we observe an increase in F1 0.500 to 0.846, showing how Claude 3.7 Sonnet can out-of-the-box find legal concepts within contracts with significantly improved accuracy.

Product Impact

With the significant advances in model capabilities, we’re seeing a substantial improvement in the ability to quickly and accurately find and extract relevant data from legal contracts.

This acceleration means we are able to regularly improve our Reports feature and Legal AI Assistant, and Robin’s legal experts now have more time to focus on curating the right data for the more complex, corner-case scenarios that pre-trained LLMs cannot address out of the box.

The result is that Robin’s research and engineering teams can develop more customised solutions for anything from managing the customers fallout around a data breach, to speeding up due diligence for a deal.

Claude 3.7 Sonnet is now available to try for free via Robin AI’s Legal AI Assistant.

Focus on the strategic work you do best

Let Robin AI handle the rest