Results on the up, but human supervision still required
Linklaters has set an exam on English law for a series of AI models in order to assess their accuracy.
‘The LinksAI English law benchmark’ asks a series of 50 questions from 10 different legal practice areas, rating each response and allocating a score.
The AI models receive a maximum of 10 marks for each question, with five marks for substance (whether the answer is right or not), three for citations (including issues of hallucination) and two marks for clarity.
The questions are described as “hard”, and are the same level as those that would require a competent mid-level lawyer, roughly around two years post qualification, specialised in that practice area.
Topics on the roster come from a range of practice areas including: contract, IP, tax, privacy, employment, corporate, dispute resolution, real estate, competition, and banking.
The last benchmarking exercise took place in October 2023, and tested GPT 2, GPT 3, GPT 4 and Bard. Whilst Bard came out on top with an average score of 4.4 out of 10, all of the tools “were often wrong and the citations sometimes fictional” the report states.
The new round of testing looked at Gemini 2.0, with a score of 6 out of 10, and OpenAI o1 the top score of 6.4 out of 10. The improvements in both cases were driven by boosts to the substance and citation scores.
One significant shift came in the prevalence of hallucinations, with AI creating cases and statutes to support its answer. Whilst almost a third of the older models hallucinated (47 out of 150 answers), the two newer tools reduced this rate down to 9%. This does not, however, take into account real but inaccurate citation.
Commenting on the improved results, the report notes that, “despite the significant improvement in Gemini 2.0 and OpenAI o1, we recommend they should not be used for English law legal advice without expert human supervision. They are still not always right and lack nuance.”
“However, if that expert supervision is available, they are getting to the stage where they could be useful, for example by creating a first draft or as a cross-check. This is particularly the case for tasks that involve summarising relatively well-known areas of law.”
Looking ahead to the future the report goes on to say that “whether this rate of progression will continue is less clear.”
“It is possible there are inherent limitations to LLMs — which are partly stochastic parrots regurgitating the internet (and other learned text) on demand,” it continues.” For example, they all suffer from the embodiment problem; they will never experience the beauty of a summer cricket game or the physical revulsion at finding a snail in a bottle of ginger beer. However, the fine tuning of this technology is likely to deliver performance improvements for years to come.”
“In any event, we will reapply the LinksAI English law benchmark to future iterations of this technology and update this report,” it adds.