NLP LLP: Talking to Your Case Files

There are two things that sci-fi movies in the 80s promised we would have by now: 1) flying cars, and 2) robot helpers who can talk to us.

Though we’re still waiting on flying cars, we have made progress on the robot front. Artificial intelligence (AI) assistants like Siri and Alexa can (most of the time) figure out what we’re trying to say and can play songs, provide directions, and check the weather at the drop of a hat. That’s cool and all, but these tasks are simple and straightforward. Where is C3PO, helping us instantaneously translate different languages? Or J.A.R.V.I.S., pulling up archives of research based on our voice commands?

The answer is: closer than we think.

In 2019, Google released BERT (short for “Bidirectional Encoder Representations from Transformers”). It's trained on over 100 million pages of English and is the first model to truly understand what it’s reading. It can understand degrees of difference, like how butterflies are similar to caterpillars but more different from dogs. It can also tell you the word “bank” in “river bank” and “bank account” has different meanings. Armed with this knowledge, it can perform better on the English portion of college admissions exams than most prospective college students (its correct answers regularly land it in the 90th percentile). It can even be taught to write comprehensible and meaningful short stories.

BERT is a natural language processing (NLP) type of AI. NLP is the art of teaching computers to understand language in the field of machine learning. The word “natural” refers to how a sentence is normally constructed by a typical speaker; the machine understands the language by “processing” it. The study of NLP has been around for decades and its earlier advancements (e.g., text indexing) have already been widely adopted by the legal world. To understand linguistic concepts, BERT uses neural networks, named after our human neurology. Neural networks have become prevalent in our world in the last decade due their ability to recognize patterns like we can—but faster.

In NLP LLP—a series on our own blog, which you can find here—we’re exploring different NLP models and techniques and their potential for use in the legal profession. We will cover the most groundbreaking discoveries in the NLP AI technology space and discuss how they relate to the legal world. Specifically, we’ll explore how our RelativityOne app—Fileread—can help.

This post discusses an emerging alternative to keyword searching in discovery: machine learning question answering.

MLQA: Your Case Files Are Your Best Witness

Imagine being on the Enron case. You’re tasked with finding evidence to support your arguments. Before you, in physical and virtual copies, are millions of documents that were seized from Enron’s email servers, databases, and corporate laptops.

With a sigh, you type into the computer’s search bar: “stock.”

After about ten minutes of searching, the computer pulls up thousands of documents. It’s up to you to cull through junk emails—like “A new item is in stock! Buy it now!”—to find meaningful results. And that’s assuming your evidence isn’t an email that actually says, “Congratulations, you just sold 10,000 ENRN on the market,” and so doesn’t show up in your search for “stock” at all.

Those are the limitations of keyword searching. But today, we know better than this; we have natural language processing (“NLP”) on our side. We have the advantage of being able to talk to our machines, and have them talk back. So what are our options?

For RelativityOne users who leverage Fileread, the answer is machine learning question and answer, or MLQA NLP.

With this feature, we can ask the software plain English questions: “Fileread, what stock transactions were completed?”

A Fileread model trained on 10 million pages of financial text would highlight the following in response to the above question:

If the question could not have been answered, Fileread would not have offered anything to highlight. Because of its neural network and the knowledge it gained from training on financial data, it knows that AAPL is a stock ticker and “order to buy” is a common phrase in the world of finance.

How Does It Work?

Fileread begins with transformer models, which are the bases upon which MLQA rests. Transformer models are “pre-trained” with large volumes of English text data where they learn how words are associated with each other. At the end of this process, the model forms several neural network layers that are able to interpret English phrases into mathematical representations. For a transformer model to become an MLQA model, it needs to be “fine-tuned,” which is the process of training it for a particular purpose.

We can think of this "pre-training" and "fine-tuning" process by imagining MLQA NLP as a kid. This kid learns the basics of English by reading different books (pre-training). Then, she learns to pass the SAT’s English comprehension section by practicing on mock exams and studying the rules of a "good" exam answer (fine-tuning). The foundations of her knowledge are cultivated through reading, but she becomes better at passing the SAT by practicing that particular skill.

The Importance of Innovation

It’s no secret that discovery is a critical piece of the litigation process. According to Duke University, firms typically allocate 20 to 50 percent of all their litigation budget for discovery. At the same time, the volume of data grows about 8 percent every two years, making e-discovery more difficult to manage and longer to process manually.

From a RAND report, more than 70 percent of resources in discovery are dedicated to manually reviewing documents. Without using more advanced technology, the report warns of human reviewing productivity reaching an upper bound and results becoming inconsistent. However, by using a solution leveraging neural networks deeply trained in English, firms can empower their teams to become more productive and accurate. Even a 10 percent productivity improvement in review could mean millions of dollars saved overall. AI is the solution to give legal teams the edge in this data arms race.

Conclusion

BERT has paved the way for countless opportunities in many different industries, including legal. In the field of e-discovery, it offers a new searching workflow that requires less precision and tediousness—and could potentially save an organization tremendous amounts of time and money.

Emily is the head of marketing at Fileread, a company dedicated to applying the latest innovations of artificial intelligence to accelerate the gathering of truth for the legal industry. With her data science background from the University of Southern California, Emily seeks to bring a mathematical approach to marketing.