Validation in Review 101: Statistical Concepts for Evaluating Review Accuracy and AI

Editor's Note: With so much chatter around AI happening in the legal world and beyond, we’re receiving a lot of requests to get back to basics. What are the foundational validation techniques teams use to ensure artificial intelligence is delivering accurate, defensible results—in addition to faster ones? In today’s article, Grace Shao, one of Relativity’s applied science interns, provides a playful example that illustrates these statistical validation techniques in more straightforward terms. Give it a read to build (or reinforce) your understanding of these concepts and their importance in your next project.

The use of artificial intelligence in legal review empowers case teams to prioritize documents predicted to be most relevant. Such prioritization presents the benefits of working through large data sets quicker and getting ahead of case strategy by seeing the most important documents earlier on in the review. As data volumes grow, this type of prioritization and streamlining manual work has become a major motivator for the development of emerging legal technologies.

In the past, developments in the search and information retrieval world—such as Boolean search, keyword search, and active learning—empowered case teams to search through documents more effectively. Today, with powerful new developments in the artificial intelligence world, there is a lot of excitement surrounding possible methods to further streamline the review process. Thus, practitioners see much of the potential benefit of AI in review in its ability to find documents faster, reducing the number of documents that must be labelled by humans.

However, this reduction creates tension with the caution many legal teams hold for AI, and their established trust in expert human reviewers. Encouraging adoption of new tools thus requires a resolution of this tension. Although it seems that the objective quality of a review is fundamentally tied to the tool of review, the quality of a review is actually more closely tied to the validation process used to evaluate it. In this article, we aim to provide an intuitive explanation of that relationship.

Saba and Her Candy

Say our friend Saba has accumulated a large pile of 10,000 wrapped gift boxes over the years. Each contains either candy or coal and is labelled with a note. She wants to find at least 90% of the candy (call this her recall target) while unwrapping the minimum number of boxes.

The percentage of boxes containing candy out of the total pile is called the richness of her data set.

Validation graphic — Saba has a lot of boxes to get through on her search for candy over coal.

Richness, Randomness, and Sample Size

Before she starts, Saba first wants to know how much candy she’s looking for. To estimate how many candy boxes there are among the total, Saba begins by taking a sample of the pile. She knows that a good sample—a subset of the data in her collection—must be random and sufficiently sized, so that the richness of the sample is likely to be representative of the entire pile.

The richness of a biased sample—one chosen based on some criteria or nuance, rather than truly at random—would not be representative of the pile, since the probability of sampling a box might be influenced by the probability of containing candy. For instance, if she looked through the lightest few boxes, the portion of candy boxes in that sample would overestimate the richness of the entire pile. Ordering by other variables that are indirectly related to the probability of candy also leads to an unrepresentative statistic: if she pulled sample boxes in order of delivery date, she might sample boxes that arrived around her birthday or another holiday, which might also influence their probability for containing candy.

By taking a completely random sample, each box has an equal probability of being picked, eliminating the relationship between the probability of sampling a box and probability of containing candy. By ensuring the independence of these two variables, she ensures that her sample is more representative.

The probability of a representative sample also increases with a larger sample size, due to less variation in the sample statistic. At the extreme, if Saba sampled only one box from the pile, her richness would either be 0% or 100%, and it would wildly vary with repetition—leading to little confidence in her estimate. At the other end, if Saba sampled 9,000 boxes, her richness would not vary much, and she would be very confident in her estimate of the population richness, but it would require a lot of manual checking.

In between these extremes, her confidence level increases with her sample size. However, this is not a linear relationship: confidence comes from sample size, not proportion of the population. For example, we can get a good estimate for the average apple weight from a sample of 1,000 apples, which is not even close to 1% of all existing apples. With increasing sample size, the law of diminishing returns applies. The difference in confidence between sampling 50 and 100 apples is much larger than the difference between 5,000 and 10,000.

Confidence intervals graphic — Confidence intervals increase with sample size along a nonlinear curve.

Ultimately, Saba chooses a sample size of 1,000 and finds 200 candy boxes in her sample. As a result, she estimates her richness to be 20.0%, and she is 95% confident the true richness is between 17.6% and 22.6%.

Saba’s First Review Tool: A Scale

To sort through the pile more efficiently, she builds a scale that sorts the boxes by weight. Since coal is heavier than chocolate, she guesses that the 20% lightest boxes (2,000 boxes) are likely to have candy, which sets the cutoff at 500 grams. She unwraps the boxes in the “reviewed set,” which weigh less than 500 grams, and ignores the boxes which are part of the “discard set,” which weigh more than 500 grams. In this way, her scale helps her reduce the number of boxes she must look at.

To understand how well her strategy works, Saba wants to measure the portion of candy boxes captured in the reviewed set, or the recall, and the portion of the discard set that have candy, or the elusion rate.

Estimating Recall and Elusion Rate

Since Saba does not know the actual contents of all the boxes, the recall and elusion rate must be estimated using a sample. Similar to the sample for estimating richness, randomness and sample size are important for estimating this statistic with a good level of confidence.

To estimate recall, Saba looks at the 200 candy boxes in the sample of 1,000 she used to estimate richness. Out of these 200 boxes, she counts that 120 were predicted to have candy, so she estimates the recall to be 60.0%, and she’s 95% confident the true recall is between 52.9% and 66.8%.

However, it’s important to note that confidence in this estimation is partially dependent on richness: extremely small richness requires a larger sample size to achieve a sufficient sample size of target items. Populations with a small number of target items need more items sampled to find any of those targets. In Saba’s case, if there are fewer candy boxes, the proportion of boxes predicted to have candy within the sample is much more variable.

Estimating recall & elusion rate graphic — The width of the confidence interval for the estimation of model recall decreases with sample size. Lower richness requires a larger sample to reach a given width. This is at a true recall of 60%.

To estimate her elusion rate, Saba takes a random sample of 500 boxes from her discard pile. In this pile, she finds 50 candy boxes, so she estimates her elusion rate to be 10%, and she’s 95% confident the true elusion rate is between 7.5% and 13.0%.

Thus, Saba is able to get an idea of how well her scale performs. However, although the scale performs better than random chance, she still has a lot more boxes to go through before she can confidently say she has found 80% of all the candy.

Saba’s Second Review Tool: A Linear Classifier

Saba continues receiving these gifts. As her boxes pile up again, the world changes, and new tools are introduced. She finds that some boxes have coal dust, and others have giant gummy bears, so Saba decides sorting by weight is not enough. She builds a robot that also considers the box size, and whether the attached note mentions “candy” or “coal.” With box and weight, the robot can calculate the density, which is a more accurate predictor of candy or coal. The presence of “candy” or “coal” on the note is also a strong indicator, but it sometimes turns out to be a red herring.

Although her tool has changed, Saba is still able to use the same techniques in evaluating its performance. With a sufficiently large random sample, Saba can be confident that the result of the robot’s sorting and her manual box-opening meets her 90% recall target. Compared to the simple scale, this robot is much better at predicting the presence of candy, and it takes Saba much less time to find at least 90% of the candy.

Using the simple scale, Saba had to exert more effort in order to make up for the scale’s poor performance and reach the target—but with a better tool, she can reach the target with less manual work.

Saba’s Third Review Tool: Artificial Intelligence

The next year, even fancier tools are created. This time, Saba finds a teacher robot capable of using the contents of past boxes to teach a student robot how to predict candy based on the message in the attached notes. The student robot works quite well, finding that notes which praise Saba are highly correlated with boxes that contain candy. It seems to understand the content of the note, dealing with nuances (such as mentions of the opposing opinion) with ease. However, it does not recognize sarcasm, and often predicts boxes of coal attached to sarcastic praise to contain candy.

Although Saba doesn’t completely understand how the student robot works, she knows she can use the same validation techniques to be confident in its results. Thus, although she is less familiar with the means of this tool, she understands the ends. Like her past processes, the student robot returns four categories: candy boxes correctly predicted to have candy, candy boxes incorrectly predicted to have coal, coal boxes correctly predicted to have coal, and coal boxes incorrectly predicted to have candy. To understand the quality of the results, she can continue to apply the same validation techniques to be confident in the new tool's results.

Legal Review

In the same manner, fancy new tools in e-discovery might outpace the general understanding of how they’re built, but the validation techniques remain the same. Although Saba’s tools changed, her validation process ensured she reached her target, albeit with different amounts of manual work. For e-discovery, the tools used in legal review only affect the amount of time and money human case teams must spend to reach certain criteria, while confidence in their results come from validation processes that have been deemed defensible in the court before.

Graphics for this article were created by Natalie Andrews.

Dear e-Discovery Professionals: It's Our Time to Shine in the Age of AI

There’s a growing need for AI practitioners who understand the strengths and weaknesses of these tools. And who is well-suited to transition to an AI ethics practitioner? e-Discovery professionals, of course!

Grace Shao is an applied sciences intern at Relativity. She is also a student, teacher assistant, and research assistant at the University of Chicago, where she is studying computing, machine learning, and statistics.