Concerns about bias appear regularly in the world of law. In fact, all participants in legal matters frequently raise concerns about actual and perceived bias. Whether bias on the part of the advocate, the jury, the judge, or the prosecutor, the perception of any unfairness affects the perception and outcome of legal proceedings. There’s no denying it: bias undermines the principles and application of any ideally impartial legal system.
There is another layer of bias that we don’t talk about as frequently: bias in the use of AI. The world of legal artificial intelligence and machine learning tools is not immune from human-introduced bias and, if anything, may be even more susceptible to bias because of the way we train the models. In legal applications of machine learning, a lawyer’s work product is used to train the models to recognize documents as generally responsive or non-responsive to an issue. The human lawyer’s reactions to the documents can color that determination of responsiveness. The trainer’s relationship to and experience with the data contained within those documents affects the way responsiveness is coded.
For example, I once ran a review involving an alleged corporate fraud committed at a company run by a husband and wife. One of the reviewers had coded many of the emails sent by the wife as not relevant, even though they clearly were. When asked why he had done that despite the fact they were relevant, he said: “I just thought that if they were [the wife’s] emails they weren’t important.” It wasn’t intentionally malicious; his belief was just that her involvement as the “wife” meant that everything went through her husband, so his emails mattered and hers didn’t.
The data we collect and utilize in large-scale legal reviews are often, by their nature, skewed. Documents may be cherry-picked and may ignore vital components of the overall data set because the selector, the trainer, and the algorithm deem the content to be too small or too insignificant to warrant further review. For example, because there is cost associated with identifying and collecting data potentially relevant to a litigation, lawyers will often do their best to focus the search for relevant documents on those search terms that will hit on that that potentially relevant information. There is, of course, nothing wrong with this—such an exercise supports the pursuit of proportionality in e-discovery endeavors. However, the process by which we select those terms can attract bias. For example, overlooking the exclusion of the names of female site managers in the search term list.
These above examples are more blatant and more easily identifiable than the subtle, unconscious bias that affects the training of AI models. But, they do show that malice isn’t necessary to create a situation where biased results will emerge from a legal review process.
In 2019, the artist Jake Elwes created a work called “Zizi – Queering the Dataset” to call attention to the difficulty AI tools have with limited or confusing data and nuance. In the work, Elwes took a data set commonly used to train facial recognition systems and added 1,000 images of drag and gender fluid faces found online. The resulting imagery renders faces that could not be found in the original data. Shifted away from normative identities, Elwes coaxes out images that show artificial drag makeup applied to different individuals. The work is meant to challenge the binary way in which many facial recognition tools work, demonstrating that what is and is not included in a data set, and how we code it, has profound downstream effects on the resulting product.
Flowing from this, one might ask: what is the value of embracing the non-binary in legal AI-based analysis? Simply put, the value is in the uncovering of gray areas, the nuance, the unexplored areas of the data set—and recognizing the tool might not always get it exactly right.
The yes/no, positive/negative, binary selection process can often exclude those nuanced components, which touch on the areas of fluidity between relevancy and non-relevancy. In running AI-enabled reviews, we almost always decide to “cut off” a data set below a certain relational score—arguing that things below that score are outside of the model and, as such, are not valuable or material to the case. I have always found it striking that the legal industry fights so hard to maintain an appearance of the black-and-white letter of the law, yet can so easily cast aside the tremendous nuance, flexibility, and, oftentimes, sensitivity that lies in many of our laws—and certainly in most of our legal precedent.
Meredith Broussard in her book More than a Glitch suggests that a “regulatory sandbox” could be employed to seek out bias within a particular AI model. The “sandbox” is similar to other types of technology sandboxes: a place where developers and users can build, play with, adjust, and evaluate a model before putting it into wider use. I believe that this is a useful starting point for how we test the implemented AI tools in law. It’s certainly important for developers and builders to test the AI algorithm they have built to ensure it isn’t biased, but how do we users of legal AI ensure that the models we build and train within the AI algorithm don’t succumb to bias?
The solution is to challenge the model throughout the process. Those of us who use AI in legal document reviews, for example, are familiar with the process of quality control and ensuring that our team of reviewers is properly and consistently coding documents. We look for consistency in responsiveness calls, privilege, and so on. But we need to go a step further and step outside of what the model has determined to be responsive based on the training it has received. As I said above, the application of law is rarely black and white (though we might like it to be). In challenging the model and investigating for bias—as Broussard suggests we do in the sandbox during the building phase—we can look for bias in the training of the model.
So, how do we do that? Our goal as lawyers and legal professionals is to seek the truth. As such, our focus should be on those documents that our trained model has excluded from the responsive set—those documents, like I said above, that are outside of the cutoff score. Run a statistical sample on that document set to see what the model has excluded. See if there are, in fact, responsive documents, and then use additional analytics tools to find substantially similar documents in that excluded document set; just because a document is below the cutoff score does not mean the model properly excluded it.
In undertaking this process it is valuable to use a team for those quality checks, and not just one reviewer—we don’t want a separate set of bias coming from one reviewer to affect negatively the goal of lessening bias in the data set. Further, as with many other parts of our work lives, bringing in a diverse team of reviewers and trainers can also help reduce bias.
By recognizing and valuing the outliers in data sets, we reduce the risk of missing something that the algorithm might not consider valuable. While this might not seem “objective,” non-binary coding actually is, because it recognizes and attempts to control for the biases that exist not only in the data set but in the model itself.
Pushing the process to explore the seemingly unimportant—the less populous, the smaller end of the positive/negative ranking scale—like Elwes’s work, “queers” the entire data analysis and recognizes the value of those documents.