Don't Stare Directly at Your Data: 3 Ways to Use Sampling to Save Time (and Your Eyes)

With the solar eclipse upon us, there are repeated warnings not to stare directly at the sun. Of course, this advice is true all the time, but with the spectacle of an eclipse, the temptation is greater than ever.

This made us realize something: Often when we receive new data, we’re tempted to start looking directly at it as soon as possible. How many times have you been asked to load all the data now so that reviewers can get started? But doing so leaves you blind to what’s actually in your data as you wait for the review to progress.

My colleague Jeff Gilles and I recently took a look at how we could use sampling to make some common workflows more efficient and better understand our case data earlier. Here are three ways you can use sampling to look at your data without hurting your eyes.

1. Organize Your Review Team

A big part of sampling is understanding what kinds of documents—and how many of each type—are in your review set.

For example, if we wanted to know how many French documents we have in our review set, we could use sampling to determine that count and have a good idea how many French-speaking reviewers we may need.

More specifically, if we used a tool like Relativity Analytics to run language identification on 100,000 random documents from a five million-document review set, and found that 30,000 (30 percent) of the documents contained French, we could be confident that there are approximately 1.5 million French documents in our case. This allows us to make more informed decisions on cost and time of review—without looking directly at the data.

2. Strengthen Your Analytics Index

Data sets often include repeated content that you don’t necessarily want to include in your conceptual analytics index—like boilerplate content on the bottom of an email. By using repeated-content identification, you can find this type of content and filter it from the conceptual analytics index to produce stronger results.

Using sampling can take this a step further and help you get a feel for how much repeated content you’re dealing with. If we find a large number of repeated content in a sample set, we know there are more instances of that same content within the larger data set we didn’t analyze, tipping us off that we may want to do some fine-tuning before building our analytics index.

3. Improve Stratified Sampling in Technology-Assisted Review

Relativity Assisted Review includes the option to perform stratified sampling, a power tool that quickly categorizes documents in an assisted review project. Stratified sampling is the best way to reduce uncategorized documents.

But we wondered if there was a way to make it run even faster. What if we used a large random sample of documents and then ran stratified sampling on just those?

So, we created a 100,000-document random sample of uncategorized documents, and ran stratified sampling on that set of documents. The speed of the stratified sample set was greatly improved, while our results were similar to a stratified sample of a much larger data set.

 

Sampling may not be new to e-discovery, but you can always find more ways to incorporate it into common workflows, saving your team time (and keeping their eyes fresh).

How have you used large random sample sets in your workflows? Let us know in the comments below.

 

Jacob Cross is a member of Relativity’s customer success team, where he helps Relativity users make the most of the platform. He has worked in the e-discovery industry since 2007, helping clients use technology to increase productivity and reduce overall review time.