by Bridgette Harris on July 28, 2016
Clustering is a powerful feature of text analytics for e-discovery. By automatically grouping documents with similar conceptual content—with no user input required—it can be incredibly helpful in quickly identifying important documents or concepts within a case.
There are plenty of ways to make the most of clustering during your e-discovery projects, from beginning to end. Here are some ways we’ve used it to make review faster and more thorough.
1. Examining clusters by custodian for a specific subject matter
Helps focus on a particular subject matter, even if it’s spread across many custodians.
Review protocols are generally written to provide all reviewers some guidance on the subject matter of the review, and how to approach different kinds of documents. However, these instructions are generally written without regard for the unique data collected from each custodian. This isn’t always very helpful when you are applying a huge protocol to batches of documents including a wide range of people, like engineers and executives.
A better way to quickly get through documents on a specific subject matter would be to use clustering per custodian to review the documents regarding a specific subject matter. Let’s say you’re working on a class action lawsuit regarding the faulty manufacturing of the steering wheel in a specific model car. You can apply clustering and look at the “Car X Steering Wheel” cluster when reviewing the executives’ data and the car’s main engineers’ data, getting a bird’s-eye view of each custodian’s unique perspective on the issue in very little time.
2. Clustering to improve your TAR project.
Provides richer data to support the machine-learning process.
Now that you’ve used clustering to identify really helpful documents across different custodians, you can use these documents to improve the document set used in training the system for a technology-assisted review (TAR) project. By using clustering to identify very relevant data, you’re providing a richer set of data to the machine-learning tool, thereby improving the quality of the system’s understanding of your data set—and potentially getting you to project completion faster.
3. Batching by clusters for effectiveness and efficiency.
Groups similar documents together to speed up review and improve the quality of output.
Clustering similar documents and batching them out for review can help reviewers specialize on a specific topic or type of document in a case. If you’re reviewing the documents of the executives in the class action lawsuit, for example, you can batch out and assign all the documents regarding board minutes to one reviewer. By focusing on one type of data, this reviewer will now be able to more readily find all the differences between the different board minutes, put the emails regarding the board minutes into context, and quickly identify anything unusual.
It’s also more efficient to have a single reviewer focus on one type of document, rather than have a reviewer look at random selection of disparate documents on a wide array of subject matters.
4. Clustering to remove clutter.
Remove documents that are irrelevant to review.
One of the most powerful uses for clustering is removing items that have no relevance to a particular matter. Let’s say we were working on the same class action lawsuit, and board minutes are critical. Many of the executives in the case are on the boards of nonprofits, and they’re receiving in their inboxes board minutes for the charity they’re working on. Unlike the corporate board minutes, these nonprofit board minutes have nothing to do with the case. Clustering can help identify these nonprofit board minutes to set them aside or remove them from review, thereby saving reviewer time.
5. Using “hot” topics to run quality control.
Cluster documents based on critical subject matter to verify coding decisions.
Even if you did not batch your documents out based on their clusters, you can use clustering to make sure you didn’t miss any important items.
If your case was dealing with faulty steering wheels for Car X, you can use clustering to find documents on that subject matter and see if any relevant documents in the hottest clusters were not identified during review. Items in your clusters, compared against coding decisions, will appear as a heat map—showing darker or lighter shades of color to indicate whether the groupings share coding values. This allows you to see, at a glance, if any documents that should possibly be marked responsive aren’t coded the way you’d expect.
Clustering helps to effectively separate different concepts in a way that simple searching doesn’t. By using this tool to sift through large amounts of data, you can ensure a more productive workflow and a more thorough process, and you can weed out irrelevant data.