This post originally appeared on the Relativity blog on August 21, 2015.
It’s 8:00 a.m. on a Monday and you just want your morning coffee, but you find yourself face-to-face with Anna, the e-discovery text analytics guru in your office. She wants to chat with you about the review project your case team got up and running late last week, and the strategy she’ll use in applying text analytics to it.
It’s one of your team’s first big cases with analytics, and Anna is thrilled—she’s got a lot of passion for the technology, and is excited to prove its worth to the rest of the case team. You’ve heard a lot about the time and cost savings analytics can deliver, so you’re cautiously optimistic.
“These timelines would never work without analytics, you know; we can’t manually review millions of documents by the end of the week,” Anna says. “Good thing we can work a little magic!”
“Sure, magic—because those analytics tools can make some of those duplicates we talked about disappear, right?” you joke.
Anna smiles and replies, “In fact, they can. That’s why we need to de-dupe the docs.”
When the data first arrived, Anna explained that there were plenty of textual near-duplicates—documents that are identical in almost every way, like incremental drafts of the same contract—in the collection. Turns out it doesn’t take long to run the process of identifying those dupes, which will ultimately save the effort of re-reviewing the same documents.
A few hours later, Anna swings by your office with another update.
“De-duping the documents really cut down the data set—that’ll help our reviewers a lot,” Anna says.
“Glad to hear it,” you answer. “What’s next?”
“Well, the threading is almost done, too.”
Anna is definitely not sewing in the office.
Email threading organizes emails in the data set by conversation, so you can review them just like you’d see them in Outlook or another email application. In your case brief last week, Anna also explained that, if you run email threading on your documents, you can limit your review to only the documents that are inclusive. The inclusive documents in an email chain contain all the text from the entire chain, so once they’re identified, you know they’re all your team needs to review to get the full picture from the email data.
Later that day, Anna lets you know that threading is complete and the review is ready to begin. Your team is well on your way to meeting that tight deadline—and you’ve gotten some valuable insight from Anna on how the data was prepped before reviewers jump in.
The next morning you again find Anna by the coffee maker. Curious to hear how her next project is going, you ask how she’s doing.
“Much better, now that I’ve had a few minutes to think. I’ve got another big case from Bill with crazy deadlines. We’re using TAR, but it’s a low prevalence case so I was a little worried about my recall,” Anna explains.
“Well, didn’t Bill give you anything to go on?” you ask. Bill is usually great at briefing the team before kicking off a review.
“He did, and that’s precisely my solution,” Anna exclaims. “He already pointed us in the direction of some responsive documents. I can just use those as pre-coded seeds and jumpstart the TAR process with data I know to be conceptually meaty and relevant. You’re really picking up on this stuff!”
Anna isn’t worried about her memory. Recall refers to the percent of responsive documents the system successfully finds during a technology-assisted review—or TAR—workflow.
Since TAR will help Bill’s team code the data faster based on what they teach it about the case, Anna wants it to learn what responsive means fairly quickly. A data set with low prevalence means the population of responsive documents is very small, so it might be more difficult to find a large enough portion of those documents through random sampling to train the system effectively. After all, she needs to give it enough data to pick up on the key concepts that drive responsiveness. Fortunately, that’s much easier to do when she can start the process by training on previously coded documents the team has already identified as responsive.
Chatting with your litigation support team about these updates can be immensely helpful in keeping you apprised of your cases and aware of how technology drives greater efficiency during e-discovery.
What other e-discovery lingo frequently comes up in conversations with your case team? Let us know in the comments.