The Search Technique You Should Be Using (But Probably Aren't)



by Aileen Tien on December 27, 2016

Litigation Support , Product Spotlight , Review & Production

Editor's Note: This simple introduction to a critical search strategy, originally published in December 2015, is among our most-viewed posts on The Relativity Blog. We thought it would make a useful refresher as you start planning your first projects for 2017. You can also dive into this tutorial for more practice.

Regular Expressions (RegEx) is a form of advanced searching that looks for specific patterns, as opposed to certain terms and phrases. It’s a unique solution that has the ability to hit results that no other type of searching can replicate, and it offers e-discovery practitioners big improvements to their typical dtSearch and analytics workflows.

Admittedly, RegEx can seem a little complicated to the unfamiliar, but it’s kind of like learning a foreign language—difficult at first, but useful if you take the time to understand it.

Far fewer litigation support professionals use RegEx than one might expect, given its capabilities. Needless to say, when my colleagues asked me to host not one but two RegEx sessions at Relativity Fest this year, I was a little nervous (I may have even cried a little), fearing the sessions would flop and no one would attend.

The sessions turned out to be two of the most popular of the entire conference. I wanted to share some of the content with a broader audience to hopefully get even more of you aboard the RegEx train.

How It Works

RegEx is different from a Google search or keyword search in that it doesn’t search for literal words and phrases, but instead uses characters with special meaning to the search engine to retrieve specific patterns. For example, if you’re looking for social security numbers, you’re really looking for the following pattern: three numbers, dash, two numbers, dash, four numbers.

Characters called “metacharacters” act as the building blocks of your search string in RegEx. They denote a special meaning and are often what people think of as the “complicated” part of the search; for example “\d” represents a whole number 0 – 9. By using metacharacters, users can construct a single search string, rather than multiple literal strings, to return desired results.

So, instead of searching for “dog” or “cat,” for example, RegEx can return all sequences of exactly three characters, depending on how you arrange your metacharacters. For example, the RegEx string “[a-z]{3}” will bring back any term containing three letters such as dog, cat, fox, and car.

Regular characters represent a literal meaning (“d” for the letter “d”) and are another component in a RegEx search. Regular characters and metacharacters can be used together in a RegEx string. Thus, if you would like to find any terms that contain three letters beginning with the letter “d,” your RegEx would be “d[a-z]{2}”.

There are two common instances in which you would use RegEx: dtSearch to find specific patterns, and analytics to filter out extraneous text.

Already familiar with RegEx? Take our quiz to test your skills.

RegEx in dtSearch and Data Grid

Because RegEx is based on pattern searching, you can use it in dtSearch to locate social security numbers, phone numbers, zip codes, email addresses, URLs, and bank account numbers—basically any string that matches a particular pattern. I think that’s something to get really excited about, as it can be used in several situations—and the same opportunities exist in Relativity Data Grid searches as well. 

For example, suppose your case team needs to find documents containing a variety of serial numbers that all match the same pattern, such as five letters, a hyphen, then four numbers (e.g. ABCDE-1234). Rather than typing in every possible serial number—clearly an impossible task, as there are more than a few million permutations—the search would simply look like this: [a-z]{5}-[0-9]{4}. (Note: The reason I'm using lowercase here is because characters in Relativity dtSearch index are normalized to lowercase).

Bates numbers are another example, in which you can limit your search to hit all terms with the prefix “ABC,” followed by any eight digits, such as ABC12345678 or ABC20802317. Here’s what the string would look like: (abc)[0-9]{8}.

To learn more about the syntax of these searches, check out this guide on dtSearch indexes, and this one for Data Grid.

Dig into RegEx with a hands-on tutorial.

RegEx in Structured Analytics

Aside from dtSearch and Data Grid, RegEx is also used in structured analytics for filtering out extraneous text. Suppose you receive a production of documents from the other side that contains a lot of emails, and you want to locate the duplicates with email threading. Typically, email threading will identify all emails belonging to one thread, as well as duplicate emails.

However, let’s say your data set has been Bates stamped, so each email has a unique Bates number on it. Because of the unique Bates numbers, the system will not identify any duplicative emails as such, and will instead label each email as an inclusive, non-duplicate spare.

Here’s where RegEx comes in to play. By searching for the Bates number pattern used in the production—say, three letters, dash, five numbers—you can find and eliminate those Bates numbers from every email. When you thread the data again, the system will identify the duplicates correctly.  

Brushing Up on Your RegEx

If you’re ready to dive into RegEx head first, there are plenty of great resources out there to help you get started, including a free regular expressions testing tool called RegExr, a Regular Expressions for Beginners webinar, and our Searching with Regular Expressions guide containing descriptions of some of the most common RegEx metacharacters and examples of the results they would return. By diving in a little deeper, RegEx won’t look like hieroglyphics anymore.

Have your own tips and tricks for leveraging RegEx? Tell us about them in the comments below.

Aileen Tien is a member of the advice team at kCura. Prior to joining kCura, Aileen worked as a developer, attorney, and litigation support specialist.

Get More e-Discovery Reading Material with the e-Discovery Primer

Comments

Post a Comment

Required Field