Editor's Note: With summer unofficially over and a new school year starting, it's a great time for a refresher course in key e-discovery concepts. Originally published in August 2015, this top 10 most-viewed post offers a close look at key concepts behind processing—a technical but critical aspect of any e-discovery project.
When client data begins to arrive, processing is necessary to convert the native files into searchable information, ensure the integrity of your data, and prepare it for review. While you will likely spend most of your time in review and might not be as involved during processing, it can be helpful to understand what happens during this phase—especially if you’re getting regular updates from your case team and want to put some context around the steps they’re taking to make your data visible.
At a high level, processing data extracts information for searching and filtering. Processing normalizes files—such as Word documents, spreadsheets, images, and emails—into a standard format to maintain the information of the files as they were stored in the ordinary course of business and make it accessible in a review tool. Opening each document in its native application could risk modifying the data, so processing data helps ensure the original information is captured.
Here are some processing terms that help explain what happens during this phase and how it affects your review.
Organization is Key to An Efficient Review: Normalizing Data, Flattening Data, and Parent and Child Documents
Normalizing data means extracting the data from all of the files collected. Each file is given a unique identifier, such as a control number, and the associated metadata—such as dates and times documents were accessed and authors who created them—is captured. Users can search, sort, and start a high-level analysis based on this information.
Flattening data is the recursive process of extracting embedded file information. For example, if a PowerPoint presentation has an audio file, flattening data extracts that audio file from the parent PowerPoint file so it can be reviewed independently, while retaining the relationship with that presentation.
Parent and child documents make it clear that files are related. If you have a zip file with three documents, the zip file is the parent and the three documents you have inside are the children. This also applies to emails and their attachments. The email is the parent and any documents that are attached to that email are the children.
Trim Unnecessary Documents to Reduce Your Review Population: De-duplicating and De-NISTing
De-duplication is the process of eliminating two documents that are exactly the same. During processing, an algorithm—MD5, SHA1, or SHA256—is applied to the data in each file, producing a hash value for each unique file. Hash values are much like the VIN for your car in that only one item can have that number. If two files produce the same hash value, it means they’re exact duplicates. A common example of duplicate documents occurs with emails. If you send an email to five colleagues and all of their data is collected for a particular case, the email would be collected with your data as well as all of the colleagues that you sent the email to. During review, a reviewer would have to review that same email as many times as the number of senders and recipients. In many cases, it makes sense to de-dupe your data so the reviewer only has to review that one email, one time.
De-NISTing is when system files are filtered out of the document population. The National Institute of Standards and Technology obtains a list of hash values for known files, such as standard software program files. If any files in your data set have these hash values, you know that they’re not user-created data and can be excluded from review, reducing the amount of documents in your population.
Standardize to Prep for Review: Control Numbers and Time Zones
A control number is a unique identifier for each document and is assigned during processing. Some organizations assign a specific prefix for each custodian, while others prefer to use one numbering scheme and continue using that same scheme throughout that entire case. These numbers are referenced frequently to identify documents throughout review.
Time zones can be very important in e-discovery. Dates and times are standardized in one time zone called the Coordinated Universal Time (UTC), but are displayed to users as the time zone that is set on their computer. You can determine where the data came from and apply the time zone accordingly, or you can make a universal decision to process all of the data in the UTC time zone for consistency. Setting the appropriate time zone when you process data will ensure that the date and time of the extracted data is populated into fields in review that are reflective of how it existed on the original machine. This is an important decision because the time zone set can affect which documents are included for review if you’re eliminating any data based on date ranges.
Once the data is properly processed and loaded into your review workspace, you can search all the files for the content of the data itself, as well as more detailed information such as when a file was created or last modified.
All of these steps are critical for organizing your case data before review, preserving original metadata, and giving your case team full access to the documents in your e-discovery tool. Understanding the core concepts of the processing phase can help ensure that you’re getting data into review as quickly as possible and staying synced with your litigation support team as they make the data available for you.
What questions do you have as an attorney about this technical step in the e-discovery process? Let us know in the comments.