Demystifying De-NISTing

The following post was first published by Modus, a Relativity Premium Hosting Partner. It provides a great overview of an important step in processing that we thought was worth sharing.

Most e-discovery practitioners have some awareness of de-NISTing as a common step that happens sometime during the processing phase of their projects. Many are aware that it is a step that has to do with removing system files from collected data. Few, however, are familiar with precisely what it is and is not, with why it has become standard practice, and with what else vendors commonly do to supplement it.

De-NISTing is a process for the removal of known, standard system files from collected data. “System files” here refers to the executables, device drivers, initialization files, and other operative components that make an operating system (e.g., Windows) or a software suite (e.g., Microsoft Office) run on your computer. “Known” and “standard” here refer to the fact that the process removes only previously identified system files that have been unaltered from their “factory-original” forms. It doesn’t remove all files of any particular type; it just removes a list of specific files.

This has become an accepted standard practice, first, because known, standard system files are unlikely to have any evidentiary value. If a file is operative rather than a vessel for user-generated content, and if that file is unaltered from its original form by any observable user actions, it cannot have any evidentiary valueunless possession of the software itself is somehow at issue in the matter. Thus, removing the files is low risk for the producing party and of no concern to the requesting party.

The second reason this has become an accepted standard practice is that such files can be numerous and voluminous, particularly when dealing with imaged hard drives. Often, more than half of the data captured in a hard drive image is system and software files. Thus, removal of such files is a significant benefit to the time, cost, and convenience of downstream ECA and review activities.

So, how do files become known, and how are they identified as standard? 

The way files become known is hidden in the name of the process. De-NISTing comes from “NIST,” the acronym for the National Institute of Standards and Technology. Within NIST is a project called the National Software Reference Library, and it is this library of “known, traceable software applications” that powers de-NISTing (so-called, undoubtedly, because de-NSRLing is unpronounceable). The NSRL includes Reference Data Sets that are updated with new known files four times per year.

The NSRL was originally created to assist law enforcement in sifting through collected ESI looking for hidden, illegal materials like evidence of cybercrime or child pornography. In law enforcement applications, as in e-discovery ones, the goal is to quickly and automatically identify and remove as much of what’s obviously irrelevant as possible without wasting valuable man-hours evaluating it. This is accomplished in much the same way as de-duplication: by using hashing algorithms and hash values.

As known files are identified for addition to the Reference Data Sets, those files are run through standard hashing algorithmsMD5 and SHA1and the hash values are recorded in the Reference Data Sets. When law enforcement or e-discovery practitioners de-NIST a data collection, they generate hash values for the collected data and compare the generated hash values to the list of known hash values. Any match indicates a known system file, unaltered from its factory-original form, and safe to remove.

The strength of this process is its reliability. It checks file-by-file and removes only known, unaltered files, eliminating any risk of evidence loss. This reliability is also the process’s weakness for e-discovery, however, as it leads to significant incompleteness. There are many, many operating system and software suite files that are not included in the NSRL’s Reference Data Sets – some because they are too new, others because they are not common enough. To address this incompleteness, many vendors supplement de-NISTing with some form of “stop” or “go” filtering, to either keep some things out or only let some things through:

  • Stop Filters
    • Stop filters are lists of file types that are filtered out and stopped from being included in subsequent processing or review activitiesonly things on the list are excluded, everything not on the list gets through. Typically, these would focus on file types like executables and initialization files, which are not likely to contain user generated content.
  • Go Filters
    • Go filters are lists of file types that are filtered out and alone allowed to proceed to subsequent processing and review activitiesonly things on the list get through, everything not on the list gets excluded. Typically, these would focus on file types like document, message, and media files, which are likely to contain user generated content. 

In either case, identification of file types can be done by file extension but is more commonly achieved by checking file headers (in case a user has altered a file’s extension in an attempt to hide something).

Of the two approaches, stop filters are lower risk, because any things you had not anticipated (the unknown unknowns) flow through rather than getting excluded. Go filters are more efficient, because they winnow the collection to only reviewable materials, obviating the need to do so during ECA, before review. Either approach may be acceptable if reasonably calculated to find all the responsive materials, given the specific circumstances of the case and the data. As in so many areas of the law, the standard for e-discovery efforts is reasonableness, not perfection.

Matthew Verga is director of content marketing and e-discovery strategy at Modus.