An Introduction to Stratified Sampling Relativity Assisted Review

by Jay Leib on June 06, 2014

Analytics & Assisted Review , Product Spotlight

Released last week, Assisted Review for Relativity 8.2 includes a new sampling methodology to help train your projects faster: stratified sampling. Let’s take a look at how this approach to sampling can help you train Relativity faster and more holistically.

What is stratified sampling?

Stratified sampling can jumpstart your Assisted Review project by offering better training sets. Unlike truly random sampling, stratified sampling selects documents based on their level of conceptual similarity to other documents.

When you use stratified sampling, Relativity analyzes the document collection and identifies a subset of example documents that will categorize the vast majority of the overall document set. The system does this by ensuring that documents which are conceptually similar to a significant number of other documents are included in the sample set.

By default, a stratified sample set includes only documents that have a minimum seed influence of 25, meaning that the document must be able to categorize at least 25 other documents. This default value provides a good balance between ensuring categorization effectiveness and offering the right number of documents in the sample.

Relativity adds documents to the stratified sample set until the sample can categorize approximately 90% of the total collection, or until there are no more documents that meet the minimum seed influence amount. A stratified sample typically consists of less than 1 percent of the entire document population, but you can also set the maximum size for the sample. The most influential documents are then added to the sample until the maximum size is reached.

What are the benefits?

In supporting our users’ projects, we’ve found that random sample sets can include documents that are not very useful for training the system. For example, a document may be a conceptual near-duplicate of another document already in the subset—meaning the engine wouldn’t learn much new information from it. As another example, a document in a random sample may not be conceptually related to many other documents. These isolated documents have limited value as examples because they don’t train the system on very much of the overall document set. On the flip side, random sampling may not capture some key concepts in your data if the numbers of documents that discuss them are small, as those rare documents may be missed by a random sample.

Stratified sampling, on the other hand, is built to include only documents that are conceptually unique, and collectively similar to a significant number of other documents in the overall collection. As a result, example documents in a stratified sample have a large impact on machine learning, and train the system more effectively and rapidly than a random sample set of the equivalent size.

Starting with your first training round, you’ll likely see a higher percentage of categorized documents. For your team, that means you’re accelerating the project’s progress and reducing the number of training rounds needed—ultimately saving time and money for your clients.

To illustrate the potential benefits of this approach to sampling, the chart below compares the percentage of documents Assisted Review was able to categorize round-over-round during training for a project we performed on test data.

Percent of Documents Categorized by Round: Random vs. Stratified Sampling
Stratified Sampling Graph

 

When should I use it?

Stratified sampling is a great option for maximizing the effectiveness of your training rounds. It is especially useful where pre-coded seeds are not available, or where you know a limited amount of relevant information exists in a large document collection.

One thing to note is that stratified sampling is not included as a sampling option when creating control sets, or for conducting quality control rounds. At that stage of your workflow, random selection is required to statistically measure and validate your results.

We hope this information will help you get the most out your Assisted Review projects in Relativity 8.2 and, as always, we would love to hear about your experiences using the feature. Remember: if you’re working on an Assisted Review project and could use a second set of eyes to help evaluate your strategy, you can always reach out. We’re happy to help.

Posted by Constantine Pappas.

 

Comments

Post a Comment

Required Field