by Evan McAlpine on February 20, 2015
This week, my colleague Ryan Hynes and I took the virtual stage for a KMWorld webinar, presenting “Text Analytics in Action: Finding the Killer Patent.” During the live session, we used Relativity to pounce on a single patent—a cat stroller idea Ryan believes pretends could make him millions—from the U.S. Patent & Trademark Office’s database. With his fictitious cat as co-pilot, the journey was all about giving viewers an in-depth look at text analytics.
Our stroller search was run in a real environment our legal team uses each week to search the U.S. patent database for new intellectual property that’s relevant to our business. Built on Relativity Analytics, it helps us identify records of interest much more quickly than keyword searching alone in the USPTO’s online database.
In addition to several cat puns, the webinar yielded some great questions about text analytics from the audience. Here’s some insight we didn’t have a chance to share live:
What are some best practices when choosing documents to include in an analytics index?
In general, you want to train the system on documents that have an adequate amount of extracted text and were created by humans, as opposed to machine-generated log files. Good text is important because the concepts it discusses are what teach analytics the language of your data set.
When using the clustering feature of Relativity Analytics, how much text can it handle from a single document?
When indexing documents, we require that you don’t include anything greater than 30 megabytes in size. To give you an idea of the scale, the average novel is typically 1-2 MB of text. Files that are greater than 30 MB are often machine-generated files with a lot of nonsensical content—think of computer log files or something similar. These files don’t really express any valuable concepts to index, so we exclude them.
Does concept searching work off the extracted text of the documents, or metadata?
That depends on how you’ve built your analytics index. In our demo, the text analytics tools we used were running on an index built only on the extracted text for the patents. This field contained all the relevant language and claims from the patents themselves, and had the most valuable concepts. When we dove into the documents ourselves after identifying the most important ones, we simply used the metadata to get a bit more context on each record.
While you could index any long text metadata field, you probably wouldn’t get very good results because there’s just not a lot of concept-rich text in those fields to analyze.
Is the Find Similar Documents feature based on a clustering structure?
Our analytics tool’s ability to detect similar documents is not dependent on clusters, but both features do have the same underlying structure: the analytics index. The index represents the conceptual space of the documents, so all of the conceptual options—such as clustering, categorization, and concept searching—will work off of that same map of the data.
Is the application you used during the demo a standalone product? Can you tell us more about it?
We built everything within the platform, which includes an open framework that is included with any Relativity subscription. That framework helped us ingest data from the USPTO and create an active patent database we use for IP work, as well as a repeatable workflow in the software—but we didn't make any modifications to Relativity Analytics at all. If you want more info on building your own applications, contact email@example.com. You can also check out our developer documentation.
What are all the tools in text analytics? Is there a fixed order in which they need to be performed during a project?
The analytics features we discussed in the webinar were clustering, concept searching, and similar document detection. There are other features that we didn’t discuss, though, including categorization and computer-assisted review, which allows you to code a small subset of documents and train the system to amplify those decisions across the rest of the data set. There’s also keyword expansion, which can expand upon a single search term to find conceptually related terms.
Once you’ve built an analytics index, all of these tools can be used in any order you choose. We do have some recommended workflows, so be sure to let us know if you have any questions about the most effective way to work with analytics during your projects.
If you’re ready to learn more about what each of the text analytics features can do, check out our new e-book for a high-level look.
Let us know in the comments what challenges—in e-discovery and beyond—your team wants to tackle with text analytics.