In this second installment of his interview series, Jay Leib—kCura’s chief strategy officer and resident computer-assisted review expert—talks about information retrieval with Dr. David Grossman, an adjunct professor of computer science at the Illinois Institute of Technology.
Jay Leib: Your background is in information retrieval. So what is the definition of information retrieval and how is it different than other computer science disciplines?
David Grossman: Much of computer science focuses on obtaining the right answer to various problems, quickly and accurately, every time. With information retrieval, on the other hand, it’s much less defined. The search may mean different things to different people, and the “right” answer may be a matter of opinion. Information retrieval is defined as the study of algorithms and heuristics that enable people to find the information they need—and only the information they need—as quickly as possible. When I was a database systems programmer in 1986, search was more definitive, getting the right answers back from a database of values. Since I began more formally working on research problems and publishing papers in information retrieval in 1992, the field has become a more distinct discipline.
What tools do information retrieval scientists use to determine the strength of a search engine and the accuracy of a query?
We use ground truth collections where the documents do not change and a standard set of test queries. Results are then manually assessed and are deemed relevant or not. This enables us to compute precision (the ratio of relevant retrieved documents to total retrieved documents) and recall (the ratio of relevant retrieved documents to total relevant documents). We then say that one approach is better than another if the precision or recall is higher.
Can you easily define precision and recall, and why are they important?
Precision tells you how many documents are correct or relevant in a set of results. For example, if a system retrieves 10 documents and five are relevant, it gets 50 percent precision. Recall considers how many relevant documents were out there to be found. Fifty percent may sound unimpressive, but what if only five documents were relevant in the whole collection? Recall measures this at 5/5, which would be 100 percent for the same query.
Precision is important because it tells you something about how much of the user’s time you are going to waste by making them read documents that are not relevant. Recall tells you how much might be missed by a given system. I can avoid wasting your time by retrieving only one document and frequently that document is correct, giving me 100 percent precision. However, if there are 1,000 relevant documents throughout the collection, my recall is only 0.1 percent; in other words, you just saved that time by missing lots of relevant documents. I would suggest that, for web search, recall may not be very important. People use Google to find the closest Pizza Hut—not every last Pizza Hut. For enterprise search or e-discovery, or newspapers or magazines, people want to see every relevant document. They also want to do this without sifting through a lot of noise, so both of these values are important within these domains.
In our industry, lawyers and investigators are now using sophisticated search technology. Do you think the tools produce reliable enough results to attest to their accuracy?
I hate that we consider machine learning algorithms that were developed 20 or 30 years ago as very sophisticated. They have been around, they work, and it just so happens that people are now discovering that they work. I think the machine learning world has had a marketing problem, but that seems to be going away. I think it depends on the application as to whether or not these tools are “reliable enough.” Also, it depends on how much data is involved. For a very small amount of data, which would not cost much to review, I think it’s fine for a human to review it. As electronically stored information continues to grow, I suspect a human review of information may no longer be feasible given the volumes that are out there.
What is a common search mistake that a layperson may make?
When we worked with AOL query logs, we frequently saw searches that left out quotes to indicate phrases. Quotes are used by just about everyone now, so putting quotes around a phrase like "New York" really helps out a search engine. Otherwise you are searching for all words that have the word "New" and all words that have the word "York."
What can lawyers and investigators look forward to in the future for this technology?
Basic machine learning algorithms are now being established. The next step will be to use known structured data to help build better queries. I suspect most of these cases have a lot of known structured data. With tobacco litigation, for example, much was known about the risks of tobacco, and statistics were available that were not included in the basic document collection. Leveraging these structured databases to improve search is an area that I think will make a big difference. When you think about it, you often know various tidbits of information that can help with a search, but you can rarely articulate them in a small query.