Practical Advice on Managing Generative AI Data for e-Discovery: Part 1

The thing about working in e-discovery is that you deeply feel both the upside excitement about new tech, and the downstream complications it will inevitably introduce.

For example, legal tech chatter is nonstop about how generative AI can reshape our everyday work—in a very resource-intensive space!—by enhancing efficiency, lowering expenses, and conserving time. On the other hand, the data produced by generative AI has created a new and complex fire swamp for legal professionals. What’s an e-discovery team to do?

Case law precedent and guidelines around this type of data are starting to look inevitable, but waiting around for them isn’t an option. In-house legal teams who haven’t already thought about integrating this emerging data source into their defensible disposition and preservation practices need to start now.

Embarking on this new frontier might seem overwhelming, but at Relativity Fest Chicago 2024, an esteemed panel did a great job to dispel the fear and help simplify all the intricacies of managing the data created by generative AI. Speakers in this session—“Scaling the Cliffs of AI Insanity: Managing Data Created by Generative AI”—included:

Aron Ahmadia, Senior Director, Applied Science, Relativity
E.J. Bastien, Sr. Director, Discovery Programs, Microsoft
Todd Itami, Director of Artificial Intelligence and E-Discovery Solutions, Of Counsel, Covington & Burling
Ben Sexton, Vice President of eDiscovery, JND
Ashley Picker Dubin, Counsel, Day Pitney LLP

We’re covering this session in two parts. To begin, read on for a quick look at the panelists’ insights on how generative AI data poses unique—and not so unique—questions for e-discovery teams.

What Is and Isn’t Different About AI-Generated Data Discovery

First things first: what’s unique about AI-generated data and how it’s handled during e-discovery?

The biggest questions seem to come around retention and discoverability. Where does this data fall on the spectrum of retention scheduling? Who owns it? Where does it live, and is it a useful reference months or more after its creation?

“These are all classic information governance problems we’ve had for years and years with productivity tools. There’s some danger in viewing software as so novel. Can you get in trouble with it? Yeah, but mostly all the same trouble as before,” Todd Itami shared with Fest attendees. He suggested a straightforward approach to these questions: “Let’s apply the IG frameworks we know how to work with, and the ethical frameworks we’ve worked with the whole time—that’s the approach.”

Ashley Picker Dubin noted that retention guidelines are sometimes set by other entities.

“Some of our clients are obligated to retain certain data, and a regulatory entity with a retention policy, like the SEC, might not be okay with a zero-retention policy on these tools,” she cautioned.

Still, for those not under such requirements, there is some valid conversation to be had around what sort of retention policy makes sense for AI-generated data. The answers depend on how teams use generative AI tools.

For example, some users like to refer back to their AI chat history.

“A lot of people want those records to exist so they can go back and understand how they got to where they are,” EJ Bastien said. “I like to refer back to them; it’s helpful for me.”

For others, looking back isn’t always a part of their everyday AI workflows—or, more importantly, the appeal of privacy negates the need for lengthy recordkeeping.

“I like the idea that we’re interacting with these and don’t like the idea of creating logs for all the Google searches we’re doing and what’s in our hearts,” Todd said.

Another interesting question was around custodians. Who owns AI-generated data? Human custodians, of course, create the prompts—but is the AI itself a custodian of its responses?

“As a party, at what point do we consider AI to be a custodian? I don’t think I’m ready to contemplate that. There’s not a point at which I wouldn’t consider it a non-custodial data source,” Ashley said in response to an audience member’s question. “That’s just what it is. A specific individual interacting with the chatbot—we’re after them and what their data is. Unless the chatbot is actually being sued, in which case we have a whole other problem.”

Persistent, Imperfect, Out of Context

While panelists tended to agree with Todd’s assessment that people can get in “mostly all the same trouble as before” with generative AI, there remain some unique considerations to collecting and assessing AI-generated data during discovery.

“Have you ever noticed how a picture sent and sent and forwarded around gets so pixelated it looks like it was shot with a potato and not a camera? That’s an artifact of how progression and data transfer works. You lose some information,” Aron Ahmadia said.

“So you can imagine, in a future world where all AI is a form of compression or translation of information, there’s loss in there. A conversation gets recorded, transcribed, summarized—maybe we can ask AI to make some audio that reflects a summarization. We’re going see a lot of that,” he predicted. “But it’s important to remember that loss. That’s why we have experts, technologists, attorneys, to think about what that means from an evidentiary perspective.”

The group agreed: generative AI can teach and produce a lot for us, but it’s not an easy button for all things at all times. For Legal Data Intelligence professionals coming at it from an outside perspective, looking to understand it in the context of a project or matter down the road, it’s essential to remember a few things: the data AI creates is persistent; it’s not always perfect and, like human-generated content, can’t be taken for certifiable truth; and reading it out of context is going to come with some challenges.

On the subject of persistent documents, the panelists advised attendees to understand what that means in the context of each unique tool.

“A lot of generative AI capabilities are leveraging data that already exists—they’re just giving you an efficient way of surfacing insights. But they’re also creating more content that we have to deal with: prompts are persistent, responses are persistent,” EJ explained. “Some people are chagrined by that, but it’s also very helpful in collecting prompts and referencing files and expediting our understanding.”

“Sometimes I find myself a bit lost or afraid when talking about persistent documents. Looking at a tool and wondering what’s happening here, I operationalize that analysis by sitting and drawing a picture of inputs, outputs, and logging that’s happening on the server and with metadata,” Todd shared. “What happens if there’s a debugging problem? What happens if I call the help desk? Do they have the encryption key to see all the weird stuff I’ve been asking? These are things you need to know. Too many people don’t think this is a big deal—we are the ones who have to really care about the details, ask the questions, and understand the logging. Going through those things is really worth it.”

EJ agreed: “The number one concept I’ve pushed is if the data persists, it should be discoverable and respect preservation settings. It should be returned in search and replicate the original experience. That’s been built into the substrate to make sure things just work in that regard.”

The technology should just work, but understanding how it works and how it affects discovery workflows is important for e-discovery practitioners and in-house data teams.

“You really should lean in and understand how these things are going to work. Understanding data persists is one thing, but where it exists is different, and it’s sometimes in fragmented pieces—transcripts live separate from the chat and files shared during meetings,” EJ continued. “Pulling it all together doesn’t have an easy button.”

Automated transcripts are a perfect example of the potentially imperfect, and context-specific, nature of AI-generated data.

“A concern is that we’re, more and more, using transcripts as part of the day-to-day of our business. We’re saving those transcriptions, sharing via emails or storing them to reference later, often without making further edits. And some tools note that these transcripts have been generated by AI, but not all do,” Ashley noted during the Fest panel. “If you’re coming back to one of these two years later, no one is going to recall who said what and when—so it’s hard to know if the transcripts are accurate, which they rarely are about the real subject matter of a conversation.”

Case teams and document reviewers, therefore, must be highly aware of their own automation biases when reading and strategizing around AI-generated data during e-discovery. It isn’t all a verbatim record of truth and reality—and even if it were, it’s often missing a lot of the nuance that can significantly influence a matter.

In the second part of our coverage of this session, we’ll take a closer look at the panelists’ insights on how to get started with AI-generated data discovery. Stay tuned!

(Psst: if you simply can’t wait, watch the recording of this session on-demand right here!)

Graphics for this article were created by Sarah Vachlon.

Scaling the Cliffs of AI Insanity: Managing Data Created by Generative AI [Session Recording]

From learning how to detect the AI-generated data to building best practices and leveraging technology to organize it, this panel of industry leaders will help you identify strategies for navigating this new landscape. 

WATCH ON DEMAND

Sam Bock is a member of the marketing team at Relativity, and serves as editor of The Relativity Blog.