What's Hiding in That Massive, Unstructured Data Set?

Editor's Note: This article was first published on the Text IQ blog.

Unstructured data, by definition, is information that lacks a pre-defined data model or organization. Ryan Zilm, director of information and life cycle management at USAA, expands on this definition, noting that unstructured data “can live on your files, shares, your personal drives. It could be [found in an] inbox, potentially SharePoint, all of those other types of repositories.”

Unstructured, or semi-structured data, is growing exponentially within enterprises. In fact, up to 80 percent of a company’s data is unstructured. Justin Van Alstyne, senior corporate counsel at T-Mobile, added during a recent discussion—“The Elephant in the Room: What’s Hiding in That Massive Unstructured Data Set?”—that enterprise data “sits on a spectrum. And, it isn't a one or zero sort of situation. You don't have one side structured and on the other side, unstructured.” This only adds to the complexity of the “elephant in the room.”

An Expanding Problem

With 2.5 quintillion bytes of data created every day, data governance is becoming increasingly challenging for enterprises. When polled, 57 percent of webinar attendees answered that storing too much unneeded data, finding what they need in their data when they need it, and organizing their data are all top challenges at their companies.

Russell Densmore, global data protection leader at Raytheon Technologies, agrees. “With almost zero governance around it, finding sensitive information in data repositories during litigation can be very challenging.” And, just because data is “technically structured, that doesn't necessarily mean that you have an idea of what you have or even they have an idea of what they have,” adds Justin. Relying on institutional knowledge of what’s in your data is no longer sufficient—or acceptable.

Unstructured data brings with it a myriad of unknowns. Without good data governance or classification, it is almost impossible to know what your company is holding onto data-wise. Additionally, creating unstructured data is very simple. It is all too easy for employees to add unnecessary data fields to surveys or marketing campaigns, for example, leaving behind a trail of a lot of unused, sensitive consumer data. Russell points out that, “data minimization is huge. Why are you collecting data if you don't need it?”

Regulation compliance adds to the data complexity. That massive unstructured data set? It’s still subject to CPRA, GDPR, and other privacy regulations. Without any classification or minimization guidelines, complying with regulations in a timely fashion can be a tall order for many companies.

The global pandemic just served to emphasize the need for information governance practices when it created brand new data challenges for enterprises. Justin shared his experience at T-Mobile:

“The pandemic [has] driven a lot of behaviors that created new data challenges. For example, recording meetings. That used to be something that happened, at least in my world, once in a blue moon. And now you see it all the time because people are in different time zones, dealing with their kids’ school, or whatever, and it's becoming a much more common thing. So it's like, that's the definition of unstructured data. How do you deal with that? The aftermath of that.”

Weapons In Your Arsenal

So how should enterprises slay their massive unstructured data sets and bring them down to a manageable size? Our webinar panelists offered their advice:

Triage your data. Ryan Zilm suggests tackling your higher-risk data sets first. “When you can start focusing on that higher risk, you're going to reduce the risk across the organization. You’ve got to triage it. Ask: do you need to classify it? Is it sensitive data? Is there PII or other things that are embedded in there? You really have to understand that.”

Know the boundaries of your data. “When you're dealing with a second request and one of the first things, in my experience, they're going to ask, is for a systems inventory,” shared Justin. “It’s heartburn-inducing to tell the DOJ that this is the universe of databases that my company uses because it's a very difficult thing to pin down.” Adding to that, Russell explained that in his experience, most large enterprises don’t know the true boundaries of their networks, which is truly problematic.

Start small. “You go for your high-risk stuff first, right? You take small bites, just start taking bites. Don't try to eat the whole elephant,” advised Russell. Look for any classification, check for metadata or markers.

Create Processes. Data governance should start with clear processes. Set retention limits for each platform. Don’t give people an option to hoard data. “You can start at some level by implementing those platform-based retentions to manage the content,” Ryan advised. “And, at least you have something consistent, and that makes it a little bit more defensible.” And Justin agreed: “If you can go platform-specific, it puts a timer on people and that tends to drive action in my experience.”

While not an exhaustive list, these tips will get you started in exploring and governing your unstructured data but our panelists agree that it is a huge undertaking. They warn that the process can take a couple of years or more.

Technology to the Rescue

While technology like AI can be an effective tool in gaining control of your unstructured data, our panelists stressed that it is still just a tool and needs to be used alongside data governance principles and protocols. This should include regularly updating your data map and purging and decommissioning data as part of its normal life cycle.

Our viewer poll uncovered that a lack of executive sponsorship (27 percent) and the need for a proof of concept, or pilot (36 percent) were the two biggest hurdles in adopting AI or machine learning at their companies. However, given the results of a Fortune survey in 2020, 57 percent of companies have AI pilots underway or have full-scale deployments, proving these hurdles are not insurmountable.

To gain an executive sponsor, Justin recommends “making sure that you get in front of them and explain the situation and try to get them to understand the risk associated with these large data sets.”

Ryan adds that it’s important to bring to the conversation the ROI, the roadmap, and your plan. Cutting to the chase, Ryan shares a method he has found to be effective: “I like to take a lot of case law and say, ‘Hey, here's the case law. Here's how much it cost this company. This is one of our competitors. Do you want to have a fine of $550 million? No, you don't. So give me at least five and I'll do X with it.’”

Justin looks at an investment in AI almost as insurance to reduce risk in enterprise data. For instance, with a data breach, “just notification plus incident response—not even the claims that are going to come, or could potentially come out of it. Even with that AI is a very small investment compared to the size of the risk that you're talking about.”

Secrets Unveiled

Panelists from the webinar “The Elephant in the Room: What’s Hiding in That Massive Unstructured Data Set?” did answer the title question: within enterprise data lies risk, challenges, questions, nuances, confusion, and complexity. Not the most reassuring answer, but through their experiences, they also provided solutions to help tame and add structure to those “massive, unstructured data sets.” You can catch their entire discussion on-demand here.

Artwork for this article was created by Natalie Andrews.

Daniel Chapman is a strategic account executive at Relativity.