by Stephen Rylander on December 08, 2015
Over the last decade, the data technology world has yielded many new and interesting innovations—so many that some businesses have struggled to identify the gaps they currently have in-house versus the emergent needs to keep up with growing data volumes. In particular, the notion of “horizontal storage” for data access has taken root.
Horizontal storage, in its simplest, most salient definition, means distributing data across many servers of the same spec, rather than continuously upgrading to another larger, faster server. That is, you have many houses as opposed to one very large house.
It’s not a novel solution, but its application in today’s landscape is key. Traditionally, relational databases—such as SQL Server—access stored data vertically, meaning they can only use as much storage as is physically attached to each server. This creates a limit to the amount of data you can store per server.
Storing data horizontally has one key strategic advantage—it allows a business to better deal with big data. With the timeframes required by the e-discovery industry—as well as others, such as consumer internet products or finance—it’s simply not realistic to only deal with enormous data sets in traditional data storage environments.
Reason 1: Vertical Limits
Traditional relational databases access stored data from local disks, arrays of disks, or a storage area network (SAN) of disks, then read data off of them and load the contents into memory.
The catch? There’s a limit to the amount of physical storage that can be pushed into these disk setups. Local disks hold a finite amount of data; if you run out of space, the next step is to buy a SAN, which can store considerably more—depending on the cash laid out for the setup. If you need more space, you buy more SAN.
It’s an expensive proposition that often pushes the limit of physical infrastructure—not to mention an organization’s budget.
Horizontal storage, on the other hand, divides and distributes. You divide large data set ABC, moving A to server A, B to server B, and C to server C. As part B grows, additional division and distribution (often called “sharding”) occurs, resulting in a data set B1 and server B1. A router or controller server coordinates reading across many servers and writing to the group. As the data set grows, servers can be added to handle increasing volumes, and the system knows how to shard and distribute the data in the most efficient manner.
Reason 2: Less Data Movement Means Faster Processing
In traditional data processing, a given data set is copied to a program that then processes a result. For instance, in financial market calculations, a handful of raw values—such as end-of-day market prices and total return indexes—are read from a database and copied to a software process to execute an algorithm and calculate a new value.
With the ever-growing array of disks in vertical database storage, large data sets have to be loaded into memory, requiring a lot of seeking and scanning of data, index building, and memory.
This concept of moving data to process it has been the industry norm for a long time. But horizontal storage is flipping the script—and it’s a really big deal.
In the horizontal world, the servers that store data also process the data, which means less data movement and faster calculations and responses. In general, it’s much more time-intensive to copy 10 MBs of data across a network than it is to ask the server with the 10 MBs of data to process it. This ratio grows tremendously when dealing with data sets in the 100s of GBs or TBs, and the time savings start to become substantial.
Reason 3: Resiliency
When you combine horizontal storage’s scaling and processing capabilities, you get yet another advantage: high levels of resiliency.
When we say “resiliency,” we’re referring to both the consistent availability of data, as well as the ability to survive failures. Technically, those are different concepts and measured separately, but summing them up as resiliency drives to the heart of the matter.
When requests come in to a distributed database using horizontal storage, they are routed to the appropriate server or group of servers for processing, and a technique called replication is applied.
Replication takes the shard (set A of ABC) and puts it on two of the servers in a group of servers, which is called a cluster. So, if we have 3 servers called Cluster-1, A is stored as a primary on Server-1 and a replica on Server-3. Data set B is stored on Server-2 and Server-3, and so forth across the cluster. If any one of the three servers goes down, the others will still serve the data, making it available and allowing the cluster of servers to survive failure. Thus, every server in the cluster is capable of serving requests and is resilient.
There are limits to this resiliency, and if you have too many nodes go down, there may be certain offline data that’s inaccessible, hampering the remaining nodes to process the request. However, this is a step up from vertical storage, where replication across nodes is not possible. Replication across nodes can happen over inexpensive commodity hardware, inside of data centers, or on public clouds.
There are plenty of examples of distributed databases using horizontal storage, including Hadoop, Casandra, and CouchDB. Each of these has its specialties and specific nuances, yet all share similar concepts of scale, minimizing data movement, and resiliency to failure.
We take this move toward horizontal storage seriously, and it’s why we’ve spent so much energy engineering Data Grid—Relativity’s horizontal data store. Scalable, fast, and resilient data is the present and future of this industry.
Stephen Rylander is a senior manager of software engineering at kCura, specializing in Relativity Data Grid.