by Mike Kolek on September 19, 2014
When a Relativity workspace gets to be very large, you, as an infrastructure manager, might need to decide if you're going to cache the database in memory or provide extremely high I/O throughput. Neither solution is simple, so we’d like to dig deeper into options for handling big data in your Relativity workspaces—and offer advice on identifying which route to take.
Caching in SQL Server Memory
In an ideal world, you’ll always cache the database in memory for large workspaces, as an optimized SQL Server can cache the entire document table in each active Relativity workspace. Each document table stores a row for every record and includes the majority of metadata regularly queried and reviewed by users. On average, the document table consumes 50 GB of disk space for every million records. For this reason, we would not recommend hosting workspaces larger than two to three million records on SQL Server 2012 Standard Edition, as this edition of SQL Server has a memory limitation of 64 GB RAM. SQL Server 2014 Standard Edition supports up to 128GB RAM.
The document table for a 25 million record workspace including extracted text might consume more than one TB of disk space. Even if you’re unable to cache the entire table in memory, the closer you can get to doing so, the better your performance will be. This is because all of the table indexes will likely be cached in memory, and SQL Server is only going to have to pull from storage for a percentage of the data during certain operations.
Using SQL Server Fast Track
If we can't cache the data, we need to design I/O subsystems that can read all of the data quickly—which means a lot of connectivity between the storage and the server, and a lot of hard drives.
The Fast Track Data Warehouse reference architecture, developed by Microsoft, is designed so that your SQL Server doesn't need enough memory to cache an entire dataset. If you have a 10 TB data warehouse, you can contact your hardware vendor—e.g. Dell, HP, Cisco, or IBM—and ask for a Fast Track Data Warehouse appliance rated for 10 TB. The hardware can be expensive and cumbersome, but as long as it is configured according to the very detailed instructions provided by Microsoft, success is almost guaranteed. The Fast Track reference specifications include data sizes and hardware specifications. These can be referenced by organizations to build their own Fast Track infrastructures with their existing Microsoft licensing.
The Fast Track infrastructure requirements include multiple processors and sufficient RAM, multiple PCI-Express slots for host bus adapters and network connectivity, no blade servers, and high per-processor core I/O sequential read throughput—approximately 400 MB per second. The Fast Track mass consumption rate (MCR) provides a per-core I/O throughput value in megabytes or gigabytes per second. Achieving these potential speeds is challenging. If you have a server with eight processor cores, you would need about 3,200 MB per second sequential read throughput—an extremely high figure.
Joining the Flash Revolution
More recent developments in flash and solid state storage technologies enable organizations to significantly improve Relativity performance across very large databases. Assuming that all of the SQL Server data is not already stored on an entirely flash-based array, it is becoming common for clients to store the tempdb system database for each SQL instance on local PCI-Express flash or solid state drive instead of traditional, spinning media. Microsoft officially supports local placement of the tempdb system database on non-shared storage in SQL Server 2012 and later Windows Failover Cluster deployments. This was one of the topics we highlighted in our “Optimizing and Monitoring Performance” session at Relativity Fest in 2013.
Many server-side flash vendors now offer more reasonably priced, intelligent, and automatic caching of only the hot data that traverses the storage fabric. They do not require any changes to the applications that they host, including Relativity. Consider the use of server-side flash as an extension of memory that accelerates your largest Relativity workspaces. This is especially applicable to organizations where hybrid disk-and-flash or all-flash arrays might be cost prohibitive, available solutions can’t accommodate high availability and disaster recovery requirements, or implementation of new storage subsystems would present significant challenges to those who manage the infrastructure.
One drawback of installing flash in the servers is management overhead in allocating flash appropriately and accommodating failover. For this reason, we’re seeing that SSD hybrid or entirely flash-based arrays as Storage Area Networks work well for Relativity infrastructures. Most of these vendors now offer compression, de-duplication, snapshots, and replication to help maximize available space, reduce costs, and provide options for disaster recovery.
As always, feel free to contact us if you have any questions about optimizing your Relativity infrastructure.