by Scott Ellis on June 16, 2014
As your business and your Relativity instance grow, it’s normal to feel some disk input/output (I/O) pressure—especially in your database management system (DBMS)—as you strive to accommodate database growth and the volume of transactions. Often, infrastructure managers first experience this pressure when their database backup and consistency check (DBCC) window begins to spill into peak production hours and impacts system performance.
As an immediate workaround, the backup may be paused or a DBCC stopped. This may then become a standard approach to resolving disk pressure, and backups and DBCC become delayed or incomplete. Over time, if the I/O pressure remains unresolved, this erosion to the backup and DBCC schedule challenges continuity capabilities. In the event of a data loss, restore point and restore time objectives become compromised—along with your organization’s peace of mind. Unfortunately, it’s easy to lose data—and almost impossible to rebuild a lost database if backups don’t exist, are taken using the wrong technology, or aren’t maintained properly.
As infrastructure managers, we’re challenged with keeping our systems up and running, and this means preventing any erosion to the backup policy—both over the long term and the short term. We need to ensure that backup capabilities never become degraded. To do that, we need to build an infrastructure that can keep up with critical business continuity standards. The disk subsystems must be robust, and capable of handling the demands of both users and the backup schedule.
So what happens when lost data needs to be recovered?
When backups fall behind due to disk pressure demands, failing to invest in disk subsystems puts the business at risk of losing hundreds or thousands of hours of coding decisions. Fortunately, permanent loss of data is a completely preventable disaster—all it takes is a practiced backup and recovery strategy.
Our team has developed a Backup and Data Management Best Practices guide designed to help you and your organization get comfortable with your backup procedures, and avoid the financial and reputation risks of losing data.
Getting your team on board with a solid backup plan—and prepared to execute it—means having a solid plan from the beginning. In this post, you’ll find some details on how to take the first steps toward developing and implementing a repeatable business continuity process.
Step 1: Have a plan that is short, sweet, and easy to understand. It should include a recovery time objective (RTO) and a recovery point objective (RPO).
Ironically, in the absence of a written RTO/RPO plan and report, stakeholders may assume that, in the event of a data loss, all data can be recovered up to the point in time at which the data loss occurred, and that it can be completed very quickly.
To help prepare your team and set expectations for others, your RTO/RPO plan needs to address several scenarios—including, say, an accidental deletion of records from one database, all the way to a complete outage that has brought your system down and resulted in an inaccessible data center. Some businesses choose to break out that kind of system-wide failure into a larger disaster recovery plan; whether or not that’s right for you depends on your business model and mission. If you’d like more information on that, contact our support team for a disaster recovery setup guide to help you get started.
Once you’ve created a plan, reviewed it, and assembled a step-by-step process for executing a complete system backup, you’ll need to test it.
Step 2: Test your back up plan.
The best way of doing this is to take a backup and see if it can be restored to a server with a fresh install of Relativity. To measure success for this test, you’ll want to ask some questions:
1. Was the restore successful?
2. Did the database attach successfully?
3. How long does it take to get Relativity up and running again?
4. Does it include successful database consistency checking (DBCC)?
If the answer to any of the above questions is not acceptable, you will need to revise your plan until it is. Always feel free to reach out to our team if you need assistance with this process. Ultimately, if you identify any unresolvable challenges that are inherent to your systems, move on to step 3 and prepare to communicate the issues clearly.
Remember: Microsoft SQL databases are not like your typical system or user files, which can often be copied and restored with little fear of a consistency problem. Good consistency means the file that you think you backed up actually matches the file that sits on disk, and is in a usable state for restoration.
Many backup utilities are not designed to handle the complexities of backing up databases. Unless it is specifically designed to address SQL database data, a given utility may not capture a consistent backup. Consequently, when a restore is attempted, one or all of the databases could fail to reattach, and the failure could be unrecoverable. Databases that were in use when an improper tool was taking the backup are less likely to restore properly.
Step 3: Develop and communicate a repeatable RTO/RPO testing plan.
In the absence of a well-communicated plan that includes clearly articulated test results on a regular basis, other business stakeholders may make the simple assumption that everything is just fine.
If the previous two steps are not executed, then stakeholders have no information regarding your business continuity capabilities. You can easily set backup and recovery expectations in your organization by communicating the results of your periodic testing in a written report or an actual meeting.
Once you have successfully developed a backup plan, communicate how often and when it will be tested. Identify stakeholders, and hold regular meetings with them. This kind of open communication will improve transparency among your team and stakeholders, so everyone is aware of the risks, benefits, and effectiveness involved in your organization’s backup and recovery preparedness. With that transparency in mind, decision makers are better equipped to make the right allocation decisions to support business continuity. Be sure to keep them apprised of any issues that could result in a data loss which may trigger an outage exceeding your published standard.
Your regular report should include a chart that shows, over the past year, the change in duration of backup and restore times. You should be able to project forward to the time when you will no longer meet your objectives and will need more hardware. Demonstrate that your backup architecture scales side-by-side with Relativity, and identify when I/O required by litigation services will begin to encroach on the disk and network I/O needed to take, test, and restore backups within published objectives. Additionally, this report should include the following details:
• RTO/RPO status (failed completely, partial failure, infrastructure failure)
• Type of test scenario (total disaster, single server outage, single database outage)
• Desired outcome (RTO/RPO)
• Actual outcome
Such a report, when contrasted with the accepted RTO/RPO plan, should serve to put management on notice if backups are not in a ready state, and that, as a result, the business could experience a disaster from which it may not be able to recover. Setting realistic expectations and clearly communicating this business need will help your organization prioritize, establish, and evolve your RTO/RPO plan as needed.
The inability to complete backups on time is one of the very first warning signs that your infrastructure is underpowered. If your backup picture is not complete, everything else on your plate should be deprioritized. Your backups don’t have to be fancy—they just have to be there when you need them.
I hope this information is helpful. As always, if you have any questions or would like assistance with your RTO/RPO planning, please don’t hesitate to contact us.
Posted by Scott Ellis.