Everyday Risk Management – Data and Business Continuity

Most people we meet, having little IT experience, think that any IT outage or incident wouldn’t have a major impact on their business. They often believe that doing backups will get them back up and running in no time, but in reality, it’s not enough. In fact, business continuity relies upon several operational layers, any one of which could fail at a particular point in time. In this post, we discuss how you can overcome these problems.

Let’s consider the following stack:

Image module

When you say backups, people tend to think of the original content they create, which is irreplaceable without a copy. In this diagram, this is represented by User Data. However, this content is frequently created, and subsequently accessed, using some kind of software – such as Microsoft Office, a database, or business applications. Those run on physical hardware, be it your desktop, a dedicated server, or even the cloud since this is ultimately a set of servers.

Here, hardware refers to a series of server components, power supply, networking, and storage. RAID (redundant storage on multiple disks) was created to reduce the failure rate of storage. However, this didn’t necessarily help in situations caused by environmental disaster or operator bankruptcy since companies rarely have redundant power supply lines and dual internet uplinks. What’s more, having a second server in a High Availability setup, just waiting for the first to fail, doubles the price and incurs additional administration costs since everything needs to be kept in sync and up to date. This is where the Cloud approach really shines – none of these problems are yours, you just pay for the service and expect the service level agreement (SLA) to be kept.

The software layer is usually focused on a business application – a server software, connected to a database, providing support for your core operations. This kind of software is never bug-free, and updates sometimes break it or cause compatibility issues. On the other hand, avoiding updates to ensure stability can result in the tool quickly becoming outdated, leading to security issues and a lack of productivity. In this situation, you’re reliant on operating system stability, antivirus protection with hopes for zero false positives, constant supervision and complex installation and configuration procedures.

Virtual machine snapshots have been developed to protect your software layer in case of misconfiguration or malfunction, but restoring them means you could lose some of your work. A more robust and sophisticated method for quickly restoring your software is to use an Infrastructure as Code tool. This involves describing the steps required to create a fully functional environment in a computer-executable script, instead of relying on manual administrator actions.

Reliable content backup is not simple, as you have to account for human error. Simply copying all files to a separate drive every night does not let you restore corrupted, or mistakenly overwritten files once it was noticed, two days after the backup. A frequent source of data loss is encrypting malware, or ransomware, which can have a devastating effect if you depend on the storage. In addition, not only does pushing files to cloud storage does have a cost, but you must also consider the cost and time required to download those files when needed. A terabyte of data can take three hours to transfer over a gigabit connection, and not all of us have such bandwidth available to the nearest cloud provider.

Two approaches can improve the recoverability of user-generated data:

  • Tiering, dividing your content (automatically if possible) into a limited size of current working files, and archived files. Archived files are those which aren’t immediately necessary for your daily business activities, and which can be transferred and restored later, allowing you to focus on the current working set
  • Snapshots and incremental backup. Having a tool that can archive only the differences between the current state and last backup allows you to keep several consistent and readily available snapshots of the state of your storage over the last few days, allowing for a partial restore or a comparison between versions.

Finally, all of the above is operated by people, your employees or third-party service providers. These people need to follow certain procedures, know the access details and credentials, and ultimately, to be immediately available when you need them. In reality, procedures are often neglected, outdated, and sometimes stored on the very system you need to recover. Employees are hard to replace, and they often take their know-how with them when they leave a company. So you need to have a backup here too – a single person operating all of your critical systems is a significant risk to your business continuity.

Your fancy Infrastructure as Code deployed on a highly-available server cluster will be of little use if there is no one to operate it. Human related delays can also add to any delays in the restoration process, yet, conversely, rushing unnecessarily can create problems of its own.
Software as a service is a popular solution to this problem. You let a dedicated team, with the necessary experience and staff resources, provide you with a service, while you focus on your core business.

In conclusion, you should be aiming for reproducible restorability, not just standard backups. Take a look at all the layers and test them regularly. It’s the same as with software – if it’s untested it will be buggy, and untested recovery is prone to fail when you need it most.

Marcin Jakubowski
Maintenance Manager
Systems administrator, security expert, problem solver, and architect behind the onCloud service, Marcin has been working at XTRF for 11 years now.
Outside of work, he’s a wine enthusiast.
XTRF