Tuesday, 9 June 2009

The 5 Restoration Phases of a Secure and Dependable System

We all want our systems to be secure and dependable, indeed the two topics are interlinked. Dependability requires high availability management, which has several aspects to it. We can try to achieve Fault Avoidance, with fault prevention and fault removal, but this isn't actually possible in all cases. For example, hard disk drives will have physical wear out due to moving parts, power supplies do not run indefinitely, etc. Therefore, we move towards Fault Acceptance. Fault acceptance relies on fault forecasting, to try to determine the most likely causes of faults, and fault tolerance to enable the system to continue functioning in the event of a fault. With fault tolerance we build redundancy into the system so that faults do not result in system failures. However, there are times when even our most fault tolerant systems will fail. What do we do then? Well, obviously we need to recover as quickly as possible.

The 5 restoration phases of a system are as follows:
  1. Diagnostic Phase - find the fault, diagnose the problem and determine the appropriate course of action to recover
  2. Procurement Phase - identify, locate, transport and physically assemble replacement hardware, software and backup media
  3. Base Provisioning Phase - configure the system hardware and install the base Operating System (OS)
  4. Restoration Phase - restore the entire system from the backup media, including system files and user data
  5. Verification Phase - verify the correct functionality of the entire system as well as the integrity of the user data

It can sometimes be quite hard to diagnose the actual root cause of a fault, as certain faults will sometimes show up in confusing ways. A few months ago I was investigating a problem with a machine that I was told was an OS problem - "Windows keeps blue screening with errors; bl***y Microsoft!" However, on further investigation, it had absolutely nothing to do with the OS and therefore Microsoft. Actually, there was a memory fault. One bank of RAM was faulty and was causing so many errors that the OS couldn't recover. Simply changing the pair of RAM modules sorted the problem out and the machine has been running reliably since. The problem with this type of fault is that it manifests in such a way as to look like a different fault.

The procurement phase can be tricky, as it takes a long time to get components delivered. This is where fault forecasting come in. If we know the most likely faults, then we can keep stock of those hardware components and make sure that any system recovery media is at hand. Obviously, we need backup media to be stored off site as well, but we will need copies on site for quick restoration when the building hasn't suffered damage. Of course, many hardware vendors will offer Service Level Agreements (SLA) as to how quickly they can replace or repair your hardware, but some things you may want to deal with in-house as it will still be quicker than the standard 4 hour fix (or longer).

The final three phases are all centred around restoration of data from backup media. This brings about the point that you must backup your system, not just your data. How long will it take you to install the OS from scratch, install all the additional services and make all the configuration changes? This will take too long. If you have backed up your system state, then this can be restored onto a base OS very much more quickly and without making mistakes or omissions. Another aspect to think about is how have you backed your system up? You need to choose a backup scenario that suits the amount of data and the speed at which you need to recover. Remember that a server with 200GB of data backed up onto a DLT tape drive that supports an average transfer rate of 5MBps, will take over 11 hours to restore, assuming that you restore from the full backup only. Any incremental backups taken since the last full backup will also have to be restored. OK, we can go faster than this, even with tape backup, but the solution needs to fit your system and a live backup solution may be required. This is where virtualisation of servers can help dramatically. Virtualisation is more often sold as 'green IT', being more efficient and cheaper. However, a major benefit is the ability to snapshot running machines and redeploy them in seconds. If one hardware box fails, you can migrate the virtual machine onto another box until the first is fixed. This can take just a few seconds or minutes, depending on the architecture of your solution.

Availability is inversely proportional to the total downtime in the period covered and is usually expressed as a percentage, e.g. 99.9% availability, which equates to around 8 hours 45 minutes downtime per annum. The downtime is the sum of all outages in that period. Therefore, we need to decrease the frequency and length of those outages. The frequency is reduced by building fault tolerant systems and the length by having good restoration policies and practices. It is vital that IT staff know the restoration policy and have practiced it. You need to set out a clear timeline of what has to happen during restoration. Certain systems and services will need to be restored first. Your database server may be the most important server to you, but without the network, DNS, DHCP and directory services no other machines will be able to connect to your server anyway and it may rely on some of those services during start up. Also, don't leave it until you are having to restore a failed system to see if the policy works. You must practice and test the policy to make sure that it does.


IT Services said...

I also had that problem hard disk drives have physical wear out due to moving parts, power supplies do not run indefinitely, but now I think this post will help me out to shortout my problems next time.

Luke Hebbes said...

Glad it might help. Incidentally, I have another post called ‘How Reliable is RAID?’, which shows you how to calculate the reliability of your RAID solution and predict how many disk failures and system failures you are likely to suffer in a given period.

Post a Comment

Welcome to the RLR UK Blog

This blog is about network and information security issues primarily, but it does stray into other IT related fields, such as web development and anything else that we find interesting.

Tag Cloud

Twitter Updates

    follow me on Twitter

    Purewire Trust