Skip to main content

The 5 Restoration Phases of a Secure and Dependable System

We all want our systems to be secure and dependable, indeed the two topics are interlinked. Dependability requires high availability management, which has several aspects to it. We can try to achieve Fault Avoidance, with fault prevention and fault removal, but this isn't actually possible in all cases. For example, hard disk drives will have physical wear out due to moving parts, power supplies do not run indefinitely, etc. Therefore, we move towards Fault Acceptance. Fault acceptance relies on fault forecasting, to try to determine the most likely causes of faults, and fault tolerance to enable the system to continue functioning in the event of a fault. With fault tolerance we build redundancy into the system so that faults do not result in system failures. However, there are times when even our most fault tolerant systems will fail. What do we do then? Well, obviously we need to recover as quickly as possible.

The 5 restoration phases of a system are as follows:
  1. Diagnostic Phase - find the fault, diagnose the problem and determine the appropriate course of action to recover
  2. Procurement Phase - identify, locate, transport and physically assemble replacement hardware, software and backup media
  3. Base Provisioning Phase - configure the system hardware and install the base Operating System (OS)
  4. Restoration Phase - restore the entire system from the backup media, including system files and user data
  5. Verification Phase - verify the correct functionality of the entire system as well as the integrity of the user data
It can sometimes be quite hard to diagnose the actual root cause of a fault, as certain faults will sometimes show up in confusing ways. A few months ago I was investigating a problem with a machine that I was told was an OS problem - "Windows keeps blue screening with errors; bl***y Microsoft!" However, on further investigation, it had absolutely nothing to do with the OS and therefore Microsoft. Actually, there was a memory fault. One bank of RAM was faulty and was causing so many errors that the OS couldn't recover. Simply changing the pair of RAM modules sorted the problem out and the machine has been running reliably since. The problem with this type of fault is that it manifests in such a way as to look like a different fault.
The procurement phase can be tricky, as it takes a long time to get components delivered. This is where fault forecasting come in. If we know the most likely faults, then we can keep stock of those hardware components and make sure that any system recovery media is at hand. Obviously, we need backup media to be stored off site as well, but we will need copies on site for quick restoration when the building hasn't suffered damage. Of course, many hardware vendors will offer Service Level Agreements (SLA) as to how quickly they can replace or repair your hardware, but some things you may want to deal with in-house as it will still be quicker than the standard 4 hour fix (or longer).
The final three phases are all centred around restoration of data from backup media. This brings about the point that you must backup your system, not just your data. How long will it take you to install the OS from scratch, install all the additional services and make all the configuration changes? This will take too long. If you have backed up your system state, then this can be restored onto a base OS very much more quickly and without making mistakes or omissions. Another aspect to think about is how have you backed your system up? You need to choose a backup scenario that suits the amount of data and the speed at which you need to recover. Remember that a server with 200GB of data backed up onto a DLT tape drive that supports an average transfer rate of 5MBps, will take over 11 hours to restore, assuming that you restore from the full backup only. Any incremental backups taken since the last full backup will also have to be restored. OK, we can go faster than this, even with tape backup, but the solution needs to fit your system and a live backup solution may be required. This is where virtualisation of servers can help dramatically. Virtualisation is more often sold as 'green IT', being more efficient and cheaper. However, a major benefit is the ability to snapshot running machines and redeploy them in seconds. If one hardware box fails, you can migrate the virtual machine onto another box until the first is fixed. This can take just a few seconds or minutes, depending on the architecture of your solution.
Availability is inversely proportional to the total downtime in the period covered and is usually expressed as a percentage, e.g. 99.9% availability, which equates to around 8 hours 45 minutes downtime per annum. The downtime is the sum of all outages in that period. Therefore, we need to decrease the frequency and length of those outages. The frequency is reduced by building fault tolerant systems and the length by having good restoration policies and practices. It is vital that IT staff know the restoration policy and have practiced it. You need to set out a clear timeline of what has to happen during restoration. Certain systems and services will need to be restored first. Your database server may be the most important server to you, but without the network, DNS, DHCP and directory services no other machines will be able to connect to your server anyway and it may rely on some of those services during start up. Also, don't leave it until you are having to restore a failed system to see if the policy works. You must practice and test the policy to make sure that it does.

Comments

  1. I also had that problem hard disk drives have physical wear out due to moving parts, power supplies do not run indefinitely, but now I think this post will help me out to shortout my problems next time.

    ReplyDelete
  2. Glad it might help. Incidentally, I have another post called ‘How Reliable is RAID?’, which shows you how to calculate the reliability of your RAID solution and predict how many disk failures and system failures you are likely to suffer in a given period.

    ReplyDelete

Post a Comment

Popular Posts

Coventry Building Society Grid Card

Coventry Building Society have recently introduced the Grid Card as a simple form of 2-factor authentication. It replaces memorable words in the login process. Now the idea is that you require something you know (i.e. your password) and something you have (i.e. the Grid Card) to log in - 2 things = 2 factors. For more about authentication see this post . How does it work? Very simply is the answer. During the log in process, you will be asked to enter the digits at 3 co-ordinates. For example: c3, d2 and j5 would mean that you enter 5, 6 and 3 (this is the example Coventry give). Is this better than a secret word? Yes, is the short answer. How many people will choose a memorable word that someone close to them could guess? Remember, that this isn't a password as such, it is expected to be a word and a word that means something to the user. The problem is that users cannot remember lots of passwords, so remembering two would be difficult. Also, having two passwords isn't real

How Reliable is RAID?

We all know that when we want a highly available and reliable server we install a RAID solution, but how reliable actually is that? Well, obviously, you can work it out quite simply as we will see below, but before you do, you have to know what sort of RAID are you talking about, as some can be less reliable than a single disk. The most common types are RAID 0, 1 and 5. We will look at the reliability of each using real disks for the calculations, but before we do, let's recap on what the most common RAID types are. Common Types of RAID RAID 0 is the Stripe set, which consists of 2 or more disks with data written in equal sized blocks to each of the disks. This is a fast way of reading and writing data to disk, but it gives you no redundancy at all. In fact, RAID 0 is actually less reliable than a single disk, as all the disks are in series from a reliability point of view. If you lose one disk in the array, you've lost the whole thing. RAID 0 is used purely to speed up dis

Trusteer or no trust 'ere...

...that is the question. Well, I've had more of a look into Trusteer's Rapport, and it seems that my fears were justified. There are many security professionals out there who are claiming that this is 'snake oil' - marketing hype for something that isn't possible. Trusteer's Rapport gives security 'guaranteed' even if your machine is infected with malware according to their marketing department. Now any security professional worth his salt will tell you that this is rubbish and you should run a mile from claims like this. Anyway, I will try to address a few questions I raised in my last post about this. Firstly, I was correct in my assumption that Rapport requires a list of the servers that you wish to communicate with; it contacts a secure DNS server, which has a list already in it. This is how it switches from a phishing site to the legitimate site silently in the background. I have yet to fully investigate the security of this DNS, however, as most