Secrets for Dealing with Downtime

Chris Heyn
by Chris Heyn 09 Apr, 2013

Downtime effectively occurs for two reasons that are either planned or unplanned. Serious downtime can be caused by:

- Network failure either LAN or WAN

- A server fault resulting in it becoming offline

Sometimes it is necessary to take servers offline for planned events including maintenance of the hardware or upgrades to applications or server operating systems. Unplanned downtime can take place at any moment and is beyond the control of the IT department administrators. Causes can be minor issues such as a hard disk or power supply that fails to a catastrophic event for example a fire , a flood or an earthquake. One of the important points to take note of is that downtime be it planned or unplanned will eventually take place and it is not a case of if it happens but rather when will it happen. 

Making sure your servers are located in a secure setting is of top importance, for example if your servers are located in the parts of the U.S. that are at risk of suffering the effects of hurricanes, the premises should be constructed as hurricane proof. In addition comprehensive firefighting installations should be installed to protect your servers from that risk. However you can never be 100 percent sure your premises are invincible and so making provision for back up facilities in a different location makes good sense. The intelligent use of geographic load balancers can divert traffic to your backup sites should the primary site be taken off line.

Regular server maintenance makes sense and allows to clean up the server and restore it to its original performance levels, having installed backup servers load balancers and therefore increased the redundancy in your network means these server outages will have less effect on your users. If maintenance is not performed minor problems in your servers will eventually grow more serious and the server will stop working. As you plan your back up facilities consider the cost to the business of unplanned downtime both in terms of business lost as well as damage o the image of the organization.

Don't confuse the terms server uptime and server availability, they are two different things. Your servers could be running fine but are not available to the users because a component in your network a router, firewall or WAN equipment could have failed, this counts against server availability. By selecting servers with dual power supplies and multiple network cards you can increase their reliability, however to really achieve a H/A network make sure you install two or more load balancers configured in high availability mode.

Defining the Downtime Rules

If you ask an IT Manager about the permitted levels of down time the organization targets the reply needs to be more than just a percentage for example 99 percent. Actual downtime values set on an annual basis are as follows:

- 99% = 87 hours 35 minutes

- 99.9% = 8 hours 45 minutes

- 99.99% = 52 minutes 35 seconds

- 99,999% = 5 minutes 16 seconds

The cost of minimizing your permitted downtime varies server by server and is more complex because different server functions have a different level of criticality. A print server going off line is more likely to be annoying than critical, however it is a different matter if your mission critical database server fails as the damage to the business is immediate. You should bear these different levels of criticality in mind as you estimate the costs for raising the reliability of your systems because of it will cost you $95,000 to raise your reliability on a server from 99.99 percent to 99.999 percent but your business would only lose $1,000 a minute thanks to downtime the investment does not make a good return.

Perhaps the most intelligent method of measuring the server performance is not whether it can handle 80, 100 or 200 sessions simultaneously but the effective time it takes users to complete their transactions. If you offer an ecommerce site where the percentage of users who can complete their transactions at peak traffic periods is too low it is not the number of users who can connect but the number who are unable to complete their purchases successfully that should be the point you care about and intend to resolve because your servers can still be running but you are losing revenue as disappointed potential customers abandon your site.  

About the Author

Chris Heyn is the General Manager of KEMP Technologies Italy. He lives in a small village called Arcene about 40kms from Milan. For the past 14 years Chris has been involved in business development for ICT companies looking to expand their activities into Italy and the eastern Mediterranean as well as the Middle East.