Networks and Servers

High Availability – Solutions

Companies are under increased pressure to keep their systems up and running and make data continuously available. Being now held to a higher standard for application and data availability, the trick is to design server and storage systems that are highly available and almost bullet-proof against unplanned downtime. In order to achieve the highest levels of availability a company has to implement a complete solution that will address all the possible points of failure. But what are the available options if you want to design a high availability solution?

You can see in the chart the main solutions for the three areas to be addressed; storage, services and networks:

In the following posts I will explain these solutions in further detail. Keep reading, ok?

High Availability – Measurement (II)

Reliability metrics

Failure rate

Reliability can be quantified as MTBF (Mean Time Between Failures) for a repairable product and as MTTF (Mean Time To Failure) for a non-repairable product.

According to the theory behind the statistics of confidence intervals, the statistical average becomes the true average as the number of samples increase. So, a power supply with an MTBF of 50,000 hours does not mean that the power supply should last for an average of 50,000 hours because the MTBF of 50,000 hours, or 1 year for 1 module, becomes 50,000/2 for two modules and 50,000/4 for four modules. It is only when all the parts fail with the same failure mode that MTBF converges to MTTF.

If the MTBF is known, one can calculate the failure rate (l) as the inverse of the MTBF. The formula for l is:

Once a MTBF is calculated, what is the probability that any one particular module will be operational at time equal to the MTBF? For electronic components we have the following equation:

But when t = MTBF

This tells us that the probability that any one particular module will survive to its calculated MTBF is only 36.8%, i.e., there is 63.2% probability that a single device will break before the MTBF!

Bathtub curve

Over many years, and across a wide variety of mechanical and electronic components and systems, people have calculated empirical population failure rates as units age over time and repeatedly obtained a graph such as shown below. Because of the shape of this failure rate curve, it has become widely known as the "Bathtub" curve.

This curve (in blue) is widely used in reliability engineering as describing a particular form of the hazard function which comprises three parts:

The first part is a decreasing failure rate, known as early failures.
The second part is a constant failure rate, known as random failures.
The third part is an increasing failure rate, known as wear-out failures.

The name is derived from the cross-sectional shape of a bathtub and the curve is generated by mapping the rate of early failures when first introduced, the rate of random failures with constant failure rate during its "useful life", and finally the rate of "wear out" failures as the product exceeds its design lifetime.

Reliability examples

Example: Suppose 10 devices are tested for 500 hours. During the test 2 failures occur.
The estimate of the MTBF is:

Whereas for MTTF is:

Another example: A router has an MTBF of 100,000 hours; what is the annual reliability? Annual reliability is the reliability for one year or 8,760 hours.

This means that the probability of no failure in one year is 91.6%; or, 91.6% of all units will survive one year.

High Availability - Measurement (I)

Measuring Availability

The need for availability is governed by the business objectives and its measurement’s primary goal is:

To provide an availability baseline (maintain it);
To help identify where to improve the systems;
To monitor and control improvement projects;

As technology has evolved over the years, most systems can achieve 90% availability with little built-in HA redundancy and it requires more discipline in the IT department rather than any hardware or software to achieve this goal. But to achieve more than 90% of system availability, some special considerations need to be made and we will look into that in the following posts.

It is important to recognize that numbers like these can be difficult to achieve, since time is needed to recover from outages. The length of recovery time correlates with the following factors:

Complexity of the system: The more complicated the system, the longer it takes to restart it. Hence, outages that require system shutdown and restart can dramatically affect your ability to meet a challenging availability target. For example, applications running on a large server can take up to an hour just to restart when the system has been shut down normally, longer still, if the system was terminated abnormally and data files must be recovered.
Severity of the problem: Usually, the greater the severity of the problem, the more time is needed to fully resolve the problem, including restoring lost data or work done.
Availability of support personnel: Let's say that the outage occurs after office hours. A support person who is called in after hours could easily take an hour or two simply to arrive to diagnose the problem. You must allow for this possibility.
Other factors: Many other factors can prevent the immediate resolution of an outage. Sometimes an application may have an extended outage simply because the system can't be put offline while applications are running. Other cases may involve the lack of replacement hardware by the system supplier, or even lack of support staff.

Availability Metrics

Mean Time to Repair (MTTR)
Impacted User Minutes (IUM)
Defects per Million (DPM)
MTBF (Mean Time Between Failure)
Performance (e.g. latency, drops)

Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service level agreements often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable per year, month, or week.