-->

High Availability – Solutions

Companies are under increased pressure to keep their systems up and running and make data continuously available. Being now held to a higher standard for application and data availability, the trick is to design server and storage systems that are highly available and almost bullet-proof against unplanned downtime. In order to achieve the highest levels of availability a company has to implement a complete solution that will address all the possible points of failure. But what are the available options if you want to design a high availability solution?

You can see in the chart the main solutions for the three areas to be addressed; storage, services and networks:


High Availability Solutions


In the following posts I will explain these solutions in further detail. Keep reading, ok?

High Availability – Measurement (II)

Reliability metrics

Failure rate


Reliability can be quantified as MTBF (Mean Time Between Failures) for a repairable product and as MTTF (Mean Time To Failure) for a non-repairable product.

According to the theory behind the statistics of confidence intervals, the statistical average becomes the true average as the number of samples increase. So, a power supply with an MTBF of 50,000 hours does not mean that the power supply should last for an average of 50,000 hours because the MTBF of 50,000 hours, or 1 year for 1 module, becomes 50,000/2 for two modules and 50,000/4 for four modules. It is only when all the parts fail with the same failure mode that MTBF converges to MTTF.

If the MTBF is known, one can calculate the failure rate (l) as the inverse of the MTBF. The formula for l is:
 
Failure Rate

Once a MTBF is calculated, what is the probability that any one particular module will be operational at time equal to the MTBF? For electronic components we have the following equation:

Reliability
 
But when t = MTBF
 
Reliability
 
This tells us that the probability that any one particular module will survive to its calculated MTBF is only 36.8%, i.e., there is  63.2% probability that a single device will break before the MTBF!
 

Bathtub curve


Over many years, and across a wide variety of mechanical and electronic components and systems, people have calculated empirical population failure rates as units age over time and repeatedly obtained a graph such as shown below. Because of the shape of this failure rate curve, it has become widely known as the "Bathtub" curve. 
 
Bathtub Curve
 
This curve (in blue) is widely used in reliability engineering as describing a particular form of the hazard function which comprises three parts:
  • The first part is a decreasing failure rate, known as early failures.
  • The second part is a constant failure rate, known as random failures.
  • The third part is an increasing failure rate, known as wear-out failures.
The name is derived from the cross-sectional shape of a bathtub and the curve is generated by mapping the rate of early failures when first introduced, the rate of random failures with constant failure rate during its "useful life", and finally the rate of "wear out" failures as the product exceeds its design lifetime.
 

Reliability examples


Example: Suppose 10 devices are tested for 500 hours. During the test 2 failures occur.
The estimate of the MTBF is:

24
 
Whereas for MTTF is:
 
25
 
Another example: A router has an MTBF of 100,000 hours; what is the annual reliability? Annual reliability is the reliability for one year or 8,760 hours.

 
26
 
This means that the probability of no failure in one year is 91.6%; or, 91.6% of all units will survive one year.

High Availability - Measurement (I)

Measuring Availability

The need for availability is governed by the business objectives and its measurement’s primary goal is:
  • To provide an availability baseline (maintain it);
     
  • To help identify where to improve the systems;
     
  • To monitor and control improvement projects;
As technology has evolved over the years, most systems can achieve 90% availability with little built-in HA redundancy and it requires more discipline in the IT department rather than any hardware or software to achieve this goal. But to achieve more than 90% of system availability, some special considerations need to be made and we will look into that in the following posts.

It is important to recognize that numbers like these can be difficult to achieve, since time is needed to recover from outages. The length of recovery time correlates with the following factors:

  • Complexity of the system: The more complicated the system, the longer it takes to restart it. Hence, outages that require system shutdown and restart can dramatically affect your ability to meet a challenging availability target. For example, applications running on a large server can take up to an hour just to restart when the system has been shut down normally, longer still, if the system was terminated abnormally and data files must be recovered.
  • Severity of the problem: Usually, the greater the severity of the problem, the more time is needed to fully resolve the problem, including restoring lost data or work done.
  • Availability of support personnel: Let's say that the outage occurs after office hours. A support person who is called in after hours could easily take an hour or two simply to arrive to diagnose the problem. You must allow for this possibility.
  • Other factors: Many other factors can prevent the immediate resolution of an outage. Sometimes an application may have an extended outage simply because the system can't be put offline while applications are running. Other cases may involve the lack of replacement hardware by the system supplier, or even lack of support staff.
 


Availability Metrics

  • Mean Time to Repair (MTTR)
  • Impacted User Minutes (IUM)
  • Defects per Million (DPM)
  • MTBF (Mean Time Between Failure)
  • Performance (e.g. latency, drops)
Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service level agreements often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable per year, month, or week.


Availability



High Availability - Objectives

The main objective in designing any High Availability (HA) and Disaster Recovery (DR) strategy is Business Continuity. Each business has its own level of tolerance for system failures and outages and depending upon that tolerance, a suitable strategy can be planned and implemented. If a business can accept 90% system availability, then there is no need to build any HA infrastructure.

Although HA solutions are frequently discussed in business environments, these considerations apply to any type of organization-defense, educational or non-profit-where HA is required. When it comes to organizations other than businesses, it is not very easy to calculate the costs associated with downtime. HA and DR systems in these organizations become more of a requirement from the service standpoint than from the perspective of cost associated with downtime.

Availability of the systems should be seen from the end user's perspective. Any time a user cannot connect to the system is considered as downtime but this does not necessarily mean the main computer system is going down because, in many cases, a poorly performing system is also considered an Unavailable System.

So, High Availability does not really mean to build redundancy into only one system or database or application, but it is a combination of redundancies being built into all areas of the process. For every business or organization, the database plays an important role; everything is built around the database, hence, most of the efforts for High Availability are concerned with making the database "Highly Available."
More specifically, a high availability architecture should have the following traits:
  • Tolerate failures such that processing continues with minimal or no interruption;
  • Be transparent to (or tolerant of) system, data, or application changes;
  • Provide built-in preventative measures;
  • Provide proactive monitoring and fast detection of failures;
  • Provide fast recoverability;
  • Automate detection and recovery operations;
  • Protect the data so that there is minimal or no data loss;
  • Implement the operational best practices to manage your environment;
  • Achieve the goals set in SLAs (for example, the RTO and the RPO) for the lowest possible total cost of ownership.
System designers often build reliability into their platforms by building in correction mechanisms for latent faults that concern them. These faults, when correctable, do not produce errors or failures since they are part of the design margins built into the system. They should still be monitored to measure their occurrence relative to designers anticipated frequency, since excessive occurrence of some correctable faults is often an indicator of a more catastrophic underlining latent fault.

Reliability, recoverability, timely error detection, and continuous operations are primary characteristics of a highly available solution.

High Availability - Terminology (II)

Planned outage/downtime


Planned outages include maintenance, offline backups and upgrades. These can often be scheduled outside periods when high availability is required.

 

Unplanned outage/downtime


While planned outages are a necessary evil, an unplanned outage can be a nightmare for a business. Depending on the business in question and the duration of the downtime, an unplanned outage can result in such overwhelming losses that the business is forced to close. Regardless of the nature, outages are something that businesses usually do not tolerate. There is always pressure on IT to eliminate unplanned downtime totally and drastically reduce, if not eliminate, planned downtime.

Note that an application or computer system does not have to be totally down for an outage to occur. It is possible that the performance of an application degrades to such a degree that it is unusable. As far as the business or end user is concerned, this application is down, although it is available.

Unplanned Shutdowns have various causes. The main reasons can be categorized into:
  • Hardware FailuresFailure of main systems components such as CPUs and memory; or peripherals such as disks, disk controllers, network cards; or auxiliary equipment such as power modules and fans; or network equipment such as switches, hubs, cables, etc., can be the causes of hardware failures.
  • Software Failures – The possibilities of failure of software mostly depends upon the type of software used. One of the main causes for software failure is applying a patch. Sometimes, if a patch does not match the type of implementation, then the application software may start to behave in a strange way, bringing down the application and reversing the changes, if possible. Sometimes, an upgrade may also cause a problem. The main problem with upgrades will be performance related or the misbehaving of any third party products, which depend upon those upgrades.
  • Human Errors – An accidental action of any user can cause a major failure on the system. Deleting a necessary file, dropping the data or a table, updating the database with a wrong value, etc., are a few examples

High Availability - Terminology (I)

System


A system is composed of a collection of interacting components. A component may itself be a system, or it may be just a singular component. Components are the result of system decomposition chiefly motivated to aid in the partitioning of complex systems for either technical, or very often, for organizational or business reasons. Decomposition of systems into components is a recursive exercise. All components are typically delineated by the careful specification of their inputs and outputs and a component that is not decomposed further is called an atomic component.

Service


A system provides one or more services to its consumers. A service is the output of a system that meets the specification for which the system was devised, or which agrees with what system users has perceived the correct values to be.

Failure


A failure in a system occurs when the consumer (human or non-human) of a service is affected by the fact that the system has not delivered the expected service. Failures are incorrect results with respect to a specification or unexpected behavior perceived by the consumer or user of a service. The cause of a failure is said to be a fault.

Mean time to failure (MTTF)


Hardware reliability can be predicted by statistically analyzing historical data. The longer a component operates, the more it is likely to fail due to aging. The mean time to failure of a component is just that: a statistical forecast to measure the average time between failures with the modeling assumption that the failed system is not repaired. The greater the MTTF of a component, the less likely it is to fail. You can use the MTTF of a component (if it is known) as useful information for establishing preventative maintenance procedures and its value for an overall system can be improved by carefully selecting your hardware and software.
MTTF
MTTF is the number of total hours of service of all devices divided by the number of devices.

High Availability - Overview

Definition of High Availability



High Availability (HA) is the ability of a system to perform its function continuously (without interruption) for a significantly longer period of time than the reliabilities of its individual components would suggest. HA, then, is a trade-off between the cost of downtime and the cost of the protective measures that are available to avoid or reduce downtime.
The term High Availability, when applied to computer systems, means that the application or service in question is available all the time, regardless of time of day, location and other factors that can influence the availability of such an application.
In general, HA it is the ability to continue a service for extremely long durations without any interruptions. Hence, HA is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period.
HA systems should protect companies from two possible failures: system failures and site failures. Though true HA solutions should guard against both system and site failures, usually HA systems are regarded as protection from system failure, while site failure is typically protected by a Disaster Recovery (DR) system.

The Future of Computers - Artificial Intelligence

What is Artificial Intelligence?


The term “Artificial Intelligence” was coined in 1956 by John McCarthy at the Massachusetts Institute of Technology defining it as the science and engineering of making intelligent machines.

Nowadays it’s a branch of computer science that aims to make computers behave like humans and this field of research is defined as the study and design of intelligent agents where an intelligent agent is a system that perceives its environment and takes actions that maximize its chances of success.

This new science was founded on the claim that a central property of humans, intelligence—the sapience of Homo Sapiens—can be so precisely described that it can be simulated by a machine. This raises philosophical issues about the nature of the mind and the ethics of creating artificial beings, issues which have been addressed by myth, fiction and philosophy since antiquity.

Artificial Intelligence includes programming computers to make decisions in real life situations (e.g. some of these “expert systems” help physicians in the diagnosis of diseases based on symptoms), programming computers to understand human languages (natural language), programming computers to play games such as chess and checkers (games playing), programming computers to hear, see and react to other sensory stimuli(robotics) and designing systems that mimic human intelligence by attempting to reproduce the types of physical connections between neurons in the human brain (neural networks).