Measuring Availability
The need for availability is governed by the business objectives and its measurement’s primary goal is:- To provide an availability baseline (maintain it);
- To help identify where to improve the systems;
- To monitor and control improvement projects;
It is important to recognize that numbers like these can be difficult to achieve, since time is needed to recover from outages. The length of recovery time correlates with the following factors:
- Complexity of the system: The more complicated the system, the longer it takes to restart it. Hence, outages that require system shutdown and restart can dramatically affect your ability to meet a challenging availability target. For example, applications running on a large server can take up to an hour just to restart when the system has been shut down normally, longer still, if the system was terminated abnormally and data files must be recovered.
- Severity of the problem: Usually, the greater the severity of the problem, the more time is needed to fully resolve the problem, including restoring lost data or work done.
- Availability of support personnel: Let's say that the outage occurs after office hours. A support person who is called in after hours could easily take an hour or two simply to arrive to diagnose the problem. You must allow for this possibility.
- Other factors: Many other factors can prevent the immediate resolution of an outage. Sometimes an application may have an extended outage simply because the system can't be put offline while applications are running. Other cases may involve the lack of replacement hardware by the system supplier, or even lack of support staff.
Availability Metrics
- Mean Time to Repair (MTTR)
- Impacted User Minutes (IUM)
- Defects per Million (DPM)
- MTBF (Mean Time Between Failure)
- Performance (e.g. latency, drops)
Typically, a well-built HA system should achieve greater than 99% availability. However, as the required availability of the system increases beyond 90%, the costs associated with building such a system increase dramatically, but not in proportion to the increase in availability. For example, an increase of availability from 90% to 95% does not necessarily mean an increase of 5% in the IT budget; the calculation is much more complicated than that.
Similarly, achieving 99.999% availability requires a lot of effort and money. This cost is often more than the cost of the original equipment required for 90% availability. In addition to money and machines, a highly trained and skilled IT staff is also required.
High Availability Examples
Let’s now take a look at some practical examples of availability measurement:Committed hours of availability (A)
This is usually measured in terms of number of hours per month, or any other period suitable to the organization.
Example:
Outage hours (B)
This is the number of hours of outage during the committed hours of availability. If high availability level is desired, consider only the unplanned outages. For continuous operations, consider only the scheduled outages. For continuous availability, you should consider all outages.Example:
Achieved availability
Next you can calculate the amount of availability achieved as follows:For the statistics in the examples above, here's each calculation:
Another example:
What is the availability of a computer with MTBF =10,000 hrs and MTTR = 12 hrs?
The annual uptime is:
Conversely, annual downtime is;
Availability (%) is calculated by tabulating end user outage time, typically on a monthly basis but also on a yearly basis
Example: For 98% availability, the annual availability is:
Some prefer to use DPM (Defects per Million) to represent system (or network) availability:
Another example:
- System has 100 customers
- Time in reporting period is one year or 24 hours x 365 days
- 8 customers have 24 hours down time per year
One more example: There are 1000 users in a company and last month there were 30 users down for 60 minutes:
2 comments:
How would you calculate the availability for n sites where each site has an availability of 3-9s? n=2, n=3, or 4 or 5?
How many resilient (seamless failover) 3-9s sites would you have to have to reach 5-9s availability?
What is any given site required 30 min to fail over to another site?
Thanks!
I guess the answer depends on who you are... If you are a tech guy trying to pressure the ones running the budget, than 5 nines will only be real when all your network is homogeneous, i.e., all your LANs/storage/servers have 5 nines availability.
On the other hand, if your are running the budget yourself, then you can get into complicated mathematics trying to prove otherwise.
In my view, a chain is only as strong as its weakest link. Got it? If you have several points with 3 nines, then the global network is only 3 nines.
Thanks for reading my stuff :-)
Post a Comment