Networks and Servers: September 2011

High Availability Storage (II)

Storage Area Network

A Storage Area Network (SAN) is a dedicated high-performance subnet that provides access to consolidated, block level data storage and is primarily used to transfer data between computer systems and storage elements and among multiple storage elements, making storage devices, such as disk arrays, tape libraries, and optical jukeboxes, accessible to servers so that the devices appear like locally attached devices to the operating system.

A SAN typically has its own communication infrastructure that is generally not accessible through the local area network by other devices. A SAN moves data among various storage devices, allowing for the sharing data between different servers, and provides a fast connection medium for backing up, restoring, archiving, and retrieving data. SAN devices are usually installed closely in a single room, but they can also be connected over long distances, making it very useful to large companies.

SAN Benefits

The primary benefits of a SAN are:

High Availability: One copy of every piece of data is always accessible to any and all hosts via multiple paths;
Reliability: Dependable data transportation ensures a low error rate, and fault tolerance capabilities;
Scalability: Servers and storage devices may be added independently of one another and from any proprietary systems;
Performance: Fibre Channel (the standard method for SAN interconnectivity) has now over than 2000MB/sec bandwidth and low overhead, and it separates storage and network I/O;

RAID Concepts

The acronym RAID stands for Redundant Array of Inexpensive Disks and is a technology that provides increased storage functions and reliability through redundancy. It was developed using a large number of low cost hard drives linked together to form a single large capacity storage device that offered superior performance, storage capacity and reliability over older storage systems. This was achieved by combining multiple disk drive components into a logical unit, where data was distributed across the drives in one of several ways called "RAID levels".
This concept of storage virtualization and was first defined as Redundant Arrays of Inexpensive Disks but the term later evolved into Redundant Array of Independent Disks as a means of dissociating a low-cost expectation from RAID technology.
There are two primary reasons that RAID was implemented:

Redundancy: This is the most important factor in the development of RAID for server environments. A typical RAID system will assure some level of fault tolerance by providing real time data recovery with uninterrupted access when hard drive fails;

Increased Performance: The increased performance is only found when specific versions of the RAID are used. Performance will also be dependent upon the number of drives used in the array and the controller;

Hardware-based RAID

When using hardware RAID controllers, all algorithms are generated on the RAID controller board, thus freeing the server CPU. On a desktop system, a hardware RAID controller may be a PCI or PCIe expansion card or a component integrated into the motherboard. These are more robust and fault tolerant than software RAID but require a dedicated RAID controller to work.

Hardware implementations provide guaranteed performance, add no computational overhead to the host computer, and can support many operating systems; the controller simply presents the RAID array as another logical drive

Software-based RAID

Many operating systems provide functionality for implementing software based RAID systems where the OS generate the RAID algorithms using the server CPU. In fact the burden of RAID processing is borne by a host computer's central processing unit rather than the RAID controller itself which can severely limit the RAID performance.

Although cheap to implement it does not guarantee any kind of fault tolerance; should a server fail the whole RAID system is lost.

Hot spare drive

Both hardware and software RAIDs with redundancy may support the use of hot spare drives, a drive
physically installed in the array which is inactive until an active drive fails. The system then automatically replaces the failed drive with the spare, rebuilding the array with the spare drive included. This reduces the mean time to recovery (MTTR), but does not completely eliminate it. Subsequent additional failure(s) in the same RAID redundancy group before the array is fully rebuilt can result in data loss. Rebuilding can take several hours, especially on busy systems.

Cluster Node Configurations

The most common size for an high availability cluster is a two-node cluster, since that's the minimum required to assure redundancy, but many clusters consist of many more, sometimes dozens of nodes and such configurations can be categorized into one of the following models:

Active/Passive Cluster

In an Active/Passive (or asymmetric) configuration, applications run on a primary, or master, server. A dedicated redundant server is present to take over on any failure but apart from that it is not configured to perform any other functions. Thus, at any time, one of the nodes is active and the other is passive. This configuration provides a fully redundant instance of each node, which is only brought online when its associated primary node fails.
The active/passive cluster generally contains two identical nodes. Database applications single instances are installed on both nodes, but the database is located on shared storage. During normal operation, the database instance runs only on the active node. In the event of a failure of the currently active primary system, clustering software will transfer control of the disk subsystem to the secondary system. As part of the failover process, the database instance on the secondary node is started, thereby resuming the service.

This configuration is the simplest and most reliable but typically requires the most extra hardware.

Cluster Quorum Configurations

In simple terms, the quorum for a cluster is the number of elements that must be online for that cluster to continue running. Server clusters require a quorum resource to function and this, like any other resource, is a resource which can only be owned by one server at a time, and for which servers can negotiate for ownership. In effect, each element can cast one “vote” to determine whether the cluster continues running. The voting elements are nodes or, in some cases, a disk witness or file share witness. The quorum resource is used to store the definitive copy of the cluster configuration so that regardless of any sequence of failures, the cluster configuration will always remain consistent. Each voting element (with the exception of a file share witness) contains a copy of the cluster configuration, and the Cluster service works to keep all copies synchronized at all times.

When network problems occur, they can interfere with communication between cluster nodes. A small set of nodes might be able to communicate together across a functioning part of a network but not be able to communicate with a different set of nodes in another part of the network. This can cause serious issues. In this "split" situation, at least one of the sets of nodes must stop running as a cluster.
Negotiating for the quorum resource allows Server clusters to avoid "split-brain" situations where the servers are active and think the other servers are down.
To prevent the issues that are caused by a split in the cluster, the cluster software requires that any set of nodes running as a cluster must use a voting algorithm to determine whether, at a given time, that set has quorum. Because a given cluster has a specific set of nodes and a specific quorum configuration, the cluster will know how many "votes" constitutes a majority (that is, a quorum). If the number drops below the majority, the cluster stops running. Nodes will still listen for the presence of other nodes, in case another node appears again on the network, but the nodes will not begin to function as a cluster until the quorum exists again.

The failover mechanism

In the cluster, shared resources can be seen from all computers, or nodes. Each node automatically senses if another node in the cluster has failed, and processes running on the failed node continue to run on an operational one. To the user, this failover is usually transparent. Depicted in the next figure is a two-node cluster; both nodes have local disks and shared resources available to both computers. The heartbeat connection allows the two computers to communicate and detects when one fails; if Node 1 fails, the clustering software will seamlessly transfer all services to Node 2.

The Windows service associated with the failover cluster named Cluster Service has a few components:

Event processor;
Database manager;
Node manager;
Global update manager
Communication manager;
Resource (or failover) manager.

The resource manager communicates directly with a resource monitor that talks to a specific application DLL that makes an application cluster-aware. The communication manager talks directly to the Windows Winsock layer (a component of the Windows networking layer).
The failover process is as follows:

High Availability Storage (II)

High Availability Storage (II)

Storage Area Network

SAN Benefits

High Availability Storage (I)

High Availability Storage (I)

RAID Concepts

Hardware-based RAID

Software-based RAID

Hot spare drive

Failover Clustering (IV)

Failover Clustering (IV)

Cluster Node Configurations

Active/Passive Cluster

Failover Clustering (III)

Failover Clustering (III)

Cluster Quorum Configurations

Failover Clustering (II)

Failover Clustering (II)

The failover mechanism