The Availability Continuum
- Base-availability systems are ready for immediate use, but will experience
both planned and unplanned outages. - High-availability systems include technologies that sharply reduce the
number and duration of unplanned outages. Planned outages still occur, but
the servers include facilities that reduce their impact. - Continuous-operations environments use special technologies to
ensure that there are no planned outages for upgrades, backups, or other
maintenance activities. Frequently, companies use high-availability servers
in these environments to reduce unplanned outages. - Continuous-availability environments go a step further to ensure
that there are no planned or unplanned outages. To achieve this level of
availability, companies must use dual servers or clusters of redundant
servers in which one server automatically takes over if another server goes
down. - Disaster tolerance environments require remote systems to take over
in the event of a site outage. The distance between systems is very
important in order to ensure that no single catastrophic event affects both
sites. However, the price for distance is loss of performance due to the
latent period required for the signal to travel the distance.
Source: IBM
When the going gets tough, the tough get going.
While
this age-old adage may apply to the human heart and mind, the response mechanism
in man-made computer systems is not all that active and strong. Unlike the human
heart, which is 99.999 % available till the first heart attack when there is an
‘outage’ for a few seconds, computer systems are not that infallible.
Systems that run mission-critical applications need to be
designed in such a way that they never fail. The domain of computi ng
science that deals with the span of solutions that make systems fail-proof is
called as High Availability (HA) computing. Ranging from phenomenally expensive
fault tolerant machines and high-availability clustering to the simple ‘ready-to-plug’
inventory of spares, these solutions offer system availability. System
availability is measured in terms of the percent of the time the system is
available- ranging from 99.9 % to 99.999%.
Pumping blood for the body
But the story goes beyond mere system availability. That is, the challenge
of putting up a highly available information system does not only lie in the
procurement of a sophisticated HA server but in extending HA performance to the
rest of the technology and support components like OS, middleware, application
software, network, and power supplies to say the least. Says ICICI Infotech
joint president Manoj Kunkalienkar, "The procurement and design of an HA
system is a heavy SI (systems integration) activity". ICICI Infotech has
designed and runs the complete technology infrastructure for the entire ICICI
Group, comprising scores of servers running a variety of mission-critical
applications made available across the country and run from two data-centers in
Mumbai. The system also has disaster-recovery capabilities.
|
It is important to understand that not everything has to be
designed for 99.99 % availability. Says Manoj, "We define HA from the
perspective of the application being available to the user". The question
business owners need to ask is — how much of downtime can I afford? At ICICI,
each of the business owners sat down to discuss the downtime affordable in their
respective businesses. The availability features of various applications were
then designed according to the uptime required for the respective businesses.
Critical applications like stock exchanges and reservation systems, where public
transactions are involved should have the highest levels of availability.
Hardware today is already highly reliable. For a single CPU
system, hardware availability is already at 3 nines or better, and for
CompactPCI (a standard for form-factor of boards) systems with hot swap
capability, hardware availability of four nines is easily achievable at
reasonable cost. Adds Manoj K, "Most hardware today is extremely reliable
and most of the servers don’t fail". To reach 5 nines or more of hardware
availability generally requires redundant hardware with automatic ‘failover’
capability. Many CompactPCI systems which support hot swap already have
redundant power supplies and fans, and can be configured with redundant I/O and
even redundant system slot CPUs.
Operation patch-up
The most stringent requirements for availability demand continuous service
with no loss of records of current transactions and history. A trading
application at a stock exchange like NSE is indeed mission critical and such
systems need machines that can run despite faults occurring. Such systems from
vendors like Compaq/ Tandem and Stratus are known as fault-tolerant systems.
Such systems guarantee continuous availability, both during equipment failure
and during subsequent return to service of repaired equipment. So-called
fault-tolerant computers that only add redundant power supplies and harden
storage subsystems with various disk array (RAID- Redundant Array of Inexpensive
Disks) configurations are neither fault-tolerant nor high-availability
computers. Such high levels of continuous availability do not come without a
significant price penalty. Fault tolerant computers are not the norm for anyone
who wishes continuous availability because they can turn out to be phenomenally
expensive.
Classes |
||
Number of 9s |
Downtime Per year |
Typical Application |
3 Nines (99.9%) |
~9 hrs |
Typical desktop or server |
4 Nines ( 99.99%) |
~ 1 hr |
Enterprise Server |
5 Nines (99.999%) |
~ 5min |
Carrier-class Server |
6 Nines (99.9999%) |
~31 seconds |
Carrier Switch Equipment |
The Classic Availability Equation: Availability = MTTF / ( MTTF+MTTR), where MTTF is ‘Mean Time to Failure’ and MTTR is ‘Mean Time to Repair’. This means the key to high availability is to create systems with very high reliability (high MTTFs) or those that can recover from failure very fast (very low MTTRs) |
||
Source: IMEX Research |
Hardware capabilities of fault managed systems can be
generally categorized into three sets: redundancy to allow continued processing
after failure, highly reliable and redundant communication between components,
and management of the components including fault detection, diagnosis,
isolation, recovery and repair. The OS capabilities include functions to
isolate, prevent the propagation of, or mask the impact of hardware and software
faults. These functions help prevent faulty applications or faulty hardware from
pulling the entire system down.
Cluster to cut costs
An option that is less expensive than a fault tolerant server is that of
building a HA cluster. High availability is achieved through redundancy of
components. This strategy, sometimes referred to as n+1 sparing, requires that
one spare be provisioned for each class of spared components. High-availability
(HA) computing, as commonly practiced, utilizes the redundant resources of
clustered (two or more) processors. Such solutions address redundancy for all
components of a system, processors, main and secondary memory, network
interfaces, and power/cooling. While hardware redundancy may be effectively
addressed by clustered (redundant) hardware resources, the class of errors
detected are less comprehensive and the time required to recover from errors is
much longer than that of true FT machines. Still, fault recovery in tens of
seconds, from most common equipment failures, can be achieved at less than half
the cost of traditional FT computer systems.
|
A common thread in all HA planning is redundancy, and the
elimination of single points of failure (SPOFs). For instance, the primary
design of Air India’s computer center was done way back in 80s to ensure high
availability. Says MSV Rao, Director- IT, Air India, "We are probably
reaching four nines of availability". AI uses systems with dual redundant
CPUs with each file getting duplexed. But bottlenecks have been found outside
the server or system hardware in the network and communication links. Adds MSV
Rao, "We follow the classical definition of having no single point of
failure and on that count I find the condition of network links to be somewhat
less reliable from a HA point of view". AI runs on four Unisys mainframes
and a collection of SunSparc, Sun E3000, IBM RS 6000s, and SGI servers.
Mainframes typically have some HA capabilities or at least some concepts that
should be borrowed from mainframes onto open systems. Says Manoj of ICICI
Infotech, "Mainframes had a nice concept–Disks could be partitioned and
could be made independently bootable". Redundancies have been built into
the Air India systems at various levels- alternate network paths through
different modes and service providers, power coming in from two different
substations and the like. Thoughtful planning of system configurations with an
eye on eliminating SPOFs can pay off in terms of achieved availability.
If the pump fails…
In planning for HA applications, you must take into account the possibility
of a catastrophe. If availability is required in these circumstances, one must
have arrangements with data centers in other parts of the country to provide
temporary equipment and services, and your backup and roll-forward plans must be
made accordingly. The key is to have a redundant facility that is far away and
is not susceptible to loss of service because of the same condition but close
enough for key people to get there quickly if necessary. NSE has shifted its
mirror site to Chennai from Pune in order to ensure that the two are in
different seismic zones.
But low-budget HA options are available too. Advises NSE IT
CEO Satish Naralkar, "Have an inventory of standby spare parts. This makes
it much simpler administratively. Many of the components are hot swappable, that
is, you can replace faulty hardware on the fly". Also note that having a
100 % inventory at all times also may work out costlier than having a full
system. The key is to know the failure history from experience and decide which
ones are to be kept. One needs to plan for HA only as much as the business
demands. Sometimes, a combination of approaches may be the solution. NSE could
be termed as a good case of having a line-up of availability solutions depending
on business needs. A good combination could include a Stratus FT machine for its
trading systems, clustered HA systems for the back-end, less critical
applications on cold standby, and critical spares for some applications. HA,
however, is not all about shelling money for redundant systems and recovery
centers. Cautions Naralkar, "The key is to allow no SPOF at all points of
failure. It is like carrying a spare tire that too is flat!" Incidents such
as 9.11 have created interest in HA. Still, for a large number of companies, HA
still remains on the wish list. As Naralkar points out, "In such a
situation people fall back on the belief that such an eventuality may never hit
them".
Easwardas Satyan in
Mumbai
The ‘Availability’ Jargon Buster
2N: A redundant system in which there is a hot spare for every
(critical) component.
3N: A redundant system in which there are three of every (critical)
component, and a voting mechanism to prevent a failure from causing errors (see
TMR).
Cluster: A collection of two or more systems which do useful work, but
in which the remaining system(s) can take over the load in case of the failure
of one of the systems.
Cold spare: A spare component that is available to replace a failed
component, but is not installed.
Cold restart: Restarting a device from power off or with no state
information.
Fault tolerance: The system attribute of being able to operate
correctly while faults are occurring.
Heartbeating: Sending a signal from one component to the other to show
that the sending unit is still functioning correctly.
Hot restart: restarting a component while the system is still
operational or partially operational.
Hot spare: A spare component that is constantly updated with the state
of the component that it will take over in case of a failure.
Hot swap: A term for removing a failed piece of hardware and replacing
it with a cold spare in a ‘hot’ (operational) system (without bringing it
down).
MTTF: Meant time to failure, which is the average time between one
failure and the next one as measured over a large number of failures.
MTTR: Mean time to repair which is the average time it takes to repair
a component or system as measured against a large number of repairs.
N+M redundancy: A system that has N components working in normal
operation and has M components in standby. This is frequently reduced to N+1,
where there is one standby component for every operational N operational
component.
RAID: Redundant Array of Inexpensive Disks. RAID may be used to
provide high availability, larger virtual disks, or both. RAID technology is
split into levels- Linear,0, 1, 2,3, 4, 5 which define various characteristics.
Reboot: To restart a computer. Cold reboot occurs when the system is
powered off to reboot. Warm reboot occurs when the hardware is left running, but
the OS is restarted.
Rollback: The process of repairing the damage to the state of a system
caused by hardware or software failure.
Rollforward: The process of bringing the state of a system or software
component to a level at which it can be placed into service.
SPOF: Single point of failure. If a component is a SPOF, it is
critical that it be as reliable as possible, since the reliability of the system
can be no better than that of the SPOF component.
Warm spare: A spare component which is powered up and ready to be
configured to take over in case of a failure.