Advertisment

When the Heart Stops Beating...

author-image
DQI Bureau
New Update

The Availability Continuum

Advertisment
  • Base-availability systems are ready for immediate use, but will experience

    both planned and unplanned outages.
  • High-availability systems include technologies that sharply reduce the

    number and duration of unplanned outages. Planned outages still occur, but

    the servers include facilities that reduce their impact.
  • Continuous-operations environments use special technologies to

    ensure that there are no planned outages for upgrades, backups, or other

    maintenance activities. Frequently, companies use high-availability servers

    in these environments to reduce unplanned outages.
  • Continuous-availability environments go a step further to ensure

    that there are no planned or unplanned outages. To achieve this level of

    availability, companies must use dual servers or clusters of redundant

    servers in which one server automatically takes over if another server goes

    down.
  • Disaster tolerance environments require remote systems to take over

    in the event of a site outage. The distance between systems is very

    important in order to ensure that no single catastrophic event affects both

    sites. However, the price for distance is loss of performance due to the

    latent period required for the signal to travel the distance.

Source: IBM

When the going gets tough, the tough get going.

Advertisment

While

this age-old adage may apply to the human heart and mind, the response mechanism

in man-made computer systems is not all that active and strong. Unlike the human

heart, which is 99.999 % available till the first heart attack when there is an

‘outage’ for a few seconds, computer systems are not that infallible.

Systems that run mission-critical applications need to be

designed in such a way that they never fail. The domain of computi ng

science that deals with the span of solutions that make systems fail-proof is

called as High Availability (HA) computing. Ranging from phenomenally expensive

fault tolerant machines and high-availability clustering to the simple ‘ready-to-plug’

inventory of spares, these solutions offer system availability. System

availability is measured in terms of the percent of the time the system is

available- ranging from 99.9 % to 99.999%.

Pumping blood for the body



But the story goes beyond mere system availability. That is, the challenge

of putting up a highly available information system does not only lie in the

procurement of a sophisticated HA server but in extending HA performance to the

rest of the technology and support components like OS, middleware, application

software, network, and power supplies to say the least. Says ICICI Infotech

joint president Manoj Kunkalienkar, "The procurement and design of an HA

system is a heavy SI (systems integration) activity". ICICI Infotech has

designed and runs the complete technology infrastructure for the entire ICICI

Group, comprising scores of servers running a variety of mission-critical

applications made available across the country and run from two data-centers in

Mumbai. The system also has disaster-recovery capabilities.

Advertisment

“If you do not conduct

regular fitness tests on your main and stand-by systems, it is tantamount to

carrying a spare tire that’s flat!”

Satish

Naralkar,



NSE IT

It is important to understand that not everything has to be

designed for 99.99 % availability. Says Manoj, "We define HA from the

perspective of the application being available to the user". The question

business owners need to ask is — how much of downtime can I afford? At ICICI,

each of the business owners sat down to discuss the downtime affordable in their

respective businesses. The availability features of various applications were

then designed according to the uptime required for the respective businesses.

Critical applications like stock exchanges and reservation systems, where public

transactions are involved should have the highest levels of availability.

Hardware today is already highly reliable. For a single CPU

system, hardware availability is already at 3 nines or better, and for

CompactPCI (a standard for form-factor of boards) systems with hot swap

capability, hardware availability of four nines is easily achievable at

reasonable cost. Adds Manoj K, "Most hardware today is extremely reliable

and most of the servers don’t fail". To reach 5 nines or more of hardware

availability generally requires redundant hardware with automatic ‘failover’

capability. Many CompactPCI systems which support hot swap already have

redundant power supplies and fans, and can be configured with redundant I/O and

even redundant system slot CPUs.

Advertisment

Operation patch-up



The most stringent requirements for availability demand continuous service

with no loss of records of current transactions and history. A trading

application at a stock exchange like NSE is indeed mission critical and such

systems need machines that can run despite faults occurring. Such systems from

vendors like Compaq/ Tandem and Stratus are known as fault-tolerant systems.

Such systems guarantee continuous availability, both during equipment failure

and during subsequent return to service of repaired equipment. So-called

fault-tolerant computers that only add redundant power supplies and harden

storage subsystems with various disk array (RAID- Redundant Array of Inexpensive

Disks) configurations are neither fault-tolerant nor high-availability

computers. Such high levels of continuous availability do not come without a

significant price penalty. Fault tolerant computers are not the norm for anyone

who wishes continuous availability because they can turn out to be phenomenally

expensive.

Classes

of High-Availability Systems

Number

of 9s
Downtime

Per year
Typical

Application
3

Nines (99.9%)
~9

hrs
Typical

desktop or server
4

Nines ( 99.99%)
~

1 hr
Enterprise

Server
5

Nines (99.999%)
~

5min
Carrier-class

Server
6

Nines (99.9999%)
~31

seconds
Carrier

Switch Equipment
The

Classic Availability Equation:
Availability = MTTF / ( MTTF+MTTR),

where MTTF is ‘Mean Time to Failure’ and MTTR is ‘Mean Time to

Repair’. This means the key

to high availability is to create systems with very high reliability (high

MTTFs) or those that can recover from failure very fast (very low MTTRs)

Source: IMEX Research

Hardware capabilities of fault managed systems can be

generally categorized into three sets: redundancy to allow continued processing

after failure, highly reliable and redundant communication between components,

and management of the components including fault detection, diagnosis,

isolation, recovery and repair. The OS capabilities include functions to

isolate, prevent the propagation of, or mask the impact of hardware and software

faults. These functions help prevent faulty applications or faulty hardware from

pulling the entire system down.

Advertisment

Cluster to cut costs



An option that is less expensive than a fault tolerant server is that of

building a HA cluster. High availability is achieved through redundancy of

components. This strategy, sometimes referred to as n+1 sparing, requires that

one spare be provisioned for each class of spared components. High-availability

(HA) computing, as commonly practiced, utilizes the redundant resources of

clustered (two or more) processors. Such solutions address redundancy for all

components of a system, processors, main and secondary memory, network

interfaces, and power/cooling. While hardware redundancy may be effectively

addressed by clustered (redundant) hardware resources, the class of errors

detected are less comprehensive and the time required to recover from errors is

much longer than that of true FT machines. Still, fault recovery in tens of

seconds, from most common equipment failures, can be achieved at less than half

the cost of traditional FT computer systems.

“The question

business owners need to ask is–how much of downtime can I afford?”

Manoj Kunkalienar, ICICI Infotech

A common thread in all HA planning is redundancy, and the

elimination of single points of failure (SPOFs). For instance, the primary

design of Air India’s computer center was done way back in 80s to ensure high

availability. Says MSV Rao, Director- IT, Air India, "We are probably

reaching four nines of availability". AI uses systems with dual redundant

CPUs with each file getting duplexed. But bottlenecks have been found outside

the server or system hardware in the network and communication links. Adds MSV

Rao, "We follow the classical definition of having no single point of

failure and on that count I find the condition of network links to be somewhat

less reliable from a HA point of view". AI runs on four Unisys mainframes

and a collection of SunSparc, Sun E3000, IBM RS 6000s, and SGI servers.

Mainframes typically have some HA capabilities or at least some concepts that

should be borrowed from mainframes onto open systems. Says Manoj of ICICI

Infotech, "Mainframes had a nice concept–Disks could be partitioned and

could be made independently bootable". Redundancies have been built into

the Air India systems at various levels- alternate network paths through

different modes and service providers, power coming in from two different

substations and the like. Thoughtful planning of system configurations with an

eye on eliminating SPOFs can pay off in terms of achieved availability.

Advertisment

If the pump fails…



In planning for HA applications, you must take into account the possibility

of a catastrophe. If availability is required in these circumstances, one must

have arrangements with data centers in other parts of the country to provide

temporary equipment and services, and your backup and roll-forward plans must be

made accordingly. The key is to have a redundant facility that is far away and

is not susceptible to loss of service because of the same condition but close

enough for key people to get there quickly if necessary. NSE has shifted its

mirror site to Chennai from Pune in order to ensure that the two are in

different seismic zones.

But low-budget HA options are available too. Advises NSE IT

CEO Satish Naralkar, "Have an inventory of standby spare parts. This makes

it much simpler administratively. Many of the components are hot swappable, that

is, you can replace faulty hardware on the fly". Also note that having a

100 % inventory at all times also may work out costlier than having a full

system. The key is to know the failure history from experience and decide which

ones are to be kept. One needs to plan for HA only as much as the business

demands. Sometimes, a combination of approaches may be the solution. NSE could

be termed as a good case of having a line-up of availability solutions depending

on business needs. A good combination could include a Stratus FT machine for its

trading systems, clustered HA systems for the back-end, less critical

applications on cold standby, and critical spares for some applications. HA,

however, is not all about shelling money for redundant systems and recovery

centers. Cautions Naralkar, "The key is to allow no SPOF at all points of

failure. It is like carrying a spare tire that too is flat!" Incidents such

as 9.11 have created interest in HA. Still, for a large number of companies, HA

still remains on the wish list. As Naralkar points out, "In such a

situation people fall back on the belief that such an eventuality may never hit

them".

Easwardas Satyan in

Mumbai

Advertisment

The ‘Availability’ Jargon Buster

2N: A redundant system in which there is a hot spare for every

(critical) component.

3N: A redundant system in which there are three of every (critical)

component, and a voting mechanism to prevent a failure from causing errors (see

TMR).

Cluster: A collection of two or more systems which do useful work, but

in which the remaining system(s) can take over the load in case of the failure

of one of the systems.

Cold spare: A spare component that is available to replace a failed

component, but is not installed.

Cold restart: Restarting a device from power off or with no state

information.

Fault tolerance: The system attribute of being able to operate

correctly while faults are occurring.

Heartbeating: Sending a signal from one component to the other to show

that the sending unit is still functioning correctly.

Hot restart: restarting a component while the system is still

operational or partially operational.

Hot spare: A spare component that is constantly updated with the state

of the component that it will take over in case of a failure.

Hot swap: A term for removing a failed piece of hardware and replacing

it with a cold spare in a ‘hot’ (operational) system (without bringing it

down).

MTTF: Meant time to failure, which is the average time between one

failure and the next one as measured over a large number of failures.

MTTR: Mean time to repair which is the average time it takes to repair

a component or system as measured against a large number of repairs.

N+M redundancy: A system that has N components working in normal

operation and has M components in standby. This is frequently reduced to N+1,

where there is one standby component for every operational N operational

component.

RAID: Redundant Array of Inexpensive Disks. RAID may be used to

provide high availability, larger virtual disks, or both. RAID technology is

split into levels- Linear,0, 1, 2,3, 4, 5 which define various characteristics.

Reboot: To restart a computer. Cold reboot occurs when the system is

powered off to reboot. Warm reboot occurs when the hardware is left running, but

the OS is restarted.

Rollback: The process of repairing the damage to the state of a system

caused by hardware or software failure.

Rollforward: The process of bringing the state of a system or software

component to a level at which it can be placed into service.

SPOF: Single point of failure. If a component is a SPOF, it is

critical that it be as reliable as possible, since the reliability of the system

can be no better than that of the SPOF component.

Warm spare: A spare component which is powered up and ready to be

configured to take over in case of a failure.

Advertisment