One of the earliest organisations to set up a disaster recovery site in 1998,
the National Stock Exchange has further improved its robustness by implementing
a business continuity plan. The Rs 500,000-crore Mumbai exchange now has a
tested and functional continuity plan to ensure uninterrupted operations in the
event of any dislocating emergency.
Under such circumstances, all operations, including IT support and
infrastructure, business support and administration, would shift to the recovery
site and remain there till the primary site is restored. While the recovery site
was first located at Pune it has now been shifted to Chennai. Comments C
Kajwadkar, Vice President of NSE.iT, the IT wing of the exchange, "We
wanted to move the site to another seismic zone" and Chennai was as remote
as they could get within the country.
Simulating disaster situations that involve mock trading has been a key initiative for NSE.iT |
|
vice-president C Kajwadkar |
The exchange’s ability to continue working through a dislocating situation
hinges on two significant capabilities. The first one is dependent on creating
redundant IT and networking infrastructure at a backup recovery site. The second
involves a well-rehearsed continuity plan where employees and other office
support equipment move to the same recovery site. With regard to the first
initiative, the exchange has the distinction of not having compromised on the
level of investments in mission critical servers, whether at the primary or
backup site. They have invested in fault-tolerant Stratus servers using VOS OS,
which are considerably more expensive than others like Compaq’s highly
available, Non-stop Himalaya servers in use at the Bombay Stock Exchange. Points
out Satish Naralkar, CEO of NSE.iT, "Systems do fail once in two years. The
damage can be very high. Are you going to take a chance there? That’s a call
somebody has to take."
That’s a risk the National Stock Exchange has never been ready to take. It
has exactly replicated the same server infrastructure at the recovery site that
it uses to support mission critical applications at the exchange at Mumbai.
According to Kajwadkar, this is now a policy decision where the merits of
parallel capital investments at both the primary and recovery site are not
debated further. The critical business processes running on Stratus servers at
both locations include trading applications for capital market, wholesale debt
market and derivatives market. Other applications are less critical and
supported on less expensive platforms. For example the applications for clearing
and settlement are supported on Compaq Alpha 8000, 4000 and 2000 series and the
market surveillance applications run on Sun Ultra Sparc servers. Other back
office applications run on Hewlett Packard 9000 servers. Explains Naralkar,
"We have seven years of experience on actual failure rates of equipment,
not theoretical figures". Implying that their loyalty to Stratus servers
for supporting core trading applications has been built through experience and
not from vendor sales jargon.
The cost of not having an adequate IT recovery solution in place is a risk that Satish Naralkar, CEO of NSE.iT, has never been ready to take |
|
Satish |
While investing in the recovery site NSE.iT also had the option of
considering collocating these servers with data centre vendors. This would have
been a faster and less expensive option. But was ruled out. "The exchange
would loose control of the site", explains Kajwadkar. The full capability
of the exchange to ride out a disaster is dependent on effective coordination
between the IT recovery and the continuity plans. With the IT recovery
capability assigned to an external vendor, coordination would have become
another obstacle.
A key challenge in the recovery operations has been to replicate the Gilat
VSAT hub centre at the backup site, an expensive investment. While the number of
active brokers has dropped considerably from its peak of 3,000 in the year 2000,
all profiles of active brokers at any time also have to be maintained at the
recovery site. This is updated every day. The transaction data from daily
trading is replicated in a batch mode at the recovery site using a combination
of 2Mbps leased lines and backup 64Kbps ISDN and VSAT links.. The exchange may
therefore have to decide whether to reconstruct the transactions during that
interval from data logs if available from the primary site or roll back
transactions by that interval. At the end of it all the exchange has a guideline
that decides whether operations should continue or not. Says Naralkar, "If
more than 30% of the brokers are affected for any reason we discontinue
trading". Implying if a certain percentage of brokers cannot get online
either under normal circumstances or during a disaster situation, the exchange
first has to ensure their connectivity. Trading can continue only after this
situation has been rectified.
High Points of Continuity at NSE | |
Policy decision to maintain parallel investments in critical servers and applications at two sites | |
Investment in expensive fault-tolerant servers | |
Four years of learning experience in recovery and continuity efforts | |
Planned exercises to simulate disaster situations with mock trading | |
Remote location of recovery site ensuring capability to function through regional disasters |
The second initiative ensuring that the exchange can work through a
dislocating situation is around capability in continuity operations. At the
exchange there are three teams that help to run operations. These include the
analyst team or business users; the clearing and settlement team and the IT
team. Explains Kajwadkar, "All the three teams have individual members and
need to work together". That creates a human resource issue since lack of
role clarity and interpersonal conflicts can arise under times of operational
dislocation and stress.
The continuity exercises therefore involve simulating a disaster when members
of all the three teams move to the recovery site and attempt to establish mock
trading. These exercises are usually conducted on pre-determined Saturday’s,
when brokers are informed beforehand and encouraged to participate in the mock
trading. Says Kajwadkar, "These mock trading sessions are useful to the
brokers as well since they get a feel of what they have to do in the event of a
disaster". This may involve establishing connectivity with the recovery
site using a number of network options. Brokers get familiar with what to do in
the event of failure of a VSAT terminal or any other primary connectivity
device. In recent months there hasn’t been an occasion to test the robustness
of the exchange’s continuity plans. The last flashpoint was in 1999 when the
INSAT satellite ensuring VSAT connectivity failed. That happened twice and both
the times it was on a week-end. On the other hand the exchange has been
regularly testing its recovery capability. In late 2001, the exchange rolled out
a plan to shift its primary site within Mumbai. To make this happen it first
shifted its recovery site from Pune to the new primary location within Mumbai.
The equipment was stabilized at this new primary site and then the equipment
from the old primary site was shifted to the new recovery site at Chennai. With
daily transactions hovering at Rs2,000 crore, an uptime of an even half-a-day is
sufficient payback for all recovery and continuity investments made over the
last seven years. "The biggest risk for an organisation is the belief that
the risk doesn’t exist" is Kajwadkar’s advice to other businesses.
Arun Shankar is a contributor to DQ