With the increasing deployment of
Microsoft Windows NT Server Enterprise Edition software, server clustering is becoming
popular as a means to achieve high availability without the expense of deploying fault
tolerant redundant systems.
The clustered approach differs from
redundancy in that, with clustering, secondary servers-known as "nodes" in a
clustered environment-are in use performing their primary tasks such as providing email or
web access services. While redundant systems simply mirror the functions of the primary
server and perform no additional tasks. Clusters are classified as "highly
available" with an average uptime of about 99%. Redundant systems are classified as
"fault tolerant" with 100% uptime under all but the most cataclysmic
circumstances. With greater backup needs and shrinking backup windows, and given the high
costs of deploying fully redundant systems, many network architectures today are opting
for Microsoft's clustering system and its promise of near-continuous availability.
Clustered servers share a common client group and common data storage media, including
local disk drives on the servers and a shared disk subsystem, predominantly a RAID array
for additional fault resilience. Clustering software gives two or more servers the
capability to manage applications and files for the clustered group. Each node in the
cluster hosts a particular set of applications or services and broadcasts its operational
state by sending out packets, called " heartbeats", to all other nodes. If the
other nodes in the cluster fail to detect the heartbeat for a specified interval,
responsibility for running an application or operation passes to a second server, which
also continues to run its own designated set of operations.
Microsoft's Clustering Service provides the framework for determining how the cluster will
handle the failure of any of its components. For example, assume that server A is running
file and print functions while server B is handling database services. If server A fails,
Clustering Service directs server B to assume control over file and print functions for
the cluster while maintaining its primary functions as a database server.
To the client, the server cluster appears as a single group of available services. Users
interact with the cluster as a unified entity. Therefore, if a clustered server fails,
users imply experience a brief interruption in network response, usually 30 seconds or
less, while another server in the cluster restarts the application. Clustering
automatically maintains the IT infrastructure for 24x7 environments, restarting
applications and services without human intervention.
As Clustering Service technology matures, additional benefits accure to network
administrators. In addition to high availability at reduced cost, server clustering will
promote scalability enabling applications to grow beyond a single server. Because
applications often comprise many processes that can be set up to execute on multiple
servers, clustering will enable network designers to simply add a new server when an
application outgrows its original host. Clustering Service will handle the process
coordination.
Implications for data protection
While Clustering Service greatly reduces the cost of building highly available systems,
clustering magnifies the complexity of backup and recovery. In a non-clustered
environment, network architects can either deploy a dedicated tape backup system, such as
automated DLT systems or libraries, for each server. Or they can back up over the network
to a backup server depending on the backup window and the amount of data.
In a clustered environment, however, data protection issues become considerably more
complicated. The two standard backup methods-backing up over the network or to a directly
attached device-are either impractical or inadequate when safeguarding all information in
a cluster. While backup of smaller amounts of data can be conducted over the network,
standard networks are inadequate for larger backups. To achieve acceptable performance on
large data sets, a network administrator needs to back up the cluster's physical nodes to
a direct-attached tape system. But this approach presents its own set of problems.
Clustered applications are packaged as "groups", which include all the resources
required to run the applications. Included in a group is the virtual server name for the
application which lets a given application, run on different systems at any given moment.
To get a complete system image containing all the information needed to rebuild the
cluster, each physical node must be backed up, not just the virtual nodes that represent
the application groups.
Because clusters are inherently dynamic responsibility for managing data and applications
moves from server to server as needed to keep the system running. Such shifting of
responsibility is called "failover". And failovers cause uncertainty in backup.
A physical node's configuration can change depending on where the virtual node resides.
Therefore, an application could execute on one system at the time of backup and on another
at the time of restoration. And unless the backed up image of the cluster is identical to
the restored image, inconsistencies arise. In the event of a failover, there's no
guarantee that all critical data is backed up and recoverable.
THE PRESENT: Use what's out there now. For networks with smaller storage
requirements, over-the-network backup is probably the best option. Enterprise environments
with large storage requirements, however, should consider directly attached tape backup
devices for each node, since over-the-network backup in these sites consumes an
unacceptable percentage of bandwidth.
In this scenario, each server in the cluster has its own tape backup system.
Administrators can manually back up each node in the cluster, but the backup applications
don't know the node part of a cluster. Directly attached tape backup devices are only
capable of taking a snap shot of the cluster at the time of backup; they don't
automatically adapt to changes in cluster configuration caused by events such as
failovers.
Clustering Service employs a virtual volume residing on the shared storage device called a
"Quorum Disk". This volume contains the mechanism servers use to communicate
cluster information with each other and must be backed up along with the contents of
server drives and shared storage devices. The quorum Disk is actively owned by one of the
clustered nodes. Backing up physical nodes in the cluster doesn't necessarily guarantee
that the Quorum Disk is also backed up.
THE IMMEDIATE FUTURE: Within the next six months, the software that runs tape
backup systems will become "cluster-aware." This means that backup software
understands the dynamics of the clustered environment and makes API calls into the cluster
to determine where data resides and how to back it up. Cluster-aware backup software
understands the relationships of all components in the cluster and can automatically back
up all those components correctly.
Inconsistent images of a cluster's Quorum Disk and physical nodes can lead to trouble at
restoration time. If restoration is needed, cluster-aware backup applications understand
how to interrogate the cluster to determine storage configuration. They also know the
proper order in which to restore the various elements to maintain consistency.
Cluster-aware utilities will be able to interrogate the cluster to get a consistently
clean backup and use that information to properly restore the system with little or no
manual intervention.
THE NEXT STEP: Intelligent use of direct-attached backup devices. Choosing among
the pool of tape devices attached to the physical nodes in a cluster, the application
specifies the proper tape device dynamically at time of backup, and simply tells the
device to back up designated files, directories or applications. The tape backup
application, because it has kept track of clustered activity knows exactly where the
relevant data is located. This scenario provides automatically optimized backup, doesn't
require backup over the network, and frees IS staff from manual intervention in cluster
backup operations.
WITHIN THE NEXT YEAR: The most cost-effective and efficient cluster backup solution
will be achieved by building on prior developments-cluster-aware backup software and
intelligent direct-attached backup devices-to allow the cluster to share a single backup
device, such as a tape library or tape autoloader. And while such a connection could be
achieved over an existing technology like SCSI, future clustered servers are likely to be
attached to each other and to shared storage area network or SAN using new fiber channel
technology.
Chief advantages of a SAN include providing shared access to storage; providing a
dedicated high-bandwidth network separate from the production network; and providing the
basis for building additional intelligence into storage.
With the advent of efficient, affordable SAN technologies and cluster-optimized backup
control software, network planners will finally be able to take advantage of the best of
both worlds: clustered servers backing up to a single, network-attached DLT or tape
library system.
Courtesy: Hewlett-Packard