Self-healing, Self-managing Networks-A Networking Nirvana?

Despite what the poets say, it is business
that makes the world go round. And helping business go round better is the business
communications infrastructure-the network. In a relatively short time, computers and
computer networks have revolutionized the way people work and conduct business. Today,
life virtually comes to a standstill if networks fail. And in the case of mission-critical
environments like hospitals and stock exchanges, a few seconds of network downtime could
have disastrous consequences.

As we progress, networks continue to grow
larger, with new servers and nodes being added every second, and become more complex and
highly sophisticated. Today, staggering volumes of data travel through them. Hence, the
need to manage data residing in these networks and ensure their security. If the data of a
hospital or national defense system network or a power plant or even a local supermarket
gets corrupt, the results are only to be imagined.

In theory, a computer network is easy
enough to understand: You connect any number of single computers together, assign a
”traffic cop” computer to see that the proper data gets to the proper destination, and sit
back and let the data flow. It”s easy enough until you have to find out why everything
stopped for no apparent reason. This is where network management comes in.

A comprehensive set of tools-network
management applications-is indispensable to a network administrator who needs to ensure
the smooth operation of a network. These applications provide a way to find out what
comprises your network, proactively monitor the network, discover signs of trouble, and
rapidly fix any problems that may arise.

Common Network Management Functions
Most enterprise Network Management Systems (NMSs) of today provide the following basic
functions:

  • Auto Discovery and Topology, where the NMS
    automatically ”discovers” all the elements or nodes in a network and displays them
    preferably as a graphical map. The map may also show how the elements relate to each
    other, both physically and logically.
  • Configuration management provides a way to
    look at and change the operational parameters of an application, node, or a group of
    nodes.
  • Fault management tools to help detect,
    isolate, and correct problems in a network.

As the industry has matured, network
management application providers have made great strides in making network management easy
and accessible. They have recognized that instead of just detecting a network failure
after the event, one has to detect a failure before it happens and prevent it. Toward this
end, there have been efforts in the direction of making the distributed networks of today
and tomorrow-self-healing and self-managing.

Some additional functions that an NMS might
provide include:

  • Performance Management, which provides
    insight into the performance of the network and offer ways to finetune operational
    parameters for optimal performance.
  • Trending and Capacity Planning to look at
    historical data of network operation and provide graphs and charts showing network traffic
    trends. These trends can then be extrapolated and used to plan network expansion.
  • Security, by which one can restrict user”s
    access to certain resources. This could range from authenticating logins for blanket
    access to assigning different privileges based on the login ID.
  • Accounting and Chargeback facilitate
    environments where the cost of running a network is spread across various departments,
    based on their usage of network bandwidth and server resources.

The commitment required to realize the
dream of self-healing, self-managing networks is not a small one. New network management
applications need to be implemented alongwith a comprehensive management strategy. The
strategy should include the following major aspects of network management:

  • Policy-based Distributed Intelligence and
    Embedded Automation
  • Service and Application Management
  • Any time, anywhere access to Network
    Management

Network managers are faced with a
multi-dimensional problem. They are expected to maintain a high level of service while
dealing with a growing number of technologies, products from multiple vendors, and
increasing end-user requirements. Their task is further compounded by having to manage
highly dynamic and geographically-dispersed networks. The norm is no longer a centralized,
legacy-based environment, but rather a distributed, client server model.

There are a number of basic difficulties
found in managing complex networks. The continual growth in the number of users and the
constant reconfiguration problems associated with adding or rearranging nodes is
immediately problematic. The simultaneous emergence of distributed, high-bandwidth
applications, such as groupware, multimedia, and videoconferencing is straining the
capacity of a network. Furthermore, the rapid deployment of new high-bandwidth and
switched technologies, such as fast Ethernet and Asynchronous Transfer Mode (ATM) and LAN
switching, is adding another dimension to network management requirements (although ATM,
and even LAN switching, open the possibility of being able to take greater control of the
network by controlling the way connections are established).

The traditional network management approach
uses intelligent Simple Network Management Protocol (SNMP) agents in the hub to collect
and reduce data while relying on the NMS for analysis and control. This approach has some
major limitations. For instance, NMS is a single point of failure, both from an NMS
hardware and an NMS network connectivity perspective. Also, NMS is overburdened with the
responsibility of keeping an eye on hundreds or thousands of devices in the network. As a
result, it cannot really respond in real time to potential problems in the network. Most
importantly, NMS relies on a human to respond after a problem has already occurred.

The solution to the above mentioned
disabilities is not to limit NMS functions to a centralized platform/workstation alone,
but rather to distribute the proper NMS intelligence throughout a network. The
administrator only needs to distribute the appropriate policies, which determine the ideal
behavior of a network, and a network itself will implement the policies in real time,
locally with the help of the embedded applications, rather than relying on the NMS.

Let us look at an example of this approach:
An application that leverages the concept of distributed embedded intelligence to the
point that network intelligence solves a potential bottleneck before it develops.

Averting Network Storms
With the rapid increase in the number of nodes on network segments, there is an
inevitability of increasing data traffic on ever-decreasing bandwidths of the segments.
When data lines get ”clogged”, data packet collisions occur. When a misconfigured or a
faulty node puts out excessive broadcasts causing what is called a ”network storm”, every
other node in the network (broadcast domain) is busy processing the spurious broadcasts
and any useful traffic has to wait until the broadcast storm subsides or is terminated.

The job of detecting and terminating a
network storm is difficult, even when equipped with advanced tools such as RMON (a remote
monitoring protocol) probes and intelligent hubs. Typically, it will take an administrator
2-4 hours to solve a broadcast storm problem. During this time the network may be
virtually unusable.

One application that automates detection
and termination of network storms is NetStorm Terminator. It precludes such occurrences by
proactively monitoring network traffic, comparing it to predefined baseline thresholds,
and terminating network storms automatically and quickly (within seconds), without user
intervention.

Armed with the proper embedded network
management applications, a network administrator can maintain the level and quality of
network service expected, while cutting the administrative overhead required to do so.
This type of a network management model, where policies are configured centrally from an
NMS and the implementation is distributed to the intelligent end nodes, is extremely
scalable. It delivers networks that are self-learning, self-healing, and self-managing.

Service And Application Management
Traditionally, network management involves configuring, monitoring, and maintaining a
collection of physical components. This approach has enabled network managers to diagnose
and solve hardware problems when they occur and to keep their networks physically
operational. Users, however, don”t know or care about node or port or link status. They
don”t know about hubs, switches, or routers.

The typical user only wants access to
network resources (application servers such as email, web etc. and file servers) with
prompt response times. The user wants to get his/her work done without having to think
about potential network problems. It is with this practical business approach in mind that
the traditional NMS needs to manage networks, in the context of what really matters:
non-stop, application-level usability.

Monitoring Real-time Server
Response

Traditionally, server availability was monitored by sending a ”ping” (verifying that there
is network connectivity from the source to the destination). The ”ping” may have been
successful, but that does not guarantee that the application residing on the server
machine is alive and well. What is needed is a way to monitor the real-time health of an
actual application.

As an example, let”s look at an application
called VitalStat. VitalStat provides intelligent management at the application level,
based upon server response time for a typical transaction between a client and a server.

Given a list of application servers on the
network (this list could potentially be automatically ”learnt”), VitalStat automatically
measures the elapsed time for a node to complete a full transaction with an application
server, for example, downloading a web page (HTML page) from a web server (HTTP server).

VitalStat correlates actual response time
to a previously gathered ”baseline”, as well as other performance characteristics, and
detects deviations from an acceptable performance level. When deviations are detected,
VitalStat determines whether the cause of the deviation is application, server, or network
related. The application also makes recommendations on how to fix deviations and prevent
future occurrences. The network administrator”s intervention or involvement in the
scenario is very minimal, if at all.

Virtual Grouping of Users and
Quality of Service

In traditional routed networks, users are grouped together based on some physical
attribute (where they reside, where the network connection is) rather than who they are,
what they do, or what network services they need to access. Most router management tools
have a cumbersome box-based approach to management instead of a systems-based approach:
They concern themselves with the tedious and error-prone tasks of configuring every
individual router, and each of its parameters, ports, protocols, subnets, filters etc.

Network managers have a new solution to
these management constraints-they are now able to identify nodes based on how they use the
network. Users can be identified and grouped in different ways, such as physical location,
the network IDs of members, or even the type of network layer protocol or applications
they use.

Whenever a user plugs into the network, the
network (armed with embedded automation) is intelligent enough to determine the
appropriate group membership based on the characteristics (network ID, protocol,
applications used etc.).

Prasad
Pammidimukkala,

Product Manager, Newbridge Networks.

One response to “Self-healing, Self-managing Networks-A Networking Nirvana?”

  1. Gopi patel says:

    Awesome article, Thanks for sharing!

Leave a Reply

Your email address will not be published. Required fields are marked *