By: Srivatsan S, Software Engineer Sr Staff, Juniper Networks IEC
Management of nodes in modern-day networks has become a challenging task because of technological advancements and increased deployment complexity. It is hard to predict post-deployment environment conditions which can present a unique combination of inputs, stimulate untested paths which induce buggy behavior in the nodal element. This presents a need to have state-of-the-art robust, resilient applications on the nodal element to address high availability requirements.
Network nodes are difficult to debug today owing to lack of visibility of various health parameters due to distributed nature of the processing. System health parameters, in a nutshell, gives an overall summary of the state of the system at any point in time like resource utilization (CPU usage, memory usage), events in various components, etc.
The need of the hour is to build an automated network analytics-based solution that has the following capabilities:
- Gather system parameters from various components in the node
- This helps in building the knowledge base
- Monitor them continuously from a system perspective for proactive intervention
- Rule based Expert system – with statically defined rules can be used to monitor the network/node for known events
- Correlate events from various domains to derive a holistic interpretation at the nodal/network level and drive triaging/resiliency actions for recovery in real time
- Build static intelligence to drive end-to-end recovery action
- Build analytics-based solution for Artificial Intelligence in Network and Nodal management
- Applies Machine Learning to dynamically model and fine-tune system performance
- Learning from the above step can be fed back to fine-tune system parameters aka closed loop control system
- Rule based expert system mentioned in the above step can be used to drive recovery/config actions
- Provide an enhanced user experience
How does the architecture work?
The framework facilitates collection of data, thus building the knowledge base, periodic monitoring, diagnosis, correlation and recovery actions.
- Rule-based Expert System
Framework will also comprise of “Rule Based Expert System” to take decisions for handling known scenarios through static rulesets/policies.
The main components of the rule-based system are:
- Rule Data Base (RDB)
- Rulesets/polices configured to take critical decisions
- Evidential Reasoning Engine
-
- The evidential engine applies rules to knowledge base to conclude Root-Cause(s) and identifies Action(s) for an Event
Some examples for static rules/policies would be to:
- Check for CRC errors to correlate it with operating temperature
- Check for memory errors and compare the same with historical data (to look for patterns) for taking an informed decision
2.Analytics Solution
Knowledge base built will be fed to an ML based Engine for monitoring usage, fine-tuning parameters based on deployment. Some examples are:
- Increased CPU utilization during specific time of day/week which could be due to route flap
- Increased traffic pattern during holiday season
If there are parameters deviating from configured tolerance levels, remedial actions can be taken.