Outages Are Not a Question of ‘if’, But ‘when'

New Update
Cloudflare Outage

By: Edgar Dias, ‎Managing Director, ServiceNow India


In a digital economy, system outages and website glitches are serious business. A service outage can lead to a cascading negative impact, as customers, employees, and other stakeholders are inconvenienced when disaster strikes. When a business’ IT systems are offline, even for short intervals, the cost of even an hour of downtime can be substantial. Besides the real time monetary cost, the firm also faces lost productivity and its reputation is diminished. The recent outages experienced by some of the largest airlines in the world (despite all the investments in disaster recovery) is a testimony to how fragile, complex and error-prone the technological infrastructure is. The result was thousands of flights delayed, millions of passengers stranded and potentially switching to other carriers, increased costs due to insurance claims and the reputational loss carefully built up shred into tatters.

At present, most networks and applications have advance warning systems to flag off minor issues before they turn into major catastrophic problems. In today’s complex world, a service typically consists of numerous applications that are interconnected to each other and in the banking industry for example, we seen a critical service being dependent on up to 15+ applications stitched together. Any failure in any of the applications/infrastructure can bring down the entire service causing alarms to go off across the entire spectrum. The main challenge at that time is to identify what triggered the “storm” of alarms and how quickly can one diagnose the root cause. . Depending on the nature of the event and the scope and size of the system, resolution of the outage can take even longer.

Real-time monitoring, building an accurate CMDB and predictive intelligence can dramatically improve the quality of shared infrastructure services. This is where IT operations management (ITOM) comes in. To put it simply, ITOM is a set of solutions that help keep the lights on—or, more precisely, keeps the various IT systems, applications, and networks that firms use up and running. ITOM helps IT departments manage every aspect of IT and deliver services at a predictable quality and performance.


Here’s a look at how ITOM can be used to stop service outages before they start.

It all starts by achieving real-time visibility into the enterprise’s infrastructure and services. Using Event Management, which is integrated with a Configuration Management Database (CMDB) and (IT Service Management) ITSM processes, aggregating inputs from third-party monitoring solutions.

When events come in, they are analyzed for duplicates or correlations, dramatically reducing “noise” and improving the quality of incidents coming into our Services Reliability Team (SRT). ITOM also automatically prioritizes incidents and intelligently assigns them to the right resource the first time.


An Event Management dashboard is an efficient way to keep a track of the health of all business services in the firm’s environment. Any deviation from the regular flow of activities can help the firm be on top of any situation.

A visual representation of the relationships between services, applications, and devices helps eliminates blind spots and allows one to pinpoint the problem area. This ‘service map’ can indicate the details the firm would need to identify and fix the root cause of an issue.

The benefits that ITOM delivers consistently significantly outweigh more than reactive point-product solutions which are unable to respond fast enough to changing business demands. These include a substantial reduction in “noise” coming into the SRT queue by correlating and prioritizing only the most important events into actionable tickets, saving hours annually through automated alert management, a significant reduction in P1 and P2 incidents via Event Management and process improvements and cost avoidance by elimination of manual work.


ITOM also helps optimize key strategic areas by:

  • Increasing velocity – By consolidating a massive amount of events into only shortlisted incidents, and then prioritizing and routing them to the right SRT resource, ITOM can eliminate the time-consuming, back-and-forth communications among teams. This accelerates the time to resolution and improves service availability to the business groups, partners, and customers.

  • Providing actionable visibility – While no organization is immune to outages, it is critical to quickly identify and resolve issues and resume services. With an Event Management dashboard, system admins can get instant visibility into the health of the firm’s environment and respond quickly to alerts. This will allow them to quickly pinpoint the device with the issue and address the root cause. The result? Outages can be eliminated and the mean time to repair can be significantly reduced through quick identification of fault domains.
  • Improving experience – System administrators would no longer waste valuable time on duplicate or false alerts. When issues arise, the information they need is at their fingertips, saving them the time and trouble of monitoring. Multiple IT managers can rest easier, knowing their teams have the insight they need to do their jobs. End users benefit from the greater uptime that ITOM provides. It is a win, win.

In India, Telecom, Banking, IT Services companies and Manufacturing companies with dispersed locations are some of the key sectors that stand to benefit the most from adopting ITOM capabilities as against existing outdated legacy systems or even manual processes. As companies start to adopt Digital strategies, the importance of ensuring availability of critical infrastructure elements becomes an absolute priority for CIO’s. Organizations that wish to stay competitive cannot remain immune to the technological developments. The benefits go beyond mere support for decision making, towards increasingly driving business processes by proactively flagging off any issues and recommending the best possible actions in an automated manner.

According to Gartner, more than 90% of organizations will adopt hybrid infrastructure management capabilities by 2020. The traditional data center outsourcing market is shrinking, while cloud compute services on the other hand are increasing. Traditional services will coexist with only a minority share alongside the industrialized and digitalized services. This forecast points towards an increase in the demand for ITOM.

Technology is always evolving, and so is ITOM. The key aspect for CIOs is to be able to identify and estimate the costs of an hour of downtime of “critical applications” and that naturally leads to a discussion on how strategies need to be adopted to drive down MTTR (Mean Time To Repair). Someday, hopefully service outages will be a thing of the past—and by continuing to enhance ITOM with machine intelligence and other advancements, I am confident that day will come sooner rather than later.

banking telecom it-services itom outages digital-strategies