Every data center outage is expensive. With the increased pace of digital initiatives, the pressure to maintain uptime is extremely challenging. Given the increased load on data centers, it is no longer possible for only human beings to handle the huge number of issues that have arisen due to increased complexities. Today, more than ever, IT operations teams are being asked to manage complex IT infrastructure. This when coupled with rising volumes of data, makes the task of IT teams more difficult to manage today’s dynamic, constantly changing IT environments.This increases the probability of outages.
While there are many technological advances, downtime is common and is increasing. The Uptime Institute’s 2022 Annual Outage Analysis report, highlights that one in five organizations report experiencing a “serious” or “severe” outage (involving significant financial losses, reputational damage, compliance breaches and in some severe cases, loss of life) in the past three years, marking a slight upward trend in the prevalence of major outages. According to Uptime’s 2022 Data Center Resiliency Survey, 80% of data center managers and operators have experienced some type of outage in the past three years – a marginal increase over the norm, which has fluctuated between 70% and 80%.Over 60% of failures result in at least $100,000 in total losses, up substantially from 39% in 2019. The share of outages that cost upwards of $1 million increased from 11% to 15% over that same period.
Reasons for data center outages
The reasons for outages can vary. From network failures to hardware or software malfunctions to power outages to cyberattacks and human errors, there are a number of reasons for a data center to suffer an outage.
We take a look at the key reasons for service outages and recommend best practices to mitigate them:
Networking issues: According to Uptime’s 2022 Data Center Resiliency Survey, networking-related problems have been the single biggest cause of all IT service downtime incidents – regardless of severity – over the past three years. Outages attributed to software, network and systems issues are on the rise due to complexities from the increasing use of cloud technologies, software-defined architectures and hybrid, distributed architectures.
Power-related problems: Power-related outages account for 43% of outages that are classified as significant (causing downtime and financial loss). The single biggest cause of power incidents is uninterruptible power supply (UPS) failures, according to the Uptime survey.
Human errors: The same Uptime survey says that an overwhelming majority of human error-related outages involve ignored or inadequate procedures. Nearly 40% of organizations have suffered a major outage caused by human error over the past three years. Of these incidents, 85% stem from staff failing to follow procedures or from flaws in the processes and procedures themselves.
Ransomware and DDoS: Cyber attacks can also be a major reason for causing downtime. Data breaches caused by ransomware and DDoS attacks are common today and can lead to service interruptions. With ransomware becoming increasingly sophisticated and prevalent, it has assumed increased importance in the board of major corporations. A report by NTT Security Holdings states that ransomware prevalence is impacting business continuity, with a 240% growth in ransomware incident response engagements over the past 24 months.
Best practices for preventing outages
Resilience is a key attribute of data centers, and every enterprise must strive to prevent outages through a series of initiatives. Firstly, organizations must conduct regular analysis of their resiliency across every important component of the data center ecosystem (power, cooling, connectivity, service providers).There is a direct correlation between data center temperature and failure of data center equipment. Monitoring temperature hence becomes extremely critical to prevent any possible breakdown or shutdown of equipment.
Failure of UPS systems can also lead to downtime. As most UPS systems are not truly tested till a power supply fails, consistent remote monitoring of UPS systems helps in providing real-time alerts and alert administrators of potential problems before they can cause downtime.
Software failures can also lead to outages and downtimes. It is hence necessary to keep on updating software and applying patches regularly. To ensure regular patching of updates, AI can be used to run scans for vulnerabilities and carry out software updates or patches where required. AI can also be used to proactively identify issues related to data center equipment or application performance or security.
Network related outages can be prevented by using a combination of proactive network monitoring and minimizing possibility of human errors using automation. It is also advisable to have network redundancy, which means that if one network fails, an alternative network with a different service provider is available.
Ideally, hire a third-party service provider that can audit and give an independent non-biased review with respect to understanding and benchmarking resiliency. Choosing the right DR process can also help in quickly recovering from an outage.
For ensuring protection against ransomware, enterprises must reduce user privileges, eliminate any end-user admins and use multi-factor authentication (MFA) as this significantly limits the opportunities for an attacker to move laterally. Network segmentation can decrease the attack vector, while implementation of user endpoint detection and response (EDR) solutions for policy-based isolation can help in preventing a malware from spreading.
Research has shown that many of the data center outages are completely preventable and can be avoided. If organizations invest in the right tools, technologies and processes, a majority of outages can be prevented from taking place.
This article has been written by Vimal Kaw, Colocation Product Head and New Site Selection Lead, NTT Ltd in India