Business Technologies

Cloud Outages – More than a Flat Tyre

From their timing to frequency to bounce-back windows to eventual impact – outages in cloud environments are becoming serious enough.

DQINDIA Online

30 Nov 2023 11:38 IST

New Update

From their timing to frequency to bounce-back windows to eventual impact – outages in cloud environments are becoming serious enough to come on the front-burner of an enterprise’s security strategy. Let’s watch this pot while it boils.

Advertisment

It may not happen every day. It can be fixed with a spare wheel in no time. Yet, a tyre—which has gone bust—hurts quite bad. Specially when the vehicle is a truck carrying a load for a lot of businesses. More so, when this truck is dashing past a superhighway, where its slight loss of speed can cause big ripples of confusion and delays for other vehicles in that lane.

The truckload of cloud cannot afford these interruptions. And no matter how big or strong you are, an outage is not something any cloud player or customer can afford to take lightly. In the last few months itself, we have seen Oracle, AWS, Google, Microsoft, IT Glue. Datadog and many cloud players suffer some glitch or the other.

From a multi-day outage that IT Glue and OCI (Oracle Cloud Infrastructure) went through, to a datacentre fire-led Greyout in Google, to a halt in services for Azure, Teams and AWS at other points—almost every big and small name in this space has confronted this uninvited beast.

Advertisment

As per IsDown’s Cloud Providers Health Report – July 2023, AWS had two major incidents, with a total outage time of 4.3 hours—showing increased propagation delay and elevated error rates. Azure had three incidents with a total outage time of 26.4 hours showing network connectivity issues and logs data access issues. Its September 2023 data shows that AWS had a total of seven incidents with a total outage time of about 26 hours, with route resolver VPC query logs delayed, increased error rates for cluster upgrades and network connectivity issues. Azure had one incident with a total outage time of 14.5 hours.

Between Internet Service Providers (ISP) and Cloud Service Providers (CSP), ISPs constitute the vast majority of global network disruptions, points out Mike Hicks, Principal Solutions Analyst, Cisco ThousandEyes (a Cisco company that monitors cloud and Internet performance). However, though fewer in volume, CSP outages can have a significantly higher impact on users, and their users’ users, compared to ISP outages.”

Advertisment

“It is directly proportional towards human life so the availability of the data or IT services are so crucial in the healthcare segment. If healthcare IT Spectrum is on the cloud then even one second outage will not be acceptable.” - Bhoopendra Solanki, CIO, Sakra World Hospital

And where does this beast bite? Everywhere – from missed business deliverables to wounded productivity to legal risks to penalties paid to regulators or clients to reputational risks to branding dents—almost everywhere.

Kicking the Tyre—Hurting the Foot

Advertisment

If we just glance through the business hours affected, it’s as much as US$100,000 per hour as shared by more than half corporate decision-makers in a fresh report from Parametrix that covered cloud downtime in 2022 among the three major providers—Amazon Web Services, Google Cloud, and Microsoft Azure. The report also reminds how its survey found 95 per cent of corporate decision-makers saying that their business is dependent on the cloud and 82 per cent pointed out that their product or organisation is dependent on availability of cloud. And almost half called cloud a mission-critical service. About 60 per cent said in this survey that they are very concerned about cloud downtime. For 52 per cent, cloud downtime leads to customer churn, and for 50 per cent it can lead to lost revenue and sales. In fact, for 36 per cent here, cloud downtime is a non-recoverable and uninsured loss.

Advertisment

“cloud outages are directly linked to revenue loss since many ordering systems and payment processing services heavily rely on cloud-based platforms. This can result in a significant decrease in sales. Also, from a customer experience perspective, cloud outages can lead to delays in order processing, ultimately resulting in a negative impact on customer satisfaction.” - Manoj Gupta, Associate Vice-President of IT, Burger King India

Bhoopendra Solanki, Chief Information Officer, Sakra World Hospital points the torch deeper on the importance of data in healthcare. “It is directly proportional towards human life so the availability of the data or IT services is very crucial in the healthcare segment. If healthcare IT Spectrum is on the cloud then even one second outage will not be acceptable.” In Solanki’s opinion, the entire IT spectrum should be in HA mode from Point A (cloud) to Point B (organisation) and if it is not possible then one should create a local cache system inside the organisation to avoid such outages.

Advertisment

Manoj Gupta, associate vice-president of IT at Restaurant Brands Asia – formerly known as Burger King India—seconds that. “Cloud outages can have severe consequences for a food industry or restaurant enterprise, impacting critical operations such as sales, customer satisfaction, and Inventory planning and forecasting. Gupta points out how cloud outages are directly linked to revenue loss since many ordering systems and payment processing services heavily rely on cloud-based platforms. This can result in a significant decrease in sales. Also, from a customer experience perspective, cloud outages can lead to delays in order processing, ultimately resulting in a negative impact on customer satisfaction.”

Plus, there is the inventory planning impact. “The food industry/restaurant business operates on a weekly planning basis, as most of the products are perishable and have very little shelf life. Any disruption in the entire process can lead to a significant sales loss. In the worst-case scenario, stores may go offline temporarily.”

Advertisment

Enterprises that suffer disproportionally high downtime whether planned or unplanned will be at a competitive market disadvantage relative to those that have “always on” systems, warns Rajesh Awasthi, Vice President & Global Head of Managed Hosting and Cloud Services, Tata Communications. “As per Gartner, 71 per cent of organisations are poorly positioned in terms of disaster recovery capabilities. Effects of a cloud outage are multifaceted, encompassing the breakdown of business applications for both end customers and business users, eventually resulting in potential revenue loss due to transactional failures. Beyond financial impacts, there’s a risk of compromising customer trust, a loss that can be challenging to recover. Additionally, the potential loss of data during an outage introduces another layer of complexity.”

As seen in the findings of the 2022 annual Outage Analysis report by Uptime Institute, one in five organisations have experienced a ‘serious’ or ‘severe’ outage (involving significant financial losses, reputational damage, compliance breaches and in some severe cases, loss of life) in the past three years. There is a clear surge in the prevalence of major outages. Third-party, commercial IT operators (including cloud, hosting, colocation, telecommunication providers, etc.) make up 63 per cent of all publicly reported outages tracked by Uptime since 2016. What also sharpens the nail for this tyre is that the gap between the beginning of a major public outage and full recovery has stretched – in a substantial way- over the last five years.

“The connective tissue in the digital supply chain between user and application is the public Internet. For cloud service providers, this presents a visibility gap as soon as an outage occurs. While it’s technically ‘not their problem’ to solve, increased Internet visibility is a gap they’re actively trying to fill in order to better communicate to customers when one of their services is degraded and why.” - Mike Hicks, Principal Solutions Analyst, Cisco ThousandEyes

It is, indeed, worrisome then to see that all big cloud players have reported hundreds of performance interruptions per year in 2021 and 2022—an average of 25 every month (Parametrix report). Looks like the cloud goes down every day! No wonder, 41 per cent have said they are more concerned about downtime this year than last year. If simply eight hours of cloud downtime during business hours can be catastrophic for 31 per cent decision-makers – then it’s not hard to imagine how much chaos and tear can this puncture cause.

It’s a four-wheel drive, after all!

Well, it’s not a small power-grid hiccup. With a worldwide end-user spending on public cloud services rising to the tune of US$591.8 billion by 2023 (As per Gartner), an outage is something that commands serious attention and pre-emptive work. And today’s IT infrastructure – both back-end and front-end—has many strings going in and out of the cloud shoelaces. One loose end and one can fall right in the mouth of the manhole of an overall outage. The growing incidence and dependencies in cloud add more furrowed brows here.

Recent findings from ThousandEyes Internet and Cloud Intelligence show total observed global network outages continuing to rise in 2023 so far, while the number of distinct application outages observed seems going northwards in the first half of 2023. Even if the number of application outages is lower than the number of network outages, the potential user impact of application outages is serious. Specially, as they entail a degradation at a single point of aggregation or dependency in the service delivery chain, leading to many ripples.

What makes these outages different from previous ones? Hicks explains that cloud outages can be categorised as either network level, infrastructure, or service level outages. “To a large degree, network level cloud outages previously appeared to be more prevalent, oftentimes localised to a region or even a specific data center, meaning the impact appeared to be

somewhat contained.”

While we still see those types of cloud outages, Hicks adds, the more recent cloud outages are different as they’ve appeared to occur at the service level, including back-end functionality like authentication and load balancing. “By design, these services are often shared services which means there’s a level of functional interdependence, causing the effects of an outage to have a compounding impact on usability with potentially far greater reach globally.”

“Every minute that operations are halted, or critical systems are inaccessible, revenue is lost. That’s where our cloud-based Disaster Recovery as a Service (DRaaS) comes in, offering enterprises the peace of mind and the ultimate protection of their critical applications and data.” - Rajesh Awasthi, VP& Global Head of Managed Hosting and Cloud Services, Tata Communications

“The availability zones should be logically and physically separated. Logically means that they are independent from a power supply and network perspective. Physically means that they are located at a sufficient distance from each other such that natural or other catastrophic events do not bring down all availability zones in one region at the same time.” - Dario Maisto, Forrester

Jacks and Wrenches please

There are only two ways to travel then—avoid sharp and round things on the road and pack as many tools and extras in the trunk as you can.

Cloud outages are rare, but they do belong to that category of risk that has a very high impact despite the low probability, warns Dario Maisto, Senior Analyst, at Forrester. “Restricting the definition of outages to those caused by events like a fire, a water leakage and other catastrophic events, fixes have to do mainly with having multiple availability zones to failover within the same region. The availability zones should be logically and physically separated. Logically means that they are independent from a power supply and network perspective. Physically means that they are located at a sufficient distance from each other such that natural or other catastrophic events do not bring down all availability zones in one region at the same time.”

“Every minute that operations are halted, or critical systems are inaccessible, revenue is lost. That’s where our cloud-based Disaster Recovery as a Service (DRaaS) comes in, offering enterprises the peace of mind and the ultimate protection of their critical applications and data.” Awasthi shares.

Gupta strongly suggests ways to implement redundancy and failover mechanisms. “This can be achieved by utilising multiple cloud providers or data centres to ensure continuous availability.

“Implementing redundancy at various levels, both hardware and software, can help ensure that if one component fails, another can seamlessly take over. Failover mechanisms can reduce downtime. Also, cloud providers should also think about having redundancy by having data centres in multiple geographical locations to mitigate the impact of regional outages. Also, to protect customer data, financial transactions, and the integrity of digital operations. Cloud providers should maintain and update security protocols regularly,” Gupta underlines.

Tickets for Slow Driving?

Perhaps SLAs and penalties in Cloud contracts would help.

“As I said QSRs rely heavily on digital systems for order processing, inventory management, and customer interactions. Cloud service providers should take several proactive measures to minimise outages,” Gupta underlines.

That explains why SLAs are more than nice-to-have frills. “Regarding SLAs and penalties, they can certainly help from a QSR perspective. SLAs that include uptime guarantees and appropriate penalties for downtime can incentivise cloud providers to prioritise service reliability. However, it is critical for QSRs to negotiate SLAs that align with their specific business needs and customer service expectations. Robust SLAs can provide QSRs with a level of assurance and compensation in case of service disruptions, which is vital in the fast-paced and customer-driven QSR industry.”

Solanki also feels that SLAs can be useful. “There can be many factors of outage which cloud players try to adhere to but there is one strong factor: connectivity between cloud and organisation should be part of cloud provider’s SLA. In the connectivity part, 5G can play an important role as a redundancy over wireless mode. Some SLA associated with commercial penalty will work.”

“Our biggest differentiation lies in the fact that we collaborate with the world’s biggest cloud service providers to ensure that their platforms are secure and resilient, minimizing the risk of outages and security breaches.” - Manish Gupta, Vice President, Infrastructure Solutions Group, Dell Technologies India

Here’s a less-anxious way to look at this breakdown. Breathe. Inhale. Exhale.

Shahin Khan, Founding Partner and Analyst, OrionX does not see the issue as that serious. “Public cloud systems are highly resilient but they are not fault-tolerant. Improving high availability for applications is quite possible, but it requires non-trivial changes to applications and data management. Cloud systems provide various means of doing so, including multi-zone and even multi-cloud support application architecture, but someone has to do the work to detect and recover from successively more complex classes of faults.”

By 2024, organisations adopting a robust cybersecurity architecture will reduce the financial impact of security incidents by an average of 90 per cent, as predicted by Gartner. Awasthi is also optimistic that providers are continually working on fortifying their infrastructure, but users should also actively engage in contingency planning, data backup, and clear communication strategies to navigate and recover from such service interruptions effectively.

Manish Gupta, Vice President, Infrastructure Solutions Group, Dell Technologies India gives a glimpse of the attention this issue is receiving on side of some providers. “Our Zero Trust security architecture is based on three factors - universal continuous authentication of everything; robust authoritative policy driven behavior; and deeply integrated threat management. Dell Technologies also offers a range of security solutions, including firewalls, intrusion detection systems, and encryption technologies, to protect cloud environments from cyber threats and vulnerabilities.” Gupta emphasizes, “Our biggest differentiation lies in the fact that we collaborate with the world’s biggest cloud service providers to ensure that their platforms are secure and resilient, minimizing the risk of outages and security breaches.”

Maisto also brings in the idea of the shared responsibility model—where it is the customer’s responsibility to ensure that the choices that are made in terms of cloud vendors’ infrastructure correctly address the requirements like those belonging to the digital sovereignty domain.

Hicks echoes that in some way when he shares his advice for CIOs and CXOs. “As a result of the increased adoption of cloud services, an organisation’s data now traverses the Internet and other networks that sit beyond their control, moving into, out of, and within the cloud. While they don’t own these environments, IT leaders are still responsible for maintaining the chain of custody of their data and it’s critical to deploy visibility solutions that will allow them to see, end-to-end, into every connection from every user to every application.”

Turns out that both cloud players and customers have to confront this problem in serious, immediate and laser-focused ways. Whether you are at the wheel or passenger renting a cab, it never hurts to get down on your knee and change a tyre. Just roll up the sleeves.

By Pratima H

pratimah@cybermedia.co.in