The 10 Biggest Cloud Outages Of 2024 (So Far)

by nlqip
August 1, 2024

From AT&T to Salesforce and Microsoft, these are among the biggest cloud outages this year so far.

A major AT&T outage in February. Salesforce service failures in May. And Microsoft solution providers facing down times in Azure and Microsoft 365 in July.

These are just some of the biggest cloud outages the world has faced so far in 2024.

For the list, CRN focused on cloud issues of particular importance to solution providers, thus skipping outages this year for consumer products including Meta’s Facebook and Instagram and Microsoft’s LinkedIn.

2024 Cloud Outages

This list does not include the July 19 faulty CrowdStrike update incident that downed millions of Windows machines because the fallout from the event is still unfolding.

Although adoption of cloud technologies has transformed organizations and entire industries, the uncommon occurrence of outages can prove costly.

The Uptime Institute, an advisory organization, reported in March that according to its research, “significant, serious or severe” outages cost more than $100,000 and even more than $1 million.

“Each year there are, on average, 10 to 20 high-profile IT outages or data center events globally that cause serious or severe financial loss, business and customer disruption, reputational loss and, in extreme cases, loss of life,” according to the organization.

A June report by Cisco’s ThousandEyes subsidiary said that cloud service provider (CSP) outages are growing in share of outages compared to internet service providers (ISPs).

“In January, we noted that ‘the ratio of ISP to CSP outages [had] changed from 89:11 in 2022 to 83:17 in 2023,’” according to the report. “In the first five months of 2024, that ratio rebalance accelerated significantly: to 73:27. … Additionally, when we compare the first five months of 2024 to the five months of 2023, we see a similar upward trend, with the percentage of CSP outages increasing at a greater rate in the H1 2024 period.”

The vendor also saw application outages grow, with “an 8% increase in the first five months of 2024, compared to the same period in 2023.”

“One reason applications deserve attention and focus is that their architecture makes them less resilient to outage conditions than ISPs and CSPs,” according to ThousandEyes. “Internet and cloud providers have been able to improve redundancy and resilience to outages over time; the number of moving parts in an application that sit outside the app developer’s control makes resiliency much more challenging in that space.”

This list is part of CRN’s 2024 Year In Review (So Far) series, which includes the 10 hottest cloud computing startups of 2024 (so far) and the 10 hottest cloud security startup companies of 2024 (so far).

Read on for more on this year’s biggest cloud outages so far.

OCI Goes Down In January

Oracle co-founder and Chief Technology Officer Larry Ellison likes to say that his cloud services never go down compared to leading cloud vendors such as Amazon Web Services.

But on Jan. 16, Oracle users faced a “disruption on the company’s network that impacted customers and downstream partners interacting with Oracle Cloud services in multiple regions, including the U.S., Canada, China, Panama, Norway, the Netherlands, India, Germany, Malaysia, Sweden, Czech Republic, and Norway,” according to a Cisco ThousandEyes report Feb. 2.

“ThousandEyes first observed this incident around 1:45 PM (UTC) (8:45 AM [EST]) and appeared to center on Oracle nodes located in various regions worldwide. Thirty-five minutes after first being observed, all the nodes exhibiting outage conditions appeared to clear; however, 10 minutes later, some nodes began exhibiting outage conditions again. The disruption lasted around 40 minutes in total.”

The issue apparently hit “a large number of data center sites and downstream services, such as NetSuite,” according to ThousandEyes, which saw 100 percent packet loss on affected interfaces.

“The incident appeared to coincide with—or occur in reasonably close proximity to—a security patch released by the vendor,” according to the report. “They could be related, but there’s nothing in ThousandEyes observations or data that definitively links the two occurrences.”

Oracle NetSuite is featured in CRN’s 2024 Partner Program Guide.

Database Upgrade Sinks Atlassian Jira In January

Atlassian’s start to the year was less than smooth with its Jira project management tool giving users 503 service unavailable messages and other errors for about four hours starting 6:52 AM UTC on Jan. 18.

ThousandEyes said that Jira services were back to normal operations by 10:30 Coordinated Universal Time (UTC). The issues hit Jira Work Management, Jira Software, Jira Product Discovery and other services offered by Australia-based Atlassian, according to a ThousandEyes report Feb. 2.

Atlassian attributed degraded performance for the Jira products family to “a scheduled database upgrade on an internal Atlassian Marketplace service.”

“This degraded performance manifested in increasing response times and eventually time outs,” according to the vendor. “This service degradation then cascaded upstream and resulted in requests timing out across the Jira family of products, impacting product experiences.”

Atlassian is featured in CRN’s 2024 Partner Program Guide.

A Series Of Microsoft Issues In January

Multiple Microsoft services were hit with issues in January, with Azure Resource Manager (ARM) facing degradation on Jan. 21.

Users experienced issues with ARM for about seven hours, from 1:30 UTC to 8:58, especially users in the Central U.S., East U.S., West Central U.S. and South Central U.S.

The ARM issue affected downstream Azure services including CDN, Virtual Machines, Data Factory, Azure Container Registry and Service Bus, according to Microsoft.

The issue came from a June 2020 ARM private preview integration with Entra Continuous Access Evaluation, according to Microsoft. “Unbeknownst to us, this preview feature of the ARM CAE implementation contained a latent code defect that caused issues when authentication to Entra failed,” according to the vendor. “The defect would cause ARM nodes to fail on startup whenever ARM could not authenticate to an Entra tenant enrolled in the preview.”

The Microsoft Key Vault team fixed the code and the vendor promised to improve “monitoring signals on role crashes for reduced time spent on identifying the cause(s), and for earlier detection of availability impact,” among other fixes.

A ThousandEyes report said that “the saving grace for Microsoft was that it occurred on a weekend, reducing the impact on users.”

“However, the critical nature of ARM in Azure operations meant that the users who were impacted could do little but wait for a fix,” according to the report.

On Jan. 26, Microsoft Teams had a widespread outage. Microsoft posted on X at 8:45 a.m. Pacific that “we’re investigating an issue impacting multiple Microsoft Teams features.”

“We’ve identified a networking issue impacting a portion of the Teams service and we’re performing a failover to remediate impact,” the vendor posted on X at 9:17 a.m.

Realtime outages monitor Downdetector reported receiving about 14,500 reports of a Microsoft Teams outage by 10:41 a.m. It received about 600 reports of a Microsoft 365 outage by around that time.

A ThousandEyes report on the incident said that it “observed these failures starting at approximately 4 PM (UTC) (8 AM [PST]), and they persisted for more than 7 hours before the incident appeared to resolve for many users by 11:10 PM (UTC).”

Microsoft has more than 400,000 partners worldwide and is a member of CRN’s 2024 Channel Chiefs.

On Feb. 14, a regional metadata store issue resulted in disruption for Google Cloud us-west1 users, ThousandEyes said in a post March 1.

The incident lasted about two hours and 40 minutes, according to Google. “Our engineering team mitigated the issue by isolating the problematic traffic and have implemented measures to prevent a recurrence,” Google said, attributing issues to its regional metadata store.

The outage hit a variety of Google Cloud products, Vertex AI products and Identity and Access Management (IAM).

Google has more than 100,000 partners worldwide, according to CRN’s 2024 Channel Chiefs.

February AT&T Outage Catches FCC Attention

On Feb. 22, AT&T users reported outages for the telecommunications giant’s services, including internet access. Downdetector.com stated that the highest concentration of outages was in Houston, Chicago, Dallas, San Antonio and Atlanta.

On Feb. 25, AT&T CEO John Stankey said in a statement that the outage appeared “due to the application and execution of an incorrect process used while working to expand our network.” The vendor also offered customers $5 credits if affected by the outage.

Ookla called it “the largest operator outage in the world since 2020,” with its Downdetector subsidiary capturing more than 1.8 million reports related to the nationwide outage.

In July, the FCC issued a report on the incident, attributing the cause to a lack of peer review, failure to adequately test post installation, insufficient safeguards and controls to get approval of changes that affect the network and other factors.

The report noted that AT&T has made changes to prevent the issue from happening again, including “scanning the network for any network elements lacking the controls that would have prevented the outage, and promptly putting those controls in place.” The report said that the incident was referred to the Enforcement Bureau “for potential violations of parts 4 and 9 of the Commission’s rules.”

“This outage illustrates the need for mobile wireless carriers to adhere to best practices, implement adequate controls in their networks to mitigate risks, and be capable of responding quickly to restore service when an outage occurs,” according to the report. “The Bureau plans to release a Public Notice, based on its analysis of this and other recent outages, reminding service providers of the importance of implementing relevant industry-accepted best practices, including those recommended by the Communications Security, Reliability, and Interoperability Council.”

AT&T is a member of CRN’s 2024 Channel Chiefs.

Comcast Network Issues In March

Issues with Comcast’s network appeared to hinder Amazon Web Services (AWS), Salesforce, Cisco Webex and other applications and services March 5 at around 7:45 a.m. UTC and ending at about 21:40 UTC, ThousandEyes said in a report March 15.

“The outage appears to have impacted traffic as it traversed Comcast’s network backbone in Texas, including traffic that originated in other regions like California and Colorado,” according to the report. “The onset was sudden. Traffic traversing the affected infrastructure saw an immediate drop off—100% packet loss—with no apparent ramp that might indicate congestion or other stress conditions in the network.”

Comcast Business is featured in CRN’s 2024 Partner Program Guide.

Cloudflare’s Unpkg Causes Headaches

Starting around 8 UTC on April 12, multiple websites using the Cloudflare-powered Unpkg free content delivery network (CDN) experienced 520 errors, “a general response that typically occurs when the origin server can’t complete a request due to protocol violations, unexpected events, or empty responses,” ThousandEyes said in a report April 26.

“Unpkg came back online around 1:00 PM (UTC) after Fly.io, the service that unpkg’s origin server uses for auto-scaling infrastructure, deployed a fix to recover the affected sites,” according to the report.

The Verge put the number of websites down in the thousands, noting that Unpkg powers more than 4 billion requests a day.

San Francisco-based Cloudflare is a member of CRN’s 2024 Channel Chiefs.

Salesforce Service Failures In May

Intermittent failures from a third-party Domain Name System (DNS) service provider was the culprit for Salesforce service failures that started May 16, according to a report from the customer relationship management (CRM) software and AI tools vendor.

The San Francisco-based vendor put the duration at four hours and 30 minutes, starting at 4 p.m. UTC.

The vendor said that during the incident, customers experienced managed packages using Visualforce unable to contact Salesforce for both first-party and Hyperforce instances.

Salesforce has about 12,000 partners worldwide.

AT&T Equipment Failure in May

Although smaller than the February AT&T outage, the telecom giant dealt with an outage on May 22 mostly hitting Virginia and North Carolina residents. The issue was equipment failure, according to an Ookla report June 11.

The number of self-reported issues peaked at about 1,300 in the morning “before subsiding one hour later,” according to Ookla.

“We worked as quickly as possible to restore service to some customers in the coastal areas of Virginia and North Carolina whose service may have been affected this morning by an equipment failure. We apologize for the inconvenience,” an AT&T spokesperson told The Washington Times at the time.

Microsoft Outages In July

This list doesn’t include the July 19 faulty CrowdStrike update incident, but even if in the end only CrowdStrike owns responsibility for the incident, July gave Microsoft watchers plenty to talk about in Microsoft cloud resilience.

A July 26 blog post by ThousandEyes reported on an Azure issue earlier in the month that led to service interruptions for Grammarly.

On July 13, “Azure reported that the Azure OpenAI (AOAI) service has an automation system that is implemented regionally but uses a global configuration to manage the lifecycle for certain backend resources,” according to the report. “A change was made to update this configuration to delete unused resources in an AOAI internal subscription. There was a quota on the number of storage accounts on this subscription, which were unused and intended to be cleaned up to prevent storage quota pressure.”

Hours before the CrowdStrike incident, “Microsoft experienced an unrelated (outage) that affected access to various Azure services and customer accounts configured with a single-region service in the Central US region,” according to the ThousandEyes blog post.

“This outage occurred around the same time as the CrowdStrike incident, from 9:56 PM (UTC) on July 18 to 12:15 PM (UTC) on July 19,” according to the post. “The close timing of the two incidents may have caused some confusion and led to the larger global IT outage being mistakenly attributed to Microsoft. Although Microsoft systems were affected during the CrowdStrike incident, it was completely unrelated to the Azure incident.”

The outage included “failures of service management operations and connectivity or availability of services,” with ThousandEyes saying that “connectivity into the Central US region appeared impaired, with forwarding loss being observed at the ingress points to the affected region … Among those impacted were Confluent, Elastic Cloud, and Microsoft 365.”

“Microsoft’s status update also identified a configuration change as the underlying cause that impacted the connectivity of backend services, specifically storage clusters and compute resources. This then triggered some automated mitigation with services being restarted repeatedly.”

And on July 30 – hours before Microsoft reported earnings for its second fiscal quarter – the vendor posted to X at 5:48 a.m. Pacific Time on Tuesday that “we’re currently investigating access issues and degraded performance with multiple Microsoft 365 services and features,” according to the X post.

Microsoft’s online status page for its cloud services was also down. At 7:51 a.m. Pacific, Microsoft added on X: “We’ve applied mitigations and rerouted user requests to provide relief. We’re monitoring the service to confirm resolution.”

Source link
lol

From AT&T to Salesforce and Microsoft, these are among the biggest cloud outages this year so far. A major AT&T outage in February. Salesforce service failures in May. And Microsoft solution providers facing down times in Azure and Microsoft 365 in July. These are just some of the biggest cloud outages the world has faced…