Root Cause Analysis Report Microsoft: This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a "Final" PIR with additional details/learnings.
What happened?
Between 11:45 and 13:58 UTC on 30 July 2024, a subset of customers experienced intermittent connection errors, timeouts, or latency spikes while connecting to Microsoft services that leverage Azure Front Door (AFD) and Azure Content Delivery Network (CDN). The two main impacted services were Azure Front Door (AFD) and Azure Content Delivery Network (CDN), and downstream services that rely on these – including the Azure portal, and a subset of Microsoft 365 and Microsoft Purview services. From 13:58 to 19:43 UTC, a smaller set of customers continued to observe a low rate of connection timeouts.
What went wrong and why?
Azure Front Door (AFD) is Microsoft's scalable platform for web acceleration, global load balancing, and content delivery, operating in nearly 200 locations worldwide – including datacenters within Azure regions, and edge sites. AFD and Azure CDN are built with platform defenses against network and application layer Distributed Denial-of-Service (DDoS) attacks. In addition to this, these services rely on the Azure network DDoS protection service, for the attacks at the network layer. You can read more about the protection mechanisms at https://learn.microsoft.com/azure/ddos-protection/ddos-protection-overview and https://learn.microsoft.com/azure/frontdoor/front-door-ddos.
Between 10:15 and 10:45 UTC, a volumetric distributed TCP SYN flood DDoS attack occurred at multiple Azure Front Door and CDN sites. This attack was automatically mitigated by the Azure Network DDoS protection service and had minimal customer impact.
At 11:45 UTC, as the Network DDoS protection service was disengaging and resuming default traffic routing to the Azure Front Door service, the network routes could not be updated within one specific site in Europe. This happened because of Network DDoS control plane failures to that specific site, due to a local power outage. Consequently, traffic inside Europe continued to be forwarded to AFD through our DDoS protection services, instead of returning directly to AFD. This event in isolation would not have caused any impact.
However, an unrelated latent network configuration issue caused traffic from outside Europe to be routed to the DDoS protection system within Europe. This led to localized congestion, which caused customers to experience high latency and connectivity failures across multiple regions. The vast majority of the impact was mitigated by 13:58 UTC, around two hours later when we resolved the routing issue. A small subset of customers without retry logic in their application may have experienced residual effects until 19:43 UTC.
How did we respond?
Our internal monitors detected impact on our Europe edge sites at 11:47 UTC, immediately prompting a series of investigations. Once we identified that the network routes could not be updated within that one specific site, we updated the DDoS protection configuration system to avoid traffic congestion. These changes successfully mitigated most of the impact by 13:58 UTC. Availability returned to pre-incident levels by 19:43 UTC once the default network policies were fully restored.
How we are making incidents like this less likely or less impactful
The report can be read here: https://azure.status.microsoft/nl-nl/status/history/
Update from Microsoft:
Details Title: Users may be unable to access your environment
User impact: Users may have been unable to access your environment.
Final Status: We completed the failover to healthy redundant infrastructure and, following a period of monitoring service health diagnostics, we confirmed that access to your environment was restored.
A Post Incident Report will be published within five days.
Preliminary Root Cause: A portion of an upstream dependency service experienced an unplanned outage.
An unexpected usage spike resulted in Azure Front Door (AFD) components performing below acceptable thresholds, leading to intermittent errors, timeout, and latency spikes. We have implemented network configuration changes and have performed failovers to provide alternate network paths for relief.
Next Steps: We're analyzing performance data and trends on the affected systems to prevent this problem from happening again.
We are pleased to inform you that the issue impacting Microsoft Azure appears to be resolved, and all services are in the process of recovering.
Microsoft will publish a Root Cause Analysis (RCA) based on this incident, and we will also publish this on our outage page once available.
We are sorry for the trouble this outage has caused and appreciate your patience and understanding during this time.
Next Update: RCA will be published as soon as it is available
If there are any services not recovering, we will provide a new update.
Thank you for your cooperation,
Microsoft update:
Starting at approximately 12:00 UTC on 30 July 2024, a subset of customers may experience issues connecting to Microsoft services globally.
Current status: We have multiple engineering teams engaged to diagnose and resolve the issue as soon as possible. We've identified multiple workstreams and are working to mitigate impacted workstreams by performing failover operations. More details will be provided as they become available.
Update from Microsoft:
We are investigating reports of issues connecting to Microsoft services globally. Customers may experience timeouts connecting to Azure services. We have multiple engineering teams engaged to diagnose and resolve the issue. More details will be provided as soon as possible.
Update from Microsoft:
Network Infrastructure - Issues accessing a subset of Microsoft services
We are investigating reports of issues connecting to Microsoft services in Europe. More details will be provided as they become available.
We are currently experiencing a worldwide outage that is impacting Microsoft Azure.
What we can see is that existing sessions may keep working. However, please keep in mind that starting new sessions, including opening a new tab or changing roles/profiles, will not work at this time.
We will provide further updates as soon as more information becomes available. Thank you for your patience and understanding during this time.
We are currently experiencing a regional outage that is impacting some of our services. Microsoft is working on a solution. Currently impacted services:
We are in close contact with Microsoft to resolve this issue and are actively analyzing the situation.
We will provide further updates as soon as more information becomes available. Thank you for your patience and understanding during this time.
We are currently experiencing issues with accessing Azure, which is impacting some components of our services, particularly the app platform. This problem originates from an ongoing issue reported by Microsoft:
Azure - Issues accessing Azure We are investigating an issue impacting Azure. More details will be provided as they become available.
Our team is closely monitoring the situation. We understand the importance of these services to your operations and are working diligently to minimize the impact.
We will provide further updates as soon as more information becomes available. Thank you for your patience and understanding during this time.
Next Update: As soon as more details are available
Thank you for your cooperation,
We’ll find your subscription and send you a link to login to manage your preferences.
We’ve found your existing subscription and have emailed you a secure link to manage your preferences.
We’ll use your email to save your preferences so you can update them later.
Subscribe to other services using the bell icon on the subscribe button on the status page.
You’ll no long receive any status updates from Boltrics status, are you sure?
{{ error }}
We’ll no longer send you any status updates about Boltrics status.