Microsoft Root Cause Analysis – Azure Front Door Connectivity Incident (29–30 October 2025)
On 29–30 October 2025, a global connectivity issue with Azure Front Door caused intermittent service disruptions for Boltrics and other Microsoft customers. Microsoft has published a Root Cause Analysis (RCA) detailing the cause, mitigation, and preventive actions. The summary is included below.
Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident (to hear from our engineering leaders, and to get any questions answered by our experts) or watch a recording of the livestream (available later, on YouTube): https://aka.ms/AIR/QNBQ-5W8.
What happened?
Between 15:41 UTC on 29 October and 00:05 UTC on 30 October 2025, customers and Microsoft services leveraging Azure Front Door (AFD) and Azure Content Delivery Network (CDN) experienced connection timeout errors and Domain Name System (DNS) resolution issues. From 18:30 UTC on 29 October 2025, as the system recovered gradually, some customers started to see availability improve – albeit with increased latency – until the system fully stabilized by 00:05 UTC on 30 October 2025.
Affected Azure services included but were not limited to: Azure Active Directory B2C, Azure AI Video Indexer, Azure App Service, Azure Communication Services, Azure Databricks, Azure Healthcare APIs, Azure Maps, Azure Marketplace, Azure Media Services, Azure Portal, Azure Sphere Security Service, Azure SQL Database, and Azure Static Web Apps.
Other Microsoft services were also impacted, including Microsoft 365 (see: MO1181369), the Microsoft Communication Registry website, Microsoft Copilot for Security, Microsoft Defender (External Attack Surface Management), Microsoft Dragon Copilot, Microsoft Dynamics 365 and Power Platform (see: MX1181378), Microsoft Entra ID (Mobility Management Policy Service, Identity & Access Management, and User Management), Microsoft Purview, Microsoft Sentinel (Threat Intelligence), Visual Studio App Center, and customers’ ability to open support cases (both in the Azure Portal and by phone).
Read more about this issue below.
What went wrong and why?
Azure Front Door (AFD) and Azure Content Delivery Network (CDN) route traffic using globally distributed edge sites supporting customers as well as Microsoft services including various management portals. The AFD control plane generates customer configuration metadata that the data plane consumes for all customer-initiated operations including purge and Web Application Firewall (WAF) on the AFD platform. Since customer applications hosted on AFD and CDN can be accessed by their end users from anywhere in the world, these changes are deployed globally across all its edge sites to provide a consistent user experience.
A specific sequence of customer configuration changes, performed across two different control plane build versions, resulted in incompatible customer configuration metadata being generated. These customer configuration changes themselves were valid and non-malicious – however they produced metadata that, when deployed to edge site servers, exposed a latent bug in the data plane. This incompatibility triggered a crash during asynchronous processing within the data plane service. This defect escaped detection due to a gap in our pre-production validation, since not all features are validated across different control plane build versions.
Azure Front Door employs multiple deployment stages, and a configuration protection system to ensure safe propagation of customer configurations. This system validates configurations at each deployment stage and advances only after receiving positive health signals from the data plane. Once deployments are rolled out successfully, the configuration propagation system also updates a ‘Last Known Good’ (LKG) snapshot (a periodic snapshot of healthy customer configurations) so that deployments can be automatically rolled back in case of any issues. The configuration protection system waits for approximately a minute between each stage, completing on an average within 5-10 minutes globally.
During this incident, the incompatible customer configuration change was made at 15:35 UTC, and was applied to the data plane in a pre-production stage at 15:36 UTC. Our configuration propagation monitoring continued to receive healthy signals – although the problematic metadata was present, it had not caused any issues. Because the data plane crash surfaced asynchronously, after approximately five minutes, the configuration passed through the protection safeguards and propagated to later stages. This configuration (with the incompatible metadata) completed propagation to a majority of edge sites by 15:39 UTC. Since the incompatible customer configuration metadata was deployed successfully to the majority of fleet with positive health signal, the LKG was also updated with this configuration.
The data plane impact began in phases starting with our preproduction environment at 15:41 UTC, and replicated across all edge sites globally by 15:45 UTC. As the data plane impact started, the configuration protection system detected this and stopped all new and inflight customer configuration changes from being propagated at 15:43 UTC. The incompatible customer configuration was processed by edge servers, causing crashes across our various edge sites. This also impacted AFD’s internal DNS service, hosted on the edge sites of Azure Front Door, resulting in intermittent DNS resolution errors for a subset of AFD customer requests. This sequence of events was the trigger for the global impact on the AFD platform.
This AFD incident on 29 October was not directly related to the previous AFD incident, from 9 October. Both incidents were broadly related to configuration propagation risk (inherent to a global Content Delivery Network, in which route/WAF/origin changes must be quickly deployed worldwide) but while the failure mode was similar, the underlying defects were different. Azure Front Door’s configuration protection system is designed to validate configurations and proceed only after receiving positive health signals from the data plane. During the AFD incident on 9 October (Tracking ID: QNBQ-5W8) that protection system worked as intended, but was later bypassed by our engineering team during a manual cleanup operation. During this AFD incident on 29 October (Tracking ID: YKYN-BWZ) the incompatible customer configuration metadata progressed through the protection system, before the delayed asynchronous processing task resulted in the crash. Some of the learnings and repair items from the earlier incident are applicable to this incident as well and are included in the list of repairs below.
How did we respond?
The issue started at 15:41 UTC and was detected by monitoring at 15:48 UTC, prompting our investigation. By 15:43 UTC the configuration protection system activated in response to widespread data plane issues, and automatically blocked all new and in-flight configuration changes from being deployed worldwide.
Since the latest ‘last known good’ (LKG) version was updated with the conflicting metadata, we chose not to revert to it. To ensure system stability, we decided not to rollback to prior versions of the LKG either. Instead, we opted to edit the latest LKG, by removing the problematic customer configurations manually. We also opted to block all customer configuration changes from propagating to the data plane at 17:30 UTC so that, as we mitigate, we would not reintroduce this issue. At 17:40 UTC we began deploying the updated LKG configuration across the global fleet. Recovery required reloading all customer configurations at every edge site and rebalancing traffic gradually, to avoid overload conditions as Edge sites returned to service. This deliberate, phased recovery was necessary to stabilize the system while restoring scale and ensuring no recurrence of the issue.
Many downstream services that use AFD were able to failover to prevent further customer impact, including Microsoft Entra and Intune portals, and Azure Active Directory B2C. In a more complex example, Azure Portal leveraged its standard recovery process to successfully transition away from AFD during the incident. Users of the Portal would have seen limited impact during this failover process and then been able to use the Portal without issue. Unfortunately, some services within the Portal did not have an established fallback strategy and therefore parts of the Portal experience continued to experience failures even after Portal recovery (for example, Marketplace).
Read more about this issue below.
Timeline of major incident milestones:
Post mitigation, we temporarily blocked all AFD customer configuration changes at the Azure Resource Manager (ARM) level to ensure the safety of the data plane. We also implemented additional safeguards including (i) fixing the control plane and data plane defects, (ii) removing asynchronous processing from the data plane, (iii) introducing an additional ‘pre-canary’ stage to test customer configuration (iv) extending the bake time during each stage of the configuration propagation, and (v) improvements to the data plane recovery time from approximately 4.5 hours to approximately one hour. We began draining the customer configuration queue from 2 November 2025. Once these safeguards were fully implemented, this restriction was removed on 5 November 2025.
The introduction of new stages in the configuration propagation pipeline was coupled with additional ‘bake time’ between stages – which has resulted in an increase in configuration propagation time, for all operations including create, update, delete, WAF operations on AFD platform, and cache purges. We continue to work on platform enhancements to ensure a robust configuration delivery pipeline and further reduce the propagation time. For more details on these temporary propagation delays, refer to http://aka.ms/AFD_FAQ.
How are we making incidents like this less likely or less impactful? To prevent issues like this, and improve deployment safety...
To reduce the blast radius of potential future issues…
To be able to recover more quickly from issues…
To improve our communications and support…
Microsoft will provide a preliminary post-incident report within two business days, followed by a final report five business days after the incident’s closure.
Once we receive these reports, we will share them on our status page.
It appears that all environments have now fully recovered and services are operating normally again.
We will continue to monitor the situation to ensure stability.
We are seeing recovery for some customers, but it may take up to four hours before all environments are fully operational. Unfortunately, we are unable to make any changes on our side to speed up the recovery process.
Microsoft is actively working to restore all affected services as quickly as possible. We will continue to monitor the situation and share updates as they become available.
We are seeing all services recovering, and our app platform is coming back online. It appears that Microsoft’s recent configuration changes are having a positive effect.
We’re continuing to monitor the situation closely. Sorry for the inconvenience caused, and thank you for your patience.
Update from Microsoft: We have initiated the deployment of our 'last known good' configuration. This is expected to be fully deployed in about 30 minutes from which point customers will start to see initial signs of recovery. Once this is completed, the next stage is to start to recover nodes while we route traffic through these healthy nodes.
We are still working on restoring our app platform. Some services are starting to recover, but full functionality has not yet been restored.
The issue is related to unreachable DNS endpoints from Microsoft, and we continue to monitor their progress closely. Updates will follow as soon as we have more information.
We are currently unable to recover our app platform, as the DNS endpoints from Microsoft remain unreachable. This issue is related to the ongoing Microsoft Front Door/DNS incident.
Our team continues to work on restoring full functionality and is monitoring Microsoft’s progress closely. We will provide further updates as soon as possible.
All environments appear to be operational again following the earlier Microsoft Front Door incident. However, our app platform/Web portal is still experiencing issues. Our team is actively working to restore full functionality.
If you are still experiencing any other issues, please create a support ticket so we can assist you further.
All environments appear to be operational again following the earlier Microsoft Front Door incident. Connectivity to Microsoft services, including Dynamics 365 Business Central, has been restored.
If you are still experiencing any issues, please create a support ticket so we can assist you further.
We apologize for the inconvenience caused.
We are seeing environments starting to recover from the Microsoft Front Door issue. Some services, including Dynamics 365 Business Central, are becoming accessible again.
We continue to monitor the situation closely and will provide an update once full recovery is confirmed.
Microsoft Front Door issues are still ongoing. Services like Dynamics 365 Business Central may remain unavailable.
We’re in contact with Microsoft and waiting for their resolution. Updates will follow as soon as we have more information.
It has been confirmed that the issue is related to Microsoft Front Door. Microsoft has identified the problem and is currently implementing a fix, expected within the next few minutes.
We are in contact with Microsoft and are hopeful that this will resolve the ongoing connectivity issues. We will continue to monitor the situation and provide updates as needed.
It appears that multiple Microsoft services are currently experiencing a global outage. This includes services such as Dynamics 365 Business Central and other Microsoft platforms.
We are in contact with Microsoft and are awaiting their recovery of the affected services. We will continue to monitor the situation closely and provide updates as soon as new information becomes available.
We are currently investigating connectivity issues affecting Microsoft Dynamics 365 Business Central. The issue appears to be related to Microsoft Front Door or DNS resolution problems, which are impacting multiple customers globally.
We are in contact with Microsoft to gather more information and will provide updates as soon as they become available.
We’ll find your subscription and send you a link to login to manage your preferences.
We’ve found your existing subscription and have emailed you a secure link to manage your preferences.
We’ll use your email to save your preferences so you can update them later.
Subscribe to other services using the bell icon on the subscribe button on the status page.
You’ll no long receive any status updates from Boltrics status, are you sure?
{{ error }}
We’ll no longer send you any status updates about Boltrics status.