Investigating Business Central Connectivity Issues

Updated

Microsoft Root Cause Analysis – Azure Front Door Connectivity Incident (29–30 October 2025)

On 29–30 October 2025, a global connectivity issue with Azure Front Door caused intermittent service disruptions for Boltrics and other Microsoft customers. Microsoft has published a Root Cause Analysis (RCA) detailing the cause, mitigation, and preventive actions. The summary is included below.

Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident (to hear from our engineering leaders, and to get any questions answered by our experts) or watch a recording of the livestream (available later, on YouTube): https://aka.ms/AIR/QNBQ-5W8.

What happened?

Between 15:41 UTC on 29 October and 00:05 UTC on 30 October 2025, customers and Microsoft services leveraging Azure Front Door (AFD) and Azure Content Delivery Network (CDN) experienced connection timeout errors and Domain Name System (DNS) resolution issues. From 18:30 UTC on 29 October 2025, as the system recovered gradually, some customers started to see availability improve – albeit with increased latency – until the system fully stabilized by 00:05 UTC on 30 October 2025.

Affected Azure services included but were not limited to: Azure Active Directory B2C, Azure AI Video Indexer, Azure App Service, Azure Communication Services, Azure Databricks, Azure Healthcare APIs, Azure Maps, Azure Marketplace, Azure Media Services, Azure Portal, Azure Sphere Security Service, Azure SQL Database, and Azure Static Web Apps.

Other Microsoft services were also impacted, including Microsoft 365 (see: MO1181369), the Microsoft Communication Registry website, Microsoft Copilot for Security, Microsoft Defender (External Attack Surface Management), Microsoft Dragon Copilot, Microsoft Dynamics 365 and Power Platform (see: MX1181378), Microsoft Entra ID (Mobility Management Policy Service, Identity & Access Management, and User Management), Microsoft Purview, Microsoft Sentinel (Threat Intelligence), Visual Studio App Center, and customers’ ability to open support cases (both in the Azure Portal and by phone).

Read more about this issue below.

09:14 CET - 11 November 2025

Updated

What went wrong and why?

Azure Front Door (AFD) and Azure Content Delivery Network (CDN) route traffic using globally distributed edge sites supporting customers as well as Microsoft services including various management portals. The AFD control plane generates customer configuration metadata that the data plane consumes for all customer-initiated operations including purge and Web Application Firewall (WAF) on the AFD platform. Since customer applications hosted on AFD and CDN can be accessed by their end users from anywhere in the world, these changes are deployed globally across all its edge sites to provide a consistent user experience.

A specific sequence of customer configuration changes, performed across two different control plane build versions, resulted in incompatible customer configuration metadata being generated. These customer configuration changes themselves were valid and non-malicious – however they produced metadata that, when deployed to edge site servers, exposed a latent bug in the data plane. This incompatibility triggered a crash during asynchronous processing within the data plane service. This defect escaped detection due to a gap in our pre-production validation, since not all features are validated across different control plane build versions.

Azure Front Door employs multiple deployment stages, and a configuration protection system to ensure safe propagation of customer configurations. This system validates configurations at each deployment stage and advances only after receiving positive health signals from the data plane. Once deployments are rolled out successfully, the configuration propagation system also updates a ‘Last Known Good’ (LKG) snapshot (a periodic snapshot of healthy customer configurations) so that deployments can be automatically rolled back in case of any issues. The configuration protection system waits for approximately a minute between each stage, completing on an average within 5-10 minutes globally.

During this incident, the incompatible customer configuration change was made at 15:35 UTC, and was applied to the data plane in a pre-production stage at 15:36 UTC. Our configuration propagation monitoring continued to receive healthy signals – although the problematic metadata was present, it had not caused any issues. Because the data plane crash surfaced asynchronously, after approximately five minutes, the configuration passed through the protection safeguards and propagated to later stages. This configuration (with the incompatible metadata) completed propagation to a majority of edge sites by 15:39 UTC. Since the incompatible customer configuration metadata was deployed successfully to the majority of fleet with positive health signal, the LKG was also updated with this configuration.

The data plane impact began in phases starting with our preproduction environment at 15:41 UTC, and replicated across all edge sites globally by 15:45 UTC. As the data plane impact started, the configuration protection system detected this and stopped all new and inflight customer configuration changes from being propagated at 15:43 UTC. The incompatible customer configuration was processed by edge servers, causing crashes across our various edge sites. This also impacted AFD’s internal DNS service, hosted on the edge sites of Azure Front Door, resulting in intermittent DNS resolution errors for a subset of AFD customer requests. This sequence of events was the trigger for the global impact on the AFD platform.

This AFD incident on 29 October was not directly related to the previous AFD incident, from 9 October. Both incidents were broadly related to configuration propagation risk (inherent to a global Content Delivery Network, in which route/WAF/origin changes must be quickly deployed worldwide) but while the failure mode was similar, the underlying defects were different. Azure Front Door’s configuration protection system is designed to validate configurations and proceed only after receiving positive health signals from the data plane. During the AFD incident on 9 October (Tracking ID: QNBQ-5W8) that protection system worked as intended, but was later bypassed by our engineering team during a manual cleanup operation. During this AFD incident on 29 October (Tracking ID: YKYN-BWZ) the incompatible customer configuration metadata progressed through the protection system, before the delayed asynchronous processing task resulted in the crash. Some of the learnings and repair items from the earlier incident are applicable to this incident as well and are included in the list of repairs below.

How did we respond?

The issue started at 15:41 UTC and was detected by monitoring at 15:48 UTC, prompting our investigation. By 15:43 UTC the configuration protection system activated in response to widespread data plane issues, and automatically blocked all new and in-flight configuration changes from being deployed worldwide.

Since the latest ‘last known good’ (LKG) version was updated with the conflicting metadata, we chose not to revert to it. To ensure system stability, we decided not to rollback to prior versions of the LKG either. Instead, we opted to edit the latest LKG, by removing the problematic customer configurations manually. We also opted to block all customer configuration changes from propagating to the data plane at 17:30 UTC so that, as we mitigate, we would not reintroduce this issue. At 17:40 UTC we began deploying the updated LKG configuration across the global fleet. Recovery required reloading all customer configurations at every edge site and rebalancing traffic gradually, to avoid overload conditions as Edge sites returned to service. This deliberate, phased recovery was necessary to stabilize the system while restoring scale and ensuring no recurrence of the issue.

Many downstream services that use AFD were able to failover to prevent further customer impact, including Microsoft Entra and Intune portals, and Azure Active Directory B2C. In a more complex example, Azure Portal leveraged its standard recovery process to successfully transition away from AFD during the incident. Users of the Portal would have seen limited impact during this failover process and then been able to use the Portal without issue. Unfortunately, some services within the Portal did not have an established fallback strategy and therefore parts of the Portal experience continued to experience failures even after Portal recovery (for example, Marketplace).

Read more about this issue below.

09:13 CET - 11 November 2025

Updated

Timeline of major incident milestones:

15:35 UTC on 29 October 2025 – Corrupt metadata first introduced, as described above.
15:41 UTC on 29 October 2025 – Customer impact began, triggered by the resulting crashes.
15:43 UTC on 29 October 2025 – Configuration protection system activated in response to issues.
15:48 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered.
16:15 UTC on 29 October 2025 – Focus of the investigation became examining AFD configuration changes.
16:18 UTC on 29 October 2025 – Initial communication posted to our public status page.
16:20 UTC on 29 October 2025 – Targeted communications to impacted customers sent to Azure Service Health.
17:10 UTC on 29 October 2025 – Began updating the ‘last known good’ LKG configuration to remove problematic configurations manually.
17:26 UTC on 29 October 2025 – Azure Portal failed away from Azure Front Door.
17:30 UTC on 29 October 2025 – Blocked all customer configuration propagation to the data plane, in preparation for deploying the new configuration.
17:40 UTC on 29 October 2025 – Initiated the deployment of our updated ‘last known good’ configuration.
17:50 UTC on 29 October 2025 – Last known good configuration available to all Edge sites, which began gradually reloading LKG configuration.
18:30 UTC on 29 October 2025 – AFD DNS servers recovered, allowing us to rebalance traffic manually to a small number of healthy Edge sites. Customers began seeing improvements in availability.
20:20 UTC on 29 October 2025 – As a sufficient number of Edge sites had recovered, we switched to automatic traffic management – customers continued to see availability improve.
00:05 UTC on 30 October 2025 – AFD impact confirmed mitigated for customers, as availability and latency had returned to pre-incident levels.

Post mitigation, we temporarily blocked all AFD customer configuration changes at the Azure Resource Manager (ARM) level to ensure the safety of the data plane. We also implemented additional safeguards including (i) fixing the control plane and data plane defects, (ii) removing asynchronous processing from the data plane, (iii) introducing an additional ‘pre-canary’ stage to test customer configuration (iv) extending the bake time during each stage of the configuration propagation, and (v) improvements to the data plane recovery time from approximately 4.5 hours to approximately one hour. We began draining the customer configuration queue from 2 November 2025. Once these safeguards were fully implemented, this restriction was removed on 5 November 2025.

The introduction of new stages in the configuration propagation pipeline was coupled with additional ‘bake time’ between stages – which has resulted in an increase in configuration propagation time, for all operations including create, update, delete, WAF operations on AFD platform, and cache purges. We continue to work on platform enhancements to ensure a robust configuration delivery pipeline and further reduce the propagation time. For more details on these temporary propagation delays, refer to http://aka.ms/AFD_FAQ.

How are we making incidents like this less likely or less impactful? To prevent issues like this, and improve deployment safety...

We have fixed both the original control plane incompatibility, and the data plane bug described above. (Completed)
We are now enforcing complete synchronous processing of each customer configuration, before advancing to production stages. (Completed)
We have implemented additional stages in our phased configuration rollout, including extended bake time to help detect configuration related issues. (Completed)
We are decoupling configuration processing in data plane servers from active traffic-serving instances to isolated worker process instances, thereby removing the risk of any configuration defect impacting data plane processing. (Estimated completion: January 2026)
Once our pre-validation of this configuration pipeline is in place, we will work towards reducing the propagation time from 45 minutes to approximately 15 minutes. (Estimated completion: January 2026)
We are enhancing our testing and validation framework to ensure backwards compatibility with configurations generated across previous versions of the control plane build. (Estimated completion: February 2026)

To reduce the blast radius of potential future issues…

We have migrated critical first-party infrastructure (including Azure Portal, Azure Communication Services, Marketplace, Linux Software Repository for Microsoft Products, Support ticket creation) into an active-active solution with fail away. (Completed)
In the longer term, we are enhancing our customer configuration and traffic isolation, to ensure that no impact to any other customers from single customers’ traffic or configuration issue – utilizing ‘micro cell’ segmentation of the AFD data plane. (Estimated completion: June 2026)

To be able to recover more quickly from issues…

We have made changes to accelerate our data plane recovery time, by leveraging the local customer configuration caching more effectively – to restore customer configurations within one hour. (Completed)
We are making investments to reduce data plane recovery time further, to restore customer configurations within approximately 10 minutes. (Estimated completion: March 2026)

To improve our communications and support…

We have addressed delays in delivering alerts via Azure Service Health to impacted customers, by making immediate improvements to increase resource thresholds. (Completed)
We will expand the automated customer alerts sent via Azure Service Health, to include similar classes of service degradation – to notify impacted customers more quickly. (Estimated completion: November 2025)
Finally, we will resolve the technical and staffing challenges that prevented Premier and Unified customers from being able to create a support request, by ensuring that we can failover to our backups systems more quickly. (Estimated completion: November 2025)

09:11 CET - 11 November 2025

Updated

Microsoft will provide a preliminary post-incident report within two business days, followed by a final report five business days after the incident’s closure.

Once we receive these reports, we will share them on our status page.

08:15 CET - 30 October 2025

Resolved

It appears that all environments have now fully recovered and services are operating normally again.

We will continue to monitor the situation to ensure stability.

21:26 CET - 29 October 2025

Updated

We are seeing recovery for some customers, but it may take up to four hours before all environments are fully operational. Unfortunately, we are unable to make any changes on our side to speed up the recovery process.

Microsoft is actively working to restore all affected services as quickly as possible. We will continue to monitor the situation and share updates as they become available.

20:36 CET - 29 October 2025

Recovering

We are seeing all services recovering, and our app platform is coming back online. It appears that Microsoft’s recent configuration changes are having a positive effect.

We’re continuing to monitor the situation closely. Sorry for the inconvenience caused, and thank you for your patience.

20:11 CET - 29 October 2025

Updated

Update from Microsoft: We have initiated the deployment of our 'last known good' configuration. This is expected to be fully deployed in about 30 minutes from which point customers will start to see initial signs of recovery. Once this is completed, the next stage is to start to recover nodes while we route traffic through these healthy nodes.

19:26 CET - 29 October 2025

Updated

We are still working on restoring our app platform. Some services are starting to recover, but full functionality has not yet been restored.

The issue is related to unreachable DNS endpoints from Microsoft, and we continue to monitor their progress closely. Updates will follow as soon as we have more information.

19:03 CET - 29 October 2025

Identified

We are currently unable to recover our app platform, as the DNS endpoints from Microsoft remain unreachable. This issue is related to the ongoing Microsoft Front Door/DNS incident.

Our team continues to work on restoring full functionality and is monitoring Microsoft’s progress closely. We will provide further updates as soon as possible.

17:51 CET - 29 October 2025

Updated

All environments appear to be operational again following the earlier Microsoft Front Door incident. However, our app platform/Web portal is still experiencing issues. Our team is actively working to restore full functionality.

If you are still experiencing any other issues, please create a support ticket so we can assist you further.

17:33 CET - 29 October 2025

Recovering

All environments appear to be operational again following the earlier Microsoft Front Door incident. Connectivity to Microsoft services, including Dynamics 365 Business Central, has been restored.

If you are still experiencing any issues, please create a support ticket so we can assist you further.

We apologize for the inconvenience caused.

17:30 CET - 29 October 2025

Updated

We are seeing environments starting to recover from the Microsoft Front Door issue. Some services, including Dynamics 365 Business Central, are becoming accessible again.

We continue to monitor the situation closely and will provide an update once full recovery is confirmed.

17:21 CET - 29 October 2025

Updated

Microsoft Front Door issues are still ongoing. Services like Dynamics 365 Business Central may remain unavailable.

We’re in contact with Microsoft and waiting for their resolution. Updates will follow as soon as we have more information.

17:17 CET - 29 October 2025

Updated

It has been confirmed that the issue is related to Microsoft Front Door. Microsoft has identified the problem and is currently implementing a fix, expected within the next few minutes.

We are in contact with Microsoft and are hopeful that this will resolve the ongoing connectivity issues. We will continue to monitor the situation and provide updates as needed.

17:11 CET - 29 October 2025

Updated

It appears that multiple Microsoft services are currently experiencing a global outage. This includes services such as Dynamics 365 Business Central and other Microsoft platforms.

We are in contact with Microsoft and are awaiting their recovery of the affected services. We will continue to monitor the situation closely and provide updates as soon as new information becomes available.

17:01 CET - 29 October 2025

Investigating

We are currently investigating connectivity issues affecting Microsoft Dynamics 365 Business Central. The issue appears to be related to Microsoft Front Door or DNS resolution problems, which are impacting multiple customers globally.

We are in contact with Microsoft to gather more information and will provide updates as soon as they become available.

16:58 CET - 29 October 2025

Investigating Business Central Connectivity Issues

Find Your Subscription

Subscribe to Status Updates