Performance degradation Business Central for some of our customers

Resolved
Resolved

Dear customer,

First of all, we would like to sincerely apologize once again for the inconvenience caused by the service outage on June 11th. We fully understand the significant impact this has had on your operations and greatly appreciate your patience and understanding.

Microsoft has shared a detailed Root Cause Analysis, confirming that the issue resulted from capacity limitations within one of their hosting clusters (AS4556). While the root cause lies within Microsoft’s infrastructure, we take the consequences for our customers very seriously.

We will continue our discussions with Microsoft to explore how we can help prevent similar issues in the future and improve both response and resolution times. Our goal remains to ensure the stability and performance of the 3PL Dynamics solution for all our customers.

Root Cause Analysis (RCA) Microsoft:

Summary

Based on our investigation and performance data analysis, the service outage that primarily affected Dynamics 365 Business Central customers in the Netherlands region was caused by cluster resource exhaustion. This resulted from reaching maximum node capacity in the default host group (AS4556, this was the major/main affected one), leading to persistent high CPU utilization and insufficient memory allocation across the cluster nodes.

Technical Root Cause

Primary Issue: Cluster Resource Saturation

The outage was fundamentally caused by the cluster reaching its maximum operational capacity. Our analysis confirmed that the majority of affected tenants were hosted on this specific cluster infrastructure. The cluster had exceeded its designed tenant-to-node ratio, creating a cascading performance degradation scenario.

Resource Utilization Patterns

The performance metrics from the such outage demonstrate the resource exhaustion pattern:

  • CPU Utilization: Nodes consistently operated at maximum capacity (approaching 100% utilization) with multiple production instances showing sustained high processor load
  • Memory Constraints: Available memory dropped to critically low levels (below threshold), forcing the system into resource contention scenarios
  • Query Performance Impact: Long-running SQL queries increased significantly, with execution times extending beyond normal operational parameters

Infrastructure Scaling Limitations

The cluster architecture had reached its horizontal scaling limits within the default host group configuration. When cluster systems approach maximum node capacity, several performance issues emerge simultaneously:

  1. Resource Competition: Multiple tenants competing for limited CPU and memory resources
  2. Memory Pressure: Low memory conditions forcing increased CPU usage for memory management operations
  3. Query Bottlenecks: Database operations experiencing delays due to resource contention

Performance Impact Analysis

Login and Authentication Issues

The resource exhaustion directly impacted the authentication infrastructure, causing:

  • Extended login response times
  • Authentication service timeouts
  • Session establishment failures

Application Performance Degradation

Users experienced significant performance issues including:

  • Unworkable slow response times across Business Central operations
  • Extended query execution periods
  • System responsiveness falling below acceptable service levels

Resolution Strategy

Immediate Remediation

We implemented a cluster splitting solution to address the immediate capacity constraints:

  1. Cluster Segmentation: The overloaded cluster was divided into two separate clusters
  2. Tenant Redistribution: Existing tenants were redistributed across the newly created cluster infrastructure
  3. Resource Rebalancing: The splitting reduced the tenant-to-node ratio, providing adequate CPU and memory resources per tenant

Technical Benefits of Cluster Splitting

The cluster division approach provided several immediate benefits:

  • Reduced Resource Contention: Lower tenant density per cluster node eliminated resource competition
  • Improved Memory Allocation: Each cluster segment could allocate sufficient memory resources without constraint
  • Enhanced Query Performance: Database operations returned to normal execution timeframes with reduced resource pressure

Preventive Measures

Capacity Monitoring Enhancement

We implemented enhanced monitoring to prevent similar incidents:

  • Proactive Threshold Monitoring: Real-time tracking of cluster resource utilization metrics
  • Automated Scaling Triggers: Early warning systems for approaching capacity limits
  • Tenant Distribution Optimization: Improved algorithms for balanced tenant placement across available infrastructure

Infrastructure Scaling Improvements

The outage highlighted the need for more dynamic scaling capabilities:

  • Elastic Cluster Management: Enhanced ability to provision additional nodes before reaching capacity limits
  • Resource Pool Expansion: Increased default host group capacity to accommodate growth
  • Performance Baseline Monitoring: Continuous tracking of key performance indicators to identify degradation trends

Conclusion

The service outage was a direct result of infrastructure scaling limitations majorly within the AS4556 cluster, where tenant growth exceeded the cluster's designed capacity. The combination of maximum node utilization, high CPU usage, and memory constraints created an environment where normal Business Central operations could not function effectively. Our cluster splitting resolution successfully addressed the immediate capacity issues and restored service performance to acceptable levels.

Resolved

Microsoft has completed the mitigation by splitting the overloaded cluster. All environments should now be fully operational again.

We apologize for the inconvenience this incident may have caused.

If you are still experiencing issues with your environment, please contact our support desk during business hours.

Recovering

Microsoft has informed us that the affected tenants (version 25.2) were recently consolidated on a single cluster in the NL region. This has had a negative effect on performance due to memory pressure, especially following a recent hotfix that required application recompilation.

To mitigate the issue, Microsoft is in the process of splitting this cluster into two and has already restarted services and added additional nodes to increase capacity. This should help alleviate the current performance degradation.

We will share further updates on this page as soon as new information becomes available.

Identified

Microsoft has indicated that they have likely identified the root cause of the issue. Their product team is actively working on it. While we don’t have a confirmed timeline yet, we are hopeful that we will see meaningful improvements within the coming hour.

We will continue to closely monitor the situation and share any updates as soon as they become available.

Investigating

We have not received any new information from Microsoft. However, we are currently testing the update to version V26 with several customers. This appears to be a potential workaround to resolve the issue. Please note that V26 is a major update.

Once multiple customers have confirmed that this workaround is effective, we will offer the opportunity to update to V26 to more customers. Any updates regarding this will be communicated through this status page.

In parallel, we are actively working with Microsoft to resolve the issue within the current version.

Investigating

Microsoft is still investigating the root cause of the performance degradation. It appears to be related to a general hotfix deployed by Microsoft. However, only customers using the NL Localization are affected.

A new update will be posted within one hour, or sooner if more information becomes available.

Investigating

Microsoft is still investigating the root cause of the performance degradation. They suspect it is related to a hotfix deployed by Microsoft and are actively working on a resolution.

A new update will be posted within one hour, or sooner if more information becomes available.

Investigating

Some customers are currently experiencing performance issues in Business Central. Users of the web client may see prolonged loading times and the message "Working on it". Web services are also affected, resulting in delays on handheld scanners and related processes.

The issue has been reported to Microsoft for these environments: 001|003|018|030|040|060|073|089|103|114|121|122|126|133|140|143|148|168|178|180|183|188|189|207|223|228|374

We are currently awaiting their feedback and will provide updates as soon as more information becomes available.

Began at:

Affected components
  • Environments
    • Microsoft Cloud