Resolved
System should be fully recovered and node additional is operational. Resolution confirmed.
Monitoring
Update: Azure has released a fix and is rolling it out across all impacted regions. On the CAST AI side, we are observing node operations completing successfully, with nodes being added and provisioning as expected.
We will provide another update once Azure confirms full resolution.
Reference: https://azure.status.microsoft/en-us/status
Identified
Update: Azure has reported issues with virtual machine service management across multiple regions following a recent configuration change. We’re actively monitoring the situation and will continue to provide updates as more information becomes available.
Reference: https://azure.status.microsoft/en-us/status
Investigating
Update: Our investigation indicates this may be related to an Azure service outage impacting node creation and addition. We are awaiting further updates from Azure. As a precaution, we recommend disabling node deletion policies and scheduled rebalancing for critical clusters to avoid triggering downscaling. We’ll continue to share updates as more information becomes available.
Reference: https://downdetector.com/status/windows-azure/
Investigating
We’re currently observing an increased rate of node add failures and long add node operations, affecting Azure AKS clusters. Our engineering team is actively investigating and will provide updates as more information becomes available.