Azure MAJOR

Post Incident Review (PIR) – Virtual Machines, Managed identities for Azure resources, and dependent services – Service management issues in multiple regions

February 2, 2026 · 06:03 PM UTC – 06:05 AM UTC · Duration: 12h 2min

Affected Services

Azure Virtual Machines (VMs)Azure Virtual Machine Scale Sets (VMSS)Azure Kubernetes Service (AKS)Azure DevOps (ADO)Other dependent services

Timeline

06:03 PM
Customer impact began, triggered by the periodic remediation workflow starting.
06:29 PM
Internal service monitoring detected a subset of regions having an increasing number of control plane failures.
07:46 PM
Correlated issues across multiple regions.
07:55 PM
Service monitoring detected failure rates exceeding failure limit thresholds.
08:10 PM
We began collaborating to devise a mitigation solution and investigate the underlying factors.
09:15 PM
We applied a primary proposed mitigation and validated that it was successful on a test instance.
09:18 PM
We identified and disabled the remediation workflow and stopped any ongoing activity, so it would not impact additional storage accounts.
09:50 PM
Began broader mitigation to impacted storage accounts. Customers saw improvements as work progressed.
12:07 AM
Storage accounts for a few high volume VM extensions that utilize managed identity finished their re-enablement in East US.
12:08 AM
Unexpectedly, customer impact increased first in East US and cascaded to West US with retries, as a critical managed identity service degraded under recovery load.
12:14 AM
Automated alerting identified availability impact to Managed Identity services in East US. Engineers quickly recognized that the service was overloaded and began to scale out.
12:30 AM
All extension hosting storage accounts had been reenabled, mitigating this impact in all regions other than East US and West US.
12:50 AM
Initial scale outs of managed identity service infrastructure scale out completed, but the new resources were still unable to handle the traffic volume due to the increasing backlog of retried requests.
02:00 AM
A second, larger set of managed identity service scale outs completed. Once again, the capacity was unable to handle the volume of backlogs and retries.
02:15 AM
Reviewed additional data and monitored downstream services to ensure that all mitigations were in place for all impacted storage accounts.
03:55 AM
To recover infrastructure capacity for the managed identity service, we began rolling out a change to remove all traffic so that the infrastructure could be repaired without load.
04:25 AM
After infrastructure nodes recovered, we began gradually ramping traffic to them, allowing backlogged identity operations to begin to process safely.
06:05 AM
Backlogged operations completed, and services returned to normal operating levels. We concluded our monitoring and confirmed that all customer impact had been mitigated.