Azure MINOR

Post Incident Review (PIR) – Network Connectivity – Toronto edge site issues

March 3, 2025 · 04:22 PM UTC – 06:37 PM UTC · Duration: 2h 15min

Affected Services

Network Connectivity

Timeline

04:22 PM
Customer impact began, as the redundant device began experiencing the hardware failure, while the paired primary device had already been removed from rotation.
04:32 PM
Our automation attempted to bring the paired primary device back into rotation and, due to system load balancing after route convergence, the impact was marginally reduced.
04:43 PM
First automated alert raised of potential issues, however the scope of impact was not evident from the alert or initial investigations.
05:19 PM
Network engineers investigating the issue took steps to remove congestion from the network peer, which reduced impact.
05:27 PM
Additional customer reports received, following delays in submitting support requests via the Azure Portal.
05:46 PM
Additional engineering teams were engaged to investigate and mitigate.
05:59 PM
We began steps to manually remove the redundant device from rotation.
06:37 PM
We isolated the faulty line card from the device. Network traffic was shifted to healthy routes, connectivity was restored, and we monitored the service health to ensure customers had been mitigated.