Azure CRITICAL

Post Incident Review (PIR) – Azure Front Door – Connectivity issues in multiple regions

July 30, 2024 · 11:45 AM UTC – 07:43 PM UTC · Duration: 7h 58min

Affected Services

Azure Front Door

Timeline

11:45 AM
Impact started
11:47 AM
Our Azure Portal team detected initial service degradation and began to investigate.
12:10 PM
Our network monitoring correlated this Portal incident to an underlying network issue at one specific site in Europe, and our networking engineers engaged to support the investigation.
12:55 PM
We confirmed localized congestion, so engineers began executing our standard playbook to alleviate congestion – including rerouting traffic.
01:13 PM
Communications were published that we are investigating reports of issues connecting to Microsoft services globally, and stated customers may experience timeouts connecting to Azure services.
01:58 PM
The changes to reroute traffic successfully mitigated most of the impact by this time, after which the only remaining impact was isolated connection failures.
04:15 PM
While investigating isolated connection failures, we identified a device within Europe that was not properly obeying commands from the Network control plane and was attracting traffic after it had been told to stop attracting traffic.
04:58 PM
We ordered the network control plane to reissue its commands, but the problematic device was not accessible as described above.
05:50 PM
We started the safe removal of the device from the network and began scanning the network for other potential issues.
07:32 PM
We completed the safe removal of the device from the network.
07:43 PM
Customer impact mitigated, as we confirmed availability returned to pre-incident levels.