12:00 AM
Initial customer impact started, as cleanup operations began across regions, gradually increasing as more regions became unhealthy.
12:05 AM
Service monitoring detected failure rates going above defined thresholds, alerting our engineers who began an investigation.
12:30 AM
We determined there was a multi-region issue unfolding by correlating alerts, service metrics, and telemetry dashboards. It took 30-40 minutes for the delete to complete, so each region (and even each endpoint in each region) went offline at different times.
12:45 AM
We identified that backend endpoints were being deleted, but did not yet know how or why. Our on-call engineers were investigating two hypotheses.
12:52 AM
We classified the event as critical, engaging incident response personnel to further investigate, coordinate resources, and drive customer workstreams.
01:00 AM
We identified the automated cleanup operation deleting resources. With the problem known, we began mitigation efforts, beginning with stopping the automation that was executing the cleanup.
01:40 AM
The cleanup job stopped on its own and did not delete everything. However, we continued to work to ensure further deletion was prevented.
02:00 AM
We added resource locks and stopped delete requests at both the ARM and Automation levels, to prevent further deletion.
02:00 AM
Initial recovery efforts began to recreate deleted resources. Note regarding recovery time - AOAI models are very large and can only be transferred via secure channels. Within a model pool (there are separate model pools per region per model) each deployment is serial (so the model copying time is a significant factor). Model pools themselves are done in parallel.
02:22 AM
First communication posted to the Azure Status page. Communications were delayed due to initial difficulties scoping affected customers, and impact analysis gaps in our communications tooling.
03:35 AM
Customer impact scope determined, and first targeted communications sent to customers via Service Health in the Azure portal.
04:15 AM
GPT-3.5-Turbo started recovering in North Central US.
07:10 AM
GPT-4o recovered in East US 2.
07:10 AM
Majority of regions and models began recovering.
08:35 AM
GPT-4 in majority of regions recovered and serving traffic.
10:20 AM
Majority of models and regions recovered, error rates dropped to normal levels. GPT-4 recovered in all regions except Canada Central, Sweden Central, North Central US, UK South, Central US, and Australia East.
03:40 PM
GPT-4 in North Central recovered.
05:35 PM
Recovered in all regions except Sweden Central.
07:20 PM
DALL-E restored in all regions.
07:30 PM
GPT-4 in UK South is recovered.
08:20 PM
GPT-4o recovered in Sweden Central.
11:35 PM
All base models recovered, and service restoration completed across all affected regions.
02:00 PM
Fine tuning model recovery across various regions.
04:54 PM
All fine-tuning model deployments restored across affected regions.