Azure MINOR

Post Incident Review (PIR) – Azure OpenAI Service – Errors in multiple regions

July 13, 2024 · 12:00 AM UTC – 04:54 PM UTC · Duration: 64h 54min

Affected Services

Azure OpenAI Service

Timeline

12:00 AM

Initial customer impact started, as cleanup operations began across regions, gradually increasing as more regions became unhealthy.

12:05 AM

Service monitoring detected failure rates going above defined thresholds, alerting our engineers who began an investigation.

12:30 AM

We determined there was a multi-region issue unfolding by correlating alerts, service metrics, and telemetry dashboards. It took 30-40 minutes for the delete to complete, so each region (and even each endpoint in each region) went offline at different times.

12:45 AM

We identified that backend endpoints were being deleted, but did not yet know how or why. Our on-call engineers were investigating two hypotheses.

12:52 AM

We classified the event as critical, engaging incident response personnel to further investigate, coordinate resources, and drive customer workstreams.

01:00 AM

We identified the automated cleanup operation deleting resources. With the problem known, we began mitigation efforts, beginning with stopping the automation that was executing the cleanup.

01:40 AM

The cleanup job stopped on its own and did not delete everything. However, we continued to work to ensure further deletion was prevented.

02:00 AM

We added resource locks and stopped delete requests at both the ARM and Automation levels, to prevent further deletion.

02:00 AM

Initial recovery efforts began to recreate deleted resources. Note regarding recovery time - AOAI models are very large and can only be transferred via secure channels. Within a model pool (there are separate model pools per region per model) each deployment is serial (so the model copying time is a significant factor). Model pools themselves are done in parallel.

02:22 AM

First communication posted to the Azure Status page. Communications were delayed due to initial difficulties scoping affected customers, and impact analysis gaps in our communications tooling.

03:35 AM

Customer impact scope determined, and first targeted communications sent to customers via Service Health in the Azure portal.

04:15 AM

GPT-3.5-Turbo started recovering in North Central US.

07:10 AM

GPT-4o recovered in East US 2.

07:10 AM

Majority of regions and models began recovering.

08:35 AM

GPT-4 in majority of regions recovered and serving traffic.

10:20 AM

Majority of models and regions recovered, error rates dropped to normal levels. GPT-4 recovered in all regions except Canada Central, Sweden Central, North Central US, UK South, Central US, and Australia East.

03:40 PM

GPT-4 in North Central recovered.

05:35 PM

Recovered in all regions except Sweden Central.

07:20 PM

DALL-E restored in all regions.

07:30 PM

GPT-4 in UK South is recovered.

08:20 PM

GPT-4o recovered in Sweden Central.

11:35 PM

All base models recovered, and service restoration completed across all affected regions.

02:00 PM

Fine tuning model recovery across various regions.

04:54 PM

All fine-tuning model deployments restored across affected regions.

View on official status page →