11:20 PM
Customer impact began, triggered by the recent service update.
12:19 AM
We detected the issue via service monitoring. This prompted us to begin our investigation, engage with other teams to troubleshoot, and start developing a hot fix.
03:18 AM
We determined that a rollback could mitigate more quickly than hotfix, so started the rollback for the impacted model.
10:55 AM
Traffic routes were determined to be using incomplete data, due to the aforementioned dependency issue.
12:40 PM
We identified and investigated resource constraints.
06:00 PM
Rollback actions completed across all affected regions.
07:30 PM
Full capacity information was restored across regions, traffic distribution normalized.
07:32 PM
Once monitoring confirmed stable recovery, we determined the service was fully restored and all customer impact had been mitigated.