09:36 PM
The configuration change deployment was initiated.
10:08 PM
The configuration change was progressed to the ring with the heavy workload.
08:05 AM
Initial customer impact began.
08:26 AM
Automated alerts received for intermittent, low volume of errors prompting us to start investigating.
10:30 AM
We attempted isolated restarts on impacted database instances in an effort to mitigate low-level impact.
01:03 PM
Automated monitoring alerted us to elevated error rates and a full incident response was initiated.
01:22 PM
We identified that calls to the database were intermittently timing out. Traffic volume appeared to be normal with no significant surge detected however we observed the spike in CPU utilization.
01:54 PM
Mitigation efforts began, including beginning to scale out the impacted environments.
03:05 PM
Scale out efforts were observed as decreasing error rates but not completely eliminating failures. Further instance restarts provided temporary relief.
03:25 PM
Scaling efforts continued. We engaged our database engineering team to help investigate.
04:37 PM
While we did not correlate the deployment to this incident, we initiated a rollback of the configuration change.
05:20 PM
Scale-out efforts completed.
05:45 PM
Service availability telemetry was showing improvements. Some customers began to report recovery.
06:30 PM
Customer impact confirmed as mitigated, after rollback of configuration change had completed and error rates had returned to normal levels.