Google Cloud CRITICAL

Increased latency and error rates observed on Google App Engine, Cloud Firestore, and Google Cloud Functions gen 1.

September 18, 2024 · 08:34 PM UTC – 11:30 PM UTC · Duration: 2h 56min

Affected Services

Cloud FirestoreGoogle App EngineGoogle Cloud Functions

Timeline

03:11 PM
Incident Report Summary On Wednesday, 18 September, 2024, Google App Engine, Cloud Firestore, and Google Cloud Run functions (1st gen) experienced increased latency and error rates for a duration of 2 hours and 56 minutes in multiple regions. In some regions, customers experienced a complete service outage for a period between 5 minutes and 67 minutes. Issue began on 18 September 2024 at 12:34 US/Pacific and was completely resolved on 18 September 2024 at 15:30 US/Pacific. To our customers who were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. Root Cause The root cause was a newly implemented automation code which created a bad traffic routing policy. This policy incorrectly directed our traffic routing control plane to mark all clusters as being unavailable to serve traffic for App Engine, Google Cloud Run functions (1st gen)* and dependent services. Google engineers intervened before the policy was rolled out to all clusters, resulting in a partial outage of the service. Remediation and Prevention Google engineers were alerted to the issue via internal production monitoring on 18 September 2024 at 13:01 US/Pacific shortly after customers began experiencing the impact. Engineering teams have identified the automation which caused the impact and terminated it at 13:46. However customer impact was only mitigated at 15:30 post manually directing the traffic back to the affected clusters. Google is committed to preventing a repeat of the issue in the future and is completing the following actions: We have removed the automation which caused the outage as a short term measure We are working to implement a more efficient and well tested traffic routing maintenance process We will implement safeguards in the automation pipeline to prevent recurrence of this issue Detailed Description of Impact On Wednesday 18 September, 2024 from 12:34 US/Pacific to 15:30 US/Pacific, Google App Engine, Google Cloud Run Functions (1st gen)* and Cloud Firestore experienced elevated error rates and increased latency. Customers reported 5xx errors with the message “Request was aborted after waiting too long to attempt to service your request.” and high latency. Customers also experienced high cold starts during this time. In 13 regions, customers experienced a complete service outage for a period between 5 minutes and 67 minutes. asia-east2 asia-northeast2 asia-northeast3 asia-south1 asia-southeast1 asia-southeast2 europe-west2 europe-west3 europe-west6 northamerica-northeast1 southamerica-east1 us-central2 us-west1 In other 11 regions, customers might observe elevated error rates: asia-east1 asia-northeast1 australia-southeast1 europe-central2 europe-west1 us-central1 us-east1 us-east4 us-west2 us-west3 us-west4 *Cloud Run and Cloud Run functions (gen2) were not affected.
05:36 AM
Mini Incident Report We apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support (All Times US/Pacific) Incident Start 18 September, 2024, 13:01 Incident End 18 September, 2024, 15:30 Duration 2 hours, 29 minutes Affected Services and Features Firestore App Engine Google Cloud Functions Gen 1 Regions/Zones Global Description Google App Engine, Google Cloud Functions Gen1, Firestore experienced elevated error rates and increased latency for a period of 2 hours, 29 minutes. Based on our preliminary analysis, the root cause of the issue was identified as a newly implemented automation code which created a bad traffic routing policy. This policy incorrectly directed our traffic routing control plane to mark all clusters as being unavailable to serve traffic for App Engine and dependent services. Google engineers intervened before the policy was rolled out to all clusters, resulting in a partial outage of the service. Google engineers have identified the automation that was responsible for this change and have terminated it until appropriate safeguards are put in place. The impact was mitigated by manually directing the traffic back to the affected clusters. There is no risk of a recurrence of this outage at the moment. Google will complete a full IR in the following days that will provide a full root cause. Customer Impact Customers experienced elevated latency and error rates for Google App Engine, Google Cloud Functions Gen1 and Firestore services. Customers in some regions experienced a complete service outage for Google App Engine, Google Cloud Functions Gen1 and Firestore services.
11:50 PM
The issue with Google App Engine, Google Cloud Functions, Cloud Firestore has been resolved for all affected users as of Wednesday, 2024-09-18 15:30 US/Pacific. We will publish an analysis of this incident once we have completed our internal investigation. We thank you for your patience while we worked on resolving the issue.
11:19 PM
Summary: Increased latency and error rates observed on Google App Engine, Cloud Firestore, and Google Cloud Functions gen 1. Description: Mitigation has been successfully applied by our engineering team. We are currently monitoring our environment to ensure stability. We will provide more information by Wednesday, 2024-09-18 16:00 US/Pacific. Diagnosis: Affected users may encounter elevated latency or an elevated error rate for the impacted products. Workaround: None at this time.
10:57 PM
Summary: Increased latency and error rates observed on Google App Engine and Google Cloud Functions gen 1. Description: Mitigation work is currently underway by our engineering team. Based on the investigation thus far, our engineers have identified that Cloud Run is not currently impacted. We do not have an ETA for mitigation at this point. We will provide more information by Wednesday, 2024-09-18 16:00 US/Pacific. Diagnosis: Affected users may encounter elevated latency or an elevated error rate for the impacted products. Workaround: None at this time.