Google Cloud CRITICAL
Increased latency and error rates observed on Google App Engine, Cloud Firestore, and Google Cloud Functions gen 1.
September 18, 2024 · 08:34 PM UTC – 11:30 PM UTC · Duration: 2h 56min
Affected Services
Cloud FirestoreGoogle App EngineGoogle Cloud Functions
Timeline
03:11 PM
Incident Report
Summary
On Wednesday, 18 September, 2024, Google App Engine, Cloud Firestore, and Google Cloud Run functions (1st gen) experienced increased latency and error rates for a duration of 2 hours and 56 minutes in multiple regions. In some regions, customers experienced a complete service outage for a period between 5 minutes and 67 minutes. Issue began on 18 September 2024 at 12:34 US/Pacific and was completely resolved on 18 September 2024 at 15:30 US/Pacific.
To our customers who were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability.
Root Cause
The root cause was a newly implemented automation code which created a bad traffic routing policy. This policy incorrectly directed our traffic routing control plane to mark all clusters as being unavailable to serve traffic for App Engine, Google Cloud Run functions (1st gen)* and dependent services. Google engineers intervened before the policy was rolled out to all clusters, resulting in a partial outage of the service.
Remediation and Prevention
Google engineers were alerted to the issue via internal production monitoring on 18 September 2024 at 13:01 US/Pacific shortly after customers began experiencing the impact. Engineering teams have identified the automation which caused the impact and terminated it at 13:46. However customer impact was only mitigated at 15:30 post manually directing the traffic back to the affected clusters.
Google is committed to preventing a repeat of the issue in the future and is completing the following actions:
We have removed the automation which caused the outage as a short term measure
We are working to implement a more efficient and well tested traffic routing maintenance process
We will implement safeguards in the automation pipeline to prevent recurrence of this issue
Detailed Description of Impact
On Wednesday 18 September, 2024 from 12:34 US/Pacific to 15:30 US/Pacific, Google App Engine, Google Cloud Run Functions (1st gen)* and Cloud Firestore experienced elevated error rates and increased latency. Customers reported 5xx errors with the message “Request was aborted after waiting too long to attempt to service your request.” and high latency. Customers also experienced high cold starts during this time.
In 13 regions, customers experienced a complete service outage for a period between 5 minutes and 67 minutes.
asia-east2
asia-northeast2
asia-northeast3
asia-south1
asia-southeast1
asia-southeast2
europe-west2
europe-west3
europe-west6
northamerica-northeast1
southamerica-east1
us-central2
us-west1
In other 11 regions, customers might observe elevated error rates:
asia-east1
asia-northeast1
australia-southeast1
europe-central2
europe-west1
us-central1
us-east1
us-east4
us-west2
us-west3
us-west4
*Cloud Run and Cloud Run functions (gen2) were not affected.
05:36 AM
Mini Incident Report
We apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support
(All Times US/Pacific)
Incident Start 18 September, 2024, 13:01
Incident End 18 September, 2024, 15:30
Duration 2 hours, 29 minutes
Affected Services and Features
Firestore
App Engine
Google Cloud Functions Gen 1
Regions/Zones
Global
Description
Google App Engine, Google Cloud Functions Gen1, Firestore experienced elevated error rates and increased latency for a period of 2 hours, 29 minutes. Based on our preliminary analysis, the root cause of the issue was identified as a newly implemented automation code which created a bad traffic routing policy. This policy incorrectly directed our traffic routing control plane to mark all clusters as being unavailable to serve traffic for App Engine and dependent services. Google engineers intervened before the policy was rolled out to all clusters, resulting in a partial outage of the service.
Google engineers have identified the automation that was responsible for this change and have terminated it until appropriate safeguards are put in place. The impact was mitigated by manually directing the traffic back to the affected clusters. There is no risk of a recurrence of this outage at the moment.
Google will complete a full IR in the following days that will provide a full root cause.
Customer Impact
Customers experienced elevated latency and error rates for Google App Engine, Google Cloud Functions Gen1 and Firestore services.
Customers in some regions experienced a complete service outage for Google App Engine, Google Cloud Functions Gen1 and Firestore services.
11:50 PM
The issue with Google App Engine, Google Cloud Functions, Cloud Firestore has been resolved for all affected users as of Wednesday, 2024-09-18 15:30 US/Pacific.
We will publish an analysis of this incident once we have completed our internal investigation.
We thank you for your patience while we worked on resolving the issue.
11:19 PM
Summary: Increased latency and error rates observed on Google App Engine, Cloud Firestore, and Google Cloud Functions gen 1.
Description: Mitigation has been successfully applied by our engineering team. We are currently monitoring our environment to ensure stability.
We will provide more information by Wednesday, 2024-09-18 16:00 US/Pacific.
Diagnosis: Affected users may encounter elevated latency or an elevated error rate for the impacted products.
Workaround: None at this time.
10:57 PM
Summary: Increased latency and error rates observed on Google App Engine and Google Cloud Functions gen 1.
Description: Mitigation work is currently underway by our engineering team. Based on the investigation thus far, our engineers have identified that Cloud Run is not currently impacted.
We do not have an ETA for mitigation at this point.
We will provide more information by Wednesday, 2024-09-18 16:00 US/Pacific.
Diagnosis: Affected users may encounter elevated latency or an elevated error rate for the impacted products.
Workaround: None at this time.