Google Cloud CRITICAL
Global: Cloud Monitoring elevated errors requesting underlying monitoring data
October 19, 2021 · 07:00 PM UTC – 08:45 PM UTC · Duration: 1h 45min
Affected Services
OperationsCloud MonitoringGoogle Cloud DataflowGoogle Cloud Pub/SubCloud NATGoogle Cloud Bigtable
Timeline
10:26 PM
INCIDENT REPORT
Summary
On 19 October 2021 11:00 US/Pacific, Cloud Monitoring experienced errors querying all monitoring data for approximately 1 hour and 45 minutes in the us-central1 region. We apologize for the inconvenience and are taking steps toward preventing recurrence in the future.
Root Cause
Cloud Monitoring is a global service but is subdivided into internal locales, each of which collect monitoring data which is generated locally. When users query Cloud Monitoring, each query fans out through a series of nodes (called mixers) within the corresponding locales. The mixers reach out to source nodes to gather the appropriate data, temporarily retaining it within a limited set of memory.
During a recent infrastructure change in the U.S. locale, the amount of memory allocated to mixers in the us-central1 region was inadvertently reduced. This caused mixer tasks to run low on memory. The number of tasks in a low memory state grew over a period of several days as the change was gradually rolled out to production, following Google's standard progressive rollout policies.
The mixer task has safeguards which are designed to detect and reduce the impact of low memory conditions by pausing queries that use significant memory. However, in this case, an existing misconfiguration of this safeguard prevented it from activating correctly. Eventually, tasks which were low on memory failed; enough tasks failed in total to cause widespread failures and service impact.
Remediation and Prevention
Google engineers were alerted to the problem on 19 October 2021 at 11:11 and immediately started an investigation. Root cause - the reduction in memory allocation for mixer nodes - was identified at 11:32. Google engineers quickly identified a mitigation, which we began to roll out at 11:50. Restoring the proper memory capacity for mixer nodes fully mitigated the issue at 12:54.
Google is committed to quickly and continually improving our technology and operations to prevent service disruptions.
We are taking the following immediate steps to prevent this or similar issues from happening again:
Fixing the misconfiguration so that mixers which are low on memory will correctly detect that condition.
Introduce load-shedding, such that mixers which run out of memory will simply reject new queries until memory usage subsides, rather than failing.
Optimize the mixers to reduce the likelihood of out-of-memory scenarios.
Modifying Cloud Monitoring's rollout automation so that it automatically spots problems of this type, allowing engineers to be alerted sooner.
11:48 PM
Mini Incident Report while full Incident Report is prepared
We apologize for the inconvenience this service disruption may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Support by opening a case using https://cloud.google.com/support
(All Times US/Pacific)
Incident Start: 19 October 2021 11:00
Incident End: 19 October 2021 12:45
Duration: 1 hours, 45 minutes
Affected Services and Features:
Google Cloud Monitoring, Google Cloud Dataflow, Google Cloud Pub/Sub, Google Cloud NAT, Google Cloud Router, Google Cloud Interconnect, Google Bigtable, Google Cloud Databases
Regions/Zones: us-central1 and us-central2
Description:
Google Cloud Monitoring experienced errors querying monitoring data for approximately 1 hour and 45 minutes. From preliminary analysis, the root cause of the issue was due to resource contention that occurred following a recent roll out which included a misconfiguration. Engineers corrected the configuration, and restarted the affected instances to resolve the issue.
Customer Impact:
-Customers may have experienced errors or incomplete monitoring data.
-Missing precomputed data from between 11:00 PT and 12:45 PT is expected, but can still be viewed via raw query.
-Customers may have also experienced false alerts during the impact window based on the underlying monitoring data.
09:27 PM
The issue with Cloud Monitoring has been resolved for all affected users as of Tuesday, 2021-10-19 12:45 US/Pacific.
We will publish an analysis of this incident once we have completed our internal investigation.
We thank you for your patience while we worked on resolving the issue.
09:07 PM
Summary: Global: Cloud Monitoring elevated errors requesting underlying monitoring data
Description: All customer impact is mitigated as of Tuesday, 2021-10-19 12:45 US/Pacific. Missing precomputed data from between 11:00 to 12:45 is expected, but can still be viewed via a raw query. Users might see lingering impact when running queries with large windows. Users may have received false alerts during that window based on the underlying monitoring data.
We will continue to monitor the situation. We do not have an ETA for full resolution at this point.
We will provide an update by Tuesday, 2021-10-19 14:00 US/Pacific with current details.
Diagnosis: Affected customers may see errors when trying to query their monitoring data.
Workaround: None at this time.
08:51 PM
Summary: Global: Cloud Monitoring elevated errors requesting underlying monitoring data
Description: We believe the issue with Cloud Monitoring is partially resolved and the mitigation is continuing to reduce the error rate. We will continue to monitor the situation.
We do not have an ETA for full resolution at this point.
We will provide an update by Tuesday, 2021-10-19 14:00 US/Pacific with current details.
Diagnosis: Affected customers may see errors when trying to query their monitoring data.
Workaround: None at this time.
08:11 PM
Summary: Global: Cloud Monitoring elevated errors requesting underlying monitoring data
Description: Mitigation work is currently underway by our engineering team.
We do not have an ETA for mitigation at this point.
We will provide more information by Tuesday, 2021-10-19 13:05 US/Pacific.
Diagnosis: Affected customers may see errors when trying to query their monitoring data.
Workaround: None at this time.