S1 - US Outage
Incident Report for Serraview
Postmortem

We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.

Description:

On December 7, 2023, our customer experience team identified an issue with the optimizations being made to handle large sets of snapshot data that was not present during the testing of the change. This issue caused all US clients to experience a brief outage in their Production environments.

Upon investigation, our engineering team identified that the root cause of the problem was that the script being run temporarily reached the limits of the temporary database being used for the task.

Type of Event:

Unplanned production outage for US clients.

Services/Modules Impacted:

All modules and services.

Remediation:

We rebooted to US SQL server used for Production which killed the task, which was run later in smaller batches to ensure the threshold wouldn’t be reached again.

Timeline (CST):

7th December

08:48 – Issue raised

09:01 – Services restored; incident resolved

Total Duration of Event:

~ 14 minutes.

Root Cause Analysis:

A script to reduce the size of snapshot data for certain clients used up progressively more and more space until it reached the maximum threshold in the TempDB drive, causing all client instances on the US Production SQL server to experience a short outage before the server was restarted.

Preventative Action:  

Alerting thresholds for monitoring TempDB space on all SQL servers will be lowered to allow CloudOps to address such issues earlier.

Scripts that have the potential to affect clients in such a manner will be more closely monitored to ensure their TempDB usage and split into smaller scopes of changes to minimize impact

Posted Dec 20, 2023 - 14:31 UTC

Resolved
At 8:48 am CST users were unable to load Serraview, internal monitoring detected the outage and teams brought services back online at 9:01am CST
Posted Dec 07, 2023 - 15:16 UTC
This incident affected: Core Services (NA- Core Services).