Serraview Detailed Root Cause Analysis – Severity 2 – February 1, 2024
UAT Instances Inaccessible in US Regions
We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Description:
On February 1, 2024, internal and external customers began to report UAT instances being inaccessible. When attempting to sign in or navigate UAT, users are presented with an error message stating “Sorry, something has gone wrong. Please try again later”.
Type of Event:
Due to the adverse effects experienced by the growing subset of customers, an incident was initiated, and our internal teams promptly recognized and addressed the issue. It's important to note that incidents arising from UAT environments do not result in any breach of SLA
Services\Modules Impacted:
UAT
Timeline:
On February 1, 2024 at approximately 8:49am EST an initial report came into support mentioning their UAT instance is inaccessible. When attempting to sign in or navigate UAT, users are presented with an error message stating “Sorry, something has gone wrong. Please try again later”. Support begins to investigate and escalates to engineering. By 11:00am EST, additional reports of UAT being inaccessible flow into support. All customers are notified of this disruption for US customers via status page at 11:11am. At 11:55am internal teams identify the issue and begins working on a resolution. A fix is implemented, and multiple customers confirm that the applied resolution is a success and users can access their UAT instances. At approximately 4:00pm all customers are notified that the issue is resolved.
Root Cause Analysis:
SQL server CPU and memory utilized at around 90%, resulting in performance slowdowns. CPU usage typically spikes to 95% to 97% between 12:30 UTC to 14:00 UTC due to backup processes, contributing to the slowness.
Remediation:
The team identified large queries that have been running for a long time and concurrent sessions that ran the same queries. There were a lot of RESOURCE_SEMAPHORE waits which indicate memory pressure. The team eliminated duplicated sessions and flushed the memory and the server returned to normal.
Preventative Action:
The scheduled task for performing backup operations has been rescheduled to a later time.