S2 - US Region - UAT Instances have some Latency or Inability to Access
Incident Report for Serraview
Postmortem

Serraview Detailed Root Cause Analysis – Severity 2 – February 1, 2024

UAT Instances Inaccessible in US Regions

 

We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident. 

 

Description:

On February 1, 2024, internal and external customers began to report UAT instances being inaccessible. When attempting to sign in or navigate UAT, users are presented with an error message stating “Sorry, something has gone wrong. Please try again later”.

 

Type of Event:

Due to the adverse effects experienced by the growing subset of customers, an incident was initiated, and our internal teams promptly recognized and addressed the issue. It's important to note that incidents arising from UAT environments do not result in any breach of SLA

 

Services\Modules Impacted:

UAT

 

Timeline:
On February 1, 2024 at approximately 8:49am EST an initial report came into support mentioning their UAT instance is inaccessible. When attempting to sign in or navigate UAT, users are presented with an error message stating “Sorry, something has gone wrong. Please try again later”. Support begins to investigate and escalates to engineering. By 11:00am EST, additional reports of UAT being inaccessible flow into support. All customers are notified of this disruption for US customers via status page at 11:11am. At 11:55am internal teams identify the issue and begins working on a resolution. A fix is implemented, and multiple customers confirm that the applied resolution is a success and users can access their UAT instances. At approximately 4:00pm all customers are notified that the issue is resolved.

 

Root Cause Analysis:

SQL server CPU and memory utilized at around 90%, resulting in performance slowdowns. CPU usage typically spikes to 95% to 97% between 12:30 UTC to 14:00 UTC due to backup processes, contributing to the slowness.

 

Remediation:

The team identified large queries that have been running for a long time and concurrent sessions that ran the same queries. There were a lot of RESOURCE_SEMAPHORE waits which indicate memory pressure.  The team eliminated duplicated sessions and flushed the memory and the server returned to normal.

 

Preventative Action:

The scheduled task for performing backup operations has been rescheduled to a later time.

Posted Feb 14, 2024 - 23:20 UTC

Resolved
This incident has been resolved.
Posted Feb 01, 2024 - 20:59 UTC
Update
We are continuing to investigate this issue.
Posted Feb 01, 2024 - 16:41 UTC
Investigating
We are currently investigating this issue.
Posted Feb 01, 2024 - 16:39 UTC