S2 - UAT and UAT2 environments are not responding
Incident Report for Serraview
Postmortem

Serraview Detailed Root Cause Analysis – January 19, 2024 

UAT Inaccessible 

 

We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.  

 

Description: 

On January 19, 2024, internal and external customers began to report the inability to access their Serraview UAT instances. When users try to login, they are presented with a timeout error or a server error.  

 

Type of Event: 

Due to the adverse effects experienced by the growing subset of customers, an incident was initiated, and our internal teams promptly recognized and addressed the issue. It's important to note that incidents arising from UAT environments do not result in any breach of SLA 

 

Services\Modules Impacted: 

UAT 

 

Timeline: 9:19am EST – Internal teams and customers began to report the inability to access their Serraview, UAT instances. When users try to login, they are presented with a timeout error or a server error. As additional customers reported the issue the initial ticket for investigation was upgraded to a high priority and at approximately 09:46am EST, all customers were notified via Status Page of the incident. At 9:51am EST, out cloud ops team acknowledged the issue and began investigating. The team has implemented the fix at approximately 10:15am EST. Customer support continued to monitor the resolution and notified all customers that the issue had been resolved at 1:18pm EST.   

 

Root Cause Analysis: 

Internal service consuming 100% CPU causing the connection to drop.   

 

Remediation: 

After an investigation, our Internal Teams restarted services to resolve the disruption to UAT.  

 

Preventative Action:  

Our internal teams continue to enhance monitoring for these internal services.

Posted Feb 02, 2024 - 21:31 UTC

Resolved
This incident has been resolved.
Posted Jan 19, 2024 - 18:18 UTC
Monitoring
Our Engineering team has identified the issue and implemented a fix. We are moving into the Monitoring Phase for the next 2 hours. We appreciate your patience and please contact support should you require any further assistance.
Posted Jan 19, 2024 - 15:46 UTC
Identified
We are currently investigating an issue with UAT environments. Our Engineering team is currently investigating to determine the cause of the disruption. Our next update will be posted at 1:00 pm CST. Production instances are NOT affected.
Posted Jan 19, 2024 - 15:12 UTC