S2 - Serraview Engage service not accessible for some clients
Incident Report for Serraview
Postmortem

Serraview Detailed Root Cause Analysis

Thursday, January 25, 2024 & Monday, January 29, 2024 

We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.  

 

Description: 

On the morning of Thursday, January 25th, a very small group of customers reported the inability to access their SV instances. When trying to access their instance, an error is presented “Sorry, something has gone wrong. Please try again later.”  

Although internal team members were able to access customer instances, they experienced some latency when navigating within them. Fortunately, the issue was resolved later in the day. However, on Monday, January 29th, it seemed that the problem had reappeared as the same errors resurfaced. 

 

Type of Event: 

Performance degradation 

 

Services\Modules Impacted: 

Production – Engage & Serraview Applications 

 

Timeline: 

Thursday, January 25, 2024 

On the morning of Thursday, January 25, 2024, at approximately 9:33am EST, a very small handful of customers reported the inability to access their SV instances. When trying to access their instance, an error is presented “Sorry, something has gone wrong. Please try again later”. Support raised an internal incident and customers were advised, via support cases, that "The site is currently experiencing a higher-than-normal amount of load and may be causing pages to be slow or unresponsive. We're investigating the cause and will provide an update as soon as possible." At approximately 4:11pm EST, internal teams confirm that the root cause could be from the Booking Service but continue to investigate. 5:33pm EST, internal teams restart services which resolves the issue. At approximately 8:08pm EST, all customers were advised that the issue had been resolved.  

Monday, January 29, 2024 

On the morning of Monday, January 29, 2024, at approximately 11:20am EST, customers reported a reoccurrence of the incident experienced in the prior week. All customers were notified via the status page, and internal teams quickly begin investigating. Around 2:33pm EST, the engineering team identified the issue and advised that the resolution is to restart the Booking Services. During the restart, internal teams found the root cause was an ever-growing list of open connections, maxing out the capability of this service. While restarting, the team increased the memory so it can handle more simultaneous connections. After monitoring, customers were able to confirm a successful connection to the application. At 7:39pm EST, all customers were notified via status page that the incident had been resolved.  

 

Total Duration of Event: 

Thursday, January 25: 10hrs 35mins 

Monday, January 29: 8hrs 19mins 

 

Root Cause Analysis: 

The main issue (general slowness) appears to have been caused by under-resourcing/provisioning of the Booking Server and the service itself.  

 

Remediation: 

A restart of the Booking Service was done to resolve the issue. The following actions were taken in order to resolve this situation: 

  • Scaling the RDS (shared booking server)  
  • Adding a replica of the booking service (horizontal scaling to better balance the load) 

 

Preventative Action:  

Our internal teams are currently focused on establishing monitoring and alert systems for the affected infrastructure. Additionally, the engineering team is actively investigating potential measures on the application side to prevent the accumulation of open connections, excessive database requests, and similar issues.

Posted Feb 02, 2024 - 18:04 UTC

Resolved
This incident has been resolved.
Posted Jan 29, 2024 - 22:39 UTC
Monitoring
Our Engineering team has identified the issue and implemented a fix. We are moving into the Monitoring Phase for the next 2 hours. We appreciate your patience and please contact support should you require any further assistance.
Posted Jan 29, 2024 - 20:21 UTC
Identified
The issue with Engage service has been identified and a fix is being implemented. This will require a restart of Engage backend services, The maintenance will begin @ 2:00pm CST and will last approximately 15 minutes. During this time Engage will not be available for all users.

We will post another update at after this maintenance is competed.
Posted Jan 29, 2024 - 19:32 UTC
Investigating
The issue with Engage service not accessible for some clients has been identified and a fix is being implemented. We will post another update at 2:30 CST.
Posted Jan 29, 2024 - 16:26 UTC
This incident affected: Engage.