We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Description:
On the morning of Thursday, January 25th, a very small group of customers reported the inability to access their SV instances. When trying to access their instance, an error is presented “Sorry, something has gone wrong. Please try again later.”
Although internal team members were able to access customer instances, they experienced some latency when navigating within them. Fortunately, the issue was resolved later in the day. However, on Monday, January 29th, it seemed that the problem had reappeared as the same errors resurfaced.
Type of Event:
Performance degradation
Services\Modules Impacted:
Production – Engage & Serraview Applications
Timeline:
Thursday, January 25, 2024
On the morning of Thursday, January 25, 2024, at approximately 9:33am EST, a very small handful of customers reported the inability to access their SV instances. When trying to access their instance, an error is presented “Sorry, something has gone wrong. Please try again later”. Support raised an internal incident and customers were advised, via support cases, that "The site is currently experiencing a higher-than-normal amount of load and may be causing pages to be slow or unresponsive. We're investigating the cause and will provide an update as soon as possible." At approximately 4:11pm EST, internal teams confirm that the root cause could be from the Booking Service but continue to investigate. 5:33pm EST, internal teams restart services which resolves the issue. At approximately 8:08pm EST, all customers were advised that the issue had been resolved.
Monday, January 29, 2024
On the morning of Monday, January 29, 2024, at approximately 11:20am EST, customers reported a reoccurrence of the incident experienced in the prior week. All customers were notified via the status page, and internal teams quickly begin investigating. Around 2:33pm EST, the engineering team identified the issue and advised that the resolution is to restart the Booking Services. During the restart, internal teams found the root cause was an ever-growing list of open connections, maxing out the capability of this service. While restarting, the team increased the memory so it can handle more simultaneous connections. After monitoring, customers were able to confirm a successful connection to the application. At 7:39pm EST, all customers were notified via status page that the incident had been resolved.
Total Duration of Event:
Thursday, January 25: 10hrs 35mins
Monday, January 29: 8hrs 19mins
Root Cause Analysis:
The main issue (general slowness) appears to have been caused by under-resourcing/provisioning of the Booking Server and the service itself.
Remediation:
A restart of the Booking Service was done to resolve the issue. The following actions were taken in order to resolve this situation:
Preventative Action:
Our internal teams are currently focused on establishing monitoring and alert systems for the affected infrastructure. Additionally, the engineering team is actively investigating potential measures on the application side to prevent the accumulation of open connections, excessive database requests, and similar issues.