We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Description:
- On October 2, 2023, our customer experience team identified an issue affecting Production instances for all clients.
- All clients experienced system slowness and performance degradation in Serraview modules
- Over the following days, several root cause issues were identified and addressed.
- Improvements were made to several areas including server processing, database functions and server load.
- Performance improved across each day until the issue was fully remediated and the incident was closed on October 6, 2023.
Type of Event:
Unplanned outage.
Services/Modules Impacted:
All modules.
Remediation:
- Multiple code fixes were implemented to optimize database functions
- A new SQL server was created for multiple regions and several clients were migrated to balance the load
- Changes were made to the flow of data between modules to improve performance and reduce data calls
Timeline (AEST):
2nd October
- 16:45 – Severity 2 incident raised
3rd – 4th October
- Optimized scalar database functions, particularly with Engage
- Revised end points accessed between modules, particularly with Engage, Serraview Live and Insights.
- Performance improvements across all modules, but still notably slow and impacting end users
- Created a new SQL server, commenced migration of clients to new SQL server
5th October
- Completed migration of clients to new SQL Server
6th October
- 15:45 – Incident Resolved
Total Duration of Event:
~ 3 days and 23 hours.
Root Cause Analysis:
Initially, the US SQL servers were put under massive load during business hours due to a sudden increase in traffic. Scalar database functions were seen to be inefficient across all regions, particularly Engage presence detection calls.
Preventative Action:
Now that we have implemented several code and infrastructure changes, we have seen considerably improved performance.
We will continue to work on preventing such issues by investigating and optimizing the flow of SVLive & Sensor data to Insights and Engage.