S2 - Serraview Instances Slow To Load
Incident Report for Serraview
Postmortem

We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident. 

 

Description:  

  • On October 2, 2023, our customer experience team identified an issue affecting Production instances for all clients. 
  • All clients experienced system slowness and performance degradation in Serraview modules 
  • Over the following days, several root cause issues were identified and addressed. 
  • Improvements were made to several areas including server processing, database functions and server load. 
  • Performance improved across each day until the issue was fully remediated and the incident was closed on October 6, 2023. 

 

Type of Event:  

Unplanned outage.  

  

Services/Modules Impacted:  

All modules. 

  

Remediation:  

  • Multiple code fixes were implemented to optimize database functions 
  • A new SQL server was created for multiple regions and several clients were migrated to balance the load 
  • Changes were made to the flow of data between modules to improve performance and reduce data calls 

  

Timeline (AEST):  

2nd October 

  • 16:45 – Severity 2 incident raised 

3rd – 4th October 

  • Optimized scalar database functions, particularly with Engage 
  • Revised end points accessed between modules, particularly with Engage, Serraview Live and Insights. 
  • Performance improvements across all modules, but still notably slow and impacting end users 
  • Created a new SQL server, commenced migration of clients to new SQL server 

5th October 

  • Completed migration of clients to new SQL Server 

6th October 

  • 15:45 – Incident Resolved 

  

Total Duration of Event:  

~ 3 days and 23 hours.  

  

Root Cause Analysis:  

Initially, the US SQL servers were put under massive load during business hours due to a sudden increase in traffic. Scalar database functions were seen to be inefficient across all regions, particularly Engage presence detection calls. 

 

Preventative Action:   

Now that we have implemented several code and infrastructure changes, we have seen considerably improved performance.  

We will continue to work on preventing such issues by investigating and optimizing the flow of SVLive & Sensor data to Insights and Engage.

Posted Oct 27, 2023 - 05:32 UTC

Resolved
This incident has been resolved, and a detailed RCA will be posted once investigation has been completed.
Posted Oct 10, 2023 - 05:28 UTC
Monitoring
We are now moving into the Monitoring Phase as we are seeing positive results after our fixes.
Our engineering team have implemented optimisations to prevent this issue in the future.
Posted Oct 06, 2023 - 05:43 UTC
Update
We are continuing to make changes this evening at 7:00 PM AEST to address slowness issues and we will provide another update within 24 hours on the results.
Please let our Support team know if you continue to experience any slowness after 12 hours.
Posted Oct 05, 2023 - 05:28 UTC
Update
A fix for Engage was pushed last night at 7:00 PM AEST which aimed to address slowness issues, and we have received reports of improvement. We will be implementing further improvements to our systems at the same time this evening which is targeted at the loading times of floors and buildings.
Please let us know if you continue to experience issues in Engage.
Posted Oct 04, 2023 - 03:38 UTC
Update
We have identified an issue with SQL servers being overloaded in both AU and US regions and are working on a fix.

We have also identified an issue with responsiveness issues in Engage, a fix for which we are implementing at 7:00 PM AEST this evening which will take around 1 hour.
Posted Oct 03, 2023 - 05:26 UTC
Update
We are continuing to work on a fix for this issue.
Posted Oct 03, 2023 - 01:02 UTC
Update
We are continuing to work on a fix for this issue.
Posted Oct 02, 2023 - 18:58 UTC
Identified
An issue causing a small number of clients to experience slowness has been identified and is being investigated.
Posted Oct 02, 2023 - 06:42 UTC
This incident affected: Space Planning (APAC - Space Planning, EMEA - Space Planning, NA - Space Planning) and Engage.