S2 - SVLive 2 Outage
Incident Report for Serraview
Postmortem

We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident. 

 

Description:  

On November 30, 2023, our customer experience team identified an issue with Serraview Live 2 causing a delay between receiving data and publishing it to the Serraview Live Servers. 

Upon investigation, our engineering team identified that the root cause of the problem was that we had reached a limit on our servers as our client count had grown. 

 

Type of Event:  

Unplanned delay in SVLive 2 data being processed.  

  

Services/Modules Impacted:  

Locator, Engage and reports (including Insights). 

  

Remediation:  

We tripled the maximum number of server shards, enabling us to scale them for each client and effectively handle the increased load. As a result, we could resume processing SVLive 2 data. 

  

Timeline (AEDT):  

30th November 

  • 17:29 – Issue raised 

1st December 

  • 09:00 – Maximum number of server shards increased; data began processing again 

  

Total Duration of Event:  

~ 11 hours and 30 minutes.  

  

Root Cause Analysis:  

To accommodate the growing number of new clients, we needed to increase the maximum number of shards on the AWS server. 

 

Preventative Action:   

  • Proactively increased the number of shards available in preparation for new clients coming onboard. 
  • Implemented real-time proactive monitoring for shard counts to receive alerts if we approach the limit again.
Posted Dec 11, 2023 - 23:04 UTC

Resolved
We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.


Description:
On November 30, 2023, our customer experience team identified an issue with Serraview Live 2 causing a delay between receiving and publishing data to the Serraview Live Servers.
Upon investigation, our engineering team identified that the root cause of the problem was that we had reached a limit on our servers as our client count had grown.


Type of Event:
Unplanned delay in SVLive 2 data being processed.


Services/Modules Impacted:
Locator, Engage and reports (including Insights).


Remediation:
We tripled the maximum number of server shards, enabling us to scale them for each client and effectively handle the increased load. As a result, we can resume processing SVLive 2 data.


Timeline (AEDT):

30th November
- 17:29 – Issue raised

1st December
- 09:00 – Maximum number of server shards increased; data began processing again


Total Duration of Event:
~ 11 hours and 30 minutes.


Root Cause Analysis:
To accommodate the growing number of new clients, we needed to increase the maximum number of shards on the AWS server.


Preventative Action:
- Proactively increased the number of shards available in preparation for new clients coming on board.
- Implemented real-time proactive monitoring for shard counts to receive alerts if we approach the limit again.
Posted Nov 30, 2023 - 06:30 UTC