S1 - AU Instance Disruption
Incident Report for Serraview
Postmortem

We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident. 

 

Description:  

On October 5, 2023, our customer experience team identified an issue affecting Production instances for all APAC clients. 

Our engineering team restarted the SQL server, and all APAC instances began to come back online. 

 

Type of Event:  

Unplanned outage.  

  

Services/Modules Impacted:  

All modules. 

  

Remediation:  

The SQL server was restarted. 

  

Timeline (AEST):  

5th October 

  • 7:52 – Severity 2 incident raised 
  • 10:17 – SQL server restarted; instances started to come back online  
  • 11:34 – Incident Resolved 

  

Total Duration of Event:  

~ 3 hours and 42 minutes.  

  

Root Cause Analysis:  

We've been working hard to investigate and resolve the issue you've been experiencing. Our team has thoroughly examined the logs, but unfortunately, we haven't found any clear indicators of the root cause at this time. Although we haven't been able to pinpoint the exact cause, intentional replication of the issue is not currently possible, meaning the existing protection automations are functioning and providing protection.  

 

Preventative action:  

As we are currently unable to replicate the issue, and our logs aren’t pointing out any clear causes for the incident, we will continue to monitor the health of the SQL server to ensure that any future anomalies are escalated immediately. Rest assured, we are committed to finding a solution and ensuring a smooth experience for you.

Posted Nov 03, 2023 - 04:13 UTC

Resolved
As we have not seen further service disruptions after the fix was implemented, we have moved to the Resolved Phase.
A Preliminary RCA will be posted in this incident in 2 business days. Please stay subscribed to the page to receive post automatically.
Posted Oct 06, 2023 - 04:44 UTC
Monitoring
The issue causing AU instances to be down or slow has now been resolved.
While we continue to investigate the root cause, this incident has been set to the Monitoring status.
Posted Oct 05, 2023 - 00:34 UTC
Investigating
We are currently investigating an issue that is causing some AU instances to be inaccessible with a 503 error, and other instances experiencing slowness.
Posted Oct 04, 2023 - 22:32 UTC
This incident affected: Core Services (APAC - Core Services) and Space Planning (APAC - Space Planning).