S1 - US Prod Outage
Incident Report for Serraview
Postmortem

Description:  

On September 26, 2023, our customer experience team identified an issue with US Production instances, preventing users from logging in. 

Upon investigation, our engineering team identified that the root cause of the problem was related to the connection between the Windows server and the Serraview US Production domain. 

 

Type of Event:  

Unplanned outage for US clients.  

  

Services/Modules Impacted:  

All services and modules in US Production instances. 

  

Remediation:  

The connection between the Windows server and the domain was restored, and US Production instances came back online quickly. 

  

Timeline (AEST):  

26th September  

  • 15:05 – Issue raised 

  • 15:51 – Issue identified – Severity 1 outage incident triggered. 

  • 16:48 – Service restarted – incident resolved. 

 

Total Duration of Event:  

~ 1 hour and 43 minutes.  

  

Root Cause Analysis:  

The connection between the Windows server and the domain had disconnected.  

 

Preventative Action:   

Specific monitoring and alerts will be set up for detecting when the connection between the Windows server and the domain is disconnected.

Posted Oct 16, 2023 - 23:58 UTC

Resolved
US Prod instances were down, affecting all modules.
Sites went down at approximately 3:30 PM AEST and were back up by approximately 4:45 PM AEST.
A detailed RCA will be made available once further investigation has progressed.
Posted Sep 26, 2023 - 05:30 UTC