ServiceChannel System Performance Degradation
Incident Report for ServiceChannel
Postmortem

Infrastructure/hardware instability 

Incident Report  

   

Date of Incident:08/31/2023  

Time/Date Incident Started: 08/31/2023, 02:15 pm EDT  

Time/Date Stability Restored:08/31/2023, 02:47 pm EDT  

Time/Date Incident Resolved:08/31/2023, 02:50 pm EDT  

  

Users Impacted: All 

Frequency: Intermittent 

Impact: Major  

Incident description 

On August 31st at 02:15 pm EDT, the ServiceChannel Site Reliability Engineering (SRE) team received a large number of SQL timeout errors, followed by reports of dashboard slowness.  

   

Root Cause Analysis 

The Database Administration (DBA) team discovered a growing queue of active database queries and increasing resource waits, resulting from functionality that was causing database blocks and high CPU load on the database cluster. 

Actions Taken 

  1. Investigated system-generated alerts and identified affected platform functionality.  

  2. Recompiled the affected stored procedures and dropped all blocking connections to return the database cluster to the steady state. 

  3. Compiled incident findings for future remediation by the Application Engineering and SRE teams. 

Mitigation Measures 

  1. Coordinate with the Application Engineering team to identify and remediate the root causes of the high database CPU and blocks.  

  2. Identify and implement general performance improvements for database queries to increase overall platform stability. 

  3. Implement infrastructural modifications to distribute database I/O across additional read replicas.

Posted Sep 14, 2023 - 14:32 EDT

Resolved
This incident has been resolved. All services are working as expected.
Posted Aug 31, 2023 - 15:14 EDT
Investigating
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Posted Aug 31, 2023 - 14:36 EDT
This incident affected: Service Automation (Work Order Manager, Maps).