ServiceChannel System Performance Degradation

Incident Report for ServiceChannel

Postmortem

Infrastructure/hardware instability

 Incident Report 

Date of Incident:08/31/2023 

Time/Date Incident Started: 08/31/2023, 02:15 pm EDT 

Time/Date Stability Restored:08/31/2023, 02:47 pm EDT 

Time/Date Incident Resolved:08/31/2023, 02:50 pm EDT 

Users Impacted: All

Frequency: Intermittent

Impact: Major 

‌

Incident description

On August 31st at 02:15 pm EDT, the ServiceChannel Site Reliability Engineering (SRE) team received a large number of SQL timeout errors, followed by reports of dashboard slowness.

Root Cause Analysis

The Database Administration (DBA) team discovered a growing queue of active database queries and increasing resource waits, resulting from functionality that was causing database blocks and high CPU load on the database cluster.

Actions Taken

Investigated system-generated alerts and identified affected platform functionality. 
Recompiled the affected stored procedures and dropped all blocking connections to return the database cluster to the steady state.
Compiled incident findings for future remediation by the Application Engineering and SRE teams.

Mitigation Measures

Coordinate with the Application Engineering team to identify and remediate the root causes of the high database CPU and blocks.
Identify and implement general performance improvements for database queries to increase overall platform stability.
Implement infrastructural modifications to distribute database I/O across additional read replicas.

Posted Sep 14, 2023 - 14:32 EDT

Resolved

This incident has been resolved. All services are working as expected.

Posted Aug 31, 2023 - 15:14 EDT

Investigating

We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.

Posted Aug 31, 2023 - 14:36 EDT

This incident affected: Service Automation (Work Order Manager, Maps).