ServiceChannel System Performance Degradation

Incident Report for ServiceChannel

Postmortem

Incident Report: Infrastructure/Hardware Instability

Date of Incident:09/08/2023 

Time/Date Incident Started: 09/08/2023, 04:18 pm EDT 

Time/Date Stability Restored:09/08/2023, 05:08 pm EDT 

Time/Date Incident Resolved:09/08/2023, 05:15 pm EDT 

Users Impacted: All

Frequency: Intermittent

Impact: Major 

Incident description: 

On September 8th at 04:18 pm EDT, the Site Reliability Engineering (SRE) team received an alert regarding "SQL timeout errors" and subsequent reports of dashboard slowness. This slowness had a significant impact on a large number of users, resulting in a suboptimal experience.

Root Cause Analysis: 

Upon conducting a thorough investigation, the Database Administration (DBA) team identified a series of database requests that were causing blocks and imposing a high CPU load on the database replica servers. This, in turn, led to an increased number of "resource waits." As a preemptive measure, the DBA team initiated a restart of the SQL service on both database replica servers. Following the successful restart of the SQL service, the system's stability was closely monitored and subsequently restored.

Actions Taken: 

Investigated system-generated alerts and identified affected platform functionality. 
DBA team proactively initiated SQL service restart on database replica servers.

Mitigation Measures:  

In response to this incident, the following mitigation measures have been implemented:

Ongoing Investigation: The team is continuing to investigate the root causes of the high CPU usage and blockages on the database servers.
Database Query Performance Improvements: Efforts are being made to enhance the performance of database queries to ensure the overall stability of the platform.

Posted Oct 05, 2023 - 09:19 EDT

Resolved

This incident has been resolved. All services are working as expected.

Posted Sep 08, 2023 - 17:41 EDT

Monitoring

System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.

Posted Sep 08, 2023 - 17:16 EDT

Investigating

We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.

Posted Sep 08, 2023 - 17:03 EDT

This incident affected: Service Automation (Asset Manager).