ServiceChannel System Performance Degradation
Incident Report for ServiceChannel
Postmortem

Incident Report: Infrastructure/Hardware Instability 

   

Date of Incident:09/08/2023  

Time/Date Incident Started: 09/08/2023, 04:18 pm EDT  

Time/Date Stability Restored:09/08/2023, 05:08 pm EDT  

Time/Date Incident Resolved:09/08/2023, 05:15 pm EDT  

  

Users Impacted: All 

Frequency: Intermittent 

Impact: Major  

    

Incident description:  

On September 8th at 04:18 pm EDT, the Site Reliability Engineering (SRE) team received an alert regarding "SQL timeout errors" and subsequent reports of dashboard slowness. This slowness had a significant impact on a large number of users, resulting in a suboptimal experience. 

Root Cause Analysis:  

Upon conducting a thorough investigation, the Database Administration (DBA) team identified a series of database requests that were causing blocks and imposing a high CPU load on the database replica servers. This, in turn, led to an increased number of "resource waits." As a preemptive measure, the DBA team initiated a restart of the SQL service on both database replica servers. Following the successful restart of the SQL service, the system's stability was closely monitored and subsequently restored. 

Actions Taken:  

  1. Investigated system-generated alerts and identified affected platform functionality.  

  2. DBA team proactively initiated SQL service restart on database replica servers. 

Mitigation Measures:    

In response to this incident, the following mitigation measures have been implemented: 

  1. Ongoing Investigation: The team is continuing to investigate the root causes of the high CPU usage and blockages on the database servers. 

  2. Database Query Performance Improvements: Efforts are being made to enhance the performance of database queries to ensure the overall stability of the platform.

Posted Oct 05, 2023 - 09:19 EDT

Resolved
This incident has been resolved. All services are working as expected.
Posted Sep 08, 2023 - 17:41 EDT
Monitoring
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
Posted Sep 08, 2023 - 17:16 EDT
Investigating
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Posted Sep 08, 2023 - 17:03 EDT
This incident affected: Service Automation (Asset Manager).