Infrastructure/hardware instability
Incident Report
Date of Incident:08/31/2023
Time/Date Incident Started: 08/31/2023, 02:15 pm EDT
Time/Date Stability Restored:08/31/2023, 02:47 pm EDT
Time/Date Incident Resolved:08/31/2023, 02:50 pm EDT
Users Impacted: All
Frequency: Intermittent
Impact: Major
Incident description
On August 31st at 02:15 pm EDT, the ServiceChannel Site Reliability Engineering (SRE) team received a large number of SQL timeout errors, followed by reports of dashboard slowness.
Root Cause Analysis
The Database Administration (DBA) team discovered a growing queue of active database queries and increasing resource waits, resulting from functionality that was causing database blocks and high CPU load on the database cluster.
Actions Taken
Investigated system-generated alerts and identified affected platform functionality.
Recompiled the affected stored procedures and dropped all blocking connections to return the database cluster to the steady state.
Compiled incident findings for future remediation by the Application Engineering and SRE teams.
Mitigation Measures
Coordinate with the Application Engineering team to identify and remediate the root causes of the high database CPU and blocks.
Identify and implement general performance improvements for database queries to increase overall platform stability.
Implement infrastructural modifications to distribute database I/O across additional read replicas.