Incident Report: Infrastructure/Hardware Instability
Date of Incident:09/08/2023
Time/Date Incident Started: 09/08/2023, 04:18 pm EDT
Time/Date Stability Restored:09/08/2023, 05:08 pm EDT
Time/Date Incident Resolved:09/08/2023, 05:15 pm EDT
Users Impacted: All
Frequency: Intermittent
Impact: Major
Incident description:
On September 8th at 04:18 pm EDT, the Site Reliability Engineering (SRE) team received an alert regarding "SQL timeout errors" and subsequent reports of dashboard slowness. This slowness had a significant impact on a large number of users, resulting in a suboptimal experience.
Root Cause Analysis:
Upon conducting a thorough investigation, the Database Administration (DBA) team identified a series of database requests that were causing blocks and imposing a high CPU load on the database replica servers. This, in turn, led to an increased number of "resource waits." As a preemptive measure, the DBA team initiated a restart of the SQL service on both database replica servers. Following the successful restart of the SQL service, the system's stability was closely monitored and subsequently restored.
Actions Taken:
Investigated system-generated alerts and identified affected platform functionality.
DBA team proactively initiated SQL service restart on database replica servers.
Mitigation Measures:
In response to this incident, the following mitigation measures have been implemented:
Ongoing Investigation: The team is continuing to investigate the root causes of the high CPU usage and blockages on the database servers.
Database Query Performance Improvements: Efforts are being made to enhance the performance of database queries to ensure the overall stability of the platform.