Date of Incident: 02/21/2019
Time/Date Incident Started: 02/21/2019, 10:42am EST
Time/Date Stability Restored: 02/21/2019, 14:42pm EST
Time/Date Incident Resolved: 02/21/2019, 18:50pm EST
Users Impacted: Active users
System performance throughout all systems was impacted by a sharp increase in system response times and intermittent inability to log in.
Root Cause Analysis:
Investigation through our monitoring systems revealed a significant number of “database waits” pointing to a particular API call performed by one of the integration clients.
Further investigation revealed that an overnight database update had not been followed by a database statistics refresh, which, in turn, led to a non-optimal database execution plans and hence the massive amount of waits, which resulted in slow performance.
We initially disabled the API client that was overloading the system.
Database statistics were forcibly recalculated and tests were performed to validate that queries are not overloading the system anymore.
Throttling threshold was adjusted for that client to prevent any further system impact.
The system was closely monitored for an additional 4 hours to ensure stability.
We are re-evaluating the default throttling value for API clients and will potentially be lowering to prevent recurrence.
We have implemented a procedural change to forcibly recalculate database statistics to eliminate the possibility of a massive data update causing a similar condition in the future.