System Degradation
Incident Report for ServiceChannel
Postmortem

Date of Incident: 02/21/2019

Time/Date Incident Started: 02/21/2019, 10:42am EST

Time/Date Stability Restored: 02/21/2019, 14:42pm EST

Time/Date Incident Resolved: 02/21/2019, 18:50pm EST

Users Impacted: Active users

Frequency: Permanent

Impact: Major

Incident description:

System performance throughout all systems was impacted by a sharp increase in system response times and intermittent inability to log in.

Root Cause Analysis:

Investigation through our monitoring systems revealed a significant number of “database waits” pointing to a particular API call performed by one of the integration clients.

Further investigation revealed that an overnight database update had not been followed by a database statistics refresh, which, in turn, led to a non-optimal database execution plans and hence the massive amount of waits, which resulted in slow performance.

Actions Taken:

We initially disabled the API client that was overloading the system.

Database statistics were forcibly recalculated and tests were performed to validate that queries are not overloading the system anymore.

Throttling threshold was adjusted for that client to prevent any further system impact.

The system was closely monitored for an additional 4 hours to ensure stability.

Mitigation Measures:

We are re-evaluating the default throttling value for API clients and will potentially be lowering to prevent recurrence.

We have implemented a procedural change to forcibly recalculate database statistics to eliminate the possibility of a massive data update causing a similar condition in the future.

Posted 4 months ago. Feb 22, 2019 - 15:19 EST

Resolved
All services are confirmed running as expected. We consider this incident to be resolved.
Posted 4 months ago. Feb 21, 2019 - 18:50 EST
Monitoring
Our Engineers have identified and rectified the issue with service performance. We are continuing to monitor, but do see an improvement in performance and systems are back to normal.
Posted 4 months ago. Feb 21, 2019 - 14:42 EST
Update
Thank you for your patience while we are continuing to investigate our service performance issues. Our Engineers and all available resources are working to resolve this issue as quickly as possible. We will continue to share updates as we have new information.
Posted 4 months ago. Feb 21, 2019 - 13:28 EST
Investigating
We are currently experiencing system performance issues. Our development team is actively working on this and we will provide updates as soon as possible.
Posted 4 months ago. Feb 21, 2019 - 11:20 EST
This incident affected: Call Center, Fixxbook, IVR, Service Automation, Store Dashboard, and Mobile Applications.