System Performance Degradation Notification Services Incident Report
Date of Incident: 05/11/2020
Time/Date Incident Started: 05/11/2020, 7:35 AM EDT
Time/Date Stability Restored: 05/11/2020, 10:10 AM EDT
Time/Date Incident Resolved: 05/11/2020, 10:30 AM EDT
Users Impacted: All Users
Frequency: Sustained
Impact: Major
Incident description:
SRE team discovered that the master db server was exhibiting sustained, substantially higher-than-normal loads on the CPU. This resembled a pattern Servicechannel system experienced the week prior, which had originally believed to be transient.
Upon further investigation, SRE response team was able to pinpoint the issue to a single external customer, performing specific API requests against our XML and Notification integrations with this client. Once the requests were identified, we extended our work order notes cleanup process to include older records.
Root Cause Analysis:
A client integration process was generating duplicate work order notes which caused excessive API and XML requests both for outgoing and incoming data transfers
- A large number of duplicate WO notes was created by a client system via xml transactions
- All these duplicate WO notes were then triggering a Notifications process (events that the client subscribed to be notified about), thereby issuing excessive API calls with suboptimal performance due to volumes
- API calls started gradually creating a backlog of requests, pushing database CPU utilization to 100%
- Upon disabling the Notifications process, the chain of events (notifications triggering API calls) was broken, permitting the system to return to normal
- Narrowing down the cause of excessive API to a specific client’s integration and examining their data, we noticed large numbers of duplicate WO notes.
- Subsequent investigation determined that these WO notes were generated by the client integration. The duplicate WO notes caused the Notifications system to trigger notifications to processing these events and caused a large number of slow API calls
Actions Taken:
- Identified the specific module that was causing performance issues
- Applied more aggressive cleanup processes of duplicate data with larger date limits (i.e. the last 6 months)
- Ensured that the system stabilized and was working within expected baseline
Mitigation Measures:
- Use an aggressive cleanup and purge process for duplicate work order notes.
- Implement a more sophisticated throttling system for API calls to ensure that an API call cannot cause the kind of system degradation experienced in this incident.
- Worked with the client to help understand what they were attempting to retrieve from our system and optimize the workflow.