Intermittent Issues with Notifications to External Systems
Incident Report for ServiceChannel
Postmortem

System Performance Degradation Notification Services Incident Report

Date of Incident: 05/11/2020

Time/Date Incident Started: 05/11/2020, 7:35 AM EDT

Time/Date Stability Restored: 05/11/2020, 10:10 AM EDT

Time/Date Incident Resolved: 05/11/2020, 10:30 AM EDT

Users Impacted: All Users

Frequency: Sustained

Impact: Major

Incident description:

SRE team discovered that the master db server was exhibiting sustained, substantially higher-than-normal loads on the CPU. This resembled a pattern Servicechannel system experienced the week prior, which had originally believed to be transient.

Upon further investigation, SRE response team was able to pinpoint the issue to a single external customer, performing specific API requests against our XML and Notification integrations with this client. Once the requests were identified, we extended our work order notes cleanup process to include older records.

Root Cause Analysis:

A client integration process was generating duplicate work order notes which caused excessive API and XML requests both for outgoing and incoming data transfers

  • A large number of duplicate WO notes was created by a client system via xml transactions
  • All these duplicate WO notes were then triggering a Notifications process (events that the client subscribed to be notified about), thereby issuing excessive API calls with suboptimal performance due to volumes
  • API calls started gradually creating a backlog of requests, pushing database CPU utilization to 100%
  • Upon disabling the Notifications process, the chain of events (notifications triggering API calls) was broken, permitting the system to return to normal
  • Narrowing down the cause of excessive API to a specific client’s integration and examining their data, we noticed large numbers of duplicate WO notes.
  • Subsequent investigation determined that these WO notes were generated by the client integration. The duplicate WO notes caused the Notifications system to trigger notifications to processing these events and caused a large number of slow API calls

Actions Taken:

  1. Identified the specific module that was causing performance issues
  2. Applied more aggressive cleanup processes of duplicate data with larger date limits (i.e. the last 6 months)
  3. Ensured that the system stabilized and was working within expected baseline

Mitigation Measures:

  1. Use an aggressive cleanup and purge process for duplicate work order notes.
  2. Implement a more sophisticated throttling system for API calls to ensure that an API call cannot cause the kind of system degradation experienced in this incident.
  3. Worked with the client to help understand what they were attempting to retrieve from our system and optimize the workflow.
Posted Jun 16, 2020 - 13:10 EDT

Resolved
Notification services are working normally. After overnight monitoring of the performance of the notifications system, we consider this issue to be resolved.
Posted May 09, 2020 - 13:08 EDT
Monitoring
Notification services are working normally. We are monitoring as any backlog of notifications clear through the system.
Posted May 08, 2020 - 18:06 EDT
Identified
We have identified the issue that is causing degraded performance in the notifications system. We will provide updates as the investigation continues.
Posted May 07, 2020 - 11:51 EDT
Investigating
ServiceChannel notifications to external systems are experiencing intermittent issues. These sporadic issues may impact autoassignment and connectivity to external systems. We are investigating.
Posted May 06, 2020 - 12:41 EDT
This incident affected: API (API Response, SendXML).