ServiceChannel System Performance Degradation

Incident Report for ServiceChannel

Postmortem

Date of Incident:                   06/12/2025 

Time/Date Incident Started: 06/12/2025, 9:35 am EDT  

Time/Date Stability Restored:   06/12/2025, 2:12 pm EDT 

Time/Date Incident Resolved: 06/12/2025, 2:12 pm EDT 

  

Users Impacted: All 

Frequency: Intermittent 

Impact: Major 

  

Incident description: 

US clients experienced intermittent issues loading their work order lists due to degraded performance in the underlying services responsible for filtering and NTE calculations.  

   

Root Cause Analysis: 

Processing certain very large provider accounts with a high number of serviceable locations triggered significant memory overconsumption, causing prolonged response times (>60 seconds) from an internal Elasticsearch service that affected all users. The system had been functioning normally until these large provider accounts triggered the condition causing a cascading effect on all users for work order lists. The intermittent nature of these issues extended the overall duration of user-facing problems 

 

Actions Taken: 

As soon as the issue was identified, our team initiated a series of mitigation steps to restore service as quickly as possible: 

  • Restarted and scaled out the Elasticsearch service to address potential performance or resource bottlenecks. This had a slight positive effect but wasn't enough to stabilize the system. 

  • Rolled back recent changes affecting client interaction with Elasticsearch, resulting in minor improvements. 

  • Removed multiple clients from the Elasticsearch service to reduce load, which slightly decreased strain, but overall service stability was still insufficient. 

  • Rolled back Elasticsearch itself to a previously stable version, reducing some system pressure but not fully resolving the problem. 

  • Finally, rolled back the entire release, leading to system recovery and a return to normal service performance within approximately 30 minutes. 

 

Mitigation Measures: 

  • Implemented code updates to address an edge case affecting providers with extensive serviceable location networks. 

  • Enhanced code for better system resilience with effective Elastic fallback mechanisms. 

  • Expanded monitoring capabilities with incident-specific metrics to increase visibility and decrease diagnostic time.

Posted Jun 23, 2025 - 15:41 EDT

Resolved

This incident has been resolved.
Posted Jun 12, 2025 - 16:22 EDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jun 12, 2025 - 14:44 EDT

Update

We continue to see issues with work order search the team is actively troubleshooting
Posted Jun 12, 2025 - 12:24 EDT

Investigating

We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Posted Jun 12, 2025 - 10:14 EDT
This incident affected: Mobile Applications (SC Mobile) and Service Automation (Work Order Manager).