ServiceChannel System Performance Degradation

Incident Report for ServiceChannel

Postmortem

Date of Incident: 06/12/2025

Time/Date Incident Started: 06/12/2025, 9:35 am EDT

Time/Date Stability Restored: 06/12/2025, 2:12 pm EDT

Time/Date Incident Resolved: 06/12/2025, 2:12 pm EDT

Users Impacted: All

Frequency: Intermittent

Impact: Major

Incident description:

US clients experienced intermittent issues loading their work order lists due to degraded performance in the underlying services responsible for filtering and NTE calculations.

Root Cause Analysis:

Processing certain very large provider accounts with a high number of serviceable locations triggered significant memory overconsumption, causing prolonged response times (>60 seconds) from an internal Elasticsearch service that affected all users. The system had been functioning normally until these large provider accounts triggered the condition causing a cascading effect on all users for work order lists. The intermittent nature of these issues extended the overall duration of user-facing problems

Actions Taken:

As soon as the issue was identified, our team initiated a series of mitigation steps to restore service as quickly as possible:

Restarted and scaled out the Elasticsearch service to address potential performance or resource bottlenecks. This had a slight positive effect but wasn't enough to stabilize the system.
Rolled back recent changes affecting client interaction with Elasticsearch, resulting in minor improvements.
Removed multiple clients from the Elasticsearch service to reduce load, which slightly decreased strain, but overall service stability was still insufficient.
Rolled back Elasticsearch itself to a previously stable version, reducing some system pressure but not fully resolving the problem.
Finally, rolled back the entire release, leading to system recovery and a return to normal service performance within approximately 30 minutes.

Mitigation Measures:

Implemented code updates to address an edge case affecting providers with extensive serviceable location networks.
Enhanced code for better system resilience with effective Elastic fallback mechanisms.
Expanded monitoring capabilities with incident-specific metrics to increase visibility and decrease diagnostic time.

Posted Jun 23, 2025 - 15:41 EDT

Resolved

This incident has been resolved.

Posted Jun 12, 2025 - 16:22 EDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jun 12, 2025 - 14:44 EDT

Update

We continue to see issues with work order search the team is actively troubleshooting

Posted Jun 12, 2025 - 12:24 EDT

Investigating

We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.

Posted Jun 12, 2025 - 10:14 EDT

This incident affected: Mobile Applications (SC Mobile) and Service Automation (Work Order Manager).