Date of Incident: 06/12/2025
Time/Date Incident Started: 06/12/2025, 9:35 am EDT
Time/Date Stability Restored: 06/12/2025, 2:12 pm EDT
Time/Date Incident Resolved: 06/12/2025, 2:12 pm EDT
Users Impacted: All
Frequency: Intermittent
Impact: Major
Incident description:
US clients experienced intermittent issues loading their work order lists due to degraded performance in the underlying services responsible for filtering and NTE calculations.
Root Cause Analysis:
Processing certain very large provider accounts with a high number of serviceable locations triggered significant memory overconsumption, causing prolonged response times (>60 seconds) from an internal Elasticsearch service that affected all users. The system had been functioning normally until these large provider accounts triggered the condition causing a cascading effect on all users for work order lists. The intermittent nature of these issues extended the overall duration of user-facing problems
Actions Taken:
As soon as the issue was identified, our team initiated a series of mitigation steps to restore service as quickly as possible:
Restarted and scaled out the Elasticsearch service to address potential performance or resource bottlenecks. This had a slight positive effect but wasn't enough to stabilize the system.
Rolled back recent changes affecting client interaction with Elasticsearch, resulting in minor improvements.
Removed multiple clients from the Elasticsearch service to reduce load, which slightly decreased strain, but overall service stability was still insufficient.
Rolled back Elasticsearch itself to a previously stable version, reducing some system pressure but not fully resolving the problem.
Finally, rolled back the entire release, leading to system recovery and a return to normal service performance within approximately 30 minutes.
Mitigation Measures:
Implemented code updates to address an edge case affecting providers with extensive serviceable location networks.
Enhanced code for better system resilience with effective Elastic fallback mechanisms.
Expanded monitoring capabilities with incident-specific metrics to increase visibility and decrease diagnostic time.