System Performance Degradation
Incident Report for ServiceChannel
Postmortem

System Performance Degradation Across Modules - Incident Report

Date of Incident:                   05/19/2022

Time/Date Incident Started: 05/19/2022, 08:50 am EDT

Time/Date Stability Restored:   05/19/2022, 09:46 am EDT

Time/Date Incident Resolved: 05/19/2022, 09:58 am EDT

Users Impacted: All users

Frequency: Intermittent

Impact: Major

Incident description:

The ServiceChannel monitoring system detected system performance degradation and increased application latencies. The ServiceChannel Site Reliability Engineering (SRE) team established an investigation. Impacted customers reported general slowness while using the ServiceChannel platform.

After confirming that core system components that would have a cascading adverse effect on many modules if they were in a degraded state were in fact healthy, the SRE team determined that an issue at our cloud provider was likely experiencing unreported performance degradation on their backend.

The SRE team engaged our cloud provider’s support engineers to establish a root cause. During a lengthy exploratory conference bridge, the cloud service provider’s support engineers were able to find evidence of a networking failure within a primary datacenter, consistent with the timeline of the incident.

Root Cause Analysis:

A transient networking failure at our cloud provider’s datacenter caused slowness and degraded performance for end users of the ServiceChannel platform.

Actions Taken: 

  1. The SRE team’s monitoring tools issued alerted related to increased latency across all ServiceChannel platform modules.
  2. The SRE team conducted an investigation, examining application logs, key infrastructure performance metrics, and other telemetry.
  3. After determining that core system components were in fact healthy, the SRE team engaged our cloud provider’s support engineers.
  4. Our cloud provider was able to provide evidence of networking failures at their datacenter during the time of the incident.

Mitigation Measures:  

  1. The SRE team is exploring alternative deployment patterns to improve resiliency during transient failures within our cloud service provider.
Posted Jun 03, 2022 - 12:53 EDT

Resolved
This incident has been resolved. All services are working as expected.
Posted May 19, 2022 - 10:36 EDT
Monitoring
System stability has been restored and services are functioning normally. We will continue to monitor closely for any further issues.
Posted May 19, 2022 - 09:58 EDT
Investigating
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Posted May 19, 2022 - 09:34 EDT
This incident affected: Service Automation (Login, Dashboard, Work Order Manager, Proposal Manager, Invoice Manager) and Provider Automation (Work Order Manager, Proposal Manager).