System Performance Degradation

Incident Report for ServiceChannel

Postmortem

Date of Incident: 07/26/2019

Time/Date Incident Started: 07/26/2019, 09:53 am EDT

Time/Date Stability Restored: 07/26/2019, 10:35 am EDT

Time/Date Incident Resolved: 07/26/2019, 11:29 am EDT

Users Impacted: Some active users

Frequency: Intermittent

Impact: Minor

‌

Incident description:

System Performance Degradation resulting in slow login, dashboard, and work order list performance.

Root Cause Analysis:

An intermittent failure affecting one read replica in our production database cluster caused certain application requests to be delayed, thereby intermittently impacting performance in login, dashboards, and work order lists. Stability was restored by 10:35am EDT, and the incident was declared resolved at 11:29am EDT.

After replacing the affected node, some users briefly experienced a performance impact while database indexes were rebuilt on the new node. The index rebuild completed at 12:30pm EDT.

Actions Taken:

Removed the failing read replica from rotation on the load balancer.
Replaced the node with a stand-by instance.
Rebuilt indexes on the new node.

Mitigation Measures:

Increase monitor sensitivity to notify the SRE team in the event of a similar intermittent failure on database read replicas.

Posted Oct 09, 2019 - 14:35 EDT

Resolved

We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.

...

Our engineering team has identified the issue and services are returning to normal. We are continuing to monitor.

...

All services are confirmed running as expected. We consider this incident to be resolved.

Posted Jul 26, 2019 - 09:53 EDT