Date of Incident: 06/13/2019
Time/Date Incident Started: 06/13/2019, 10:35 am EST
Time/Date Stability Restored: 06/13/2019, 1:12 pm EST
Time/Date Incident Resolved: 06/13/2019, 1:18 pm EST
Users Impacted: Active users
Frequency: Intermittent
Impact: Major
Incident description:
System Performance Degradation where users were unable to login to the system, or experienced errors during the login process.
Root Cause Analysis:
We have identified an issue related to login session management in classic ASP code. This issue resulted in a number of cascading failures, which in turn created timeouts throughout the ServiceChannel platform.
Actions Taken:
Reverted code from previous release
Restarted Redis Cluster
Mitigation Measures:
Added additional monitoring to notify SRE team when Redis Cache hits are over defined thresholds.
Implemented manual temporary stopgap measures and currently working on a permanent solution.