Date of Incident: 06/13/2019
Time/Date Incident Started: 06/13/2019, 10:35 am EST
Time/Date Stability Restored: 06/13/2019, 1:12 pm EST
Time/Date Incident Resolved: 06/13/2019, 1:18 pm EST
Users Impacted: Active users
System Performance Degradation where users were unable to login to the system, or experienced errors during the login process.
Root Cause Analysis:
We have identified an issue related to login session management in classic ASP code. This issue resulted in a number of cascading failures, which in turn created timeouts throughout the ServiceChannel platform.
Reverted code from previous release
Restarted Redis Cluster
Added additional monitoring to notify SRE team when Redis Cache hits are over defined thresholds.
Implemented manual temporary stopgap measures and currently working on a permanent solution.