US Production App Rollback Incident Report
Incident Report for ServiceChannel
Postmortem

US Production App Rollback Incident Report

 

Date of Incident:                         08/09/2023

Time/Date Incident Started:       08/09/2023, 10:00 pm EDT

Time/Date Stability Restored:   08/10/2023, 12:00 am EDT

Time/Date Incident Resolved:   08/10/2023, 12:00 am EDT

 

Users Impacted: All

Frequency: Continuous

Impact: Major

 

Incident description:

On 8/9/23, the production release of the US application code was rolled back following smoke testing and synthetic monitors that detected errors on the ServiceChannel platform.

 

Root Cause Analysis:

Upon investigation, it was determined that the cause of the issue could be traced back to a recent update in the platform session cookie. This update resulted in a malfunction of the Component module due to the module specifying an incorrect Redis store for session data.

Actions Taken:

  1. In response to the incident, the team promptly executed a rollback of the application services code to the previous functional version. After the rollback, the stability of the web platform was confirmed through both smoke testing and synthetic monitors.
  2. To address the underlying problem, the Redis connection strings for the component modules were updated. The US Production release was re-deployed on 8/10/23 at 10 PM EDT with the correct configuration applied.

Mitigation Measures:       

To prevent similar incidents in the future, the following mitigation measures will be implemented:

  1. Ensuring Environment Consistency: A concerted effort will be made to better align production and non-production configurations.
  2. Governance of Production Changes: To maintain greater control over potentially disruptive production changes, any changes that, due to scale considerations, can only be applied to the Production environment, will require explicit approval from senior management before implementation.
  3. Monitoring Production-Only Variables: We will implement automated monitoring to to regularly check for the presence of "Production Only" configuration values. This practice will provide an additional layer of oversight and help prevent inadvertent changes.
Posted Sep 11, 2023 - 14:35 EDT

Resolved
The production release of the US application code was rolled back following smoke testing and synthetic monitors that detected errors on the ServiceChannel platform.
Posted Aug 09, 2023 - 22:00 EDT