Incident Report: Code Release Causes US Environment Outage
Date of Incident: 05/08/2025
Time/Date Incident Started: 05/08/2025, 2:29 am EDT
Time/Date Stability Restored: 05/08/2025, 4:07 am EDT
Time/Date Incident Resolved: 05/08/2025, 4:12 am EDT
Users Impacted: Many
Frequency: Continuous
Impact: Major
Incident description:
During the scheduled US production code release on May 8, 2025, ServiceChannel encountered technical issues that impacted service availability on our platform. Users experienced login difficulties from 2:29 AM to 3:07 AM EDT, while critical dashboard functionality was unavailable from 2:29 AM to 4:12 AM EDT.
Root Cause Analysis:
Login Module Issue: As part of ongoing deployment process enhancements, a configuration adjustment was made that worked correctly in our testing environments but behaved differently in production. The issue was identified and resolved through our standard troubleshooting procedures.
Dashboard Issue: A configuration setting that was properly configured in our development environments had not been fully synchronized to the production environment. This discrepancy wasn't detected until the new code attempted to access the setting during the deployment.
Full platform functionality was confirmed restored by 4:12 AM EDT
Actions Taken:
SRE team immediately investigated upon receiving alerts starting at 2:29am EDT indicating issues with two critical systems: dashboard and login
CICD team successfully rolled back the login module to the prior version, restoring user access by 3:07 AM EDT
Dashboard continued to experience issues, so investigation continued while login was restored
Dashboard functionality was restored by ensuring all required configuration settings were properly applied to production
Mitigation Measures:
Reviewed existing deployment procedures to include improved configuration validation and improved rollback protocols to prevent similar configuration-related issues in the future
Implemented process improvements for immediate communication with support teams following any service disruptions to ensure proper customer follow-up and transparency