API System Production Outage
Incident Report for ServiceChannel
Postmortem

Date of Incident:                   02/16/2021

Time/Date Incident Started: 02/16/2021, 05:09pm EDT

Time/Date Stability Restored:   02/16/2021, 06:06pm EDT

Time/Date Incident Resolved: 02/16/2021, 08:08pm EDT

Users Impacted: All API Users

Frequency: Sustained

Impact: Outage

 

Incident description: 

Servicechannel API failure impacting a broad range of API endpoints.

 

Root Cause Analysis:

Due to an operational oversight, a system service account used by many API endpoints was accidentally deleted.

In the course of testing permissions for a new production service account, a member of the ServiceChannel SRE team referenced an existing production API service account to review roles and permissions for the new test production service account. 

Once the testing had concluded, the SRE engineer deleted the test production service account without realizing that the existing production API service account was also selected for deletion. The test and production account were deleted together.

Without the production API services account, many API calls were unable to authenticate against ServiceChannel resources, causing a broad range of API requests to fail. Since API calls are used by both customers and the ServiceChannel platform alike, this issue manifested itself as a wide range of system issues, but they all had the same root cause. 

Only a small number of people within the ServiceChannel SRE team are granted permissions to create or delete system accounts like the one identified in this incident.

Actions Taken:

  1. SRE team monitors showed a dramatic spike in errors while reporting a significant drop in API activity.
  2. Investigation of system logs identified API authentication failures against a the API service account.
  3. Through further investigation, the SRE team discovered that the API service account was missing.
  4. The SRE engineer who deleted the account realized his mistake and reported it to the investigating team. 
  5. An attempt to restore the account was unsuccessful because the account API token and secret are randomly assigned and cannot be manually configured. 
  6. The missing account was recreated with the same permissions and distributed to the application through an emergency configuration deployment.
     

Mitigation Measures:  

  1. Review privileged access on the infrastructure side to ensure that creations and deletions require independent review, agreement, and authorization by at least two members of the team.
  2. Reduce the blast radius of a failure to authenticate against a service account by using multiple API service accounts on a component-by-component basis (or similar).
Posted Feb 18, 2021 - 21:31 EST

Resolved
We consider this incident to be resolved. As always, thank you for your patience.
Posted Feb 16, 2021 - 20:38 EST
Monitoring
The SRE Team has brought impacted services back online and is now monitoring for any further issues. Thank you for your continued patience.
Posted Feb 16, 2021 - 19:54 EST
Identified
The ServiceChannel Site Reliability Engineering team has identified the issue and is now bringing impacted services back online.
Posted Feb 16, 2021 - 18:09 EST
Investigating
We are currently investigating an issue that is impacting several services. We will provide an update shortly. Thank you for your patience.
Posted Feb 16, 2021 - 17:24 EST
This incident affected: API (Authentication) and Service Automation (Work Order Manager, Proposal Manager, Compliance Manager).