General Outage - Microsoft Azure Platform Issues
Incident Report for ServiceChannel
Postmortem

General Outage - Microsoft Azure Platform Issues

Date of Incident: 06/11/2020

Time/Date Incident Started: 06/11/2020, 7:57 AM EDT

Time/Date Stability Restored: 06/11/2020, 9:35 AM EDT

Time/Date Incident Resolved: 06/11/2020, 10:20 AM EDT

Users Impacted: All Users

Frequency: Sustained

Impact: Major

Incident description:

Between 7:57 and 10:20 AM EST on 11 Jun 2020, Microsoft Azure cloud reported an outage against their US East Region. ServiceChannel was identified as a customer in this region whose virtual machines were impacted. The outage put the ServiceChannel platform in an unstable state as key components became unavailable.

After identifying the underlying Microsoft Azure platform issue, the ServiceChannel SRE team initiated its incident response protocol, preparing to transfer platform operations to a warm backup facility in a geographically separate region.

A key piece of response infrastructure, a VPN endpoint for remote management, was impacted by the general outage in the US East region, delaying the ServiceChannel SRE team response by about 25 minutes.

Before the warm backup facility was brought online, Microsoft Azure was able to restore operations in the US East region, and the ServiceChannel platform resumed normal operations.

Root Cause Analysis:

Microsoft reported in their own RCA (Tracking ID 9VHK-J80) that an incident during a planned power maintenance activity at the datacenter caused an impact to a storage scale unit, which then became unhealthy. The incident caused power to be lost to a subset of racks comprising 60% of this storage scale unit.

The maintenance activity itself did not impact the storage scale unit, but it caused the scale unit to have reduced redundant power options at the time of the incident. All racks and network devices have two sources of power for redundancy, but it is standard procedure in some types of maintenance to isolate some resources to a single source of power for a short period. After the isolation had been completed on this scale unit, but before maintenance could begin, a distribution breaker in the redundant power source tripped open unexpectedly and the power was lost to the subset of racks.

Microsoft discontinued the maintenance task and restored power to the affected devices, thereby restoring services to the affected node. As this was a datacenter-scale event, it took some time before ServiceChannel platform resources could return to a healthy state.

Actions Taken:

  1. Identified the issue as a Microsoft Azure cloud provider-specific outage.
  2. Initiated the ServiceChannel Business Continuity plan to restore platform availability.

Mitigation Measures:

  1. Review locations of key infrastructure to ensure prompt response to underlying platform issues (e.g. VPN components, network dependencies, etc.) to eliminate single points of failure that might complicate our Business Continuity procedures.
  2. Review warm backup infrastructure and disaster recovery procedures.
Posted Jun 19, 2020 - 14:41 EDT

Resolved
The ServiceChannel platform is back to normal.

We appreciate your patience though this incident, and apologize for any inconvenience.
Posted Jun 11, 2020 - 10:11 EDT
Monitoring
Our engineers are monitoring as services are returning to normal. There may be some platform performance degradation as systems come back online.
Posted Jun 11, 2020 - 09:42 EDT
Investigating
ServiceChannel is impacted by a general outage in the Microsoft Azure platform that hosts the ServiceChannel application. Engineers are aware of this issue and are actively investigating.

We appreciate your patience and will provide an update shortly.
Posted Jun 11, 2020 - 08:22 EDT
This incident affected: Service Automation (Login, Dashboard, Work Order Manager, Proposal Manager, Invoice Manager, Asset Manager, Compliance Manager, Supply Manager, Weather, Project Tracker, Inventory Manager), API (Authentication, API Response, SFTP, SendXML), Mobile Applications (SC Mobile), Provider Automation (Login, Work Order Manager, Proposal Manager, Invoice Manager, IVR), Analytics (Analytics Dashboard), and WorkForce.