General Outage - Microsoft Azure Platform Issues
Date of Incident: 06/11/2020
Time/Date Incident Started: 06/11/2020, 7:57 AM EDT
Time/Date Stability Restored: 06/11/2020, 9:35 AM EDT
Time/Date Incident Resolved: 06/11/2020, 10:20 AM EDT
Users Impacted: All Users
Between 7:57 and 10:20 AM EST on 11 Jun 2020, Microsoft Azure cloud reported an outage against their US East Region. ServiceChannel was identified as a customer in this region whose virtual machines were impacted. The outage put the ServiceChannel platform in an unstable state as key components became unavailable.
After identifying the underlying Microsoft Azure platform issue, the ServiceChannel SRE team initiated its incident response protocol, preparing to transfer platform operations to a warm backup facility in a geographically separate region.
A key piece of response infrastructure, a VPN endpoint for remote management, was impacted by the general outage in the US East region, delaying the ServiceChannel SRE team response by about 25 minutes.
Before the warm backup facility was brought online, Microsoft Azure was able to restore operations in the US East region, and the ServiceChannel platform resumed normal operations.
Root Cause Analysis:
Microsoft reported in their own RCA (Tracking ID 9VHK-J80) that an incident during a planned power maintenance activity at the datacenter caused an impact to a storage scale unit, which then became unhealthy. The incident caused power to be lost to a subset of racks comprising 60% of this storage scale unit.
The maintenance activity itself did not impact the storage scale unit, but it caused the scale unit to have reduced redundant power options at the time of the incident. All racks and network devices have two sources of power for redundancy, but it is standard procedure in some types of maintenance to isolate some resources to a single source of power for a short period. After the isolation had been completed on this scale unit, but before maintenance could begin, a distribution breaker in the redundant power source tripped open unexpectedly and the power was lost to the subset of racks.
Microsoft discontinued the maintenance task and restored power to the affected devices, thereby restoring services to the affected node. As this was a datacenter-scale event, it took some time before ServiceChannel platform resources could return to a healthy state.