Date of Incident: 12/19/2019
Time/Date Incident Started: 12/19/2019, 11:06 AM EST
Time/Date Stability Restored: 12/19/2019, 11:18 AM EST
Time/Date Incident Resolved: 12/19/2019, 11:31 pm EST
Users Impacted: All Users
Frequency: Sustained
Impact: Major
Incident description:
ServiceChannel users encountered HTTP 504 errors and issues while navigating to the ServiceChannel site due a network connectivity issue between our cloud service providers.
Root Cause Analysis:
Attempts to route traffic to our web cluster suddenly failed due to a general tunnel failure between our core networking routing infrastructure.
Error logs indicated an unrecoverable endpoint failure on one side of our VPN tunnels that interconnect cloud service providers.
Since the network connection between our cloud service providers is critical for the proper functioning of the ServiceChannel application, we maintain multiple redundant VPN connections in an active-active configuration. Despite engineering efforts to assure multiple redundant network paths, all VPN endpoints at one of the cloud service providers failed simultaneously.
The cloud service vendor whose services were affected has since acknowledged that they placed all of our VPN gateway endpoints in standby mode to perform maintenance.
Our vendor’s support organization did not provide warning about this maintenance, and performed the maintenance on all endpoints simultaneously. Since all endpoints were administratively placed in standby mode, redundancy was lost and no VPN tunnels could be established between cloud service providers. When the vendor completed their maintenance, connectivity was restored.
Actions Taken:
Mitigation Measures: