System Connectivity
Incident Report for ServiceChannel
Postmortem

Date of Incident: 12/19/2019

Time/Date Incident Started: 12/19/2019, 11:06 AM EST

Time/Date Stability Restored: 12/19/2019, 11:18 AM EST

Time/Date Incident Resolved: 12/19/2019, 11:31 pm EST

Users Impacted: All Users

Frequency: Sustained

Impact: Major

Incident description:

ServiceChannel users encountered HTTP 504 errors and issues while navigating to the ServiceChannel site due a network connectivity issue between our cloud service providers.

Root Cause Analysis:

Attempts to route traffic to our web cluster suddenly failed due to a general tunnel failure between our core networking routing infrastructure.

Error logs indicated an unrecoverable endpoint failure on one side of our VPN tunnels that interconnect cloud service providers.

Since the network connection between our cloud service providers is critical for the proper functioning of the ServiceChannel application, we maintain multiple redundant VPN connections in an active-active configuration. Despite engineering efforts to assure multiple redundant network paths, all VPN endpoints at one of the cloud service providers failed simultaneously.

The cloud service vendor whose services were affected has since acknowledged that they placed all of our VPN gateway endpoints in standby mode to perform maintenance.

Our vendor’s support organization did not provide warning about this maintenance, and performed the maintenance on all endpoints simultaneously. Since all endpoints were administratively placed in standby mode, redundancy was lost and no VPN tunnels could be established between cloud service providers. When the vendor completed their maintenance, connectivity was restored.

Actions Taken:

  1. Identified the loss of network connectivity between cloud providers
  2. Restarted one Cisco CSR VPN appliance to eliminate the Cisco CSR VPN appliance as a cause of fault
  3. Reviewed logs to identify that all endpoints managed by one cloud provider had failed
  4. Confirmed that the cloud provider vendor had performed unplanned maintenance

Mitigation Measures:

  1. Rebooted VPN endpoints to force renegotiation of VPN tunnels
  2. Created additional documentation for operational staff related to troubleshooting the the hosted VPN Gateway
  3. Opened a support case at the highest priority with our vendor and escalated to our dedicated tech lead to prevent unscheduled maintenance
  4. Investigate implementing a dedicated L2 network connection between cloud providers to replace VPN connectivity as the primary data path between clouds
Posted Dec 20, 2019 - 10:30 EST

Resolved
All services are working as expected. We consider this issue to be resolved.
Posted Dec 19, 2019 - 11:31 EST
Monitoring
Our engineering team has identified the issue and services are returning to normal. We are continuing to monitor. Thank you for your patience.
Posted Dec 19, 2019 - 11:18 EST
Investigating
The ServiceChannel operations team is investigating a general network connectivity issue. We will provide an ETA shortly.
Posted Dec 19, 2019 - 11:06 EST
This incident affected: Service Automation (Login, Work Order Manager, Proposal Manager, Invoice Manager, Asset Manager, Compliance Manager, Supply Manager) and Provider Automation (Login, Work Order Manager, Proposal Manager, Invoice Manager, IVR).