SFTP Unavailable
Incident Report for ServiceChannel
Postmortem

Incident Report: SFTP Service Disruption

   

Date of Incident:          12/12/2023  

Time/Date Incident Started: 12/11/2023, 05:43 pm EST  

Time/Date Stability Restored:12/12/2023, 01:24 pm EST  

Time/Date Incident Resolved:12/12/2023, 01:54 pm EST  

  

Users Impacted: Few

Frequency: Continuous

Impact: Major  

    

Incident description:  

On December 11th at 5:43 pm EDT, an unexpected disruption occurred in the Production ServiceChannel SFTP service. By the morning of December 12th, 2023,  the ServiceChannel Support team began to receive customer reports of timeout errors when attempting to connect to the ServiceChannel SFTP server. 

Root Cause Analysis:  

A comprehensive investigation by the Site Reliability Engineering (SRE) team revealed no resource contention issues with the affected server instance. Nevertheless, to preemptively avoid any hardware bottleneck issues, the SRE team performed a scale-up of the server instance to the next larger instance size. Despite this effort, tests indicated ongoing issues with external connections to port 22, while all internal network tests were successful. 

The SRE team shifted their efforts to pinpoint potential network irregularities and found that the security policy governing the SFTP server had been altered to exclude access to port 22. Upon further investigation with the Security team, we determined that this change was part of a broad initiative to harden our platform's security posture. Regrettably, this policy update was executed without the normal change management process, and the the broader engineering organization was not notified in advance. 

This network modification was subsequently reversed, and SFTP functionality was restored.  

Actions Taken:  

  1. The SRE team inspected the SFTP server and confirmed it was operating within defined parameters. The team also proactively scaled up the infrastructure to proactively address the possibility of any system bottlenecks. 
  2. The SRE team identified a suspected change in the security policy, wherein Port 22 access was removed for all but private network address spaces. System event logs confirmed that this change was implemented by the security team. Upon identifying the issue, the Security team was informed, and an emergency rollback was requested. 

 

Mitigation Measures:    

In light of this incident, the following preventative measures have been put in place: 

  1. Improvements to internal communications, including ensuring that all network changes are announced and approved by the wider engineering organization prior to their implementation.  
  2. Ensuring that going forward, Infrastructure changes to the ServiceChannel Platform will be made by the SRE team using the normal Infrastructure as Code process. 
  3. Additional monitoring of SFTP infrastructure using both network ping tests and end-to-end synthetic transaction tests have been implemented to test from both internal and external network paths.
Posted Dec 14, 2023 - 18:44 EST

Resolved
This incident has been resolved.
Posted Dec 12, 2023 - 15:30 EST
Monitoring
A fix has been implemented. We are monitoring the results.
Posted Dec 12, 2023 - 14:07 EST
Investigating
ServiceChannel is currently investigating an issue that prevents users from connecting to our SFTP servers from the internet. We are working to restore service as soon as possible. Thank you for your patience.
Posted Dec 12, 2023 - 10:37 EST
This incident affected: API (SFTP).