Incident Report: SFTP Service Disruption
Date of Incident: 12/12/2023
Time/Date Incident Started: 12/11/2023, 05:43 pm EST
Time/Date Stability Restored:12/12/2023, 01:24 pm EST
Time/Date Incident Resolved:12/12/2023, 01:54 pm EST
Users Impacted: Few
Frequency: Continuous
Impact: Major
Incident description:
On December 11th at 5:43 pm EDT, an unexpected disruption occurred in the Production ServiceChannel SFTP service. By the morning of December 12th, 2023, the ServiceChannel Support team began to receive customer reports of timeout errors when attempting to connect to the ServiceChannel SFTP server.
Root Cause Analysis:
A comprehensive investigation by the Site Reliability Engineering (SRE) team revealed no resource contention issues with the affected server instance. Nevertheless, to preemptively avoid any hardware bottleneck issues, the SRE team performed a scale-up of the server instance to the next larger instance size. Despite this effort, tests indicated ongoing issues with external connections to port 22, while all internal network tests were successful.
The SRE team shifted their efforts to pinpoint potential network irregularities and found that the security policy governing the SFTP server had been altered to exclude access to port 22. Upon further investigation with the Security team, we determined that this change was part of a broad initiative to harden our platform's security posture. Regrettably, this policy update was executed without the normal change management process, and the the broader engineering organization was not notified in advance.
This network modification was subsequently reversed, and SFTP functionality was restored.
Actions Taken:
Mitigation Measures:
In light of this incident, the following preventative measures have been put in place: