Errors connecting to Service Channel Platform

Incident Report for ServiceChannel

Postmortem

Incident Report: Login Failures due to port exhaustion 

 

Date of Incident:                   02/27/2026 

Time/Date Incident Started: 02/27, 5:00 PM EST  

Time/Date Stability Restored: 02/27, 5:06 PM EST 

Time/Date Incident Resolved: 02/27, 5:15 PM EST 

  

Users Impacted: Some 

Frequency: Continuous 

Impact: Major 

  

Incident description: 

On February 27, 2026, the Service Channel platform experienced an issue with the Login and Authentication services, resulting in some users being unable to log in. Additionally, a subset of users who were already logged in encountered authentication errors, which caused their sessions to become invalid. 

Monitoring systems alerted the SRE team at 5:00 PM EST. The SRE team initiated incident triage and identified the underlying issue. Service restarts and connection resets restored stability by 5:06 PM EST. All servers and services were fully verified and confirmed operational by 5:15 PM EST. 

Root Cause Analysis: 

The incident was caused by a sudden increase in concurrent connections to the Login and Authentication services, which resulted in port exhaustion on impacted nodes. As available ports were depleted, the system was unable to establish new connections and intermittently failed to maintain existing authentication sessions. This condition led to users being unable to log in, while a subset of users who were already authenticated experienced session validation failures that caused authentication interruptions. The issue was resolved by restarting the affected nodes, which reset the exhausted ports and restored normal connection handling. 

Actions Taken: 

  • Monitoring alerts were received by the SRE team indicating authentication service degradation. 

  • The SRE team initiated incident triage and investigated the login and authentication failures. 

  • Port exhaustion was identified on impacted nodes during troubleshooting. 

  • Recovery actions, including restarting affected nodes and resetting connections, were performed. 

  • Services returned to normal operation following these recovery actions. 

  • Platform functionality was fully verified after stability was restored.

Posted Mar 11, 2026 - 20:34 EDT

Resolved

This incident has been resolved.
Posted Feb 27, 2026 - 17:22 EST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Feb 27, 2026 - 17:21 EST

Identified

The issue has been identified and a fix is being implemented.
Posted Feb 27, 2026 - 17:16 EST

Investigating

We are currently investigating this issue.
Posted Feb 27, 2026 - 17:16 EST
This incident affected: Service Automation (Login).