ServiceChannel System Performance Degradation
Incident Report for ServiceChannel
Postmortem

Crowdstrike Incident Report 

  

Date of Incident:                   07/19/2024 

Time/Date Incident Started: 07/19/2024, 01:10 am EDT 

Time/Date Stability Restored:  07/19/2024, 05:47 am EDT 

Time/Date Incident Resolved: 07/19/2024, 10:00 am EDT 

 

Users Impacted: All users 

Frequency: Continuous 

Impact: Major 

  

Incident description: 

On 7/19/2024 at 1:10 AM, The ServiceChannel Database Administration (DBA) and Site Reliability Engineering (SRE) teams received alerts from step-based test monitors that multiple ServiceChannel systems were failing their checks. Once alerted, the DBA and SRE teams immediately began investigating the issue's cause. 

Root Cause Analysis: 

A global outage caused by Crowdstrike, a third-party vendor providing a security Endpoint Detection and Response (EDR) platform, temporarily impacted the performance of ServiceChannel SaaS applications. 

There was no security impact as this was a third-party software component that caused the degradation of our services. 

 

Actions Taken: 

  1. The DBA and SRE teams, in coordination with ServiceChannel Leadership, activated our business continuity and disaster recovery process, allowing business critical systems to continue operating. 
  2. Analysis of the problem determined there was an issue with the Crowdstrike EDR platform which ServiceChannel uses for detection of cybersecurity events. 
  3. Upon further investigation, the SRE team identified a mitigation strategy for each affected asset:  

    1. Take a snapshot of the boot drive for the affected asset. 
    2. Detach the impacted boot drive from the affected asset. 
    3. Attach each impacted boot drive to a recovery workstation. 
    4. Remove the corrupted Crowdstrike update file. 
    5. Reattach the boot drive to the asset. 
    6. Restart and monitor for successful return to service. The ServiceChannel SRE team applied the mitigation across all affected assets. 
  4. The ServiceChannel SRE team applied the mitigation across all affected assets. 

 

Mitigation Measures:    

  1. Work with Crowdstrike to implement any Crowdstrike EDR-related availability remediation advice. 
  2. Investigate additional technologies, techniques, and capabilities to improve our DR solution to reduce recovery times of secondary systems.
Posted Jul 26, 2024 - 09:46 EDT

Resolved
This incident has been resolved. All services are working as expected.
Posted Jul 19, 2024 - 10:00 EDT
Monitoring
We have restored functionality to impacted services, we are monitoring the results to ensure there are no further issues.
Posted Jul 19, 2024 - 05:47 EDT
Investigating
We are actively investigating degraded system performance. An update will be provided shortly. Thank you for your patience.
Posted Jul 19, 2024 - 04:42 EDT
This incident affected: Analytics (Data Direct, Analytics Dashboard) and API (Universal Connector).