Legacy Jobs Outage (Supernurse) -- June 27, 2024

Production Outage Report

Cloud Flare changes to rate limiting rules interrupted cross-service communication resulting in site outages

Date and Time of Outage:

Date: 6/27/24

Three total outages:

  • 10:09AM-10:12AM

  • 2:55PM-2:56PM

  • 3:13PM-3:14PM

Services Affected: jobs (supernurse/legacy jobs service only, not the new Nuxt application)

Severity Level:

Critical (Severity 1):
  • Definition: A critical outage in production means a complete service failure or shutdown. The application or service is entirely unavailable, or a critical feature is non-functional, leading to a total disruption of business operations.

  • Impact: Affects all or the majority of users. Business operations are halted or severely impacted.

  • Response: Requires immediate and continuous attention until resolved. Highest priority for the team.

High (Severity 2):
  • Definition: This level indicates a significant degradation in the production service. Major functionalities are impaired, causing a serious impact on business operations and user experience.

  • Impact: Affects a large number of users, though not everyone. Critical functionalities are disrupted but the service is not completely down.

  • Response: High priority for resolution, with swift action required to restore full functionality.

Moderate (Severity 3):
  • Definition: A moderate severity outage involves partial disruption in the production environment. Some features may not work as expected, but core functionalities remain operational.

  • Impact: Affects a moderate number of users. The issue is disruptive but does not critically impair business operations.

  • Response: Should be addressed promptly, but it can be scheduled based on resource availability and other priorities.

Low (Severity 4):
  • Definition: This severity level is for minor issues in the production environment that have a low impact on business processes. These might include non-critical bugs or performance issues.

  • Impact: Affects a small number of users and has minimal impact on overall business operations. Core functionalities are not impacted.

  • Response: These issues can be addressed during regular maintenance windows or in the next scheduled update cycle.

Incident Description:

We saw three small blips over the day for service interruption for nurse jobs. There were no customer service reports related to these outages that we are aware of.

Console logs indicated Error Code 429 (too many requests).

Root Cause Analysis:

Cloud Flare rule changes in preparation for Cloud Flare upgrades were partially completed by ECM.

Resolution and Recovery:

CHM ticket for initial Cloud Flare rule changes:

Relevant ticket for recent rule changes to CloudFlare rules for managing bot traffic:

@Christian Roberts added custom user agent to apollo traffic (see PR here)
@Jason Smith added rule to exclude a user agent in CloudFlare
@Christian Roberts requested @Jason Smith increase rate limit from 40req/s to 100req/s

Site has remained stable since this final change.

Data and Metrics:


image (9).png

Recommendations and Preventative Measures:

Since Jason is now actively invovled with Enterprise Cloud Management, he was added to the appropriate channels in Slack for proactive communication around any future changes. Incident was discussed with @David Atkinson and future communication around ECM changes were discussed – he stated he would carry this to his larger team meeting for further process discussion.

Report Prepared By:

@Ashley Edds