Nurse.com Jobs Memory Errors -- June 27, 2024

Production Outage Report

Memory errors on Jobs Nuxt app resulting in occasional errors to users

Date and Time of Outage:

Date: 6/27/24

Time outage was reported: During 2PM release call

Date/Time outage was resolved: 4:49PM

Services Affected:

Nurse.com Nuxt App (Job Board)

Severity Level:

Critical (Severity 1):
  • Definition: A critical outage in production means a complete service failure or shutdown. The application or service is entirely unavailable, or a critical feature is non-functional, leading to a total disruption of business operations.

  • Impact: Affects all or the majority of users. Business operations are halted or severely impacted.

  • Response: Requires immediate and continuous attention until resolved. Highest priority for the team.

High (Severity 2):
  • Definition: This level indicates a significant degradation in the production service. Major functionalities are impaired, causing a serious impact on business operations and user experience.

  • Impact: Affects a large number of users, though not everyone. Critical functionalities are disrupted but the service is not completely down.

  • Response: High priority for resolution, with swift action required to restore full functionality.

Moderate (Severity 3):
  • Definition: A moderate severity outage involves partial disruption in the production environment. Some features may not work as expected, but core functionalities remain operational.

  • Impact: Affects a moderate number of users. The issue is disruptive but does not critically impair business operations.

  • Response: Should be addressed promptly, but it can be scheduled based on resource availability and other priorities.

Low (Severity 4):
  • Definition: This severity level is for minor issues in the production environment that have a low impact on business processes. These might include non-critical bugs or performance issues.

  • Impact: Affects a small number of users and has minimal impact on overall business operations. Core functionalities are not impacted.

  • Response: These issues can be addressed during regular maintenance windows or in the next scheduled update cycle.

Incident Description:

During the release, we encountered a higher-than-expected error rate due to excessive memory usage, causing system failures. Initially, we increased the memory limits, but the problem persisted. Doubling the original memory value temporarily restored functionality but led to CPU limits being reached. The issue was discovered through monitoring alerts indicating high error rates.

Root Cause Analysis:

We investigated the issue by searching error logs, reviewing git history, and examining machine usage without finding an immediate explanation. However, we identified a sudden spike in requests per second starting around 11:15 PM the previous night. Further analysis revealed that a previous configuration change in our application caused a set-cookie header to be sent to clients, making the pages uncacheable by Cloudflare. This resulted in over 450,000 direct requests to the server in the last 24 hours.

Resolution and Recovery:

The excessive load caused instability and downtime, affecting all users and services relying on the server. This issue was exacerbated by the recent success of our SEO efforts, which significantly increased traffic to our site. If this had occurred back in February, when traffic was lower, the impact would have been almost non-existent. To resolve the issue, we modified the application configuration to stop sending the set-cookie header, allowing Cloudflare to cache the pages again. This restored load to normal levels and stabilized the site. The timeline involved initial memory adjustments and investigations, followed by the identification and rectification of the caching issue.

Recommendations and Preventative Measures:

To prevent similar incidents in the future, we plan to implement stricter monitoring and alerts for caching and request spikes, and conduct regular reviews and audits of configuration changes. The incident highlighted the importance of monitoring caching mechanisms and ensuring fallback plans for unexpected spikes in requests.

Report Prepared By:

@Christian Roberts (@James Tharpe assisted in investigation and resolution)