...
Internal Communication: We should set up a internal communication protocol whenever downtime is expected. Every team, irrespective of their day to day role, should be made aware of when and why the downtime will occur. An email chain, a specific Teams channel, etc.
Customer Communication: Clear communication with our users prior to any planned downtime - communicated via emails, pop-up banners or notifications on the website, and our various social media channels can drastically reduce confusion and frustration. It's worth mentioning that an entailment of what to expect and further assurance of our efforts to restore the service as soon as possible can help maintain the trust of our customers.
Website Maintenance Planning: Maximizing off-peak hours for conducting any construction work can drastically reduce the impact of downtime.
CURRENT | Current Process | Owner | Notes | Enhancement |
---|---|---|---|---|
Planned Releases /Outages | ||||
Release scheduled/approved | Deployment window scheduled around 9pm | EM | With deployments there is generally about 20-30 mins of downtime. Does this outage time vary? |
|
During deployment | website inaccessible for users | There should be a website maintenance message that users see This is from webscale |
| |
Pre release communication | Email to stakeholder teams notifying of upcoming deployment and downtime window | PM current | Should this be coming from delivery managers? Who receives today? -marketing? -support? -product leaders? |
|
Post release communication | EM replies to distro notifying deployment is complete | see above | ||
Alert/notifications | #alerts channel notifies when sites are down and back up. | EMs or someone from engineering will generally communicate when due to release | ||
Unplanned Outages/downtime | ||||
Alert/notifications | #alerts in slack notify when community sites are down (and back up again) | who monitors these channels? What is escalation path here when there is outage? | ||
Communication? | Engineering Incident Manager- Teams Channel | seems to be used as a forum for communicating unplanned outages and resolution.. | ||
Incident report published |
Website Unavailable message:
...