Categories
Incident Management

Incident Management: Best Practices

They are like 5 stages of an incident:

1. Assess impact

2. Inform customers (statuspage)

3. Identify the issue

4. Mitigate the issue

5. Resolve the incident

Then there’s followup and further work. Also important to note that (2) should be ongoing as you progress.

Updating the status page should be done within reasonable periods – e.g. every 15-20 mins unless you specify otherwise. It’s important to your users to know what’s going on and not frequently updating about progress is super insulting to them. “Hush,the engineers are talking”

Often, you wanna update every 15 mins for a while until you get to a point where there would be no new information and then back off to maybe an hour or more. As long as you _tell_ people what that period is. It sets up expectations and means each update is still important.

Personally, I’m not a big fan of apologetic language in a statuspage. Maybe in the last update, with the promise of a postmortem but DEFINITELY NOT all the way through. Apologies are hollow. Write a postmortem, publish it, confirm follow ups to prevent this incident.

“We’re happy to report…” In the words of my manager: “Don’t say that, we’re not cracking open a cold one with the boys while we resolve this”

The statuspage should be blameless. We talk about blameless postmortems but holy shit the number of status updates I see that are like “an engineer pushed a change” or “someone (we won’t say who) made a manual change”. That’s singling someone out. It’s not blameless.

While updating periodically, you should be as relevantly specific as possible. What do your users actually care about? What will make sense without detailed knowledge? Timestamp the updates with relevancy to the info. When you publish it isn’t when it happened, innit

When dealing with an incident, if you’re the primary oncall, that doesn’t mean you have to be Incident Commander. Being part of a good team is understanding people’s strengths and it’s okay to admit you don’t do the best in charge of delegation.

(This is taken from a twitter thread and is copied with author’s permission: https://twitter.com/TheJokersThief/status/1149408411949379585)

By Nawaz Dhandala

Founder and CEO of HackerBay - company that builds products like Fyipe and CloudBoost. I'm currently working with the most talented team in Enterprise Software. We're working on products that helps enterprises build software faster and helps them maintain it.

Leave a Reply

Your email address will not be published. Required fields are marked *