Categories
On-Call Management

How to create an on-call schedule that doesn’t suck.

Tips on how to build an effective and sustainable on-call schedule that your team loves.

How to create an on-call schedule?
“On-call doesn’t have to be sucking the life out of employees. There’s another side to it as well. A better one.“ Picture credit: Unsplash

A lot of tech companies struggle with creating an effective and efficient on-call schedule internally for their product and service, this results in much longer downtimes when something goes wrong. They often over-burden their team members with repeated on-call duty which results in team member fatigue. Here’s how to create an on-call schedule that your team might love.

The on-call schedule ensures that someone competent is available to bring the services up and running if they go down so that the customers don’t face trouble using your product or service. Though on-call isn’t a new concept in the world of DevOps and IT Ops, the execution and roles have greatly evolved over the years. Here’s how to create an on-call schedule for your team that your team might just love.

How has on-call has evolved over the years?

Earlier, being on-call and resolving issues when they occur used to be the sole responsibility of Sysadmins and Operation Engineers but with the evolution of DevOps, many Software Developers these days find themselves participating in an on-call rotation as well and this has worked great for most of the companies.

The on-call schedules used to be created on spreadsheets (some still use it) and intimated to the team without looking into their availability since it was practically impossible to take care of everyone’s availability. The person on-call had to be available at that time or day. It lacked flexibility and it was a nightmare to find a replacement if the person on-call had an emergency. Furthermore, it was a hassle to find someone who could be available to resolve the issue if the person on-call couldn’t resolve the issue on their own.

Thanks to Ops Platforms like Fyipe which has an inbuilt on-call scheduling feature, we don’t require to worry about creating schedules on a spreadsheet and alerting the person on-call anymore.

What still remains unchanged, however, is the negative attitude towards being on-call. No-one wanted to be on-call then and no-one wants it now but it’s an absolute necessity.

Being on-call doesn’t have to suck! An effective on-call schedule helps a lot in clearing off the negative air around being on-call and keeps your engineers happy. Happy on-call team result in happy customers

The only way this is possible without draining out your engineers is to ensure that the schedule takes care of their work-life balance and doesn’t drain out a single person completely.

Why do you need to have someone on-call?

Being on-call is probably the first step that an organization takes towards improving its availability and reliability for its customers or users. On-call engineers are the last line of defense to defense against customer-impacting outages and ensure that the issues are resolved as quickly as possible. You need to be there when your customers need you. On-call ensures this.

“If the idea of being “on-call” sucks to your team, it means they are responding negatively to a symptom.

The cause is less systemic and more a reflection of the team/org’s basic engineering prowess.

An organization should have a “No Downtime” engineering and ops process in place. Having a call schedule for your team is an emergency last line of defense against downtimes.

Who should be on your on-call team?

Here’s an interesting story. About 11 years ago, Google came up with a new strategy for production management. It realized that as R&D was pushing more and more features to the production, Operation engineers were having a tough time keeping production as stable as possible. The two teams were literally pulling in opposite directions and this lead to arising tensions especially because they had different skill sets, backgrounds, incentives, and metrics and there was a clash among them.

In order to bridge this gap between the two groups, Ben Treynor, one of Google’s ops leaders, thought of an innovative solution that led to the creation of a new team at Google called Site Reliability Engineering or SRE in short. The team comprised of 50% sysadmins and 50% software engineers. This improved the operations efficiency multi-fold.

Many companies have followed along similar lines and we have seen them succeed over the years with this strategy. It makes a lot of sense to include engineers, who have worked on the code, on the on-call team because of the following reasons:

  1. They have a deeper understanding of the code or feature they have worked on and hence, if the issue is created due to code issues, they are able to fix it much faster. This is extremely efficient.
  2. Engineers get an exposure to ops processes as well. This presents them with a holistic vision by helping them understand the implications of a certain coding practice in the production environment and thus helps them produce better code quality.

How should you create an on-call schedule?

Creating a schedule for on-call rotation primarily depends on the below-mentioned factors:

  1. Team Size
  2. Geographical distribution of the team.
  3. Feature or service wise distribution of teams.

Creating rotation plans based on team size

When you are a solo flyer (Team size = 1)

When you are a single person team, creating an on-call schedule is a no brainer. It is highly likely that you are just starting your journey as a startup and hence you are the only person responsible for everything that goes around in your company. Hence, you need to be available when you are alerted and be on-call 24 x 7 x 365.

Starting a company is tough and you would be drained to the core when you leave for the day. This might even result in you missing the alerts and calls.

Our advice is to have an ex-colleague / workmate as backup and add them as a secondary on-call person so that if you don’t wake up and acknowledge the alert, he or she will jump on a call, fix it or might notify you / call you immediately. It’s highly recommended you use on-call software because software calls you in a loop until the issue is acknowledged or resolved.

When Team size = 2

Approach 1: Change the schedule for primary on-call every other day ie., alternately. let your peer choose between MWF (Monday, Wednesday and Friday) and TTF (Tuesday, Thursday and Saturday) and you can have your pick on the even or odd of the four Sundays you want to be on call. Person A will be primary and will be alerted first and Person B will be called when Person A misses notifications or does not acknowledge incidents.

Primary On-Call Members will be the ones who will be alerted first. If they do not pick up the call or respond to alerts then a secondary on-call members are alerted.

Create an on-call schedule for primary and secondary rotations
Create an on-call schedule for primary and secondary rotations

Approach 2: You can also rotate weekly as well. You can be the primary on-call person and your partner would act as a secondary on-call for the week. The following week, these orders will change and your partner will be primary and you would act as the secondary on-call person.

We recommend weekly rotations.

The purpose of having a secondary on-call is to ensure that your partner gets alerted in case you miss the alert and at least one of you start working on the issue.

When Team size = 3

Let’s say there are just 3 three people are on the on-call team named — A, B and C.

Approach: We found that the best approach to on-call rotation – here is to do it weekly.

Let’s say that A is the primary on-call person, B is the secondary on-call person and C is the backup for week 1. So when an alert is triggered, the person A receives the first alert. This alert can be in the form of a call, SMS, email or even on slack depending on the set preference.

Primary, Secondary and Backup Team
Primary, Secondary and Backup Team

Ideally, A should be available to acknowledge the alert and start working on it. But, for some reason, if A misses the alert, the alert is received by B who can then start working on the issue if A is unavailable. If neither of A and B receives the alert or are unavailable, the alert finally goes to C, the backup.

On week 2, B who was the secondary on-call person in week 1 replaces A as the primary on-call person and C replaces B to become the secondary on-call person. Now A becomes the backup on-call person for week 2.

On week 3, C replaces B, who was the primary on-call person for week 2, to become the primary on-call for week 3. A jumps up to become the secondary on-call and B becomes the backup for week 3.

On week 4, A becomes the primary on-call person, B becomes secondary and C becomes back up again and this cycle goes on.

Team Size = 4 and above.

If your team size is 4 members or above, we have seen that the best strategy is to have weekly rotations.

Let’s say there are 6 people on the team. Say A, B, C, D, E, and F. The rotation must be decided such that it remains fair to everyone while reducing the stress of being available all the time. This can be done by ensuring that everyone works as primary, secondary and backup on-call for the same amount of time.

Approach:


Week 1:
 Let’s say ‘A’ works as the primary on-call, ‘B’ works as secondary on-call and ‘C’ work as the backup.

Primary on-call: A
Secondary on-call: B
Backup: C

Free from all on-call responsibility: D and E and F

Week 2: ‘A’ will be relieved of the work of being on-call since he acted as primary for a week with maximum responsibility. Its time for ‘B’ to replace ‘A’ and become the primary for this week and ‘C’ replaces B as secondary on-call.
Since the backup position is empty and ‘D’ hasn’t got any responsibilities yet, ‘D’ will act as backup for this week.

Primary on-call: B
Secondary on-call: C
Backup: D

Free from all on-call responsibility : A, E and F

Week 3: ‘B’ is relieved from all the responsibilities of on-call since ‘B’ worked as a primary on-call for a complete week. ‘D’ becomes secondary on-call and ‘E’ becomes backup.

Primary on-call: C
Secondary on-call: D
Backup: E

Free from all on-call responsibility : A, B and F

Weekly Rotations
Weekly Rotations

Week 4: Now ‘D’ replaces ‘C’ and becomes the primary on-call. ‘E’ becomes secondary and ‘A’ becomes the backup.

Primary on-call: D
Secondary on-call: E
Backup: F

Free from all on-call responsibility: A,B, and C

As we can see, every week a person moves up a notch in responsibility cycle from backup to secondary and from secondary to primary on-call. As this happens, the person with maximum responsibility moves out of the cycle and stays out until everyone has fulfilled the responsibility of being a backup on-call. Once this is done, people who exited the cycle first enter the cycle first and this goes on.

Tip: Some teams follow the same rotation but do it every day instead of every week. This works perfectly when the team size is large, say 30 or above and everyone is aware of the schedule. In smaller teams, however, this creates a lot of tension among engineers who hate distraction every now and then. It also affects their work-life balance.

With weekly rotation, engineers are mentally and physically prepared that they need to be available for a week and can work on their tasks in the other 3 weeks every month. This has proved to be a better plan for a most of our clients who have this team size.

Feature-wise distribution of teams

It makes sense to have a member from the team which is responsible for rolling out a feature to be responsible for maintaining it as well. An on-call person who is already aware of how the feature is designed and of the code style is much faster in resolving the issue and avoiding it in the future as well.

“Having someone on-call as a backup from the team which has developed the feature helps in faster resolution of issues.”

Hence, an on-call team must always include at least one person as a backup / secondary on-call from the team who was responsible for rolling out the feature.

Important: Sometimes a team works on multiple features and hence an on-call schedule based on features might clash. In such a case it’s important that a single person shouldn’t act as primary on-call for two features. It must also be ensured if possible that there is at least one person in both on-call schedules who isn’t common to both the schedules.

This makes it fail-safe such that even in the worst-case scenario, there will be two different people on-call for two different features.

Accounting for Geographical distribution of teams

Enterprise companies have a huge team that may be geographically distributed. Companies of this size are based on following the sun model. This ensures that the on-call schedule doesn’t exceed the office hours and helps in ensuring a work-life balance.

Let’s say a team is distributed into two subgroups working in two geographical regions, the US and India. Proper on-call scheduling and rotation ensure that the Indian team receives the alert when the team in the US is off their work schedule and similarly the team In the US would receive the alert when the team in India is offline.

Alerts customized like this prevent burnout for members in either of the regions but here’s a problem that arises when only members in the teams of a particular region are scheduled to receive alerts and work on issues based on the time of the day.

“Alerts customized according to the time zone prevent burnout for members in either of the regions.”

Warning: Sometimes the teams might require information such as logs, which would be available only with the team in the other time zone, to work on the issue. In such a scenario, it becomes extremely difficult to have someone from the team in the other time zone to get back to the office in the middle of the night to help with the information required so that the issue could be resolved on time.

Approach: We have found that the best way to ensure that such situations are avoided is to have someone on the team in the other time zone be on call but this can be further optimized by deciding the schedule based on the priority of the incident. In case of low or moderate priority events, the alert to the team in the other time zone could be avoided. In such a scenario, a member of the same time zone would act as a backup.

In case of high priority incidents, alerts can be sent to team members in the other time zone as well.

High priority — critical incidents

Geo distributed teams.
Geo distributed teams.

Scenario 1: When its day in USA and night in India and an issue occurs.

Primary on-call: Member 1 of the team in the US
Secondary on-call: Member 2 of the team in the US
Backup: Member 3 of the team in India .

Secondary Backup: Member 2 of the team in India. (In case the backup misses the alert)

Scenario 2: When its day in India and night in the US.

Primary on-call: Member 1 of the team in India
Secondary on-call: Member 2 of the team in India
Backup: Member 3 of the team in the US.
Secondary Backup: Member 2 of the team in USA. (In case the backup misses the alert)

Low priority — low impact incidents.

Geo distributed teams.
Geo Distributed Teams

Scenario 1: When its day in USA and night in India and an issue occurs

Primary on-call: Member 1 of the team in the US
Secondary on-call: Member 2 of the team in the USA
Backup: Member 3 of the team in the US.

Scenario 2: When its day in India and night in USA.

Primary on-call: Member 1 of the team in India
Secondary on-call: Member 2 of the team in India
Backup: Member 3 of the team in India.

Using the above methodology, we would now be able to design an on-call rotation schedule that works best for your team. But there are some more things that you should take care of while creating an on-call team.

Tips to build an awesome on-call culture.

  1. Have only those people on the team who are independently capable of working and resolving issues related to code, server, and other network issues. Having members in the role of SRE as there are in Google is probably is a much better idea than having just a sysadmin or a DevOps person on call.
  2. Make sure you take a poll from your team members before finalizing a schedule. Its always better to find a middle ground that serves both the firm and the engineers well. Even after implementation, regular feedback ensures that the team doesn’t find trouble following it.
  3. Ensure that the schedule takes care of the work-life balance of your employees. Ensure that they get enough sleep and have a healthy work environment.
  4. Make sure the person on-call isn’t burdened with anything else while he/she is on call. This reduces efficiency and is counterproductive.
  5. Help your team develop a culture of empathy. Your team should care for each other and should learn it from you. Sometimes a person who was supposed to be on call might have some emergency due to which they might not be available to be on-call. In such a scenario, someone from the team should eagerly come up and volunteer to shoulder the responsibility and cover-up for the person. They shouldn’t be forced into it.
  6. Your schedule should have the flexibility for people such as those who might be ill, or someone who might be having a kid or about to have one. The schedule designer must take care of these situations and adapt accordingly until the time they are fit to start again. It’s always a good idea to keep a list of people who can replace the on-call person in case of emergencies.

We at Fyipe are helping hundreds of businesses across the globe run efficiently and reduce downtime to improve the customer experience every day. Want us to help you? Reach us here at Fyipe.

You can also talk to our engineers on how to create an on-call schedule for your organizations. Please send an email to [email protected] and one of our engineers will be right with you.

Related Posts

By Nawaz Dhandala

Founder and CEO of HackerBay - company that builds products like Fyipe and CloudBoost. I'm currently working with the most talented team in Enterprise Software. We're working on products that helps enterprises build software faster and helps them maintain it.

2 replies on “How to create an on-call schedule that doesn’t suck.”

Leave a Reply

Your email address will not be published. Required fields are marked *