Next Level Support Engineering (a.k.a. Incidents Management Engineering)

Senior Director, Engineering at Snapdocs

Problem

The role of Support Engineering is for the most part misunderstood across the industry. When a startup is going through a rapid growth phase, Support Engineering becomes critical. However, building the next-level Support Engineering function requires a mindset shift. Many companies fail because they cannot set it up right, which is quite different from how most managers would approach it. Most of our managers -- 40 to 50 of various levels -- were perplexed with what we were doing and were dismissing the idea of what the next-level Support Engineering should look like.

A year ago, I was tasked to restructure our existing Support Engineering. Since our software became exceedingly high in demand following the emergence of Covid-19, our growth went from 2x to 30x. Consequently, the system started to break down -- more customers were onboarded, much more people were using our software, and far more support requests were submitted. Back then, we had a few engineers who would also do support. They would work with the internal support team and customers to get issues resolved. That worked well initially, but it ceased to be a sustainable solution as we started to grow rapidly.

Actions taken

I was never particularly pleased with the support function in my company. In my opinion, it was at best average, nothing more than that. The first thing I did was to step back, evaluate where we were and where we should be. That led me to redefine what Support Engineering was and what this function should do to address the new set of problems that emerged with our unprecedented growth. From mapping out what Support Engineering was supposed to do, I arrived at what kind of people we would need for that. The process had four main steps:

Incident management

For starters, if something unexpected would happen, Support Engineering would be a tier-one response. But, we also had to redefine what incident management was in the first place. It shouldn’t be reactive; on the contrary, it should be proactive and well planned. We designed end-to-end escalation policies, protocols, and runbooks and documented everything in Wiki to make it that way. Documentation was shared across the organization, which helped educate other teams on their role in escalation policies and protocols. For example, we established how Operations should communicate to us when there would be a problem. We would use Jira and one of the dashboards that would allow us to create custom fields that Operations should fill in, thus providing sufficient information for us to act. As soon as a ticket would be completed, someone would immediately get paged. However, putting in place incident problem protocols was just a tip of an iceberg.

Program management

Once we would resolve an incident, the logical next step would be to ensure that that kind of incident would never happen again. So we established a program management team whose role would be to do exactly that. They would waste no time; they would take the ticket, reach out to a couple of teams to learn who owns the problem, and see if a team that owns it could prioritize it. Then they would put a date on it to make sure it would be completed.

Since each team has its own charter and its own priorities, the program management team should help them prioritize the problem that was the root cause behind the incident. If the team that owns the problem is overwhelmed with work, the program management team may decide to handle the problem by themselves but would still ask for support from the team that owns the problem.

Interception

To make our Support Engineering genuinely next level, we had to take a step back and build software that would allow us to intercept problems before our customers would experience them. For example, we built a service named Snoopy because it snoops up on other services, and if something went wrong, Snoopy would alert the team that owns the service as well as us. Knowing that there is a problem before a customer does, we are always one step ahead of a customer.

Building the team

As a team, we have multiple roles -- we are a product team that builds services to prevent issues, but we also take on work from other teams. We augment other teams, and by doing so, we also learn about their problems which makes us competent in multiple domains. Obviously, we are quite different from an average product team, and we needed people other than what an average product team needs.

We looked across the organization, but we didn’t have the people we needed. Most people in product teams are typically focused on building features and shipping them; and repeating this. With goals that we set for Support Engineering, we had to bring top-class people. We needed someone who knew our system well, so we decided to put our staff engineer as a tech lead. I reached out to them, explained the charter and why we thought they would be the best choice. They bought in, excited about the uncharted waters they were entering, and we managed to add two more senior engineers. We couldn’t immediately add mid-level engineers because they are still into building features. Being an engineer on the support engineering team requires curiosity and enthusiasm to begin with. When one is debugging, rather than assuming how something works, they should be genuinely curious.

Therefore, we decided to mix our top engineers with the juniors. We blended people with knowledge and experience with those who are curious and have nerves of steel. We figured out that if something unexpected happened, they would be calm and able to sort it out.

What we did, exceeded our expectations. After a year, those juniors -- some of whom were boot camp graduates -- grew tremendously in close proximity to the most top-notch engineers. We also made sure to hire juniors with exceptional communication skills because they had to interact with people from other teams. They were ramping up fast both in terms of technical as well as management skills. Hopefully, we would be able to train more juniors that we will inject into other teams, allowing us to grow more sustainably.

Lessons learned

In hindsight, I would start building Support Engineering much earlier and certainly before the organization experienced rapid growth. There are so many benefits that people shouldn’t waste time overthinking the idea. However, it is a significant investment. Most startups can’t make that investment which is causing product teams to be less efficient.
You can start it small, though that is not what we did. We originally had four engineers, but we are continually growing. Yet, numbers matter less than your decision to come up with the right structure.
Support Engineering is top horizontal. It liaises Engineering and Operations while shielding product teams. When product teams are shielded, they are more productive. Product teams are always focused on what is the next, but someone should also need to focus on what was built in the past. The support engineering team vouches for those initiatives, and by helping with prioritization, it helps product teams achieve their goals. Consequently, other teams became happier because there is a support function that they can rely on.
Don’t be afraid to make mistakes, be afraid to make an impact. We made a lot of mistakes, but we iterated fast. As soon as we would see that we were diverging from our plan, we would iterate and adjust.

Be notified about next articles from Cristian Brotto

Cristian Brotto

Senior Director, Engineering at Snapdocs

Engineering Management

We will send you a weekly newsletter with new mentors, circles, peer groups, content, webinars,bounties and free events.

Next Level Support Engineering (a.k.a. Incidents Management Engineering)

Problem

Actions taken

Lessons learned

Be notified about next articles from Cristian Brotto

Connect and Learn with the Best Eng Leaders