Handling Tech Debt: A Story About a Notification System Gone Amiss

Brian Guthrie

VP of Engineering at Meetup

Problem

I joined my current company as a Head of Platform Engineering specifically hired to manage our tech debt. The very first problem at the gate was that no one defined what exactly the team should be working on or what the strategy for removing or minimizing tech debt should look like.

My organization could be characterized as a target-rich environment from a tech debt perspective with a myriad of problems to focus on. A problem that stood out and needed to be immediately addressed was our notification system. I didn’t like the idea of immediately taking on the notification system and would rather have a decent PM look at it from a user perspective. However, I was running a platform team and hadn’t been given any PMs or budget to hire one and I was supposed to deal with the problem from an operational or software design standpoint.

The problem, in a nutshell, was that emails were taking too long to show up -- the organizer would send a notification and the message would show up hours later to the members.

Actions taken

We decided to track the latency of notifications through the systems, but we didn’t have any metrics in place for measuring latency. I tasked my team to put the tooling in place and then optimize it.

Around that time I went for parental leave, and when I came back several months later, my EM -- who was reporting up to me on the team -- explained to me how they had created all those events and would be able to see every time a notification did something in the system. To my surprise, they claimed they were still unable to measure latency since they had to figure out all those events in between. Actually, the right way to do it would be to fire off one event in the beginning and one in the end, look at the latency and use those findings to slice the tomato thinner and thinner until we could identify where the problem was. We proceeded to do that over the next few months and finally were able to measure the latency that turned out to be different for different messages. This was the first warning sign.

Part of the reason that we had to approach this from an observability frame of reference was that people who built the system no longer worked at the organization. We realized that some indications had much higher latency than some other notifications and we started looking at our observability data to break this down and find some patterns in the delays. We identified some suspicious patterns -- some messages would be delayed for exactly five minutes or exactly one hour. When we finally dug deeper into the source code we realized that the team that had designed the system had intentionally built the delays into the system and then everyone forgot about it.

The reason they were delayed was that the code explicitly dictated that the notifications should be delayed. There was no underlying operational problem with the system and the problem was about the knowledge transfer. We spent a lot of time on the archeology of the system that at its root was relatively stable but was not serving the purpose that our users had come to expect from a notification system. We would batch and delay notifications because we wanted to send fewer overall notifications. However, that had the net effect of upsetting our users who wanted a more stable experience.

That brought us back to square one -- we needed a dedicated PM to help us understand why this particular system was malfunctioning and come at it from the standpoint of users’ expectations. At that point, I was able to hand the problem off to the organization and explain that we had a rich understanding of the system’s architecture and metadata for everything all the way through. It was then decided that we would tune a few variables which made the system much better for the users. It was a lot of work and very few changes at the end.

Lessons learned

It was an interesting mistake from an observability perspective. The other thing that we could have done, would be to read some of the more notable sections of the source code and try to infer behavior. However, at the time I inherited the team, there were more people versed in operations vs. those having software development skills and it made sense to treat it like an operational issue which made our journey a bit longer.
In the organization’s effort to understand how to measure the output of the team handling tech debt, they focused on something with a measurable output, but the opportunity cost of that work was that we couldn’t do other important refactoring or cleanup work. It nevertheless underscored the value of documentation and the organizational knowledge transfer.

Be notified about next articles from Brian Guthrie