We've just launched plato for individuals

🔥

login


Google Sign inLinkedIn Sign in

Don't have an account? 

Handling Tech Debt: A Story About a Notification System Gone Amiss

Dev Processes
Product

30 September, 2020

Brian Guthrie, VP of Engineering at Meetup, tells a story of the first project he tackled as a Head of Platform Engineering -- a notification system gone amiss.

Problem

I joined my current company as a Head of Platform Engineering specifically hired to manage our tech debt. The very first problem at the gate was that no one defined what exactly the team should be working on or what the strategy for removing or minimizing tech debt should look like.
 

My organization could be characterized as a target-rich environment from a tech debt perspective with a myriad of problems to focus on. A problem that stood out and needed to be immediately addressed was our notification system. I didn’t like the idea of immediately taking on the notification system and would rather have a decent PM look at it from a user perspective. However, I was running a platform team and hadn’t been given any PMs or budget to hire one and I was supposed to deal with the problem from an operational or software design standpoint.
 

The problem, in a nutshell, was that emails were taking too long to show up -- the organizer would send a notification and the message would show up hours later to the members.
 

Actions taken

We decided to track the latency of notifications through the systems, but we didn’t have any metrics in place for measuring latency. I tasked my team to put the tooling in place and then optimize it.
 

Around that time I went for parental leave, and when I came back several months later, my EM -- who was reporting up to me on the team -- explained to me how they had created all those events and would be able to see every time a notification did something in the system. To my surprise, they claimed they were still unable to measure latency since they had to figure out all those events in between. Actually, the right way to do it would be to fire off one event in the beginning and one in the end, look at the latency and use those findings to slice the tomato thinner and thinner until we could identify where the problem was. We proceeded to do that over the next few months and finally were able to measure the latency that turned out to be different for different messages. This was the first warning sign.
 

Part of the reason that we had to approach this from an observability frame of reference was that people who built the system no longer worked at the organization. We realized that some indications had much higher latency than some other notifications and we started looking at our observability data to break this down and find some patterns in the delays. We identified some suspicious patterns -- some messages would be delayed for exactly five minutes or exactly one hour. When we finally dug deeper into the source code we realized that the team that had designed the system had intentionally built the delays into the system and then everyone forgot about it.
 

The reason they were delayed was that the code explicitly dictated that the notifications should be delayed. There was no underlying operational problem with the system and the problem was about the knowledge transfer. We spent a lot of time on the archeology of the system that at its root was relatively stable but was not serving the purpose that our users had come to expect from a notification system. We would batch and delay notifications because we wanted to send fewer overall notifications. However, that had the net effect of upsetting our users who wanted a more stable experience.
 

That brought us back to square one -- we needed a dedicated PM to help us understand why this particular system was malfunctioning and come at it from the standpoint of users’ expectations. At that point, I was able to hand the problem off to the organization and explain that we had a rich understanding of the system’s architecture and metadata for everything all the way through. It was then decided that we would tune a few variables which made the system much better for the users. It was a lot of work and very few changes at the end.
 

Lessons learned

  • It was an interesting mistake from an observability perspective. The other thing that we could have done, would be to read some of the more notable sections of the source code and try to infer behavior. However, at the time I inherited the team, there were more people versed in operations vs. those having software development skills and it made sense to treat it like an operational issue which made our journey a bit longer.
  • In the organization’s effort to understand how to measure the output of the team handling tech debt, they focused on something with a measurable output, but the opportunity cost of that work was that we couldn’t do other important refactoring or cleanup work. It nevertheless underscored the value of documentation and the organizational knowledge transfer.

Related stories

AI Products Are Data Products
19 October

Deepak Paramanand, Product Lead at Hitachi, shares how he built three different AI projects that all had one thing in common -- the ability to create or input data.

Product
Impact
Deepak Paramanand

Deepak Paramanand

Product Lead at Hitachi

How Different Is Enterprise Product Management
19 October

Deepak Paramanand, Product Lead at Hitachi, highlights the key differences between enterprise product management and the one that builds products for consumers and thus helps aspiring product managers choose the right career for them.

Product
Deepak Paramanand

Deepak Paramanand

Product Lead at Hitachi

Looking for a PM Job During the Covid-19 Pandemics
12 October

Prabha Matta, Senior Product Manager at SquareTrade, talks about her personal experience of looking for a PM job during the Covid-19 pandemics and how the changed circumstances affected her job search and interviewing process.

Hiring
Product
Managing Expectations
Prabha Matta

Prabha Matta

Senior Product Manager at Square Trade

Trailing New Products: What to Go for -- Quantity or Quality?
30 September

Caroline Parnell, previously managed product teams at O2 and Vodafone, recalls how she made a mistake by going for quantity rather than quality of product trialists, and how that prevented her from receiving the best feedback on her product.

Users Feedback
Product
Caroline Parnell

Caroline Parnell

Most recently Head of New Product Innovation at Previously O2 and Vodafone

Ensuring Diversity of Thinking through Design Thinking Techniques
30 September

Caroline Parnell, previously managed product teams at O2 and Vodafone, shares some of the techniques she applied with her team to ensure diversity of thinking during product discovery workshops.

Product
Productivity
Internal Communication
Caroline Parnell

Caroline Parnell

Most recently Head of New Product Innovation at Previously O2 and Vodafone

You're a great engineer.
Become a great engineering leader.

Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.