Plato Elevate Winter Summit has been announced (Dec 7th-8th)

🔥

Back to resources

Migrating a Legacy Service: A Lesson in Eng/Ops Collaboration

Collaboration

18 November, 2020

Jackson Dowell
Jackson Dowell

Engineering Manager at Asana

Jackson Dowell, Engineering Manager at Asana, discusses how he approached legacy service migration by stabilizing the existing stack and getting Engineering and Operations to work together to address the underlying problems.

Problem

I was managing a team at LinkedIn that inherited a set of legacy Python services that we were part-way to migrating to a new Java microservices stack. In the past, Engineering was focused on building the new stack and running the data migration while Operations was siloed and responsible for keeping the existing software alive - including some wild hacks and constant pages. Ultimately the existing Python services were causing regular production outages and the migration plan failed to address the underlying technical issues, so I needed to reprioritize and change how engineering and operations were working together.

Actions taken

Our NYC-based team flew to California to do a handoff with the existing development team because the project was in flight. I had my engineers pair with the team that was actively working on the migration, taking small tasks across both systems and getting mentorship so we could learn how to confidently make and release changes.

As part of the handoff, I facilitated meetings between SRE (Site Reliability Engineer) leads who were transferring ownership of this project. Participating in their exchange helped me compile the full state of the world of the software that was operating in production, including the existence of a script that rebooted hosts stuck on failing jobs that executed at least once a minute across the whole fleet.

During the first week we took over, there were two outages caused by that piece of software. In the first instance, I had a member of my team pair directly with the engineer who investigated and resolved the incident. In the second case, the engineer on my team was able to take the lead and, supported by another person on the team, resolve the problem. I noted commonalities between both incidents and action items we would take to prevent the problem from recurring.

Once I had a full understanding of the existing system and the migration plan, I wrote up an analysis that I shared with my manager and the team of leads in New York to secure buy-in for a process called code yellow -- active feature development would pause until the operations were stabilized to the point that both SRE and Engineering were satisfied. I defined the exit criteria for code yellow, which included completing action items from the past incidents, introducing connection pooling and connection limits to the database (preventing cascading failures), and introducing a paging rotation in engineering to support future outages. We targeted a two-week window and were able to exit code yellow on time.

Lastly, I worked with my Tech Lead to revamp the migration plan to address failures in long-running jobs in the first milestone and to improve the safety of the data migration. We communicated the new plan. then held daily standups where engineers and SREs worked together to balance addressing short-term stability issues and investing in the migration to get to a long-term safe and scalable state.

Lessons learned

  • Having Engineering and Operations both take responsibility for solving incidents (including being paged) instilled a culture of caring for the reliability of our software. The engineering team translated that into excitement for discovering and remediating existing bugs and producing high-quality designs for the new system. The SRE team also felt supported and collaborated regularly to define requirements for how the new system would behave and solve current problems.
  • I got positive reinforcement for questioning the existing plan given new information. Even though it was controversial to delay our migration, we were ultimately able to provide better service to our customers and members, and we learned a lot about how to design our system to scale effectively.

Discover Plato

Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader


Related stories

Specialization vs. Wearing Many Hats

23 November

William Bajzek, Director of Engineering at Sapphire Digital, compares and contrasts a team structure that utilized siloed skill sets and one where everybody’s duties overlap at the edges.

Internal Communication
Collaboration
William Bajzek

William Bajzek

Director of Engineering at Sapphire Digital

Mergers and Acquisitions: Collaboration tools hold a key to bringing cultures together

23 November

Neelima Annam, Sr Director Information Technology at Outmatch, shares how something as minor as collaboration tools can be a BIG issue during mergers and acquisitions.

Acquisition / Integration
Internal Communication
Collaboration
Neelima Annam

Neelima Annam

Snr Director Information Technology at Outmatch HCM

How to Build Rapport With an Introverted Manager

17 November

Piyush Dubey, Senior Software Engineer at Microsoft, shares his journey of climbing up the career ladder through awkward times dealing with an introverted manager.

Managing Expectations
Internal Communication
Collaboration
Coaching / Training / Mentorship
Juniors
Piyush Dubey

Piyush Dubey

Senior Software Engineer at Microsoft

The Benefits of Stakeholder Communication

17 November

Piyush Dubey, Senior Software Engineer at Microsoft, shares how to understand the stakeholder communication process better and why it is essential.

Meetings
Internal Communication
Collaboration
Ownership
Stakeholders
Piyush Dubey

Piyush Dubey

Senior Software Engineer at Microsoft

How to Work With People Who Are Different Than You

11 November

Rajesh Agarwal, VP & Head of Engineering at Syncro, shares how effectively he collaborated with a newly-joined team as a diverse candidate.

Acquisition / Integration
Leadership
Collaboration
Cultural Differences
Rajesh Agarwal

Rajesh Agarwal

VP and Head of Engineering at Syncro

You're a great engineer.
Become a great engineering leader.

Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.