Plato Elevate Winter Summit has been announced (Dec 7th-8th)

🔥

Back to resources

How to Identify Root Cause of an Application Failure

Ownership

1 September, 2020

Murali Bala
Murali Bala

Software Engineering Leader at amazon

Murali Bala, Director, Software Engineering at Capital One, outlines how he applied a root cause analysis to fix a recurring outage of their website.

Problem

Not long after I joined my current company, I was alarmed by the problem of a recurring outage of our API services. One of our API was dependent on 25+ other downstream services and if any of those services failed, our application would experience failures too. Considering the complexity of the situation, the biggest challenge was how do we figure out which dependency is actually causing the outage, especially with APIs that have a lot of downstream dependencies? In order for us to identify failing services, root cause and fix problems quickly, we built an application dependency discovery, reporting, and management tool.

Actions taken

I started off by precisely defining the problem and outlining key questions.

  • Which system was failing? - Your API or other Downstream API.
  • Why was it failing? - Complete Error Code and Description.
  • Who was the Service Owner? - Team Name

The tool essentially monitored the application's error threshold, triggering its analyzer service on error. The analyzer service analyzed logs for application and it’s dependencies for errors and posts analysis to pre-configured destinations channels like slack, SMS, email and pager duty.

My journey began by investigating the application logs. I tried to understand how the errors between dependencies were interconnected by looking at the messages, IDs, host key, etc. Following on that, we built a small prototype that created a lineage map across the board. Those dependencies were mapped out as a branched out tree and if any of those branches were having a blip, that was going to affect our API.

As an application owner, I could onboard my application into this platform, confirm my application dependencies, setup error threshold. The action would trigger our monitoring tool; which would now continuously monitor the application logs for a failure. If the application fails for any reason and the number of errors crosses the configured threshold (e.g. five percent), our tool would start crawling and navigating through the dependencies trying to figure out why the application was failing. Once the problem was identified, an email, slack and/or pager duty would be sent out notifying a responsible person that the application was failing and identifying which particular branch of our application dependency was causing it. At the same time, the original owner of the branch obligation would also get an alert; everyone was informed what was happening and the operation center didn’t have to go and investigate the root cause of the failure, in turn saving a huge amount of time.

In addition, we are now working on a machine learning algorithm that would look at certain patterns of problem, connecting the dots and anticipating potential problems, and sending proactive messages of alert.

If you are interested in learning more about the tool, please read up on the patent filing - Determining problem dependencies in application dependency discovery, reporting, and management tool.

Lessons learned

Challenge yourself and don’t accept the status quo. When you detect a problem, challenge the existing solution if you don’t find it satisfactory. As a leader, we must be willing to risk and take bold steps. Nothing great is ever achieved by doing things the way they have always been done.

Discover Plato

Scale your coaching effort for your engineering and product teams
Develop yourself to become a stronger engineering / product leader


Related stories

Transitioning From Tech to Product Management

23 November

Nicholas Cheever, Divisional Vice President, Global Supply Chain Technology at Trimble Transportation, talks from his experience on how to excel in a PM role when transitioning from tech roles.

Ownership
New PM
Nicholas Cheever

Nicholas Cheever

Divisional Vice President, Global Supply Chain Technology at Trimble Transportation

The Benefits of Stakeholder Communication

17 November

Piyush Dubey, Senior Software Engineer at Microsoft, shares how to understand the stakeholder communication process better and why it is essential.

Meetings
Internal Communication
Collaboration
Ownership
Stakeholders
Piyush Dubey

Piyush Dubey

Senior Software Engineer at Microsoft

Overcome a Poor Working Relationship

11 November

Rajesh Agarwal, VP & Head of Engineering at Syncro, shares how he took the time to develop and understand one of his co-workers to drive impeccable business results.

Conflict Solving
Internal Communication
Collaboration
Ownership
Health / Stress / Burn-Out
Rajesh Agarwal

Rajesh Agarwal

VP and Head of Engineering at Syncro

Cultivating Accountability

24 September

Brian Flanagan, Head of Product and Growth at Optimity, takes a balanced approach to leadership that does not shy away from looking painful truths in the eye.

Leadership
Ownership
Brian Flanagan

Brian Flanagan

Head of Product and Growth at Optimity

The Changing Face of Life Insurance

24 September

Ben Picolo, Engineering Manager at PolicyGenius Inc., talks about an industry-changing that his team shipped amidst the chaos in the product line.

Dev Processes
Leadership
Convincing
Ownership
Ben Picolo

Ben Picolo

Engineering Manager at PolicyGenius Inc.

You're a great engineer.
Become a great engineering leader.

Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.