Elevate Spring Summit has been announced (Thu, Mar 11th)

🔥


Don't have an account? 

How to Identify Root Cause of an Application Failure

Ownership

1 September, 2020

Murali Bala, Director, Software Engineering at Capital One, outlines how he applied a root cause analysis to fix a recurring outage of their website.

Problem

Not long after I joined my current company, I was alarmed by the problem of a recurring outage of our API services. One of our API was dependent on 25+ other downstream services and if any of those services failed, our application would experience failures too. Considering the complexity of the situation, the biggest challenge was how do we figure out which dependency is actually causing the outage, especially with APIs that have a lot of downstream dependencies? In order for us to identify failing services, root cause and fix problems quickly, we built an application dependency discovery, reporting, and management tool.

Actions taken

I started off by precisely defining the problem and outlining key questions.

  • Which system was failing? - Your API or other Downstream API.
  • Why was it failing? - Complete Error Code and Description.
  • Who was the Service Owner? - Team Name

The tool essentially monitored the application's error threshold, triggering its analyzer service on error. The analyzer service analyzed logs for application and it’s dependencies for errors and posts analysis to pre-configured destinations channels like slack, SMS, email and pager duty.

My journey began by investigating the application logs. I tried to understand how the errors between dependencies were interconnected by looking at the messages, IDs, host key, etc. Following on that, we built a small prototype that created a lineage map across the board. Those dependencies were mapped out as a branched out tree and if any of those branches were having a blip, that was going to affect our API.

As an application owner, I could onboard my application into this platform, confirm my application dependencies, setup error threshold. The action would trigger our monitoring tool; which would now continuously monitor the application logs for a failure. If the application fails for any reason and the number of errors crosses the configured threshold (e.g. five percent), our tool would start crawling and navigating through the dependencies trying to figure out why the application was failing. Once the problem was identified, an email, slack and/or pager duty would be sent out notifying a responsible person that the application was failing and identifying which particular branch of our application dependency was causing it. At the same time, the original owner of the branch obligation would also get an alert; everyone was informed what was happening and the operation center didn’t have to go and investigate the root cause of the failure, in turn saving a huge amount of time.

In addition, we are now working on a machine learning algorithm that would look at certain patterns of problem, connecting the dots and anticipating potential problems, and sending proactive messages of alert.

If you are interested in learning more about the tool, please read up on the patent filing - Determining problem dependencies in application dependency discovery, reporting, and management tool.

Lessons learned

Challenge yourself and don’t accept the status quo. When you detect a problem, challenge the existing solution if you don’t find it satisfactory. As a leader, we must be willing to risk and take bold steps. Nothing great is ever achieved by doing things the way they have always been done.


Related stories

Structuring a Startup for Scale
30 December

Wadah Sayyed, Director of Engineering at HPE, discusses how he helped set his startup for success by mapping out ownership structures and building teams around clear ownership.

Scaling Team
Ownership
Team processes
Wadah Sayyed

Wadah Sayyed

Director of engineering at HPE

Team Leads as a Mini-CTOs
17 December

Arzumy MD, CTO at Fave, explains how he empowered his team leads to act as mini-CTOs and take ownership over their work.

Delegate
Impact
Reorganization
Ownership
Team processes
Arzumy MD

Arzumy MD

CTO at Fave

An Egalitarian Approach to a Disportionate Workload
16 November

Nimrod Perez, CTO and VP of Engineering at Wobi LTD., explains how he solved a long-troubling problem of disproportionate workload by his simple and egalitarian approach.

Ownership
Team processes
Ethics
Nimrod Perez

Nimrod Perez

CTO and VP of Engineering at Wobi LTD.

How to Effectively Manage Stakeholders
29 November

Ido Cohen, Head of Product at Permutive, shares how he approaches communication with different stakeholders who more often than not have different -- and conflicting -- goals.

Collaboration
Feedback
Ownership
Ido Cohen

Ido Cohen

Head of Product at Permutive

Solving the Right Problems
29 November

Ido Cohen, Head of Product at Permutive, discusses the importance of solving the right problems and how failing to identify them can lead to misuse of resources and lost opportunities.

Product
Collaboration
Convincing
Ownership
Ido Cohen

Ido Cohen

Head of Product at Permutive

You're a great engineer.
Become a great engineering leader.

Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.