login


Google Sign inLinkedIn Sign in

Don't have an account? 

How to Identify Root Cause of an Application Failure

Dev Processes
Leadership
Productivity

30 June, 2020

Murali Bala, Director, Software Engineering at Capital One, outlines how he applied a root cause analysis to fix a recurring outage of their website.

Problem

Not long after I joined my current company, I was alarmed by the problem of a recurring outage of our API services. One of our API was dependent on 25+ other downstream services and if any of those services failed, our application would experience failures too. Considering the complexity of the situation, the biggest challenge was how do we figure out which dependency is actually causing the outage, especially with APIs that have a lot of downstream dependencies? In order for us to identify failing services, root cause and fix problems quickly, we built an application dependency discovery, reporting, and management tool.

Actions taken

I started off by precisely defining the problem and outlining key questions.

  • Which system was failing? - Your API or other Downstream API.
  • Why was it failing? - Complete Error Code and Description.
  • Who was the Service Owner? - Team Name

The tool essentially monitored the application's error threshold, triggering its analyzer service on error. The analyzer service analyzed logs for application and it’s dependencies for errors and posts analysis to pre-configured destinations channels like slack, SMS, email and pager duty.

My journey began by investigating the application logs. I tried to understand how the errors between dependencies were interconnected by looking at the messages, IDs, host key, etc. Following on that, we built a small prototype that created a lineage map across the board. Those dependencies were mapped out as a branched out tree and if any of those branches were having a blip, that was going to affect our API.

As an application owner, I could onboard my application into this platform, confirm my application dependencies, setup error threshold. The action would trigger our monitoring tool; which would now continuously monitor the application logs for a failure. If the application fails for any reason and the number of errors crosses the configured threshold (e.g. five percent), our tool would start crawling and navigating through the dependencies trying to figure out why the application was failing. Once the problem was identified, an email, slack and/or pager duty would be sent out notifying a responsible person that the application was failing and identifying which particular branch of our application dependency was causing it. At the same time, the original owner of the branch obligation would also get an alert; everyone was informed what was happening and the operation center didn’t have to go and investigate the root cause of the failure, in turn saving a huge amount of time.

In addition, we are now working on a machine learning algorithm that would look at certain patterns of problem, connecting the dots and anticipating potential problems, and sending proactive messages of alert.

If you are interested in learning more about the tool, please read up on the patent filing - Determining problem dependencies in application dependency discovery, reporting, and management tool.

Lessons learned

Challenge yourself and don’t accept the status quo. When you detect a problem, challenge the existing solution if you don’t find it satisfactory. As a leader, we must be willing to risk and take bold steps. Nothing great is ever achieved by doing things the way they have always been done.


Related stories

What We Learned From Running Open Spaces
30 June

Jeff Foster, Head of Product Engineering, highlights key learnings from his experience of running open spaces and if and how it contributed to an increase in innovation.

Company Culture
Productivity
Impact
Jeff Foster

Jeff Foster

Head of Product Engineering at Redgate

Some Ideas for Breaking Down Silos In Your Organization
30 June

Jeff Foster, Head of Product Engineering, shares how he managed to break down silos in his organization by encouraging their employees to choose their own team.

Team reaction
Managing Expectations
Company Culture
Internal Communication
Collaboration
Productivity
Reorganization
Jeff Foster

Jeff Foster

Head of Product Engineering at Redgate

Some Useful Tips for Decoupling Releases and Deployments
30 June

Pierre Bergamin, VP of Engineering at Assignar, outlines some useful tips for decoupling releases from deployment and increasing deployments by a huge factor, speeding up reverts and planning releases in a better way.

Agile / Scrum
Dev Processes
Pierre Bergamin

Pierre Bergamin

VP of Engineering at Assignar

A New Job as a VP of Engineering: A Lesson in Trust and Collaboration
30 June

Pierre Bergamin, VP of Engineering at Assignar, recalls his own transition to a leadership role in a new company and how he made everything smoother by embracing trust, delegating work and encouraging collaboration.

New Manager of Manager
Changing company
Leadership
Pierre Bergamin

Pierre Bergamin

VP of Engineering at Assignar

How to Identify Root Cause of an Application Failure
30 June

Murali Bala, Director, Software Engineering at Capital One, outlines how he applied a root cause analysis to fix a recurring outage of their website.

Dev Processes
Leadership
Productivity
Murali Bala

Murali Bala

Director, Software Engineering at Capital One

You're a great engineer.
Become a great engineering leader.

Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.