How to Identify Root Cause of an Application Failure
1 September, 2020
Not long after I joined my current company, I was alarmed by the problem of a recurring outage of our API services. One of our API was dependent on 25+ other downstream services and if any of those services failed, our application would experience failures too. Considering the complexity of the situation, the biggest challenge was how do we figure out which dependency is actually causing the outage, especially with APIs that have a lot of downstream dependencies? In order for us to identify failing services, root cause and fix problems quickly, we built an application dependency discovery, reporting, and management tool.
I started off by precisely defining the problem and outlining key questions.
- Which system was failing? - Your API or other Downstream API.
- Why was it failing? - Complete Error Code and Description.
- Who was the Service Owner? - Team Name
The tool essentially monitored the application's error threshold, triggering its analyzer service on error. The analyzer service analyzed logs for application and it’s dependencies for errors and posts analysis to pre-configured destinations channels like slack, SMS, email and pager duty.
My journey began by investigating the application logs. I tried to understand how the errors between dependencies were interconnected by looking at the messages, IDs, host key, etc. Following on that, we built a small prototype that created a lineage map across the board. Those dependencies were mapped out as a branched out tree and if any of those branches were having a blip, that was going to affect our API.
As an application owner, I could onboard my application into this platform, confirm my application dependencies, setup error threshold. The action would trigger our monitoring tool; which would now continuously monitor the application logs for a failure. If the application fails for any reason and the number of errors crosses the configured threshold (e.g. five percent), our tool would start crawling and navigating through the dependencies trying to figure out why the application was failing. Once the problem was identified, an email, slack and/or pager duty would be sent out notifying a responsible person that the application was failing and identifying which particular branch of our application dependency was causing it. At the same time, the original owner of the branch obligation would also get an alert; everyone was informed what was happening and the operation center didn’t have to go and investigate the root cause of the failure, in turn saving a huge amount of time.
In addition, we are now working on a machine learning algorithm that would look at certain patterns of problem, connecting the dots and anticipating potential problems, and sending proactive messages of alert.
If you are interested in learning more about the tool, please read up on the patent filing - Determining problem dependencies in application dependency discovery, reporting, and management tool.
Challenge yourself and don’t accept the status quo. When you detect a problem, challenge the existing solution if you don’t find it satisfactory. As a leader, we must be willing to risk and take bold steps. Nothing great is ever achieved by doing things the way they have always been done.
Wadah Sayyed, Director of Engineering at HPE, discusses how he helped set his startup for success by mapping out ownership structures and building teams around clear ownership.
Director of engineering at HPE
Arzumy MD, CTO at Fave, explains how he empowered his team leads to act as mini-CTOs and take ownership over their work.
CTO at Fave
Nimrod Perez, CTO and VP of Engineering at Wobi LTD., explains how he solved a long-troubling problem of disproportionate workload by his simple and egalitarian approach.
CTO and VP of Engineering at Wobi LTD.
Ido Cohen, Head of Product at Permutive, shares how he approaches communication with different stakeholders who more often than not have different -- and conflicting -- goals.
Head of Product at Permutive
Ido Cohen, Head of Product at Permutive, discusses the importance of solving the right problems and how failing to identify them can lead to misuse of resources and lost opportunities.
Head of Product at Permutive
You're a great engineer.
Become a great engineering leader.
Plato (platohq.com) is the world's biggest mentorship platform for engineering managers & product managers. We've curated a community of mentors who are the tech industry's best engineering & product leaders from companies like Facebook, Lyft, Slack, Airbnb, Gusto, and more.