How to Identify Root Cause of an Application Failure
Engineering Manager at Amazon
Not long after I joined my current company, I was alarmed by the problem of a recurring outage of our API services. One of our API was dependent on 25+ other downstream services and if any of those services failed, our application would experience failures too. Considering the complexity of the situation, the biggest challenge was how do we figure out which dependency is actually causing the outage, especially with APIs that have a lot of downstream dependencies? In order for us to identify failing services, root cause and fix problems quickly, we built an application dependency discovery, reporting, and management tool.
I started off by precisely defining the problem and outlining key questions.
- Which system was failing? - Your API or other Downstream API.
- Why was it failing? - Complete Error Code and Description.
- Who was the Service Owner? - Team Name
The tool essentially monitored the application's error threshold, triggering its analyzer service on error. The analyzer service analyzed logs for application and it’s dependencies for errors and posts analysis to pre-configured destinations channels like slack, SMS, email and pager duty.
My journey began by investigating the application logs. I tried to understand how the errors between dependencies were interconnected by looking at the messages, IDs, host key, etc. Following on that, we built a small prototype that created a lineage map across the board. Those dependencies were mapped out as a branched out tree and if any of those branches were having a blip, that was going to affect our API.
As an application owner, I could onboard my application into this platform, confirm my application dependencies, setup error threshold. The action would trigger our monitoring tool; which would now continuously monitor the application logs for a failure. If the application fails for any reason and the number of errors crosses the configured threshold (e.g. five percent), our tool would start crawling and navigating through the dependencies trying to figure out why the application was failing. Once the problem was identified, an email, slack and/or pager duty would be sent out notifying a responsible person that the application was failing and identifying which particular branch of our application dependency was causing it. At the same time, the original owner of the branch obligation would also get an alert; everyone was informed what was happening and the operation center didn’t have to go and investigate the root cause of the failure, in turn saving a huge amount of time.
In addition, we are now working on a machine learning algorithm that would look at certain patterns of problem, connecting the dots and anticipating potential problems, and sending proactive messages of alert.
If you are interested in learning more about the tool, please read up on the patent filing - Determining problem dependencies in application dependency discovery, reporting, and management tool.
Challenge yourself and don’t accept the status quo. When you detect a problem, challenge the existing solution if you don’t find it satisfactory. As a leader, we must be willing to risk and take bold steps. Nothing great is ever achieved by doing things the way they have always been done.
Connect and Learn with the Best Eng Leaders
We will send you a weekly newsletter with new mentors, circles, peer groups, content, webinars,bounties and free events.