Resolving Our Platform Stability Issues

John Doran

Director of Engineering at Phorest Salon Software

Problem

A couple of years back we started to face some serious scaling issues. As the platform had grown and we had onboarded more and more customers, system response times began to suffer due to server exhaustion and database contention. Our hosting costs were rising much faster than our growth rate which was clearly unsustainable for the business. Things became very serious when we suffered outages as various components of the platform started falling over on a regular basis.

Actions taken

We found that we were continuously firefighting to keep the system running smoothly. The support team ended up with a giant red button on their floor that was pressed numerous times per week. Our reputation for delivering amazing customer service was being hurt. The teams who had to deal with frustrated customers weren't able to do their jobs- that is, to help salons utilise our platform to grow their businesses. We needed to take some serious action and fix the stability issues, rethinking our values and decision-making process when an incident happened. We could no longer just "redeploy it" or "make X bigger" to solve production issues. Every incident that occurred needed to have an outage report. This allowed us to have clear actions and solutions for every type of incident. Each outage report had the same format, was published by the software engineers working on the problem, and then put onto our wiki. It was shared with everyone in the company, giving transparency and reassurance. Thus, we were putting preventative measures in place. Outage report format

Description: Digestible one-liner of what happened.
Outage time: hh:mm
Number of support tickets raised: Detailed numbers of impact on the support team
Affected functionality: Description of the functions of the system affected by the outage
Explanation of the problem: A clear technical description of what happened **The report ensures we have a clear understanding of what actually happened.
Investigations: Some details of where the engineer looked and how they came to fix the issue and how long it took. Along with some screenshots of metrics or logs from the issue.
Preventative measures and actions: What are we going to do from stopping this from happening again? **The minimum expectation here would be an alert to help us pre-empt the issue. Each action needed to be tracked in Jira. What we found was:
Weaknesses in automation and deployment procedures.
Our build process and speed to deploy was too slow
Where we lacked monitors and alerts (customers knew about issues before us)
Server components which struggled to deal with traffic volumes
Outdated versions of libraries and code which had memory leaks When analysing the data we were clearly able to see how much it was hindering our product development. Engineers were being constantly pulled from different angles to firefight. That instability in velocity and delivery meant we couldn't accurately predict when new features or improvements could be delivered. Two of our core values are growth and thinking long term, so we knew it was time to fix these issues and evolve our platform. The effort to fix everything was too large with a small engineering team while also continuing product development work. We had to make a big decision to halt all product development work and undergo a large price of engineering effort to fix the problems. This had large knock-on effects as we had business commitments made and expectations to meet. But the goal was clear- to improve the stability of our system while helping it scale as we grow our customer base. We called this engineering effort project Darwin as it was about the evolution of our system. From an engineering side it was extremely difficult to know when we would be done, but we broke it down into small measurable increments. Some of the major pieces of work we took on were:
We started with test coverage at an API and integration level — so we could know if we broke anything
We wrote gatling performance tests to ensure we could simulate production environments
Dividing a monolithic backend up into separate services (bounded contexts per responsibility)
Migrated from classic EC2 baked AMI deployments to Docker
We made our containers self-healing and load balanced them behind ALBs
Moved our infrastructure to code
Migrating our databases Amazon's Aurora
Making our services stateless and removing caches
Adding auto scaling capabilities
Fully automating our build process and release process

Lessons learned

While it was painful to stop feature development and fix the issues, we can safely say our stability problems are gone. There is no more firefighting and the red button on the support floor is thankfully gathering dust. By using our long term values as guidance, we took on project Darwin to attain platform stability, fault tolerance and elasticity.
So that we never have to fall back into this big bang approach of needing to fix things we have adapted a continuous improvement mindset, it is now something that is a core part of our engineering values. We take periodical breaks in our development sprints to work on our technical backlog — fixing niggling issues, upgrading areas of the system, answering the unknowns and always making the system better.
As mentioned, our hosting costs were unsustainable and as we look back we see a lower and non fluctuating AWS bill.
On a more personal note, this was one of the hardest engineering challenges I have ever faced, it wouldn't have been possible without the talented engineers and support of the team at Phorest. Source: https://hackernoon.com/resolving-our-platform-stability-issues-4bf4aeb2e1ca

Be notified about next articles from John Doran