Fixing a Broken Development Process

John Doran

Director of Engineering at Phorest Salon Software

Problem

We made changes in our development process to help our business scale. With more customers onboarded our system's performance was suffering. Our hosting costs were rising much faster than our growth rate which was clearly unsustainable for the business. Things became very serious when we suffered outages as various components of the platform started falling over on a regular basis.

Actions taken

We stopped all engineering work and underwent an effort called project Darwin — evolving our platform and team. We downed tools to fix the pain and that meant major adjustments for the business. We did the following to tackle the problem:

Continuous Integration
We began with foundations and launched continuous integration servers. The CI server introduced a better workflow for us. It helped us see real clarity in terms of the quality of the system, where it had coverage, and where it didn't. Further still, the CI server (and some crafty engineering) helped us break the dash repo into multiple application servers consequently re-architecting the system into small bits.
Moving our deployment and Infra to Docker
Migrating to Docker made it easier for us to both develop and to deploy, using a bunch of different tech stacks. For us, this was a shift from one guy being responsible for deployments, to actually developing and using Docker in our day-to-day workflow. Our system, as we started splitting it apart, became more and more distributed and Docker was great for us in terms of consistency and probability, particularly around those different tech stacks.
Automated testing
Prior to Project Darwin, the test suite took around 35 minutes to run, when we started and got them all going. With Project Darwin, we wrote a bunch of performance tests, particularly around pulling down appointments, creating appointments, and sending large volumes of messages. We knew we couldn't have any regressions there, so we used Gatling to do those performance tests. We would run those from the continuous integration server and we'd do various types of soak testing to make sure we weren't taking any steps backwards. More so, each time we deployed we would run a performance test to ensure that we weren't getting slower or having any problems with the new deployment.
Shifting our development practices to a more collaborative approach
Before, there wasn't really a team effort to ship things. It was more so developer finishes coding the machine, push it off, and maybe at the end of the month we'd ship. So the people who wrote the code didn't ship it which led to all sorts of problems in terms of dependencies, tangles, and knowledge silos. After, it was about the team. It was about leadership and the team members, together, talking through common issues. They meet every two or three weeks, talk about some key metrics in the system- why is this too high; why is this too low- and work through trigger pairs. They identify the pain points so that efforts are focused.
We make a big effort, particularly for people who are working remotely. We try to get them all in the same room once a quarter. We talk about our challenges, talk about our goals, talk about our values, and make sure we're all on the same page.
Monitoring, tooling and system health
We upgraded our systems and started using New Relic to help find errors. We also used APM. We looked at CloudWatch and reintroduced CloudWatch metrics, to help us watch traffic and help us see slow transactions. Logentries helped us a lot in terms of spotting anomalies in the logs. Pingdom was actually a really surprising good addition to monitoring. It simply calls any health check endpoint you want and has some nice slack and messaging integration.
Additionally, we did some small end-to-end tests that gave us a sort of heartbeat to how the system was running and gave us the kind of confidence to know about an issue before a customer. It allowed us to get rid of that red button. The results of the work we undertook were:
Make our system fault tolerant
Let us become elastically scalable
Make deployments faster and easier
Get our hosting costs under control

Lessons learned

You need to build an environment of trust.
You need to be able to be confident and okay with failure in terms of taking risks, sometimes saying no to features and to customers. To be able to push back on leadership and make sure that you're really evolving the system the right way so that you're not just becoming a feature factory.
Part of Phorest's core values and mission is to help the salon owner grow their business and use the tools that we provide to do that. We realized that if people were firefighting all of the time and not being able to support our customers and boost their revenue, then the company would have been pointless. So, we weren't fulfilling our mission by coping with outages and constantly panicking.
I would say if we had waited any longer it could've been detrimental to the health of the business. I think that we did a great job in terms of getting to a certain point, but we would've risked technical decay and really done harm to the organization if it had gone any further. I believe Darwin was a lot of work and it could've been easier if we had paid more attention to technical debt and made the right decisions earlier on. For example, maybe saying no to that customer who wanted a bespoke piece of functionality.
It was a pretty hard decision for us to make in terms of the business, because we had a lot of deliverables and commitments to customers and to our sales team. But we made the call and it paid off.
We're really happy with the state of the infrastructure to get us to maybe 8,000 to 10,000 salons, but we need to be really conscious of the company's growth and our goals. We need to make sure that we can scale at a much bigger level, and we also need to make sure that our customers aren't affected by our growth. Source: https://nothingventured.rocks/fixing-a-broken-development-process-2049cd4ba116

Be notified about next articles from John Doran