Ways to Reduce Your Cloud Costs
Problem
Our cloud computing costs were increasing dramatically over time. We were spending too much money on our cloud costs and the only data we had was the invoice from our provider that was constantly going up.
Actions taken
First and foremost, we had to understand what exactly was the problem, but we didn’t have detailed enough data and visibility into our computing costs.
I started by streaming our billing data into BigQuery and creating a dashboard in Data Studio to create visibility into our costs within the organization. This allowed us to see how much we were spending in different areas and with different computing products, and how this was changing over time. The dashboard was updated automatically. Every morning, I would look at it and post screenshots in Slack highlighting problems or insights, for example, the cost went up today, why do you think that happened?
I collaborated with others to develop a cost reduction roadmap that would make cost a lasting pillar within our priorities. Within that pillar, we created a prioritized backlog. For each project, we estimated how much we could save. Some projects also had strategic impact around improving performance and reducing technical debt, so we included those factors when prioritizing. Having this backlog in place helped facilitate brainstorming, and ideas on how to further reduce our costs started to pour in.
Within the backlog, there were several small, impactful projects that could be done across the organization. To focus our effort, we developed a tiger team of just a few engineers to focus on delivering these projects without having to disrupt other teams. By focusing only on cost, they also developed tools and best practices that were distributed to the rest of the organization, such as profiling, canarying release process, and performance analysis tools that compare two versions of the service to establish if there was a significant difference in reduction in latency and in costs.
The cost initiative was also able to influence other departments across the company besides engineering. We had hundreds of customers using our platform, built with a multitenant architecture, and it was not easy to identify the amount of resources each customer was using. We created a predictive heuristic model, based off of our application logs, to estimate the cost per customer. We turned this into a dashboard, which created visibility within the company on the profitability by customer. That led us to redo our pricing model, in collaboration with Product Marketing, Sales, and Finance, by devising a single metric that best correlated with our costs. The Sales and Customer Success teams applied this model to all new customers and for customers approaching renewal, in order to rectify or eliminate unprofitable customers.
Lessons learned
- We had trouble taking action until we thoroughly analyzed and broke down the problem. Unlike leaders who have a broader perspective on problems, most engineers think at the level of the scope they are involved in and struggle to identify problems transcending that scope. Once we came to the level of labeling each service and each resource by team -- engineers on those teams could focus on and comprehend the problems specific to their team, and consequently, define follow-up tasks. If you can't break down a metric by the appropriate level, it's hard to take any action.
- Lead by example! If you want the team to focus on a certain goal, look at the metrics you care about every day. I would pull out the cost dashboard every morning. You will most certainly find persuasive data, and your commitment will signal to everyone the importance of the problem. If it's important enough for you to spend your time on it, it's important for everyone else. That can also stir productive conversations and bring to a number of great ideas. Posting some of those ideas will further draw attention and increase overall visibility.
- There is a correlation between cost and reliability. We observed that many reliability incidents had caused a cascading chain of events. For example, a customer’s traffic pattern would cause Datastore to be hit too hard, request latency would increase, more instances would start up, and the surge in instances would add additional pressure to preload caches hammering Datastore even more. A single incident like this could cost us thousands of dollars! Therefore, increased reliability likely means decreased costs.
- There is also a correlation between cost and performance. Our segmentation system was very costly due to its very inefficient design, which would copy the user profile rather than sending delta updates. It also was not scaling for our biggest customer. We re-architected this system, which became a win-win situation -- reduced costs for us and better service for our customers.
- Cost reduction requires an ongoing effort to maintain costs. Earlier this year, we reduced our costs by over 60% in six months, and many team members mistakenly thought we were done with it. A couple of months later, our costs started to go up again. You have to constantly keep an eye on costs. The problem is really entropic in its essence -- it’s easy for an engineer to introduce a change that can increase costs, or a new customer gets onboarded with a unique use case that ends up being costly, and things can spiral from there.
Be notified about next articles from Andrew First
Connect and Learn with the Best Eng Leaders
We will send you a weekly newsletter with new mentors, circles, peer groups, content, webinars,bounties and free events.