Disclaimer (and a short reminder): this blog post represents my personal observations & opinions. It should not be affiliated in any way to my prior or current employers and (in particular) cannot be treated as an unofficial or official statement on their behalf.

Public cloud is everywhere these days. It's hard to be surprised, knowing its advantages over traditional infrastructure, especially unconstrained growth capabilities and flexible pay-as-you-go models. But all this new convenience changes the rules of the game in many unexpected ways - what's interesting (from a technical leadership perspective), not all of them are unequivocally positive.

Let's consider the on-premise model with its rigidity and hard constraints. Yes, it requires capacity estimating (never easy, never accurate), and typically companies do over-provision (and treat it as a barely visible sunk cost). But the before-mentioned constraints do act as some safety-belt as well. What do I mean by that?

Overgrowing your infra

Typically there are some early warnings you're running out of physical resources:

  • average CPU load peaks at dangerous levels
  • RAM memory consumption is high enough to force hard drive swapping
  • persistent storage quotas are getting dangerously close to 100%
  • even well-optimized, index-hitting queries take unacceptable time to complete, etc.
  • Web API HTTP requests that never took more than 20 ms, now regularly reach 120 ms and even more

Highly scalable cloud resources provide a plethora of quick options within your hand's reach to ease such pains. One click, few seconds and the problem is gone (for now). Bump RDBMS instance size level, add more processing nodes (in parallel), set up an auto-scaling group (and forget about the problem in general). Bah, some resources (especially the serverless ones or the ones that support autoscaling) do not require any manual intervention: you consume more, and they will just grow to accommodate the increasing demand.

Virtual infinity. Really?

On the one hand, it's absolutely amazing. The inertia is eliminated, your operations are simplified, and the engineering crew can work on more important (your business-specific) problems.

On the other hand though, the constraints are the way to enforce frugality and inspire resourcefulness. When engineers knew they have limited resources to do something, that realization was spurring their creativity. They were tackling problems that looked unfeasible at first glance, questioning their previous assumptions, typical ('by-the-book') way of doing stuff, hence learning and bringing new options (& opportunities) to the table.

Nothing ramps up a good engineering career like meaningful challenges concluded with fruitful outcomes.

These days in many cases, engineers are not even aware why their apps are such resource hogs. Many of them have lost the ability to properly profile & optimize applications in their favorite programming language and development environment. They (especially those who do not have the 'skin in the game' - are not the co-owners of the business they build) treat cloud resources as virtually free and not their concern.

These are not technical or simply skill-related issues. These are serious cultural problems.

Problem? Whose problem?

As a result, when facing the capacity/performance problems in the pay-as-you-go model, engineers tend to favor the easiest option - cover them with a pile of dollars (by increasing the cloud provider's bill).

Unsurprisingly, there's a significant chance that will not end well. The consumption can quickly get out of control. Suppose your engineers suck not only in profiling and optimization but also in setting up cost monitoring and threshold alarms (yes, that's an entirely new skillset to learn and master!). In that case, you're pretty much screwed: expenditures in the cloud can pile up beyond your yearly budget in just a few days.

Wasn't that the whole point?

So, does that invalidate the popular statement that the infrastructure is cheaper than people (whose availability is also more limited), so you shouldn't have your most precious 'resources' wasting time on optimizations instead of building features?

As usual, the truth is somewhere in-between. I agree that the company should focus not on the commodity but their competitive advantage (which is rarely related to resource consumption optimization ...). However, as all the extremes are bad, look for a proper balance by using tools like Pareto Principle, Law of Diminishing Returns, and ROI calculation.

No, I'm not advocating for staying on-premise, quite the opposite :) My intention is to emphasize that the cloud is not just the deployment model. It enables the distribution of the ownership over much more 'fluent' cloud resources, so the organizations need to make sure this ownership is well understood and being picked up.

What are my other suggested actions? Apart from the obvious ones (use the cost control tooling created by your cloud services provider!):

  1. I keep repeating it as a mantra, but it's essential - having some form of 'skin in the game' can make this problem disappear w/o any additional intervention. People are more likely to be proactive and effective in cost control if the outcome affects what they get out of the business more directly.
  2. Allocate ALL the infrastructure costs - to clients, value streams, products, teams, or even individuals and turn them into visible & trackable metrics with an automated escalation of anomalies. Those the metrics are assigned to should be declared as their owners - accountable for the control of future changes.
  3. Trace the link(s) between input (business) and output (infrastructure consumption) metrics: is hardware cost somehow correlated to any of the meaningful business KPIs (e.g., clients acquired, active clients, daily transactions, extra feature consumption)?
  4. If there's indeed such a correlation, what's its nature? How does it scale? Is it linear? Exponential? What's the trend - are we getting more profitable or quite the opposite?