Platform Keepers, Container Herders - how we've started doing SRE

Platform Keepers, Container Herders - how we've started doing SRE

In this article you'll find out about modern Operations (& why DevOps ain't exactly the thing), what's missing in "you build it, you run it" paradigm, why there's a need for platform-caretaking team, what does it really mean - "SRE", how did we start building an SRE team & what did we struggle with.

Ol' good days

In old good days of computing (I mean - in my case it means 90s, 00s ;>), life was so easy ;P There were development people & there were maintenance people. There were programmers & there were admins + SysOps. They've rarely talked to each other, they were throwing work over the walls to the other dept & everyone was happy ;P

OK, I'm kidding - this was inefficient, annoying & frequently crippled the whole organisation due to locality of responsibilities & limited knowledge sharing. Fortunately these days seem to be over & no-one (?) organises their service + infra operations this way these days (at least theoretically ...).

The 4th most twisted word ever

(after Agile, Cloud & Big Data)

This change was possible mainly thanks to the Raise of DevOps (and partially due to a supremacy of Cloud). Although very few truly understand what stands behind this mysterious term, its buzzword potential is hardly matched.

Based on my observations, there are 2 common ways organisations follow to "implement DevOps":

  • dedicated "DevOps teams" - usually Infra/Support teams renamed to sound more modern, the major difference is the average skillset - which is nowadays far more balanced between dev & ops specialties: people in these teams usually understand enough of both worlds
  • standard, feature teams "peppered" with "DevOps Engineers" - this is obviously far better (in terms of knowledge sharing & building DevOps culture), but may fall into a trap of sticking to very local perspective (of a single team)

You build it, ...

you (should really, really, really) run it, BUT ...

When we were working on more mature approach to operations & maintenance in Fresha Engineering, we were still the beneficiaries of our Heroku heritage: this platform is so simple (some would even say: oversimplified), that we didn't need "Infra team", "DevOps team" or "SysOps team" - we were all DevOps by default, accidentally - just because the PaaS was so straightforward and approachable that all engineers could use/automate/build-upon it.

But this could not last forever (at least not in our case) - as we keep growing rapidly:

  • linearly: when it comes to team's headcount, feature-base & complexity, etc.
  • exponentially: in terms of the number of customers, volume of traffic, etc.

We've never followed crazy microservice frenzy, but it doesn't mean we have a single, uniform monolith - we've split platform in loosely coupled context-oriented services, separate applications (both server- & client-side), independently scalable components, distinguished APIs, ...

We've also started moving out of Heroku - its simplicity (so helpful at some point) started to be more of a burden than an advantage. We've needed more flexibility, more visibility, more control - to apply more powerful releasing model, to be able to detect forthcoming outages (before they really happen), to use cloud resources in more cost-effective way ...

Matter of perspective

(why Platform team is needed)

That's when we came up with a conclusion that "you build it, you run it" is a stellar principle when it comes to support & incident/problem handling, BUT  there's shitload of other Platform-related work that someone needs to take care of:

  • end-to-end business service availability for end-users, that spans borders between technical services/applications
  • general resilience & disaster recovery - not only in reactive, but actually rather in pro-active way (yes, we're entering the realm of chaos engineering here ...)
  • learning (in a scientific, data-driven way) the limitations of our platform - its max capacity, characteristics of how will it behave in 1, 3, 6, ... months time
  • providing the best tools, practices & know-how need to resolve incidents within the platform: monitoring, log management, error handling, graceful degradation
  • introducing & continuous development of non-functional features of the platform - dedicated tools aimed to reduce risks & shorten release cycle: no down-time deployments, testing on production (canary), seamless deployments (blue/green), automated release verification (smoke tests), automation of the whole Continuous Delivery pipeline ...

And many more.

But still - all of that WITHOUT crossing out the principle - "you build it, you run it". Feature teams are still supposed to support (e.g. during on-calls) the platform, but there's a need for a bunch of specialised engineering people who'll focus on non-domain-related aspects of the platform & providing tools+data that help with maintaining it.

SRE to the rescue

I think I've heard about SRE (Site Reliability Engineering) in Google (for the first time) in 2016. Or - to be more precise - I've read about it in the 1st of 2 books Google engineers have dedicated to this topic. It didn't rock my world (review is here), mainly because I've found the form of several articles written by multiple people too repetitious & "uncombed", but ... I couldn't deny that this way of thinking about platform operations resonates with me & my past experience (aka lessons brutally learned due to own mistakes).

What's SRE then?

I like the definition from Wikipedia:

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to IT operations problems.

Brilliant. It's development-oriented approach to scalability / availability / resilience / performance /... problems. Based on automation, declarative definition in code (usually: some sort of DSL), clear KPIs & metrics (SLOs, SLIs). Pro-active, not re-active. Aaaand scalable in a way that if your infrastructure / traffic grows 10 times it doesn't mean you need 10 times more people to keep it running.

Google reportedly does that (under this very name) since 2003-ish. Several other high-profile (& not only) companies have picked up this terminology & apply it as well. So it happened in our case: we've decided that SRE mindset (& duty-set) exactly fits the gap that is widening in our company.

SRE - the new dawn

The beginnings are rarely easy. Neither they were for us.

We've started in August 2018, with a self-proclaimed Statement of Work for our newly-formed SRE team (proudly bearing a name of "Undefined"). None of the engineers on board has played such an exact role before.

We've started with 4 people (& expanded to 6 in 3 months time) - all of them had a choice to opt-in or -out (I've chosen the candidates, but they had a right to reject - some have used it):

  • 1 (-> 3) Cloud Engineers - people with a vast experience in cloud infrastructure & Infra-as-Code approach
  • 3 Back-End Engineers - people who knew all the technologies we've used to build server-side components of our platform (and have a proper attitude - preferred breaking stuff over building new features ;>)

The name "Site Reliability Engineering" didn't really fit our conditions (we have ... 1 site ;>), so we've renamed it to System Reliability Engineering. At least for now :)

But naming was the least of our issues. We've struggled with:

  • team identity - guys have went through all the necessary "theory" & examples, but they were in fact re-defining the role in our current context - this was not always easy & required a shift in general mindset
  • infra+back-end dualism - none of the 6 was a true SRE in the beginning; we had 3 infra guys & 3 back-end guys - while in fact we needed 6 people "in the middle", understanding enough of both worlds; fortunately I didn't even have to encourage guys for the mutual cooperation
  • doing things for the 1st time - even w/ all the knowledge available in the Internet, doing some things for the first time ain't easy, especially when inertia (of huge volumes) & stakes (production infra!) are high
  • toil, toil, toil - there's a lot to struggle with, until you manage to automate some of it (which comes in time)
  • WIP (Work in Progress) - we took a lot of stuff on our plates, namely: platform migration to AWS, progressive containerisation of applications, DB instance version unification, capacity testing, ...
  • knowledge sharing - we've decided to work as a small, focused team, to speed up the infancy period of SRE & before-mentioned projects, but it always constitutes a challenge when it comes to spreading the newly acquired knowledge - again: fortunately guys were mature enough to realise, understand & address the issue straight-away

Experiment Continues

We're getting close to 6 months of our New Opening.

It works better than it ever did. We have a very short learning loop & we're not afraid to apply changes (which is apparently a huge trouble for so many companies), but we're still nowhere close to where we want to be.

Fortunately - we believe we'll get there - our progress is measurable & the speed of changes is comforting. We're tackling an obstacle after obstacle & this is the best fuel we can get to keep motivation high.

Paradoxically, the Feature Teams (other teams - focused on business features) take more & more of daily operations on themselves. This was initially my biggest concern (I didn't want to build knowledge silos & invisible walls), but so far none of my fatalistic fears have come true.

According to our predictions, next 3 months will bring a real breakthrough in terms of what we can do to improve platform operations.

Keep fingers crossed for us.

Pic: Copyright by National Geographic. All rights reserved.

Sebastian Gebski

About Sebastian Gebski

Geek, agilista, blogger, codefella, serial reader. In the daylight - I lead software delivery. #lean #dotnet #webdev #elixir. I speak here for myself only.

Comments