In this blog post you'll read about why & how Shedul has started the journey towards Continuous Delivery, what does it have in common with test automation and why 2 weeks are both a little & a lot ...
Dunno about you, but I love the warstories. Theory is great, but even the greatest & most thoroughly thought-through ideas do not guarantee a success - it's (almost) all about execution and learning from successes & mistakes (both your own & others'). I have shitloads of my own stories I'd gladly share, but in many cases I simply can't - as a consultant I was contractually obliged to keep the certain level of confidentiality.
Additionally, as a person with a limited impact (even on a senior position) on crucial decisions & direction set by the client, to remain professionally credible in some cases I'd have to bluntly criticise organisations & people who've entrusted me with their confidential details - it just wouldn't be ethically proper.
But a good news is that some time ago I've moved from being a consultant to working directly (as an executive) for a product company named Shedul. So now I can wash some of our own dirts in public! Yay, let's bring something meaty on the table :)
Maybe something about ... test automation? A topic everyone's aware of, everyone did something about, but so very few have something valuable to show ...
Before we start
No, we're not there yet. By "there" I mean Continuous Delivery (not a subjective interpretation, but the canonical one, proposed by Jez Humble). Let me paraphrase it for you:
Continuous Delivery is when you're ALWAYS ABLE to release current state of your software (and it's your decision when to do it and when to wait).
As you hopefully already know there's NO chance to get to CD without a high degree of quality assurance automation. In simple words:
- to have Continuous Delivery you need "a quality built-in" (all the time)
- to have "a quality built-in" you need Continuous Integration (on shared branch, preferably trunk) and automated tests running within Continuous Testing loop
As I said, we're not there yet, but we're very determined to get there.
Fast.
And honestly - there's nothing but our own limitations that prevents us from getting there. We have no excuse. And this is a tale (in episodes) about our journey.
Disclaimer: we'll focus on Continuous Testing only. Topics like CI, evolving our branching model, provisioning & deployment automation, cattle-style approach to infrastructure are all super-interesting & deserve separate series of posts, but let's focus on one problem at a time :)
(sort of) Starting Point
I've joined the organisation 4 months ago, so I can't credit myself for any input into what was done before. And in fact a lot was (done):
- we were already able to release (deploy to production) every Sprint's increment (reliably)
- Sprints were relatively short (2 weeks each)
- there were no "test Sprints" or "integration Sprints", just "Sprints" that were delivering real value (our Demos were never about slides, we always present running software)
- it was a common practice to release features behind flags (toggles) and e.g. roll them out to particular client/region first (sort-of "canary release")
- test data automation was well dealt with (data seeded, seeds well managed)
- test environments provisioning & deployment was fully automated, hence test conditions were as repeatable as possible (to remove "works on this machine" syndrome)
- static checks were present for all the languages we use (JavaScript, Ruby, Elixir)
Not bad. But there were also very clear (visible at the first glance) issues:
- due to complex release process, Scrum teams' work was not integrated into 3 parallel (sequential) release branches until final days of the Sprint (integration was not truly continuous across teams)
- approach to test automation was over-simplified - either mocked unit tests (created by developers) or end-to-end tests (UI level, created by QA engineers) - nothing in between ...
- ... the former were not an issue (but frankly - I didn't have enough data to be 100% sure), but the latter crawled (in terms of automation progress) & functionality coverage of E2E tests was very low ...
- as we were always fixated on the quality, even glitches were considered important & the overall approach to prioritisation got totally out of control - overzealous QA Engineers were chasing visual glitches so hard that they had no time to pursuit automation goals
"2W ought to be enough for anybody"
Hmm, but was there really a reason to complain? Many companies would kill to get down to 2 week stable delivery cadence ... Do end-users really have to get new features more frequently?
On our case 2 weeks cadence simply ain't sufficient:
- our business is around the clock (global in terms of clients & their users), available 24/7 & critical for clients' businesses - since I've joined we had 0 (zero) releases with service downtime, all the deployments are "fluent" & do not cause service interruption - this means that what could have been a single release for other organisations in many cases (in our reality) splits into several deployments (backwards & future compatible changes, data back-fills/clean-ups, etc.)
- we co-operate with reference clients, we do canary releases, so we need fast feedback loop to continuously validate our hypotheses - actual (measured) users' behaviour change is always the best validation of proposed direction
- 2 week long Sprints are shorter than it appears -> manual regression testing (even partial) gets more & more repetitive (for QA Engineers), cumbersome & can cause very fast individuals' burn-out - that could be disastrous for the whole company!
That's why we definitely had to do something about it ...
The 2nd part in the series can be found here.