When I was working about my last blog post, the one about outsourcing the pain, I've come up with an idea of one, interesting type of pain software developers tend to forget about. Wait, they don't forget about it, they usually intentionally pretend it does not exist: the pain of massive data inertia.

As close to production ..., right?

One of the main reasons why developers share their development environments (and cripple their development agility) is their insatiable hunger for data :) Everyone would like a full-blown copy of production database - surely, in 99% data is properly scrambled, but the volume persist. And in vast majority of cases we're rather talking about GBs or sometimes even TBs.

Who would like to work with such database? Who really needs such data? (please keep in mind that I'm not focusing on reproducing particular error conditions now, just everyday development / testing scenario).

Untangling the untanglible

Unfortunately, people do that because they don't have a choice or at least they believe so:

  1. Unexplored data model - data models are huge, no-one knows 100%, so the overall clean-up would be a joint effort of many people who still may not be aware of all the assumptions, conventions & tech debt. In 99% of legacy apps, data structures are never removed from data model (just in case) and that obscures the reality even more (it's said, that in SAP about 65% of data tables is not used anymore at all - dunno whether it's true or not).

  2. Effort of cleaning - even if you collect all the necessary knowledge about the data model, there's still a lot of work to be performed to make an actual clean-up in a non-destructive way. Again - in majority of legacy apps I've seen there are no integrity constraints declared (I refer to FKs mainly), so database won't validate itself.

  3. Effort of preparing - cleaning is one thing, but you still need some content that's sufficient for all future testing, right? Preparing such a dataset may be a real challenge - time-consuming and engaging crucial domain experts. But the results may be stunning: I remember test data design workshop for a retail bank - we've managed to cover full test suite of the transaction system (incl. every credit product lifecycle) using not more than 30 contracts (main entities).

  4. Who will confirm it's sufficient? - the actual work to be done is one thing, but what about verifying whether data is fine or not? It surely requires pretty much complete regression re-test of whole application / system. A thorough regression, that digs through all the internals - just because data stripping may have touched some cobwebs that would have never been touched in any other case (but they were glueing things together...).

  5. Coherency between data sources - obviously, in case of integrated systems, some data is replicated between the boundaries. Sometimes it's just identifiers, sometimes even something more. You can easily imagine, that breaking such bonds is something that was rarely taken underconsideration during the development (aka design for failure, level 2).

Exploratory testing ... yeahrightsure

Because of all the points mentioned above, barely anyone does the effort. People (both developers & testers) just live with humongous ballast, acknowledging peacefully the consequences:

  • much, much more challenging test automation
  • relying on so-called exploratory testing & manual testing accuracy
  • each database refreshment is time-consuming, destabilizes everything & creates a completely different, unique state
  • acceptance of the ignorance (we have a hell of a database, we don't know what's inside, but we're usted to that already)

All of that because we all feel we're reponsible for code, but we don't treat data as artifacts:

  • we keep Single Responsibility Principle in code, but how often do we the same in data? And if this is lacking, what's the impact on clear responsibility for data?
  • refactoring code is still much, much easier (& faster) than refactoring the data
  • code changes are all there in source control, but in case of data: many do not keep DDLs in source control, not even mentioning DML tracking transparency

Yes, you're stuck

I have bad news: this way, if you don't put effort in proper data preparation & maintenance, you'll never cross some borders.

In effect - you're stuck forever with the mess you have.

Stuck with being 100% dependant on manual testing, on knowledge & motivation of your testers. You'll never be able to achieve sub-month release cycles, the only continuous activity in your development concern will be continuous concern about the quality - slowly deteriorating, impacted by growing mess, knowledge entropy & increasing technical debt.

Subconsciously you'll avoid some activities, because you'll be afraid of breaking the database (or interfering with other people who share it) and it takes sooo much time to have it restored. You'll keep increasing the periods of time between database refreshment, because it will always require some effort of scrumbling & creating dedicated users (for instance) that takes a lot of time because of database's size.

And in the end, you'll keep telling your tester:

Yes, it's broken on the test environment, but most likely because of broken data. But don't worry, we'll have it synced with production in just 2 months time. Keep calm & waterfall on.