TL;DR Practical usage of database technologies has drifted quite far from academical theory we've (or at least I've) been taught at universities. In my case some lessons learned were pretty much contrary to opposite to what I've believes was correct as a fresh graduate. This blog post is about them - e.g. why redundancy may be advantageous, deleting any data is wrong or updates in place should be replaced with aggregate switching.
I'm one of these people who've built their professional career in software engineering on prior academic knowledge gained during studies on technical university. I'm proud of if, I think my studies have helped me a lot & if I had a chance to change my education path in the past (knowing how it all went), I would not change anything. But to be honest the measure of how closely theoretical knowledge corresponds to everyday practice varies from topic to topic - surprisingly one of the topics with quite a wide gap between the theory & practice is database design, especially for on-line (more or less) transactional systems (the ones I've spend majority of my career building).
Why do I refer to that? Every once in a while I share my experience with some more junior engineers & sometimes they seem a bit shocked with a full certainty - why not make it a blog post then?
That's why I hereby present you my set of key, baseline rules I've worked out (/ learned from someone, adopted/adapted) during all past years. Adhering to them (in an aware, thoughtful way, of course) is for me sort the first step to provide a certain level of quality in a system-under-development.
Few basic disclaimers before we start:
- Majority of the naming (of patterns / practices / categories) is either my own or "stolen" from some co-worker, hence they are "unofficial" & may not conform to some common naming standards (that I may not be aware of).
- These rules are highly (not 100% though) applicable to RDBMSes - typical storage of transactional data in on-line systems (at least in my case) - applying them to various NoSQL DBs will not work in some cases.
- As I've already mentioned, these principles are aimed at on-line, transactional, service-based systems. Not analytics, not batch job-based crunchers, not heavy ETL marts or anything like that. And even in this, narrow, single category system characteristics may differ a lot: some systems are more write-heavy, some are not; some are more about timeliness of data, while some are only about minimum latency. So, please, keep in mind that these principles are not 100% universal - always map them into your context & make your own judgement based on local conditions
- The goal here is NOT to optimise for performance only! It's to keep a reasonable performance, but primarily optimise for maintainability, openness for changes & domain scalability
OK, let's get to business then:
Classify data structures
Once you design a data structure (e.g. table), apart from assigning it to a specific aggregate, always mark it (e.g. in metadata) as particular category of information, e.g.:
- configuration data - business rule parameters, dictionaries, data your processing (globally) depends upon - typical examples: interest rate multipliers, currency exchange rates, system-wide commission split ratios, workflow template metadata
- repositories - core data that represents they key entities in the system (system is "centric" around them) that don't change frequently (so, no workflow data here!) but other structures refer to them commonly - typical examples: user, customer, contract
- transactional data - data produced due to normal, operational activities in your business - typical examples: workflow instance, data put in an request / application / order form, comments, purchases
- "protein-only" data - data that doesn't take any part in processing & is only used by humans as an enrichment of other kinds of data - typical examples: binary attachments, pictures, uploaded documents, archived mails / formatted notifications
- audit data - append-only logs & traces that represent (a selection) of kinda event-sourced history for particular entities / functionality - typical examples: undo information, entity changelog, security audit log, troubleshooting log (for support)
Why do that? There are at least two reasons:
- these conceptual data categories pretty much determine how data should be treated in various situations: some you have to keep on all environments, some are redundant on test / development one; some can be archived, some should not be at all; some don't grow or grow linearly while some tend to grow exponentially (so may require different storage / indexing strategies), etc.
- to make sure you don't mix these responsibilities in 1 particular data structure - as it would mean violating the separation of concerns on DB level
Don't delete any data (almost ever)
My (teams') code almost never deletes any data. If something has appeared in DB structures, it should (physically) remain there for historical traceability & transparency reasons. Of course it doesn't mean that end-users / service consumers should keep seeing it -> having a common convention (bit field named "MarkedAsDeleted") to filter them out works like charm here.
Of course there may be delete-heavy situations where data is highly volatile & past values are not important after certain (short) time (not every quantum of information is business-critical), but otherwise than then, this rule is explicitly applicable in vast majority of cases.
There's also a case of law obligations (data removal requests allowed by law) - but this is completely separate story (& can be handled in a different ways).
For transactional data, don't update either
Proper versioning is the key.
In fact, in-place updates are usually just as bad as deletes - especially for "transactional data" (check the definition in categories above). What usually worked for me best was having a sensible versioning schema on the level of aggregate & switching over between full, internally coherent aggregate versions.
Reasons for avoiding updates are pretty much the same as for deletes - they decrease traceability, transparency, may cause to accidentally losing data (yes, I've heard about backups ;), but re-building broken parts of DB from backups after some time IS cumbersome). Such approach also helps a lot with complex, multi-step modifications ("in progress", not effective yet versions) & their saga-based implementations.
I wish I'd done a talk titled "From CRUD to CRAP--Create Read Append Profit"— Viktor Klang (@viktorklang) April 25, 2017
If you design this mechanism well, adding archiving / history rolling on the top of it will be a piece of cake.
Redundancy is OK ...
... as long as you contain writes in aggregate-level services.
We all seem so tainted with the overzealous tendency of unconditional normalization (mainly due to traditional academic education) that we keep omitting the fact that there's actually nothing wrong with reasonable redundancy of data. What does it mean 'reasonable' (in this particular context)?
- one that serves the optimisation of queries (saves us some excessive joining)
- one that is covered by containing the writes into all the necessary places within one transaction (so write is atomic & leaves aggregate in internally coherent state)
- one that keeps such write operations' scopes within a single aggregate - so it's not half of the application that smears data freely around w/o control -> adhering to exclusivity aspect of SRP (Single Responsibility Principle) is crucial here
This is especially beneficial in (quite common) scenario of apps that serve far more read requests (queries) than write requests (commands).
But isn't it confusing for the other developers in the organisation? How should they know which field to use, if it's duplicated in more than 1 entity? In fact - other developers from other teams should not really care - database schema is just an implementation detail, teams other than the "owners" of the bounded context should use only the abstract public interface on the service level to access the data they need.
Always drag your "ancestors'" IDs ...
...all the way down through entities.
My preferred strategy for treating IDs is a mixture of two points from the list above - the ones about immutability & redundancy:
- one single (conceptual) entity should keep its identifier forever (please keep in mind that this entity may be a version of another entity!), so don't "re-use" identifiers or make identifiers out of business-meaningful quanta of information (only auto-generated, auto-incremented numbers or GUIDs)
- within aggregate, each next "level" of child entities should preferably have identifiers of all its ancestors (parent, parent of parent, etc.)
Yes, identifiers also take some space, but this is a price I'm very wiling to pay (without further thought) for overall simplification & pretty much guaranteed high query performance.
Tame the changes
Contrary to our intuition, DBs are always in motion (as the code is) - structures are the subject of changes (preferably in sync with code that uses them), accompanied by various migrations & data conversions, sometimes there are also needs for ad-hoc modifications (data fixes, rare operations that are not covered in GUIs) - letting any of these to be manually applied (e.g. as SQL scripts sent via mail), out of strict automation & control means that you're pretty much giving up on real, working Continuous Integration loop (due to lack of explicit ordering, repeatability & determinism).
My rule-set here is very simple (which doesn't always mean "easy"):
- all DDLs (schema changes) should be versioned with code (in VCS), idempotent, their ordering has to explicit (so nothing runs out of order), execution should be traceable & synchronised with the code deployment
- same rules (as above) apply to migrations, with some additional rules: their should also be testable (reconciliation) & re-runnable in smart way (only failing parts, explicit subset, etc.)
- it's even more tricky for "custom scripts" (the ones that are not "one-off") - these don't have to be sequential (ordered) of course, but they are supposedly long-living, so they should be as little brittle as possible, hence they should be built upon the level of services, instead of messing directly with DB via dedicated DMLs
In fact it may look like I'm convincing you to treat DB changes just like changes in code, but reality is a bit more complex:
- when you're deploying new code, you have the comfort of over-writing previous version - in theory there may be no continuity between previous & new versions
- when you're deploying a change related to DB (either schema or data), you're always applying a transition from current state to the new state - that means that:
- schema changes & migrations should not use the latest version of service code (as it may be ahead of DB state), but code that is in sync with DB state they are supposed to be executed upon (that's why having them in pure SQL is good enough)
- "custom scripts" should always use the latest version of service code (as there may be executed anytime) (that's why having them implemented as command line apps that refer to business APIs like any other client makes a lot of sense)