Untouchables

Nooo, it’s not about natural interfaces :)

Our client quite recently encountered a very interesting stage of its IT landscape maturity life-cycle. Landscape has got unmanageable… Sounds terrifying? Yes, it may look harsh, but fortunately in this particular case it’s still quite easy to fix. Let’s start with some background:

In complex IT landscape with several inter-communicated applications, it’s absolutely critical to know:

availability time-frame window for every application (when application HAS to be available)
how can you decrease the severity of particular application’s unavailability when it has to go down out of its service time-frame window
who should be informed about partial / wholesome unavailability of particular service
what does partial / wholesome unavailability of particular service mean for business of the organization

Our client had no clue. If there was an urgent need to shutdown particular database server, they didn’t know: when to do this, what will it really mean for organization and who is responsible for approval of whole operation. What’s even worse, client people were struggling to find out what they could do to improve current situation. Our solution was quite simple:

1.) We’ve divided all the applications into components - the distinguishing key was the separate availability of component - if we have two elements of solution and it’s possible that in this particular moment one of them is down (doesn’t work completely) and one is up (works normally) - they are separate components. Obviously, each component can have different business criticality.

2.) For each identified component, we’ve asked for:

who of the development team (people who created and maintain this component) is responsible for this component
who’s business owner of this component (such person should be available enough and able to make a decision regarding this component without need of any additional approval)
what’s the “working calendar” (weekly, including weekends) for the component - when should it be up’n’available, for instance:
- what are component’s working hours on weekdays, on Saturday, on Sunday OR
- if it has to be available when other component(s) is available, what component(s) is this
what is required for component’s proper work:
- availability of another component? (and what happens if it’s not available)
- availability of database (and what happens if it’s not available)
- availability of MOM - like WebSphere MQ server (and what happens if it’s not available)
- etc.

3.) If you keep all these data in structured registry, you’ll able to respond following questions:

what happens to business if component ABC / database DEF / MQ server GHI is not available for X hours on Y-day
who should be informed that particular components are not available and what’s the total impact (including indirect one, due to cross-component dependencies)
where should you plan your next service windows

In theory, it’s quite easy. But, such map / registry keeps it value only as long, as you keep it up-to-date. If it means tracking all changes in 50+ independently developed applications, it may be harder than expected. In such situations, there are only 2 possible options (excluding failure…):

Make sure that there’s no way any change goes to production without careful (manual) examination that leads to update of your map - you can do it by adjusting change management process, but it’s time-consuming and complex.
Keep all the communication configuration in a way that allows you to create dumps that will automatically refresh your map. This means you have to make an effort in configuration management automation, but in the end it’s very beneficial.

Unfortunately, we were not able to convince our client to follow path #2. We’ll keep observing how they struggle with #1, some interesting observations may follow…