Service Outage Avoidance - The mother of all metrics - Matthew Hooper VigilantGuy Digital Transformentalist - Create High Performing IT

In my role at Vigilant as both a consultant and an executive I have had the opportunity to interview hundreds of operational IT managers and directors. In most cases the number one metric they were managed by was “Availability” or “System up-time”.

What turns into a very interesting dialogue is when you talk to them about how they collect those metrics, report those metrics and respond to those metrics. Here are the shortfalls of taking this approach:

  • Up-time – this is very rarely measured from the End-users standpoint. So you are immediately putting IT on the defensive when you state the system was available on the network, but the end-user is not able to execute business on the system.
  • This reported metric only gives credibility to how quickly IT personnel was able to find and fix the outages. Outages are typically caused by poor release practices or change management, IT functions, anyway.

A new approach that should be considered is how I measured my operation as an IT director and what Vigilant consultants call “Service Outage Avoidance” (Not to be called SOA or real confusion sets in)
This metric is the marrying of component availability to end-user availability. You can accomplish this by monitoring a systems network & server components for availability along with the end-users behavior. When an outage occurs at the component level, yet the service stays up to the end user, due to you superior availability design of the system, you have achieved avoiding a service outage.
Availability metrics then should be broken into the 5 following categories:

  • Network      (Link status, utilization, drop/error rates)
  • Server         (OS stats, CPU, HD, Mem)
  • Application  (DB, J2ee, .Net, etc)
  • Business Logic    (Code interfaces, Connectors, ETL, etc..)
  • Business Process  (Transactions, order counts, etc…)
  • End-User      (real-time screen to screen, refresh, errors, etc..)

“Service Outage Avoidance” metric shows the percentage of downtime of a component where end-user was available.  (i.e.  4months of aggregate downtime of SAN on Email system during 12 months of end user availability)
Your next management report then will show something like this:
Email Services – Service Outage Avoidance: 25%
What this metric means is that we had an impact at a component level of 25%, but due to proper design and management we avoided having a business impact.
In other words, “You know how we weren’t sure if it was worth it to build in all that fail-over and redundancy. Well here is how valuable that decision to spend was.”

If you can equate the up-time value against this, you can calculate the ROI.
i.e. Up-time value of Email for 1mos= $1million dollars.
Cost of redundancy $1M
1 year ROI is 300%  
(4mos *$1M = $4M return – $1M investment = $3M. $3M(return)/$1M(Investment) = 300%)

In my next blog – Don’t underestimate the infrastructure

Written by

Digital Transformentalist Twitter: @VigilantGuy Linkedin: Web: Matt Hooper is an industry advocate for Service Management strategies and best practices around Enterprise Service Management. For over 20 years Matt has instituted methodologies for business intelligence and optimization. Leveraging technology to drive business outcomes, he has built an industry reputation for his highly effective approach to creating value through Service Management. Matt is active on Social Media known as VigilantGuy, and co-hosts the weekly podcast: Hacking Business Technology.

Leave a Reply

Matthew Hooper

Digital Transformentalist