Book review: The Visible Ops Handbook

I first heard of The Visible Ops Handbook during Ben Rockwood’s LISA ’11 keynote. Since Ben seemed so excited about it, I added it to the list of books I should (but probably would never) read. Then Matt Simmons mentioned it in a brief blog post and I decided that if I was ever going to get around to reading it, I needed to stop putting it off. I bought it that afternoon, and a month later I’ve finally had a chance to read it and write a review. Given the short length and high quality of this book, it’s hard to justify such a delay.

Information Technology Infrastructure Library (ITIL) training has been a major push in my organization the past few years. ITIL is a formalized framework for IT service management, but seems to be unfavored in the sysadmin community. After sitting through the foundational training, my opinion was of the “it sounds good, but…” variety. The problem with ITIL training and the official documentation is that you’re told what to do without ever being told how to do it. Kevin Behr, Gene Kim, and George Spafford solve that problem in less than 100 pages.

Based on observations and research of high-performing IT teams, The Visible Ops Handbook assumes that no ITIL practices are being followed. Implementation of the ITIL basics is broken down into four phases. Each phase includes real-world accounts, the benefits, and likely resistance points. This arms the reader with the tools necessary to sell the idea to management and sysadmins alike.

The introduction addresses a very important truism: “Something must need improvement, otherwise why read this?” The authors present a general recap of their findings, including these compelling statistics: 80% of outages are self-inflicted and 80% of mean time to repair (MTTR) is often wasted on non-productive activities (e.g. trying to figure out what changed).

Phase 1 focuses on “stabilizing the patient.” The goal is to reduce unplanned work from 80% of outage time to 25% or less. To do this, triage the most critical systems that generate the most unplanned work. Control when and how changes are made and fence off the systems to prevent unauthorized changes. While exceptions might be tempting, they should be avoided. The authors state that “all high performing IT organizations have only one acceptable number of unauthorized changes: zero.”

After reading Phase 1, I already had an idea to suggest. My group handles change management fairly well, but we don’t track requests for change (RFCs) well. Realizing how important that is, I convinced our groups manager and our best developer that it was a key feature to add to our configuration management database (CMDB) system.

In Phase 2, the reader performs a catch & release program and find “fragile artifacts.” Fragile infrastructure are those systems or services with a low change success rate and high MTTR. After all systems have been “bagged and tagged”, it’s time to make a CMDB and a service catalog. This phase is the next place that my group needs to do work. We have a pretty nice CMDB that’s integrated with our monitoring systems and our job schedulers, but we lack a service catalog. Users can look at the website and see what we offer, but that’s only a subset of the services we run.

Phase 3 focuses on creating a repeatable build library. The best IT organizations make infrastructure easier to build than repair. A definitive software library, containing master images for all software necessary to rebuild systems, is critical. For larger groups, forming a separate release management team to engineer repeatable builds for the different services is helpful. The release management team should be separate from the operational group and consist of generally senior staff.

The final phase discusses continual improvement. If everyone stopped at “best practices”, no one would have a competitive advantage. Suggested metrics for each key process area are listed and explained. After all, you can’t manage what you can’t measure. Finding out what areas are the worst makes it easier to decide what to improve upon.

The last third of the book consists of appendices that serve as useful references for the four phases. One of the appendices includes a suggested table layout for a CMDB system. The whole book is focused on the practical nature of ITIL implementation and guiding organizational learning. At times, it assumes a large staff (especially when discussing separation of duties), so some of the ideas will have to be adapted to meet the needs of smaller groups. Nonetheless, this book is an invaluable resource to anyone involve in IT operations.