service management Archives – Blog FiascoBlog Fiasco

Recently, Tom Limoncelli had a post at Everything Sysadmin describing how to move users to a corporate standard. Tom pretty much nailed it (particularly the part about the importance of management support), but I wanted to add my own experiences. Like Tom, I’ve been out of the fleet management business for a while. My perspective comes not from migrating from a wild west scenario to a standard, but from one standard to another.

As an aside, I was an undergraduate when my department left the wild west. The way computing worked on campus, this meant I was mostly unaware except for the weather lab and the one server that was my responsibility. I heard plenty of tales, though, from both the customer and provider side. I got to deal with the residual mistrust from all the things that went wrong. And I saw the graphs of ticket volume. Oh was it a mess I was glad to have missed.

For my part, the first summer after I became the department’s sysadmin, I decided it was time to upgrade the Linux machines. Our Linux servers were a mic of RHEL 3 and RHEL 4, while some of the desktops rand one of those or they ran Fedora Core 1. Fedora Core 1 was well beyond end-of-life by that point and packages for RHEL 3 and RHEL 4 were increasingly becoming out of date for the needs of the faculty and students who used them on a daily basis.

RHEL 5 had been released a few months prior, so it seemed like a good opportunity to get everything on the same OS. The first thing I did was to put a few spare machines in the computing lab as demo machines. Interested users could sit down and test the software packages they used and report any problems.

Meanwhile, I also surveyed each professor who had Linux machines or who taught in the lab about the software they used. Some packages we weren’t sure were ever used anymore, and it was a good opportunity to find cruft that could be cleaned up.

The next step was to “force” people to start using RHEL 5 machines by upgrading one machine in each lab (most of the faculty who had one Linux machine had several). Starting with the friendliest users, I hit every lab. We found a few problems here and there (a bug fix in tcsh caused on group quite a bit of trouble since they were inadvertently relying on the buggy behavior), but people could see them getting fixed.

The upgrade process got smoother the more times I did it, until we got to the point that I could sent our student employees off to get it started. The friendly users helped find the troublesome issues first so that the holdouts had a smooth experience. By the end of the summer all 70 or so machines were on the same OS, which reduced the support effort. Users had newer packages and a better experience. Everyone was happy and had cake (the cake was probably for something else, though).

I recently overheard a conversation among three instructors about their university’s Blackboard learning management system. They were swapping stories of times when the system failed. One of them mentioned that one time during a particularly rocky period in the service’s history, he entered a large number of grades into the system only to find that they weren’t there the next day. As a result, he started keeping grades in a spreadsheet as a backup of sorts. The other two recalled times when the system would repeatedly fail mid-quiz for students. Even if the failures were due to their own errors, the point is that they lost trust in the system.

This got me thinking about “shadow systems.” Shadow systems are hardly new, people have been working around sanctioned IT systems since the first IT system was sanctioned. If a customer doesn’t like your system for whatever reason, they will find their own ways of doing things. This could be the person who brings their own printer in because the managed printer is too far away or the department that runs their own database server because the central database service costs too much. Even the TA who keeps grades in a spreadsheet in case Blackboard fails is running a shadow system, and even these trivial systems can have a large aggregate cost.

Because my IT service management class recently discussed service metrics, I considered how trust in a system might be measured. My ultimate conclusion: all your metrics are crap. Anything that’s worth measuring can’t be measured. At best, we have proxies.

Think about it. Does a student really care if the learning management system has five nines of uptime if that .001 is while she’s taking a quiz? Does the instructor care that 999,999 transactions complete successfully when his grade entry is the one that doesn’t?

We talk about “operational credibility” using service metrics, but do they really tell us what we want to know? What ultimately matters in preventing shadow systems is if the user trusts the service. How someone feels about a service is hard to quantify. Quantifying how a whole group feels about a service is even harder. Traditional service metrics are a proxy at their best. At their worst, they completely obscure what we really want to know: does the customer trust the system enough to use it?

There are a a whole host of factors that can affect a service’s credibility. Broadly speaking, I place them into four categories:

Technical – Yes, the technical performance of a system does matter. It matters because it’s what you measure, because it’s what you can prove, and because it affects the other categories. The trick is to avoid thinking you’re done because you’ve taken care of technical credibility.
Psychological – Perception is reality and how people perceive things is driven by the inner workings of the human mind. To a large degree, service providers have little control over the psychology of their customers. Perhaps the most important are of control is the proper management of expectations. Incident and problem response, as well as general communication, are also critical factors.
Sociological – One disgruntled person is probably not going to build a very costly shadow system. A whole group of disgruntled people will rack up cost quickly. Some people don’t even know they hate something until the pitchfork brigade rolls along.
Political – You can’t avoid politics. I debated including this in psychological or sociological, but I think it belongs by itself. If someone can keep some of their clout within the organization by liking or disliking a service, you can bet they will. I suspect political factors almost always work against credibility, and are often driven by short-sightedness or fear.

If I had the time and resources, I’d be interested in studying how various factors relate to customer trust in a service. It would be interesting to know, especially for services that don’t have a direct financial impact, what sort of requirements can be relaxed and still meet the level of credibility the customer requires. If you’re a graduate student studying service management, I present this challenge to you: find a derived value that can be tightly correlated to the perceived credibility of a service. I believe it can be done.

Blog Fiasco

The world's only(?) FOSS/weather/sports/marketing/high-performance computing blog

Tag Archives: service management

Upgrading users to new environments

Service credibility: the most important metric