LISA wants you: submit your proposal today

Posted on May 10, 2018 by Ben Cotton

I have the great honor of being on the organizing committee for the LISA conference this year. If you’ve followed me for a while, you know how much I enjoy LISA. It’s a great conference for anyone with a professional interest in sysadmin/DevOps/SRE. This year’s LISA is being held in Nashville, Tennessee, and the committee wants your submission.

As in years past, LISA content is focused on three tracks: architecture, culture, and engineering. There’s great technical content (one year I learned about Linux filesystem tuning from the guy who maintains the ext filesystems), but there’s also great non-technical content. The latter is a feature more conferences need to adopt.

I’d love to see you submit a talk or tutorial about how you solve the everyday (and not-so-everyday) problems in your job. Do you use containers? Databases? Microservices? Cloud? Whatever you do, there’s a space for your proposal.

Submit your talk to https://www.usenix.org/conference/lisa18/call-for-participation by 11:59 PM Pacific on Thursday, May 24. Or talk one of your coworkers into it. Better yet, do both! LISA can only remain a great conference with your participation.

Come see me at these conferences in the next few months

Posted on September 19, 2016 by Ben Cotton

I thought I should share some upcoming conference where I will be speaking or in attendance.

9/16 — Indy DevOps Meetup (Indianapolis, IN) — It’s an informal meetup, but I’m speaking about how Cycle Computing does DevOps in cloud HPC
10/1 — HackLafayette Thunder Talks (Lafayette, IN) — I organize this event, so I’ll be there. There are some great talks lined up.
10/26-27 — All Things Open (Raleigh, NC) — I’m presenting the results of my M.S. thesis. This is a really great conference for open source, so if you can make it, you really should.
11/14-18 — Supercomputing (Salt Lake City, UT) — I’ll be working the Cycle Computing booth most of the week.
12/4-9 — LISA (Boston, MA) — The 30th version of the premier sysadmin conference looks to be a good one. I’m co-chairing the Invited Talks track, and we have a pretty awesome schedule put together if I do say so myself.

DevOps is dead!

Posted on July 25, 2016 by Ben Cotton

“$thing is dead!” is one of the more annoying memes in the world of technology. A Tech Crunch article back in April claimed that managed services (of cloud infrastructure) is the death knell of DevOps. I dislike the article for a variety of reasons, but I want to focus on why the core argument is bunk. Simply put: “the cloud” is not synonymous with “magical pixie dust.” Real hardware and software still exist in order to run these services.

Amazon Web Services (AWS) is the current undisputed leader in the infrastructure-as-a-service (IaaS) space. AWS probably underlies many of the services you use on a daily basis: Slack and Netflix are two prime examples. AWS offers dozens of services for computation, storage, and networking that roll out updates to datacenters across the globe many times a day. DevOps practices are what make that possible.

Oh, but the cloud means you don’t need your internal DevOps team! No. Shut up. “Why not simply teach all developers how to utilize the infrastructure tools in the cloud?” Because being able to spin up servers and being able to effectively manage and scale them are two entirely different concepts. It is true that cloud services can (not “must”!) take the “Ops” out of “DevOps” for development environments. But just as having access to WebMD doesn’t mean I’m going to perform my own surgery, being able to spin up resources doesn’t obviate the need for experienced management.

The author spoke of “managed services provider” as an apparent synonym for “IaaS provider”. He ignored what I think of as “managed services” which is a contracted team to manage a service for you. That’s what I believe to be the more realistic threat to internal DevOps teams. But it’s no different than any other outsourcing effort, and outsourcing is hardly a new concept.

At the end of the article, the author finally gets around to admitting that DevOps is a cultural paradigm, not a position or a particular brand of fairy dust. Cloud services don’t threaten DevOps, they make it easier than ever to practice. Anyone trying to convince you that DevOps is dead is just trying to get you to read their crappy article (yes, I have a well-tuned sense of irony, why do you ask?).

Another great SysAdvent

Posted on December 26, 2014 by Ben Cotton

Once again, a group of volunteer writers and editors came together to put together 25 posts related to systems administration for the SysAdvent blog. Although I have contributed several articles over the years, I much prefer editing. All of this year’s posts are great, but I’m very proud of the posts that I had a hand in editing. As usual, the writers did most of the work, my suggestions were always minor.

Day 6 – Debugging for Systems Engineers by Paul Nasrat
Day 8 – From Operator to Guide: Lessons Learned Moving from Engineering into Management by Mathias Meyer
Day 12 – Ops and Development Teams: Finding a Harmony by Nell Shamrell
Day 19 – Infosec Basics: Reason Behind the Madness by Jan Schaumann
Day 24 – 12 Days of SecDevOps by Jen Andre

Service credibility: the most important metric

Posted on October 2, 2013 by Ben Cotton

I recently overheard a conversation among three instructors about their university’s Blackboard learning management system. They were swapping stories of times when the system failed. One of them mentioned that one time during a particularly rocky period in the service’s history, he entered a large number of grades into the system only to find that they weren’t there the next day. As a result, he started keeping grades in a spreadsheet as a backup of sorts. The other two recalled times when the system would repeatedly fail mid-quiz for students. Even if the failures were due to their own errors, the point is that they lost trust in the system.

This got me thinking about “shadow systems.” Shadow systems are hardly new, people have been working around sanctioned IT systems since the first IT system was sanctioned. If a customer doesn’t like your system for whatever reason, they will find their own ways of doing things. This could be the person who brings their own printer in because the managed printer is too far away or the department that runs their own database server because the central database service costs too much. Even the TA who keeps grades in a spreadsheet in case Blackboard fails is running a shadow system, and even these trivial systems can have a large aggregate cost.

Because my IT service management class recently discussed service metrics, I considered how trust in a system might be measured. My ultimate conclusion: all your metrics are crap. Anything that’s worth measuring can’t be measured. At best, we have proxies.

Think about it. Does a student really care if the learning management system has five nines of uptime if that .001 is while she’s taking a quiz? Does the instructor care that 999,999 transactions complete successfully when his grade entry is the one that doesn’t?

We talk about “operational credibility” using service metrics, but do they really tell us what we want to know? What ultimately matters in preventing shadow systems is if the user trusts the service. How someone feels about a service is hard to quantify. Quantifying how a whole group feels about a service is even harder. Traditional service metrics are a proxy at their best. At their worst, they completely obscure what we really want to know: does the customer trust the system enough to use it?

There are a a whole host of factors that can affect a service’s credibility. Broadly speaking, I place them into four categories:

Technical – Yes, the technical performance of a system does matter. It matters because it’s what you measure, because it’s what you can prove, and because it affects the other categories. The trick is to avoid thinking you’re done because you’ve taken care of technical credibility.
Psychological – Perception is reality and how people perceive things is driven by the inner workings of the human mind. To a large degree, service providers have little control over the psychology of their customers. Perhaps the most important are of control is the proper management of expectations. Incident and problem response, as well as general communication, are also critical factors.
Sociological – One disgruntled person is probably not going to build a very costly shadow system. A whole group of disgruntled people will rack up cost quickly. Some people don’t even know they hate something until the pitchfork brigade rolls along.
Political – You can’t avoid politics. I debated including this in psychological or sociological, but I think it belongs by itself. If someone can keep some of their clout within the organization by liking or disliking a service, you can bet they will. I suspect political factors almost always work against credibility, and are often driven by short-sightedness or fear.

If I had the time and resources, I’d be interested in studying how various factors relate to customer trust in a service. It would be interesting to know, especially for services that don’t have a direct financial impact, what sort of requirements can be relaxed and still meet the level of credibility the customer requires. If you’re a graduate student studying service management, I present this challenge to you: find a derived value that can be tightly correlated to the perceived credibility of a service. I believe it can be done.

I’m famous, sorta

Posted on October 1, 2013 by Ben Cotton

One of my co-workers happens to be a co-host of “Food Fight“, a DevOps podcast. Last week, he asked for someone to join in for a crossover episode with “RCE“. When nobody else volunteered, he roped me into it. It turned out to be pretty awesome, I would have loved to extend the conversation a few more hours. With any luck, I’ll re-appear on one of those shows sometime. As you may already be aware, one of my goals is for Leo Laporte to personally invite me to the TWiT Brickhouse to get drunk with him on an episode of “This Week in Tech.” I feel like I’ve moved a little closer today.

Anyway, here are the links:

Book review: The Visible Ops Handbook

Posted on January 28, 2012 by Ben Cotton

I first heard of The Visible Ops Handbook during Ben Rockwood’s LISA ’11 keynote. Since Ben seemed so excited about it, I added it to the list of books I should (but probably would never) read. Then Matt Simmons mentioned it in a brief blog post and I decided that if I was ever going to get around to reading it, I needed to stop putting it off. I bought it that afternoon, and a month later I’ve finally had a chance to read it and write a review. Given the short length and high quality of this book, it’s hard to justify such a delay.

Information Technology Infrastructure Library (ITIL) training has been a major push in my organization the past few years. ITIL is a formalized framework for IT service management, but seems to be unfavored in the sysadmin community. After sitting through the foundational training, my opinion was of the “it sounds good, but…” variety. The problem with ITIL training and the official documentation is that you’re told what to do without ever being told how to do it. Kevin Behr, Gene Kim, and George Spafford solve that problem in less than 100 pages.

Based on observations and research of high-performing IT teams, The Visible Ops Handbook assumes that no ITIL practices are being followed. Implementation of the ITIL basics is broken down into four phases. Each phase includes real-world accounts, the benefits, and likely resistance points. This arms the reader with the tools necessary to sell the idea to management and sysadmins alike.

The introduction addresses a very important truism: “Something must need improvement, otherwise why read this?” The authors present a general recap of their findings, including these compelling statistics: 80% of outages are self-inflicted and 80% of mean time to repair (MTTR) is often wasted on non-productive activities (e.g. trying to figure out what changed).

Phase 1 focuses on “stabilizing the patient.” The goal is to reduce unplanned work from 80% of outage time to 25% or less. To do this, triage the most critical systems that generate the most unplanned work. Control when and how changes are made and fence off the systems to prevent unauthorized changes. While exceptions might be tempting, they should be avoided. The authors state that “all high performing IT organizations have only one acceptable number of unauthorized changes: zero.”

After reading Phase 1, I already had an idea to suggest. My group handles change management fairly well, but we don’t track requests for change (RFCs) well. Realizing how important that is, I convinced our groups manager and our best developer that it was a key feature to add to our configuration management database (CMDB) system.

In Phase 2, the reader performs a catch & release program and find “fragile artifacts.” Fragile infrastructure are those systems or services with a low change success rate and high MTTR. After all systems have been “bagged and tagged”, it’s time to make a CMDB and a service catalog. This phase is the next place that my group needs to do work. We have a pretty nice CMDB that’s integrated with our monitoring systems and our job schedulers, but we lack a service catalog. Users can look at the website and see what we offer, but that’s only a subset of the services we run.

Phase 3 focuses on creating a repeatable build library. The best IT organizations make infrastructure easier to build than repair. A definitive software library, containing master images for all software necessary to rebuild systems, is critical. For larger groups, forming a separate release management team to engineer repeatable builds for the different services is helpful. The release management team should be separate from the operational group and consist of generally senior staff.

The final phase discusses continual improvement. If everyone stopped at “best practices”, no one would have a competitive advantage. Suggested metrics for each key process area are listed and explained. After all, you can’t manage what you can’t measure. Finding out what areas are the worst makes it easier to decide what to improve upon.

The last third of the book consists of appendices that serve as useful references for the four phases. One of the appendices includes a suggested table layout for a CMDB system. The whole book is focused on the practical nature of ITIL implementation and guiding organizational learning. At times, it assumes a large staff (especially when discussing separation of duties), so some of the ideas will have to be adapted to meet the needs of smaller groups. Nonetheless, this book is an invaluable resource to anyone involve in IT operations.

Blog Fiasco

The world's only(?) FOSS/weather/sports/marketing/high-performance computing blog

Tag Archives: DevOps