Using systemd tmpfiles

As part of my regular duties at my day job, I provide some community support on the HTCondor users’ mailing list. At the beginning of the week, someone came to the list with Fedora RPM troubles. I was able to track it down to the fact that the RPM didn’t correctly create some temporary directories.

But why not? It’s been a while since I did any RPM building, but at first glance, the spec file looked fine. The systemd tmpfiles configuration looked good, too. Then someone else found the cause of the problem: according to the Fedora packaging guidelines, a package should write the config file andcreate the directory. It turns out that tmpfiles.d is only processed at boot time (or when manually invoked), so a package that needs a file/directory to exist needs to create it at install time (or require a restart).

I was going to file an RFE, but I found a related one marked as CLOSED->WONTFIX. I understand the reasoning here, so I won’t reopen it to argue. It’s just a surprising behavior if it’s not something you regularly deal with.

So your upstream went poof?

Sometimes when you rely on a software project, it disappears. For example, when Apple bought FoundationDB last month, some projects found themselves on a suddenly shaky foundation. [ed. note – I have fired myself from this blog] Open source projects are not the only ones that can suddenly disappear; commercial software rarely gets further updates when the company that produces it goes under. Sometimes it’s an inconvenience, sometimes it’s a disaster.

Unless you intend to write everything in-house, at some point you’re going to be relying on third parties. It’s not a reason to panic, but it’s a risk to be mitigated. Sometimes, it’s worth choosing an older-but-widely-used technology instead of the new hotness. (There are additional benefits to that choice as well.) Evaluating the risk here is no different than evaluating any other risk: what’s the likelihood of this project going poof and what’s the impact if it does? Consider, too, what you’re willing to do to mitigate the risk. Will you hire (if you’re a company who has things like money) the lead developer to keep the project going? Will you task your own developers with maintaining a fork after the source is closed? Will you throw up your hands and walk away?

Screenshots in documentation

I’ve spent a lot of time at work lately writing documentation or getting others’ drafts into our preferred format. I’ve written and consumed technical documentation quite a bit over the years, so it’s no surprise that I have a few opinions on the matter. One of my opinions is that screenshots are often used as a crutch to support poorly-written docs.

I shared this thought and got a little bit of pushback, so I thought it was worth expanding on. I’m not opposed to screenshots. In fact, I’m quite in favor of the judicious use of screenshots (if for no other reason that sometimes it’s nice to break up the wall-of-text). But showing the same window with and without text entered into a text box is excessive.

Pictures may be worth a thousand words, but they have a lot of downsides. Well-written documentation can be followed without a picture, which is very important for the visually-impaired who rely on text-to-speech tools. Text strings are much easier for translators to handle. That’s not important for all documentation, but can be very important if you have (or want!) international customers or contributors. Images are also less awesome in version control systems (which you’re totally using, right?)

A friend asked how I felt about videos. If they’re used in concert with text-based documentation, I prefer them to static images. Videos can provide visual and audio cues, which help accessibility. Videos also more obviously show how to manage complex interactions. Of course, the question that arises from this is: if your interactions are so complex as to require video explanation, do you need to fix your UI/UX?

The target audience matters, too. Technical users require fewer visual aids than unsophisticated users. When you write your documentation, use as many screenshots as you need, but no more than that.

Goldilocks process

I was once asked in a job interview: “how do you know if you have too much or too little process.” I didn’t have a good answer at the time (which is probably part of the reason I didn’t get the job), but I’ve since come up with my preferred way of finding the Goldilocks point in the amount of process. Too much process causes pain for your developers. Too little process causes pain for your customers.

This doesn’t seem like a very novel idea, but I think it makes sense. The appropriate amount of process varies from situation to situation. Designing systems for manned spaceflight is much different than designing a Facebook game. Therefore, it makes sense to look at where the pain exists to see where you are on the spectrum.

With too much process, your developers are unhappy because they’re spending too much time doing meta work and not enough time writing code (even though meta work can be critically important, you should always do exactly the right amount). With too little process (particularly around testing and UX), your customers get a buggy, hard-to-use product that they’re constantly having to update.

One important note is that customers can experience pain from too much process, too, particularly if they have to wait three months for a bug fix. By the same token, developers can experience pain from too little process when every commit sets off a flurry of merge conflicts. Arguably, these scenarios are not from the wrong amount of process, but the wrong implementation of process. Go read The Phoenix Project for a much better explanation than I could give.

Product debt

I’ve spent a lot of time in the past two years researching and thinking about technical debt. In most of that time, I’ve thought about it as something that’s an implementation detail of the code. It’s probably fair to keep it defined to that context, but that leaves a lot of places for debt to be introduced before any code is ever written.

A recent post on the Instigator Blog discussed the concept of “product debt” (also referred to “design debt” or “user experience debt”). One of the key indicators of this sort of debt are differences between, for examples, different forms within a product. I’d argue that the ability to reuse documentation between products (e.g. installation instructions) is an inverse measure of debt.

The post is a great introduction to the concept, but as I re-read it, I realized that I didn’t like using the term “product debt” for what the author describes. Most of the discussion is around user experience design. While UX is obviously very important to successful products, it’s not the only consideration. Similarly, there are other kinds of design apart from UX design.

I have given the matter some thought, and I’ve come up with the start of a taxonomy for debt. “Why create a taxonomy?” you may ask. Firstly, because if we can use a common vocabulary, we can communicate more clearly with each other. Secondly, because having a way to categorize debt can allow project/product managers to manage debt payment. Finally, it could be a launching point for academic studies of debt in order to better guide the debt payments in the second point.

So here’s a first draft of a debt taxonomy. The categories below are all components of a total product debt.

  • Technical debt. This is the debt that occurs in your code as implementation details.
  • Architecture debt. This is the debt you incur from design decisions, including the API.
  • User experience debt. This debt is basically the debt described by to in the linked post.
  • Documentation debt. This debt accrues when your documentation is missing, incorrect, or unreadable.

Cores or machines?

Back in February, Pete Cheslock quipped “100,000 cores – cause it sounds more impressive than 2000 servers.” Patrick Cable pointed out that HPC people in particular talk in cores. I told them both that the “user perspective is all about cores. The number of machines it takes to provide them means squat.” Andrew Clay Shafer disagreed, with a link to some performance benchmarks.

He’s technically correct (the best kind of correct), but misses the point. Yes, there are performance impacts when the number of machines change (interestingly, fewer machines is better for parallel jobs, while more machines is better for serial jobs), but that’s not necessarily of concern to the user. Data movement and other constraints can wash out any performance differences the machine count introduces.

But really, the concern with core count is misplaced, too. What should really be of concern to the user is the time-to-results. It’s up to the IT service provider to translate that need to the technical requirements (this is more true for operational computing than research, and it depends on the workload to have a fair degree of predictability). The user says “I need to do X amount of computation and get the results in Y amount of time.” Whether this is done on 1 huge machine or ten thousand small machines, that doesn’t really matter. This plays well into cloud environments where you can use a mixture of instance types to get to the size you need.

Adding new vulnerabilities: a thought on the Germanwings crash

I saw some interesting commentary on last week’s Germanwings crash. According to investigators, when the pilot left the cockpit (presumably to use the lavatory), the first officer locked the cockpit door and began a slow and intentional descent into the side of a mountain. The captain attempted to regain entry to the cockpit, but the first officer never opened the door. Locking cockpit doors, which were a reaction to the September 11 hijackings, have three settings: unlocked, locked, and really locked. The “really locked” does not allow even authorized users in unless someone on the inside opens the door.

The person, whose name I failed to note, remarked that the locked door is what enabled this crash to happen. In the wake of 9/11, we protected the flight crews from passengers through allegedly better security screening and locked doors. At the same time, we’ve taken away the ability for passengers to protect themselves from the flight crew. (In the U.S., regulations require someone else, either another pilot or a cabin crew member, to sit in the cockpit while a pilot is out, presumably for this reason.)

150 people were killed because the captain was able to be locked out of the cockpit. If the “really locked” setting weren’t available, it’s hard to say what might have happened. The flight crew may have scuffled with the same end result, but I think the most likely scenario is that the first officer would have never attempted this in the first place.

So where am I going with this? IT seems to have a particular affinity for aviation. We’re both relatively new, both complex, and both generally smooth-running with the occasional spectacular failure (fortunately for IT, we’re far less likely to have our failures result in death, at least currently). Both industries have a love for checklists and trying not to fail in the same way twice.

The Germanwings tragedy can be a lesson for IT. By solving one vulnerability, an entirely new one was introduced. This is hardly a new lesson, we often talk about fixing one bug to introduce others. The important thing is to consider this not at the code level, but at the design level. When thinking about software or systems, it’s important to think about what a change that seems like a great idea might have really undesirable effects.

systemd and SysV init scripts

Chris Siebenmann wrote earlier this week about how systemd’s support of System V init scripts results in unexpected and undesired behavior. Some init scripts include dependency information, which is an LSB standard that SysV init ignores. The end result is that scripts which have incomplete dependencies specified end up getting started too soon by systemd.

I commented that it’s unreasonable to hold systemd responsible for what is effectively a bug in the init script. If you’re going to provide information, provide complete information or don’t expect things to work right. Chris reasonably replied that many of the scripts that include such information are the result of what he called programming through “superstition and mythology.” Others may use the term “cargo cult programming.” Code reuse has both positive and negative aspects, and the slow spread of bad practices via copy/paste is clearly a negative in this case.

I understand that, and Chris makes a valid point. It’s neither realistic nor reasonable to expect everyone to study the specifications for everything they come across. Init scripts, due to their deceptive simplicity, are excellent candidates for “I’ll just copy what worked for me (or someone else), without checking to see if I’m doing something wrong.” But in my opinion, that doesn’t absolve the person who wrote the script from their responsibility if it breaks down the road.

To the user, of course, who is responsible is immaterial. I wholeheartedly agree that breaking things is bad, but avoiding the breakage needs to be the responsibility of the right people. It’s not reasonable for the systemd developers to test every init script out there in every possible combination in order to hit the condition Chris described.

As I see it, there were three options the systemd developers could have taken:

  1. No support for SysV init scripts
  2. Ignore dependency information in SysV init scripts
  3. Use the dependency information in SysV init scripts

Option 1 is clearly a non-starter. systemd adoption would probably never have occurred outside of a few niche cases (e.g. singe-purpose appliances) without supporting init scripts. The more vehement systemd detractors would prefer this option, but it would be self-defeating for the developers to choose it.

Option 2 is what Chris would have preferred. He correctly notes that it would have kept the init scripts with incomplete dependencies from starting too soon.

Option 3 is what the developers chose to implement. While it does, in effect, change the behavior of some init scripts in some conditions, it also allows systemd to properly order SysV init scripts with services defined by .service files.

The developers clearly considered the effects of respecting dependency information in init scripts and decided that the ability to order services and build a better dependency graph was more important than not allowing certain init scripts from starting too soon under certain conditions. Was that the right decision?

Chris and others think it is not. The developers thing it is. I’m a little on the fence to the extent that I don’t know which I’d choose were the decision up to me. If we had a real sense of how many SysV init scripts end up hitting this condition, that would help inform the decision. However, the developers chose option 3, and it’s hard for me to argue against that. Yes, it’s a change in behavior, and perhaps it’s “robot behavior”, but I have a hard time getting too mad at a computer for doing what I told it to do.

Unlimited vacation policies, burnout, etc.

Recently, my company switched from a traditional vacation model to a minimum vacation model. If you’re unfamiliar with the term, it’s essentially the unlimited vacation model practiced by Netflix and others, with the additional requirement of taking a defined minimum of time off each year. It’s been a bit of an adjustment for me, since I’m used to the traditional model. Vacation was something to be carefully rationed (although at my previous employer, I tended to over-ration). Now it’s simply a matter of making sure my work is getting done and that there’s someone to cover for me when I’m out.

I’m writing this at 41,000 feet on my way to present at a conference [ed note: it is being published the day after it was written]. I’m secretly glad that the WiFi apparently does not work over the open ocean (I presume due to political/regulatory reasons). Now, don’t get me wrong, one of my favorite things to do when I fly is to watch myself on FlightAware, but in this case it’s a blessing to be disconnected. If a WiFi connection were available, it would be much harder to avoid checking my work email.

It took me a year and a half at my job before I convinced myself to turn off email sync after hours. Even though I rarely worked on emails that came in after hours, I felt like it was important that I know what was going on. After several weekends of work due to various projects, I’d had enough. The mental strain became too much. At first, I’d still manually check my mail a time of two, but now I don’t even do that much.

This is due in part to the fact that the main project that was keeping me busy has had most of the kinks worked out and is working pretty well. It also helps that there’s another vendor managing the operations, so I only get brought in when there’s an issue with software we support. Still, there are several customers where I’m the main point of contact, and the idea of being away for a week fills me with a sense of “oh god, what will I come back to on Monday?”

i’ve written before about burnout, but I thought it might be time to revisit the topic. When I wrote previously, I was outgrowing my first professional role. In the years since, burnout has taken a new form for me. Since I wrote the last post, two kids have come into my life. In addition, I’ve gone from a slow-paced academic environment to a small private sector company which claims several Fortune 100 companies as clients. Life is different now, and my perception of burnout has changed.

I don’t necessarily mind working long hours on interesting problems. There are still days when it’s hard to put my pencil down and go home (metaphorically, since I work from a spare bedroom in our house). But now that I have have kids, I’ve come to realize that when I used to feel burnt out, I was really feeling bored. Burnout is more represented by the impact on my family life.

I know I need to take time off, even if it’s just to sit around the house with my family. It’s just hard to do knowing that I’m the first — and sometimes last — line of support. But I’m adjusting (slowly), and I’m part of a great team, so that helps. Maybe one of these days, I’ll be able to check my email at the beginning of the work day without bracing myself.

Lessons and perspective from fast food

I started working in fast food when I turned 16 and needed to fund my habit of aimlessly driving around the backroads. I ended up spending five and a half years in the industry (the last few of which were only when I was home from college), advancing from a meat patty thermodynamic technician to crew trainer to floor supervisor. In the end, I was effectively a shift manager. I supervised the store’s training and safety programs.

It’s should be no surprise that I learned a lot about myself and about leadership during that time. Even a decade on, stories from my fast food days come up in job interviews and similar conversations. As a result of my time in fast food, I’ve learned to be very patient and forgiving with those who work in the service sector. I’ve also learned some lessons that are more generally applicable.

Problems are often not the fault of the person you’re talking to. This is a lesson for service customers more than service providers, but most providers are customers to. In fast food, when your order is wrong it’s sometimes the fault of the counter staff putting the wrong sandwich in the bag, but sometimes it’s the fault of the grill staff for making it wrong. In any case, it’s rarely the result of malice or even incompetence. People, especially overworked and under-engaged people, will make mistakes. This holds true for customer service staff at most companies, particularly large ones where the tier 1 staff are temps or outsourced. (That doesn’t imply that actual malice or incompetence should be acceptable.)

Sometimes the best way to deal with a  poor performer is to give them responsibility. One story I’m fond of telling involves a crew member who was, in many ways, the stereotypical teenage fast food worker. He was smart, but lazy, and didn’t much care for authority. The fact that I was only a year older than him made it particularly hard for me to give him orders. He’d been working for a few months and was good at it when he applied himself, so the trick was to get him to do that. After a little bit of fumbling around, I found the trick. I started spending more time away from the grill area and more time up front, and I made it clear that he was in charge of the grill. I gave him some autonomy, but I also held him accountable. Lo and behold, his behavior improved. He also started taking opportunities to teach new employees the tricks of the trade. He could have let the authority go to his head, but instead he acted like an adult because he was being treated like an adult.

Don’t nitpick behavior, but don’t put up with crap. The standing philosophy on my shifts was “don’t make life hard for me and I won’t make life hard for you.” I didn’t like bossing people around (in part because they were either old enough to be my grandmother or because they were 16 and all the bossing in the world wasn’t going to have any effect). If people were doing their jobs well, I put up with some good-natured shenanigans (the exceptions being food safety, physical safety, and customer satisfaction). One time a fellow supervisor had written up a crew member for a really trivial infraction. I went to her file later and tore it up. This person was a good worker and harping on her for minor issues was only going to drive her away. By the same token, I came in from taking the garbage out one night to find two normally good workers having a fight with the sink sprayer. They were both sent home right away (granted one of them was off in 10 minutes anyway).

Spread the unpleasant work around, and be willing to do some yourself. Some of the managers and supervisors liked to make the same people do the unpleasant tasks all the time. I hated that approach because it just seemed to reinforce bad behavior. The people who made my life unpleasant certainly came up in the rotation more often, but everyone had to clean restrooms, empty grease traps, etc. Including me. I didn’t try to make a show of it, but I wanted the people who I worked with to see that I wasn’t using my position to avoid doing things I didn’t want to do. And if I’m being completely honest, there was something I enjoyed about emptying the grease traps (except when I’d lose my grip and pour them into my shoe).

Don’t be a jerk. I won’t delude myself into thinking I was universally loved. I know I wasn’t, and I’m okay with that. But for the most part, I think I was pretty well-liked among the crew and management because I wasn’t a jerk. I took my job seriously (perhaps too seriously at times), but I was flexible enough to try to work with people the way they needed to be worked with. I tried to make work fun and cooperative, because working in fast food sucks and anything that can make it less sucky benefits workers and customers alike.