systemd and SysV init scripts

Chris Siebenmann wrote earlier this week about how systemd’s support of System V init scripts results in unexpected and undesired behavior. Some init scripts include dependency information, which is an LSB standard that SysV init ignores. The end result is that scripts which have incomplete dependencies specified end up getting started too soon by systemd.

I commented that it’s unreasonable to hold systemd responsible for what is effectively a bug in the init script. If you’re going to provide information, provide complete information or don’t expect things to work right. Chris reasonably replied that many of the scripts that include such information are the result of what he called programming through “superstition and mythology.” Others may use the term “cargo cult programming.” Code reuse has both positive and negative aspects, and the slow spread of bad practices via copy/paste is clearly a negative in this case.

I understand that, and Chris makes a valid point. It’s neither realistic nor reasonable to expect everyone to study the specifications for everything they come across. Init scripts, due to their deceptive simplicity, are excellent candidates for “I’ll just copy what worked for me (or someone else), without checking to see if I’m doing something wrong.” But in my opinion, that doesn’t absolve the person who wrote the script from their responsibility if it breaks down the road.

To the user, of course, who is responsible is immaterial. I wholeheartedly agree that breaking things is bad, but avoiding the breakage needs to be the responsibility of the right people. It’s not reasonable for the systemd developers to test every init script out there in every possible combination in order to hit the condition Chris described.

As I see it, there were three options the systemd developers could have taken:

  1. No support for SysV init scripts
  2. Ignore dependency information in SysV init scripts
  3. Use the dependency information in SysV init scripts

Option 1 is clearly a non-starter. systemd adoption would probably never have occurred outside of a few niche cases (e.g. singe-purpose appliances) without supporting init scripts. The more vehement systemd detractors would prefer this option, but it would be self-defeating for the developers to choose it.

Option 2 is what Chris would have preferred. He correctly notes that it would have kept the init scripts with incomplete dependencies from starting too soon.

Option 3 is what the developers chose to implement. While it does, in effect, change the behavior of some init scripts in some conditions, it also allows systemd to properly order SysV init scripts with services defined by .service files.

The developers clearly considered the effects of respecting dependency information in init scripts and decided that the ability to order services and build a better dependency graph was more important than not allowing certain init scripts from starting too soon under certain conditions. Was that the right decision?

Chris and others think it is not. The developers thing it is. I’m a little on the fence to the extent that I don’t know which I’d choose were the decision up to me. If we had a real sense of how many SysV init scripts end up hitting this condition, that would help inform the decision. However, the developers chose option 3, and it’s hard for me to argue against that. Yes, it’s a change in behavior, and perhaps it’s “robot behavior”, but I have a hard time getting too mad at a computer for doing what I told it to do.

Being a good troubleshooter

Not too long ago, my friend Andy said “I think that’s how i convince so many people I’m decent at being a sysadmin, I just look at logs.” It was a statement that really struck a chord with me. On the one hand, I feel like I’ve had a pretty successful career so far, and if I haven’t been excellent, I’ve at least been sufficiently competent. On the other hand, I don’t feel like I have a lot of the experience that “everyone” has. I’ve never run email servers, I’ve only done trivial DNS and DHCP setups, and so on.

In my current job, I work with some really smart people, few of whom have much if any sysadmin experience. Andy and I, because of our sysadmin background, have become the go-to guys for sysadmin questions. It’s a role I enjoy, particularly when I’m able to solve a problem. Sometimes it’s because I have direct experience with the issue, but more often than not, it’s because I know how to poke around until I find the answer.

There are many skills required for successful systems administration. Troubleshooting is high on the list. The ability to troubleshoot new and unusual problems is particularly helpful. I’ve found my own troubleshooting abilities to depend on the ability to build off of previous (if unrelated) experience and being able to piece together disjointed clues. It’s sort of like being a detective and also a human “big data” system.

Fishing for clues in log files is great, too, if you can find the needle in the haystack. But what seems to separate the good troubleshooters from the bad that I’ve known is intellectual curiosity. Not just asking what happened, but finding out why. Being willing to ask questions and learn about unfamiliar areas instead of immediately deferring to those more knowledgeable builds a strong skill base.

Of course, without Google, we’d all be out of luck.

Oh no, I can’t scan!

Printing is evil. Less evil but related is scanning. I don’t have to scan documents very often, but I needed to a few weeks ago. I fired up XSane and it immediately quit because it couldn’t find any scanners. I have an all-in-one, and the printing worked, so I felt like it should probably be able to find the scanner. Running `lsusb` showed the device.

“Maybe it’s a problem with XSane”, I said to myself. I tried the `scanimage` command. Still no luck. Then I ran scanimage as root. Suddenly, I could scan. My device was on /dev/bus/usb/001/008, so I took a look at the permissions. Apparently the device was owned by root:lp, and the permissions were 664. I wasn’t in the lp group, so I couldn’t use the scanner. So I added myself to the group.

I’m assuming something changed when I upgraded to Fedora 21, since it had worked previously. Perhaps a udev rule is different? In any case, if you happen upon the same problem, enjoy this shortcut answer so that you don’t spend an hour trying to debug it.

What’s in a (version) number?

Last week, Linus Torvalds posted a poll on Google+ asking people if the Linux kernel should continue with 3.x release numbers or if it was time for 4.0. I seem to recall Linus saying something to the effect of “it’s just a number” when Linux finally went from 2.6 to 3.0. Of course, it’s not just a number. Versions convey some kind of meaning, although the meaning isn’t always clear.

In general, I’m a fan of following the Semantic Versioning specification. In the projects I work on, both personally and professionally, I use something close to Semantic Versioning. I actually commented to coworkers the other day that I thought we had really screwed up on some product version numbers for not incrementing the major version when we made breaking changes.

But that’s the limitation of Semantic Versioning, too. Some projects do an excellent job of maintaining compatibility (for example, you can use some pretty old HTCondor versions with the new hotness and things generally work). Do they stick with the same major version forever and end up with very large minor versions? At some point, then, the major version becomes pretty useless.

A colleague remarked that in some cases, the patch number has become almost irrelevant. Back when even the smallest of releases meant mailing flopping disks or waiting an eternity for an FTP download, even patch releases where a big deal. In the modern era of broadband and web-based applications, version numbers themselves start to mean a lot less. If your browser auto-updates, do you even care what the version number is? Does anyone (including Facebook developers) know what version of Facebook you’re using?

Of course, not everything is auto-updating and even fewer things are web or mobile applications. Version numbers will be important in certain areas for the foreseeable future. It’s clear that no versioning scheme will be universally applicable. What’s important is that developers have a versioning scheme, that it’s made clear to users and downstream developers, and that they stick to it. That way, the version numbers mean something.

elementary misses the point

A recent post on the elementary blog about how they ask for payment on download created a bit of a stir this week. One particular sentence struck a nerve (it has since been removed from the post): “We want users to understand that they’re pretty much cheating the system when they choose not to pay for software.”

No, they aren’t. I understand that people want to get paid for their work. It’s only natural. Especially when you’d really like that work to be what puts food on the table and not something you do after you work a full week for someone else. I certainly don’t begrudge developers asking for money. I don’t even begrudge requiring payment before being able to download the software. The developers are absolutely right when they say “elementary is under no obligation to release our compiled operating system for free download.”

Getting paid for developing open source software is not antithetical to open source or free (libre) software principles. Neither the OSI’s Open Source Definition nor the Free Software Foundation’s Free Software Definition necessarily preclude a developer from charging for works. That most software that’s free-as-in-freedom is also free-as-in-beer is true, but irrelevant. Even elementary touts the gratis nature of their work on the front page (talk about mixed messages):

100% free, both in terms of pricing and licensing. But you're a cheater if you take the free option.

100% free, both in terms of pricing and licensing. But you’re a cheater if you take the free option.

Simply put, the developers cannot choose to offer their work for free and then get mad when people take them up on the offer. Worse, they cannot alienate their community by calling them cheaters. Of the money the elementary receives, how much of it goes upstream to the Linux Foundation, the FSF, and the numerous other projects that make elementary possible? Surely they wouldn’t be so hypocritical as to take the work of others for free?

An open source project is more than just coders. It’s more than just coders and funders. A truly healthy project of any appreciable size will have people who contribute in various ways: writing documentation; providing support on mailing lists, fora, etc.; triaging bug reports; filing bug reports; doing design; marketing (including word-of-mouth). This work is important to the project, too, and should be considered an in-kind form of payment.

It’s up to each project to decide what they want in return for the work put in. But it’s up to each project to accept that people will take from all of the choices that are available. If that includes “I get it for free”, then the right answer is to find ways for those people to become a part of the community and contribute how they can.

Upgrading users to new environments

Recently, Tom Limoncelli had a post at Everything Sysadmin describing how to move users to a corporate standard. Tom pretty much nailed it (particularly the part about the importance of management support), but I wanted to add my own experiences. Like Tom, I’ve been out of the fleet management business for a while. My perspective comes not from migrating from a wild west scenario to a standard, but from one standard to another.

As an aside, I was an undergraduate when my department left the wild west. The way computing worked on campus, this meant I was mostly unaware except for the weather lab and the one server that was my responsibility. I heard plenty of tales, though, from both the customer and provider side. I got to deal with the residual mistrust from all the things that went wrong. And I saw the graphs of ticket volume. Oh was it a mess I was glad to have missed.

For my part, the first summer after I became the department’s sysadmin, I decided it was time to upgrade the Linux machines. Our Linux servers were a mic of RHEL 3 and RHEL 4, while some of the desktops rand one of those or they ran Fedora Core 1. Fedora Core 1 was well beyond end-of-life by that point and packages for RHEL 3 and RHEL 4 were increasingly becoming out of date for the needs of the faculty and students who used them on a daily basis.

RHEL 5 had been released a few months prior, so it seemed like a good opportunity to get everything on the same OS. The first thing I did was to put a few spare machines in the computing lab as demo machines. Interested users could sit down and test the software packages they used and report any problems.

Meanwhile, I also surveyed each professor who had Linux machines or who taught in the lab about the software they used. Some packages we weren’t sure were ever used anymore, and it was a good opportunity to find cruft that could be cleaned up.

The next step was to “force” people to start using RHEL 5 machines by upgrading one machine in each lab (most of the faculty who had one Linux machine had several). Starting with the friendliest users, I hit every lab. We found a few problems here and there (a bug fix in tcsh caused on group quite a bit of trouble since they were inadvertently relying on the buggy behavior), but people could see them getting fixed.

The upgrade process got smoother the more times I did it, until we got to the point that I could sent our student employees off to get it started. The friendly users helped find the troublesome issues first so that the holdouts had a smooth experience. By the end of the summer all 70 or so machines were on the same OS, which reduced the support effort. Users had newer packages and a better experience. Everyone was happy and had cake (the cake was probably for something else, though).

Introducing the “Permissive 3000″ license

Software licenses aren’t necessarily the easiest texts to understand. This issue is compounded when the person trying to understand the license is in a different jurisdiction or is a non-native speaker of English. A recent thread on the OSI’s license-discuss list brought this issue to light. According to the original poster, a project using the BSD 3-Clause license was used without attribution in a proprietary product. The developer lost the court case because the judge did not understand English well. The poster brought an attempt at a rewrite to the list, but it had some contradictions and other meaningful differences. So I thought I’d give it a try myself.

This weekend, I started from the original BSD 3-Clause license and excised all of the words not on the Oxford 3000™ word list (or reasonably close modifications, e.g. verb tense conjugations). I did make an exception for the word “copyright”, since it seems indispensable to a software license. In all other cases, I used synonyms and circumlocution in order to preserve the meaning while remaining within the constrained word list. This was challenging at times, since circumlocution can end up making the document more difficult to understand than an unknown word might. The difficulty is further compounded by the fact that many words have a distinct legal meaning and a synonym might not have the same weight.

I consoled myself with the fact that software warranties (where most of the real challenge was) are probably not that useful anyway. Furthermore, just because a word has a distinct meaning in American courts, that doesn’t mean that foreign legal systems have the same definitions. Trying to use largely U.S.-centric licenses written in English is a challenge for a global society, but I don’t know that a system of jurisdiction/language-specific licenses would be any better.

In any case, without further ado, I present the Permissive 3000 license. It’s highly experimental and totally unvetted by legal professionals, so nobody should use it for anything except a learning exercise. I’m looking forward to some constructive feedback and hopefully it sparks a discussion about how licenses can be simplified so that they’re more easily understood by judges, developers, and users alike.

Using tracer to point out service restart needs

If you’re seeing this via Fedora Planet, you probably saw Miroslav Suchý’s post from a few days ago about a project called Tracer. Tracer is a friendly tool to tell you what outdated services are running. With the dnf plugin installed, you get a list at the end of the upgrade process.

For example, right after I installed the plugin and ran an upgrade, I was told that I needed to restart the Samba service. In addition, there were several programs that needed to be manually restarted (KeePassX and Spider Oak, to name two). Plus, one process required a logout, and one required a full system reboot.

I’ve found this to be pretty useful, since I don’t always realize what services need to be restarted after package updates. I have a decade of system administration experience, so it’s not too bad for me. For others, this is a great way to shine light on exactly what needs to be restarted and how.

On Linus Torvalds and communities

This week, the Internet was ablaze with reactions to comments made by Linus Torvalds at Linux.conf.au. Unsurprisingly, Torvalds defended the tone he employs on the Linux kernel mailing list, where he holds no punches. “I’m not a nice person, and I don’t care about you. I care about the technology and the kernel—that’s what’s important to me,” he said (as reported by Ars Technica). He later said “all that [diversity] stuff is just details and not really important.”

The reactions were mixed. Some were upset at the fact that an influential figure like Torvalds didn’t take the opportunity to address what they see as a major issue in the Linux community. Others dismissed those who were upset by pointing to the technical quality of Linux, cultural differences, etc.

I don’t subscribe to the LKML, so most of the posts I’ve seen are generally when someone is trying to point out a specific event (whether a behavior or a technical discussion), and I don’t claim to have a good sense for what that particular mailing list is like. Torvalds and the Linux community have developed a great technical product, but the community needs work.

Speaking to open source communities in general, too many people use the impersonal nature of email to mistake rudeness for directness. Direct and honest technical criticisms are a vital part of any collaborative development. Insults and viciousness are not. Some people thrive in (or at least tolerate) those kinds of environments, but they are incredibly off-putting to everyone else, particularly newcomers.

Open source communities, like any community, need to be welcoming to new members. This allows for the infusion of new ideas and new perspectives: some of which will be obnoxiously naive, some of which will be positively transformative. The naive posts of newcomers can be taxing when you’ve seen the same thing hundreds of times, but everyone has to learn somewhere. The solution is to have a team armed with pre-written responses in order to prevent frustrated emails.

Not being a jerk doesn’t just mean tolerating noobs, though. Communities should have an established code of conduct which addresses both annoying and mean actors. When the code of contact is being repeatedly breached, the violator needs to be nudged in the right direction. When a community is welcoming and actively works to remain that way, it thrives. That’s how it can get the diversity of ideas and grow the technical competency that Linus Torvalds so desires.

A lesson in ISO weeks

Last week, users of the Twitter client for Android experienced authentication problems. It was a long and lonely Sunday night for me without my Tweeps. When the issue was fixed, word on the street was that it was due to time travel, in a sense. Sunday started the first week of 2015 if you’re using ISO week numbering.

The next morning, I got my regular weekly email from our time tracking system at work, except it showed I had recorded zero hours in the previous week. Late December tends to be a quiet time, but not that quiet. Then I looked a little closer and noticed that the email was for week 2015-52. Oops!

I thought I’d take a look at the code for the report generator, and my hunch that it was also an ISO week issue was quickly confirmed. In the code, the current date was recorded and split into year and week values. Then the week value was decremented. This seemed silly to me. I changed it to first subtract a week before splitting into the year and week values. This seemed to fix…the glitch.

So what’s the lesson in all of this? First, make sure you do the math at the right time. Secondly, make sure you understand how time works. The year of the ISO week being ahead of the calendar year only happens on limited occasion. It’s not a scenario that one would think to test (though I expect a lot more tests will include it now).