Linus’s awakening

It may be the biggest story in open source in 2018, a year that saw Microsoft purchase GitHub. Linus Torvalds replaced the Code of Conflict for the Linux kernel with a Code of Conduct. In a message on the Linux Kernel Mailing List (LKML), Torvalds explained that he was taking time off to examine the way he led the kernel development community.

Torvalds has taken a lot of flak for his style over the years, including on this blog. While he has done an excellent job shepherding the technical development of the Linux kernel, his community management has often — to put it mildly — left something to be desired. Abusive and insulting behavior is corrosive to a community, and Torvalds has spent the better part of the last three decades enabling and partaking in it.

But he has seen the light, it would seem. To an outside observer, this change is rather abrupt, but it is welcome. Reaction to his message has been mixed. Some, like my friend Jono Bacon, have advocated supporting Linus in his awakening. Others take a more cynical approach:

I understand Kelly’s position. It’s frustrating to push for a more welcoming and inclusive community only to be met with insults and then when someone finally comes around to have everyone celebrate. Kelly and others who feel like her are absolutely justified in their position.

For myself, I like to think of it as a modern parable of the prodigal son. As tempting as it is to reject those who awaken late, it is better than them not waking at all. If Linus fails to follow through, it would be right to excoriate him. But if he does follow through, it can only improve the community around one of the most important open source projects. And it will set an example for other projects to follow.

I spend a lot of time thinking about community, particularly since I joined Red Hat as the Fedora Program Manager a few months ago. Community members — especially those in a highly-visible role — have an obligation to model the kind of behavior the community needs. This sometimes means a patient explanation when an angry rant would feel better. It can be demanding and time-consuming work. But an open source project is more than just the code; it’s also the community. We make technology to serve the people, so if our communities are not healthy, we’re not doing our jobs.

HP laptop keyboard won’t type on Linux

Here’s another story from my “WTF, computer?!” files (and also my “oh I’m dumb” files).

As I regularly do, I recently updated my Fedora machines. This includes the crappy HP 2000-2b30DX Notebook PC that I bought as a refurb in 2013. After dnf finished, I rebooted the laptop and put it away. Then while I was at a conference last week, my wife sent me a text telling me that she couldn’t type on it.

When I got home I took a look. Sure enough, they keyboard didn’t key. But it was weirder than that. I could type in the decryption password for the hard drive at the beginning of the boot process. And when I attached a wireless keyboard, I could type. Knowing the hardware worked, I dropped to runlevel 3. The built-in keyboard worked then.

I tried applying the latest updates, but that didn’t help. Some internet searching lead me to Freedesktop.org bug 103561. Running dnf downgrade libinput and rebooting gave me a working keyboard again. The bug is closed as NOTABUG, since the maintainers say it’s an issue in the kernel, which is fixed in the 4.13 kernel release. So I checked to see if Fedora 27, which was released last week, includes the 4.13 kernel. It does, and so does Fedora 26.

That’s when I realized I still had the kernel package excluded from dnf updates on that machine because of a previous issue where a kernel update caused the boot process to hang while/after loading the initrd. I removed the exclusion, updated the kernel, and re-updated libinput. After a reboot, the keyboard still worked. But if you’re using a kernel version from 4.9 to 4.12, libinput 1.9, and an HP device, your keyboard may not work. Update to kernel 4.13 or downgrade libinput (or replace your hardware. I would not recommend the HP 2000 Notebook. It is not good.)

Disappearing WiFi with rt2800pci

I recently did a routine package update on my Fedora 24 laptop. I’ve had the laptop for three years and have been running various Fedorae the whole time, so I didn’t think much of it. So it came as some surprise to me when after rebooting I could no longer connect to my WiFi network. In fact, there was no indication that any wireless networks were even available.

Since the update included a new kernel, I thought that might be the issue. Rebooting into the old kernel seemed to fix it (more on that later!), so I filed a bug, excluded kernel packages from future updates, and moved on.

But a few days later, I rebooted and my WiFi was gone again. The kernel hadn’t updated, so what could it be? I spent a lot of time flailing around until I found a “solution”. A four-year-old forum post said don’t reboot. Booting from off or suspending and resuming the laptop will cause the wireless to work again.

And it turns out, that “fixed” it for me. A few other posts seemed to suggest power management issues in the rt2800pci driver. I guess that’s what’s going on here, though I can’t figure out why I’m suddenly seeing it after so long. Seems like a weird failure mode for failing hardware.

Here’s what dmesg and the systemd journal reported:

Aug 01 14:54:24 localhost.localdomain kernel: ieee80211 phy0: rt2800_wait_wpdma_ready: Error - WPDMA TX/RX busy [0x00000068]
Aug 01 14:54:24 localhost.localdomain kernel: ieee80211 phy0: rt2800pci_set_device_state: Error - Device failed to enter state 4 (-5)

Hopefully, this post saves someone else a little bit of time in trying to figure out what’s going on.

The tricky problem dilemma

A good sysadmin believes in treating the cause, not the symptom. Unfortunately, pragmatism sometimes gets in the way of that. A recent example: we just rolled out a kernel update to a few of our compute clusters. About 3% of the machines ended up in a troubled state. By troubled, I mean that the permissions on a few directories (/bin, /lib, /dev, /etc, /proc, and /sys) were set to 700, making the machine effectively unusable. For the most part, we didn’t notice this on the affected machines until after they did their post-upgrade reboot, but fortunately we were able to catch a few that hadn’t yet rebooted.

What we found was that / had a sysroot directory and an init file. These are created by the mkinitrd script, which is called by the new-kernel-pkg script, which is in turn called in the postinstall script of the kernel RPM. The relevant part of the mkinitrd script seems to be

TMPDIR=""
    for t in /tmp /var/tmp /root ${PWD}; do
        if [ ! -d $t ]; then continue; fi
        if ! access -w $t ; then continue; fi

        fs=$(df -T $t 2>/dev/null | awk '{line=$1;} END {printf $2;}')
        if [ "$fs" != "tmpfs" ]; then
            TMPDIR=$t
            break
        fi
    done

which creates a working directory in /tmp under normal conditions. However, there seemed to be something that caused / to be used instead of /tmp. Later in the script, several directories are created in $TMPDIR, which correspond to the wrongly-permissioned directories. There’s not a clear indication of why this happens, but if we clean up and reinstall the updated kernel package it doesn’t necessarily repeat itself. After some soul-searching, we decided that it was more important to return the nodes to service than to try to track down an easily-correctable-but-difficult-to-solve problem. We’ll see if it happens again with the next kernel upgrade.