The tricky problem dilemma

A good sysadmin believes in treating the cause, not the symptom. Unfortunately, pragmatism sometimes gets in the way of that. A recent example: we just rolled out a kernel update to a few of our compute clusters. About 3% of the machines ended up in a troubled state. By troubled, I mean that the permissions on a few directories (/bin, /lib, /dev, /etc, /proc, and /sys) were set to 700, making the machine effectively unusable. For the most part, we didn’t notice this on the affected machines until after they did their post-upgrade reboot, but fortunately we were able to catch a few that hadn’t yet rebooted.

What we found was that / had a sysroot directory and an init file. These are created by the mkinitrd script, which is called by the new-kernel-pkg script, which is in turn called in the postinstall script of the kernel RPM. The relevant part of the mkinitrd script seems to be

TMPDIR=""
    for t in /tmp /var/tmp /root ${PWD}; do
        if [ ! -d $t ]; then continue; fi
        if ! access -w $t ; then continue; fi

        fs=$(df -T $t 2>/dev/null | awk '{line=$1;} END {printf $2;}')
        if [ "$fs" != "tmpfs" ]; then
            TMPDIR=$t
            break
        fi
    done

which creates a working directory in /tmp under normal conditions. However, there seemed to be something that caused / to be used instead of /tmp. Later in the script, several directories are created in $TMPDIR, which correspond to the wrongly-permissioned directories. There’s not a clear indication of why this happens, but if we clean up and reinstall the updated kernel package it doesn’t necessarily repeat itself. After some soul-searching, we decided that it was more important to return the nodes to service than to try to track down an easily-correctable-but-difficult-to-solve problem. We’ll see if it happens again with the next kernel upgrade.

3 thoughts on “The tricky problem dilemma

  1. Pingback: Tweets that mention The tricky problem dilemma « Blog Fiasco -- Topsy.com

  2. That snippet is checking /tmp, /var/tmp, /root, and $PWD (which is probably / during rpm installation) to find something that’s a writeable directory but *not* tmpfs. If your systems are all-tmpfs – or have tmpfs at the three locations mentioned – you could end up using / as TMPDIR. You could probably fix this by keeping something that wasn’t tmpfs mounted at /root, or editing that first “for …” line so it ends with a more sensible fallback dir than ${PWD}.

    Current Fedora mkinitrd doesn’t do anything this silly as far as I can tell..

  3. Will, none of the filesystems are tmpfs. In a moment of inspiration, I realized that it might be hung NFS mounts causing df to not finish executing. The clusters currently use an NFS mount for the yum repo, and it might be that several thousand nodes trying to grab the package in a relatively short amount of time cause a few hangs here and there. Another possibility is that other NFS mounts (e.g. user homedirs and scratch directories) are causing it. In order to test this hypothesis, the cluster architect is changing the yum repo to use http, which is served off our IPVS cluster.

Leave a Reply

Your email address will not be published. Required fields are marked *