A Cfengine learning experience

Note: This post refers to Cfengine 2. The difficulties I had may quite likely be a result of peculiarities in our environment or the limits of my own knowledge.

A few weeks ago, my friends at the University of Nebraska politely asked us to install host certificates on our Condor collectors and submitters so that flocking traffic between our two sites would be encrypted. It seemed like a reasonable request, so after getting certificates for 17-ish hosts from our CA, I set about trying to put them in place. I could have plopped them all in place easily enough using a for loop, but I decided it would make more sense to left Cfengine take care of it. This has the added advantage of making sure the certificate gets put in place automatically when a host gets reinstalled or upgraded.

I thought it would be nice if I tested my Cfengine changes locally first. I know just enough Cfengine to be dangerous, and I don’t want to spam the rest of the group with mail as I check in modifications over and over again. So after editing the input file on one of the servers, I ran cfagent -qvk. It didn’t work. The syntax looked correct, but nothing happened. After a bit, I asked my soon-to-be-boss for help.

It turned out that I didn’t quite get the meaning of the -k option. I always used it to run against the local cache of the input files, not realizing that it killed all copy actions. Had I looked at the documentation, I would have figured that out. Like I said, I know just enough to be dangerous.

I didn’t want to create a bunch of error email since some hosts wouldn’t be getting host certificates, so I went with a IfFileExists statement that I could use to define a group to use in the copy: stanza. So I committed what I thought to be the correct changes and tried running cfagent again. The certificates still weren’t being copied into place. Looking at the output, I saw that it couldn’t find the file. Nonsense. It’s right there on the Cfengine server.

As it turns out, that’s not where IfFileExists looks, it looks on the server running cfagent. The file, of course, doesn’t exist locally because Cfengine hasn’t yet copied it. Eventually I surrendered and defined a separate group in cf.groups to reference in the appropriate input file. This makes the process more manual than I would have liked, but it actually works.

Oh, except for one thing. In testing, I had been using $(hostname) in a shellcommand: to make sure that the input file was actually getting read. When I finally got the copy: stanza sorted out, the certificates still weren’t being copied out. The cfagent output said it couldn’t find ‘/masterfiles/tmpl/security/host-certs/$(hostname).pem’. As it turns out, I thought $(hostname) was a valid Cfengine variable. Instead, it was actually being passed to the shell command and being executed by the shell. The end result was indiscernible from what I intended in that case, but didn’t translate to the copy: stanza. The variable I wanted was $(fqhost).

The tricky problem dilemma

A good sysadmin believes in treating the cause, not the symptom. Unfortunately, pragmatism sometimes gets in the way of that. A recent example: we just rolled out a kernel update to a few of our compute clusters. About 3% of the machines ended up in a troubled state. By troubled, I mean that the permissions on a few directories (/bin, /lib, /dev, /etc, /proc, and /sys) were set to 700, making the machine effectively unusable. For the most part, we didn’t notice this on the affected machines until after they did their post-upgrade reboot, but fortunately we were able to catch a few that hadn’t yet rebooted.

What we found was that / had a sysroot directory and an init file. These are created by the mkinitrd script, which is called by the new-kernel-pkg script, which is in turn called in the postinstall script of the kernel RPM. The relevant part of the mkinitrd script seems to be

TMPDIR=""
    for t in /tmp /var/tmp /root ${PWD}; do
        if [ ! -d $t ]; then continue; fi
        if ! access -w $t ; then continue; fi

        fs=$(df -T $t 2>/dev/null | awk '{line=$1;} END {printf $2;}')
        if [ "$fs" != "tmpfs" ]; then
            TMPDIR=$t
            break
        fi
    done

which creates a working directory in /tmp under normal conditions. However, there seemed to be something that caused / to be used instead of /tmp. Later in the script, several directories are created in $TMPDIR, which correspond to the wrongly-permissioned directories. There’s not a clear indication of why this happens, but if we clean up and reinstall the updated kernel package it doesn’t necessarily repeat itself. After some soul-searching, we decided that it was more important to return the nodes to service than to try to track down an easily-correctable-but-difficult-to-solve problem. We’ll see if it happens again with the next kernel upgrade.

N900: a year later

It’s been a little over a year since I first got my Nokia N900.  When I first wrote about this phone, I was pretty excited.  After a year — and several firmware updates — am I still excited?  The answer is mixed.  I still find my phone incredibly useful, but there are a lot of things I find disappointing.  Amazon.com recently listed the N900 as the most gifted phone of 2010, but it appears that it will remain a niche device. Continue reading

Status of the Internet

I’ve been meaning to do this since the Comcast DNS issues a few weeks ago, but I’ve finally put together a quick page with links to the status pages of various web services and sites.  You can check the status of the Internet at http://funnelfiasco.com/internet.html.  It’s a bit surprising how many sites don’t have obvious status pages.  It makes sense for popular sites to have a separate server for status information that users can find when they’re having problems.  I don’t have one for FunnelFiasco because this isn’t a popular site.  If I ever get popular, wake me up from my faint and I’ll stand a status page up.  If any of my dear readers know of sites that have status pages, send me a link or leave a comment and I’ll get it added.

I also had the chance to improve my CSS for FunnelFiasco while writing this.  Over the summer, I was able to find out how to get rid of tables for pictures.  Today, I mostly copied that implementation for “text tables”, or content where I use a table-like format for displaying the data, but don’t need the rigidity.  There are still a few things to work out, but I’m pretty happy with it so far.  Now to backport those changes into other pages.

Managing to-do lists with TuDu

I have no problem admitting that I’m not very organized.  I often find myself letting tasks drop, especially if they’re not part of my normal routine.  It’s not that I’m lazy (sometimes!), it’s just that I forget what I need to do — or I remember everything at once and get overwhelmed by it all.  I tried using project trackers like Planner and KPlato, but they seemed way too heavy for what I needed.  Fortunately, I recently came upon a small project called TuDu.

TuDu is terminal-based, as are several of my other favorite applications, which means it is unobtrusive and can be left running in a screen session for quick attachment from anywhere.  It supports nested tasks, making it easy to break down larger tasks into manageable sections.  Schedule dates, due dates, and priorities can be used to keep the more important items at the top of the pile, and categories can be used to filter items for the chronically over-burdened.

Since I’ve started using TuDu, I’ve found that my productivity has (or at least has seemingly) increased.  There’s a great sense of accomplishment to be able to mark an item as done.  Just remember to hit ‘s’ frequently, as TuDu does not auto-save the XML file.