The strangest bug

Okay, this is probably not the strangest bug that ever existed, but it’s certainly one of the weirdest I’ve ever personally come across. A few weeks ago, a vulnerability in OS X was announced that affected all versions but was only fixed in Yosemite. That was enough to finally get me to upgrade from Mavericks on my work laptop. I discovered post-upgrade that the version of VMWare Fusion I had been running does not work on Yosemite. Since VMWare didn’t offer a free upgrade path, I decided not to spend the company’s money and switched to VirtualBox instead (see sidebar 1).

Fast forward to the beginning of last week when I started working on the next version of my company’s Risk Analysis Pipeline product. One of the executables is a small script that polls CycleServer to count the number of jobs left in a particular submission and blocks the workflow until the count reaches 0. It’s been pretty reliable since I first wrote it a year ago, and hasn’t seen any substantial changes.

Indeed, it saw no changes at all when I picked up development again last week, but I started seeing some unusual behavior. The script would poll successfully six times and then fail every time afterward. After adding some better logging, I saw that it was failing with HTTP 401, which didn’t make sense because it sent the credentials every time (see sidebar 2). I checked the git log to confirm that the file hadn’t changed. I spent some time fruitlessly searching for the error. I threw various strands of spaghetti at the wall. All to no avail.

I knew it had to work generally, because it’s the sort of thing that would be very noticeable to our customers. Particularly the part where this sort of failure would mean the workflow never completed. I wondered if something changed when I switched from VMWare Fusion to VirtualBox. After all, I did change the networking setup a bit when I did that, but I would expect the failure to be consistent in that case. (Well, to always fail, not to work six times before failing.)

So I tried the patch release I had published a few days before. It worked fine, which ruled out my local test server being broken. Then I checked out the git tag of that patch release and recompiled. The rebuild failed in the same way. This was very perplexing, since I had released the patch version after the OS X upgrade and resulting VM infrastructure changes.

Out of ideas, one of my colleagues suggested reinstalling Python. I re-ran the Python installer and built again. Suddenly, it worked. I’m at a loss to explain why. Maybe there was something different enough about the virtualized network devices that caused py2exe to get confused when it built. Maybe there’s some sort of counter in urrlib2 that implements the plannedObsolescence() method. Whatever it was, I decided I don’t really care. I’m just glad it works again.

Sidebar 1

The conversion process was pretty simple. For reasons that I no longer remember, I had my VMWare disk images in 2 GB slices, so I had to combine them first. VirtualBox supports vmdk images, though, so it was quick to get the new VMs up and running. My CentOS VM worked with no effort. My Windows 7 VM was less happy. I ended up having to reinstall the OS in order for it to boot in anything other than recovery mode. It’s possible that I failed to correctly install something at that time, but the timeline doesn’t support that. In any case, I’m always impressed by the way my virtual and physical Linux machines seem to handle arbitrary hardware changes with no problem.

Sidebar 2

I also learned something about the way the HTTP interactions worked. I’ve never had much reason to pay attention before, but it turns out that the call to the rest API is first met with a 401, then it sends the authentication and gets a 200. This probably comes as no surprise to anyone who has dealt with HTTP authentication, but it was a lesson for me. Never stop learning.

Sidebar 3

I didn’t mention this in the text above, so if you made it this far, I applaud your dedication to reading the whole post. The first half of my time spent on this problem was spent ruling out a self-inflicted wound. I had already spent a fair amount of time tracking down a bug I introduced trying to de-lint one of the modules. More on that in a later (and hopefully shorter) post.