Seek first to understand

One of the lessons that I’ve had to repeatedly re-learn over my career is “understand the problem before you fix it.” I try to fix a problem as quickly as I can. It’s a laudable goal, but a fix without understanding may not actually fix the problem. And it may not prevent future occurrences. If you’re particularly unlucky, it will make the problem worse.

I learned this lesson late last week. On Thursday, someone reported some HTML appearing in some Fedora documentation on translated pages. “Oh! It was probably that PR I merged yesterday,” I thought. So I reverted it.

Then I started digging into it some more. And I realized that it’s probably not that change at all. In fact, it worked locally and on the staging server. It was just broken on the production server. It’s not clear to me if both staging and production sync the translation data on the same schedule (without getting too sidetracked, the staging environment isn’t really a staging environment. It needs a better name). But I became convinced that it’s not a problem in the docs infrastructure, but in the translations. So I reverted my reversion.

This is not the first time I jumped in to fix something before I took a look around to see what’s going on. Unfortunately, it probably won’t be the last.

Here’s the thing: most of the time, a slight delay doesn’t matter. No one’s safety was at risk. We weren’t losing hundreds of thousands of dollars a minute. There was no real harm in spending 10 minutes to figure out what was going on. Perhaps I could try to reproduce it. After all, if you can’t reproduce the error, how do you know you’ve fixed it?

Hopefully the next time I go to fix a problem, I’ll understand the problem first. As astronauts do, I need to work the problem.

How I shot myself in the foot with pylint

I mentioned this in passing in a recent post, but I thought I deserved to make fun of myself more fully here. One of the things I’ve tried to do as I work on code I’ve inherited from predecessors is to clean it up a bit. I’m not a computer science expert, so by “clean up”, I mostly mean style issues as opposed to improving the data structures or anything truly useful like that.

Most of the development I do is on Python code that gets compiled into Windows executables and run as an actuarial workflow. I discovered early on in the process that if I’m working on code that runs toward the end of the workflow, having to wait 20 minutes just to find out that I made some dumb syntax or variable name error is really annoying. I got in the habit of running pylint before I compiled to help catch at least some of the more obvious problems.

Over time, I decided to start taking action on some of the pylint output. Recently, I declared war on variables named “str”. Since str() is a Python function, pylint rightly complained about it. Since the method that used “str” did string replacement, I opted for the still-not-great-but-at-least-not-terrible “string”. I replaced all of the places “str” appeared as a variable and went about my business.

As I was testing some other changes, I noticed that some of my path replacement was failing (though I didn’t know that’s where it was at first). So I shoved a whole bunch of logger calls into the “prepare” script to see where exactly it was failing. Finally, I found it. Then I shoved more into the module where the failure happened. I had to work down through several method calls before I finally found it.

There was still one instance of “str” there, but now Python thought it was the str() builtin and got really confused. In hindsight, it should have been totally obvious that I had inflicted this pain on myself, but several days had passed and I had forgotten that I had messed around in that function. I should have consulted the revision history sooner.

The strangest bug

Okay, this is probably not the strangest bug that ever existed, but it’s certainly one of the weirdest I’ve ever personally come across. A few weeks ago, a vulnerability in OS X was announced that affected all versions but was only fixed in Yosemite. That was enough to finally get me to upgrade from Mavericks on my work laptop. I discovered post-upgrade that the version of VMWare Fusion I had been running does not work on Yosemite. Since VMWare didn’t offer a free upgrade path, I decided not to spend the company’s money and switched to VirtualBox instead (see sidebar 1).

Fast forward to the beginning of last week when I started working on the next version of my company’s Risk Analysis Pipeline product. One of the executables is a small script that polls CycleServer to count the number of jobs left in a particular submission and blocks the workflow until the count reaches 0. It’s been pretty reliable since I first wrote it a year ago, and hasn’t seen any substantial changes.

Indeed, it saw no changes at all when I picked up development again last week, but I started seeing some unusual behavior. The script would poll successfully six times and then fail every time afterward. After adding some better logging, I saw that it was failing with HTTP 401, which didn’t make sense because it sent the credentials every time (see sidebar 2). I checked the git log to confirm that the file hadn’t changed. I spent some time fruitlessly searching for the error. I threw various strands of spaghetti at the wall. All to no avail.

I knew it had to work generally, because it’s the sort of thing that would be very noticeable to our customers. Particularly the part where this sort of failure would mean the workflow never completed. I wondered if something changed when I switched from VMWare Fusion to VirtualBox. After all, I did change the networking setup a bit when I did that, but I would expect the failure to be consistent in that case. (Well, to always fail, not to work six times before failing.)

So I tried the patch release I had published a few days before. It worked fine, which ruled out my local test server being broken. Then I checked out the git tag of that patch release and recompiled. The rebuild failed in the same way. This was very perplexing, since I had released the patch version after the OS X upgrade and resulting VM infrastructure changes.

Out of ideas, one of my colleagues suggested reinstalling Python. I re-ran the Python installer and built again. Suddenly, it worked. I’m at a loss to explain why. Maybe there was something different enough about the virtualized network devices that caused py2exe to get confused when it built. Maybe there’s some sort of counter in urrlib2 that implements the plannedObsolescence() method. Whatever it was, I decided I don’t really care. I’m just glad it works again.

Sidebar 1

The conversion process was pretty simple. For reasons that I no longer remember, I had my VMWare disk images in 2 GB slices, so I had to combine them first. VirtualBox supports vmdk images, though, so it was quick to get the new VMs up and running. My CentOS VM worked with no effort. My Windows 7 VM was less happy. I ended up having to reinstall the OS in order for it to boot in anything other than recovery mode. It’s possible that I failed to correctly install something at that time, but the timeline doesn’t support that. In any case, I’m always impressed by the way my virtual and physical Linux machines seem to handle arbitrary hardware changes with no problem.

Sidebar 2

I also learned something about the way the HTTP interactions worked. I’ve never had much reason to pay attention before, but it turns out that the call to the rest API is first met with a 401, then it sends the authentication and gets a 200. This probably comes as no surprise to anyone who has dealt with HTTP authentication, but it was a lesson for me. Never stop learning.

Sidebar 3

I didn’t mention this in the text above, so if you made it this far, I applaud your dedication to reading the whole post. The first half of my time spent on this problem was spent ruling out a self-inflicted wound. I had already spent a fair amount of time tracking down a bug I introduced trying to de-lint one of the modules. More on that in a later (and hopefully shorter) post.

Perl’s CGI.pm popup_menu cares how you give it data

Last weekend when I was working on the script that mirrors and presents radar data for mobile use, I decided the less work I had to do, the better.  To that end, I tried to make heavy use of the CGI.pm Perl module.  In addition to handling the CGI input, CGI.pm also prints regular HTML tags, so you can avoiding having to throw a bunch of HTML markup in your print statements.  This makes for much cleaner code and reduces the chances you’ll make a silly formatting mistake.

Everything was going well until I added the popup menu to select the radar product.  Initially I followed the example in the documentation and it worked.  As I went on, I decided instead of having two hashes for the product information, it made sense to make my hash include not only the product description, but the URL pattern I’d be using when it came time to mirror the image.  Unfortunately, when I tried to make that change, my popup form no longer had the labels I wanted.

I kept poking at it for a while and finally got frustrated to the point where I decided I’d just write a foreach and have that part print the HTML markup instead of using CGI.pm functions.  Fortunately, I first talked to my friend Mike about it.  I sent him the code and after a little bit of working, he realized what my problem was.  CGI.pm’s popup_menu function expects a pointer to a hash for labels, not an array (I’m not really sure why, maybe someone can explain it?).  Once that was settled, the script worked as expected and the remainder was finished in short order.

Sometimes, it really helps to pay attention to the data type that a function expects.