HTCondor Week 2015

There are many reasons I enjoy the annual gathering of HTCondor users, administrators, and developers. Some of those reasons involve food and alcohol, but mostly it’s about the networking and the knowledge sharing.

Unlike many other conferences, HTCondor Week is nearly devoid of vendors. I gave a presentation on behalf of my company, and AWS was present this year, but it wasn’t a sales pitch in either case. The focus is on how HTCondor enabled research. I credit the project’s academic roots.

Every year, themes seem to develop. This year, the themes were cloud and caching. Cloud offerings seem to really be ready to take off in this community, even though Miron would say that the cloud is just a different form of grid computing that’s been done for decades. The ability to scale well beyond internal resources quickly and cheaply has obvious appeal. The limiting factor currently seems to be that university funding rules make it slightly more difficult for academic researchers than just pulling out a credit card.

In the course of one session,  three different caching mechanisms were discussed. This was interesting because it is not something that’s been discussed much in the past. It makes sense, though, that caching files common across multiple jobs on a node would be a big improvement in performance. I’m most partial to Zach Miller’s fledgling HTCache work, though the squid cache and CacheD presentations had their own appeal.

Todd Tannenbaum’s “Talk of Lies” spent a lot of time talking about performance improvements that have been made in the past year, but they really need to congratulate themselves more. I’ve seen big improvements from 8.0 to 8.2, and it looks like even more will land in 8.4. There’s some excellent work planned for the coming releases, and I hope it pans out.

After days of presentations and conversations, my brain is full of ideas for improving my company’s products. I’m really motivated to make contributions to HTCondor, too. I’m even considering carving out some time to work on that book I’ve been wanting to write for a few years. Now that would truly be a miracle.

78% of companies “run on open source”

Black Duck Software recently released the results of their annual Future of Open Source survey. On the surface, it looks pretty good. As the title of this post says, 78% of companies “run on open source”. Open source usage has doubled in business IT environments since 2010. Two-thirds consider open source offerings before their proprietary counterparts.

Not only are companies using open source software, they’re contributing, too. Some 64% of companies participate, with nearly 90% expecting to increase their contributions in the next few years. Half of the companies say more than 50% of their engineers are working on open source. Many companies see open source participation as a recruiting tool as well.

But when you dig a little deeper, there are some issues, too. A majority of companies view open source software as having higher quality and security, but most don’t monitor the code for vulnerabilities. Companies lack formalized policies both for consumption and contribution. A lot of the terms are pretty vague, too. “Participation” in open source can take on a variety of meanings, some of which are basically token involvement for PR purposes.

What I found most interesting, though, was the projects listed as “most valuable”: OpenStack and Docker. I may be biased by my day job, but I see that as a sign of the rise of *aaS. Despite the growth that cloud services have already seen, there’s a lot more market out there to be tapped.

Another interesting item was the increase in venture capital investment, both in gross and per-deal measures. Hopefully, this reduces the issues faced by projects such as OpenSSL and PGP, where a lack of funding puts much of the Internet’s secure communication at risk.

Finally, my initial reaction to the headline was “the other 22% do and don’t know it.” As it turns out, I wasn’t that far off. Black Duck reported that 3% of respondents do not use open source software at all. (Where’s the remaining 19%?) I actually wonder if that’s true. It seems like you’d have to try pretty hard to avoid any of it. This will become increasingly true as time goes on, when even historically hostile companies like Microsoft being open sourcing some of their products.

How I shot myself in the foot with pylint

I mentioned this in passing in a recent post, but I thought I deserved to make fun of myself more fully here. One of the things I’ve tried to do as I work on code I’ve inherited from predecessors is to clean it up a bit. I’m not a computer science expert, so by “clean up”, I mostly mean style issues as opposed to improving the data structures or anything truly useful like that.

Most of the development I do is on Python code that gets compiled into Windows executables and run as an actuarial workflow. I discovered early on in the process that if I’m working on code that runs toward the end of the workflow, having to wait 20 minutes just to find out that I made some dumb syntax or variable name error is really annoying. I got in the habit of running pylint before I compiled to help catch at least some of the more obvious problems.

Over time, I decided to start taking action on some of the pylint output. Recently, I declared war on variables named “str”. Since str() is a Python function, pylint rightly complained about it. Since the method that used “str” did string replacement, I opted for the still-not-great-but-at-least-not-terrible “string”. I replaced all of the places “str” appeared as a variable and went about my business.

As I was testing some other changes, I noticed that some of my path replacement was failing (though I didn’t know that’s where it was at first). So I shoved a whole bunch of logger calls into the “prepare” script to see where exactly it was failing. Finally, I found it. Then I shoved more into the module where the failure happened. I had to work down through several method calls before I finally found it.

There was still one instance of “str” there, but now Python thought it was the str() builtin and got really confused. In hindsight, it should have been totally obvious that I had inflicted this pain on myself, but several days had passed and I had forgotten that I had messed around in that function. I should have consulted the revision history sooner.

Building my website with blatter

I recently came across a project called “blatter”. It’s a Python script that uses jinja2’s template engine to build static websites. This is exactly the sort of thing I’d been looking for. I don’t do anything too fancy with FunnelFiasco.com, but every once in a while I want to make a change across all (or at least most) pages. For example, I recently updated the default content license from CC BY-NC-SA 3.0 United States to CC BY-NC-SA 4.0 International. It’s a relatively minor change, but changing it everywhere is a real pain.

Sure, I could switch to a real CMS (heck, I already have WordPress installed!) or re-do the site in PHP, but that sounded too much like effort. I like my static pages that are artisinally hand-crafted slapped together in vi, but I also like being able to make lazy changes. And I really like page-to-page consistency. With blatter, I can create a few small templates and suddenly changes can be made across the whole site in just a few seconds.

Blatter smoothly merges static and templated content. The only downside is that because it seems to touch all files every time it builds (blats), pushing the new content to my website becomes a larger task. That’s not a huge concern because of the relatively small size of the content, but it’s something that seems fixable. So pretty much all of the site has been blatterized now. For the most part, you shouldn’t really notice any changes.

Communicating weather safety information

Weather is complicated and hyper-local. The general public often lacks a basic understanding of weather evolution and people are generally bad at risk assessment. These facts combined make it really hard to provide general safety advice. It’s made even harder by the fact that if you give bad advice, you may be responsible for injury or death.

What to do when you’re in a car and a tornado is coming is perhaps the epitome of this issue. The National Weather Service office in Kansas City recently posted a scenario to its Facebook page. I saw some dismay expressed about how many people said they’d keep driving in that scenario. But here’s the kicker, I think that’s (conditionally) the right answer.

In the scenario you’re smack in the middle of a six mile stretch of interstate highway that’s expected to be impacted by a tornado in 15 minutes and you’re at an exit. The overpass is clearly the wrong answer. A very good answer would be to go to one of the gas stations or restaurants in the picture and seek shelter there. A car is about the worst place to be in a tornado, so why did I say “keep driving” is the right answer?

Let’s assume you’re traveling at 60 miles per hour. In three minutes, you’ve reached the edge of the warned area. The tornado won’t reach that area for another 12 minutes. Of course, there’s likely some error in the projection, but even if the forward motion is twice what was stated, you still have a cushion of over three minutes. If, in addition, the danger area is twice as large as stated, you still have 30 seconds. That’s cutting it too close, but we’re being really conservative here.

Now let’s look at all of the underlying assumptions that I made. First, I assume that you can safely travel at normal speed the necessary distance. This means no traffic, accidents, construction zones, or debris from earlier storms. In some places, you’d probably have sufficient visibility to make that determination, but certainly not in all places, and not in the picture shown. Second, I assume that you are just passing through. If you’re 10 minutes from home, it might be tempting to try to get there, but that eats into a lot of your safety buffer. Third, I assume you’re traveling south or that the main part of the supercell (another assumption) does not contain heavy rain or large hail that would slow you down or cause damage/injury on its own.

What would I do in that situation? It would depend on my familiarity with the area, my awareness of the storm type and evolution, and (most importantly), my ability to process it all quickly enough.

What should you do in that situation? See above. The best default answer is to seek shelter in one of the buildings off the exit, but that’s not always the best answer.

Book Review: The Open Organization

Full disclosure: I own a small number of shares in Red Hat.

Three years after Red Hat become the first open source company to reach a billion dollars in annual revenue, CEO Jim Whitehurst published a tell-all book about his company. The Open Organization barely mentions the technology involved in Red Hat’s success, although Whitehurst holds a bachelor’s degree in computer science. The Open Organization, as the title suggests, is about the organizational culture of Red Hat that enables its success.

Whitehurst describes his time at Red Hat as a learning experience that made him a better leader. Previously, he had been the successful Chief Operating Officer at Delta Airlines, guiding that company through bankruptcy and revival in the wake of the September 11 terror attacks. The organizational structure of Delta is described as being “top down”, typical of most large companies.

Such a structure arises from an promotes risk aversion and central control. Red Hat prefers a bottom-up approach where employees are given a wide latitude to make decisions. The role of the CEO becomes motivator and context-setter, while accountability is handled by social pressure.

However, the bottom-up approach cannot be truly described as a democracy, a point that Whitehurst emphasizes repeatedly. Red Hat follows a “the best idea wins, no matter where it comes from” policy, but Whitehurst makes it clear that ideas have to be solicited, too. Employees have different preferences about communication, and they need different ways to provide their ideas.

In describing Red Hat’s culture across seven chapters, Whitehurst doesn’t prescribe the specifics to every other organization. In chapter 7, he acknowledges that Red Hat is still a work in progress. Nonetheless, the broader principles are applicable. Whitehurst cites examples from other companies across a variety of industries to demonstrate that it’s not only software companies that can follow Red Hat’s example.

The Open Organization is a well-written book that turns out to be an easy read. Unlike many management books, it focuses on practical effects instead of theory and provides numerous examples. The content is well laid out, establishing the “why” before moving on to the “what” and finally the “how”.

My main complaint is that Whitehurst does not address the potential criticisms of Red Hat’s method. The blunt and argumentative (although generally collegial) nature will not be appealing to everyone. Furthermore, the way the company aggressively defends its culture (a phenomenon described in several places) prevents whimsical change but it also could discourage appropriate changes from the outside.

Nevertheless, The Open Organization is an excellent book for leaders at any level of an organization. I strongly recommend it as a guide to opening up your own organization. Picking and chose what works for you.

The Open Organization is scheduled to be released on June 2. It is published by Harvard Business Review Press.

The strangest bug

Okay, this is probably not the strangest bug that ever existed, but it’s certainly one of the weirdest I’ve ever personally come across. A few weeks ago, a vulnerability in OS X was announced that affected all versions but was only fixed in Yosemite. That was enough to finally get me to upgrade from Mavericks on my work laptop. I discovered post-upgrade that the version of VMWare Fusion I had been running does not work on Yosemite. Since VMWare didn’t offer a free upgrade path, I decided not to spend the company’s money and switched to VirtualBox instead (see sidebar 1).

Fast forward to the beginning of last week when I started working on the next version of my company’s Risk Analysis Pipeline product. One of the executables is a small script that polls CycleServer to count the number of jobs left in a particular submission and blocks the workflow until the count reaches 0. It’s been pretty reliable since I first wrote it a year ago, and hasn’t seen any substantial changes.

Indeed, it saw no changes at all when I picked up development again last week, but I started seeing some unusual behavior. The script would poll successfully six times and then fail every time afterward. After adding some better logging, I saw that it was failing with HTTP 401, which didn’t make sense because it sent the credentials every time (see sidebar 2). I checked the git log to confirm that the file hadn’t changed. I spent some time fruitlessly searching for the error. I threw various strands of spaghetti at the wall. All to no avail.

I knew it had to work generally, because it’s the sort of thing that would be very noticeable to our customers. Particularly the part where this sort of failure would mean the workflow never completed. I wondered if something changed when I switched from VMWare Fusion to VirtualBox. After all, I did change the networking setup a bit when I did that, but I would expect the failure to be consistent in that case. (Well, to always fail, not to work six times before failing.)

So I tried the patch release I had published a few days before. It worked fine, which ruled out my local test server being broken. Then I checked out the git tag of that patch release and recompiled. The rebuild failed in the same way. This was very perplexing, since I had released the patch version after the OS X upgrade and resulting VM infrastructure changes.

Out of ideas, one of my colleagues suggested reinstalling Python. I re-ran the Python installer and built again. Suddenly, it worked. I’m at a loss to explain why. Maybe there was something different enough about the virtualized network devices that caused py2exe to get confused when it built. Maybe there’s some sort of counter in urrlib2 that implements the plannedObsolescence() method. Whatever it was, I decided I don’t really care. I’m just glad it works again.

Sidebar 1

The conversion process was pretty simple. For reasons that I no longer remember, I had my VMWare disk images in 2 GB slices, so I had to combine them first. VirtualBox supports vmdk images, though, so it was quick to get the new VMs up and running. My CentOS VM worked with no effort. My Windows 7 VM was less happy. I ended up having to reinstall the OS in order for it to boot in anything other than recovery mode. It’s possible that I failed to correctly install something at that time, but the timeline doesn’t support that. In any case, I’m always impressed by the way my virtual and physical Linux machines seem to handle arbitrary hardware changes with no problem.

Sidebar 2

I also learned something about the way the HTTP interactions worked. I’ve never had much reason to pay attention before, but it turns out that the call to the rest API is first met with a 401, then it sends the authentication and gets a 200. This probably comes as no surprise to anyone who has dealt with HTTP authentication, but it was a lesson for me. Never stop learning.

Sidebar 3

I didn’t mention this in the text above, so if you made it this far, I applaud your dedication to reading the whole post. The first half of my time spent on this problem was spent ruling out a self-inflicted wound. I had already spent a fair amount of time tracking down a bug I introduced trying to de-lint one of the modules. More on that in a later (and hopefully shorter) post.

Kids these days

A while ago, I heard about a website called Class 120. The service is designed to allow parents, coaches, and the like to see when college students are going to class. It uses the student’s smartphone location and class schedule to determine if the student attended the class or not. When I first heard of this, I rolled my eyes. I suspect many others have a similar reaction.

But the more I think about it, the less objectionable it seems. Going off to college is often the first time a teenager has real independence. It’s unreasonable to expect them all to do well with no experience. A little bit of passive supervision might be just what it takes to turn a $20,000 waste of a year into something more immediately useful. Sure, it could be a crutch, but sometimes crutches are necessary.

Speaking for myself, something like this might have been really useful in 2001. My first year as an undergraduate was pretty lousy academically, in part because I missed a lot of class. I learned my lesson eventually, but at the cost of effectively a wasted year. People have various ways to motivate themselves, some intrinsic, some extrinsic. Who am I to judge?