HTCondor Week 2015

There are many reasons I enjoy the annual gathering of HTCondor users, administrators, and developers. Some of those reasons involve food and alcohol, but mostly it’s about the networking and the knowledge sharing.

Unlike many other conferences, HTCondor Week is nearly devoid of vendors. I gave a presentation on behalf of my company, and AWS was present this year, but it wasn’t a sales pitch in either case. The focus is on how HTCondor enabled research. I credit the project’s academic roots.

Every year, themes seem to develop. This year, the themes were cloud and caching. Cloud offerings seem to really be ready to take off in this community, even though Miron would say that the cloud is just a different form of grid computing that’s been done for decades. The ability to scale well beyond internal resources quickly and cheaply has obvious appeal. The limiting factor currently seems to be that university funding rules make it slightly more difficult for academic researchers than just pulling out a credit card.

In the course of one session,  three different caching mechanisms were discussed. This was interesting because it is not something that’s been discussed much in the past. It makes sense, though, that caching files common across multiple jobs on a node would be a big improvement in performance. I’m most partial to Zach Miller’s fledgling HTCache work, though the squid cache and CacheD presentations had their own appeal.

Todd Tannenbaum’s “Talk of Lies” spent a lot of time talking about performance improvements that have been made in the past year, but they really need to congratulate themselves more. I’ve seen big improvements from 8.0 to 8.2, and it looks like even more will land in 8.4. There’s some excellent work planned for the coming releases, and I hope it pans out.

After days of presentations and conversations, my brain is full of ideas for improving my company’s products. I’m really motivated to make contributions to HTCondor, too. I’m even considering carving out some time to work on that book I’ve been wanting to write for a few years. Now that would truly be a miracle.

Cores or machines?

Back in February, Pete Cheslock quipped “100,000 cores – cause it sounds more impressive than 2000 servers.” Patrick Cable pointed out that HPC people in particular talk in cores. I told them both that the “user perspective is all about cores. The number of machines it takes to provide them means squat.” Andrew Clay Shafer disagreed, with a link to some performance benchmarks.

He’s technically correct (the best kind of correct), but misses the point. Yes, there are performance impacts when the number of machines change (interestingly, fewer machines is better for parallel jobs, while more machines is better for serial jobs), but that’s not necessarily of concern to the user. Data movement and other constraints can wash out any performance differences the machine count introduces.

But really, the concern with core count is misplaced, too. What should really be of concern to the user is the time-to-results. It’s up to the IT service provider to translate that need to the technical requirements (this is more true for operational computing than research, and it depends on the workload to have a fair degree of predictability). The user says “I need to do X amount of computation and get the results in Y amount of time.” Whether this is done on 1 huge machine or ten thousand small machines, that doesn’t really matter. This plays well into cloud environments where you can use a mixture of instance types to get to the size you need.

Cloud detente

Evident.io founder and CEO Tim Prendergast wondered on Twitter why other cloud service providers aren’t taking marketing advantage of the Xen vulnerability that lead Amazon and Rackspace to reboot a large number of cloud instances over a few-day period. Digital Ocean, Azure, and Google Compute Engine all use other hypervisors, so isn’t this an opportunity for them to brag about their security? Amazon is the clear market leader, so pointing out this vulnerability is a great differentiator.

Except that it isn’t. It’s a matter of chance that Xen is The hypervisor facing an apparently serious and soon-to-be-public exploit. Next week it could be Mircosoft’s Hyper-V. Imagine the PR nightmare if Microsoft bragged about how much more secure Azure is only to see a major exploit strike Hyper-V next week. It would be even worse if the exploit was active in the wild before patches could be applied.

“Choose us because of this Xen issue” is the cloud service provider equivalent of an airline running a “don’t fly those guys, they just had a plane crash” ad campaign. Just because your competition was unlucky this time, there’s no guarantee that you won’t be the lower next time.

I’m all for companies touting legitimate security features. Amazon’s handling of this incident seems pretty good, and I think they generally do a good job of giving users the ability to secure their environment. That doesn’t mean someone can’t come along and do it better. If there’s anything 2014 has taught us, it’s that we have a long road ahead of us when it comes to the security of computing.

It’s to the credit of Amazon’s competition that they’ve remained silent. It shows a great degree of professionalism. Digital Ocean’s Chief Technology Evangelist John Edgar had the best explanation for the silence: “because we’re not assholes mostly.”

FAQs are not the place to vent

I’ve spent a lot of my professional life explaining technical concepts to not-necessarily-very-technical people. Most of the time (but sadly not all of it), it’s because the person doesn’t need to fully understand the technology, they just need to know enough to effectively do their job. I understand how frustrating it can be to answer what seems like an obvious question, and how the frustration compounds when the question is repeated. That’s why we maintain FAQ pages, so we can give a consistently friendly answer to a question.

You can imagine my dismay when my friend Andy shared an FAQ entry he found recently. A quantum chemistry application’s FAQ page includes this question: “How do I choose the number of processors/How do I setup my parallel calculation?” It’s a very reasonable question to ask. Unfortunately, the site answers it thusly: “By asking this question, you demonstrate your lack of basic understanding of how parallel machines work and how parallelism is implemented in Quantum ESPRESSO. Please go back to the previous point.”

The previous question is similar and has an answer of of “See Section 3 of the User Guide for an introduction to how parallelism is implemented in Quantum ESPRESSO”. Now that’s a pretty good answer. Depending on the depth of information in Section 3, it might be possible to answer the question directly on the FAQ page with an excerpt, but at least pointing the visitor to the information is a good step.

I don’t understand getting frustrated with a repeated FAQ. If the answers are so similar, copy and paste them. Or combine the questions. FAQs, user guides, and the like are great because you can compose them in a detached manner and edit them to make sure they’re correct, approachable, and not jerkish. FAQs are an opportunity to prevent frustration, not to express it.

How to use HTCondor’s CLAIM_WORKLIFE to optimize cluster throughput

Not only do I write blog posts here, I also occasionally write some for work. The Cycle Computing blog just posted “How to use HTCondor’s CLAIM_WORKLIFE to optimize cluster throughput“. This began as a conversation I had with one of our customers and I decided it was worth expanding on and sharing with the wider community.

Amazon VPC: A great gotcha

If you’re not familiar with the Amazon Web Services offerings, one feature is the Virtual Private Cloud (VPC). VPC is effectively a way of walling yourself off from all or part of the world. If you’re running a public-facing web server, it might not be so important. If you’re running a compute cluster, it’s a no-brainer. Just be careful about that “no-brainer” part.

While working on a new cluster for a customer today, I was trying to figure out why the HTCondor scheduler wasn’t showing up to the collector. The daemons were all running. HTCondor security policies weren’t getting in the way. I could use condor_config_val from each host to query the other host. I brought in a colleague to double-check me. He couldn’t figure it out either.

After beating our heads against the wall for a while, and finding absolutely nothing helpful in the logs, I noticed one tiny detail in the logs. The schedd kept saying it was updating the collector, but the collector never seemed to notice. The schedd kept saying it was updating the collector via UDP. How many times had I watched that line go by?

The last time, though, it clicked. And it clicked hard. I had set up a security group to allow all traffic within the VPC. Except I had set it for all TCP traffic, so the UDP packets were being silently dropped. As UDP packets are wont to do. When I changed the security group rule from TCP to all protocols, the scheduler magically appeared in the pool.

Once again, the moral of the story is: don’t be stupid.

Parsing SGE’s qacct dates

Recently I was trying to reconstruct a customer’s SGE job queue to understand why our cluster autoscaling wasn’t working quite right. The best way I found was to dump the output of qacct and grep for {qsub,start,end}_time. Several things made this unpleasant. First, the output is not de-duplicated on job id. Jobs that span multiple hosts get listed multiple times. Another thing is that the dates are in a nearly-but-not-quite “normal” format. For example: “Tue Mar 18 13:00:08 2014″.

What can you do with that? Not a whole lot. It’s not a format that spreadsheets will readily treat as a date, so if you want to do spreadsheety things, you’re forced to either manually enter them or write a shell function to do it for you:

function qacct2excel { echo "=`date -f '%a %b %d %T %Y' -j \"$1\"  +%s`/(60*60*24)+\"1/1/1970\"";

The above works on OS X because it uses a non-GNU date command. On Linux, you’ll need a different set of arguments, which I haven’t bothered to figure out. It’s still not awesome, but it’s slightly less tedious this way. At some point, I might write a parser that does what I want qacct to do, instead of what it does.

It’s entirely possible that there’s a better way to do this. The man page didn’t seem to have any helpful suggestions, though. I hate to say “SGE sucks” because I know very little about it. What I do know is that it’s hard to find good material for learning about SGE. At least HTCondor has thorough documentation and tutorials from HTCondor Week posted online. Perhaps one of these days I’ll learn more about SGE so I can determine whether it sucks or not.

I’m famous, sorta

One of my co-workers happens to be a co-host of “Food Fight“, a DevOps podcast. Last week, he asked for someone to join in for a crossover episode with “RCE“. When nobody else volunteered, he roped me into it. It turned out to be pretty awesome, I would have loved to extend the conversation a few more hours. With any luck, I’ll re-appear on one of those shows sometime. As you may already be aware, one of my goals is for Leo Laporte to personally invite me to the TWiT Brickhouse to get drunk with him on an episode of “This Week in Tech.” I feel like I’ve moved a little closer today.

Anyway, here are the links:

Extending rivalries to HPC

In October, Indiana University announced it would purchase a Cray XK7 named “Big Red II”. With a theoretical peak of just over 1 petaFLOPS, it would be the fastest University-owned (not associated with a national center) cluster. Of course, in state rivals would never let that stand. In the latest Top 500 list, unveiled at at the International Supercomputing Conference, Big Red II ranks a very respectable 46th. Unfortunately for them, Purdue University’s new Conte cluster checked in at 28. Oops! Let’s compare:

Cluster Cost Theoretical performance LINPACK performance Cost per benchmarked TFLOPS
Big Red II $7.5 million 1000.6 TFLOPS 597.4 TFLOPS $12.55k / TFLOPS
Conte $4.3 million 1341.1 TFLOPS 943.4 TFLOPS $4.56k / TFLOPS
Comparison 57.33% 134.03% 157.92% 36.33%

It’s clear that Conte is the winner in performance and cost. But what about value? Both of these clusters have accelerators, Big Red II uses Nvidia GPUs and Conte uses Intel’s Phi (which also powers China’s new Tianhe-2, far and away the fastest cluster in the world). Using the GPU requires writing code in the CUDA language, whereas Phi will run native x86 code. This lowers the barrier to entry for users on Phi, but GPUs seem to win in most benchmarks. This would seem to increase the cost of providing user support, but it may be that IU’s users are already prepared to run on the GPU. All of the performance numbers in the world won’t matter if the clusters aren’t used, and only time will tell which cluster provides a better value. What may end up being a more interesting result is the political ramifications. Will the statehouse be okay with the two main state universities both running expensive high performance computing resources? If not, who will get to carry on? Both institutions have a record of success. Indiana ranked as high as #23 on the June 2006 list, but Big Red II is the first Top 500 system there since November 2009. Meanwhile, Purdue has had at least one system (and as many as three) on every list since November 2008. With Conte and the additional clusters in operation, Purdue has much greater capacity, but that doesn’t mean that IU’s system is a waste. I suspect that as long as both universities are bringing in enough grant money to justify the cost of their clusters, nobody in Indianapolis will care to put a stop to this rivalry. In the meantime, it appears that Purdue will remain the dominant HPC power in the state, as on the football field.

Treating clusters as projects

Last month, HPCWire ran an article about the decommissioning of Los Alamos National Lab’s “Roadrunner” cluster. In this article was a quote from LANL’s Paul Henning: “Rather than think of these machines as physical entities, we think of them as projects.” This struck me as being a very sensible position to take. One of the defining characteristics of a project is that it is of limited duration. Compute clusters have a definite useful life, limited by the vendor’s hardware warranty and the system performance (both in terms of computational ability and power consumption) relative to new systems.

Furthermore, the five PMBOK process areas all come into play. Initiation and Planning happen before the cluster is installed. Execution could largely be considered the building and acceptance testing portion of a cluster’s life. The operational time is arguably Monitoring and Control. Project closeout, of course, is the decommissioning of the resource. Of course, smaller projects such as software updates and planned maintenance occur within the larger project. Perhaps it is better to think of each cluster as a program?

The natural extension of considering a cluster (or any resource, for that matter) to be a project is assigning a project manager to each project. This was a key component to a staffing plan that a former coworker and I came up with for my previous group. With five large compute resources, plus a few smaller resources and the supporting infrastructure, it was very difficult for any one person to know what was going on. Our proposal included having one engineer assigned as the point-of-contact for a particular resource. This person didn’t have to fix everything on that cluster, but they would know about every issue and all of the unique configuration. This way, time wouldn’t be wasted doing the same diagnostic steps three months apart when a recurring issue recurs.