Why HTCondor is a pretty awesome scheduler

In early March, The Next Platform published an article I wrote about cHPC, a container project aimed at HPC applications. But as I wrote it, I thought about how HTCondor has been addressing a lot of the concerns for a long time. Since I’m in Madison for HTCondor Week right now, I thought this was a good time to explain some of the ways this project is awesome.

No fixed walltime. This is a benefit or a detriment, depending on the circumstances, but most schedulers require the user to define a requested walltime at submission. If the job isn’t done at the end of that time, the scheduler kills it. Sorry about your results, get back in line and ask for more walltime. HTCondor’s flexible configuration allows administrators to enable such a feature if desired. By default users are not forced to make a guess that they’re probably going to get wrong.

Flexible requirements and resource monitoring. HTCondor supports user-requestable CPU, memory, and GPU natively. With partitionable slots, resources can be carved up on the fly. And HTCondor has “concurrency limits”, which allow for customizable resource constraints (e.g. software licenses, database connections, etc).

So many platforms. Despite the snobbery of HPC sysadmins, people do real work on Windows. HTCondor has almost-full feature parity on Windows. It also has “universes” for Docker and virtual machines.

Federation. Want to overflow to your friend’s resource? You can do that! You can even submit jobs from HTCondor to other schedulers.

Support for disappearing resources. In the cloud, this is the best feature. HTCondor was designed for resource scavenging on desktops, and it still supports that as a first-class use case. That means machines can come and go without much hassle. Contrast this to other schedulers where some explicit external action has to happen in order to add or remove a node.

Free as in freedom and free as in beer. Free beer is also the best way to get something from the HTCondor team. But HTCondor is licensed under the Apache 2.0 license, so anyone can use it for any purpose.

HTCondor isn’t perfect, and there are some use cases where it doesn’t make sense (e.g. low-latency), but it’s a pretty awesome project. And it’s been around for over three decades.

Changing how HTCondor is packaged in Fedora

The HTCondor grid scheduler and resource manager follows the old Linux kernel versioning scheme: for release x.y.z, if y is an even number it’s a “stable” series that get bugfixes, behavior changes and major features go on odd-numbered y. For a long time, the HTCondor packages in Fedora used the development series. However, this leads to a choice between introducing behavior changes when a new development HTCondor release comes out or pinning a Fedora release to a particular HTCondor release which means no bugfixes.

This ignores the Fedora Packaging Guidelines, too:

As a result, we should avoid major updates of packages within a stable release. Updates should aim to fix bugs, and not introduce features, particularly when those features would materially affect the user or developer experience. The update rate for any given release should drop off over time, approaching zero near release end-of-life; since updates are primarily bugfixes, fewer and fewer should be needed over time.

Although the HTCondor developers do an excellent job of preserving backward compatibility, behavior changes can happen between x.y.1 and x.y.2. HTCondor is not a major part of Fedora, but we should still attempt to be good citizens.

After discussing the matter with upstream and the other co-maintainers, I’ve submitted a self-contained change for Fedora 25 that will

  1. Upgrade the HTCondor version to 8.6
  2. Keep HTCondor in Fedora on the stable release series going forward

Most of the bug reports against the condor-* packages have been packaging issues and not HTCondor bugs, so upstream isn’t losing a massive testing resource here. I think this will be a net benefit to Fedora since it prevents unexpected behavior changes and makes it more likely that I’ll package upstream releases as soon as they come out.

Hints for using HTCondor’s credd and condor_store_cred

HTCondor has the ability to run jobs as either an unprivileged “nobody” user or as the submitting user. On Linux, enabling this is fairly easy: the administrator just sets the UID_DOMAIN configuration to the same value and away you go. On Windows, you need to run the credential daemon (condor_credd) and the user must send store credentials using condor_store_cred.

The manual does a pretty good job of describing the basic setup of the credd, though there are some important pieces missing. With help from HTCondor technical lead Todd Tannenbaum, I’ve submitted some improvements to the docs, but in the meantime…

The main thing to consider when configuring your pool to use the credd is that it wants things to be secure. That makes sense, considering its entire job is to securely store and transfer user credentials. The credd will not hand out the password unless the client is authenticated and using a secure connection. The method of authentication is not important (if you really, really trust your network, you can use the CLAIMTOBE method), so long as authentication occurs somehow.

So where do the condor_store_cred hints come in? Often, the credd runs on the same machine as the schedd, and users log in to there to submit jobs. In that case, everything’s probably fine. But if you’re submitting jobs from a machine outside the pool (for example, a user’s workstation), it can get a little hairier.

Before running condor_store_cred, HTCondor needs to be told where to look for the credd, and the client settings mentioned above need to meet the credd’s requirements. (I’m using CLAIMTOBE here for simplicity). If the machine the user submits from is not in the pool, condor_store_cred will need to know where to find the collector, too.

CREDD_HOST = scheduler.example.com
COLLECTOR_HOST = centralmanager.example.com
SEC_CLIENT_AUTHENTICATION_METHODS = CLAIMTOBE
SEC_CLIENT_AUTHENTICATION = PREFERRED
SEC_CLIENT_ENCRYPTION = PREFERRED

As of this writing, condor_store_cred gives an unhelpful error message if something goes wrong. It will always say “Make sure your ALLOW_WRITE setting includes this host.”, so if your ALLOW_WRITE setting already includes the host in question, you might get stuck. Use the -debug option to get better output. For example:

02/16/16 12:23:51 STORE_CRED: In mode 'query'
02/16/16 12:23:51 Warning: Collector information was not found in the configuration file. ClassAds will not be sent to the collector and this daemon will not join a larger Condor pool.
02/16/16 12:23:51 STORE_CRED: Failed to start command.
02/16/16 12:23:51 STORE_CRED: Unable to contact the REMOTE schedd.

This tells you that you forgot to set the COLLECTOR_HOST in your configuration.

Another hint is that if your scheduler name is different than the machine name (e.g. if you run multiple condor_schedd processes on a single machine and have Q1@hostname, Q2@hostname, etc), you might need to include “-name Q1@hostname” in the arguments. Unlike most other HTCondor client commands, you cannot specify a “sinful string” as a target using the “-addr” option.

Hopefully this helps you save a little bit of time getting run_as_owner working on your Windows pool, until such time as I sit down to write that “Administering HTCondor” book that I’ve been meaning to work on for the last 5 years.

HTCondor 8.3.8 in Fedora repos

It’s only been a month-plus since HTCondor 8.3.8 was released, but I finally have the Fedora packages updated. Along the way, I fixed a couple of outstanding bugs in the Fedora package. The builds are in the updates-testing repo, so test away!

As of right now, upstream plans to release HTCondor 8.5.0 early next week, so I got caught up just in time.

HTCondor Week 2015

There are many reasons I enjoy the annual gathering of HTCondor users, administrators, and developers. Some of those reasons involve food and alcohol, but mostly it’s about the networking and the knowledge sharing.

Unlike many other conferences, HTCondor Week is nearly devoid of vendors. I gave a presentation on behalf of my company, and AWS was present this year, but it wasn’t a sales pitch in either case. The focus is on how HTCondor enabled research. I credit the project’s academic roots.

Every year, themes seem to develop. This year, the themes were cloud and caching. Cloud offerings seem to really be ready to take off in this community, even though Miron would say that the cloud is just a different form of grid computing that’s been done for decades. The ability to scale well beyond internal resources quickly and cheaply has obvious appeal. The limiting factor currently seems to be that university funding rules make it slightly more difficult for academic researchers than just pulling out a credit card.

In the course of one session,  three different caching mechanisms were discussed. This was interesting because it is not something that’s been discussed much in the past. It makes sense, though, that caching files common across multiple jobs on a node would be a big improvement in performance. I’m most partial to Zach Miller’s fledgling HTCache work, though the squid cache and CacheD presentations had their own appeal.

Todd Tannenbaum’s “Talk of Lies” spent a lot of time talking about performance improvements that have been made in the past year, but they really need to congratulate themselves more. I’ve seen big improvements from 8.0 to 8.2, and it looks like even more will land in 8.4. There’s some excellent work planned for the coming releases, and I hope it pans out.

After days of presentations and conversations, my brain is full of ideas for improving my company’s products. I’m really motivated to make contributions to HTCondor, too. I’m even considering carving out some time to work on that book I’ve been wanting to write for a few years. Now that would truly be a miracle.

How to use HTCondor’s CLAIM_WORKLIFE to optimize cluster throughput

Not only do I write blog posts here, I also occasionally write some for work. The Cycle Computing blog just posted “How to use HTCondor’s CLAIM_WORKLIFE to optimize cluster throughput“. This began as a conversation I had with one of our customers and I decided it was worth expanding on and sharing with the wider community.

Amazon VPC: A great gotcha

If you’re not familiar with the Amazon Web Services offerings, one feature is the Virtual Private Cloud (VPC). VPC is effectively a way of walling yourself off from all or part of the world. If you’re running a public-facing web server, it might not be so important. If you’re running a compute cluster, it’s a no-brainer. Just be careful about that “no-brainer” part.

While working on a new cluster for a customer today, I was trying to figure out why the HTCondor scheduler wasn’t showing up to the collector. The daemons were all running. HTCondor security policies weren’t getting in the way. I could use condor_config_val from each host to query the other host. I brought in a colleague to double-check me. He couldn’t figure it out either.

After beating our heads against the wall for a while, and finding absolutely nothing helpful in the logs, I noticed one tiny detail in the logs. The schedd kept saying it was updating the collector, but the collector never seemed to notice. The schedd kept saying it was updating the collector via UDP. How many times had I watched that line go by?

The last time, though, it clicked. And it clicked hard. I had set up a security group to allow all traffic within the VPC. Except I had set it for all TCP traffic, so the UDP packets were being silently dropped. As UDP packets are wont to do. When I changed the security group rule from TCP to all protocols, the scheduler magically appeared in the pool.

Once again, the moral of the story is: don’t be stupid.

Parsing SGE’s qacct dates

Recently I was trying to reconstruct a customer’s SGE job queue to understand why our cluster autoscaling wasn’t working quite right. The best way I found was to dump the output of qacct and grep for {qsub,start,end}_time. Several things made this unpleasant. First, the output is not de-duplicated on job id. Jobs that span multiple hosts get listed multiple times. Another thing is that the dates are in a nearly-but-not-quite “normal” format. For example: “Tue Mar 18 13:00:08 2014”.

What can you do with that? Not a whole lot. It’s not a format that spreadsheets will readily treat as a date, so if you want to do spreadsheety things, you’re forced to either manually enter them or write a shell function to do it for you:

function qacct2excel { echo "=`date -f '%a %b %d %T %Y' -j \"$1\"  +%s`/(60*60*24)+\"1/1/1970\"";

The above works on OS X because it uses a non-GNU date command. On Linux, you’ll need a different set of arguments, which I haven’t bothered to figure out. It’s still not awesome, but it’s slightly less tedious this way. At some point, I might write a parser that does what I want qacct to do, instead of what it does.

It’s entirely possible that there’s a better way to do this. The man page didn’t seem to have any helpful suggestions, though. I hate to say “SGE sucks” because I know very little about it. What I do know is that it’s hard to find good material for learning about SGE. At least HTCondor has thorough documentation and tutorials from HTCondor Week posted online. Perhaps one of these days I’ll learn more about SGE so I can determine whether it sucks or not.