A warning about Condor access control lists

Like most sane services, Condor has a notion of access control.  In fact, Condor’s access control lists (ACLs) provide a very granular level of control, allowing you to set a variety of roles based on hostname/IP.  One thing we’re working on at my day job is making it easier for departments across campus to join our Condor pool.  In the face of budget concerns, a recommendation has been drafted which includes having departments choose between running Condor and powering machines off when not in use.  Given the preference for performing backups and system updates overnight, we’re guessing the majority will choose to donate cycles to Condor, so we’re trying to prepare for a large increase in the pool.

Included in that preparation is the switch from default-deny to default-allow-campus-hosts.  Previously, we only allowed specific subdomains on campus, but this means that every time a new department (which effectively means a new subdomain) joins the pool, we have to modify the ACLs.  While this isn’t a big deal, it seems simpler to just allow all of campus except the “scary” subnets (traditionally wireless clients, VPN clients, and the dorms. Especially the dorms.)  Effectively, we’ll end up doing that anyway, and so keeping the file more static should make it easier to maintain.

So on Wednesday, after the security group blessed our idea, I began the process of making the changes.  Let me point out here that you don’t really appreciate how much IP space an institution like Purdue has until you need to start blocking segments of it.  All of 128.10.0.0/16, 128.46.0.0/16, 128.210.0.0/16 and 128.211.0.0/16.  That’s a lot of public space, and it doesn’t include the private IP addresses in use.  So after combing through the IP space assignments, I finally got the ACLs written, and on Thursday I committed them.  And that’s when all hell broke loose.

Condor uses commas to separate as many hosts as you want, and asterisks can be used to wildcard hosts (Condor does not currently support CIDR notation, but that would be awesome).  The danger here is that if you accidentally put a comma in place of a period, you might end up denying write access to *. Obviously, this causes things to break down.  Once people started complaining about Condor not working, I quickly found my mistake and pushed out a correction.  However, Condor does not give up so quickly.  Once a rule is in DENY_WRITE, it will not stop denying that host until the Condor master has been stopped and re-started.  A simple config update won’t change it.

We had to learn that by experimentation, so I spent most of Friday helping my colleagues re-start the Condor process everywhere and testing the hell out of my changes.  Fortunately, once everything had been cleaned up, it worked as expected, and this gave me a chance to learn more about Condor.  And I also learned a very important lesson: test your changes first, dummy.

A 650-node 10Gb computer cluster: easy peasy

At Purdue, we have a long history of being a leader in the field of computing.  (After all, Ctrl+Alt+Del was invented by Purdue alum David Bradley.)  Since we’re a pretty geeky campus anyway, it is more than a matter of professional pride, there’s street cred on the line too.  After building a large compute cluster last year, the research computing group on campus decided it needed to be one-upped this year.

Once again, volunteers from around Purdue and a few other institutions gathered to set up the cluster in a single day.  Once again, we finished way ahead of schedule.  This year, approximately 650 nodes went from box to OS install in less than three hours.  Jobs were already running by lunch time.

The process wasn’t entirely smooth though.  For reasons not adequately explained to the volunteers, the 10 gigabit network cards (NICs) were not installed by the vendor.  That meant each machine that was installed had to first be opened and have a NIC installed.  That is what I did for two hours yesterday morning.

The NIC installation process wasn’t too difficult, there were only 4 screws to contend with.  The organizers had expected 15 NICs per person per shift would be installed.  I did 42 in my two hour shift, and several others installed 50 or more.  At several points, they couldn’t get the machines unboxed and on our tables fast enough.

Several hundred more nodes will be installed once the external funding is processed, and it is likely that Coates will end up reaching the maximum capacity of just over 1200 nodes.  This gives it over 10k cores, all joined by 10 gigabit Ethernet connections.  This allows an obscene amount of data to be processed and transferred, which is very helpful in big-data fields like the atmospheric sciences.

Expectations are high for Coates.  It is, like Steele was, the largest compute cluster in the Big Ten at build-time.  Coates is expected to rank in the top 50 internationally when the supercomputer rankings come out in November.  Coates is also expected to be the first academic cluster connected solely with 10Gb that is big enough to achieve international ranking.  Perhaps most importantly, Coates is expected by Purdue researchers to facilitate some serious science.

Even though my contribution didn’t require much technical skill, I take pride in the fact that a whole rack of nodes can transfer data on the fast because of the network cards that I installed.  This cluster is a big deal to those who care about clusters, and it is really nice to be a part of something so geekily awesome.  If you’re one of those people who care about clusters, the technical details are at http://www.rcac.purdue.edu/userinfo/resources/coates/