A warning about Condor access control lists

Like most sane services, Condor has a notion of access control.  In fact, Condor’s access control lists (ACLs) provide a very granular level of control, allowing you to set a variety of roles based on hostname/IP.  One thing we’re working on at my day job is making it easier for departments across campus to join our Condor pool.  In the face of budget concerns, a recommendation has been drafted which includes having departments choose between running Condor and powering machines off when not in use.  Given the preference for performing backups and system updates overnight, we’re guessing the majority will choose to donate cycles to Condor, so we’re trying to prepare for a large increase in the pool.

Included in that preparation is the switch from default-deny to default-allow-campus-hosts.  Previously, we only allowed specific subdomains on campus, but this means that every time a new department (which effectively means a new subdomain) joins the pool, we have to modify the ACLs.  While this isn’t a big deal, it seems simpler to just allow all of campus except the “scary” subnets (traditionally wireless clients, VPN clients, and the dorms. Especially the dorms.)  Effectively, we’ll end up doing that anyway, and so keeping the file more static should make it easier to maintain.

So on Wednesday, after the security group blessed our idea, I began the process of making the changes.  Let me point out here that you don’t really appreciate how much IP space an institution like Purdue has until you need to start blocking segments of it.  All of 128.10.0.0/16, 128.46.0.0/16, 128.210.0.0/16 and 128.211.0.0/16.  That’s a lot of public space, and it doesn’t include the private IP addresses in use.  So after combing through the IP space assignments, I finally got the ACLs written, and on Thursday I committed them.  And that’s when all hell broke loose.

Condor uses commas to separate as many hosts as you want, and asterisks can be used to wildcard hosts (Condor does not currently support CIDR notation, but that would be awesome).  The danger here is that if you accidentally put a comma in place of a period, you might end up denying write access to *. Obviously, this causes things to break down.  Once people started complaining about Condor not working, I quickly found my mistake and pushed out a correction.  However, Condor does not give up so quickly.  Once a rule is in DENY_WRITE, it will not stop denying that host until the Condor master has been stopped and re-started.  A simple config update won’t change it.

We had to learn that by experimentation, so I spent most of Friday helping my colleagues re-start the Condor process everywhere and testing the hell out of my changes.  Fortunately, once everything had been cleaned up, it worked as expected, and this gave me a chance to learn more about Condor.  And I also learned a very important lesson: test your changes first, dummy.