Cloud detente founder and CEO Tim Prendergast wondered on Twitter why other cloud service providers aren’t taking marketing advantage of the Xen vulnerability that lead Amazon and Rackspace to reboot a large number of cloud instances over a few-day period. Digital Ocean, Azure, and Google Compute Engine all use other hypervisors, so isn’t this an opportunity for them to brag about their security? Amazon is the clear market leader, so pointing out this vulnerability is a great differentiator.

Except that it isn’t. It’s a matter of chance that Xen is The hypervisor facing an apparently serious and soon-to-be-public exploit. Next week it could be Mircosoft’s Hyper-V. Imagine the PR nightmare if Microsoft bragged about how much more secure Azure is only to see a major exploit strike Hyper-V next week. It would be even worse if the exploit was active in the wild before patches could be applied.

“Choose us because of this Xen issue” is the cloud service provider equivalent of an airline running a “don’t fly those guys, they just had a plane crash” ad campaign. Just because your competition was unlucky this time, there’s no guarantee that you won’t be the lower next time.

I’m all for companies touting legitimate security features. Amazon’s handling of this incident seems pretty good, and I think they generally do a good job of giving users the ability to secure their environment. That doesn’t mean someone can’t come along and do it better. If there’s anything 2014 has taught us, it’s that we have a long road ahead of us when it comes to the security of computing.

It’s to the credit of Amazon’s competition that they’ve remained silent. It shows a great degree of professionalism. Digital Ocean’s Chief Technology Evangelist John Edgar had the best explanation for the silence: “because we’re not assholes mostly.”

FAQs are not the place to vent

I’ve spent a lot of my professional life explaining technical concepts to not-necessarily-very-technical people. Most of the time (but sadly not all of it), it’s because the person doesn’t need to fully understand the technology, they just need to know enough to effectively do their job. I understand how frustrating it can be to answer what seems like an obvious question, and how the frustration compounds when the question is repeated. That’s why we maintain FAQ pages, so we can give a consistently friendly answer to a question.

You can imagine my dismay when my friend Andy shared an FAQ entry he found recently. A quantum chemistry application’s FAQ page includes this question: “How do I choose the number of processors/How do I setup my parallel calculation?” It’s a very reasonable question to ask. Unfortunately, the site answers it thusly: “By asking this question, you demonstrate your lack of basic understanding of how parallel machines work and how parallelism is implemented in Quantum ESPRESSO. Please go back to the previous point.”

The previous question is similar and has an answer of of “See Section 3 of the User Guide for an introduction to how parallelism is implemented in Quantum ESPRESSO”. Now that’s a pretty good answer. Depending on the depth of information in Section 3, it might be possible to answer the question directly on the FAQ page with an excerpt, but at least pointing the visitor to the information is a good step.

I don’t understand getting frustrated with a repeated FAQ. If the answers are so similar, copy and paste them. Or combine the questions. FAQs, user guides, and the like are great because you can compose them in a detached manner and edit them to make sure they’re correct, approachable, and not jerkish. FAQs are an opportunity to prevent frustration, not to express it.

How to use HTCondor’s CLAIM_WORKLIFE to optimize cluster throughput

Not only do I write blog posts here, I also occasionally write some for work. The Cycle Computing blog just posted “How to use HTCondor’s CLAIM_WORKLIFE to optimize cluster throughput“. This began as a conversation I had with one of our customers and I decided it was worth expanding on and sharing with the wider community.

Amazon VPC: A great gotcha

If you’re not familiar with the Amazon Web Services offerings, one feature is the Virtual Private Cloud (VPC). VPC is effectively a way of walling yourself off from all or part of the world. If you’re running a public-facing web server, it might not be so important. If you’re running a compute cluster, it’s a no-brainer. Just be careful about that “no-brainer” part.

While working on a new cluster for a customer today, I was trying to figure out why the HTCondor scheduler wasn’t showing up to the collector. The daemons were all running. HTCondor security policies weren’t getting in the way. I could use condor_config_val from each host to query the other host. I brought in a colleague to double-check me. He couldn’t figure it out either.

After beating our heads against the wall for a while, and finding absolutely nothing helpful in the logs, I noticed one tiny detail in the logs. The schedd kept saying it was updating the collector, but the collector never seemed to notice. The schedd kept saying it was updating the collector via UDP. How many times had I watched that line go by?

The last time, though, it clicked. And it clicked hard. I had set up a security group to allow all traffic within the VPC. Except I had set it for all TCP traffic, so the UDP packets were being silently dropped. As UDP packets are wont to do. When I changed the security group rule from TCP to all protocols, the scheduler magically appeared in the pool.

Once again, the moral of the story is: don’t be stupid.

Parsing SGE’s qacct dates

Recently I was trying to reconstruct a customer’s SGE job queue to understand why our cluster autoscaling wasn’t working quite right. The best way I found was to dump the output of qacct and grep for {qsub,start,end}_time. Several things made this unpleasant. First, the output is not de-duplicated on job id. Jobs that span multiple hosts get listed multiple times. Another thing is that the dates are in a nearly-but-not-quite “normal” format. For example: “Tue Mar 18 13:00:08 2014″.

What can you do with that? Not a whole lot. It’s not a format that spreadsheets will readily treat as a date, so if you want to do spreadsheety things, you’re forced to either manually enter them or write a shell function to do it for you:

function qacct2excel { echo "=`date -f '%a %b %d %T %Y' -j \"$1\"  +%s`/(60*60*24)+\"1/1/1970\"";

The above works on OS X because it uses a non-GNU date command. On Linux, you’ll need a different set of arguments, which I haven’t bothered to figure out. It’s still not awesome, but it’s slightly less tedious this way. At some point, I might write a parser that does what I want qacct to do, instead of what it does.

It’s entirely possible that there’s a better way to do this. The man page didn’t seem to have any helpful suggestions, though. I hate to say “SGE sucks” because I know very little about it. What I do know is that it’s hard to find good material for learning about SGE. At least HTCondor has thorough documentation and tutorials from HTCondor Week posted online. Perhaps one of these days I’ll learn more about SGE so I can determine whether it sucks or not.

I’m famous, sorta

One of my co-workers happens to be a co-host of “Food Fight“, a DevOps podcast. Last week, he asked for someone to join in for a crossover episode with “RCE“. When nobody else volunteered, he roped me into it. It turned out to be pretty awesome, I would have loved to extend the conversation a few more hours. With any luck, I’ll re-appear on one of those shows sometime. As you may already be aware, one of my goals is for Leo Laporte to personally invite me to the TWiT Brickhouse to get drunk with him on an episode of “This Week in Tech.” I feel like I’ve moved a little closer today.

Anyway, here are the links:

Extending rivalries to HPC

In October, Indiana University announced it would purchase a Cray XK7 named “Big Red II”. With a theoretical peak of just over 1 petaFLOPS, it would be the fastest University-owned (not associated with a national center) cluster. Of course, in state rivals would never let that stand. In the latest Top 500 list, unveiled at at the International Supercomputing Conference, Big Red II ranks a very respectable 46th. Unfortunately for them, Purdue University’s new Conte cluster checked in at 28. Oops! Let’s compare:

Cluster Cost Theoretical performance LINPACK performance Cost per benchmarked TFLOPS
Big Red II $7.5 million 1000.6 TFLOPS 597.4 TFLOPS $12.55k / TFLOPS
Conte $4.3 million 1341.1 TFLOPS 943.4 TFLOPS $4.56k / TFLOPS
Comparison 57.33% 134.03% 157.92% 36.33%

It’s clear that Conte is the winner in performance and cost. But what about value? Both of these clusters have accelerators, Big Red II uses Nvidia GPUs and Conte uses Intel’s Phi (which also powers China’s new Tianhe-2, far and away the fastest cluster in the world). Using the GPU requires writing code in the CUDA language, whereas Phi will run native x86 code. This lowers the barrier to entry for users on Phi, but GPUs seem to win in most benchmarks. This would seem to increase the cost of providing user support, but it may be that IU’s users are already prepared to run on the GPU. All of the performance numbers in the world won’t matter if the clusters aren’t used, and only time will tell which cluster provides a better value. What may end up being a more interesting result is the political ramifications. Will the statehouse be okay with the two main state universities both running expensive high performance computing resources? If not, who will get to carry on? Both institutions have a record of success. Indiana ranked as high as #23 on the June 2006 list, but Big Red II is the first Top 500 system there since November 2009. Meanwhile, Purdue has had at least one system (and as many as three) on every list since November 2008. With Conte and the additional clusters in operation, Purdue has much greater capacity, but that doesn’t mean that IU’s system is a waste. I suspect that as long as both universities are bringing in enough grant money to justify the cost of their clusters, nobody in Indianapolis will care to put a stop to this rivalry. In the meantime, it appears that Purdue will remain the dominant HPC power in the state, as on the football field.

Treating clusters as projects

Last month, HPCWire ran an article about the decommissioning of Los Alamos National Lab’s “Roadrunner” cluster. In this article was a quote from LANL’s Paul Henning: “Rather than think of these machines as physical entities, we think of them as projects.” This struck me as being a very sensible position to take. One of the defining characteristics of a project is that it is of limited duration. Compute clusters have a definite useful life, limited by the vendor’s hardware warranty and the system performance (both in terms of computational ability and power consumption) relative to new systems.

Furthermore, the five PMBOK process areas all come into play. Initiation and Planning happen before the cluster is installed. Execution could largely be considered the building and acceptance testing portion of a cluster’s life. The operational time is arguably Monitoring and Control. Project closeout, of course, is the decommissioning of the resource. Of course, smaller projects such as software updates and planned maintenance occur within the larger project. Perhaps it is better to think of each cluster as a program?

The natural extension of considering a cluster (or any resource, for that matter) to be a project is assigning a project manager to each project. This was a key component to a staffing plan that a former coworker and I came up with for my previous group. With five large compute resources, plus a few smaller resources and the supporting infrastructure, it was very difficult for any one person to know what was going on. Our proposal included having one engineer assigned as the point-of-contact for a particular resource. This person didn’t have to fix everything on that cluster, but they would know about every issue and all of the unique configuration. This way, time wouldn’t be wasted doing the same diagnostic steps three months apart when a recurring issue recurs.

Monitoring sucks, don’t make it worse

You don’t have to go too far to find someone who thinks monitoring sucks. It’s definitely true that monitoring can be big, ugly, and complicated. I’m convinced that many of the problems in monitoring are not technical, but policy issues. For the sake of clarity (and because I’m like that), let’s start with some definitions. These definitions may or may not have validity outside the scope of this post, but at least they will serve to clarify what I mean when I say things.

  • Monitoring – an automatic process to collect metrics on a system or service
  • Alerting – notification when a critical threshold has been reached

In the rest of this post, I will be throwing some former colleagues under the bus. It’s not personal, and I’m responsible for some of the problem as well. The group in question has a monitoring setup that is dysfunctional to the point of being worthless. Not all of the problems are policy-related, but enough are to prompt this post. It should be noted that I’m not an expert on this subject, just a guy with opinions and a blog.

Perhaps the most important thing that can be done when setting up a monitoring system is coming up with a plan. It sounds obvious, but if you don’t know what you’re monitoring, why you’re monitoring it, and how you’re monitoring it, you’re bound to get it wrong. This is my first rule: in monitoring, failing to plan is planning to not notice failure.

It’s important to distinguish between monitoring and alerting. You can’t alert on what you don’t monitor, but you don’t need to alert on everything you monitor. This is one area where it’s easy to shoot yourself in the foot, especially at a large scale. Many of the monitoring checks were in reaction to something going wrong. As a result, Nagios ended up alerting for things like “a compute node has 95% memory utilization.” For servers, that’s important. For nodes, who cares? The point of the machines is to do computation. Sometimes that means chewing up memory.

Which brings me to rule number two: every alert should have a reaction. If you’re not going to do something about an alert, why have it in the first place? It’s okay to monitor without alerting — the information can be important in diagnosing problems or analyzing usage — but if an alert doesn’t result in a human or automated reaction, shut it off.

Along that same line, alerts should be a little bit painful. Don’t punish yourself for something failing, but don’t make alerts painless either. Perhaps the biggest problem in the aforementioned group is that most of the admins filtered Nagios messages away. That immediately killed any incentive to improve the setup.

I took the alternate approach and weakly lobbied for all alerts to hit the pager. This probably falls into the “too painful” category. You should use multiple levels of alerts. An email or ticket is fine for something that needs to be acted on but can wait until business hours. A more obnoxious form of alert should be used for the Really Important Things[tm].

The great thing about having a little bit of pain associated with alerts is that it also acts as incentive to fix false alarms. At one point, I wrote Nagios checks to monitor HTCondor daemons. Unfortunately, due to the load on the Nagios server, the checks would timeout and produce alerts. The daemons were fine and the cond0r_master process generally does a good job of keeping things under control. So I removed the checks.

The opposite problem is running checks outside the monitoring system. One colleague had a series of cron jobs that checked the batch scheduler. If the checks failed, he would email the group. Don’t work outside the system.

Finally, be sure to consider planned outages. If you can’t suppress alerts when things are broken intentionally, you’re going to have a bad time. As my friend tweeted: “Rough estimates indicate we sent something like 180,000 emails when our clusters went down for maintenance.”

Coming up: LISA ’12

It may seem like I’ve not been writing much lately, but nothing can be further from the truth. It’s just that my writing has been for grad school instead of Blog Fiasco. But don’t worry, soon I’ll be blogging like a madman. That’s right: it’s time for LISA ’12. Once again, I have the privilege of being on the conference blog team and learning from some of the TopPeople[tm] in the field. Here’s a quick look at my schedule (subject to change based on level of alertness, addition of BoFs, etc):







Now I just need to pack my bags and get started on the take-home final that’s due mid-week. Look for posts from me and my team members Matt Simmons and Greg Riedesel on the USENIX Blog.