Blog Fiasco

August 28, 2014

How to use HTCondor’s CLAIM_WORKLIFE to optimize cluster throughput

Filed under: HPC/HTC — Tags: , — bcotton @ 9:10 am

Not only do I write blog posts here, I also occasionally write some for work. The Cycle Computing blog just posted “How to use HTCondor’s CLAIM_WORKLIFE to optimize cluster throughput“. This began as a conversation I had with one of our customers and I decided it was worth expanding on and sharing with the wider community.

April 2, 2014

Amazon VPC: A great gotcha

Filed under: HPC/HTC,The Internet — Tags: , , , , , — bcotton @ 9:01 pm

If you’re not familiar with the Amazon Web Services offerings, one feature is the Virtual Private Cloud (VPC). VPC is effectively a way of walling yourself off from all or part of the world. If you’re running a public-facing web server, it might not be so important. If you’re running a compute cluster, it’s a no-brainer. Just be careful about that “no-brainer” part.

While working on a new cluster for a customer today, I was trying to figure out why the HTCondor scheduler wasn’t showing up to the collector. The daemons were all running. HTCondor security policies weren’t getting in the way. I could use condor_config_val from each host to query the other host. I brought in a colleague to double-check me. He couldn’t figure it out either.

After beating our heads against the wall for a while, and finding absolutely nothing helpful in the logs, I noticed one tiny detail in the logs. The schedd kept saying it was updating the collector, but the collector never seemed to notice. The schedd kept saying it was updating the collector via UDP. How many times had I watched that line go by?

The last time, though, it clicked. And it clicked hard. I had set up a security group to allow all traffic within the VPC. Except I had set it for all TCP traffic, so the UDP packets were being silently dropped. As UDP packets are wont to do. When I changed the security group rule from TCP to all protocols, the scheduler magically appeared in the pool.

Once again, the moral of the story is: don’t be stupid.

March 30, 2014

Parsing SGE’s qacct dates

Filed under: HPC/HTC,mac — Tags: , , , , — bcotton @ 9:19 pm

Recently I was trying to reconstruct a customer’s SGE job queue to understand why our cluster autoscaling wasn’t working quite right. The best way I found was to dump the output of qacct and grep for {qsub,start,end}_time. Several things made this unpleasant. First, the output is not de-duplicated on job id. Jobs that span multiple hosts get listed multiple times. Another thing is that the dates are in a nearly-but-not-quite “normal” format. For example: “Tue Mar 18 13:00:08 2014″.

What can you do with that? Not a whole lot. It’s not a format that spreadsheets will readily treat as a date, so if you want to do spreadsheety things, you’re forced to either manually enter them or write a shell function to do it for you:

function qacct2excel { echo "=`date -f '%a %b %d %T %Y' -j \"$1\"  +%s`/(60*60*24)+\"1/1/1970\"";

The above works on OS X because it uses a non-GNU date command. On Linux, you’ll need a different set of arguments, which I haven’t bothered to figure out. It’s still not awesome, but it’s slightly less tedious this way. At some point, I might write a parser that does what I want qacct to do, instead of what it does.

It’s entirely possible that there’s a better way to do this. The man page didn’t seem to have any helpful suggestions, though. I hate to say “SGE sucks” because I know very little about it. What I do know is that it’s hard to find good material for learning about SGE. At least HTCondor has thorough documentation and tutorials from HTCondor Week posted online. Perhaps one of these days I’ll learn more about SGE so I can determine whether it sucks or not.

October 1, 2013

I’m famous, sorta

Filed under: HPC/HTC,The Internet — Tags: , , , — bcotton @ 4:55 pm

One of my co-workers happens to be a co-host of “Food Fight“, a DevOps podcast. Last week, he asked for someone to join in for a crossover episode with “RCE“. When nobody else volunteered, he roped me into it. It turned out to be pretty awesome, I would have loved to extend the conversation a few more hours. With any luck, I’ll re-appear on one of those shows sometime. As you may already be aware, one of my goals is for Leo Laporte to personally invite me to the TWiT Brickhouse to get drunk with him on an episode of “This Week in Tech.” I feel like I’ve moved a little closer today.

Anyway, here are the links:

June 18, 2013

Extending rivalries to HPC

Filed under: HPC/HTC — Tags: , , , , — bcotton @ 8:12 am

In October, Indiana University announced it would purchase a Cray XK7 named “Big Red II”. With a theoretical peak of just over 1 petaFLOPS, it would be the fastest University-owned (not associated with a national center) cluster. Of course, in state rivals would never let that stand. In the latest Top 500 list, unveiled at at the International Supercomputing Conference, Big Red II ranks a very respectable 46th. Unfortunately for them, Purdue University’s new Conte cluster checked in at 28. Oops! Let’s compare:

Cluster Cost Theoretical performance LINPACK performance Cost per benchmarked TFLOPS
Big Red II $7.5 million 1000.6 TFLOPS 597.4 TFLOPS $12.55k / TFLOPS
Conte $4.3 million 1341.1 TFLOPS 943.4 TFLOPS $4.56k / TFLOPS
Comparison 57.33% 134.03% 157.92% 36.33%

It’s clear that Conte is the winner in performance and cost. But what about value? Both of these clusters have accelerators, Big Red II uses Nvidia GPUs and Conte uses Intel’s Phi (which also powers China’s new Tianhe-2, far and away the fastest cluster in the world). Using the GPU requires writing code in the CUDA language, whereas Phi will run native x86 code. This lowers the barrier to entry for users on Phi, but GPUs seem to win in most benchmarks. This would seem to increase the cost of providing user support, but it may be that IU’s users are already prepared to run on the GPU. All of the performance numbers in the world won’t matter if the clusters aren’t used, and only time will tell which cluster provides a better value. What may end up being a more interesting result is the political ramifications. Will the statehouse be okay with the two main state universities both running expensive high performance computing resources? If not, who will get to carry on? Both institutions have a record of success. Indiana ranked as high as #23 on the June 2006 list, but Big Red II is the first Top 500 system there since November 2009. Meanwhile, Purdue has had at least one system (and as many as three) on every list since November 2008. With Conte and the additional clusters in operation, Purdue has much greater capacity, but that doesn’t mean that IU’s system is a waste. I suspect that as long as both universities are bringing in enough grant money to justify the cost of their clusters, nobody in Indianapolis will care to put a stop to this rivalry. In the meantime, it appears that Purdue will remain the dominant HPC power in the state, as on the football field.

May 15, 2013

Treating clusters as projects

Filed under: HPC/HTC,Musings,Project Management — Tags: — bcotton @ 9:38 pm

Last month, HPCWire ran an article about the decommissioning of Los Alamos National Lab’s “Roadrunner” cluster. In this article was a quote from LANL’s Paul Henning: “Rather than think of these machines as physical entities, we think of them as projects.” This struck me as being a very sensible position to take. One of the defining characteristics of a project is that it is of limited duration. Compute clusters have a definite useful life, limited by the vendor’s hardware warranty and the system performance (both in terms of computational ability and power consumption) relative to new systems.

Furthermore, the five PMBOK process areas all come into play. Initiation and Planning happen before the cluster is installed. Execution could largely be considered the building and acceptance testing portion of a cluster’s life. The operational time is arguably Monitoring and Control. Project closeout, of course, is the decommissioning of the resource. Of course, smaller projects such as software updates and planned maintenance occur within the larger project. Perhaps it is better to think of each cluster as a program?

The natural extension of considering a cluster (or any resource, for that matter) to be a project is assigning a project manager to each project. This was a key component to a staffing plan that a former coworker and I came up with for my previous group. With five large compute resources, plus a few smaller resources and the supporting infrastructure, it was very difficult for any one person to know what was going on. Our proposal included having one engineer assigned as the point-of-contact for a particular resource. This person didn’t have to fix everything on that cluster, but they would know about every issue and all of the unique configuration. This way, time wouldn’t be wasted doing the same diagnostic steps three months apart when a recurring issue recurs.

April 23, 2013

Monitoring sucks, don’t make it worse

Filed under: HPC/HTC,Linux — Tags: , , — bcotton @ 10:10 pm

You don’t have to go too far to find someone who thinks monitoring sucks. It’s definitely true that monitoring can be big, ugly, and complicated. I’m convinced that many of the problems in monitoring are not technical, but policy issues. For the sake of clarity (and because I’m like that), let’s start with some definitions. These definitions may or may not have validity outside the scope of this post, but at least they will serve to clarify what I mean when I say things.

  • Monitoring – an automatic process to collect metrics on a system or service
  • Alerting – notification when a critical threshold has been reached

In the rest of this post, I will be throwing some former colleagues under the bus. It’s not personal, and I’m responsible for some of the problem as well. The group in question has a monitoring setup that is dysfunctional to the point of being worthless. Not all of the problems are policy-related, but enough are to prompt this post. It should be noted that I’m not an expert on this subject, just a guy with opinions and a blog.

Perhaps the most important thing that can be done when setting up a monitoring system is coming up with a plan. It sounds obvious, but if you don’t know what you’re monitoring, why you’re monitoring it, and how you’re monitoring it, you’re bound to get it wrong. This is my first rule: in monitoring, failing to plan is planning to not notice failure.

It’s important to distinguish between monitoring and alerting. You can’t alert on what you don’t monitor, but you don’t need to alert on everything you monitor. This is one area where it’s easy to shoot yourself in the foot, especially at a large scale. Many of the monitoring checks were in reaction to something going wrong. As a result, Nagios ended up alerting for things like “a compute node has 95% memory utilization.” For servers, that’s important. For nodes, who cares? The point of the machines is to do computation. Sometimes that means chewing up memory.

Which brings me to rule number two: every alert should have a reaction. If you’re not going to do something about an alert, why have it in the first place? It’s okay to monitor without alerting — the information can be important in diagnosing problems or analyzing usage — but if an alert doesn’t result in a human or automated reaction, shut it off.

Along that same line, alerts should be a little bit painful. Don’t punish yourself for something failing, but don’t make alerts painless either. Perhaps the biggest problem in the aforementioned group is that most of the admins filtered Nagios messages away. That immediately killed any incentive to improve the setup.

I took the alternate approach and weakly lobbied for all alerts to hit the pager. This probably falls into the “too painful” category. You should use multiple levels of alerts. An email or ticket is fine for something that needs to be acted on but can wait until business hours. A more obnoxious form of alert should be used for the Really Important Things[tm].

The great thing about having a little bit of pain associated with alerts is that it also acts as incentive to fix false alarms. At one point, I wrote Nagios checks to monitor HTCondor daemons. Unfortunately, due to the load on the Nagios server, the checks would timeout and produce alerts. The daemons were fine and the cond0r_master process generally does a good job of keeping things under control. So I removed the checks.

The opposite problem is running checks outside the monitoring system. One colleague had a series of cron jobs that checked the batch scheduler. If the checks failed, he would email the group. Don’t work outside the system.

Finally, be sure to consider planned outages. If you can’t suppress alerts when things are broken intentionally, you’re going to have a bad time. As my friend tweeted: “Rough estimates indicate we sent something like 180,000 emails when our clusters went down for maintenance.”

December 7, 2012

Coming up: LISA ’12

Filed under: Funnel Fiasco,HPC/HTC,Linux — Tags: , , , — bcotton @ 2:40 pm

It may seem like I’ve not been writing much lately, but nothing can be further from the truth. It’s just that my writing has been for grad school instead of Blog Fiasco. But don’t worry, soon I’ll be blogging like a madman. That’s right: it’s time for LISA ’12. Once again, I have the privilege of being on the conference blog team and learning from some of the TopPeople[tm] in the field. Here’s a quick look at my schedule (subject to change based on level of alertness, addition of BoFs, etc):

Sunday

Monday

Tuesday

Wednesday

Thursday

Friday

Now I just need to pack my bags and get started on the take-home final that’s due mid-week. Look for posts from me and my team members Matt Simmons and Greg Riedesel on the USENIX Blog.

October 12, 2012

A datacenter on the moon

Filed under: HPC/HTC — Tags: , , — bcotton @ 9:35 pm

Last week, Wired ran an article entitled “Why We Need a Datacenter on the Moon“. Surprisingly, it was a serious article, although more wistful than persuasive. The basic premise is that there’s a coming congestion problem for the Deep Space Network, a system of antennas that provide communication support for interplanetary satellites. By placing receivers on the far side of the moon, electromagnetic noise from the earth can be reduced. Presumably, the datacenter would be placed there so that only the “interesting” results would have to be sent back to earth.

All of this depends on two things: getting the equipment to the far side of the moon and getting people to the far side of the moon. These are obviously far from trivial dependencies. There are other technical hurdles as well. For one, communications back to Earth would require either highly elevated antennas or enough cable to reach to the light side. Although cooling would be cheap since the temperature on the dark side is -280°F, something has to dissipate the heat. The proposal suggests water cooling, which means the water will likely need to be heated to prevent freezing (there are multiple ways to accomplish this, including housing the equipment in a space designed for human occupancy).

Long before such a datacenter could be powered on, other workarounds will likely be put in place. The Wired article mentions the use of lasers for space-to-earth communication. Still, it’s an interesting idea that may inspire future space exploration efforts. If NASA is ever looking for a sysadmin for their Luna office, you can believe that I’ll have my resume submitted.

January 28, 2012

Book review: The Visible Ops Handbook

Filed under: HPC/HTC,Linux — Tags: , , , , — bcotton @ 2:42 pm

I first heard of The Visible Ops Handbook during Ben Rockwood’s LISA ’11 keynote. Since Ben seemed so excited about it, I added it to the list of books I should (but probably would never) read. Then Matt Simmons mentioned it in a brief blog post and I decided that if I was ever going to get around to reading it, I needed to stop putting it off. I bought it that afternoon, and a month later I’ve finally had a chance to read it and write a review. Given the short length and high quality of this book, it’s hard to justify such a delay.

Information Technology Infrastructure Library (ITIL) training has been a major push in my organization the past few years. ITIL is a formalized framework for IT service management, but seems to be unfavored in the sysadmin community. After sitting through the foundational training, my opinion was of the “it sounds good, but…” variety. The problem with ITIL training and the official documentation is that you’re told what to do without ever being told how to do it. Kevin Behr, Gene Kim, and George Spafford solve that problem in less than 100 pages.

Based on observations and research of high-performing IT teams, The Visible Ops Handbook assumes that no ITIL practices are being followed. Implementation of the ITIL basics is broken down into four phases. Each phase includes real-world accounts, the benefits, and likely resistance points. This arms the reader with the tools necessary to sell the idea to management and sysadmins alike.

The introduction addresses a very important truism: “Something must need improvement, otherwise why read this?” The authors present a general recap of their findings, including these compelling statistics: 80% of outages are self-inflicted and 80% of mean time to repair (MTTR) is often wasted on non-productive activities (e.g. trying to figure out what changed).

Phase 1 focuses on “stabilizing the patient.” The goal is to reduce unplanned work from 80% of outage time to 25% or less. To do this, triage the most critical systems that generate the most unplanned work. Control when and how changes are made and fence off the systems to prevent unauthorized changes. While exceptions might be tempting, they should be avoided. The authors state that “all high performing IT organizations have only one acceptable number of unauthorized changes: zero.”

After reading Phase 1, I already had an idea to suggest. My group handles change management fairly well, but we don’t track requests for change (RFCs) well. Realizing how important that is, I convinced our groups manager and our best developer that it was a key feature to add to our configuration management database (CMDB) system.

In Phase 2, the reader performs a catch & release program and find “fragile artifacts.” Fragile infrastructure are those systems or services with a low change success rate and high MTTR. After all systems have been “bagged and tagged”, it’s time to make a CMDB and a service catalog. This phase is the next place that my group needs to do work. We have a pretty nice CMDB that’s integrated with our monitoring systems and our job schedulers, but we lack a service catalog. Users can look at the website and see what we offer, but that’s only a subset of the services we run.

Phase 3 focuses on creating a repeatable build library. The best IT organizations make infrastructure easier to build than repair. A definitive software library, containing master images for all software necessary to rebuild systems, is critical. For larger groups, forming a separate release management team to engineer repeatable builds for the different services is helpful. The release management team should be separate from the operational group and consist of generally senior staff.

The final phase discusses continual improvement. If everyone stopped at “best practices”, no one would have a competitive advantage. Suggested metrics for each key process area are listed and explained. After all, you can’t manage what you can’t measure. Finding out what areas are the worst makes it easier to decide what to improve upon.

The last third of the book consists of appendices that serve as useful references for the four phases. One of the appendices includes a suggested table layout for a CMDB system. The whole book is focused on the practical nature of ITIL implementation and guiding organizational learning. At times, it assumes a large staff (especially when discussing separation of duties), so some of the ideas will have to be adapted to meet the needs of smaller groups. Nonetheless, this book is an invaluable resource to anyone involve in IT operations.

Older Posts »

Powered by WordPress