Extending rivalries to HPC

In October, Indiana University announced it would purchase a Cray XK7 named “Big Red II”. With a theoretical peak of just over 1 petaFLOPS, it would be the fastest University-owned (not associated with a national center) cluster. Of course, in state rivals would never let that stand. In the latest Top 500 list, unveiled at at the International Supercomputing Conference, Big Red II ranks a very respectable 46th. Unfortunately for them, Purdue University’s new Conte cluster checked in at 28. Oops! Let’s compare:

Cluster Cost Theoretical performance LINPACK performance Cost per benchmarked TFLOPS
Big Red II $7.5 million 1000.6 TFLOPS 597.4 TFLOPS $12.55k / TFLOPS
Conte $4.3 million 1341.1 TFLOPS 943.4 TFLOPS $4.56k / TFLOPS
Comparison 57.33% 134.03% 157.92% 36.33%

It’s clear that Conte is the winner in performance and cost. But what about value? Both of these clusters have accelerators, Big Red II uses Nvidia GPUs and Conte uses Intel’s Phi (which also powers China’s new Tianhe-2, far and away the fastest cluster in the world). Using the GPU requires writing code in the CUDA language, whereas Phi will run native x86 code. This lowers the barrier to entry for users on Phi, but GPUs seem to win in most benchmarks. This would seem to increase the cost of providing user support, but it may be that IU’s users are already prepared to run on the GPU. All of the performance numbers in the world won’t matter if the clusters aren’t used, and only time will tell which cluster provides a better value. What may end up being a more interesting result is the political ramifications. Will the statehouse be okay with the two main state universities both running expensive high performance computing resources? If not, who will get to carry on? Both institutions have a record of success. Indiana ranked as high as #23 on the June 2006 list, but Big Red II is the first Top 500 system there since November 2009. Meanwhile, Purdue has had at least one system (and as many as three) on every list since November 2008. With Conte and the additional clusters in operation, Purdue has much greater capacity, but that doesn’t mean that IU’s system is a waste. I suspect that as long as both universities are bringing in enough grant money to justify the cost of their clusters, nobody in Indianapolis will care to put a stop to this rivalry. In the meantime, it appears that Purdue will remain the dominant HPC power in the state, as on the football field.

Treating clusters as projects

Last month, HPCWire ran an article about the decommissioning of Los Alamos National Lab’s “Roadrunner” cluster. In this article was a quote from LANL’s Paul Henning: “Rather than think of these machines as physical entities, we think of them as projects.” This struck me as being a very sensible position to take. One of the defining characteristics of a project is that it is of limited duration. Compute clusters have a definite useful life, limited by the vendor’s hardware warranty and the system performance (both in terms of computational ability and power consumption) relative to new systems.

Furthermore, the five PMBOK process areas all come into play. Initiation and Planning happen before the cluster is installed. Execution could largely be considered the building and acceptance testing portion of a cluster’s life. The operational time is arguably Monitoring and Control. Project closeout, of course, is the decommissioning of the resource. Of course, smaller projects such as software updates and planned maintenance occur within the larger project. Perhaps it is better to think of each cluster as a program?

The natural extension of considering a cluster (or any resource, for that matter) to be a project is assigning a project manager to each project. This was a key component to a staffing plan that a former coworker and I came up with for my previous group. With five large compute resources, plus a few smaller resources and the supporting infrastructure, it was very difficult for any one person to know what was going on. Our proposal included having one engineer assigned as the point-of-contact for a particular resource. This person didn’t have to fix everything on that cluster, but they would know about every issue and all of the unique configuration. This way, time wouldn’t be wasted doing the same diagnostic steps three months apart when a recurring issue recurs.

A 650-node 10Gb computer cluster: easy peasy

At Purdue, we have a long history of being a leader in the field of computing.  (After all, Ctrl+Alt+Del was invented by Purdue alum David Bradley.)  Since we’re a pretty geeky campus anyway, it is more than a matter of professional pride, there’s street cred on the line too.  After building a large compute cluster last year, the research computing group on campus decided it needed to be one-upped this year.

Once again, volunteers from around Purdue and a few other institutions gathered to set up the cluster in a single day.  Once again, we finished way ahead of schedule.  This year, approximately 650 nodes went from box to OS install in less than three hours.  Jobs were already running by lunch time.

The process wasn’t entirely smooth though.  For reasons not adequately explained to the volunteers, the 10 gigabit network cards (NICs) were not installed by the vendor.  That meant each machine that was installed had to first be opened and have a NIC installed.  That is what I did for two hours yesterday morning.

The NIC installation process wasn’t too difficult, there were only 4 screws to contend with.  The organizers had expected 15 NICs per person per shift would be installed.  I did 42 in my two hour shift, and several others installed 50 or more.  At several points, they couldn’t get the machines unboxed and on our tables fast enough.

Several hundred more nodes will be installed once the external funding is processed, and it is likely that Coates will end up reaching the maximum capacity of just over 1200 nodes.  This gives it over 10k cores, all joined by 10 gigabit Ethernet connections.  This allows an obscene amount of data to be processed and transferred, which is very helpful in big-data fields like the atmospheric sciences.

Expectations are high for Coates.  It is, like Steele was, the largest compute cluster in the Big Ten at build-time.  Coates is expected to rank in the top 50 internationally when the supercomputer rankings come out in November.  Coates is also expected to be the first academic cluster connected solely with 10Gb that is big enough to achieve international ranking.  Perhaps most importantly, Coates is expected by Purdue researchers to facilitate some serious science.

Even though my contribution didn’t require much technical skill, I take pride in the fact that a whole rack of nodes can transfer data on the fast because of the network cards that I installed.  This cluster is a big deal to those who care about clusters, and it is really nice to be a part of something so geekily awesome.  If you’re one of those people who care about clusters, the technical details are at http://www.rcac.purdue.edu/userinfo/resources/coates/