You can do real HPC in the cloud

The latest Top 500 list came out earlier this week. I’m generally indifferent to the Top 500 these days (in part because China has two exaflop systems that it didn’t submit benchmarks for). But for better or worse, it’s still an important measure for many HPC practitioners. And that’s why the fact that Microsoft Azure cracked the top 10 is such a big deal.

For years, I heard that the public cloud can’t be used for “real” HPC. Sure, you can do throughput workloads, or small MPI jobs as a code test, but once it’s time to do the production workload, it has to be bare metal. This has never not been wrong. With a public cloud cluster as the 10th most powerful supercomputer* in the world, there’s no question that it can be done.

So the question becomes: should you do “real” HPC in the cloud? For whatever “real” means. There are cases where buying hardware and running it makes sense. There are cases where the flexibility of infrastructure-as-a-service wins. The answer has always been—and always will be—run the workload on the infrastructure that best fits the needs. To dismiss cloud for all use cases is petty gatekeeping.

I congratulate my friends at Azure for their work in making this happen. I couldn’t be happier for them. Most of the world’s HPC happens in small datacenters, not the large HPC centers that tend to dominate the Top 500. The better public cloud providers can serve the majority of the market, the better it is for us all.

Extending rivalries to HPC

In October, Indiana University announced it would purchase a Cray XK7 named “Big Red II”. With a theoretical peak of just over 1 petaFLOPS, it would be the fastest University-owned (not associated with a national center) cluster. Of course, in state rivals would never let that stand. In the latest Top 500 list, unveiled at at the International Supercomputing Conference, Big Red II ranks a very respectable 46th. Unfortunately for them, Purdue University’s new Conte cluster checked in at 28. Oops! Let’s compare:

Cluster Cost Theoretical performance LINPACK performance Cost per benchmarked TFLOPS
Big Red II $7.5 million 1000.6 TFLOPS 597.4 TFLOPS $12.55k / TFLOPS
Conte $4.3 million 1341.1 TFLOPS 943.4 TFLOPS $4.56k / TFLOPS
Comparison 57.33% 134.03% 157.92% 36.33%

It’s clear that Conte is the winner in performance and cost. But what about value? Both of these clusters have accelerators, Big Red II uses Nvidia GPUs and Conte uses Intel’s Phi (which also powers China’s new Tianhe-2, far and away the fastest cluster in the world). Using the GPU requires writing code in the CUDA language, whereas Phi will run native x86 code. This lowers the barrier to entry for users on Phi, but GPUs seem to win in most benchmarks. This would seem to increase the cost of providing user support, but it may be that IU’s users are already prepared to run on the GPU. All of the performance numbers in the world won’t matter if the clusters aren’t used, and only time will tell which cluster provides a better value. What may end up being a more interesting result is the political ramifications. Will the statehouse be okay with the two main state universities both running expensive high performance computing resources? If not, who will get to carry on? Both institutions have a record of success. Indiana ranked as high as #23 on the June 2006 list, but Big Red II is the first Top 500 system there since November 2009. Meanwhile, Purdue has had at least one system (and as many as three) on every list since November 2008. With Conte and the additional clusters in operation, Purdue has much greater capacity, but that doesn’t mean that IU’s system is a waste. I suspect that as long as both universities are bringing in enough grant money to justify the cost of their clusters, nobody in Indianapolis will care to put a stop to this rivalry. In the meantime, it appears that Purdue will remain the dominant HPC power in the state, as on the football field.