You can do real HPC in the cloud

The latest Top 500 list came out earlier this week. I’m generally indifferent to the Top 500 these days (in part because China has two exaflop systems that it didn’t submit benchmarks for). But for better or worse, it’s still an important measure for many HPC practitioners. And that’s why the fact that Microsoft Azure cracked the top 10 is such a big deal.

For years, I heard that the public cloud can’t be used for “real” HPC. Sure, you can do throughput workloads, or small MPI jobs as a code test, but once it’s time to do the production workload, it has to be bare metal. This has never not been wrong. With a public cloud cluster as the 10th most powerful supercomputer* in the world, there’s no question that it can be done.

So the question becomes: should you do “real” HPC in the cloud? For whatever “real” means. There are cases where buying hardware and running it makes sense. There are cases where the flexibility of infrastructure-as-a-service wins. The answer has always been—and always will be—run the workload on the infrastructure that best fits the needs. To dismiss cloud for all use cases is petty gatekeeping.

I congratulate my friends at Azure for their work in making this happen. I couldn’t be happier for them. Most of the world’s HPC happens in small datacenters, not the large HPC centers that tend to dominate the Top 500. The better public cloud providers can serve the majority of the market, the better it is for us all.

HPE acquiring Cray

So HPE is acquiring Cray. Given my HPC experience, you may wonder what my take is on this. (You probably don’t, but this is my blog and I’ve only had one post this week, so…) This could either go very well or very badly. If you want deeper insight, read Timothy Prickett Morgan’s article in The Next Platform.

How it could go well

HPE is strong in the high-performance computing (HPC) market. They had nearly 50% of the top 500 systems 10 years ago. But their share of that list has pretty steadily fallen since — largely due to the growth of “others”. And they’ve never been dominant at the top end. Meanwhile, Cray — a name that was essentially synonymous with “supercomputing” for decades — has been on an upswing.

Cray had 37 systems on the Top 500 list in November 2010 and hasn’t dropped below that number since. From November 2012 through June 2015, Cray took off. They peaked at 71 systems in June 2015, and have been on a slow decline since.

But the system count isn’t the whole story. When looking at the share of performance, Cray is consistently one of the top vendors. Currently accounting for nearly 14% of the list’s performance, they were consistently in the 20-25% range during their ascent at the early part of the current decade.

And while the exascale race is good news for Cray, that revenue is bursty. When cloud providers starting taking up some of the low end HPC workloads, it wasn’t a concern for Cray. They don’t play in that space. But the cloud tide is rising (particularly as Microsoft’s acquisition of Evan Burness starts to pay dividends). When I was at Microsoft, we entered into a partnership with Cray. It was mutually beneficial: Microsoft customers could get a Cray supercomputer without having to find a place in the datacenter for it and Cray customers could more easily offload their smaller workloads to cloud services.

So all of this is to say there’s opportunity here. HPE can get into the top-end systems, particularly the contracts with the U.S. Departments of Defense and Energy. And Cray can ignore the low-to-mid market because the HPE mothership will cover that. And both sides get additional leverage with component manufacturers. If HPE lets Cray be Cray, this could turn out to be a great success.

How it could go poorly

Well, as a friend said “HPE is pretty good at ruining everything they buy”. I’ll admit that I don’t have a particularly positive view of HPE, but there’s nothing particular that I can point to as a reason. If HPE tries to absorb Cray into the larger HPE machine, I don’t think it will go well. Let Cray continue to be Cray, with some additional backing and more aggressive vendor relations, and it will do well. Try to make Cray more HPE-like and it will be a poor way to spend a billion dollars.

The bigger picture

Nvidia gobbled up Mellanox. Xilinx bought Solarflare. Now HPE acquires Cray (and SGI a few years ago). Long-standing HPC vendors are disappearing into the arms of larger companies. It will be very interesting to see how this plays out in the market over the next few years. Apart from the ways technological diversity help advance the state of the art, I wonder what this says about the market generally. Acquisitions like this can often be a way to show growth without having to actually grow anything.

LISA wants you: submit your proposal today

I have the great honor of being on the organizing committee for the LISA conference this year. If you’ve followed me for a while, you know how much I enjoy LISA. It’s a great conference for anyone with a professional interest in sysadmin/DevOps/SRE. This year’s LISA is being held in Nashville, Tennessee, and the committee wants your submission.

As in years past, LISA content is focused on three tracks: architecture, culture, and engineering. There’s great technical content (one year I learned about Linux filesystem tuning from the guy who maintains the ext filesystems), but there’s also great non-technical content. The latter is a feature more conferences need to adopt.

I’d love to see you submit a talk or tutorial about how you solve the everyday (and not-so-everyday) problems in your job. Do you use containers? Databases? Microservices? Cloud? Whatever you do, there’s a space for your proposal.

Submit your talk to https://www.usenix.org/conference/lisa18/call-for-participation by 11:59 PM Pacific on Thursday, May 24. Or talk one of your coworkers into it. Better yet, do both! LISA can only remain a great conference with your participation.

Why HTCondor is a pretty awesome scheduler

In early March, The Next Platform published an article I wrote about cHPC, a container project aimed at HPC applications. But as I wrote it, I thought about how HTCondor has been addressing a lot of the concerns for a long time. Since I’m in Madison for HTCondor Week right now, I thought this was a good time to explain some of the ways this project is awesome.

No fixed walltime. This is a benefit or a detriment, depending on the circumstances, but most schedulers require the user to define a requested walltime at submission. If the job isn’t done at the end of that time, the scheduler kills it. Sorry about your results, get back in line and ask for more walltime. HTCondor’s flexible configuration allows administrators to enable such a feature if desired. By default users are not forced to make a guess that they’re probably going to get wrong.

Flexible requirements and resource monitoring. HTCondor supports user-requestable CPU, memory, and GPU natively. With partitionable slots, resources can be carved up on the fly. And HTCondor has “concurrency limits”, which allow for customizable resource constraints (e.g. software licenses, database connections, etc).

So many platforms. Despite the snobbery of HPC sysadmins, people do real work on Windows. HTCondor has almost-full feature parity on Windows. It also has “universes” for Docker and virtual machines.

Federation. Want to overflow to your friend’s resource? You can do that! You can even submit jobs from HTCondor to other schedulers.

Support for disappearing resources. In the cloud, this is the best feature. HTCondor was designed for resource scavenging on desktops, and it still supports that as a first-class use case. That means machines can come and go without much hassle. Contrast this to other schedulers where some explicit external action has to happen in order to add or remove a node.

Free as in freedom and free as in beer. Free beer is also the best way to get something from the HTCondor team. But HTCondor is licensed under the Apache 2.0 license, so anyone can use it for any purpose.

HTCondor isn’t perfect, and there are some use cases where it doesn’t make sense (e.g. low-latency), but it’s a pretty awesome project. And it’s been around for over three decades.

Other writing in February 2017

Where have I been writing when I haven’t been writing here?

The Next Platform

I’m freelancing for The Next Platform as a contributing author. Here are the articles I wrote last month:

Opensource.com

Over on Opensource.com, we managed our 5th consecutive million-page-view month, despite the short month. I wrote the articles below.

Also, the 2016 Open Source Yearbook is now available. You can get a free PDF download now or buy the print version at cost. Or you can do both!

Cycle Computing

Meanwhile, I wrote or edited a few things for work, too:

  • HyperXite case study – The HyperXite team used CycleCloud software to run simulations for their hyperloop pod.
  • ALS research case study – A professor at the University of Arizona quickly simulate a million compounds as part of a search for pharmacological treatment for Lou Gerhig’s disease.
  • Transforming enterprise workloads – A brief look at how some of our customers transform their businesses by using cloud computing.
  • LAMMPS scaling on Microsoft Azure – My coworkers did some benchmarking of the InfiniBand interconnect on Microsoft Azure. I just wrote about it.
  • Various ghost-written pieces. I’ll never tell which ones!

Other writing in January 2017

Where have I been writing when I haven’t been writing here?

The Next Platform

I’m freelancing for The Next Platform as a contributing author. Here are the articles I wrote last month:

Opensource.com

Over on Opensource.com, we had our fourth consecutive month with a milion-plus page views and set a record with 1,122,064. I wrote the articles below.

Also, the 2016 Open Source Yearbook is now available. You can get a free PDF download now or wait for the print version to become available. Or you can do both!

Cycle Computing

Meanwhile, I wrote or edited a few things for work, too:

  • Use AWS EBS Snapshots to speed instance setup — Staging reference data can be a time-expensive operation. This post describes one way we cut tens of minutes off of time for a cancer research workload.
  • Various ghost-written pieces. I’ll never tell which ones!

Maybe your tech conference needs less tech

My friend Ed runs a project called “Open Sourcing Mental Illness“, which seeks to change how the tech industry talks about mental health (to the extent we talk about it at all). Part of the work involves the publication of handbooks developed by mental health professionals, but a big part of it is Ed giving talks at conferences. Last month he shared some feedback on Twitter:

So I got feedback from a conf a while back where I did a keynote. A few people said they felt like it wasn’t right for a tech conf. It was the only keynote. Some felt it wasn’t appropriate for a programming conf. Time could’ve been spent on stuff that’d help career. Tonight a guy from a company that sponsored the conf said one of team members is going to seek help for anxiety about work bc of my talk. That’s why I do it. Maybe it didn’t mean much to you, but there are lots of hurting, scared people who need help. Ones you don’t see.

Cate Huston had similar feedback from a talk she gave in 2016:

the speaker kept talking about useless things like feelings

The tech industry as a whole, and some areas more than others, likes to imagine that it is as cool and rational as the computers it works with. Conferences should be full of pure technology. And yet we bemoan the fact that so many of our community are real jerks to work with.

I have a solution: maybe your tech conference needs less technology. After all, the only reason anyone pays us to do this stuff is because it (theoretically) solves problems for human beings. I’m biased, but I think the USENIX LISA conference does a great job of this. LISA has three core areas: architecture, engineering, and culture. You could look at it this way: designing, implementing, and making it so people will help you the next time around.

Culture is more than just sitting around asking “how does this make you feeeeeeeel?” It includes things like how to avoid burnout and how to train the next generation of practitioners. It also, of course, includes how to not be a insensitive jerk who inflicts harm on others with no regard for the impact they cause.

I enjoy good technical content, but I find that over the course of a multi-day conference I don’t retain very much of it. For a few brief hours in 2011, I understood SELinux and I was all set to get it going at home and work. Then I attended a dozen other sessions and by the time I got home, I forgot all of the details. My notes helped, but it wasn’t the same. On the other hand, the cultural talks tend to be the ones that stick with me. I might not remember the details, but the general principles are lasting and actionable.

Every conference is different, but I like having one-third of content be not-tech as a general starting point. We’re all humans participating in these communities, and it serves no one to pretend we aren’t.

Other writing in December 2016

Happy new year! Where have I been writing when I haven’t been writing here?

SysAdvent

Once again, SysAdvent was a great success. The large community that has built around this project means I do less than in years past. I want to give others the opportunity to get involved, too. This year I edited one article:

The Next Platform

I’m freelancing for The Next Platform as a contributing author. Here are the articles I wrote last month:

Opensource.com

Over on Opensource.com, we hit the million page view mark for the third consecutive month. I wrote the articles below.

Cycle Computing

Meanwhile, I wrote or edited a few things for work, too:

  • LISA 16 Cloud HPC BoF — I summarized a BoF session at the LISA Conference in Boston.
  • Various ghost-written pieces. I’ll never tell which ones!

Other writing in November 2016

Where have I been writing when I haven’t been writing here?

The Next Platform

I’m freelancing for The Next Platform as a contributing author. Much like my role with Opensource.com as a Community Moderator, I look at the other names on the list and I just say “wow! How did I end up in such good company?” The articles I wrote last month:

  • Advances in in situ processing tie to exascale targets — The growth in FLOPS is outpacing the growth in IOPS. Analyzing simulations as they run is becoming increasingly important for scientists and engineers.
  • Microsoft Research pens Quill for data intensive analysis — Collecting data is only useful to the extent that the data is analyzed. We have more data these days, but no platform that can handle both real-time streaming and post hoc analysis. The Quill project aims to change that.
  • JVM Boost shows warm Java is better than cold — The Java Virtual Machine allows “write once, run anywhere” but it imposes a performance penalty. For short-running jobs, the hit can be significant. The HotTub project speeds up these jobs (up to 30x in some cases!) by reusing JVM processes.

Opensource.com

Over on Opensource.com, I agreed to coordinate the Doc Dish column. I also wrote the articles below. It was a great month for the site. Three times during November, we set a single-day page view record. We also crossed the million page view mark for the second consecutive month and the third time in site history.

Cycle Computing

Meanwhile, I wrote or edited a few things for work, too:

  • Scale in a Cloudy World — I contributed an article to HPC Source about how to scale cloud HPC environments.
  • Various ghost-written pieces. I’ll never tell which ones!

Other writing in October 2016

Where have I been writing when I haven’t been writing here?

Over on Opensource.com, we had our second-ever month with a million page views! While I didn’t have any articles published, I did agree to coordinate the Doc Dish column, so there’s that.

Meanwhile, I wrote or edited a few things for work, too:

I also spoke at the All Things Open conference in Raleigh, NC. It went okay.