You can do real HPC in the cloud

The latest Top 500 list came out earlier this week. I’m generally indifferent to the Top 500 these days (in part because China has two exaflop systems that it didn’t submit benchmarks for). But for better or worse, it’s still an important measure for many HPC practitioners. And that’s why the fact that Microsoft Azure cracked the top 10 is such a big deal.

For years, I heard that the public cloud can’t be used for “real” HPC. Sure, you can do throughput workloads, or small MPI jobs as a code test, but once it’s time to do the production workload, it has to be bare metal. This has never not been wrong. With a public cloud cluster as the 10th most powerful supercomputer* in the world, there’s no question that it can be done.

So the question becomes: should you do “real” HPC in the cloud? For whatever “real” means. There are cases where buying hardware and running it makes sense. There are cases where the flexibility of infrastructure-as-a-service wins. The answer has always been—and always will be—run the workload on the infrastructure that best fits the needs. To dismiss cloud for all use cases is petty gatekeeping.

I congratulate my friends at Azure for their work in making this happen. I couldn’t be happier for them. Most of the world’s HPC happens in small datacenters, not the large HPC centers that tend to dominate the Top 500. The better public cloud providers can serve the majority of the market, the better it is for us all.

The AWS/VMWare partnership

Disclosures: My employer is an AWS partner. This post is solely my personal opinion and does not represent the opinion of my employer or AWS. I have no knowledge of this partnership beyond what has been publicly announced. I also own a small number of shares of Amazon stock.

Last week, Amazon Web Services (AWS) and VMWare announced a partnership that would make AWS the preferred cloud solution for VMWare. AWS will provide a separate set of hardware running VMWare’s software managed by VMWare staff. Customers can then provision a VMWare environment from that pool that looks the same as an internal data center.

As others have pointed out, this is essentially a colocation service that just happens to be run by Amazon. I share that view of it, but I don’t take the view that AWS blinked. It’s true that AWS has eschewed hybrid cloud in favor of pure cloud offerings, and they’ve done quite well with that strategy.

I don’t think the market particularly cares about purity, nor do I think the message will get muddled. Here’s how I see this deal: VMWare sees people moving stuff to the cloud and they know that the more that trend continues, the smaller their market becomes. Meanwhile AWS is printing money but is aware of the opportunity to print more. Microsoft Azure, despite having an easy answer for hybrid, doesn’t seem to be a real threat to AWS at the moment.

But I don’t think AWS leadership is stupid or complacent, and this deal represents a low-risk, high-reward opportunity for them. With this partnership, AWS now has an entry into organizations that have previously been cloud-averse. Organizations can dip their toes into “cloud” without having to re-tool (although this is not the best long-term strategy, as @cloud_opinion points out). As the organization becomes comfortable with the version of the cloud they’re using, it becomes easier for AWS sales reps to talk them into moving various parts to AWS proper.

Now I don’t mean to imply that AWS is a sheep in wolf’s clothing here. This deal seems mutually beneficial. VMWare is going to face a shrinking market over time. With this deal, they at least get to buy themselves some time. For AWS, it’s more of a long game, and they can put as much or as little into this partnership as they want. For both companies, it’s a good argument to prevent customers from switching to Microsoft’s offerings.

What will be most interesting is to see if Google Cloud, the other major infrastructure-as-a-service (IaaS) provider will respond. Google’s strategy, up until about a year ago, has seemed to be “we’re Google, of course people will use us”. That has worked fairly well for startups, but it has very little traction in the enterprise. Google can continue to be more technically-focused, but that will hinder their ability to get into major corporations (especially those outside of the tech industry).

I don’t see that there’s a natural fit at this point (though I also wouldn’t have expected AWS and VMWare to pair up, so what do I know?). One interesting option would be for Google to buy Red Hat (disclosure: I also own a few shares of Red Hat) and make Open Shift its hybrid solution. I don’t see that happening, though, as it doesn’t seem like the right move for either company.

The VMWare-on-AWS offering will not be generally available until sometime next year, so we have a little bit of time before we can see how it plays out.

Other writings in September 2016

Where have I been writing when I haven’t been writing here?

Over on Opensource.com, we had another 900k+ page views in the month: the fourth time in site history and the second consecutive month. I contributed two articles:

Meanwhile, I wrote a few things for work, too:

  • Cycle Computing: The cloud startup that just keeps kicking — The Next Platform wrote a very nice article about us, so I wrote a blog post talking about how nice it was. (Hey, I’m in marketing now. It’s what we do).
  • Cloud-Agnostic Glossary — Supporting multiple cloud-service providers means having to translate terms between them. I put together a Rosetta Stone to help translate relevant terms between AWS, Azure, and Google Cloud.
  • The question isn’t cost, it’s value — When people talk about the cost of cloud computing, they’re usually looking at the raw dollar value. Since it takes money to make money, that’s not always the right way to look at it. It’s better to consider the value generated.

DevOps is dead!

“$thing is dead!” is one of the more annoying memes in the world of technology. A Tech Crunch article back in April claimed that managed services (of cloud infrastructure) is the death knell of DevOps. I dislike the article for a variety of reasons, but I want to focus on why the core argument is bunk. Simply put: “the cloud” is not synonymous with “magical pixie dust.” Real hardware and software still exist in order to run these services.

Amazon Web Services (AWS) is the current undisputed leader in the infrastructure-as-a-service (IaaS) space. AWS probably underlies many of the services you use on a daily basis: Slack and Netflix are two prime examples. AWS offers dozens of services for computation, storage, and networking that roll out updates to datacenters across the globe many times a day. DevOps practices are what make that possible.

Oh, but the cloud means you don’t need your internal DevOps team! No. Shut up. “Why not simply teach all developers how to utilize the infrastructure tools in the cloud?” Because being able to spin up servers and being able to effectively manage and scale them are two entirely different concepts. It is true that cloud services can (not “must”!) take the “Ops” out of “DevOps” for development environments. But just as having access to WebMD doesn’t mean I’m going to perform my own surgery, being able to spin up resources doesn’t obviate the need for experienced management.

The author spoke of “managed services provider” as an apparent synonym for “IaaS provider”. He ignored what I think of as “managed services” which is a contracted team to manage a service for you. That’s what I believe to be the more realistic threat to internal DevOps teams. But it’s no different than any other outsourcing effort, and outsourcing is hardly a new concept.

At the end of the article, the author finally gets around to admitting that DevOps is a cultural paradigm, not a position or a particular brand of fairy dust. Cloud services don’t threaten DevOps, they make it easier than ever to practice. Anyone trying to convince you that DevOps is dead is just trying to get you to read their crappy article (yes, I have a well-tuned sense of irony, why do you ask?).

Amazon VPC: A great gotcha

If you’re not familiar with the Amazon Web Services offerings, one feature is the Virtual Private Cloud (VPC). VPC is effectively a way of walling yourself off from all or part of the world. If you’re running a public-facing web server, it might not be so important. If you’re running a compute cluster, it’s a no-brainer. Just be careful about that “no-brainer” part.

While working on a new cluster for a customer today, I was trying to figure out why the HTCondor scheduler wasn’t showing up to the collector. The daemons were all running. HTCondor security policies weren’t getting in the way. I could use condor_config_val from each host to query the other host. I brought in a colleague to double-check me. He couldn’t figure it out either.

After beating our heads against the wall for a while, and finding absolutely nothing helpful in the logs, I noticed one tiny detail in the logs. The schedd kept saying it was updating the collector, but the collector never seemed to notice. The schedd kept saying it was updating the collector via UDP. How many times had I watched that line go by?

The last time, though, it clicked. And it clicked hard. I had set up a security group to allow all traffic within the VPC. Except I had set it for all TCP traffic, so the UDP packets were being silently dropped. As UDP packets are wont to do. When I changed the security group rule from TCP to all protocols, the scheduler magically appeared in the pool.

Once again, the moral of the story is: don’t be stupid.

CCA11: Cloud Computing and Its Applications

Earlier this week, I attended CCA11 at Argonne National Laboratory. I was there to present an extended abstract and take in what I could. I’ve never presented at a conference before (unless you count a short talk to kick off the Condor BoF at LISA ’10, which I don’t) and the subject of my abstract was work that we’ve only partially done, so I was a bit nervous. Fortunately, our work was well-received. It was encouraging enough that I might be talked into writing another paper at some point.

One thing I learned from the poster session and the invited talks is that the definition of “cloud” is just as ambiguous as ever. I continue to hate the term, although the field (however you define it) is doing interesting things. There’s a volunteer effort underway at NASA to use MapReduce to generate on-demand product visualization for disasters. An early prototype for Namibian flooding is at http://matsu.opencloudconsortium.org/.

Perhaps one of the largest concerns is the sheer volume of data. For example, the National Institutes of Health have over two petabytes of genomics data available, but how can you transfer that? Obviously, in most cases a user would only request a subset of data, but if there’s a use case that requires the whole data set, then what? One abstract presenter championed the use of sneakernet and argued that network bandwidth is the greatest challenge going forward.

One application that wasn’t mentioned is the cloud girlfriend. Maybe next year?

Thanks to Andy Howard and Preston Smith for their previous work and for helping me write the abstract.