EC2 Archives – Blog FiascoBlog Fiasco

Evident.io founder and CEO Tim Prendergast wondered on Twitter why other cloud service providers aren’t taking marketing advantage of the Xen vulnerability that lead Amazon and Rackspace to reboot a large number of cloud instances over a few-day period. Digital Ocean, Azure, and Google Compute Engine all use other hypervisors, so isn’t this an opportunity for them to brag about their security? Amazon is the clear market leader, so pointing out this vulnerability is a great differentiator.

Except that it isn’t. It’s a matter of chance that Xen is The hypervisor facing an apparently serious and soon-to-be-public exploit. Next week it could be Mircosoft’s Hyper-V. Imagine the PR nightmare if Microsoft bragged about how much more secure Azure is only to see a major exploit strike Hyper-V next week. It would be even worse if the exploit was active in the wild before patches could be applied.

“Choose us because of this Xen issue” is the cloud service provider equivalent of an airline running a “don’t fly those guys, they just had a plane crash” ad campaign. Just because your competition was unlucky this time, there’s no guarantee that you won’t be the lower next time.

I’m all for companies touting legitimate security features. Amazon’s handling of this incident seems pretty good, and I think they generally do a good job of giving users the ability to secure their environment. That doesn’t mean someone can’t come along and do it better. If there’s anything 2014 has taught us, it’s that we have a long road ahead of us when it comes to the security of computing.

It’s to the credit of Amazon’s competition that they’ve remained silent. It shows a great degree of professionalism. Digital Ocean’s Chief Technology Evangelist John Edgar had the best explanation for the silence: “because we’re not assholes mostly.”

If you’re not familiar with the Amazon Web Services offerings, one feature is the Virtual Private Cloud (VPC). VPC is effectively a way of walling yourself off from all or part of the world. If you’re running a public-facing web server, it might not be so important. If you’re running a compute cluster, it’s a no-brainer. Just be careful about that “no-brainer” part.

While working on a new cluster for a customer today, I was trying to figure out why the HTCondor scheduler wasn’t showing up to the collector. The daemons were all running. HTCondor security policies weren’t getting in the way. I could use condor_config_val from each host to query the other host. I brought in a colleague to double-check me. He couldn’t figure it out either.

After beating our heads against the wall for a while, and finding absolutely nothing helpful in the logs, I noticed one tiny detail in the logs. The schedd kept saying it was updating the collector, but the collector never seemed to notice. The schedd kept saying it was updating the collector via UDP. How many times had I watched that line go by?

The last time, though, it clicked. And it clicked hard. I had set up a security group to allow all traffic within the VPC. Except I had set it for all TCP traffic, so the UDP packets were being silently dropped. As UDP packets are wont to do. When I changed the security group rule from TCP to all protocols, the scheduler magically appeared in the pool.

Once again, the moral of the story is: don’t be stupid.

Blog Fiasco

The world's only(?) FOSS/weather/sports/marketing/high-performance computing blog

Tag Archives: EC2

Cloud detente