Blog Fiasco

September 30, 2014

Cloud detente

Filed under: HPC/HTC,Linux,The Internet — Tags: , , , , , , , — bcotton @ 8:21 am

Evident.io founder and CEO Tim Prendergast wondered on Twitter why other cloud service providers aren’t taking marketing advantage of the Xen vulnerability that lead Amazon and Rackspace to reboot a large number of cloud instances over a few-day period. Digital Ocean, Azure, and Google Compute Engine all use other hypervisors, so isn’t this an opportunity for them to brag about their security? Amazon is the clear market leader, so pointing out this vulnerability is a great differentiator.

Except that it isn’t. It’s a matter of chance that Xen is The hypervisor facing an apparently serious and soon-to-be-public exploit. Next week it could be Mircosoft’s Hyper-V. Imagine the PR nightmare if Microsoft bragged about how much more secure Azure is only to see a major exploit strike Hyper-V next week. It would be even worse if the exploit was active in the wild before patches could be applied.

“Choose us because of this Xen issue” is the cloud service provider equivalent of an airline running a “don’t fly those guys, they just had a plane crash” ad campaign. Just because your competition was unlucky this time, there’s no guarantee that you won’t be the lower next time.

I’m all for companies touting legitimate security features. Amazon’s handling of this incident seems pretty good, and I think they generally do a good job of giving users the ability to secure their environment. That doesn’t mean someone can’t come along and do it better. If there’s anything 2014 has taught us, it’s that we have a long road ahead of us when it comes to the security of computing.

It’s to the credit of Amazon’s competition that they’ve remained silent. It shows a great degree of professionalism. Digital Ocean’s Chief Technology Evangelist John Edgar had the best explanation for the silence: “because we’re not assholes mostly.”

August 16, 2014

Fedora Board proposal

Filed under: Linux — Tags: , — bcotton @ 2:04 pm

In an email to the Fedora community this week, the Fedora Board asked for comments on a proposed change to how Fedora is governed. Although I haven’t been as active in Fedora as I’d like, I still contribute and I still have opinions on the proposal. The following post is the feedback I provided on the board-discuss mailing list. In accordance with the desire to keep discussion from fragmenting, I have disabled comments on this post.

My initial reaction to this proposal was “what did I just read?” At first glance, it looked like a move from a democracy to a dictatorship. I even used the phrase “the Shuttleworthization of Fedora.” Having taken the time to process the proposal, as well as look at the the accompanying material, my reaction has shifted. In the process of writing about the parts of the proposal I’d like to keep, I realize that I essentially came up with the same proposal in different terms. My two point summary:

  • Lengthen board terms to reduce turnover (I’m not necessarily in favor of the indefinite terms as presented, but one year is too short)
  • Change the board from being entirely at-large to being representative of major constituencies

The Fedora Board, at least from the perspective of an irregular contributor, is indeed a very passive organization. To some degree, I find that appropriate for our community, but I can appreciate the arguments that a more active board would benefit the community and the product we labor to produce. The questions that arise are: “how active should the board be?” and “how do we structure the board such that it meets this need?”

My concern is that we’re addressing the second question before addressing the first. We don’t know where we’re going, but we know how we’re going to get there! The thread on board-discuss back in September was unclear about the intended relationship between a re-imagined board and FESCo. The proposal as presented offers no additional clarity. The proposal talks of leading and doing without really talking about the scope of responsibility. Perhaps that’s the main problem with the board as currently constructed?

July 1, 2014

Samba configuration: the ultimate cargo cult

Filed under: Linux — Tags: — bcotton @ 4:45 pm

Samba is a magical tool that allows *nix and Windows machines to coexist in some forms of peace. It’s particularly helpful when you want to share files across platforms. I’ve maintained Samba servers at work and at home for nearly a decade now and I don’t pretend to understand it.

Over the years, I’ve come to view Samba as the poster child for cargo cult system administration. I suspect most people Google for their problem and apply whatever magic totem fixes it, without really understanding what’s actually going on. They share this knowledge and perpetuate the magical configuration. Allow me to do the same.

For one of the applications we support at my current job, our normal cluster configuration is a Linux file server with Windows execute nodes. The server provides anonymous read/write access to the execute nodes and forces the user server-side. (It’s a closed environment, so this is just a lot simpler.) During a recent project, we were doing a customer’s first foray into the cloud. We started from a configuration that we used for another customer running the same application. Oh, but this customer uses RHEL 6 servers, so we switched the setup from the RHEL 5 images we had been using.

Crap. That broke it. For some reason, the clients couldn’t write to the file server. After a late night of frantic effort (this was a project with a short timeline), we found we needed to add the following lines:

guest account = rap
map to guest
valid users = rap, @rap
force group = rap
guest ok = yes

That seemed to solve the problem. Apparently there were some changes between the versions of Samba in RHEL 5 and 6. But then we discovered that hosts would start to write and then become unable to access the share. So we added the following:

writeable = yes
guest only = yes
acl check permissions = False

Oh, but then it turns out that sharing a directory over both Samba and NFS can cause weird timestamp issues. After some experimentation, we found it was necessary to stop using oplocks:

kernel oplocks = no
oplocks = no
level2 oplocks = no

So here’s our final, working config. Cargo cult away!

[global]
workgroup = WORKGROUP
netbios name = Samba
encrypt passwords = yes
security = share
log level = 2
socket options = TCP_NODELAY IPTOS_LOWDELAY SO_KEEPALIVE SO_RCVBUF=8192 SO_SNDBUF=8192
kernel oplocks = no
oplocks = no
level2 oplocks = no
max xmit = 65535
dead time = 15
getwd cache = yes
printcap name = /etc/printcap
use sendfile = yes
guest account = rap
map to guest = Bad User

[rap]
comment = File Share
path=/vol/smb/rap
force user = rap
valid users = rap, @rap
force group = rap
read only = no
writeable = yes
browseable = yes
public = yes
guest ok = yes
guest only = yes
acl check permissions = False

April 19, 2014

The right way to do release notes

Filed under: Linux,Project Management,The Internet — Tags: — bcotton @ 8:52 pm

Forever ago (in Internet time), the developer(s?) of Pocket Casts released an update with some really humorous release notes:

Release notes for Pocket Casts 3.6.

As I do, I got thinking about how I felt about it. While my initial reaction was to be amused, I quickly turned to finding it unhelpful. In fact, most apps have awful release notes. My least favorite phrase, which seems to appear in the release notes of every updated app on my phone, is “and bug fixes.”

Despite the title of this post, there’s no one right way to write release notes. The “right” way depends on what you’re releasing, for one. In a Linux distribution like Fedora, release notes could be composed of the release notes for every component package. However, that would be monumentally unwieldy. Even the Fedora Technical Notes — which report only the changed packages, not the notes for those packages — is not likely to be ready by too many people. The Release Notes are a condensed view, which highlight prominent features. The Release Announcement is even further condensed, and is useful for media and public announcements. This hierarchy is a good example of the importance of the audience.

I’ve seen arguments that release notes are unnecessary if the source code repository is accessible. Who needs release notes when you can just look at the commit log? This is a pretty lousy argument. A single change may be composed of many commits and a single commit may represent multiple changes (though it shouldn’t). Not to mention that commit messages are often poorly written. I’ve made far too many of those myself. Even if the commit log is a beautiful representation of what happened, it’s a lot to ask a consumer of your software to scour every commit since the last release.

My preference for release notes includes, in no particular order, a list of new features, bugs fixed, and known issues. The HTCondor team does a particularly good job in that regard. One thing I’d add to their release notes is an explicit listing of the subsystem(s) affected for each point. The exact format doesn’t particularly matter. All I’m looking for is an explanation as to why I should or should not care about a particular release. And “fixed some bugs” doesn’t tell me that.

January 9, 2014

Online learning: Codecademy

Filed under: Linux,mac,The Internet — Tags: , , , , , , — bcotton @ 9:05 pm

Last week, faced with a bit of a lull at work and a coming need to do some Python development, I decided to work through the Python lessons on Codecademy. Codecademy is a website that provides free instruction on a variety of programming languages by means of small interactive example exercises.

I had been intending to learn Python for several years. In the past few weeks, I’ve picked up bits and pieces by reading and bugfixing a project at work, but it was hardly enough to claim knowledge of the language.

Much like the “… for Dummies” books, the lessons were humorously written, simple, and practical. Unlike a book, the interactive nature provides immediate feedback and a platform for experimentation. The built-in Q&A forum allows learners to help each other. This was particularly helpful on a few of the exercises where the system itself was buggy.

The content suffered from the issue that plagues any introductory instruction: finding the right balance between too easy and too hard. Many of the exercises were obvious from previous experience. By and large, the content was well-paced and at a reasonable level. The big disappointment for me was the absence of explanation and best practices. I often found myself wondering if the way I solved the problem was the right way.

Still, I was able to apply my newly acquired knowledge right away. I now know enough to be able to understand discussion of best practices and I’ll be able to hone my skills through practices. That makes it worth the time I invested in it. Later on, I’ll work my way through the Ruby (to better work with our Chef cookbooks) and PHP (to do more with dynamic content on this site) modules.

August 7, 2013

When your HP PSC 1200 all-in-one won’t print

Filed under: Linux — Tags: , — bcotton @ 10:50 am

I don’t think I’ve made it any secret that I hate printing. It’s still an inescapable part of my life, though. Last week, I was printing some forms for an event my wife was running the following day. We had just purchased new ink, so of course that’s the idea time for the paper to completely stop feeding. Wheels sounded like they were turning, but the printer would not pull any paper in. If you find yourself in a similar situation, fear not! I can tell you how to fix it. The first step is to go visit HP’s video on how to clean the rollers and whatnot:

http://www8.hp.com/h20621/video-gallery/us/en/customer-care/1245172367001/hp-psc-1200-not-pick-or-feed-paper/video/

Still here? That must mean you followed the steps in the video to no avail. It’s time to take the printer apart. If your printer is still under warranty or you’re skittish about doing this, then stop right here. Before you do any steps in the video above or my description below, make sure the printer is unplugged.

The first step is to remove the four screws at the top of the printer (one in each corner). You’ll need either a #10 Torx screwdriver or an appropriately-sized Allen wrench (I think 1/16″). Once those screws are loosened, remove the upper body of the printer as shown below. Lift the majority of the body, not just the very top part, or else you’ll just remove the scanner plate. Don’t be too alarmed if the ink access door comes off.

Separating the printer body for removal.

Separating the body for removal.

As you lift the body, carefully remove the two ribbons (shown below) by pulling them directly toward you.

The two ribbons to remove.

The two ribbons to remove.

Give the white wheel on the left side a good shove inward. You may not feel it move, but this is the magic voodoo.

White wheel on the left of the paper roller.

Push really hard on this wheel.

Replace the ribbons by pushing them firmly back into their slots. Put the ink access door back in place and set the printer body atop the printer. Tighten the screws. Plug the printer in, turn it on, and “enjoy” printing once again.

April 23, 2013

Monitoring sucks, don’t make it worse

Filed under: HPC/HTC,Linux — Tags: , , — bcotton @ 10:10 pm

You don’t have to go too far to find someone who thinks monitoring sucks. It’s definitely true that monitoring can be big, ugly, and complicated. I’m convinced that many of the problems in monitoring are not technical, but policy issues. For the sake of clarity (and because I’m like that), let’s start with some definitions. These definitions may or may not have validity outside the scope of this post, but at least they will serve to clarify what I mean when I say things.

  • Monitoring – an automatic process to collect metrics on a system or service
  • Alerting – notification when a critical threshold has been reached

In the rest of this post, I will be throwing some former colleagues under the bus. It’s not personal, and I’m responsible for some of the problem as well. The group in question has a monitoring setup that is dysfunctional to the point of being worthless. Not all of the problems are policy-related, but enough are to prompt this post. It should be noted that I’m not an expert on this subject, just a guy with opinions and a blog.

Perhaps the most important thing that can be done when setting up a monitoring system is coming up with a plan. It sounds obvious, but if you don’t know what you’re monitoring, why you’re monitoring it, and how you’re monitoring it, you’re bound to get it wrong. This is my first rule: in monitoring, failing to plan is planning to not notice failure.

It’s important to distinguish between monitoring and alerting. You can’t alert on what you don’t monitor, but you don’t need to alert on everything you monitor. This is one area where it’s easy to shoot yourself in the foot, especially at a large scale. Many of the monitoring checks were in reaction to something going wrong. As a result, Nagios ended up alerting for things like “a compute node has 95% memory utilization.” For servers, that’s important. For nodes, who cares? The point of the machines is to do computation. Sometimes that means chewing up memory.

Which brings me to rule number two: every alert should have a reaction. If you’re not going to do something about an alert, why have it in the first place? It’s okay to monitor without alerting — the information can be important in diagnosing problems or analyzing usage — but if an alert doesn’t result in a human or automated reaction, shut it off.

Along that same line, alerts should be a little bit painful. Don’t punish yourself for something failing, but don’t make alerts painless either. Perhaps the biggest problem in the aforementioned group is that most of the admins filtered Nagios messages away. That immediately killed any incentive to improve the setup.

I took the alternate approach and weakly lobbied for all alerts to hit the pager. This probably falls into the “too painful” category. You should use multiple levels of alerts. An email or ticket is fine for something that needs to be acted on but can wait until business hours. A more obnoxious form of alert should be used for the Really Important Things[tm].

The great thing about having a little bit of pain associated with alerts is that it also acts as incentive to fix false alarms. At one point, I wrote Nagios checks to monitor HTCondor daemons. Unfortunately, due to the load on the Nagios server, the checks would timeout and produce alerts. The daemons were fine and the cond0r_master process generally does a good job of keeping things under control. So I removed the checks.

The opposite problem is running checks outside the monitoring system. One colleague had a series of cron jobs that checked the batch scheduler. If the checks failed, he would email the group. Don’t work outside the system.

Finally, be sure to consider planned outages. If you can’t suppress alerts when things are broken intentionally, you’re going to have a bad time. As my friend tweeted: “Rough estimates indicate we sent something like 180,000 emails when our clusters went down for maintenance.”

March 14, 2013

So long, Google Reader

Filed under: Linux,The Internet — Tags: , , , , — bcotton @ 3:25 pm

In case you haven’t been paying attention in the past 24 hours, the Pope has killed Google Reader.

What? Oh! Okay, Google is killing Google Reader. On July 1, the best RSS client I’ve ever used will be no more. One of the more interesting aspects of the reaction is seeing how people have used it. I never really got into the sharing feature of Reader, so it didn’t bother me when it was discontinued in favor of Google Plus. For some people, that was apparently the main selling point.

My own use was generally selfish. I just wanted to know when something new was posted to a site. This is especially important for sites that don’t update regularly, as I’m not likely to keep checking a site every day on the off chance it’s been updated. I also don’t want to rely on social media to get updates. If I’ve been offline for a few days, I’m not going to catch up on all of the Twitter, Facebook, and Google+ posts I’ve missed. I will scroll through the entire collection of articles in Google Reader, reading those that seem interesting.

I can buy that RSS has seen a decline in usage (not in utility, but that’s a separate matter). I can understand that Google doesn’t find it worthwhile to keep Reader going. Like Casey Johnston, I suspect that it won’t go away entirely (as you may recall, the real-time editing technology in Google Wave made an excellent addition to Google Docs). But here’s the thing: I don’t really care.

Yes, I use Google Reader on a daily basis. I’m not tied to it, though. Reader doesn’t integrate with any other Google products in a way that’s meaningful for me. So while I have probably spent more time watching this woman’s face than my wife is comfortable with, I’ll make do without Google Reader. I don’t know what I’ll migrate to yet. NewsBlur has been brought up several times, although they currently aren’t allowing new free accounts (presumably due to being crushed by new users in the wake of yesterday’s announcement). I may also go the self-hosting route and set up tt-rss (which may also present an opportunity to run it as a paid service for those who can’t/won’t run it themselves). I still have a few months to figure it out.

February 13, 2013

How do you measure software quality?

Filed under: Linux — Tags: , , , , — bcotton @ 10:30 am

There are two major license types in the free/open source software world: copyleft (e.g. GPL) and permissive (e.g. BSD). Because of the different legal ramifications of the licenses, it’s possible to make theoretical arguments that either license would tend to produce higher quality software. For my master’s thesis, I would like to investigate the quality of projects licensed under these paradigms, and whether there’s a significant difference. In order to do this, I’ll need some objective mechanism for measuring some aspect(s) of software quality. This is where you come in: if you have any suggestions for measures to use, or tools to get these measures, please let me know. It will have to be language-independent and preferably not rely on bug reports or other similar data. Operating on source would be preferable, but I have no objections to building binaries if I have to.

The end goal (apart from graduating) is to provide guidance for license selection in open source projects when philosophical considerations are not a concern. I have no intention or desire to turn this into a philosophical debate on the merits of different license types.

January 15, 2013

Deploying Fedora 18 documentation: learning git the hard way

Filed under: Linux,The Internet — Tags: , , , , — bcotton @ 5:49 pm

If you haven’t heard, the Fedora team released Fedora 18 today. It’s the culmination of many months of effort, and some very frustrating schedule delays. I’m sure everyone was relieve to push it out the door, even as some contributors worked to make sure the mirrors were stable and update translations. I remembered that I had forgotten to push the Fedora 18 versions of the Live Images Guide and the Burning ISOs Guide, so I quickly did that. Then I noticed that several of the documents that were on the site earlier weren’t anymore. Crap.

Here’s how the Fedora Documentation site works: contributors write guides in DocBook XML, build them with a tool called publican, and then check the built documents into a git repository. Once an hour, the web server clones the git repo to update the content on the site. Looking through the commits, it seemed like a few hours prior, someone had published a document without updating their local copy of the web repo first, which blew away previously-published Fedora 18 docs.

The fix seemed simple enough: I’d just revert to a few commits prior and then we could re-publish the most recent updates. So I git a `git reset –hard` and then tried to push. It was suggested that a –force might help, so I did. That’s when I learned that this basically sends the local git repo to the remote as if the remote were empty (someone who understands git better would undoubtedly correct this explanation), which makes sense. For many repos, this probably isn’t too big a deal. For the Docs web repo, which contains many images, PDFs, epubs, etc. and is roughly 8 GB on disk, this can be a slow process. On a residential cable internet connection which throttles uploads to about 250 KiB/s after the first minute, it’s a very slow process.

I sent a note to the docs mailing list letting people know I was cleaning up the repo and that they shouldn’t push any docs to the web. After an hour or so, the push finally finished. It was…a failure? Someone hadn’t seen my email and pushed a new guide shortly after I had started the push-of-doom. Fortunately I discovered the git revert command in the meantime. revert, instead of pretending like the past never happened, makes diffs to back out the commit(s). After reverting four commits and pushing, we were back to where we were when life was happy. It was simple to re-publish the docs after that, and a reminder was sent to the group to ensure the repo is up-to-date before pushing.

The final result is that some documents were unavailable for a few hours. The good news is that I learned a little bit more about git today. The better news is that this should serve as additional motivation to move to Publican 3, which will allow us to publish guides via RPMs instead of an unwieldy git repo.

Older Posts »

Powered by WordPress