Blog Fiasco

August 30, 2010

It’s beginning to look a lot like LISA

Filed under: Uncategorized — Tags: , , , — bcotton @ 4:32 pm

We’re just over two months from the Large Installation System Administration (LISA) conference, and the website has recently been updated with details. I’ve never been to this conference before, but as a member of the official blog team, I’ll get to spend the week doing nothing but participating in, and writing about, LISA ’10. Can I write two blog posts and countless tweets every day? It will be a challenge, and I’m sure I’ll be tired of writing by the end, but there should be plenty to write about.

With three days of workshops, 48 training courses, and three days of technical sessions,  there’s plenty to choose from.  I’m especially interested in the talk “Measuring the Value of System Administration” scheduled for Thursday morning.  Of course, each evening there will be Birds of a Feather (BoF) sessions, which I’m told are the most valuable part of the whole LISA experience.  BoFs are an informal meeting of the minds, where admins who do similar work compare notes and pick up new ideas to bring home.  And drink beer.  I’m okay with that.  The BoF schedule is still pretty thin, but no doubt it will fill out as November approaches.

If you’re interested in attending LISA, you can register online at http://www.usenix.org/events/lisa10/registration/.  Registration is available in half-day increments, so you can pay for exactly the amount of conference you want, and if you register by October 18, you get the “early bird discount.”  I hope to see you all in San Jose!

July 12, 2010

There are two kinds of sysadmins in the world

Filed under: Musings — Tags: , — bcotton @ 8:09 am

I mentioned recently that in my experience there are two breeds of sysadmins: the long-hair and the short-hair.  I think we all can picture the long-hair breed.  They’re the stereotypical representation of sysadmins in the media: long hair (duh!), often bearded, generally overweight, sloppily-dressed, anti-social, addicted to caffeine.  Think Comic Book Guy from “The Simpsons”.  The lesser-known breed is the short-hair sysadmin. The short-hair has short hair (duh again!), generally no facial hair, professionally-dressed, often with military experience.

Although it might seem like these two breeds are polar opposites, they do have some traits in common.  Because they are still sysadmins, both breeds tend to see themselves as the rulers of their domains (interestingly, the short-hairs tend to be more flexible and accommodating to end-users).   Security incidents are seen as an unforgivable personal insult, so paranoia is a desirable trait. Though short-hairs are more likely to have a social life, both breeds are quite geeky and prone to obsess over technical details.

Now I don’t claim to be an expert on the subject, and these are my own personal observations.  Nonetheless, I can’t think of any sysadmins that I’ve come across that don’t fit generally into one of the two breeds.  Not every one fits in precisely, but close enough that there’s no question which breed he or she is.  What’s interesting is that these two breeds don’t seem to clash professionally, perhaps because the easiest way to earn a sysadmin’s respect is to have unquestionable technical skill.

June 14, 2010

Ugly shell commands

Filed under: HPC/HTC,Linux — Tags: , , , , , , — bcotton @ 6:25 am

Log files can be incredibly helpful, but they can also be really ugly.  Pulling information out programmatically can be a real hassle.  When a program exists to extract useful information (see: logwatch), it’s cause for celebration.  The following is what can happen when a program doesn’t exist (and yes, this code actually worked).

The scenario here is that a user complained that Condor jobs were failing at a higher-than-normal rate.  Our suspicion, based on a quick look at his log files, is that a few nodes are eating most of his jobs.  But how to tell?  I’ll want to create a spreadsheet that has the job ID, the date, the time, and the last execute host for all of the failed jobs.  I could either task a student to manually pull this information out of the log files, or I can pull it out with some shell magic.

The first step was to get the job ID, the date, and the time from the user’s log files:

grep -B 1 Abnormal ~user/condor/t?/log100 | grep "Job terminated" | awk '{print $2 "," $3 "," $4 }' | sed "s/[\(|\)]//g" | sort -n > failedjobs.csv

What this does is to search the multiple log files for the word “Abnormal”, with one line printed before each match because that’s where the information we want is.  To pull that line out, we search for “Job terminated” and then pull out the second, third, and fourth fields, stripping the parentheses off of the job ID, sorting, and then writing to the file failedjobs.csv.

The next step is to get the last execute node of the failed jobs from the system logs:

for x in `cat failedjobs.csv | awk -F, '{print $1}'`; do
host=`grep "$x.*Job executing" /var/condor/log/EventLog* | tail -n 1 | sed -r "s/.*<(.*):.*/\1/g"`
echo "`host $host | awk '{print $5}'`" >> failedjobs-2.csv;
done

Wow.  This loop pulls the first field out of the CSV we made in the first step.  The IP address for each failed job is pulled from the Event Logs by searching for the “Job executing” string.  Since a job may execute on several different hosts in its lifetime, we want to only look at the last one (hence the tail command), and we pull out the contents of the angle brackets left of the colon.  This is the IP address of the execute host.

With that information, we use the host command to look up the hostname that corresponds to that IP address and write it to a file.  Now all that remains is to combine the two files and try to find something useful in the data.  And maybe to write a script to do this, so that it will be a little easier the next time around.

May 21, 2010

Why it’s not always done the right way: difficulties with preempting Condor jobs when the disk is nearly full

Filed under: HPC/HTC — Tags: , , — bcotton @ 7:38 am

In the IT field, there’s a concept called “best practice”, which is the recommended policy, method, etc for a particular setting or action.  In the perfect world, every system would conform to the accepted best practices in every respect.  Reality isn’t always perfect, though, and there are often times when a sysadmin has to fall somewhere short of this goal.  Some Internet Tough Guys will insist that their systems are rock-solid and superbly secured. That’s crap, we all have to cut corners.  Sometimes it’s acceptable, sometimes it’s a BadThing™.  This is the story of one of the (hopefully) acceptable times.

(more…)

April 7, 2010

Solving the CUPS “hpcups failed” error

Filed under: Linux — Tags: , , , , , , — bcotton @ 7:03 am

I thought when I took my new job that my days of dealing with printer headaches were over.  Alas, it was not to be.  A few weeks ago, I needed to print out a form for work.  I tried to print to the shared laser printer down the hall.  Nothing.  So I tried the color printer. Nothing again.  I was perplexed because both printers had worked previously, so being a moderately competent sysadmin, I looked in the CUPS logs.  I saw a line in error_log that read printer-state-message="/usr/lib/cups/filter/hpcups failed". That seemed like it was the problem, so I tried to find a solution and couldn’t come up with anything immediately.

Since a quick fix didn’t seem to be on the horizon, I decided that I had better things to do with my time and I just used my laptop to print.  That worked, so I forgot about the printing issue.  Shortly thereafter, the group that maintains the printers added the ones on our floor to their CUPS server.  I stopped CUPS on my desktop and switched to their server and printing worked again, thus I had even less incentive to track down the problem.

Fast forward to yesterday afternoon when my wife tried to print a handbill for an event she is organizing in a few weeks.  Since my desktop at home is a x86_64 Fedora 12 system, too, it didn’t surprise me too much when she told me she couldn’t print.  Sure, enough, when I checked the logs, I saw the same error.  I tried all of the regular stalling tactics: restarting CUPS, power cycling the printer, just removing the job and trying again.  Nothing worked.

The first site I found was an Ubuntu bug report which seemed to suggest maybe I should update the printer’s firmware.  That seemed like a really unappealing prospect to me, but as I scrolled down I saw comment #8.  This suggested that maybe I was looking in the wrong place for my answer.  A few lines above the hpcups line, there was an error message that read prnt/hpcups/HPCupsFilter.cpp 361: DEBUG: Bad PPD - hpPrinterLanguage not found.

A search for this brought me to a page about the latest version of hplip. Apparently, the new version required updated PPD files, which are the files that describe the printer to the print server.  In this case, updating the PPD file was simple, and didn’t involve having to find it on HP’s or a third-party website.  All I had to do was use the CUPS web interface and modify the printer, keeping everything the same except selecting the hpcups 3.10.2 driver instead of the 3.9.x that it had been using.  As soon as I made that change, printing worked exactly as expected.

The lesson here, besides the ever-present “printing is evil” is that the error message you think is the clue might not always be.  When you get stuck trying to figure a problem out, look around for other clues.  Tunnel vision only works if you’re on the right track to begin with.

March 15, 2010

A warning about Condor access control lists

Filed under: HPC/HTC,Linux — Tags: , , , , — bcotton @ 9:50 am

Like most sane services, Condor has a notion of access control.  In fact, Condor’s access control lists (ACLs) provide a very granular level of control, allowing you to set a variety of roles based on hostname/IP.  One thing we’re working on at my day job is making it easier for departments across campus to join our Condor pool.  In the face of budget concerns, a recommendation has been drafted which includes having departments choose between running Condor and powering machines off when not in use.  Given the preference for performing backups and system updates overnight, we’re guessing the majority will choose to donate cycles to Condor, so we’re trying to prepare for a large increase in the pool.

Included in that preparation is the switch from default-deny to default-allow-campus-hosts.  Previously, we only allowed specific subdomains on campus, but this means that every time a new department (which effectively means a new subdomain) joins the pool, we have to modify the ACLs.  While this isn’t a big deal, it seems simpler to just allow all of campus except the “scary” subnets (traditionally wireless clients, VPN clients, and the dorms. Especially the dorms.)  Effectively, we’ll end up doing that anyway, and so keeping the file more static should make it easier to maintain.

So on Wednesday, after the security group blessed our idea, I began the process of making the changes.  Let me point out here that you don’t really appreciate how much IP space an institution like Purdue has until you need to start blocking segments of it.  All of 128.10.0.0/16, 128.46.0.0/16, 128.210.0.0/16 and 128.211.0.0/16.  That’s a lot of public space, and it doesn’t include the private IP addresses in use.  So after combing through the IP space assignments, I finally got the ACLs written, and on Thursday I committed them.  And that’s when all hell broke loose.

Condor uses commas to separate as many hosts as you want, and asterisks can be used to wildcard hosts (Condor does not currently support CIDR notation, but that would be awesome).  The danger here is that if you accidentally put a comma in place of a period, you might end up denying write access to *. Obviously, this causes things to break down.  Once people started complaining about Condor not working, I quickly found my mistake and pushed out a correction.  However, Condor does not give up so quickly.  Once a rule is in DENY_WRITE, it will not stop denying that host until the Condor master has been stopped and re-started.  A simple config update won’t change it.

We had to learn that by experimentation, so I spent most of Friday helping my colleagues re-start the Condor process everywhere and testing the hell out of my changes.  Fortunately, once everything had been cleaned up, it worked as expected, and this gave me a chance to learn more about Condor.  And I also learned a very important lesson: test your changes first, dummy.

December 30, 2009

The changing role of IT

Filed under: Musings — Tags: , , — bcotton @ 10:46 am

A year and a half ago, my director and I were having a discussion about my career plans and and the IT landscape in the coming years.  ”IT is the next blue-collar industry,” he said.  I agreed, but didn’t give the matter much consideration.  Recent discussions on /. and Standalone Sysadmin have brought the subject back to the front of my mind.

There was a time when computer knowledge was a relatively rare, and thus valuable, asset.  When I was in middle school, I was one of the few people I knew who had AOL. Many had computers, but few were connected to any network.  In shop class, the teacher, a fellow student and I would share our experiences with the performance of the different access numbers.  I don’t claim to be anywhere near the cutting edge, but at the time, I was one in a relatively small club.

According to the Census Bureau, 62% of US households had Internet access in the home in 2007.  This number continues to rise, and it seems most middle-class families are online.  People my age and younger have grown up using computers.  Older people, including my parents, have acquired computer skills at home and/or at work.  As a result, computer knowledge, at least from the desktop perspective, has become commodity.  There’s a large population that can e-mail, browse the web, manage photos, add printers, etc. — at least to some degree.

Because more users have the ability to manage the routine tasks, the people paid to do these tasks lose stature.  Of course, not all IT staff do these tasks.  What we’ll see is the separation of IT from a monolithic entity into levels of expertise/responsibility.  Just like not everyone in the automotive industry is a mechanic, not everyone in IT is a technician either.

Help desk and other technician type positions will continue to become more blue collar, and I think that it is appropriate.  Systems and network admins, architects, and other higher-level positions are still, in my opinion, professional positions.  In environments where that’s not the case, it is up to these employees and their managers to make the case.

December 21, 2009

Becoming a sysadmin

Filed under: Linux,Musings — Tags: , , — bcotton @ 11:01 am

Several months ago, my brother-in-law expressed interest in systems administration. After I couldn’t talk him out of it, he asked “how do I become a sysadmin?”  I gave it some thought and realized I didn’t have a good answer.  So I polled friends and strangers for their stories.  What resulted is yesterday’s post on the SysAdvent blog.

September 4, 2009

Adding disks to a RAID array without being stupid

Filed under: Linux — Tags: , , , , , — bcotton @ 6:44 am

It is a common practice among sysadmins to avoid doing actual work on Fridays.  Not because we’re lazy (okay, not entirely because we’re lazy), but because if something goes wrong on Friday, it might not show up until Saturday.  Just because I’m on call, doesn’t mean I actually want to go in.  So of course I had reservations about installing new disks on our LDM (weather data) server, but it is important to classroom instruction and to my own geekery, so since the disks arrived on Thursday afternoon, I set aside my normal practice and did the install on Friday morning.

Backstory: our weather data server was set up about four years ago because the machine which had previously been tasked with data ingest could no longer keep up with the load.  We got a grant from Unidata’s Equipment Award program to purchase a Dell PowerEdge 2850 with a 4×300 GB RAID 5, dual 3.6 GHz 64-bit processors, and 4 GB of RAM.  Not exactly a top-of-the-line machine, but a big step up from the desktop-class machine that we had been using.  Fast-forward four years and the size and number of data products has increased.  Now the 500ish GB data partition is no longer sufficient.

The volume of data got to be so much that our scouring routine couldn’t keep enough disk space clear and eventually the data partition filled.  This caused the decoders that run on the data as it comes in to freak out and core dump.  This filled up the root partition, which ended up causing the machine to freak out (and also to take down an LDAP server –oops!).  Since there were two more disk slots available, the solution was simple: add more disk. Which brings me to a week ago…

I scheduled a two-hour outage Friday morning, figuring it would be a quick job.  Oh boy was I in for a surprise.  My first discovery was that the machine was apparently configured to use three of the disks and have the fourth just chill, not even as a hot spare.  The second discovery was that the array was generally cranky and felt that one of the disks needed to be rebuilt.  It took a little bit to figure out how to make that happen (the PERC 4 BIOS is not entirely intuitive), but I finally figured it out.  Once the rebuild started, it became clear that it would take hours to finish.  It was very adamant that I not reboot the machine while that was going on, so I was pretty much stuck with it being down for the rest of the day.

Shortly after what should have been time to leave, the rebuild finished.  So I added the two new disks to the RAID 5, bringing it up to 5x300GB (or 1.2 TB of usable space) and set the 6th disk as a hot spare.  The array needed to do more math to grow itself, and I estimated that it would take about 13 hours to finish, which meant I’d have to come in on Saturday morning.  I should have known better than to start this on a Friday, but the stupid had only just begun.

On my way in to the office on Saturday morning, I had stopped at the local farmers’ market to do some grocery shopping.  It wasn’t until I got to my building that I realized I had forgotten my work keys.  This was a sign of things to come, although I did not know it at the time.  After returning home and then driving back to the office, I was able to get going.  Now comes the easy part right?

I booted the machine, unmounted the data partition, and fired up fdisk.  I decided to first create a new partition for the LDM user’s home directory, to keep / from getting filled.  Then I made another partition to fill the rest of the disk that would be my larger, more awesome data partition.  Once the partition table was written and the system rebooted, I created a new filesystem on the two new partitions and said “done!”  Then I looked at the size of the data partition: 500 GB.  That wasn’t right at all, it should have been more than double that.

Very perplexed, I began looking for an explanation.  fdisk said the disk was 1.2 TB, so why wouldn’t it let me make the last partition bigger?  I checked for limitations, but for the block size I was using, filesystems of 2 TB should be possible.  After beating my head against it for an hour, I finally decided that 500 GB would have to work for now and that I’d just figure it out on Monday.

Over the weekend, I spent some time talking to my friend Randy about it, and he assured me he was just as confused as I was.  Something was imposing an artificial constraint on the size of this partition, but I couldn’t figure out what it was.  Come Monday, I sat down at my desk with a full pot of coffee ready to stab at it until it surrendered.  I’m not sure what made me think of it, but all of a sudden I understood what the problem was.

In Master Boot Record-type partitioning, only four “primary” partitions can exist.  If you want more than four partitions on a disk, numbers 5 and above are contained inside the fourth partition.  It occurred to me that the 4th partition was never grown when I had added the two extra disks to the array.  All I had to do was to delete partition 4 and then re-make it.  I could then re-make the partitions inside it.  Once I did that, I was able to create a 1.1 TB filesystem for data.  Problem solved.

It really bothered me that I hadn’t figured that out on Saturday.  It was a very simple, and quick solution that I already knew everything I needed to know for.  Granted, I’d never done something like this before, but it should have occurred to me more quickly.  At least one good thing came of this, though: I’ll never do actual work on a Friday again.

July 31, 2009

Happy Sysadmin Appreciation Day

Filed under: Linux — Tags: , — bcotton @ 7:00 am

Many people refer to their job as a “calling.”  For many system administrators, there’s a calling as well.  Generally at some obscene hour because the monitoring system noticed that a critical server went down.  From your banking to your time wasting on Facebook, life as you know it is made possible by the system admins who work to keep things running.  Let’s face it, you probably never give any thought to those poor men and women who sit in their cramped, dim offices (or cubes — yuck!).  Nobody stops to think about the vast amounts of coffee, Mountain Dew, etc that go into fueling the labor of the sysadmin.  These brave souls who always have their BlackBerry or pager within arms reach — not because they want to, but because the SLA stipulates five nines.

Fortunately, today is the day you get to show your appreciation for all the work the sysadmins do for you.  Today is System Administrator Appreciation Day.  On this highest of holy days, show a sysadmin some love.  Buy him a beer.  Offer to take some of the empty soda cans out of her office.  Don’t break things.  If you feel comfortable, offer a platonic hug.  But don’t dawdle.  Even today, the sysadmin has plenty of work to do.

Older Posts »

Powered by WordPress