Adding disks to a RAID array without being stupid

It is a common practice among sysadmins to avoid doing actual work on Fridays.  Not because we’re lazy (okay, not entirely because we’re lazy), but because if something goes wrong on Friday, it might not show up until Saturday.  Just because I’m on call, doesn’t mean I actually want to go in.  So of course I had reservations about installing new disks on our LDM (weather data) server, but it is important to classroom instruction and to my own geekery, so since the disks arrived on Thursday afternoon, I set aside my normal practice and did the install on Friday morning.

Backstory: our weather data server was set up about four years ago because the machine which had previously been tasked with data ingest could no longer keep up with the load.  We got a grant from Unidata’s Equipment Award program to purchase a Dell PowerEdge 2850 with a 4×300 GB RAID 5, dual 3.6 GHz 64-bit processors, and 4 GB of RAM.  Not exactly a top-of-the-line machine, but a big step up from the desktop-class machine that we had been using.  Fast-forward four years and the size and number of data products has increased.  Now the 500ish GB data partition is no longer sufficient.

The volume of data got to be so much that our scouring routine couldn’t keep enough disk space clear and eventually the data partition filled.  This caused the decoders that run on the data as it comes in to freak out and core dump.  This filled up the root partition, which ended up causing the machine to freak out (and also to take down an LDAP server –oops!).  Since there were two more disk slots available, the solution was simple: add more disk. Which brings me to a week ago…

I scheduled a two-hour outage Friday morning, figuring it would be a quick job.  Oh boy was I in for a surprise.  My first discovery was that the machine was apparently configured to use three of the disks and have the fourth just chill, not even as a hot spare.  The second discovery was that the array was generally cranky and felt that one of the disks needed to be rebuilt.  It took a little bit to figure out how to make that happen (the PERC 4 BIOS is not entirely intuitive), but I finally figured it out.  Once the rebuild started, it became clear that it would take hours to finish.  It was very adamant that I not reboot the machine while that was going on, so I was pretty much stuck with it being down for the rest of the day.

Shortly after what should have been time to leave, the rebuild finished.  So I added the two new disks to the RAID 5, bringing it up to 5x300GB (or 1.2 TB of usable space) and set the 6th disk as a hot spare.  The array needed to do more math to grow itself, and I estimated that it would take about 13 hours to finish, which meant I’d have to come in on Saturday morning.  I should have known better than to start this on a Friday, but the stupid had only just begun.

On my way in to the office on Saturday morning, I had stopped at the local farmers’ market to do some grocery shopping.  It wasn’t until I got to my building that I realized I had forgotten my work keys.  This was a sign of things to come, although I did not know it at the time.  After returning home and then driving back to the office, I was able to get going.  Now comes the easy part right?

I booted the machine, unmounted the data partition, and fired up fdisk.  I decided to first create a new partition for the LDM user’s home directory, to keep / from getting filled.  Then I made another partition to fill the rest of the disk that would be my larger, more awesome data partition.  Once the partition table was written and the system rebooted, I created a new filesystem on the two new partitions and said “done!”  Then I looked at the size of the data partition: 500 GB.  That wasn’t right at all, it should have been more than double that.

Very perplexed, I began looking for an explanation.  fdisk said the disk was 1.2 TB, so why wouldn’t it let me make the last partition bigger?  I checked for limitations, but for the block size I was using, filesystems of 2 TB should be possible.  After beating my head against it for an hour, I finally decided that 500 GB would have to work for now and that I’d just figure it out on Monday.

Over the weekend, I spent some time talking to my friend Randy about it, and he assured me he was just as confused as I was.  Something was imposing an artificial constraint on the size of this partition, but I couldn’t figure out what it was.  Come Monday, I sat down at my desk with a full pot of coffee ready to stab at it until it surrendered.  I’m not sure what made me think of it, but all of a sudden I understood what the problem was.

In Master Boot Record-type partitioning, only four “primary” partitions can exist.  If you want more than four partitions on a disk, numbers 5 and above are contained inside the fourth partition.  It occurred to me that the 4th partition was never grown when I had added the two extra disks to the array.  All I had to do was to delete partition 4 and then re-make it.  I could then re-make the partitions inside it.  Once I did that, I was able to create a 1.1 TB filesystem for data.  Problem solved.

It really bothered me that I hadn’t figured that out on Saturday.  It was a very simple, and quick solution that I already knew everything I needed to know for.  Granted, I’d never done something like this before, but it should have occurred to me more quickly.  At least one good thing came of this, though: I’ll never do actual work on a Friday again.

Why disk utilization matters

Here’s a rare weekend post to help make up for my lack of blogging this week.  Once again it is work related.  My life is boring and uneventful otherwise. 🙂

Unless you plan on sitting around babysitting your servers  every minute of every day, it is probably a good idea to have a monitoring system like Nagios set up.  My department, eternal mooches that we are, opted to not set one up and instead use the service provided by the college-level IT staff.  It worked great, until one day when it didn’t any more.  Some config change hosed the system and the Nagios service no longer ran.  I didn’t consider it much of a big deal until about 7 days ago.

This time last week, I was enjoying a vacation with my beautiful wife in celebration of our 2nd wedding anniversary.  When I got home Sunday evening, I noticed that several people had sent in e-mails complaining that they couldn’t log in to their Linux machines.  Like a fool, I spent the last few hours of my freedom trying to resolve the issue.  We figured out it was a problem with the LDAP server.  Requests went out, but no answers were ever received.  So after a too-long e-mail exchange, we got a workaround set up and I called it good enough.  I went to bed at one o’clock, thoroughly exhausted.

The next day we started working on figuring out what was the problem.  At first it seemed like the issue was entirely with the LDAP server, which is run by the central computing group on campus.  I was pleased that it was not one of my systems.  Then they noticed that there were a lot of open connections from two of my servers: one was our weather data website, and the other was our weather data ingest server.  Both machines work pretty hard, and at first I thought maybe one of the image generation processes just choked and that tripped up everything else.

Further investigation showed that the root cause of the issue was probably that the data partition on the ingest server was full.  This caused the LDM processes to freak out, which resulted in a lot more error messages in the log, which then filled up /var.  Now the system was running so slowly that nothing was behaving right, and since the web server is tightly married to the data server, they both ended up going crazy and murdering the LDAP server.

Now there are scripts that are supposed to run to scour the data server to keep the disks from filling.  I thought perhaps something had kept them from running.  I looked through logs, through cron e-mails, and then ran some find commands by hand.  Everything suggested that the scouring was working as it should.  The more I looked, the more I realized it’s just that the radar data is ever-growing.  I just need to add more disk.

Had I been keeping an eye on the disk usage these past few months, I would have known this sooner, and been able to take care of it before critical services got beaten up.  I think on Monday, I’ll lend a hand getting the Nagios server up and running again.  Learn from my mistakes, readers!