Why you shouldn’t be stupid

With today being a holiday, I figured it would be a good time to patch our Solaris servers. Like a good admin, I made sure it was announced ahead of time, but that didn’t stop a few users from complaining when their jobs failed. But this post isn’t about users, it’s about me. After the patch cluster finished installing, I rebooted our main file server. Everything seemed to come up okay, but when I tried to access the NFS-exported directories, I couldn’t.

I thought it was rather peculiar, but things happen sometimes. I noticed that there were several working directories, all the newer ones. Odd. Oh wait! When I was re-working our first array (I’ll write a post about that soon), I commented some directories out of /etc/dfs/dfstab. When I rebooted the server, it didn’t re-share those directories. A quick edit to dfstab, and a restart of the nfs.server process and we were back in business. Or are we?

Three directories still weren’t automounting. One was complaining about permissions, and two complained that there was no such file or directory. By this point, my colleague Randy had popped online and was helping me take a look at this. He noticed that the ownership and permission on the server side were wrong for one of the directories. A typo in an rsync command a few days ago probably caused that problem, but it was easy enough to fix. So now only two directories were being angry.

But wait, the Solaris desktop machines weren’t having any problems, it was just the Linux clients. Well that’s odd. So we tried clearing the nscd cache and restarting autofs. No changes. If the problem was the autofs configuration, everything should be broken, right? Randy found a note in the client’s syslog that it couldn’t find the server. DNS problem perhaps? The server has an entry in /etc/hosts, so it shouldn’t be an issue, but it wouldn’t hurt to look. I ran tcpdump on the client and not traffic passed to the nameservers while I was trying to automount the directory, so it couldn’t be DNS.

I tried manually mounting one of the directories on /mnt. It worked fine. Odd. I ran snort on on the server to look for traffic from the client I was testing on. There was absolutely no traffic when I was trying to automount. Well that would explain why it wouldn’t mount the directory, but why wasn’t the traffic getting through? Randy and I were beating our heads against the figurative wall, and the advertised outage window was minutes from closing. Then Randy sent me another snippet from the log file. I immediately saw the problem: the server name was mis-typed in the maps for the autofs configs. ‘server.empolyer.edu’ is not quite the same as ‘server.employer.edu’. A quick change of the config and everything was fixed.

An hour and a half of frustrating debugging because I mis-typed a few lines months ago. Oops.