Our core infrastructure is highly redundant and can survive almost any failure. But this (needlessly long) story isn’t about that; no, it’s about an outlying building that houses our execs and their staff. Given the small amount of staff and their (supposedly!) non-critical needs we only have a single small server running VMware. Recently this server has started rebooting randomly. What makes this really bad is that even though this box isn’t supposed to do anything critical over time semi-important VMs have crept onto it that need to be running 7:30AM-7PM every day.
We called the vendor (Sun in this case) for troubleshooting. Sun says that it’s likely one of the 2 CPUs but they want to replace both CPUs and the mainboard just to be sure. Wonderful! I point out that we need to do this after hours. They say, paraphrasing; “No problem, sir! We can be there anytime! We have no life and are totally committed to our customers! We can be there in the middle of the night! We’ll leave brides waiting at altars! or even … Oh, wait, my mistake. It looks like you have silver support not gold. So we’ll see you sometime during regular business hours, then. Now get off the phone you filthy peasant”.
I thought this box was on gold? I recall that I had dropped gold support on all of our VMware servers as part of cost cutting because if one died it would just failover to another host anyway. Apparently I totally forgot that this one is not part of a cluster. Doh! So they’ll be here in the morning but what to do about the down time? What if something unexpected happens and the server doesn’t come up at all after the part swaps?
I come up with a crap plan but it’s the least bad I can do I in the timeframe. I’m going to down one of the clustered VMware hosts and reconfigure it for the remote building. Then I’ll vMotion everything away while Sun does their thing and vMotion it back when they’re done. But I’ll need some shared storage because the remote server has everything on internal disk. That made sense at the time but complicates things now. This won’t work without shared storage so in desperation I do the unthinkable; I root around in our stockpile of retired hardware for some storage.
I come up with Grandpa’s ye olden Fibre Channel array from when disks were 73GB, people ran their SANs at 2Gbit/s, and rode in horse drawn carriages. Boot it up and it turns out it has one failed disk; I can still remove that disk and setup a RAID5 with 2 hot spares which should be good for a worst case scenario of needing to be up for 24-48 hours. Remote server doesn’t have a FC HBA; what to do? I have old 2Gbit/s HBAs but they’re PCI-X and the remote box only has PCIe. Fortunately I buy pairs of single port FC HBAs instead of dual port cards so I can steal a PCIe HBA from the cluster host. I’m just killing redundancy all over the place here.
Then I try to setup the array only to find out it’s not responding over the network and while it has a serial interface the connector is non-standard! I hate this crap! How much money can you even make from proprietary cables? Then I recall that this array is from before we redid all the IP addressing so I play with nmap to find the IP address in the old scheme. I also learn that I like the old ASCII telnet interface a lot more then the web GUI on our current arrays. Still mad about that serial interface.
Now this remote building looks very nice but apparently was designed without any understanding of IT needs at all (someone from marketing I’m sure) so we don’t have anywhere to set this up besides a lobby near a janitor closet (!) that I’ve turned into a “server closet”. End up just leaving it all on the dolly. Then I start to cable everything up and find out I’m short one LC-LC fiber cable! I’m too deep into this to back out now. So I complete my transition to the dark side by going into the wiring closet … and … pulling a fiber cable from a redundant trunk between two switches. That’s the advantage of being in charge of IT; you can pull stuff that you would yell at a subordinate for.
Then I SVmotion and vMotion everything over and call it a night. At 7:00AM the Sun field tech shows up, generously an hour earlier than the contract specifies, to do the job. I can’t say he’s nonplussed but I’ve certainly seen many instances of greater plussed-ness
So let’s have some pics of this super professional setup!
Nice! This is the kind of thing that makes IT look good! Right outside the CxO offices no less! After I’m done here I’m going to get some cinder blocks and a rusted out ‘72 chevy for the lawn.

Now this is sysadmin of the year level cabling right here! Surly nothing bad could come of this!

You know what this Sun tech (courtesy crap cell pic) and the salesperson at Tiffany’s in the movie “Breakfast at Tiffany’s” have in common? They’re both way too professional to point out how lame you are.

After the parts swap the server came up and I migrated everything back. So pretty uneventful in the end but we did avoid an hour of downtime. I’m sure that if we hadn’t made these arrangements and just told people that we’re down for an hour and deal with it that something horrible would have happened and we would have been down for like a week.
So there you have it; I took the ghetto’s servers for this incident and they want them back. If you were laboring under the assumption that I have it together then this will disabuse you of the notion.
You need to be a member of The SysAdmin Network to add comments!
Join The SysAdmin Network