The SysAdmin Network

No more hiding in the server room

Isaac

Isaac? The ghetto is calling, they want their servers back.

Our core infrastructure is highly redundant and can survive almost any failure. But this (needlessly long) story isn’t about that; no, it’s about an outlying building that houses our execs and their staff. Given the small amount of staff and their (supposedly!) non-critical needs we only have a single small server running VMware. Recently this server has started rebooting randomly. What makes this really bad is that even though this box isn’t supposed to do anything critical over time semi-important VMs have crept onto it that need to be running 7:30AM-7PM every day.

We called the vendor (Sun in this case) for troubleshooting. Sun says that it’s likely one of the 2 CPUs but they want to replace both CPUs and the mainboard just to be sure. Wonderful! I point out that we need to do this after hours. They say, paraphrasing; “No problem, sir! We can be there anytime! We have no life and are totally committed to our customers! We can be there in the middle of the night! We’ll leave brides waiting at altars! or even … Oh, wait, my mistake. It looks like you have silver support not gold. So we’ll see you sometime during regular business hours, then. Now get off the phone you filthy peasant”.

I thought this box was on gold? I recall that I had dropped gold support on all of our VMware servers as part of cost cutting because if one died it would just failover to another host anyway. Apparently I totally forgot that this one is not part of a cluster. Doh! So they’ll be here in the morning but what to do about the down time? What if something unexpected happens and the server doesn’t come up at all after the part swaps?

I come up with a crap plan but it’s the least bad I can do I in the timeframe. I’m going to down one of the clustered VMware hosts and reconfigure it for the remote building. Then I’ll vMotion everything away while Sun does their thing and vMotion it back when they’re done. But I’ll need some shared storage because the remote server has everything on internal disk. That made sense at the time but complicates things now. This won’t work without shared storage so in desperation I do the unthinkable; I root around in our stockpile of retired hardware for some storage.

I come up with Grandpa’s ye olden Fibre Channel array from when disks were 73GB, people ran their SANs at 2Gbit/s, and rode in horse drawn carriages. Boot it up and it turns out it has one failed disk; I can still remove that disk and setup a RAID5 with 2 hot spares which should be good for a worst case scenario of needing to be up for 24-48 hours. Remote server doesn’t have a FC HBA; what to do? I have old 2Gbit/s HBAs but they’re PCI-X and the remote box only has PCIe. Fortunately I buy pairs of single port FC HBAs instead of dual port cards so I can steal a PCIe HBA from the cluster host. I’m just killing redundancy all over the place here.

Then I try to setup the array only to find out it’s not responding over the network and while it has a serial interface the connector is non-standard! I hate this crap! How much money can you even make from proprietary cables? Then I recall that this array is from before we redid all the IP addressing so I play with nmap to find the IP address in the old scheme. I also learn that I like the old ASCII telnet interface a lot more then the web GUI on our current arrays. Still mad about that serial interface.

Now this remote building looks very nice but apparently was designed without any understanding of IT needs at all (someone from marketing I’m sure) so we don’t have anywhere to set this up besides a lobby near a janitor closet (!) that I’ve turned into a “server closet”. End up just leaving it all on the dolly. Then I start to cable everything up and find out I’m short one LC-LC fiber cable! I’m too deep into this to back out now. So I complete my transition to the dark side by going into the wiring closet … and … pulling a fiber cable from a redundant trunk between two switches. That’s the advantage of being in charge of IT; you can pull stuff that you would yell at a subordinate for.

Then I SVmotion and vMotion everything over and call it a night. At 7:00AM the Sun field tech shows up, generously an hour earlier than the contract specifies, to do the job. I can’t say he’s nonplussed but I’ve certainly seen many instances of greater plussed-ness

So let’s have some pics of this super professional setup!

Nice! This is the kind of thing that makes IT look good! Right outside the CxO offices no less! After I’m done here I’m going to get some cinder blocks and a rusted out ‘72 chevy for the lawn.

Now this is sysadmin of the year level cabling right here! Surly nothing bad could come of this!

You know what this Sun tech (courtesy crap cell pic) and the salesperson at Tiffany’s in the movie “Breakfast at Tiffany’s” have in common? They’re both way too professional to point out how lame you are.

After the parts swap the server came up and I migrated everything back. So pretty uneventful in the end but we did avoid an hour of downtime. I’m sure that if we hadn’t made these arrangements and just told people that we’re down for an hour and deal with it that something horrible would have happened and we would have been down for like a week.

So there you have it; I took the ghetto’s servers for this incident and they want them back. If you were laboring under the assumption that I have it together then this will disabuse you of the notion.

Views: 13

Wesley "Nonapeptide" Comment by Wesley "Nonapeptide" on November 5, 2009 at 12:32am
A couple of things I noted:

1. I find it amusing that the execs are lodged in the outlying building with sketchy equipment and have "non-critical" needs. In my experience, it's the execs that get direct fiber links to the SAN, their own storage group on the Exchange server and know the number to the secret, red, blinking IT bat phone. =)

2. The "junk" that you have lying around to concoct this hack is cooler and more expensive than the brand new stuff I get to work with. But I'll bet I know more than you about creating wireless backhauls with Campbells soup cans, a Capt Crunch whistle and Faraday Coil! =)
Isaac Comment by Isaac on November 5, 2009 at 4:12am
Regarding point 1 you’re right; especially about the building. See the thing is that the outlying building was our first building and is considered HQ but its small; in fact less than 20 people work there. Down a couple of blocks we have 3 larger buildings that house everyone else.

They do have a lot of dedicated resources. They have their own Exchange server (running enterprise edition no less!) their own file server, TS server, their own DC, their own app servers, a dedicated internet connection behind an ASA5510, a router for SRST, and more. But clustered VMware seemed just too expensive. However like I said unanticipated critical apps have come up on that box and that may change things, especially given this recent situation. We need to something here; at least we need to restore gold support.
They don’t have an IT “red phone” but they have my private cell. :(

Regarding point 2). Well the hardware was respectable back in the day it’s just old now. The worst thing was bringing up an old disk array for production that had been sitting collecting dust for over a year.

Maybe I’ve just gotten spoiled now that I don’t work at a company with a anemic IT budget; we’re non-profit so we get special pricing on all kinds of shiny toys.
Wesley "Nonapeptide" Comment by Wesley "Nonapeptide" on November 5, 2009 at 9:43pm
Ah, okay. That sounds more like the executive treatment that IT people have to adapt to. =) Funny, I've never worked for a for-profit company. Only American 501(c)3s (in my case, things like churches), but none of them were terribly lucrative so no VMWare clusters for me yet. I do so enjoy asking for non-profit pricing. Makes it slightly less annoying when purchasing Microsoft products. =)

Comment

You need to be a member of The SysAdmin Network to add comments!

Join The SysAdmin Network

© 2012   Created by Elizabeth Ayer and Michael Francis.   Powered by .

Badges  |  Report an Issue  |  Terms of Service