Well at some point when it seemed like it was going to be short story rather
than a novel, I promised a full accounting of the events surrounding our server
crash and subsequent down-time. At this point I don't think it's possible to
marshal all the necessary details into a concise (or even rambling) explanation,
but in an effort to fulfill that promise and satisfy curious minds, I will
attempt to point out some of the highlights of the situation, without
getting too far into the blame game. That said, I have to say up front that the main
culprit in the downtime is myself, as noted in this first point below:
Things that went particularly wrong:
After our previous server crash the mail I sent off to the appropriate IT
folk to make sure our back-ups were being properly conducted had a typo in the
address line, and was never received, much less acted upon. This single point of
failure was really the downfall of the whole process, and I have to accept full
responsibility for that (and I have scolded others in the past for introducing
typos into things like email addresses and URLs when it would be far safer to
use an address book entry or copy and paste them from known good locations).
There was also an unfortunate communication breakdown (for which the fault lies
elsewhere) that prevented all the data from the old server from being
transferred to the new one when it could have prevented all this from being a
problem. The actual server crash was heat related, apparently the cooling failed
in the box that was Frankensteined together, and things sort of melted down. It
took a while to determine the drives needed data recovery, and the recovered
data was returned on a drive that was not useable in any systems we had on hand
(we specifically asked to get the data back on DVD but that was apparently
impractical, don't ask me how we then ended up with incompatible media). Eventually a machine had to be built to accommodate
the drive and this server was then placed in the data center for transfer.
Naturally a host of disk errors ensued, because what fun would disaster recovery
be without further disasters?
Things that went particularly right:
The folks at UGO stepped up to the plate big time as far as providing resources,
coming up with a brand new server to move the site onto temporarily (and now
permanently), in coordinating and paying for the data recovery, and then
springing for the parts to build a new machine just to read the recovery drive.
Additionally, their IT people, in particular Chris, devoted several man-days to
this whole process. Bagpuss also deserves thanks for some timely technical
assistance, as well as jumping in and setting up the
Chatbear forum we used during the
recovery. Furn, Blammo programmer who is now wrapped up in his studies, also
sacrificed a good deal of his valuable spare time to work on various aspects of
this. Last, but far from least, is Frans, who is always a great friend to this
site. Frans put in a yeoman effort on this project: helping sort through the
aftermath of the crash, devising a plan going forward, providing moral support
when things looked bleak, and even at the very end, sympathetic over the
prospect that I would have to manually enter the some 1200(!) stories from the
last seven weeks into the database, he wrote a script to automate most of the
database update, which he did while fighting the effects of pneumonia! Friends
don't get any better than that, and I can only hope he knows how much that's
appreciated. So huzzah and kudos to Frans, apologies for the downtime, and
thanks for hanging in there with us.
Oh yeah, we will be implementing a disaster recovery plan that will hopefully be
less of a disaster going forward.
Prince Rainier of Monaco Dies at Age 81 and
Saul Bellow, Author and Nobel Winner, Dead at 89.