|
Magellan disk crash
Magellan disk crash, loss of /var and /var/imap/user
An air-conditioning unit failure prompted a hard disk crash on Magellan. The crash was of the
"catastrophic" variety which mandated a system rebuild from backups. Physics IT Support staff
attended the scene and by midnight Thursday all Internet services on Magellan except mail were
operational. All mailboxes were rebuilt by 7am Friday whereupon all services on Magellan had
returned to normalcy. Our apologies for the downtime.
On Friday morning Magellan was a little busier than usual, so those people using Webmail likely
noticed temporary performance degradation. As is normal, some mail clients did not
handle the mailbox rebuild so well (reporting a couple of duplicate messages in some mailboxes,
or incorrectly reporting Read/Unread states) while other mail clients were unaffected.
Recovery process
The job took a few hours longer than it could have, as we found we were able to recover the mailbox state to within minutes of the hard disk crash/corruption (up until midday Thursday 27, when Magellan had to be shut down). This recovery was achieved by doing a non-destructive media analysis on the crashed HDD, and overlaying recent mailbox state changes over the data we restored from tape.
Risk management
We are addressing risk management issues more thoroughly year by year (as funding and time
permit). In line with this we have installed an ATA hard disk in Magellan so we can setup a
mirror arrangement to prevent such prolonged downtime in the future; mirroring using the former
SCSI hard disk (ie.$3.5K per unit of which we require two) is prohibitively expensive.
Would you like to know more?
Contact Physics IT Support at help@phys.unsw.edu.au
if you'd like further details.
|