Extended Monday maintenance, Starting already Friday 20th of November 2015


Advanced search

Message boards : Server backend and mirrors : Extended Monday maintenance, Starting already Friday 20th of November 2015

Author Message
Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4461
Credit: 2,094,806
RAC: 0
Message 14180 - Posted: 20 Nov 2015, 6:53:36 UTC

Another big maintenance session this weekend. Same issue as last weekend: trying to fix the storage raid.

Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4461
Credit: 2,094,806
RAC: 0
Message 14181 - Posted: 23 Nov 2015, 18:23:35 UTC
Last modified: 12 Dec 2015, 13:54:35 UTC

We just came out of this week's "Monday maintenance" which pretty much lasted the full weekend and it was a great success! (phew!)
According to the disk manager we did lose some data but the data lost is only 64KiB of data (that's right, 65536 bytes) out of the 13TiB allocated to us on the defective raid. We have some data on other raids as well (short-term storage, the BBB3D archives, caching, etc.) - those raids were fortunately completely unaffected by this incident but caching has been turned off temporarily because we ran out of disk slots during the rescue-operation.

The next few days will be spent running some database checks to make sure everything is as super as it sounds. Luckily most of those checks can be run while everything is online. Two databases (namely the workunit and result databases) require a bit of downtime. Those will be checked next weekend during the maintenance window there.

The plan looks something like this now:

1) Get new raid to full sync status while starting up services
2) Run DB checks to make sure we are fully alive
3) Install additional disk slots
4) Re-enable cache filesystem to boost performance
5) Deep filesystem checks on hosted data comparing actual files with the stored MD5 checksums in the DB from when they were originally written to disk <-- Happening right now
6) Rethink how we handle hardware failures like this. In particular how warranty of old drives is handled - what started this whole ordeal was (also) partly a human error.

More info tomorrow

[Edit Monday 19:21 UTC] (1) Most services are up and running. Syncing has just begun on the last pair of disks. 18.5 hours expected
[Edit Tuesday 7:20 UTC] (1) 30% left on the sync
[Edit Thursday 15:25 UTC] (1) is complete. (2) has begun
[Edit Friday 16:25 2015-12-11 UTC] (2) is complete. (3) has begun
[Edit Saturday 13:54 2015-12-12 UTC] (3) and (4) completed, (5) soon to start

Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4461
Credit: 2,094,806
RAC: 0
Message 14205 - Posted: 12 Dec 2015, 14:12:02 UTC
Last modified: 12 Dec 2015, 14:16:16 UTC

Funny little story: During today's follow-up maintenance to install additional disk slots a couple of old IDE Maxtor DiamondMax 10 300GB drives were removed from the server because they are to be replaced by much larger drives soon.
It turns out that these drives have been churning along since 2005, they were added to the storage server shortly after BURP started and have been running essentially 24/7 since then. That's a bit more than 10 years of non-stop service; and apparently they are still in pristine condition according to their internal diagnostics.

There's something awe-inspiring about hardware that just works!

Profile noderaser
Project donor
Avatar
Send message
Joined: 28 Mar 06
Posts: 506
Credit: 1,547,532
RAC: 35
Message 14206 - Posted: 13 Dec 2015, 7:05:34 UTC
Last modified: 13 Dec 2015, 7:06:46 UTC

Quantum/Maxtor made good drives. I just sent off a 16-year old PowerMac that is still running on its original Quantum Fireball IDE drive.
____________


Post to thread

Message boards : Server backend and mirrors : Extended Monday maintenance, Starting already Friday 20th of November 2015