Message boards :
Server backend and mirrors :
Extended Monday maintenance, Starting already Friday 20th of November 2015
Message board moderation
Author | Message |
---|---|
![]() Volunteer moderator Project administrator ![]() Send message Joined: 16 Jun 04 Posts: 4574 Credit: 2,100,463 RAC: 8 |
Another big maintenance session this weekend. Same issue as last weekend: trying to fix the storage raid. |
![]() Volunteer moderator Project administrator ![]() Send message Joined: 16 Jun 04 Posts: 4574 Credit: 2,100,463 RAC: 8 |
We just came out of this week's "Monday maintenance" which pretty much lasted the full weekend and it was a great success! (phew!) According to the disk manager we did lose some data but the data lost is only 64KiB of data (that's right, 65536 bytes) out of the 13TiB allocated to us on the defective raid. We have some data on other raids as well (short-term storage, the BBB3D archives, caching, etc.) - those raids were fortunately completely unaffected by this incident but caching has been turned off temporarily because we ran out of disk slots during the rescue-operation. The next few days will be spent running some database checks to make sure everything is as super as it sounds. Luckily most of those checks can be run while everything is online. Two databases (namely the workunit and result databases) require a bit of downtime. Those will be checked next weekend during the maintenance window there. The plan looks something like this now: 1) Get new raid to full sync status while starting up services 2) Run DB checks to make sure we are fully alive 3) Install additional disk slots 4) Re-enable cache filesystem to boost performance 5) Deep filesystem checks on hosted data comparing actual files with the stored MD5 checksums in the DB from when they were originally written to disk <-- Happening right now 6) Rethink how we handle hardware failures like this. In particular how warranty of old drives is handled - what started this whole ordeal was (also) partly a human error. More info tomorrow [Edit Monday 19:21 UTC] (1) Most services are up and running. Syncing has just begun on the last pair of disks. 18.5 hours expected [Edit Tuesday 7:20 UTC] (1) 30% left on the sync [Edit Thursday 15:25 UTC] (1) is complete. (2) has begun [Edit Friday 16:25 2015-12-11 UTC] (2) is complete. (3) has begun [Edit Saturday 13:54 2015-12-12 UTC] (3) and (4) completed, (5) soon to start |
![]() Volunteer moderator Project administrator ![]() Send message Joined: 16 Jun 04 Posts: 4574 Credit: 2,100,463 RAC: 8 |
Funny little story: During today's follow-up maintenance to install additional disk slots a couple of old IDE Maxtor DiamondMax 10 300GB drives were removed from the server because they are to be replaced by much larger drives soon. It turns out that these drives have been churning along since 2005, they were added to the storage server shortly after BURP started and have been running essentially 24/7 since then. That's a bit more than 10 years of non-stop service; and apparently they are still in pristine condition according to their internal diagnostics. There's something awe-inspiring about hardware that just works! |
noderaser Project donor Send message Joined: 28 Mar 06 Posts: 516 Credit: 1,567,702 RAC: 0 |
Quantum/Maxtor made good drives. I just sent off a 16-year old PowerMac that is still running on its original Quantum Fireball IDE drive. Click here to see My Detailed BOINC Stats ![]() |