We just came out of this week's "Monday maintenance" which pretty much lasted the full weekend and it was a great success! (phew!)
According to the disk manager we did lose some data but the data lost is only 64KiB of data (that's right, 65536 bytes) out of the 13TiB allocated to us on the defective raid. We have some data on other raids as well (short-term storage, the BBB3D archives, caching, etc.) - those raids were fortunately completely unaffected by this incident but caching has been turned off temporarily because we ran out of disk slots during the rescue-operation.
The next few days will be spent running some database checks to make sure everything is as super as it sounds. Luckily most of those checks can be run while everything is online. Two databases (namely the workunit and result databases) require a bit of downtime. Those will be checked next weekend during the maintenance window there.
The plan looks something like this now:
1) Get new raid to full sync status while starting up services
2) Run DB checks to make sure we are fully alive
3) Install additional disk slots
4) Re-enable cache filesystem to boost performance
5) Deep filesystem checks on hosted data comparing actual files with the stored MD5 checksums in the DB from when they were originally written to disk <-- Happening right now
6) Rethink how we handle hardware failures like this. In particular how warranty of old drives is handled - what started this whole ordeal was (also) partly a human error.
More info tomorrow
[Edit Monday 19:21 UTC] (1) Most services are up and running. Syncing has just begun on the last pair of disks. 18.5 hours expected
[Edit Tuesday 7:20 UTC] (1) 30% left on the sync
[Edit Thursday 15:25 UTC] (1) is complete. (2) has begun
[Edit Friday 16:25 2015-12-11 UTC] (2) is complete. (3) has begun
[Edit Saturday 13:54 2015-12-12 UTC] (3) and (4) completed, (5) soon to start