Unexpected downtime 1st of August 2014

Message boards : Server backend and mirrors : Unexpected downtime 1st of August 2014
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4553
Credit: 2,097,282
RAC: 3
Message 13119 - Posted: 1 Aug 2014, 14:33:05 UTC
Last modified: 1 Aug 2014, 14:38:01 UTC

Yesterday we saw increased latency on one of our network cards (up to 450ms instead of the usual <1ms). We started diagnostics on the link together with the ISP of the fibre optics channel that is used for the bulk of the BURP data downloads.

Today the network card and two hard-drives (yes, two more...gah) just flat out died. The service is back up after around 1.5 hours of downtime but work will continue on fixing the hardware issues that remain.

The downtime should have been no more than a few minutes but was extended due to the replacement card not being compatible with the kernel configuration that the server was running. It took a little while extra to migrate the settings to a new kernel, compile it and install it.

Tomorrow we will be installing replacement drives and the service will be slower than usual while the raids sync up.

To avoid similar issues in the future there is a plan to migrate to a smarter network setup where the 3 network cards that the server has are used as failovers for each other. The other network hardware at the site supports such a setup, so it is just a matter of actually getting it done.
ID: 13119 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4553
Credit: 2,097,282
RAC: 3
Message 13138 - Posted: 2 Aug 2014, 14:36:58 UTC
Last modified: 2 Aug 2014, 14:37:44 UTC

Replacement drives have arrived (huh what?). They have started self tests and are about to go into the raid in a few hours from now, expect some slowdowns due to re-syncing.
ID: 13138 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
funkydude

Send message
Joined: 23 Dec 13
Posts: 275
Credit: 2,478,281
RAC: 0
Message 13140 - Posted: 2 Aug 2014, 15:57:09 UTC

You certainly don't ride the luck train here.
ID: 13140 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Server backend and mirrors : Unexpected downtime 1st of August 2014