Raid mirror failure (Jan 12.)

Message boards : Server backend and mirrors : Raid mirror failure (Jan 12.)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 7425 - Posted: 13 Jan 2008, 18:12:26 UTC
Last modified: 13 Jan 2008, 18:29:00 UTC

Yesterday at 20:31 the raid controller lost contact with the raid mirror responsible for storing the busy part of the database:
Jan 12 20:31:48 legolas ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Jan 12 20:31:48 legolas ata6.00: cmd 25/00:08:38:96:f6/00:00:10:00:00/e0 tag 0 cdb 0x0 data 4096 in
Jan 12 20:31:48 legolas res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 12 20:31:53 legolas ata6: port is slow to respond, please be patient (Status 0xd0)
Jan 12 20:31:58 legolas ata6: device not ready (errno=-16), forcing hardreset
Jan 12 20:31:58 legolas ata6: hard resetting port
Jan 12 20:32:03 legolas ata6: port is slow to respond, please be patient (Status 0x80)
Jan 12 20:32:08 legolas ata6: COMRESET failed (errno=-16)
Jan 12 20:32:08 legolas ata6: hard resetting port
Jan 12 20:32:13 legolas ata6: port is slow to respond, please be patient (Status 0x80)
Jan 12 20:32:18 legolas ata6: COMRESET failed (errno=-16)
Jan 12 20:32:18 legolas ata6: hard resetting port
Jan 12 20:32:23 legolas ata6: port is slow to respond, please be patient (Status 0x80)
Jan 12 20:32:53 legolas ata6: COMRESET failed (errno=-16)
Jan 12 20:32:53 legolas ata6: limiting SATA link speed to 1.5 Gbps
Jan 12 20:32:53 legolas ata6: hard resetting port
Jan 12 20:32:58 legolas ata6: COMRESET failed (errno=-16)
Jan 12 20:32:58 legolas ata6: reset failed, giving up
Jan 12 20:32:58 legolas ata6.00: disabled
Jan 12 20:32:58 legolas ata6: EH complete
Jan 12 20:32:58 legolas sd 5:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00
Jan 12 20:32:58 legolas end_request: I/O error, dev sdd, sector 284595768
Jan 12 20:32:58 legolas raid1: sdd: rescheduling sector 284595768
Jan 12 20:32:58 legolas sd 5:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00
Jan 12 20:32:58 legolas end_request: I/O error, dev sdd, sector 293046528
Jan 12 20:32:58 legolas md: super_written gets error=-5, uptodate=0
Jan 12 20:32:58 legolas raid1: Disk failure on sdd, disabling device.
Jan 12 20:32:58 legolas Operation continuing on 1 devices
Jan 12 20:32:58 legolas RAID1 conf printout:
Jan 12 20:32:58 legolas --- wd:1 rd:2
Jan 12 20:32:58 legolas disk 0, wo:0, o:1, dev:sdc
Jan 12 20:32:58 legolas disk 1, wo:1, o:0, dev:sdd
Jan 12 20:32:58 legolas RAID1 conf printout:
Jan 12 20:32:58 legolas --- wd:1 rd:2
Jan 12 20:32:58 legolas disk 0, wo:0, o:1, dev:sdc
Jan 12 20:32:58 legolas raid1: sdc: redirecting sector 284595768 to another mirror

Any attempt to get the drive back online automatically or through SATA rescans on that or other ports failed.

After a system reboot all drives were again functional and reporting superb S.M.A.R.T. values, indicating that this may be a software-related issue or an issue related to the motherboard BIOS rather than the harddrive itself.

The drive has been spending the past half an hour resyncing with the other hot storage raid drive.

Fortunately no data was lost and the project was only down in the two minutes it took to reboot the system (the other raid drive kept everything running, although at half the usual speed).

You can follow the resync progress on the status page.
146523264 blocks [2/1] [U_]
[==========>..........] recovery = 51.0% (74820480/146523264) finish=30.8min speed=38723K/sec


When it says \"146523264 blocks [2/2] [UU]\" rather than \"146523264 blocks [2/1] [U_]\" it means that both disks are again working as expected.
ID: 7425 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 7441 - Posted: 14 Jan 2008, 17:25:39 UTC
Last modified: 15 Jan 2008, 7:46:45 UTC

The raid is fully functional again and the database is running as smoothly as ever. Just to throw an example it has been serving an astonishing 90´241´800 requests in the past 23.5 hours, that\'s about 1\'062 requests per second.
ID: 7441 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Rytis

Send message
Joined: 15 Mar 05
Posts: 77
Credit: 31,788
RAC: 0
Message 7451 - Posted: 14 Jan 2008, 20:32:31 UTC

Seems to be showing [2/1] again :(
PrimeGrid
Administrator
ID: 7451 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
indigomonkey

Send message
Joined: 31 May 06
Posts: 44
Credit: 18,633
RAC: 0
Message 7457 - Posted: 14 Jan 2008, 23:02:37 UTC

Looks like it could be a recurring problem. Mind you, serving over 1000 requests per second - that\'s pretty darn impressive!
ID: 7457 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 7466 - Posted: 15 Jan 2008, 7:59:22 UTC
Last modified: 15 Jan 2008, 8:54:57 UTC

Same port, same drive, seemingly a slightly different issue, though.

Fortunately this time the drive was re-detected when the raid controller did a scan of the disk bus... a really weird problem this one since both the controller and the disk seems to be at excellent health.
I\'ll try the usual stuff (new cables, swap ports etc.) next week to see if anything interesting pops up.
ID: 7466 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Achim

Send message
Joined: 17 May 05
Posts: 183
Credit: 2,642,713
RAC: 0
Message 7470 - Posted: 15 Jan 2008, 9:10:02 UTC

Are we down due to this problem?
ID: 7470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 7471 - Posted: 15 Jan 2008, 9:24:05 UTC - in response to Message 7470.  
Last modified: 15 Jan 2008, 9:32:48 UTC

Are we down due to this problem?

Yes we were down for about half an hour while the raid synced back up. Everything should be back online now.

Just after the resync the database started at a sustained 10k requests per second which is clearly not supposed to happen. The assimilators were all using way way too much memory (and using it in a very weird way, where they would start at 0MB and then \"slowly\" climb to 100MB over a period of 0.5 secs).
It appears that the BURP part of the assimilators have been running some pretty nasty database requests but due to the fact that the BURP part of the database is quite nicely optimized for fast access it didn\'t really slow down the system at all. Of course there is no need to use a lot of requests to do something that only needs 2 requests, so a false was flipped to a true and now the assimilators not only generate way less DB requests but also use less memory - win-win.
However, this doesn\'t seem to have anything to do with the raid issue.
ID: 7471 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Achim

Send message
Joined: 17 May 05
Posts: 183
Credit: 2,642,713
RAC: 0
Message 7472 - Posted: 15 Jan 2008, 9:29:57 UTC

Looks like this time it was not so clean(at least for me):
Host Project Date Message
BURP 15/01/2008 10:27:29 Message from server: Resent lost result 731in0.zip__ses0000000731_frm0000010103_prt00000.wu_1
BURP 15/01/2008 10:27:29 Message from server: Completed result 731in0.zip__ses0000000731_frm0000009890_prt00000.wu_2 refused: result already reported as success
BURP 15/01/2008 10:27:29 Message from server: Completed result 731in0.zip__ses0000000731_frm0000009999_prt00000.wu_2 refused: result already reported as success
ID: 7472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 7473 - Posted: 15 Jan 2008, 9:35:40 UTC - in response to Message 7472.  

Looks like this time it was not so clean(at least for me)

Yes, this time the database came to a grinding halt when the raid resync started (which was also why it was decided to take the project offline for a while).
This may have affected any of the 255 hosts that were talking to the server at that point.
ID: 7473 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 7475 - Posted: 15 Jan 2008, 10:13:38 UTC
Last modified: 15 Jan 2008, 10:30:42 UTC

Since the server was down for a while some clients ran short of work and are now requesting a lot of it. The feeder had to be restarted to cope with the requests.

[Edit:] And currently you can almost call it real time rendering - the Hanoi session is being assimilated at 2fps at the moment.
ID: 7475 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Achim

Send message
Joined: 17 May 05
Posts: 183
Credit: 2,642,713
RAC: 0
Message 7476 - Posted: 15 Jan 2008, 11:35:29 UTC - in response to Message 7475.  

And currently you can almost call it real time rendering - the Hanoi session is being assimilated at 2fps at the moment.

So this is another stress test....
ID: 7476 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 7477 - Posted: 15 Jan 2008, 11:47:10 UTC - in response to Message 7476.  

And currently you can almost call it real time rendering - the Hanoi session is being assimilated at 2fps at the moment.

So this is another stress test....

Well at least I\'m pretty stressed at the moment - I\'ll have to leave the system running as it is until tomorrow afternoon.
Scheduling some downtime tomorrow to clean up the mess.
ID: 7477 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Fischer-Kerli
Project donor

Send message
Joined: 24 Mar 05
Posts: 70
Credit: 78,553
RAC: 0
Message 7478 - Posted: 15 Jan 2008, 11:49:46 UTC - in response to Message 7472.  
Last modified: 15 Jan 2008, 12:36:57 UTC

Looks like this time it was not so clean(at least for me):
Host Project Date Message
BURP 15/01/2008 10:27:29 Message from server: Resent lost result 731in0.zip__ses0000000731_frm0000010103_prt00000.wu_1
BURP 15/01/2008 10:27:29 Message from server: Completed result 731in0.zip__ses0000000731_frm0000009890_prt00000.wu_2 refused: result already reported as success
BURP 15/01/2008 10:27:29 Message from server: Completed result 731in0.zip__ses0000000731_frm0000009999_prt00000.wu_2 refused: result already reported as success


I\'ve seen messages like these now and then when the BURP server was under heavy load (typically when a session with very \"short\" WUs was running), so, nothing new. I guess that basically means your BOINC client successfully managed to talk to the server-side scheduler, only the server\'s response was lost. From the user\'s perspective, that\'s more of a cosmetical issue really (except for a time lag between the original send of a new result and the re-send).

EDIT: There should be a message \"Couldn\'t connect to scheduler\" (or something like that) in your log, dated a few minutes earlier. At that point, your client did in fact connect to the scheduler, results 9890 and 9999 were successfully reported, and task 10103 was assigned to you. Because the server\'s response was lost, your client wasn\'t aware of that until a few minutes later, when communication was working again both ways.
ID: 7478 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Achim

Send message
Joined: 17 May 05
Posts: 183
Credit: 2,642,713
RAC: 0
Message 7479 - Posted: 15 Jan 2008, 13:00:10 UTC - in response to Message 7478.  

I\'ve seen messages like these now and then when the BURP server was under heavy load (typically when a session with very \"short\" WUs was running), so, nothing new.

I haven\'t seen these during the last months.

EDIT: There should be a message \"Couldn\'t connect to scheduler\" (or something like that) in your log

BURP 15/01/2008 12:30:50 Scheduler request failed: HTTP service unavailable

yes, it was.
ID: 7479 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Achim

Send message
Joined: 17 May 05
Posts: 183
Credit: 2,642,713
RAC: 0
Message 7480 - Posted: 15 Jan 2008, 13:04:33 UTC - in response to Message 7477.  

Well at least I\'m pretty stressed at the moment
I can assume
Scheduling some downtime tomorrow to clean up the mess.
maybe you could relax the 2 WU\'s per CPU before doing so, so we have some work during the downtime.
I assume the result would be a overloaded server as soon as the server is back.
If an average WU is about 10 min, and the downtime is 4 hours, we would need 6 * 4 = 24 WU\'s per CPU before downtime, which is quite a lot.

ID: 7480 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nick

Send message
Joined: 23 Aug 06
Posts: 23
Credit: 251,238
RAC: 0
Message 7498 - Posted: 17 Jan 2008, 5:16:28 UTC - in response to Message 7480.  

Well at least I\'m pretty stressed at the moment
I can assume
Scheduling some downtime tomorrow to clean up the mess.
maybe you could relax the 2 WU\'s per CPU before doing so, so we have some work during the downtime.
I assume the result would be a overloaded server as soon as the server is back.
If an average WU is about 10 min, and the downtime is 4 hours, we would need 6 * 4 = 24 WU\'s per CPU before downtime, which is quite a lot.



which would be ok, on these really long sessions.
ID: 7498 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 7502 - Posted: 17 Jan 2008, 20:48:33 UTC

As an experiment the \"downtime\" was done using a script that can run while the remaining project stays online. So actually there was no need to bring the scheduler or database down for this cleanup.

This new script will also help minimize the amount of time that the project is down monday mornings.
ID: 7502 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 7529 - Posted: 20 Jan 2008, 19:28:24 UTC
Last modified: 20 Jan 2008, 19:31:35 UTC

The raid mirror is down again. Waiting until tomorrow\'s scheduled maintenance to attempt to fix it. It is the same mirror, and the same error as originally posted.

The drives, cables and raid ports will be tested individually in order to locate the issue.
ID: 7529 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 7553 - Posted: 22 Jan 2008, 17:16:10 UTC
Last modified: 22 Jan 2008, 17:16:28 UTC

The isue has now been isolated to a drive or cable malfunction. Further testing will be done to completely nail down the problem.
ID: 7553 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 7562 - Posted: 23 Jan 2008, 9:31:32 UTC - in response to Message 7553.  

The isue has now been isolated to a drive or cable malfunction. Further testing will be done to completely nail down the problem.

And the fact that the drive still crashes on a different port with different cables isolates the possible causes to software or drive.
ID: 7562 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Server backend and mirrors : Raid mirror failure (Jan 12.)