Week 7 downtime

Message boards : Server backend and mirrors : Week 7 downtime
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 7748 - Posted: 16 Feb 2008, 23:13:54 UTC
Last modified: 16 Feb 2008, 23:24:25 UTC

An unwritten rule in computer systems administration seems to be the following: \"When you are sitting right next to the system it will work for years without an issue but the moment you are away from it it is bound to crash - and there\'s nothing you can do about it\".

What happened during week 7 (or more precisely from monday the 11th of Feb at 6.30 UTC to this saturday morning) was that the server experienced 4 different issues:

1) Monday morning (one day after my departure on a week-long trip to Finland) the network DNS server was down for a few minutes, causing the script which regularly checks for network connectivity to fail. The script detected this failure and attempted to re-establish the network connection automatically.

2) The script, however, failed to correctly set up the routing information for outbound packages in the case where only the DNS server has been down and not the physical network link. Any attempts to contact the BURP server from the internet resulted in a timeout because all outgoing packages were being dropped.
While I was in Finland there was simply not much I could do about the problem.

3) The fact that the network interfaces did not have the right setup caused the system to not have internet connectivity - which then again triggered the script; practically running it in an infinite loop.

4) At some point during the week the BOINC feeder lost connection to the database (because no DNS information was available). As a consequence, in the timeframe between saturday morning and now, the project returned \"No work\" when clients connected to it.

As of the timestamp of this post all of these issues should now have been fixed and work should once again be flowing steadily.

I\'m very sorry about this kind of downtime, especially because the timing was very unfortunate - allowing a relatively simple problem like this to cause the entire server to be unavailable for a whole week.
The script has been corrected in order to hopefully prevent a similar thing from happening in the future.

If you experience any further issues in relation to this downtime please post about it in this thread.
ID: 7748 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
DramaKing

Send message
Joined: 4 Jan 08
Posts: 28
Credit: 34,420
RAC: 0
Message 7749 - Posted: 16 Feb 2008, 23:25:36 UTC

Glad to know that the site\'s working again. And I totally sympathize.
ID: 7749 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 7751 - Posted: 16 Feb 2008, 23:41:27 UTC

Another consequence of this is that the counter that keeps track of the number of completed parts per session (for statistical use on the website) got completely wrong data for a while.
For this reason it was necessary to reset the counter for the ongoing session 726. During the next few days the counter will be adjusted to the correct value as the system re-assimilates the workunits that were impacted by this.
ID: 7751 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
SenrabYar

Send message
Joined: 21 Mar 05
Posts: 5
Credit: 8,272
RAC: 0
Message 7752 - Posted: 16 Feb 2008, 23:44:43 UTC

After failing to get any communication with BURP from the BOINC client, I decided to restart BOINC Manager - everything now works.

Maybe others will find they are stuck unless they stop and restart the BOINC manager.


<img src="http://seti.mundayweb.com/stats.php?userID=903"><br>
<img src="http://boinc.mundayweb.com/one/stats.php?userID=2034">
ID: 7752 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mark Reiss
Avatar

Send message
Joined: 7 Aug 06
Posts: 21
Credit: 15,526
RAC: 0
Message 7753 - Posted: 16 Feb 2008, 23:45:45 UTC


Hi: Will we lose any pending credit for session 726?

Mark Reiss

ID: 7753 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
SenrabYar

Send message
Joined: 21 Mar 05
Posts: 5
Credit: 8,272
RAC: 0
Message 7754 - Posted: 16 Feb 2008, 23:52:36 UTC

Just checking - and despite the deadline being passed, I\'ve completed the quota for some WUs and got the credit - for me and the other 2 co-quories have too I expect.

Not sure what will happen if a fourth WU was sent because of the apparent failure to reply in time. If that one gets reported back first, quite probably the other good one will become \'not needed\' and get nothing - a tad unfair methinks.
<img src="http://seti.mundayweb.com/stats.php?userID=903"><br>
<img src="http://boinc.mundayweb.com/one/stats.php?userID=2034">
ID: 7754 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AC
Project donor
Avatar

Send message
Joined: 30 Sep 07
Posts: 121
Credit: 143,874
RAC: 0
Message 7756 - Posted: 17 Feb 2008, 0:31:13 UTC - in response to Message 7751.  
Last modified: 17 Feb 2008, 0:31:45 UTC

...During the next few days the counter will be adjusted to the correct value as the system re-assimilates the workunits that were impacted by this.

Awww... I was afraid that the current 1.5 KCPUsec/sec number was too good to be true. >.<

Glad to see things back up and running -- and yet another system glitch fixed!
ID: 7756 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile batan

Send message
Joined: 29 Aug 07
Posts: 69
Credit: 39,901
RAC: 0
Message 7757 - Posted: 17 Feb 2008, 2:11:53 UTC - in response to Message 7752.  

After failing to get any communication with BURP from the BOINC client, I decided to restart BOINC Manager - everything now works.

Maybe others will find they are stuck unless they stop and restart the BOINC manager.

Thanks - that was helpful.
ID: 7757 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile DoctorNow
Project donor
Avatar

Send message
Joined: 11 Apr 05
Posts: 403
Credit: 2,189,214
RAC: 7
Message 7759 - Posted: 17 Feb 2008, 10:25:35 UTC - in response to Message 7753.  

Glad to have BURP back online, Janus! :-)


Hi: Will we lose any pending credit for session 726?

As for my part, none of the pending credits are missing. ;-)
Life is Science, and Science rules. To the universe and beyond
Proud member of BOINC@Heidelberg
My BOINC-Stats
ID: 7759 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 7787 - Posted: 19 Feb 2008, 8:21:03 UTC - in response to Message 7753.  

Hi: Will we lose any pending credit for session 726?

No.
ID: 7787 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Server backend and mirrors : Week 7 downtime