Extended Monday maintenance, Monday 26th of Aug 2013


Advanced search

Message boards : Server backend and mirrors : Extended Monday maintenance, Monday 26th of Aug 2013

Author Message
Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4461
Credit: 2,094,806
RAC: 0
Message 11983 - Posted: 26 Aug 2013, 21:08:59 UTC
Last modified: 27 Aug 2013, 19:52:01 UTC

As part of the Monday maintenance the system found some inconsistencies in the workunit database table file. This file is particularly large, so the maintenance window will have to be extended to Tuesday in order to properly analyse the issue and recover the database table.

Much of the BOINC side of things will be shut down during this but the forum should stay up most of the time.

An update will be posted tomorrow.

Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4461
Credit: 2,094,806
RAC: 0
Message 11984 - Posted: 26 Aug 2013, 23:40:15 UTC

Update:
Two returned results had to be dropped from the table to restore it to a consistent state. Apart from that, things should be back to normal. A more in-depth analysis will be performed later today, hopefully identifying what caused the issue to begin with.

Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4461
Credit: 2,094,806
RAC: 0
Message 11985 - Posted: 27 Aug 2013, 19:51:50 UTC

Short version: Stuff got too hot, things are fine:

Long-winded and technical version:
Turns out that this was caused by an I/O overload at around 18:05 UTC. The server was running a number of things at the same time (due to the maintenance and other things):
- A session finished and went into encode mode. This causes a very fast bulk read of a lot of data followed by a slower trickle and a lot of CPU while the encoding is going on. Sunflower sessions in particular cause extra load because they convert the raw EXR datafiles into PNGs before encoding.
- CATS was moving data to my workstation and offsite for backups. This kept a couple of the network cards pretty busy
- The database was scanning through results to produce the hourly stats export
- A couple of google and baiku bots were scanning the old parts of the gallery
- The raid storage array was performing a consistency scan
- Normal project operation was running at the same time

Apparently this was a bit too much and the server became a bit too hot and decided to slow down, then take a little break, first for 10 secs then a longer break for around 20 secs. Eventually a hardware watchdog killed the system and restarted it. This caused the filesystem to drop the last ~14 secs of changes and revert back to the last known good configuration - which is why 2 rows were dropped from the result table.

At this point the server was simply sitting in failsafe mode waiting for an admin to acknowledge the report about DB inconsistencies. Once I noticed the warning it then took a few hours to post about it on the forums, run the database repairs and then fire BOINC+BURP back up.

The ventilation around the data server has been improved a bit and the ambient temperature is also going to be declining over the next week+months, so hopefully this should not repeat next week.

Anyways, there you have it. Long-winded indeed.


Post to thread

Message boards : Server backend and mirrors : Extended Monday maintenance, Monday 26th of Aug 2013