Server issues Dec 06, 2011

Message boards : Server backend and mirrors : Server issues Dec 06, 2011
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 11122 - Posted: 6 Dec 2011, 6:27:38 UTC

The server has been having some issues recently:
1) There has been some tiny power spikes and outages in the past few days that have triggered the emergency power system a couple of times. Each time the issue was resolved before the battery ran out of power, so no downtime there. Specifically this was related to a 15kV cable dropout at 12.30 Dec 4th and was resolved at 13.10
2) Tonight (Dec 5) at 22:30-22:48 and then again at 23:35-06:02 the database server went offline for unknown reasons. According to the logs it basically stalled. This may or may not be related to heat issues and it is being looked into.

Until the root reason can be determined we will be running slightly slower than usual. In particular, the assimilator will be kept offline.
ID: 11122 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 11123 - Posted: 6 Dec 2011, 7:21:47 UTC - in response to Message 11122.  

The assimilator did not have a backlog at all and was enabled again. The temperature in the server is being lowered by additional cooling now.
ID: 11123 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 11124 - Posted: 6 Dec 2011, 19:11:26 UTC
Last modified: 6 Dec 2011, 21:05:22 UTC

The server stalled again shortly after my last post and this time it took the workunit and result tables with it down. It takes a little while to rebuild these tables...

[Edit:] The tables have been rebuilt. 4 workunits had to be dropped because they were left in an inconsistent state. If your host got one of these it will eventually give up on it because the server doesn't recognize it.
ID: 11124 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 11125 - Posted: 6 Dec 2011, 21:16:09 UTC

The usual deadlines will temporarily be extended just as the usual limit of 2 queued workunits per machine will be raised to 5. This is to more smoothly handle a possible scenario where the project is offline for an extended period of time. Hopefully this is an unnecessary precaution.

The cause of the recent stalls has now clearly been linked to a hardware issue which occurs during high system loads.
ID: 11125 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
baracutio
Project donor

Send message
Joined: 29 Mar 05
Posts: 96
Credit: 174,604
RAC: 0
Message 11126 - Posted: 6 Dec 2011, 22:48:21 UTC - in response to Message 11125.  
Last modified: 6 Dec 2011, 22:48:29 UTC

thanks for updates on these issues, janus.



greetings
ID: 11126 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 11127 - Posted: 10 Dec 2011, 10:15:09 UTC

Replacement parts have arrived and will be installed rather soonish - probably Monday if everything goes as planned.
ID: 11127 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 11135 - Posted: 13 Dec 2011, 20:36:02 UTC

I basically took a deep breath and pulled the plug on the old server, switched the motherboard, CPU and memory out with new stuff and asked the system to "go figure it out". On Gentoo Linux when switching from AMD to Intel CPUs this means a full system recompile which had been prepared on a USB stick in advance.
So right now the server is running from a small 8GB USB stick and most services are back online.

The plan is to get everything up and running (including the secondary network link which is unfortunately still down right now) and then replace the old root file system with the new one from the stick.
ID: 11135 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 11136 - Posted: 13 Dec 2011, 23:05:06 UTC

Unfortunately the system couldn't boot from the new root filesystem due to a peculiarity in the hardware setup. Things will have to be shuffled around a bit - this includes moving 100G around twice to make room for the filesystem on another drive.

The reason why the secondary link (the one that takes care of all the downloads) is down was found and it will be enabled as soon as possible some time tomorrow.
ID: 11136 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jk1swt

Send message
Joined: 20 Mar 11
Posts: 20
Credit: 2,187,202
RAC: 24
Message 11137 - Posted: 13 Dec 2011, 23:24:32 UTC

Appreciate you keeping us updated Janus.
ID: 11137 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4574
Credit: 2,100,463
RAC: 8
Message 11141 - Posted: 14 Dec 2011, 17:44:07 UTC
Last modified: 14 Dec 2011, 18:25:27 UTC

Turns out that the reason why the new server could not booot wasn't a hardware issue after all - grub (the bootloader) simply cannot boot from a harddisk with a number higher than 8. And the Linux kernel just randomly happened to be on disk number 14... Suddenly a good reason to migrate to grub2, and (surprise surprise) everything booted wonderfully out of the box.

[Edit:] The dev environment is up again.
[Edit2] The secondary link is up but DNS is wrong, fixed values will take 1 hour to spread to the web. Will test if the routing is working properly in the meantime.
ID: 11141 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Server backend and mirrors : Server issues Dec 06, 2011