[131-132] Suspend/resume test analysis

Message boards : Client : [131-132] Suspend/resume test analysis
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4570
Credit: 2,100,463
RAC: 8
Message 1360 - Posted: 1 Jul 2005, 16:39:13 UTC
Last modified: 1 Jul 2005, 16:52:55 UTC

Introduction:
The sessions 131 and 132 where run to test the recent changes and fixes regarding suspending and resuming the client.
The main problem that was to be solved was that the previous client versions continued to measure time while in a suspended state. This could lead to hitting the CPU time limit set to protect the machines from infinite loops. This problem seems to be completely solved with this release. Another positive thing is that another set of bugs/problems were discovered and most of them have been isolated so that they can be fixed - they are all described in more detail below.

Problem: Large and complex sessions use a lot of RAM which can cause hosts with less memory to start swapping.
Effect: Swapping degrades overall responsiveness of the system severely.
Solution: When dealing out work the scheduler must be aware of memory requirements and only deal out work to a particular host if that host can handle it without swapping. This means authors of sessions must specify an approximate amount of memory required. Multi-CPU and hyperthreaded machines must be considered as a special case. This solution is scheduled to be implemented as part of the Alpha stage of the project.

Problem: Responsiveness is low on extremely complex scenes due to the fact that the client only checks messages from BOINC when a full line has been rendered (ie. could be several minutes in between)
Effect: Blender "keeps running" when BOINC has been shut down. Suspending sometimes takes a long time. Automatic benchmarks are aborted because Blender is still running.
Solution: Either don't wait for a full line to complete but read messages once every second or read messages multiple times for a line of pixels. These two options will be tested in the next line of client releases. Responsiveness is an important factor in users's view of client stability and should remain the same even for complex sessions. Achieving faster responses is the only remaining task before the suspend/resume feature of the client is acceptable.

Problem: Some experimental BOINC clients (4.46-4.48?) have problems with the shared memory segment setup phase.
Effect: Since the shared memory segments are used to communicate with the BOINC client (which is pretty crucial) the client fails with an unrecoverable error 5 (-1073741819) trashing the workunit.
Solution: It is expected to be solved in the next series of BOINC core clients (5.X) and theoretically has nothing to do with BURP in particular.

Problem: Missing heartbeats from the core client
Effect: Workunits are aborted
Solution: Instead of aborting the workunit already after 30 secs of missing heartbeats the client could wait 60 secs or more. However, this is not an optimal solution. Instead of patching like this it would be more interesting to know WHY the heartbeats stop. Since the problem is especially visible on laptops when resuming work after hibernation or standby this could indicate a problem in the BOINC core client. Tests have shown that the core client restarts the sending of heartbeats a little later. More debugging will have to be done to cast further light on this bug.

Problem: Automatic benchmarks remove the application from memory
Effect: All work is lost and the workunit restarts or fails
Solution: Theoretically the benchmarking code should already take applications like BURP into consideration and only suspend them to memory. This is clearly not the case here. The problem has been narrowed down to be either in the benchmark-preparing code or in the reporting code in the BURP client (which should avoid reporting checkpoints). These two areas will be checked before the next client release and the faulty code corrected.

All in all a successful test with lots of development work to be done before the next test. Later today you will be able to download the animations resulting from session 131 and 132 on their session pages (soon available trough the session gallery).
ID: 1360 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ric

Send message
Joined: 3 Mar 05
Posts: 14
Credit: 20,161
RAC: 0
Message 1361 - Posted: 1 Jul 2005, 17:22:47 UTC - in response to Message 1360.  

Thankyou Janus for this very informative and open test analysis.



Even if some work had to be restarted several times, it doesn't matter.

We are here to support this emerging project, What ever it's needed to be done.

As an owner of several intel hosts with HT, but only 512 MB, I've learned the
lesson, to better run only 1 blender at the time. Running 2 blenders (ht mode)
narrows the host to much (you mentioned it --> swapping) and it take extremly long times to finish.

It's faster to doing the burb with bios setup changed to HT "disabled".

It looks like, the minimum for "well" running systems is 512 MB <b>per CPU</b>

Further goodluck and keep the good work on!

It's a pleasure to support burp

regards

ID: 1361 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4570
Credit: 2,100,463
RAC: 8
Message 1362 - Posted: 1 Jul 2005, 19:49:56 UTC

To kick in a note about the speed of these sessions:
The 131 session would have taken about 150 days to render on my machine but was completed in just 3 using BURP... That's an average frame rendertime of just 20 mins on BURP contra 12 hours. All this includes the extra time spent due to the above issues - oh and the network was rendering the 132 session simultaneously =)
ID: 1362 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John McLeod VII

Send message
Joined: 3 Mar 05
Posts: 51
Credit: 79,519
RAC: 0
Message 1371 - Posted: 2 Jul 2005, 2:18:29 UTC

I had to abandon several WUs because of the thrashing issue. I am glad that is being addressed. I will wait a few days and allow work download from BURP on those again.


BOINC WIKI
ID: 1371 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Berserker

Send message
Joined: 1 Apr 05
Posts: 6
Credit: 169,668
RAC: 0
Message 1374 - Posted: 2 Jul 2005, 9:11:46 UTC - in response to Message 1360.  

One other problem I've noticed:

http://burp.boinc.dk/show_host_detail.php?hostid=6075

This host (one of mine) downloaded 28 workunits in one day. It appears that on this PC, session 131 workunits had an estimated time to completion of 1 hour 56 minutes, when the actual time is closer to 5 hours 48 minutes. It downloaded far too much work for the deadlines (and is still finishing some of it).
ID: 1374 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4570
Credit: 2,100,463
RAC: 8
Message 1385 - Posted: 23 Jul 2005, 10:09:25 UTC - in response to Message 1360.  
Last modified: 23 Jul 2005, 10:18:34 UTC

Problem: Some experimental BOINC clients (4.46-4.48?) have problems with the shared memory segment setup phase.

CVS checkins from 23. of July suggest that this is being solved in the next BOINC client release.

Problem: Missing heartbeats from the core client

Enough debugging has been done to fully isolate this problem. It was solved by David Anderson on 23. of July 2005 and will affect all BOINC projects that compile their clients using the corrected software - great thanks to all the users here who helped narrow down the issue:
- API: change heartbeat mechanism so that instead of using time(0),
it uses its own counter (incremented on interrupt).
This avoid specious heartbeat timeout
when user resets system clock.
It should also fix problem where BOINC restarts
after hibernation, app runs before core client,
and it gets heartbeat timeout.
ID: 1385 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Client : [131-132] Suspend/resume test analysis