About the Distribution of work units

Message boards : Number crunching : About the Distribution of work units
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile batan

Send message
Joined: 29 Aug 07
Posts: 69
Credit: 39,901
RAC: 0
Message 6609 - Posted: 1 Sep 2007, 21:15:05 UTC
Last modified: 1 Sep 2007, 21:54:11 UTC

Hi, I am new to Burp and have only observed it for a few days now, so please forgive any misunderstandings.

Here on the message board some asked for more work, and I too after I joined the project at first had nothing to crunch although there was work in progress.
Currently there is a session running which got me work too. But I observed this:

When there is a new session, the work units are distributed very quickly and apparently each host gets a big bunch of WUs (frames or parts). After a very short time all WUs are distributed. During the session it is no longer possible to get work from Burp, because all WUs have already been sent out. So now I (for example) have a long render queue which will finish in a few hours - while others (which maybe were offline when the work was distriuted) have nothing to crunch.
So maybe it would be better if each host would download less WUs at a time and instead supplement gradually as it uploads results.
In my opinion - as far as I have understood the principle at this point - that would achieve two goals:
- wider distribution (more crunchers with less work each), and thus
- less time until end

At the moment for example, while there is a session running for half a day already, all available WUs were distributed from the beginning and thus all late-comers are left \"empty-handed\", while others are overloaded.

Maybe it can simply be set in the server software to distribute no more than a number of n (let\'s say 8 - instead of 50 or so) WUs at a time. When a computer has already crunched 6 of them, it can request the next 8.

This would probably also enhance reliability, since every host only downloads what it expects to actually render (small steps).
ID: 6609 · Rating: 1 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keck_Komputers
Avatar

Send message
Joined: 6 Mar 05
Posts: 94
Credit: 1,384,324
RAC: 10
Message 6611 - Posted: 1 Sep 2007, 23:06:20 UTC

I agree. In my opinion it would be a good idea for any project with an intermittent work supply to limit the number of tasks per RPC, and the frequency of RPCs. The goal would be for it to take at least 4 hours for any group of work to be distributed. (4 hours is a common defferal period.)
BOINC WIKI

BOINCing since 2002/12/8
ID: 6611 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4568
Credit: 2,100,409
RAC: 5
Message 6614 - Posted: 2 Sep 2007, 10:20:57 UTC
Last modified: 2 Sep 2007, 10:23:06 UTC

A much better solution would be if it was somehow possible for \"realtime\" projects like BURP to tell BOINC to only request enough work to fill up the available CPU slots. However, this is not the case.

Instead we will be experimenting with different limiter settings the next few months. At the moment the server will only send out 1 workunit to your machine at most every 5 mins (trying to take multi-core machines into account here).

As Batan correctly states, the performance increases as the workunits are spread over a greater amount of machines and are rendered simultaneously. However, this has to be held up against the increased amount of data distribution required by this greater amount of machines... which at the moment is not an issue due to the great support from the mirror contributors.
ID: 6614 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keck_Komputers
Avatar

Send message
Joined: 6 Mar 05
Posts: 94
Credit: 1,384,324
RAC: 10
Message 6616 - Posted: 2 Sep 2007, 18:01:08 UTC - in response to Message 6614.  
Last modified: 2 Sep 2007, 18:10:09 UTC

A much better solution would be if it was somehow possible for \"realtime\" projects like BURP to tell BOINC to only request enough work to fill up the available CPU slots. However, this is not the case.

Have you tried playing with the max_wus_in_progress tag? It does not take into account the number of CPUs but can be used to limit the queue. Somewhere between 2 and 4 would nearly accomplish what you want for most hosts.

Don\'t forget to set resend_lost_results if you use this, otherwise the server may not send work to hosts if a result is lost.

edit: Added this as trac ticket #387.
BOINC WIKI

BOINCing since 2002/12/8
ID: 6616 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[B^S] themerrill
Project donor

Send message
Joined: 16 Apr 07
Posts: 3
Credit: 3,619
RAC: 0
Message 6617 - Posted: 2 Sep 2007, 19:49:08 UTC - in response to Message 6616.  

I think the most pressing issue for a more fair work unit distribution would be to put or at least try to put approx. run times.
When that viper animation came on a few weeks ago, my computer downloaded a whole bunch of units because last render session I did was no more than 30 minutes a WU. the viper was 9 hours or more a work unit so I downloaded WAY more than I could really handle.
That said, the other idea presented are also good ones. The limiter on the number of WU would be another way to solve the problem I described
ID: 6617 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4568
Credit: 2,100,409
RAC: 5
Message 6618 - Posted: 2 Sep 2007, 19:54:51 UTC - in response to Message 6616.  
Last modified: 2 Sep 2007, 19:57:44 UTC

Have you tried playing with the max_wus_in_progress tag? It does not take into account the number of CPUs but can be used to limit the queue. Somewhere between 2 and 4 would nearly accomplish what you want for most hosts.

That\'s right - it would work as a nice temporary solution; And 4- and 8-core systems mostly shouldn\'t run 4+ instances of Blender anyways.

Don\'t forget to set resend_lost_results if you use this, otherwise the server may not send work to hosts if a result is lost.

Yes, I found out the hard way last time I was testing it. However, I forgot to add it to the documentation. Wiki updated now - others may find it useful too, I guess.

edit: Added this as trac ticket #387.

Thanks, I never got around to that either.
ID: 6618 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Rytis

Send message
Joined: 15 Mar 05
Posts: 77
Credit: 31,788
RAC: 0
Message 6619 - Posted: 2 Sep 2007, 20:58:56 UTC - in response to Message 6618.  

Don\'t forget to set resend_lost_results if you use this, otherwise the server may not send work to hosts if a result is lost.

Yes, I found out the hard way last time I was testing it. However, I forgot to add it to the documentation. Wiki updated now - others may find it useful too, I guess.

This one was fixed in http://boinc.berkeley.edu/trac/changeset/13101.
PrimeGrid
Administrator
ID: 6619 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[B^S] sTrey
Avatar

Send message
Joined: 23 Mar 05
Posts: 49
Credit: 13,306
RAC: 0
Message 6621 - Posted: 3 Sep 2007, 2:52:11 UTC
Last modified: 3 Sep 2007, 2:54:18 UTC

Thank you Janus for making the change mentioned in the news. It\'s nice to be able to help crunch BURP again!
ID: 6621 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4568
Credit: 2,100,409
RAC: 5
Message 6624 - Posted: 3 Sep 2007, 6:11:46 UTC - in response to Message 6619.  
Last modified: 3 Sep 2007, 6:12:41 UTC

Don\'t forget to set resend_lost_results if you use this, otherwise the server may not send work to hosts if a result is lost.

Yes, I found out the hard way last time I was testing it. However, I forgot to add it to the documentation. Wiki updated now - others may find it useful too, I guess.

This one was fixed in http://boinc.berkeley.edu/trac/changeset/13101.

Good call, completely missed that.
ID: 6624 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Fischer-Kerli
Project donor

Send message
Joined: 24 Mar 05
Posts: 70
Credit: 78,553
RAC: 0
Message 6625 - Posted: 3 Sep 2007, 8:33:44 UTC
Last modified: 3 Sep 2007, 8:41:33 UTC

I like the idea. But with BOINC\'s silly work fetch behaviour, a lot of user micro-management would be necessary in order to keep your Dual Core running all the time:

03.09.2007 06:22:33|BURP|Sending scheduler request: To fetch work
03.09.2007 06:22:33|BURP|Requesting 143006 seconds of new work
03.09.2007 06:22:38|BURP|Scheduler RPC succeeded [server version 510]
03.09.2007 06:22:38|BURP|Message from server: No work sent
03.09.2007 06:22:38|BURP|Message from server: (reached per-host limit of 2 tasks)
03.09.2007 06:22:38|BURP|Deferring communication for 1 min 0 sec
03.09.2007 06:22:38|BURP|Reason: no work from project
03.09.2007 06:23:39|BURP|Sending scheduler request: To fetch work
03.09.2007 06:23:39|BURP|Requesting 143205 seconds of new work
03.09.2007 06:23:49|BURP|Scheduler RPC succeeded [server version 510]
03.09.2007 06:23:49|BURP|Message from server: Not sending work - last request too recent: 70 sec
03.09.2007 06:23:49|BURP|Deferring communication for 1 min 0 sec
03.09.2007 06:23:49|BURP|Reason: no work from project
03.09.2007 06:24:50|BURP|Sending scheduler request: To fetch work
03.09.2007 06:24:50|BURP|Requesting 143401 seconds of new work
03.09.2007 06:24:55|BURP|Scheduler RPC succeeded [server version 510]
03.09.2007 06:24:55|BURP|Message from server: Not sending work - last request too recent: 70 sec
03.09.2007 06:24:55|BURP|Deferring communication for 1 min 0 sec
03.09.2007 06:24:55|BURP|Reason: no work from project
03.09.2007 06:25:55|BURP|Sending scheduler request: To fetch work
03.09.2007 06:25:55|BURP|Requesting 143746 seconds of new work
03.09.2007 06:26:01|BURP|Scheduler RPC succeeded [server version 510]
03.09.2007 06:26:01|BURP|Message from server: Not sending work - last request too recent: 66 sec
03.09.2007 06:26:01|BURP|Deferring communication for 1 min 0 sec
03.09.2007 06:26:01|BURP|Reason: no work from project
03.09.2007 06:26:14|DepSpid|[checkpoint_debug] result spider_116086_0 checkpointed
03.09.2007 06:27:01|BURP|Sending scheduler request: To fetch work
03.09.2007 06:27:01|BURP|Requesting 143941 seconds of new work
03.09.2007 06:27:06|BURP|Scheduler RPC succeeded [server version 510]
03.09.2007 06:27:06|BURP|Message from server: Not sending work - last request too recent: 66 sec
03.09.2007 06:27:06|BURP|Deferring communication for 1 min 37 sec
03.09.2007 06:27:06|BURP|Reason: no work from project
03.09.2007 06:28:47|BURP|Sending scheduler request: To fetch work
03.09.2007 06:28:47|BURP|Requesting 144137 seconds of new work
03.09.2007 06:28:52|BURP|Scheduler RPC succeeded [server version 510]
03.09.2007 06:28:52|BURP|Message from server: Not sending work - last request too recent: 105 sec
03.09.2007 06:28:52|BURP|Deferring communication for 6 min 10 sec
03.09.2007 06:28:52|BURP|Reason: no work from project
03.09.2007 06:35:07|BURP|Sending scheduler request: To fetch work
03.09.2007 06:35:07|BURP|Requesting 145230 seconds of new work
03.09.2007 06:35:12|BURP|Scheduler RPC succeeded [server version 510]
03.09.2007 06:35:12|BURP|Message from server: No work sent
03.09.2007 06:35:12|BURP|Message from server: (reached per-host limit of 2 tasks)
03.09.2007 06:35:12|BURP|Deferring communication for 14 min 34 sec
03.09.2007 06:35:12|BURP|Reason: no work from project

And so on ... finally, the next RPC gets delayed for such a long time that chances are high that one or even both of your BURP WUs have completed in the meantime. Franz Kafka would be very fond of that (observed with 5.10.18, but I don\'t think it depends on the version). It would be very helpful if the BOINC server specified what RPC request interval it would like to see (as opposed to just telling which one is too short) - and if the BOINC client was able to understand the message. But that\'s up to the BOINC devs I guess.

The problem is worsened by the fact that a completed result still counts as one of the two you are allowed to have. Watch this:

03.09.2007 10:27:49|BURP|Computation for task 616in0.zip__ses0000000616_frm0000000408_prt00000.wu_0 finished
03.09.2007 10:28:03||Resuming network activity
(Talk about micro-management!)
03.09.2007 10:28:03|BURP|Sending scheduler request: To fetch work
03.09.2007 10:28:03|BURP|Requesting 171458 seconds of new work
03.09.2007 10:28:04|BURP|[file_xfer] Started upload of file 616in0.zip__ses0000000616_frm0000000408_prt00000.wu_0_0
03.09.2007 10:28:08|BURP|Scheduler RPC succeeded [server version 510]
03.09.2007 10:28:08|BURP|Message from server: No work sent
03.09.2007 10:28:08|BURP|Message from server: (reached per-host limit of 2 tasks)
03.09.2007 10:28:08|BURP|Deferring communication for 1 min 0 sec
03.09.2007 10:28:08|BURP|Reason: no work from project
03.09.2007 10:28:10|BURP|[file_xfer] Finished upload of file 616in0.zip__ses0000000616_frm0000000408_prt00000.wu_0_0
03.09.2007 10:28:10|BURP|[file_xfer] Throughput 35701 bytes/sec
03.09.2007 10:28:19||Suspending network activity - user request
ID: 6625 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4568
Credit: 2,100,409
RAC: 5
Message 6626 - Posted: 3 Sep 2007, 9:41:55 UTC - in response to Message 6625.  
Last modified: 3 Sep 2007, 9:43:27 UTC

I like the idea. But with BOINC\'s silly work fetch behaviour, a lot of user micro-management would be necessary in order to keep your Dual Core running all the time [...]

True. Given the max_wus_in_progress there\'s really no need to use the minimum wait time between requests anymore (and personally that feature has aways annoyed me when testing the BOINC client).

Trying out: Max 3 concurrent WUs (expected cpucount = 2, 1 extra WU to make sure workfetch can continue while the others are busy), Max 1 sent at a time, no forced delays.
ID: 6626 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Fischer-Kerli
Project donor

Send message
Joined: 24 Mar 05
Posts: 70
Credit: 78,553
RAC: 0
Message 6627 - Posted: 3 Sep 2007, 10:25:01 UTC
Last modified: 3 Sep 2007, 10:37:56 UTC

With only one additional WU, there may be a short pause during which one of the cores is idle (see below), but that should be tolerable. Apart from that, the solution works great for session 616 with its half-hour WU length. Thanks for that! Not so sure about future shorter sessions (EDIT: or longer ones, for that matter), though.

03.09.2007 12:04:45|BURP|Sending scheduler request: To fetch work
03.09.2007 12:04:45|BURP|Requesting 168148 seconds of new work
03.09.2007 12:04:51|BURP|Scheduler RPC succeeded [server version 510]
03.09.2007 12:04:51|BURP|Message from server: No work sent
03.09.2007 12:04:51|BURP|Message from server: (reached per-host limit of 3 tasks)
03.09.2007 12:04:51|BURP|Deferring communication for 14 min 58 sec
03.09.2007 12:04:51|BURP|Reason: no work from project
03.09.2007 12:14:42|BURP|Computation for task 616in0.zip__ses0000000616_frm0000000489_prt00002.wu_3 finished
03.09.2007 12:14:42|BURP|Starting 616in0.zip__ses0000000616_frm0000000505_prt00000.wu_0
03.09.2007 12:14:42|BURP|[cpu_sched] Starting 616in0.zip__ses0000000616_frm0000000505_prt00000.wu_0 (initial)
03.09.2007 12:14:42|BURP|Starting task 616in0.zip__ses0000000616_frm0000000505_prt00000.wu_0 using blender version 455
03.09.2007 12:14:44|BURP|[file_xfer] Started upload of file 616in0.zip__ses0000000616_frm0000000489_prt00002.wu_3_0
03.09.2007 12:14:50|BURP|[file_xfer] Finished upload of file 616in0.zip__ses0000000616_frm0000000489_prt00002.wu_3_0
03.09.2007 12:14:50|BURP|[file_xfer] Throughput 38018 bytes/sec
03.09.2007 12:16:29|BURP|Computation for task 616in0.zip__ses0000000616_frm0000000474_prt00000.wu_2 finished
03.09.2007 12:16:31|BURP|[file_xfer] Started upload of file 616in0.zip__ses0000000616_frm0000000474_prt00000.wu_2_0
03.09.2007 12:16:36|BURP|[file_xfer] Finished upload of file 616in0.zip__ses0000000616_frm0000000474_prt00000.wu_2_0
03.09.2007 12:16:36|BURP|[file_xfer] Throughput 35463 bytes/sec
03.09.2007 12:19:51|BURP|Sending scheduler request: To fetch work
03.09.2007 12:19:51|BURP|Requesting 171111 seconds of new work, and reporting 2 completed tasks
03.09.2007 12:19:56|BURP|Scheduler RPC succeeded [server version 510]
03.09.2007 12:19:58|BURP|Starting 616in0.zip__ses0000000616_frm0000000509_prt00003.wu_1
03.09.2007 12:19:58|BURP|[cpu_sched] Starting 616in0.zip__ses0000000616_frm0000000509_prt00003.wu_1 (initial)
03.09.2007 12:19:58|BURP|Starting task 616in0.zip__ses0000000616_frm0000000509_prt00003.wu_1 using blender version 455
ID: 6627 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile batan

Send message
Joined: 29 Aug 07
Posts: 69
Credit: 39,901
RAC: 0
Message 6628 - Posted: 3 Sep 2007, 12:16:08 UTC

The 3 WUs setting to me too seems good (better than the 2 WUs). I now always have three WUs, while either one or two WUs are always active (one core is used by other projects about half of the time), while the other waiting either ready to start or to be confirmed.
I guess that the overall balance in WUs distribution is much better now. It has to be seen if this leads to a performance increase. Anyhow, I think that this way more crunchers get part of the work, and only as much as they can actually handle.
ID: 6628 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
noderaser
Project donor
Avatar

Send message
Joined: 28 Mar 06
Posts: 516
Credit: 1,567,702
RAC: 0
Message 6631 - Posted: 3 Sep 2007, 23:59:44 UTC

Would it be ok to have the limit 4, so that quadcores can have all BURP units? Or, maybe even 6 to have a little extra work?
Click here to see My Detailed BOINC Stats
ID: 6631 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile batan

Send message
Joined: 29 Aug 07
Posts: 69
Credit: 39,901
RAC: 0
Message 6633 - Posted: 4 Sep 2007, 13:57:41 UTC
Last modified: 4 Sep 2007, 14:00:02 UTC

There seems to be another issue about WUs distribution where an improvement could possibly be made. Sessions seem to slow down considerably towards the end; sometimes just very few unfinished WUs cause the session to hang in suspension while over 99% ready for many hours. This seems to be caused by cases in which BURP clients that received WUs were shut down before completion of the tasks, or where a lot of BOINC projects are running on the same machine.
Would it make sense to reduce the deadline, so that unfinished WUs are reassigned to \"hungry\" computers? However, since the deadline must also take into account slower computers, there probably is not very much margin to set it shorter.

Would it be ok to have the limit 4, so that quadcores can have all BURP units? Or, maybe even 6 to have a little extra work?

You seem to have enough RAM. Anyway, when more WUs are sent out per host, the deadlines should be shorter (to avoid hanging WUs).
ID: 6633 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[B^S] themerrill
Project donor

Send message
Joined: 16 Apr 07
Posts: 3
Credit: 3,619
RAC: 0
Message 6634 - Posted: 5 Sep 2007, 13:32:05 UTC - in response to Message 6633.  

There seems to be another issue about WUs distribution where an improvement could possibly be made. Sessions seem to slow down considerably towards the end; sometimes just very few unfinished WUs cause the session to hang in suspension while over 99% ready for many hours. This seems to be caused by cases in which BURP clients that received WUs were shut down before completion of the tasks, or where a lot of BOINC projects are running on the same machine.
Would it make sense to reduce the deadline, so that unfinished WUs are reassigned to \"hungry\" computers? However, since the deadline must also take into account slower computers, there probably is not very much margin to set it shorter.


They recetly extened the deadline to about 30 hours. Remember, BOINC enters \'panic crunch\' mode at 24. putting it back to 24 should help alliviate the problem. Or you could send out a larger batch of the missing WU\'s. Credit goes to those who get them in before quorum
ID: 6634 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4568
Credit: 2,100,409
RAC: 5
Message 6635 - Posted: 5 Sep 2007, 13:56:41 UTC
Last modified: 9 Sep 2007, 19:54:09 UTC

Already a system is in place to boost the final workunits of each session. However the new workunits are scheduled at the same priority as the original workunits were. Scheduling them at a higher priority and sending them to more reliable hosts is something that I\'ll look into adding (I just learned about this feature from the WCG people)
ID: 6635 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile batan

Send message
Joined: 29 Aug 07
Posts: 69
Credit: 39,901
RAC: 0
Message 6637 - Posted: 5 Sep 2007, 15:17:44 UTC

I didn\'t know about BOINC\'s \"panic mode\". (To who\'s interested - information about BOINC\'s work scheduler here).
I think it\'s ok too keep BOINC from entering panic mode, except maybe for long overdue work units; people don\'t like BOINC behaving \"un-normally\".

...sending them to more reliable hosts...

That seems a good solution. But how do you determine the reliability of hosts? Is there an internal rating system, or simply by recent average credit?
ID: 6637 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4568
Credit: 2,100,409
RAC: 5
Message 6642 - Posted: 9 Sep 2007, 19:32:39 UTC - in response to Message 6637.  
Last modified: 9 Sep 2007, 19:50:29 UTC

The reliability rating is indeed based on RAC and turnaround time for the host\'s recent workunits (as far as I\'ve understood - I haven\'t really had time to look into it yet).
ID: 6642 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Zanthius
Project donor

Send message
Joined: 24 Mar 05
Posts: 94
Credit: 1,627,664
RAC: 0
Message 6651 - Posted: 12 Sep 2007, 0:11:24 UTC - in response to Message 6631.  

Would it be ok to have the limit 4, so that quadcores can have all BURP units? Or, maybe even 6 to have a little extra work?


Seconded from a fellow quad core cruncher :)
ID: 6651 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : About the Distribution of work units