Message boards :
Number crunching :
About the Distribution of work units
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 29 Aug 07 Posts: 69 Credit: 39,901 RAC: 0 |
Hi, I am new to Burp and have only observed it for a few days now, so please forgive any misunderstandings. Here on the message board some asked for more work, and I too after I joined the project at first had nothing to crunch although there was work in progress. Currently there is a session running which got me work too. But I observed this: When there is a new session, the work units are distributed very quickly and apparently each host gets a big bunch of WUs (frames or parts). After a very short time all WUs are distributed. During the session it is no longer possible to get work from Burp, because all WUs have already been sent out. So now I (for example) have a long render queue which will finish in a few hours - while others (which maybe were offline when the work was distriuted) have nothing to crunch. So maybe it would be better if each host would download less WUs at a time and instead supplement gradually as it uploads results. In my opinion - as far as I have understood the principle at this point - that would achieve two goals: - wider distribution (more crunchers with less work each), and thus - less time until end At the moment for example, while there is a session running for half a day already, all available WUs were distributed from the beginning and thus all late-comers are left \"empty-handed\", while others are overloaded. Maybe it can simply be set in the server software to distribute no more than a number of n (let\'s say 8 - instead of 50 or so) WUs at a time. When a computer has already crunched 6 of them, it can request the next 8. This would probably also enhance reliability, since every host only downloads what it expects to actually render (small steps). |
![]() ![]() Send message Joined: 6 Mar 05 Posts: 94 Credit: 1,384,324 RAC: 7 |
I agree. In my opinion it would be a good idea for any project with an intermittent work supply to limit the number of tasks per RPC, and the frequency of RPCs. The goal would be for it to take at least 4 hours for any group of work to be distributed. (4 hours is a common defferal period.) BOINC WIKI ![]() ![]() BOINCing since 2002/12/8 |
![]() Volunteer moderator Project administrator ![]() Send message Joined: 16 Jun 04 Posts: 4571 Credit: 2,100,463 RAC: 8 |
A much better solution would be if it was somehow possible for \"realtime\" projects like BURP to tell BOINC to only request enough work to fill up the available CPU slots. However, this is not the case. Instead we will be experimenting with different limiter settings the next few months. At the moment the server will only send out 1 workunit to your machine at most every 5 mins (trying to take multi-core machines into account here). As Batan correctly states, the performance increases as the workunits are spread over a greater amount of machines and are rendered simultaneously. However, this has to be held up against the increased amount of data distribution required by this greater amount of machines... which at the moment is not an issue due to the great support from the mirror contributors. |
![]() ![]() Send message Joined: 6 Mar 05 Posts: 94 Credit: 1,384,324 RAC: 7 |
A much better solution would be if it was somehow possible for \"realtime\" projects like BURP to tell BOINC to only request enough work to fill up the available CPU slots. However, this is not the case. Have you tried playing with the max_wus_in_progress tag? It does not take into account the number of CPUs but can be used to limit the queue. Somewhere between 2 and 4 would nearly accomplish what you want for most hosts. Don\'t forget to set resend_lost_results if you use this, otherwise the server may not send work to hosts if a result is lost. edit: Added this as trac ticket #387. BOINC WIKI ![]() ![]() BOINCing since 2002/12/8 |
[B^S] themerrill Project donor Send message Joined: 16 Apr 07 Posts: 3 Credit: 3,619 RAC: 0 |
I think the most pressing issue for a more fair work unit distribution would be to put or at least try to put approx. run times. When that viper animation came on a few weeks ago, my computer downloaded a whole bunch of units because last render session I did was no more than 30 minutes a WU. the viper was 9 hours or more a work unit so I downloaded WAY more than I could really handle. That said, the other idea presented are also good ones. The limiter on the number of WU would be another way to solve the problem I described |
![]() Volunteer moderator Project administrator ![]() Send message Joined: 16 Jun 04 Posts: 4571 Credit: 2,100,463 RAC: 8 |
Have you tried playing with the max_wus_in_progress tag? It does not take into account the number of CPUs but can be used to limit the queue. Somewhere between 2 and 4 would nearly accomplish what you want for most hosts. That\'s right - it would work as a nice temporary solution; And 4- and 8-core systems mostly shouldn\'t run 4+ instances of Blender anyways. Don\'t forget to set resend_lost_results if you use this, otherwise the server may not send work to hosts if a result is lost. Yes, I found out the hard way last time I was testing it. However, I forgot to add it to the documentation. Wiki updated now - others may find it useful too, I guess. edit: Added this as trac ticket #387. Thanks, I never got around to that either. |
![]() Send message Joined: 15 Mar 05 Posts: 77 Credit: 31,788 RAC: 0 |
Don\'t forget to set resend_lost_results if you use this, otherwise the server may not send work to hosts if a result is lost. This one was fixed in http://boinc.berkeley.edu/trac/changeset/13101. PrimeGrid Administrator |
[B^S] sTrey![]() Send message Joined: 23 Mar 05 Posts: 49 Credit: 13,306 RAC: 0 |
Thank you Janus for making the change mentioned in the news. It\'s nice to be able to help crunch BURP again! |
![]() Volunteer moderator Project administrator ![]() Send message Joined: 16 Jun 04 Posts: 4571 Credit: 2,100,463 RAC: 8 |
Don\'t forget to set resend_lost_results if you use this, otherwise the server may not send work to hosts if a result is lost. Good call, completely missed that. |
Fischer-Kerli Project donor Send message Joined: 24 Mar 05 Posts: 70 Credit: 78,553 RAC: 0 |
I like the idea. But with BOINC\'s silly work fetch behaviour, a lot of user micro-management would be necessary in order to keep your Dual Core running all the time: 03.09.2007 06:22:33|BURP|Sending scheduler request: To fetch work 03.09.2007 06:22:33|BURP|Requesting 143006 seconds of new work 03.09.2007 06:22:38|BURP|Scheduler RPC succeeded [server version 510] 03.09.2007 06:22:38|BURP|Message from server: No work sent 03.09.2007 06:22:38|BURP|Message from server: (reached per-host limit of 2 tasks) 03.09.2007 06:22:38|BURP|Deferring communication for 1 min 0 sec 03.09.2007 06:22:38|BURP|Reason: no work from project 03.09.2007 06:23:39|BURP|Sending scheduler request: To fetch work 03.09.2007 06:23:39|BURP|Requesting 143205 seconds of new work 03.09.2007 06:23:49|BURP|Scheduler RPC succeeded [server version 510] 03.09.2007 06:23:49|BURP|Message from server: Not sending work - last request too recent: 70 sec 03.09.2007 06:23:49|BURP|Deferring communication for 1 min 0 sec 03.09.2007 06:23:49|BURP|Reason: no work from project 03.09.2007 06:24:50|BURP|Sending scheduler request: To fetch work 03.09.2007 06:24:50|BURP|Requesting 143401 seconds of new work 03.09.2007 06:24:55|BURP|Scheduler RPC succeeded [server version 510] 03.09.2007 06:24:55|BURP|Message from server: Not sending work - last request too recent: 70 sec 03.09.2007 06:24:55|BURP|Deferring communication for 1 min 0 sec 03.09.2007 06:24:55|BURP|Reason: no work from project 03.09.2007 06:25:55|BURP|Sending scheduler request: To fetch work 03.09.2007 06:25:55|BURP|Requesting 143746 seconds of new work 03.09.2007 06:26:01|BURP|Scheduler RPC succeeded [server version 510] 03.09.2007 06:26:01|BURP|Message from server: Not sending work - last request too recent: 66 sec 03.09.2007 06:26:01|BURP|Deferring communication for 1 min 0 sec 03.09.2007 06:26:01|BURP|Reason: no work from project 03.09.2007 06:26:14|DepSpid|[checkpoint_debug] result spider_116086_0 checkpointed 03.09.2007 06:27:01|BURP|Sending scheduler request: To fetch work 03.09.2007 06:27:01|BURP|Requesting 143941 seconds of new work 03.09.2007 06:27:06|BURP|Scheduler RPC succeeded [server version 510] 03.09.2007 06:27:06|BURP|Message from server: Not sending work - last request too recent: 66 sec 03.09.2007 06:27:06|BURP|Deferring communication for 1 min 37 sec 03.09.2007 06:27:06|BURP|Reason: no work from project 03.09.2007 06:28:47|BURP|Sending scheduler request: To fetch work 03.09.2007 06:28:47|BURP|Requesting 144137 seconds of new work 03.09.2007 06:28:52|BURP|Scheduler RPC succeeded [server version 510] 03.09.2007 06:28:52|BURP|Message from server: Not sending work - last request too recent: 105 sec 03.09.2007 06:28:52|BURP|Deferring communication for 6 min 10 sec 03.09.2007 06:28:52|BURP|Reason: no work from project 03.09.2007 06:35:07|BURP|Sending scheduler request: To fetch work 03.09.2007 06:35:07|BURP|Requesting 145230 seconds of new work 03.09.2007 06:35:12|BURP|Scheduler RPC succeeded [server version 510] 03.09.2007 06:35:12|BURP|Message from server: No work sent 03.09.2007 06:35:12|BURP|Message from server: (reached per-host limit of 2 tasks) 03.09.2007 06:35:12|BURP|Deferring communication for 14 min 34 sec 03.09.2007 06:35:12|BURP|Reason: no work from project And so on ... finally, the next RPC gets delayed for such a long time that chances are high that one or even both of your BURP WUs have completed in the meantime. Franz Kafka would be very fond of that (observed with 5.10.18, but I don\'t think it depends on the version). It would be very helpful if the BOINC server specified what RPC request interval it would like to see (as opposed to just telling which one is too short) - and if the BOINC client was able to understand the message. But that\'s up to the BOINC devs I guess. The problem is worsened by the fact that a completed result still counts as one of the two you are allowed to have. Watch this: 03.09.2007 10:27:49|BURP|Computation for task 616in0.zip__ses0000000616_frm0000000408_prt00000.wu_0 finished 03.09.2007 10:28:03||Resuming network activity (Talk about micro-management!) 03.09.2007 10:28:03|BURP|Sending scheduler request: To fetch work 03.09.2007 10:28:03|BURP|Requesting 171458 seconds of new work 03.09.2007 10:28:04|BURP|[file_xfer] Started upload of file 616in0.zip__ses0000000616_frm0000000408_prt00000.wu_0_0 03.09.2007 10:28:08|BURP|Scheduler RPC succeeded [server version 510] 03.09.2007 10:28:08|BURP|Message from server: No work sent 03.09.2007 10:28:08|BURP|Message from server: (reached per-host limit of 2 tasks) 03.09.2007 10:28:08|BURP|Deferring communication for 1 min 0 sec 03.09.2007 10:28:08|BURP|Reason: no work from project 03.09.2007 10:28:10|BURP|[file_xfer] Finished upload of file 616in0.zip__ses0000000616_frm0000000408_prt00000.wu_0_0 03.09.2007 10:28:10|BURP|[file_xfer] Throughput 35701 bytes/sec 03.09.2007 10:28:19||Suspending network activity - user request |
![]() Volunteer moderator Project administrator ![]() Send message Joined: 16 Jun 04 Posts: 4571 Credit: 2,100,463 RAC: 8 |
I like the idea. But with BOINC\'s silly work fetch behaviour, a lot of user micro-management would be necessary in order to keep your Dual Core running all the time [...] True. Given the max_wus_in_progress there\'s really no need to use the minimum wait time between requests anymore (and personally that feature has aways annoyed me when testing the BOINC client). Trying out: Max 3 concurrent WUs (expected cpucount = 2, 1 extra WU to make sure workfetch can continue while the others are busy), Max 1 sent at a time, no forced delays. |
Fischer-Kerli Project donor Send message Joined: 24 Mar 05 Posts: 70 Credit: 78,553 RAC: 0 |
With only one additional WU, there may be a short pause during which one of the cores is idle (see below), but that should be tolerable. Apart from that, the solution works great for session 616 with its half-hour WU length. Thanks for that! Not so sure about future shorter sessions (EDIT: or longer ones, for that matter), though. 03.09.2007 12:04:45|BURP|Sending scheduler request: To fetch work 03.09.2007 12:04:45|BURP|Requesting 168148 seconds of new work 03.09.2007 12:04:51|BURP|Scheduler RPC succeeded [server version 510] 03.09.2007 12:04:51|BURP|Message from server: No work sent 03.09.2007 12:04:51|BURP|Message from server: (reached per-host limit of 3 tasks) 03.09.2007 12:04:51|BURP|Deferring communication for 14 min 58 sec 03.09.2007 12:04:51|BURP|Reason: no work from project 03.09.2007 12:14:42|BURP|Computation for task 616in0.zip__ses0000000616_frm0000000489_prt00002.wu_3 finished 03.09.2007 12:14:42|BURP|Starting 616in0.zip__ses0000000616_frm0000000505_prt00000.wu_0 03.09.2007 12:14:42|BURP|[cpu_sched] Starting 616in0.zip__ses0000000616_frm0000000505_prt00000.wu_0 (initial) 03.09.2007 12:14:42|BURP|Starting task 616in0.zip__ses0000000616_frm0000000505_prt00000.wu_0 using blender version 455 03.09.2007 12:14:44|BURP|[file_xfer] Started upload of file 616in0.zip__ses0000000616_frm0000000489_prt00002.wu_3_0 03.09.2007 12:14:50|BURP|[file_xfer] Finished upload of file 616in0.zip__ses0000000616_frm0000000489_prt00002.wu_3_0 03.09.2007 12:14:50|BURP|[file_xfer] Throughput 38018 bytes/sec 03.09.2007 12:16:29|BURP|Computation for task 616in0.zip__ses0000000616_frm0000000474_prt00000.wu_2 finished 03.09.2007 12:16:31|BURP|[file_xfer] Started upload of file 616in0.zip__ses0000000616_frm0000000474_prt00000.wu_2_0 03.09.2007 12:16:36|BURP|[file_xfer] Finished upload of file 616in0.zip__ses0000000616_frm0000000474_prt00000.wu_2_0 03.09.2007 12:16:36|BURP|[file_xfer] Throughput 35463 bytes/sec 03.09.2007 12:19:51|BURP|Sending scheduler request: To fetch work 03.09.2007 12:19:51|BURP|Requesting 171111 seconds of new work, and reporting 2 completed tasks 03.09.2007 12:19:56|BURP|Scheduler RPC succeeded [server version 510] 03.09.2007 12:19:58|BURP|Starting 616in0.zip__ses0000000616_frm0000000509_prt00003.wu_1 03.09.2007 12:19:58|BURP|[cpu_sched] Starting 616in0.zip__ses0000000616_frm0000000509_prt00003.wu_1 (initial) 03.09.2007 12:19:58|BURP|Starting task 616in0.zip__ses0000000616_frm0000000509_prt00003.wu_1 using blender version 455 |
![]() Send message Joined: 29 Aug 07 Posts: 69 Credit: 39,901 RAC: 0 |
The 3 WUs setting to me too seems good (better than the 2 WUs). I now always have three WUs, while either one or two WUs are always active (one core is used by other projects about half of the time), while the other waiting either ready to start or to be confirmed. I guess that the overall balance in WUs distribution is much better now. It has to be seen if this leads to a performance increase. Anyhow, I think that this way more crunchers get part of the work, and only as much as they can actually handle. |
noderaser Project donor Send message Joined: 28 Mar 06 Posts: 516 Credit: 1,567,702 RAC: 0 |
Would it be ok to have the limit 4, so that quadcores can have all BURP units? Or, maybe even 6 to have a little extra work? Click here to see My Detailed BOINC Stats ![]() |
![]() Send message Joined: 29 Aug 07 Posts: 69 Credit: 39,901 RAC: 0 |
There seems to be another issue about WUs distribution where an improvement could possibly be made. Sessions seem to slow down considerably towards the end; sometimes just very few unfinished WUs cause the session to hang in suspension while over 99% ready for many hours. This seems to be caused by cases in which BURP clients that received WUs were shut down before completion of the tasks, or where a lot of BOINC projects are running on the same machine. Would it make sense to reduce the deadline, so that unfinished WUs are reassigned to \"hungry\" computers? However, since the deadline must also take into account slower computers, there probably is not very much margin to set it shorter. Would it be ok to have the limit 4, so that quadcores can have all BURP units? Or, maybe even 6 to have a little extra work? You seem to have enough RAM. Anyway, when more WUs are sent out per host, the deadlines should be shorter (to avoid hanging WUs). |
[B^S] themerrill Project donor Send message Joined: 16 Apr 07 Posts: 3 Credit: 3,619 RAC: 0 |
There seems to be another issue about WUs distribution where an improvement could possibly be made. Sessions seem to slow down considerably towards the end; sometimes just very few unfinished WUs cause the session to hang in suspension while over 99% ready for many hours. This seems to be caused by cases in which BURP clients that received WUs were shut down before completion of the tasks, or where a lot of BOINC projects are running on the same machine. They recetly extened the deadline to about 30 hours. Remember, BOINC enters \'panic crunch\' mode at 24. putting it back to 24 should help alliviate the problem. Or you could send out a larger batch of the missing WU\'s. Credit goes to those who get them in before quorum |
![]() Volunteer moderator Project administrator ![]() Send message Joined: 16 Jun 04 Posts: 4571 Credit: 2,100,463 RAC: 8 |
Already a system is in place to boost the final workunits of each session. However the new workunits are scheduled at the same priority as the original workunits were. Scheduling them at a higher priority and sending them to more reliable hosts is something that I\'ll look into adding (I just learned about this feature from the WCG people) |
![]() Send message Joined: 29 Aug 07 Posts: 69 Credit: 39,901 RAC: 0 |
I didn\'t know about BOINC\'s \"panic mode\". (To who\'s interested - information about BOINC\'s work scheduler here). I think it\'s ok too keep BOINC from entering panic mode, except maybe for long overdue work units; people don\'t like BOINC behaving \"un-normally\". ...sending them to more reliable hosts... That seems a good solution. But how do you determine the reliability of hosts? Is there an internal rating system, or simply by recent average credit? |
![]() Volunteer moderator Project administrator ![]() Send message Joined: 16 Jun 04 Posts: 4571 Credit: 2,100,463 RAC: 8 |
The reliability rating is indeed based on RAC and turnaround time for the host\'s recent workunits (as far as I\'ve understood - I haven\'t really had time to look into it yet). |
![]() Project donor Send message Joined: 24 Mar 05 Posts: 94 Credit: 1,627,664 RAC: 0 |
Would it be ok to have the limit 4, so that quadcores can have all BURP units? Or, maybe even 6 to have a little extra work? Seconded from a fellow quad core cruncher :) |