farm render speed improvement


Advanced search

Message boards : Number crunching : farm render speed improvement

Author Message
jk1swt
Send message
Joined: 20 Mar 11
Posts: 20
Credit: 2,150,361
RAC: 1
Message 12627 - Posted: 26 Apr 2014, 20:40:21 UTC

From a users end, I think the speed of the farm is way below its potential considering the number of computers attached. The issue I see, is my computer stays idle a good bit when there are projects still being rendered. Upon observation, lets say animation "X" has 3000 work units and only 15 work units are in the wild for render at any one time, and the render time is relatively short compared to other current projects that have a long render time. Project "X"'s work units can get gated (held back) behind the completion of the long work units (project "Y") on host nodes. The problem is more apparent when the long units ("Y")have a high ram requirement. So host nodes that don't even meet the qualifications for the long units("Y") sit idle. While the computers that have ("X") units in cue, get finished with ("Y") and work on the short ones they have in their cue. (hope this makes sense)

The other issue too is, Boinc after every request for work that it gets "0 new task" from the scheduler, Boinc increases its time till next request. I've seen in my log as much as 8 hours between request. So if I don't actively pay attention to my host, I can go a long time idle even when there are projects needing to be done. My concern is that I can't be the only one, and keeping all attached computers active would make this farm scream with speed.

jk1swt

funkydude
Send message
Joined: 23 Dec 13
Posts: 275
Credit: 2,478,281
RAC: 0
Message 12631 - Posted: 26 Apr 2014, 21:36:13 UTC - in response to Message 12627.

To add an example to this thread, as of me typing this post, 2 projects have units available. On this laptop I can't process the 11GB RAM project but I can the other. I have to keep hitting update in BOINC, with about a 50% success ratio of trying to get units from the project I can crunch, the other 50% is just errors about the RAM amount.

I envision this happening to other PCs as they automatically query for units and get rejected due to RAM, even though there's another project they could crunch. It needs to cycle through all available projects and not fail after querying one.

Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4461
Credit: 2,094,806
RAC: 0
Message 12633 - Posted: 27 Apr 2014, 9:23:13 UTC
Last modified: 27 Apr 2014, 9:32:17 UTC

The BOINC scheduler is extremely stupid. It has X slots available to it where you can load in workunits. When a client connects it will go through those X slots and then fail if nothing was found. Normally BOINC will load up just random stuff into the slots and hope that something fits. Obviously this fails even when there's a tiny bit of disparity between the amount of work (as you both describe) because the small units get picked fast while the larger ones stick and eventually fill the entire X slots.

BURP isn't using the standard feeding mechanism of randomly loading in workunits. Instead, every Y seconds it fills a portion of the X slots with units selected from the currently rendering sessions in such a way that if there already is a long-running workunit from session Y then no more long-running workunits from Y are made. This at first fixes the issue of quick sessions being held back from slow sessions (or low-mem sessions from high-mem sessions polluting the entire X slots list).
There are two remaining issues:
1) Workunits that fail and must be resent also take up slots in the list of X. Sometimes a lot of workunits fail and pollute the list so that no new work can enter it.
2) The period of Y seconds is sometimes not low enough. Normally it is set to 5 secs but sometimes when there's an extremely fast session (like right now) this means that a bit of work gets created and is immediately fetched within 0.5 secs. Then for the remaining 4.5 secs, until the next cycle starts, clients are faced with the "0 new tasks" message.
Simply setting Y low always is a matter of power usage vs performance. An interesting future dev task could be to make Y vary depending on the speed of currently rendering sessions (or simply not allow sessions with <10sec render times as they are silly anyways).
Example: At Y=5secs the max insertion speed of any number of 3600 frames sessions would be around 5 hours.

This was one side of the problem. The other side is client-side. The same issue can happen on clients when a long-running unit is stalling a waiting quick unit in the client queue. This is compounded by the fact that typically people are running other projects than just BURP on their clients and those can have long units too, but obviously even long-running BURP workunits may stall quick ones.
First of all this is limited by keeping the client queue very small as compared to other BOINC projects. Also, when fetching work from BURP the deadlines are short so the client will prefer to do BURP work in continuous chunks before switching back to other projects.
Last but not least the server will add additional client resources to short deadline units and if they finish before the long-running WU on the blocking host the queued quick unit is aborted on the blocking host if it hasn't started yet (basically the work is re-allocated and the previously blocking host will need to fetch something else to do when it is done with the long-running unit).

BUT! This is all Blender Internal workunits. Cycles workunits (and workunits for any non-deterministic renderer) works differently due to the fact that we don't have a proper validation system for it yet. Many of these smart features are disabled for Cycles units.
Currently it is a bit difficult to see whether a unit is BI or Cycles. If you open the workunit details for a rendered unit you will be able to spot it in the first few lines, but there's nothing on the website yet to show it beforehand.

funkydude
Send message
Joined: 23 Dec 13
Posts: 275
Credit: 2,478,281
RAC: 0
Message 12750 - Posted: 26 May 2014, 14:12:21 UTC - in response to Message 12633.

Is it possible to speed up how long it takes to create and distribute work units?

funkydude
Send message
Joined: 23 Dec 13
Posts: 275
Credit: 2,478,281
RAC: 0
Message 12751 - Posted: 26 May 2014, 17:01:34 UTC - in response to Message 12750.

Project 2136 seems to take ~10 minutes to create and distribute 1 WU.

Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4461
Credit: 2,094,806
RAC: 0
Message 12752 - Posted: 26 May 2014, 17:45:23 UTC

It is a 9.9GB memory session. If unsent units do not get picked no more units will be generated for that session until new units are required.
Link to unit overview.

funkydude
Send message
Joined: 23 Dec 13
Posts: 275
Credit: 2,478,281
RAC: 0
Message 12753 - Posted: 26 May 2014, 19:43:40 UTC - in response to Message 12752.
Last modified: 26 May 2014, 19:45:32 UTC

It is a 9.9GB memory session. If unsent units do not get picked no more units will be generated for that session until new units are required.
Link to unit overview.


I don't think being unsent should prevent more being created, that doesn't compensate for a "flood" of requests at the same time.

I think there should be a buffer of ~10 or so. It can be expanded larger as the project grows.

Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4461
Credit: 2,094,806
RAC: 0
Message 12754 - Posted: 26 May 2014, 21:34:20 UTC - in response to Message 12753.
Last modified: 26 May 2014, 21:35:09 UTC

There is a buffer of 10. In fact it is 18 right now. The buffer is replenished every 0.5 secs at the moment.

Werinbert
Send message
Joined: 9 May 13
Posts: 4
Credit: 10,136
RAC: 0
Message 12768 - Posted: 31 May 2014, 16:41:20 UTC

Janus, another speed issue is the speed of download.
For most projects I am able to download at speed of well over 100 KBps, for BURP however it seems to be limited to less than 4 KBps. As far as I can tell this limit is on your end, either your servers or ISP. Normally this would not really be an issue but with the large files the BURP tries to download it takes hours to complete. In the mean time Boinc thinks that there are a bunch of WUs waiting to be crunched and doesn't look for more WUs. Unfortunately the BURP WUs are "downloading" and thus not able to be worked on. This leads to threads running dry of work.......

A non crunching computer is a sad computer.

So, Janus, is it possible to increase your bandwidth for data transfer?

funkydude
Send message
Joined: 23 Dec 13
Posts: 275
Credit: 2,478,281
RAC: 0
Message 12769 - Posted: 31 May 2014, 17:38:20 UTC - in response to Message 12768.

Werinbert, please see the other thread about download speed. This is a temporary issue.

Janus, are you sure about that? Under the "unsent" section on the server status, I've never seen it go beyond 2. At the moment I'm frequently having issues with "no tasks available" server returns. I'm capable of running all the current projects as of writing.

Werinbert
Send message
Joined: 9 May 13
Posts: 4
Credit: 10,136
RAC: 0
Message 12770 - Posted: 31 May 2014, 18:12:08 UTC - in response to Message 12769.

Werinbert, please see the other thread about download speed. This is a temporary issue.

Sorry ... I did eventually see the other thread. The 3.7KBps download speed seems to be affecting my brain as well. ;-)

Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4461
Credit: 2,094,806
RAC: 0
Message 12772 - Posted: 31 May 2014, 20:56:22 UTC
Last modified: 31 May 2014, 21:29:55 UTC

2 is definitely a bit low. I'll have a look at increasing the BOINC shared memory segment to allow for more units in the pool.

[Edit: ] Ah there we go, BURP may be updating the memory segment at 0.5sec intervals but BOINC was set to only read it every 5 secs. That explains why there was a discrepancy between what the system was showing in the queue and what was available for clients.
The shared memory segment is now also a factor 10 larger than before. Let's see how it all pans out.

[Edit2: ] Also, it seems every 10th update takes longer. Some timing debug will have to be added to see what is going on here.

funkydude
Send message
Joined: 23 Dec 13
Posts: 275
Credit: 2,478,281
RAC: 0
Message 12773 - Posted: 31 May 2014, 21:46:01 UTC - in response to Message 12772.

As always Janus, nice work. It now reads over 100 units :)

Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4461
Credit: 2,094,806
RAC: 0
Message 12775 - Posted: 1 Jun 2014, 9:26:23 UTC

Actually it is not quite good enough yet - working on getting the timing info as soon as possible to see what the issue is.

funkydude
Send message
Joined: 23 Dec 13
Posts: 275
Credit: 2,478,281
RAC: 0
Message 12776 - Posted: 1 Jun 2014, 11:11:02 UTC - in response to Message 12775.

Actually it is not quite good enough yet - working on getting the timing info as soon as possible to see what the issue is.


It seems to be broken again. Reporting 0 unsent tasks as of writing. Looking at my log, my system has gone for hours without crunching again.

Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4461
Credit: 2,094,806
RAC: 0
Message 12777 - Posted: 1 Jun 2014, 11:24:01 UTC
Last modified: 1 Jun 2014, 11:40:43 UTC

This time it is because I'm working on the code that generates units and have had to start/stop it a lot to pinpoint the location of the issue. You have probably spotted the "Project is down for maintenance" log entries in the client log.

The good news: I did find the culprit with regards to the weird timing issues: In a very special case where there is a non-Blender-internal session in the queue which has sent out all workunits but is not yet finished rendering and there is another session in the queue that already has workunits in the renderqueue that are unsent (yeah I told you the context was a bit special...) the workunit creation system can enter into a state where it will only create 2 workunits and then trigger a fail-safe mode where it does a full session list update which takes around 74 seconds, during which the feeder queue could be depleted.

Essentially what happens is that it looks at the poor cycles session and sees that it is starved (no workunits in the queue) and not yet finished, so the priority selector will keep selecting it over and over again even though no new workunits can be created for it. When the other session runs dry too then the priority selector will chose randomly between the two and new work is generated again but quite quickly it goes back to the previous scenario.

There's a fix in place and a proper fix will be up later today where the priority selector will be given a bit of memory to remember not to select any session that it was recently not possible to create any more work for.

[Edit: ] Oh, and the problematic cycles session just completed rendering so the trigger for this issue is no longer there.

Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4461
Credit: 2,094,806
RAC: 0
Message 12778 - Posted: 1 Jun 2014, 13:04:24 UTC

The weird timing issue was fixed. Also the fail-safe mode session list update was decreased from a runtime of around 74secs to around 20ms on the test input. The next test to see if this really works out is at the end of the currently running cycles session.


Post to thread

Message boards : Number crunching : farm render speed improvement