The Linux client needs some work.


Advanced search

Message boards : Client : The Linux client needs some work.

Author Message
Profile Stack
Send message
Joined: 27 Jul 05
Posts: 5
Credit: 1,678,054
RAC: 2,526
Message 11986 - Posted: 28 Aug 2013, 22:40:41 UTC

So right now, the Linux client is seriously pissing me off. Specifically, this damn session: http://burp.renderfarming.net/result.php?resultid=7445562 So this may get a bit rantish....

I had this session up over 50 hours worth of work when I had to shut down the system and replace a faulty hard drive. Did BURP pick up where it left off? Nope. Every other project does! But BURP hasn't support a resume since...well...ever that I can remember. I have been doing BURP for quite some time and I seem to remember that if the client ever shuts down, the session restarts.

Again, I have known about this for a long time and it hasn't really ever bothered me before, except that I let it run for a few more days after its first shutdown and then my neighbors and I suffered a power outage! Another 30+ hours wasted!

OK no worries, that was last weekend, it will be up and running again and this task will soon be over, right?

Except that I got a phone call from my wife today saying the AC just died. It is only 90+ F outside and my office "is like an oven". Hooray! So I log in remotely and start shutting things down. Just a quick check and I see another 15 hours wasted on this freaking job! Not only that, but can someone please explain to me how I have BOINC set to use 75% of my CPUs (something EVER OTHER PROJECT can adhere to) but BURP is currently using 100% utilization ON ALL OF THEM!! The job is only supposed to be a 2 processor job! Why the HELL is it using _EIGHT_?

Again, yes. The Linux client suspending problem has been around since the dawn of time it seems like. It rarely seems to bother me much except for an occasional lost job because it completed when it wasn't supposed to be running. That sucks but whatever. And this isn't the first time I have noticed BURP using far more resources then it is supposed to.

In fact, I don't think any of these problems are original or unique by any means. I don't post much, but I lurk on the forums from time to time and I know this isn't the first time someone has complained about the Linux client. But are these issues ever going to be addressed?

I am close to twice the length of time it should have taken to complete this job AND I get to start ALL over again when I boot the box after the AC is finished. Grrrrr.....

Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4483
Credit: 2,094,806
RAC: 0
Message 11989 - Posted: 1 Sep 2013, 12:22:48 UTC
Last modified: 1 Sep 2013, 12:25:19 UTC

Yup, BURP is beta - and since there's only one person working on it we have some seriously long release cycles (like, year-long sometimes) and some long-standing bugs. You do seem to be pretty unlucky, though, with both the power, the AC and the HDD failing within such a short timeframe of each other.

As you say the issues about checkpointing (a Blender thing) and suspend/resume (a Glue3 bug) have already been covered elsewhere. Checkpointing doesn't look like it will ever be implemented but suspend/resume will probably be fixed in the next release. The current, big, multi-day workunits exaggerate the issues. When at all possible the workunits are being kept to 1-5 hours which seems to be the sweet spot for performance.

However, you mention that the rendering process uses more CPUs than what was assigned to it. Is this reproducible? Does it always do it or just this once? When you get the AC back can you try it out again to see if it is repeatable?

Profile Stack
Send message
Joined: 27 Jul 05
Posts: 5
Credit: 1,678,054
RAC: 2,526
Message 11998 - Posted: 3 Sep 2013, 20:21:28 UTC - in response to Message 11989.

Greetings Janus.
On the multi-threaded jobs, I have noticed the over-allocation of CPU's a number of times. It was very repeatable on the last job. I killed that one because I just put in another 100+ hours and needed to reboot earlier and it of course restarted. I pulled another job but it is only single core so it is behaving for the most part. I will try and pull another multi-threaded job and see if it was just that session or not.

Profile Stack
Send message
Joined: 27 Jul 05
Posts: 5
Credit: 1,678,054
RAC: 2,526
Message 12004 - Posted: 10 Sep 2013, 3:34:39 UTC - in response to Message 11998.

Greetings,

I do BOINC cause I think it is nifty, but this isn't my area of expertise at all. However, complaining about something without being willing to help solve it is kind of silly. I don't really know what I can provide to help at all, but I finally pulled another multi-core BURP job. As soon as I spotted it, I paused it and I let most of the other work on my system complete. I figured this would narrow down any potential errors in the log files, should you care to view them.

As soon as the job started, it was doing well with the 6 of 8 cores that BOINC is allowed to use. However, only a few seconds went by before it jumped to using all eight cores. Here is a screenshot:


There are nine processes listed, but the top one is just the parent. It ran on eight processors for a few hours then it dropped to four. BOINC is only allowed to use 6 cores, BURP says it is a 6 core job, but this job certainly likes to run at 8 and 4. :-/

So what can I provide to help debug this issue further? This is obviously something I can replicate with different jobs and across multiple systems (though the last system is no longer doing BOINC as it has been re-purposed).

The OS is a fully updated Debian Wheezy (though I noticed this issue in Squeeze too before the recent update). I run the BOINC app as found in the repos (7.0.27 at this time). It is 64bit. And err...I am running out of ideas on other bits of useful info that aren't already in the computer listing...

I am just going to let the job run unless I hear otherwise. If there is a log file, or a test/check you want me to run that can help with debugging the issue then just let me know.

Thanks!

Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4483
Credit: 2,094,806
RAC: 0
Message 12005 - Posted: 10 Sep 2013, 17:26:47 UTC - in response to Message 12004.

Pretty peculiar indeed - especially since the parameter is right there in your screenshot:
./blender_ld blender -noaudio -b in -F EXR -t 6 -f 386 0.0 0.0 1.0 1.0
It means that BOINC told Glue3 about the thread limit and Glue3 started Blender with the right parameters but for some reason Blender still runs too many threads.
Your report should be enough to pinpoint the issue, thanks!

Profile Stack
Send message
Joined: 27 Jul 05
Posts: 5
Credit: 1,678,054
RAC: 2,526
Message 12008 - Posted: 10 Sep 2013, 23:29:20 UTC - in response to Message 12005.

Well if you need me to pull anything, just let me know. When Boinc started this BURP job it was estimating a ~4hour run to completion. It has now been running for 25:37hrs and estimates 31:38hr remaining. So I have a feeling that this job will go on for a while. :-D

Also, when I got home from work today I noticed that it is back up to using all 8 cores.

All the projects are on suspend (except for WUProp) until the job finishes. I still have that PrimeGrid job in the queue, but it hasn't run at all yet since I took that first screenshot.

So if there is any log file or any information I can provide to help out, just let me know.

Thanks again!

____________

Profile Janus
Volunteer moderator
Project administrator
Avatar
Send message
Joined: 16 Jun 04
Posts: 4483
Credit: 2,094,806
RAC: 0
Message 12015 - Posted: 16 Sep 2013, 20:11:33 UTC

Good news:
The reason for this has been now been narrowed down to a thread-detection patch that didn't cleanly apply to the current Sunflower clients.

"Bad" news:
Unfortunately the Sunflower clients will not be updated any-more because we're soon to reach the end of the Sunflower project. However, once it ends, the changes from Sunflower will be merged into the normal clients and here the fix is included.

In other words: The threading issue has been isolated and a fix is on its way, it just won't be instant.


Post to thread

Message boards : Client : The Linux client needs some work.