BURP Criu Checkpointing

Message boards : Client : BURP Criu Checkpointing
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Janus
Volunteer moderator
Project administrator
Avatar

Send message
Joined: 16 Jun 04
Posts: 4553
Credit: 2,097,282
RAC: 3
Message 15356 - Posted: 7 Mar 2018, 19:10:53 UTC
Last modified: 7 Mar 2018, 19:46:39 UTC

Beating a dead horse here, but BURP doesn't support checkpointing with Blender because Blender does not support checkpointing.
For quite a while it was believed to be impossible to do this without adding specific code for it in Blender, but it actually turns out to be possible but just very very (very!) very difficult.

The OpenVZ people created a little tool called Criu that can create snapshots of live processes on Linux and surprisingly, with enough tweaking, it actually works on Blender running inside of the Glue3 controller process in our client.

However, getting it working reliably is incredibly complex and the tool has a number of drawbacks:

  • Only works on Linux
  • Only works if BOINC/Glue3 runs as a privileged user or a privileged user has started a Criu service
  • Does not work with CUDA/OpenCL
  • Checkpoints are very large compared to other BOINC projects. Essentially RAM usage + a bit
  • Creating the checkpoint freezes the process for a short while
  • Restoring from a checkpoint is fairly complicated and involves spawning a new PID namespace to avoid overlapping with the process ID of already running processes on the system, which again requires a privileged user
  • Input/Output and log files are more difficult to handle
  • Some data will have to be juggled around and some changes may have to be rolled back in case a stale restore point is ever used (like after a computer crash where it did not have time to write a restore point while shutting down)


Here's an example of some Blender output that was suspended/resumed. Notice how Blender gets a bit confused about the sudden time shift:

...
Fra:1 Mem:204.58M (0.00M, Peak 225.66M) | Time:00:00.82 | Mem:27.12M, Peak:27.12M | Scene, RenderLayer | Updating Lookup Tables
Fra:1 Mem:204.58M (0.00M, Peak 225.66M) | Time:00:00.82 | Mem:27.12M, Peak:27.12M | Scene, RenderLayer | Updating Baking
Fra:1 Mem:204.58M (0.00M, Peak 225.66M) | Time:00:00.82 | Mem:27.12M, Peak:27.12M | Scene, RenderLayer | Updating Device | Writing constant memory
Fra:1 Mem:204.58M (0.00M, Peak 225.66M) | Time:00:00.82 | Mem:27.12M, Peak:27.12M | Scene, RenderLayer | Path Tracing Tile 0/40
Fra:1 Mem:470.08M (0.00M, Peak 483.58M) | Time:00:36.53 | Remaining:03:26.51 | Mem:292.70M, Peak:297.20M | Scene, RenderLayer | Path Tracing Tile 1/40
Fra:1 Mem:470.08M (0.00M, Peak 483.58M) | Time:00:50.17 | Remaining:03:02.22 | Mem:292.70M, Peak:297.20M | Scene, RenderLayer | Path Tracing Tile 2/40
Fra:1 Mem:467.55M (0.00M, Peak 483.58M) | Time:00:52.53 | Remaining:02:58.50 | Mem:290.17M, Peak:297.20M | Scene, RenderLayer | Path Tracing Tile 3/40
Fra:1 Mem:467.55M (0.00M, Peak 483.58M) | Time:01:07.99 | Remaining:02:47.67 | Mem:290.17M, Peak:297.20M | Scene, RenderLayer | Path Tracing Tile 4/40
Fra:1 Mem:465.02M (0.00M, Peak 483.58M) | Time:01:08.55 | Remaining:02:46.24 | Mem:287.63M, Peak:297.20M | Scene, RenderLayer | Path Tracing Tile 5/40
[Criu suspends]
a bit more than half an hour passes while I mess around trying to get it to resume...
[Criu resumes]
Fra:1 Mem:462.48M (0.00M, Peak 483.58M) | Time:36:03.08 | Remaining:01:18:41.27 | Mem:285.10M, Peak:297.20M | Scene, RenderLayer | Path Tracing Tile 6/40
Fra:1 Mem:455.45M (0.00M, Peak 483.58M) | Time:36:15.55 | Remaining:01:05:12.37 | Mem:278.07M, Peak:297.20M | Scene, RenderLayer | Path Tracing Tile 7/40
Fra:1 Mem:448.42M (0.00M, Peak 483.58M) | Time:36:23.59 | Remaining:57:56.10 | Mem:271.04M, Peak:297.20M | Scene, RenderLayer | Path Tracing Tile 8/40
Fra:1 Mem:440.41M (0.00M, Peak 483.58M) | Time:36:23.96 | Remaining:57:39.07 | Mem:263.03M, Peak:297.20M | Scene, RenderLayer | Path Tracing Tile 9/40
Fra:1 Mem:438.44M (0.00M, Peak 483.58M) | Time:36:31.36 | Remaining:52:17.09 | Mem:261.05M, Peak:297.20M | Scene, RenderLayer | Path Tracing Tile 10/40
...


For now I'll just leave it at this proof-of-concept example and note that Criu is a very impressive tool. It even allows resuming Blender on another computer than where it was originally running, which is quite mind-blowing.

So, yes, it is theoretically possible to provide checkpointing under very specific circumstances for some specific kind of sessions on Linux. However, the downside is that doing so would add a lot of complexity to the client as well as require more privileges than what we are comfortable requesting people to give to the client.
ID: 15356 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Client : BURP Criu Checkpointing