Proposed change of Project-->Client download system

Last edited 2007-04-21

This document describes an idea about how BOINC could progress from using the current HTTP-based file distribution system to using a distributed distribution system.

The current situation

Currently projects and participants suffer under the bandwidth limitations of the project servers. These limitations become particularly clear after a new science application has been released for distribution to the participants's BOINC clients. At such a time the project servers may become congested. HTTP mirrors could be set up to improve the situation a bit, but no matter how we put it bandwidth is expensive and the fewer mirrors we can get away with the better.

In the current system an application's travel from creation to its destination on the client is as follows:

How may this be improved?

Well, since the bottleneck is the project servers, we may as well figure out a way to let clients share the load. There's no need to reinvent the wheel so let's have a look at Bittorrent ([Wikipedia] / [Website]):

Bittorrent is way to transfer files in a distributed manner. Each client fetches pieces of a file from a seed (a place that has the entire file already) and then the clients help each other gather (through the use of a central tracker) the entire file by using both their in- and outbound bandwidth, sharing the pieces.
When a client has the entire file it becomes a seed itself for a while.

In this manner the load on the initial seed is considerably lower than if the seed (project webserver) would have had to directly distribute the file to each of the clients. In fact the initial seed may even be offline at times - the network will still be able to distribute the file as long as there are enough pieces being held online by the clients to recreate the entire file.

Nice! Why didn't we start using this ages ago?

A few years ago when this idea was first put forward the implementation of this would not only require a change of the BOINC client but a rewrite of large parts of the serverside scripts as well as adding two new serverside service dependencies (the Bittorrent tracker and one or more seed servers).
Since then things have luckily changed for the better.

Recently the people who develop Getright (a popular download accellerator) added Bittorrent support to it - and as part of that they proposed an extension to the .torrent format to support what is called "web seeds". Many Bittorrent clients support web seeds now. A web seed is simply an HTTP or FTP server like the ones projects currently use to serve the files - in other words the need for a dedicated seed service went away. Therefore We can simply use our current HTTP setup to seed files.

Also, since the Bittorrent protocol has become increasingly popular, PHP scripts now exist that will take care of the tracking without the need for a dedicated tracker service. The scripts use only stuff that is already on the BOINC dependency list (like PHP, Apache and some of them MySQL). Such a script could be bundled with the BOINC server framework so that project admins won't even need to know about Bittorrent.

Furthermore (as an alternative to using a PHP tracker) it is now possible to use the Bittorrent extension that enables trackerless tracking trough a structured P2P network called Kademlia. This completely elliminates the need of any central additions to the BOINC server structure (apart from whatever script that produces the .torrent files).

On the client side things have improved as well (as I'll come back to in the implementation details). Instead of rewriting the download system entirely we may simply add Bittorrent support on top of what we already have.

The improved situation

Instead of being limited by the project's outbound bandwidth the download speed will be limited by the project outbound bandwidth plus the total available participant outbound bandwidth (possibly also limited by their configuration). The more participants, the greater the total bandwidth - A win-win situation: The project sends less data out their pipe (ie. more cost effecient) and participants get the files faster (this makes people happier and computation latency lower when updating apps).

An application now travels like this:

Not all files need to be downloaded using this scheme and not all clients need to use their outbound bandwidth if they don't like to. More about this in the next section.

How do we actually make this work?

During the last few weeks I've been working on the serverside stuff:

On the clientside we will need:

Looking at the serverside everything works as usual except that a new directory "bittorrent_download" exists in parallel to the old "dowload" dir. For each of the application file bigger than XX KB in "bittorrent_download" a .torrent-generator will generate .torrent files. The torrents contain the usual Bittorrent information (tracker URL, hashes of file data etc.) along with URLs to the project webserver download place as well as any mirrors.

The server also makes a tracker available on the project URL. This tracker allows only tracking of files in "bittorrent_download" and not arbitrary files.

On the clientside old clients will simply download all the files and ignore the fact that the *.torrent files could have helped them at all.
New and updated clients will look at the list of files they need to download. Whenever a download is about to start they do a lookup of the file:
http://burp.boinc.dk/torrent_generator.php?file=/some_big_application.file

If the torrent-cache contains information about the given file the client can feth the .torrent before starting the actual download. Whenever a .torrent has been downloaded the Bittorrent library is asked to start downloading the file that it describes.

New clients will behave exactly as the old ones if a project isn't using the bittorrent strategy to distribute files (ie. if the project isn't providing .torrents because they haven't updated to the new server code yet).

When the client is done downloading all the files it starts computing but it also optionally seeds the files for an additional XX mins at at most the outgoing speed selected in the host's settings. Also, users should be able to "opt out" of the upload part of the bittorrent system if they are on a line where they pay per MB traffic or if their network setup simply doesn't allow P2P traffic. We don't need everyone to do uploading as long as just some people help out - that's the beauty of Bittorrent.

Other uses

Not only may distributed file distribution help when a project updates their science application. Some projects also have very big files associated with their workunits - or may use a single large input file for a series of workunits. In that case generating a .torrent for the file would similarly improve throughput.

Conclusion

It is possible to extend BOINC to do distributed file distribution by adding only 3 things: