Wednesday, October 7, 2009

Xus' file storage service and Git

Xus has a 2-pronged approach for integrating its file storage with Git:
  1. the basic file storage will just use ordinary file storage, with an approach similar to Git's, ensuring that Xus' model is "compatable" with Git's
  2. provide a plugin that uses JGit (http://www.eclipse.org/egit/) for people who want to maintain a local Git repository
Each peer can be optionally Git-enabled, since the peers will only interact with a local Git repository. It doesn't use Git's protocol to exchange data, because we want Xus to be able to swarm downloads, but a Git-enabled peer could use Git-clone to seed its cache. Since the file storage protocol doesn't have to interact with Git, Xus just needs #1 to make sure that a peer can fetch everything it needs over the net. Since Git uses SHA-1 hashes as keys for almost every type of object, this shouldn't bloat the protocol too much. Note that when I say "protocol" here, I'm talking about the file storage service protocol, which rides inside of layer 2 messages, not Xus' layer 1 or layer 2 protocols.

The file storage service will break files into chunks and use the DHT to spread the chunks around. This is how we do it right now in Plexus; the directory and chunk models are ours and Plexus uses PAST (Pastry's DHT file storage service) to store and retrieve the objects (chunks, file chunk lists, and directories). Xus DHT file service will be similar to PAST's but not entirely the same. The chunk/chunk list/directory model will be similar to Plexus' but it also needs branches, commits, and tags, in order to use Git.

The Git model has been a milestone on Plexus since last year because I think it's crucial to power an open, collaborative gaming environment. In a completely open environment where anyone can change anything, version history ensures that people can't destroy data; they can only create new versions without some of the data, and signed commits make spoofing very difficult, allow people to rate authors based on their work, etc.

This also has the side effect of making a p2p version of Git :).

I don't think totally reimplementing PAST is a good idea for small clouds, because I believe data copying would kill the network. If the cloud is small, with peers joining and leaving slowly over time but never getting larger than 5, each time a new peer joins, it gets a copy of all of the data in storage. PAST does have caching and it works very well for large clouds where storage is built up slowly over time, but I don't think it works well for small rings. For this reason, I'm planning on only having the topic space masters replicate automatically (hopefully starting with a Git-cloned cache seed), but having "ordinary" peers "fill in the holes" from the topic space master peers on-demand. This will slow down initial requests, but amortize the impact on the network.

In practice, I think JGit integration will probably be the default, but I still want to provide it as a plugin as a courtesy to projects so they don't have to link in yet another third party library if they don't require it. Also, I want non-Git-enabled peers to be able to work with Git-enabled ones seemlessly. At least this is what seems like a good idea at this time :).

This implements a form of swarming download, similar in spirit to bittorrent, but not similar in the way it actually functions; you can simultaneously download chunks of files from different peers and several peers will have the chunks, but it's organized quite differently.

Also :) I'm thinking that it would be a good idea to give each directory (with all of its versions) a separate "view" in Git; i.e. new directories are branched from HEAD in Git (like when you make static web pages for Github: http://pages.github.com/). This will give each directory an independent set of branches, allowing people to store many different directories in a single Git repository without having to check out a ton of stuff just to get the directory you want, AND it also allows you to reuse the files from other directories because they're all in one Git repository.

No comments:

Post a Comment