Using Bittorrent for initial sync

patmast3r · September 3, 2015, 3:01pm

So the idea to use BitTorent to distribute the blockchain isn't exactly brand new. I'm not sure if any crypto-project has actually built it into their client yet though which is what I'm proposing.
I'm suggesting to use the BitTorrent protocoll for the initial sync i.e. from nemesis to whatever height we're at when a node joins.

Sorry if this is confusing mumbling but I'm just trying to write this down to get it out of my head and so everyone can start pitching in.

BitTorrent is intended to provide faster sharing of huge files by splitting files into smaller pieces and simultaniously downloading those pieces from multiple peers. So in our case that would be like downloading block 0-400 from node a and 400-800 from node b and so on (This is just to illustrate what it does not how it would really work).
The nice thing is that once a peer has downloaded one piece of the file it can immediately share that piece and helps distributing the file eventhough it hasn't even finished downloading it.

So this sounds pretty cool but there are some implementational details that we need to iron out.

Trackers or DHT ?
There are afaik 2 ways peers in BitTorrent know where to find files (and chunks of those).
Either a tracker which is basically a centralized service that keeps track of the peers tells them or by looking it up in a DHT (Distributed Hash Table) that is maintained by all peers.
Trackers would probably be WAY easier to implement but it wouldn't be completely decentralized.

How to create the file to seed ?
So what we would want to seed is the db file where the blockchain is kept for every client. Since the file has to be an exact match everywhere we can't just make a copy of the current db and seed it because there could be microforks that are yet to be resolved or nodes could be on different heights.
So we would need to determine WHEN trackers create db snapshots and what should be included.

How to tell peers about new snapshots ?
When you usually download torrents you either use .torrent files that include all the infromation or magnet links. I think magnet links require DHT so if we're going for trackers we'd prob want to distribute .torrent files or rather all the informations usually included in those. How ?

This is as far as I've gotten with a possible concept.

Well known peers are used as trackers. Technically we only need one but I'm not sure how much load acting as a tracker puts on a node so having some more trackers is prob a good idea.

Every x blocks the trackers make a snapshot of their DB and start seeding it (so they are trackers and seeders because someone has to start seeding). It's important that the files they are creating match 100% so the snapshot should prob be the db at current height - rewrite limit.

So now people need to be told about those torrents that are waiting to be downloaded.
I'm pretty sure the information needed doesn't fit in a message so just storing that on the blockchain isn't possible.
I guess peers could somehow push those information over the wire but I'm not sure if that would open a ddos attack vector.
May peers could indicate wether or not they are trackers and peers can then ask them for the required information.
This is really something the core devs should come up with

/edit: I just though about this. I think in our usecase most information that are usually in the .torrent files will stay the same. So we could just save the information that do change in the blockchain to distribute it. By storing it in the blockchian every node would know about it and new nodes could ask every node for the information needed to start downloading the blockchain via bittorrent. That way they don't need to know about trackers or anything like that in advance. Nodes that already have the chain will know about it and can tell them.

Once that info is known to the client it can download the db, place it in the right directory fire up NIS as usually and start syncing from there. Download of the db should happen way faster since it's downloading it from multiple locations at once instead of from just one node.

I'm sure there are more details to hash out so let's do some brainstorming and maybe we can come up with a basic concept of how this could/should work.

rigel · September 3, 2015, 3:01pm

I don't think we need to reinvent everything from scrath: NIS nodes already know how to find each other and how to exchange chunks of blocks.

The only problem that stops from downloading chunks in parallel is that every block needs all previous blocks to be verified.

The solution could be developing a temporary storage to save all unverified blocks before moving them in the blockchain.

Am I missing something?

mixmaster · September 3, 2015, 3:01pm

I don't think we need to reinvent everything from scrath: NIS nodes already know how to find each other and how to exchange chunks of blocks.

The only problem that stops from downloading chunks in parallel is that every block needs all previous blocks to be verified.

The solution could be developing a temporary storage to save all unverified blocks before moving them in the blockchain.

Am I missing something?

That is exactly what I was thinking and I suggested this months ago. But I guess there is something we don't see, which makes this complicated to implement.

patmast3r · September 3, 2015, 3:01pm

I don't think we need to reinvent everything from scrath: NIS nodes already know how to find each other and how to exchange chunks of blocks.

The only problem that stops from downloading chunks in parallel is that every block needs all previous blocks to be verified.

The solution could be developing a temporary storage to save all unverified blocks before moving them in the blockchain.

Am I missing something?

Blocks are downloaded in chunks of - i think - 400 right now just not simoultaniously from multiple nodes.
The beauty of the BitTorrent solution is that the devs don't need to come up with a solution to the problem of where to download which chunks. You'll have different download speeds so you need to come up with a protocoll to sort out some issues that result from that. Like when to verify what and how to put this stuff into the db when it comes out of order.
So what I'm proposing is using a well established protocoll to download a big amount of data in chunks from multiple sources at once instead of coming up with a new one. I think using BitTorrent would also provide some integrity checks since files are afaik shared via their hashes so you can't just provide a bogus db. When you just request chunks from multiple nodes one could provide bs data and you might only find out once you put them together at which point you'll have to start over. Not sure about this though.

If there is in fact an easier and/or cleaner way to do it I'm all for it. My intention was merely to get a discussion going :)

/edit: Thinking about it we could just let everyone that wants to seed and reflect that in the output of /node/info or have it be another state in /node/peer-list (so it would be active, inactive, seeding or whatever)

So every seeding node would create the snapshot and start seeding.
Every node that joins the network can get a list of seeding nodes and just start downloading.

Probably not really according to protocoll but if possible would still be nice i think.

mixmaster · September 3, 2015, 3:01pm

You are right, pat. So the plan is to integrate Bittorrent protocol features into NEM nodes? I first thought this is suggested as a service on top. If it is integrated and you could e.g. just switch "thisNisIsSeeder = false/true" in NIS config.properties, that would be awesome.

patmast3r · September 3, 2015, 3:01pm

You are right, pat. So the plan is to integrate Bittorrent protocol features into NEM nodes? I first thought this is suggested as a service on top. If it is integrated and you could e.g. just switch "thisNisIsSeeder = false/true" in NIS config.properties, that would be awesome.

That was the idea yes.
I have read more into Bittorrent though and I think I was wrong. I don't think it prevents anyone from seeding a bullshit chain. I'm also not sure how this would be handled by the protocoll.
Maybe the idea isn't all that bright afterall.

I guess I'll have to dig through the spec to find out how exactly bullshit seeds would be handled.

gimre · September 3, 2015, 3:01pm

The solution could be developing a temporary storage to save all unverified blocks before moving them in the blockchain.

That is exatly what I was thinknig about.

patmast3r · September 3, 2015, 3:01pm

The solution could be developing a temporary storage to save all unverified blocks before moving them in the blockchain.

That is exatly what I was thinknig about.

I think that's what any soution would do since you're downloading in chunks. So until you have all chunks you'll have to store the rest somewhere. Imho that's not a solution but only a basic property that any solution needs to provide.
I don't see how that solves the problem of nodes providing bogus chains to screw up the syncing process. Maybe Eigentrust++ solves that already ? I mean there's nothing stopping nodes from providing bs data right now, even without multiple sources.

Care to elaborate gimre ? :)

To clarify The prolem with nodes providing bs data for me isn't that I'm afraid they'll take over the network or anything. Of course every node can verfiy the data. However it can be a major annoyance if you have to re-download.

/Edit:
Could we have a harvester produce the merkle hash of the current chain and store that in the blockchain ? Then every node should be able to verify chunks right ?

jabo38 · September 3, 2015, 3:01pm

if we are sending people to bittorrent, why not just have them download from our own centralized server. much faster and don't have to worry about people uploading bogus chains. still verify the chain with a hash.