Matt Basta

My Blog

The Myth of P2P File Storage

I've seen a lot of talk about P2P file storage. This is a long, old conversation that's been happening for literally decades, even before the likes of KaZaA graced the computers of the world. Ultimately, the proposition is the following:

I'll reserve a chunk of my hard disk space to store the files of others in exchange for storage space in a P2P network.

Unfortunately, there's simply no way that this can ever reliably work. Let's examine why.

The Storage Exchange

Let's say there are three people on a P2P network. The network allows each person to store 1MB of content. That means that the other two members of the network need to store at least one megabyte to be able to keep the network alive. To provide redundancy in this case, however, both of the two members of the network will need to keep a copy of each file.

Now, if user B is offline and user C needs his 1MB, user A can provide a copy for him.

Having a fixed amount of storage, however, isn't reasonable. Let's say User A now wants to upload his entire Mp3 collection to the network. He has 1GB of Mp3s. User B and user C now need to each have 1GB of capacity to store those files.

In a larger network, this is less of a problem because you can distribute those files around much more easily. One user does not need to store the entirety of a blob of files, it can be evenly spread over the network. If there's 500 users in the network and we want to maintain a more reasonable level of redundancy (say, four copies of each file), we need to span 4GB over 499 users: each user needs to store an average of about 8MB.

Leeches

As with any P2P network, a user can take advantage of the resources of the network without contributing resources back. These users are known as leeches. In our hypothetical storage network, a leech is a user that deposits lots of files, but does not provide any storage capacity to the network.

How can this be prevented? It turns out, this is a hard problem.

The first approach is straightforward on paper: to deposit files, you need to store files. But this runs into a number of obvious questions:

In a traditional P2P network like Gnutella or KaZaA, there is no central directory. When you search for a piece of content, the network does not know in advance whether one of its nodes is hosting a copy of that content or not: it needs to search. A "blockchain" approach can be taken, solving the second question: a master file listing can be added to a chain of transactions, where the name of the "sender" is recorded with the name of the "receiver". The sender then sends the actual file to the receiver. A record is then available of who stores how much. But that opens up even more questions!

The first question is very hard. You can use some clever crypto: the sender can send a salt and ask the receiver to hash the file to verify that it's received, but there's nothing stopping the receiver from simply dropping the file as soon as the hash is sent. The sender can check periodically, but there's no reason the receiver can't just change their identity and never appear on the network for verification again. The sender could have a "supernode" on the network periodically check, but then the user needs to trust that the supernode never gets hacked and leaks the crypto keys (or work in conjunction with the bad receiver), or anyone could properly respond to the request for work.

The receiver disappearing is a big problem. You can make entries in the blockchain expire if they're not validated periodically. Then you need to deal with those transactions going away: when one link in the chain becomes invalid, a user's storage will need to be revoked to compensate for the capacity that they haven't proven that they are providing. That will potentially trigger some files in the network to be destroyed. But which ones? And what if the user was just on vacation or something, and now their files are slowly dropping out of the network because they haven't been online for a while?

Fair Shares

Let's say each user in a network wants to store 1GB in a network of 3 people. Each user needs to store at least 2GB of content to ensure that there is suitable redundancy (one gigabyte from each other user). In a network of 5 people, to meet the four-copies requirement from the example above, each user would need to store 4GB. This is a hard requirement: it is foolish to assume that (reliable) users with large amounts of available storage capacity would join the network and generously donate vast amounts of their space.

Google Drive offers 15GB of free storage. To allow each user to store 15GB on our hypothetical network, they would need to store at least 60GB of other users' files. That's not a small ask.

You could make the proposition that the more you offer to store for the network, the more storage you get. This benefits users who don't store very much. But this does not scale as the network grows:

It is also important to consider a user's maximum storage. If a user is asked to store more content locally in order to store more on the network, there is a finite amount of content a user can store online before they would otherwise begin dipping into the percentage of their hard disk that is reserved for other users' content. How do you enforce the user's ability to store files online such that they do not accept the burden of storing more files than they are capable of storing?

A user could, for instance, lie about their local storage capacity. The network would allow them to begin storing files online. As requests to store content on the user's machine come in, they will fail as the content exceeds the client's ability to store it (with out-of-space exceptions). It is not reasonable to ask a user to download other users' files before they are allowed to store their own.

Lessons

  1. It's virtually impossible to ensure that a user is not leeching.
  2. Any sort of leech detection will have false positives, which will negatively impact legitimate users.
  3. Any sort of central tracking mechanism for file storage (like a blockchain) will quickly grow absurdly large; much faster than the Bitcoin blockchain.
  4. Users that store more are not necessarily more reliable. Providing more capacity to the network is essentially useless unless the node can provide redundant capacity, which---for almost every conceivable user---is not reasonable.
  5. There's no way to stop a bad actor from connecting to the network multiple times and manipulating the distribution of files in his favor.
  6. There is no way to stop a user from overcommitting their local storage capacity.

Practicality

True P2P networks have never been very user friendly. Especially on mobile, any loss of connectivity can cause a non-trivial interruption of service. Unlike a centralized system like the web (where the destination server is fixed at a single address), a P2P network suffers from significant disadvantages:

As the network grows and scales, maintaining the location of each and every resource on the network becomes increasingly difficult, especially with many unreliable nodes. If a centralized monolithic storage system like a cloud storage provider can take many hundreds of milliseconds to find and begin transmitting a resource, it is a best-case scenario that a distributed, unreliable network could do the same in even a remotely competitive amount of time.

Security

One last consideration for such a network is the ability for it to be attacked. DDoSing a client on this network would not be hard at all: simply sending requests to a client to download files could completely saturate the client's entire outbound internet connection. This could be used to force a user not to use the network. If the client queues download requests, a DDoS can prevent nodes that have resources stored on the victim machine from accessing their content. Simply filling up the queues would mean that few legitimate download requests could get through.

Such an attack could be used to block a particular user from accessing their content. If a user's public identity on the network is traced to their personal identity, it could be possible for an attacker to extort the user: a ransom could be demanded for access to the user's own files, for instance, by overwhelming the hosts that store the user's files.

Closing Thoughts

I think a global P2P storage network would be cool. And some people have even tried it in various incarnations. It is not, however, a concept that will scale. Users will quickly grow tired of its unreliable and slow behavior, and the frustrations of managing content between various different devices. The death-blow, I believe, is the fact that a user could never get "extra storage": requiring the user to download more content than they store makes the network only useful for accessing shared content from other users and keeping backups of very important files. Because of the network's latency, the system could not be used to stream media, eliminating yet another use case.

If anyone ever manages to solve the problems presented above, I'd love to hear your solutions.