The Myth of P2P File Storage
I've seen a lot of talk about P2P file storage. This is a long, old conversation that's been happening for literally decades, even before the likes of KaZaA graced the computers of the world. Ultimately, the proposition is the following:
I'll reserve a chunk of my hard disk space to store the files of others in exchange for storage space in a P2P network.
Unfortunately, there's simply no way that this can ever reliably work. Let's examine why.
The Storage Exchange
Let's say there are three people on a P2P network. The network allows each person to store 1MB of content. That means that the other two members of the network need to store at least one megabyte to be able to keep the network alive. To provide redundancy in this case, however, both of the two members of the network will need to keep a copy of each file.
- User A: 1MB from User B, 1MB from User C
- User B: 1MB from User A, 1MB from User C
- User C: 1MB from User A, 1MB from User B
Now, if user B is offline and user C needs his 1MB, user A can provide a copy for him.
Having a fixed amount of storage, however, isn't reasonable. Let's say User A now wants to upload his entire Mp3 collection to the network. He has 1GB of Mp3s. User B and user C now need to each have 1GB of capacity to store those files.
In a larger network, this is less of a problem because you can distribute those files around much more easily. One user does not need to store the entirety of a blob of files, it can be evenly spread over the network. If there's 500 users in the network and we want to maintain a more reasonable level of redundancy (say, four copies of each file), we need to span 4GB over 499 users: each user needs to store an average of about 8MB.
As with any P2P network, a user can take advantage of the resources of the network without contributing resources back. These users are known as leeches. In our hypothetical storage network, a leech is a user that deposits lots of files, but does not provide any storage capacity to the network.
How can this be prevented? It turns out, this is a hard problem.
The first approach is straightforward on paper: to deposit files, you need to store files. But this runs into a number of obvious questions:
- How do you verify that a user has stored files?
- How do you count how many files the user has stored?
In a traditional P2P network like Gnutella or KaZaA, there is no central directory. When you search for a piece of content, the network does not know in advance whether one of its nodes is hosting a copy of that content or not: it needs to search. A "blockchain" approach can be taken, solving the second question: a master file listing can be added to a chain of transactions, where the name of the "sender" is recorded with the name of the "receiver". The sender then sends the actual file to the receiver. A record is then available of who stores how much. But that opens up even more questions!
- How do we know that the receiver didn't just pipe the file to
- Every user's identity is tied to each file, making the system a privacy hazard. Unlike a Bitcoin, you can't have a separate address for each coin you own because instead of the coin being the resource, you are the resource. The network needs something to tie the file back to you.
The first question is very hard. You can use some clever crypto: the sender can send a salt and ask the receiver to hash the file to verify that it's received, but there's nothing stopping the receiver from simply dropping the file as soon as the hash is sent. The sender can check periodically, but there's no reason the receiver can't just change their identity and never appear on the network for verification again. The sender could have a "supernode" on the network periodically check, but then the user needs to trust that the supernode never gets hacked and leaks the crypto keys (or work in conjunction with the bad receiver), or anyone could properly respond to the request for work.
The receiver disappearing is a big problem. You can make entries in the blockchain expire if they're not validated periodically. Then you need to deal with those transactions going away: when one link in the chain becomes invalid, a user's storage will need to be revoked to compensate for the capacity that they haven't proven that they are providing. That will potentially trigger some files in the network to be destroyed. But which ones? And what if the user was just on vacation or something, and now their files are slowly dropping out of the network because they haven't been online for a while?
Let's say each user in a network wants to store 1GB in a network of 3 people. Each user needs to store at least 2GB of content to ensure that there is suitable redundancy (one gigabyte from each other user). In a network of 5 people, to meet the four-copies requirement from the example above, each user would need to store 4GB. This is a hard requirement: it is foolish to assume that (reliable) users with large amounts of available storage capacity would join the network and generously donate vast amounts of their space.
Google Drive offers 15GB of free storage. To allow each user to store 15GB on our hypothetical network, they would need to store at least 60GB of other users' files. That's not a small ask.
You could make the proposition that the more you offer to store for the network, the more storage you get. This benefits users who don't store very much. But this does not scale as the network grows:
- If a sender wants to store more files on the network than the network has capacity for (even if the user legitimately provide his or her fair share of capacity), they are unable to do so.
- The capacity that a sender must provide in order to keep the network healthy grows polynomially. For the 15GB example above, a user must store 60GB. If the user wanted to store 100GB, they would need to store 400GB. A network with higher availability would require greater redundancy, pushing that number even larger.
- As a user provides more and more storage to the network, the network becomes less distributed because more capacity is focused on that single node. If one node provides more storage than the others, it will be more likely that that node will be the receiver for any given storage operation. The user becomes a larger point of failure.
- Bad actors could create multiple connections to the network and request to store the same file multiple times. I.e.: They pretend to be two (or three, or four) different receivers and pretend to store the same file on behalf of a single user under each identity. Behind the scenes, the node would simply keep one copy of a file (storing 1/2, 1/3, or 1/4 of what they should), causing decreased redundancy in the network.
It is also important to consider a user's maximum storage. If a user is asked to store more content locally in order to store more on the network, there is a finite amount of content a user can store online before they would otherwise begin dipping into the percentage of their hard disk that is reserved for other users' content. How do you enforce the user's ability to store files online such that they do not accept the burden of storing more files than they are capable of storing?
A user could, for instance, lie about their local storage capacity. The network would allow them to begin storing files online. As requests to store content on the user's machine come in, they will fail as the content exceeds the client's ability to store it (with out-of-space exceptions). It is not reasonable to ask a user to download other users' files before they are allowed to store their own.
- It's virtually impossible to ensure that a user is not leeching.
- Any sort of leech detection will have false positives, which will negatively impact legitimate users.
- Any sort of central tracking mechanism for file storage (like a blockchain) will quickly grow absurdly large; much faster than the Bitcoin blockchain.
- Users that store more are not necessarily more reliable. Providing more capacity to the network is essentially useless unless the node can provide redundant capacity, which---for almost every conceivable user---is not reasonable.
- There's no way to stop a bad actor from connecting to the network multiple times and manipulating the distribution of files in his favor.
- There is no way to stop a user from overcommitting their local storage capacity.
True P2P networks have never been very user friendly. Especially on mobile, any loss of connectivity can cause a non-trivial interruption of service. Unlike a centralized system like the web (where the destination server is fixed at a single address), a P2P network suffers from significant disadvantages:
- Connecting to any node on the network requires knowledge of at least one existing node on the network. Disconnection for longer than a few hours may mean that all known nodes no longer exist at their IP addresses.
- The location of any resource on the network is not implicitly known:
- In a traditional P2P model, the network must perform a search to identify which hosts have a copy of the requested resource.
- In a blockchain-style network, the public identity of a host storing a particular resource is known, but the address of that host must still be searched for. The public identity may not be the address of the host, as the host's address may have changed.
- Determining whether a particular host is connected to the network to request content will always time out rather than fail. Unless the network is a complete graph (i.e.: every node is connected to every other node), it is not possible to know for certain whether any particular node exists as part of the network. For example, if user A is searching for user B, user B may have just connected and his presence may not have propagated the network. When user A sends his query, it's unknown whether user B is online and latency has simply prevented his response from returning or whether user B is simply not online.
- It can be assumed that some users are simply unreliable. A user on a mobile device, for instance, may frequently disconnect from the network. Some nodes may be bandwidth constrained or have high packet loss. Other nodes may be busy with other tasks (video games, media streaming, etc.). Errors and timeouts would make the network client seem glitchy or slow at best.
- Throughput while transferring content between nodes on the network may be very poor, depending on the location and distance of a target node.
- A network connection that is disproportionately small in comparison to the node's storage could cause significant issues. Consider this: a user seeks to store a lot of content on the network, and so makes a large amount of space available to other users. The network routes inbound files to the user's newly available space. The user's network connection is very poor (perhaps on bad wifi), and the inbound files saturate the user's connection. This may make it impossible for the user to use their internet connection, or for any files to be received before various timeouts occur.
As the network grows and scales, maintaining the location of each and every resource on the network becomes increasingly difficult, especially with many unreliable nodes. If a centralized monolithic storage system like a cloud storage provider can take many hundreds of milliseconds to find and begin transmitting a resource, it is a best-case scenario that a distributed, unreliable network could do the same in even a remotely competitive amount of time.
One last consideration for such a network is the ability for it to be attacked. DDoSing a client on this network would not be hard at all: simply sending requests to a client to download files could completely saturate the client's entire outbound internet connection. This could be used to force a user not to use the network. If the client queues download requests, a DDoS can prevent nodes that have resources stored on the victim machine from accessing their content. Simply filling up the queues would mean that few legitimate download requests could get through.
Such an attack could be used to block a particular user from accessing their content. If a user's public identity on the network is traced to their personal identity, it could be possible for an attacker to extort the user: a ransom could be demanded for access to the user's own files, for instance, by overwhelming the hosts that store the user's files.
I think a global P2P storage network would be cool. And some people have even tried it in various incarnations. It is not, however, a concept that will scale. Users will quickly grow tired of its unreliable and slow behavior, and the frustrations of managing content between various different devices. The death-blow, I believe, is the fact that a user could never get "extra storage": requiring the user to download more content than they store makes the network only useful for accessing shared content from other users and keeping backups of very important files. Because of the network's latency, the system could not be used to stream media, eliminating yet another use case.
If anyone ever manages to solve the problems presented above, I'd love to hear your solutions.