Deduplication by hash: fingerprinting that can reveal whether a file already exists
2011
Dropbox splits files into blocks, hashes each with SHA-256, and stores only one copy of any block it already holds — a cost-saving design that researcher Christopher Soghoian warned could leak whether a given file already exists on Dropbox's servers.
What happened
To save storage and bandwidth, Dropbox chunks files into roughly 4 MB blocks, computes a SHA-256 hash of each block, and uploads only blocks it has never seen before. The hash acts as a fingerprint and lookup key. Historically this deduplication operated across users: if anyone had already uploaded an identical block, a new user's client could skip the upload entirely.
In April 2011, security researcher Christopher Soghoian's 'slight paranoia' analysis ('How Dropbox sacrifices user privacy for cost savings') argued that cross-user deduplication created a side channel: by observing whether the client was asked to upload a file or not, one could infer whether that exact file already existed somewhere on Dropbox — useful to investigators, copyright holders, or anyone probing for a known document. It also meant Dropbox could identify users holding a specific file purely from its hash, without reading content. Dropbox later moved deduplication to within a single user's account, reducing the cross-user inference; the episode is distinct from, but related to, the 2011 FTC encryption complaint.
Impact
The deduplication debate established early that file fingerprints alone — not just file contents — carry privacy weight: a hash can confirm possession of a specific document, support DMCA-style matching, and answer 'does Dropbox already have this file?' for outside parties. It shaped later understanding of how hash-matching (including CSAM scanning and shared-link enforcement) works on the platform.