Locality-Sensitive Hashing Does Not Guarantee Privacy! Attacks on Google’s FLoC and the MinHash Hierarchy System

Authors: Florian Turati, Karel Kubicek karel.kubicek@inf.ethz.ch, Carlos Cotrini, and David Basin

Abstract: Recently proposed systems aim at achieving privacy using locality-sensitive hashing. We show how these approaches fail by presenting attacks against two such systems: Google’s FLoC proposal for privacy-preserving targeted advertising and the MinHash Hierarchy, a system for processing location trajectories in a privacy-preserving way. Our attacks refute the pre-image resistance, anonymity, and privacy guarantees claimed for these systems.

In the case of FLoC, we show how to deanonymize users using Sybil attacks and to reconstruct 10% or more of the browsing history for 30% of its users using Generative Adversarial Networks. We achieve this only analyzing the hashes used by FLoC. For MinHash, we precisely identify the location trajectory of a subset of individuals and, on average, we can limit users’ trajectory to just 10% of the possible geographic area, again using just the hashes. In addition, we refute their differential privacy claims.

BibTex

@article{turati2023locality,
  title={Locality-Sensitive Hashing Does Not Guarantee Privacy! Attacks on Google's FLoC and the MinHash Hierarchy System},
  author={Turati, Florian and Kubicek, Karel and Cotrini, Carlos and Basin, David},
  journal={Proceedings on Privacy Enhancing Technologies},
  pages={117-131},
  volume={2023},
  number={4},
  year={2023},
  doi={10.56553/popets-2023-0101}
}

#NoFLoC

Locality-Sensitive Hashing not only in Google FLoC

Google proposed FLoC, a “private targeted advertising” system. On first look, it seemed as a great improvement to users’ privacy, but after a while, we realized that it leaks information where no one expected it to do so - from using Locality-Sensitive Hashing (LSH), which was never meant for privacy applications!

LSH maps similar inputs to similar outputs. It is a neat method for clustering or duplicates detection. Google used one of its flavors in detecting duplicate websites.

After a failed test on tens of millions of users Google discontinued FLoC in 2022, unlucky for our research, but definitely the correct decision for users’ privacy. Yet the reason for doing so was not the information leakage we found but rather the backlash. To that end, LSH is still applied in other privacy projects. We evaluate the information leakage of LSH in systems that use it to establish privacy - FLoC and MinHash Hierarchy systems.

FLoC

The key idea of FLoC is to take the user’s browsing history, and create a 50-bit hash (fingerprint) of it locally in the browser using LSH. This hash is then used to cluster users with similar history, assigning them a cluster ID. Only this ID is then shared with trackers and advertisers, keeping the web content still targeted, but to groups of at least k > 2000 users in cluster, providing k-anonymity.

We propose these attacks on FLoC:

GAN-IP attack The scheme of the GAN-IP attack, extracting private information from FLoC. First, the attacker takes a cohort ID 𝛾 and then uses the GAN-IP attack to create fake browsing histories whose SimHash contains 𝛾 as prefix. These SimHashes make the cohort decomposable, so FLoC’s clustering algorithm divides the cohort into smaller cohorts. The attacker exploits the fake histories to infer websites visited by real users.

Recovered historiesRecovered histories Distribution of overlap between target user’s 30 items history and generated histories by GAN (left) and GAN-IP (right) attacks. While the integer programming reduces the absolute number of recovered history items, the relative portion of history the attack correctly found increases from 8% to 12%. Note that for privacy reasons, we worked with public MovieLens Datasets.

MinHash Hierarchy

MinHash Hierarchy is a system for urban planning. It collects cellular data tracking movement of people in the city to extract statistics about citizens’ behavior. It uses LSH to keep the individual’s data private, even claiming that it achieves differential privacy.

We show that the differential privacy claims are wrong, since some citizens can be completely tracked. We again evaluate the information leakage from recovering hash pre-images, and we find that in a city as Porto, we can track users down to 10% of area.

Recovered trajectoriesRecovered trajectories Four randomly chosen trajectories of taxi drivers in Porto denoted as “Target X,” and the extracted location “Recovered X” from MinHash Hierarchy system. Gray pixels denote where the drivers potentially drove, while black pixels are places where we are certain that the driver passed by. This uses only the LSH hash content, a further post processing on geographical distribution of timestamps would narrow this map significantly more.

Conclusions

We show that using locality-sensitive hashing to establish privacy into a data-collection system is a wrong idea. While hashing is lossy compression, there must remain input leakage in the hash, else LSH would not provide any utility. Obliviousness of this leakage to system designers does not mean there is none, and it resembles security by obscurity principle.

There were other systems that use LSH to establish privacy in the system, such as Apple CSAM (discontinued after attacks were discovered) and Replacement AutoEncoder. We are therefore worried if this idea is not a reappearing idea so we use our findings to discourage any privacy system designers from its usage.

Q&A

Acknowledgement

The authors would like to thank:

Updates