Authors: David Bernhard, Luka Nenadic, Stefan Bechtold, Karel Kubicek karel.kubicek@inf.ethz.ch
Abstract: Websites’ privacy policies and terms of service constitute valuable resources for scholars in various disciplines. Nonetheless, there exists no large, multilingual database collecting these documents over the long term. Therefore, researchers spend a lot of valuable time collecting them for individual projects, and these heterogeneous methods impede the reproducibility and comparability of research findings. As a solution, we introduce a long-term scraper of privacy policies and terms supporting 37 languages. We run our scraper on a monthly basis on 800’000 websites, and we publish the dataset for the twelve crawls in 2024. Our manual evaluation of the end-to-end extraction of the documents demonstrates F1 scores of 79% for privacy policies and 75% for terms of service in five sample languages (English, German, French, Italian, and Croatian). We present several broad potential applications of our database for future research.
@inproceedings{bernhard2025multilingual,
title={Multilingual Scraper of Privacy Policies and Terms of Service},
author={Bernhard, David and Nenadic, Luka and Bechtold, Stefan and Kubicek, Karel},
booktitle={Proceedings of the Symposium on Computer Science and Law},
publisher={Association for Computing Machinery},
address={München, Germany},
pages={TBA},
numpages={8},
year={2025},
isbn={79-8-4007-1421-4/25/03},
doi={10.1145/3709025.3712215},
url={https://doi.org/10.1145/3709025.3712215},
series={CSLAW '25},
}
Privacy policies (policies) document how firms collect, store, and use users’ personal data. They are of significant interest in many studies examining data collection practices of websites, mobile apps, and other services. Terms of service (or terms and conditions, terms) are legal contracts between the consumer and the service, and therefore are also a focus of legal studies.
Empirical studies of policies and terms rely on a corpus of policies and terms. While various datasets exist, they typically provide documents from a single timestamp or focus on historical data. Furthermore, existing datasets typically focus on documents in English. We are not aware of any continuously operating project collecting policies and terms in multiple languages at large scale. Consequently, researchers spend a lot of valuable time collecting them for individual projects, and these heterogeneous methods impede the reproducibility and comparability of research findings.
We introduce a long-term scraper of privacy policies and terms supporting 37 languages. We run our scraper on a monthly basis on 800’000 websites, and we publish the dataset for the twelve scrapes in 2024.
Our scraper uses three techniques to detect the policies and terms, they are applied in this order:
Then, our scraper classifies the detected documents using multilingual BERT models, which we trained on our labeled dataset of 415 positive and 133 negative samples of policies, and another on 273 positive and 810 negative samples of terms collected from our pilot scrapes. The models achieve 93.2% and 92.3% accuracy for policies and terms on the validation dataset, respectively.
We then extract the documents from the website using Firefox Readibility library, and from PDF using pdfminer library. We store both the extracted document in Markdown and the original HTML.
Our manual evaluation of the extraction of the documents demonstrates F1 scores of 79% for privacy policies and 75% for terms of service in five sample languages (English, German, French, Italian, and Croatian). While this might seem low, we would like to remind the reader that compared to other works, this evaluation was performed end-to-end, accumulating errors of network, scraper navigation, ML classification of the documents, and the extraction of the document from the website or PDF in its full content. Similar works typically report only some of these metrics and not their accumulated error rate and typically consider similar documents, such as cookie policy, as true positives.
Month | Sampled URLs | Page loaded | Collected policies | Policies total | Collected terms | Terms total |
---|---|---|---|---|---|---|
2024-01 | 556281 | 95.30% | 310349 | 55.79% | 219445 | 39.82% |
2024-02 | 800151 | 96.29% | 370918 | 46.36% | 266419 | 33.60% |
2024-03 | 854044 | 97.85% | 399831 | 46.82% | 286679 | 33.95% |
2024-04 | 802174 | 98.51% | 374914 | 46.74% | 268835 | 33.51% |
2024-05 | 796574 | 96.74% | 274336 | 34.44% | 92350 | 11.46% |
2024-06 | 798650 | 96.66% | 347859 | 43.56% | 182436 | 22.54% |
2024-07 | 800237 | 96.34% | 345107 | 43.13% | 187126 | 23.06% |
2024-08 | 801949 | 96.37% | 346897 | 43.26% | 188260 | 23.14% |
2024-09 | 802010 | 96.31% | 350517 | 43.70% | 191367 | 23.50% |
2024-10 | 804693 | 95.93% | 352598 | 43.82% | 192186 | 23.51% |
2024-11 | 815136 | 94.95% | 355119 | 43.57% | 192369 | 23.21% |
2024-12 | 813172 | 95.96% | 350713 | 43.13% | 188839 | 22.81% |
Month | Sites per minute | Total DB size | Total compressed HTML |
---|---|---|---|
Older | 20 | 14 GB | 32 GB |
2024-01 | 20.6 | 19 GB | 43 GB |
2024-02 | 20.3 | 29 GB | 48 GB |
2024-03 | 29.5 | 38 GB | 83 GB |
2024-04 | 30.6 | 44 GB | 136 GB |
2024 (full) | 30 | 121.24 GB |
Please note that the first months of 2024 were still subject to modifications, while we were reaching the final deployment stage. From May 2024, the scraper is stable and such changes should be rare.
The authors would like to thank: