Multilingual Scraper of Privacy Policies and Terms of Service

Authors: David Bernhard, Luka Nenadic, Stefan Bechtold, Karel Kubicek karel.kubicek@inf.ethz.ch

Abstract: Websites’ privacy policies and terms of service constitute valuable resources for scholars in various disciplines. Nonetheless, there exists no large, multilingual database collecting these documents over the long term. Therefore, researchers spend a lot of valuable time collecting them for individual projects, and these heterogeneous methods impede the reproducibility and comparability of research findings. As a solution, we introduce a long-term scraper of privacy policies and terms supporting 37 languages. We run our scraper on a monthly basis on 800’000 websites, and we publish the dataset for the twelve crawls in 2024. Our manual evaluation of the end-to-end extraction of the documents demonstrates F1 scores of 79% for privacy policies and 75% for terms of service in five sample languages (English, German, French, Italian, and Croatian). We present several broad potential applications of our database for future research.

BibTeX

@inproceedings{bernhard2025multilingual,
  title={Multilingual Scraper of Privacy Policies and Terms of Service},
  author={Bernhard, David and Nenadic, Luka and Bechtold, Stefan and Kubicek, Karel},
  booktitle={Proceedings of the Symposium on Computer Science and Law},
  publisher={Association for Computing Machinery},
  address={München, Germany},
  pages={TBA},
  numpages={8},
  year={2025},
  isbn={79-8-4007-1421-4/25/03},
  doi={10.1145/3709025.3712215},
  url={https://doi.org/10.1145/3709025.3712215},
  series={CSLAW '25},
}

Policies and terms are useful for research

Privacy policies (policies) document how firms collect, store, and use users’ personal data. They are of significant interest in many studies examining data collection practices of websites, mobile apps, and other services. Terms of service (or terms and conditions, terms) are legal contracts between the consumer and the service, and therefore are also a focus of legal studies.

Empirical studies of policies and terms rely on a corpus of policies and terms. While various datasets exist, they typically provide documents from a single timestamp or focus on historical data. Furthermore, existing datasets typically focus on documents in English. We are not aware of any continuously operating project collecting policies and terms in multiple languages at large scale. Consequently, researchers spend a lot of valuable time collecting them for individual projects, and these heterogeneous methods impede the reproducibility and comparability of research findings.

Our multilingual scraper

We introduce a long-term scraper of privacy policies and terms supporting 37 languages. We run our scraper on a monthly basis on 800’000 websites, and we publish the dataset for the twelve scrapes in 2024.

Our scraper uses three techniques to detect the policies and terms, they are applied in this order:

  1. Navigation by searching for keywords in links and buttons. These keywords were translated to the majority of the 37 languages by native or proficient speakers, and only for 15 languages we used google translate.
  2. Search using search engines (Startpage and DuckDuckGo).
  3. Querying common paths such as example.com/policy.

Then, our scraper classifies the detected documents using multilingual BERT models, which we trained on our labeled dataset of 415 positive and 133 negative samples of policies, and another on 273 positive and 810 negative samples of terms collected from our pilot scrapes. The models achieve 93.2% and 92.3% accuracy for policies and terms on the validation dataset, respectively.

We then extract the documents from the website using Firefox Readibility library, and from PDF using pdfminer library. We store both the extracted document in Markdown and the original HTML.

Our manual evaluation of the extraction of the documents demonstrates F1 scores of 79% for privacy policies and 75% for terms of service in five sample languages (English, German, French, Italian, and Croatian). While this might seem low, we would like to remind the reader that compared to other works, this evaluation was performed end-to-end, accumulating errors of network, scraper navigation, ML classification of the documents, and the extraction of the document from the website or PDF in its full content. Similar works typically report only some of these metrics and not their accumulated error rate and typically consider similar documents, such as cookie policy, as true positives.

Results statistics

Number of scraped websites and detected documents

Month Sampled URLs Page loaded Collected policies Policies total Collected terms Terms total
2024-01 556281 95.30% 310349 55.79% 219445 39.82%
2024-02 800151 96.29% 370918 46.36% 266419 33.60%
2024-03 854044 97.85% 399831 46.82% 286679 33.95%
2024-04 802174 98.51% 374914 46.74% 268835 33.51%
2024-05 796574 96.74% 274336 34.44% 92350 11.46%
2024-06 798650 96.66% 347859 43.56% 182436 22.54%
2024-07 800237 96.34% 345107 43.13% 187126 23.06%
2024-08 801949 96.37% 346897 43.26% 188260 23.14%
2024-09 802010 96.31% 350517 43.70% 191367 23.50%
2024-10 804693 95.93% 352598 43.82% 192186 23.51%
2024-11 815136 94.95% 355119 43.57% 192369 23.21%
2024-12 813172 95.96% 350713 43.13% 188839 22.81%

Speed and data production

Month Sites per minute Total DB size Total compressed HTML
Older 20 14 GB 32 GB
2024-01 20.6 19 GB 43 GB
2024-02 20.3 29 GB 48 GB
2024-03 29.5 38 GB 83 GB
2024-04 30.6 44 GB 136 GB
2024 (full) 30 121.24 GB  

Change log, outages, and outliers

Please note that the first months of 2024 were still subject to modifications, while we were reaching the final deployment stage. From May 2024, the scraper is stable and such changes should be rare.

Acknowledgement

The authors would like to thank:

Updates