Multilingual Scraper of Privacy Policies and Terms of Service

Authors: David Bernhard, Luka Nenadic, Stefan Bechtold, Karel Kubicek karel.kubicek@inf.ethz.ch

Abstract: Websites’ privacy policies and terms of service constitute valuable resources for scholars in various disciplines. Nonetheless, there exists no large, multilingual database collecting these documents over the long term. Therefore, researchers spend a lot of valuable time collecting them for individual projects, and these heterogeneous methods impede the reproducibility and comparability of research findings. As a solution, we introduce a long-term scraper of privacy policies and terms supporting 37 languages. We run our scraper on a monthly basis on 800’000 websites, and we publish the dataset for the twelve crawls in 2024. Our manual evaluation of the end-to-end extraction of the documents demonstrates F1 scores of 79% for privacy policies and 75% for terms of service in five sample languages (English, German, French, Italian, and Croatian). We present several broad potential applications of our database for future research.

Conference page: CSLAW 2025, paper page
Download authors’ pre-print of the paper: PDF
Anonymized documents from full 2024 year as CSV: Zenodo
Request the datasets (full, unmodified, using DB access), source code, and supplementary material: Google Forms
See presentation: Google Slides

BibTeX

@inproceedings{bernhard2025multilingual,
  author = {Bernhard, David and Nenadic, Luka and Bechtold, Stefan and Kubicek, Karel},
  title = {Multilingual Scraper of Privacy Policies and Terms of Service},
  year = {2025},
  isbn = {9798400714214},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3709025.3712215},
  doi = {10.1145/3709025.3712215},
  booktitle = {Proceedings of the 2025 Symposium on Computer Science and Law},
  pages = {55–63},
  numpages = {9},
  keywords = {Web scraping, dataset, privacy, privacy policies, terms of service},
  location = {Munich, Germany},
  series = {CSLAW '25}
}

Policies and terms are useful for research

Privacy policies (policies) document how firms collect, store, and use users’ personal data. They are of significant interest in many studies examining data collection practices of websites, mobile apps, and other services. Terms of service (or terms and conditions, terms) are legal contracts between the consumer and the service, and therefore are also a focus of legal studies.

Empirical studies of policies and terms rely on a corpus of policies and terms. While various datasets exist, they typically provide documents from a single timestamp or focus on historical data. Furthermore, existing datasets typically focus on documents in English. We are not aware of any continuously operating project collecting policies and terms in multiple languages at large scale. Consequently, researchers spend a lot of valuable time collecting them for individual projects, and these heterogeneous methods impede the reproducibility and comparability of research findings.

Our multilingual scraper

We introduce a long-term scraper of privacy policies and terms supporting 37 languages. We run our scraper on a monthly basis on 800’000 websites, and we publish the dataset for the twelve scrapes in 2024.

Our scraper uses three techniques to detect the policies and terms, they are applied in this order:

Navigation by searching for keywords in links and buttons. These keywords were translated to the majority of the 37 languages by native or proficient speakers, and only for 15 languages we used google translate.
Search using search engines (Startpage and DuckDuckGo).
Querying common paths such as example.com/policy.

Then, our scraper classifies the detected documents using multilingual BERT models, which we trained on our labeled dataset of 415 positive and 133 negative samples of policies, and another on 273 positive and 810 negative samples of terms collected from our pilot scrapes. The models achieve 93.2% and 92.3% accuracy for policies and terms on the validation dataset, respectively.

We then extract the documents from the website using Firefox Readibility library, and from PDF using pdfminer library. We store both the extracted document in Markdown and the original HTML.

Our manual evaluation of the extraction of the documents demonstrates F1 scores of 79% for privacy policies and 75% for terms of service in five sample languages (English, German, French, Italian, and Croatian). While this might seem low, we would like to remind the reader that compared to other works, this evaluation was performed end-to-end, accumulating errors of network, scraper navigation, ML classification of the documents, and the extraction of the document from the website or PDF in its full content. Similar works typically report only some of these metrics and not their accumulated error rate and typically consider similar documents, such as cookie policy, as true positives.

The scraper is based on our earlier publication Automating Website Registration for Studying GDPR Compliance. This reference was accidentally omitted in the camera-ready version, but is now added in the author version here.

Results statistics

Number of scraped websites and detected documents

Month	Sampled URLs	Page loaded	Collected policies	Policies total	Collected terms	Terms total
2024-01	556281	95.30%	310349	55.79%	219445	39.82%
2024-02	800151	96.29%	370918	46.36%	266419	33.60%
2024-03	854044	97.85%	399831	46.82%	286679	33.95%
2024-04	802174	98.51%	374914	46.74%	268835	33.51%
2024-05	796574	96.74%	274336	34.44%	92350	11.46%
2024-06	798650	96.66%	347859	43.56%	182436	22.54%
2024-07	800237	96.34%	345107	43.13%	187126	23.06%
2024-08	801949	96.37%	346897	43.26%	188260	23.14%
2024-09	802010	96.31%	350517	43.70%	191367	23.50%
2024-10	804693	95.93%	352598	43.82%	192186	23.51%
2024-11	815136	94.95%	355119	43.57%	192369	23.21%
2024-12	813172	95.96%	350713	43.13%	188839	22.81%

Speed and data production

Month	Sites per minute	Total DB size	Total compressed HTML
Older	20	14 GB	32 GB
2024-01	20.6	19 GB	43 GB
2024-02	20.3	29 GB	48 GB
2024-03	29.5	38 GB	83 GB
2024-04	30.6	44 GB	136 GB
2024 (full)	30	121.24 GB

Change log, outages, and outliers

Please note that the first months of 2024 were still subject to modifications, while we were reaching the final deployment stage. From May 2024, the scraper is stable and such changes should be rare.

2024-01: 556’281 websites using a different sampling: top 30k websites from each country
- Scraping period: 2023-12-27 - 2024-01-16
- Document detection using search engine (startpage.com) resulted in many documents from PayPal that are linked on each result page. We exclude these manually.
2024-02: 800’151 websites, with 6k websites per bucket
- Scraping period: 2024-01-27 - 2024-02-23
- StartPage search dropped to about 15% of previous month due to bot detection.
2024-03: 854’044 websites, with 6k websites per bucket
- Scraping period: 2024-03-02 - 2024-03-21
- StartPage search back to normal, removal of PayPal from search results - this holds for the future as well.
2024-04: 802’174 websites, we reduced the sampling to 5k websites per bucket - from this point is the sampling process stable, see documentation in the paper
- Scraping period: 2024-04-02 - 2024-03-20
2024-05
- From this month on, we use classification of policies and terms. Older months do not check the documents by the content, but only order available links to documents according to matching. We will retrospectively compute probabilities that collected documents were indeed policy/terms for older months.
- Scraper correctly processes HTTP errors (e.g., 404 Page not found) and excludes them from detected documents. Also, documents shorter than 60 characters are ignored.
- The new strict interpretation of policies and terms reduced detection rates. We however still also store documents that are plausibly policy or terms (with ML classification accompanied).
2024-07
- We deployed the scraper on a new server dedicated to the project for a long term.

Acknowledgement

The authors would like to thank:

Natalie Ashgriz for maintenance and UI development.
Elias Landes provided excellent research assistance with data evaluation.
Luka Lodrant, Patrice Kast, Truong Long for assistance developing the scraper.
Joachim Posegga and the Deutsches Forschungsnetz (DNF) for providing us with university VPN.
François Hublet, Matilda Backendal, Sofia Giampietro, Srđan Krstić, Tobias Klenze, Jure Lodrant, Artem Galushko for translating the crawler keywords to various languages.
Alexander Stremitzer and Jakob Merane for the work on related publication sharing parts of the codebase.
Karel Kubicek’s research is funded by SNSF Postdoc.Mobility grant No. P500PT_225449 and Luka Nenadic’s research is funded by SNSF grant No. 10002634.

Updates

March 25, 2025: New form and minor paper update to address missing citation.
January 29, 2025: Significant update for the CSLAW publication, extended text, results for full 2024
June 13, 2024: The initial version of this page.