Multilingual Scraper of Privacy Policies and Terms of Service

Authors: Karel Kubicek karel.kubicek@inf.ethz.ch, David Bernhard, Stefan Bechtold

Abstract: We developed a scraper for privacy policies and terms of service, designed for long-term data collection over a multiple-year period. This scraper supports 37 languages, addressing the systematic understudy of non-English websites. We crawl nearly 1 million websites monthly, sampled from various countries and popularity ranks using the Chrome UX Report list, which closely represents real user behavior and reduces sampling bias. This project provides an unprecedented dataset for empirical research in privacy policies and terms of service, enabling comprehensive studies of legal documents across different languages and regions.

Sign-up for access and/or project updates

Monthly results

Number of crawled websites and detected documents

Month Crawled pages Load ok Policies crawled Policies searched Policies total Terms crawled Terms searched Terms total
2024-01 556’281 95.30% 256’117 9’376 48.17% 202’297 17’148 39.82%
2024-02 800’151 96.29% 366’521 1’616 46.43% 264’758 1’661 33.60%
2024-03 854’044 97.85% 389’358 10’471 47.34% 281’366 5’313 33.95%
2024-04 802’174 98.51% 366’521 7’321 47.18% 264’758 1’661 33.62%
2024-05                
2024-06                

Speed and data production

Month Sites per minute Total DB size Total compressed HTML
Older 20 14 GB 32 GB
2024-01 20.6 19 GB 43 GB
2024-02 20.3 29 GB 48 GB
2024-03 29.5 38 GB 83 GB
2024-04 30.6 44 GB 136 GB
2024-05      
2024-06      

Change log, outages, and outliers

Please not that first months of 2024 were still subject to modifications, while we were reaching final deployment stage. From June 2024, the crawler should be stable and such changes should be rare.

Acknowledgement

The authors would like to thank:

Updates