Multilingual Scraper of Privacy Policies and Terms of Service
Authors: Karel Kubicek karel.kubicek@inf.ethz.ch, David Bernhard, Stefan Bechtold
Abstract: We developed a scraper for privacy policies and terms of service, designed for long-term data collection over a multiple-year period. This scraper supports 37 languages, addressing the systematic understudy of non-English websites. We crawl nearly 1 million websites monthly, sampled from various countries and popularity ranks using the Chrome UX Report list, which closely represents real user behavior and reduces sampling bias. This project provides an unprecedented dataset for empirical research in privacy policies and terms of service, enabling comprehensive studies of legal documents across different languages and regions.
- 2 page project flyer: PDF
Sign-up for access and/or project updates
Monthly results
Number of crawled websites and detected documents
Month |
Crawled pages |
Load ok |
Policies crawled |
Policies searched |
Policies total |
Terms crawled |
Terms searched |
Terms total |
2024-01 |
556’281 |
95.30% |
256’117 |
9’376 |
48.17% |
202’297 |
17’148 |
39.82% |
2024-02 |
800’151 |
96.29% |
366’521 |
1’616 |
46.43% |
264’758 |
1’661 |
33.60% |
2024-03 |
854’044 |
97.85% |
389’358 |
10’471 |
47.34% |
281’366 |
5’313 |
33.95% |
2024-04 |
802’174 |
98.51% |
366’521 |
7’321 |
47.18% |
264’758 |
1’661 |
33.62% |
2024-05 |
|
|
|
|
|
|
|
|
2024-06 |
|
|
|
|
|
|
|
|
Speed and data production
Month |
Sites per minute |
Total DB size |
Total compressed HTML |
Older |
20 |
14 GB |
32 GB |
2024-01 |
20.6 |
19 GB |
43 GB |
2024-02 |
20.3 |
29 GB |
48 GB |
2024-03 |
29.5 |
38 GB |
83 GB |
2024-04 |
30.6 |
44 GB |
136 GB |
2024-05 |
|
|
|
2024-06 |
|
|
|
Change log, outages, and outliers
Please not that first months of 2024 were still subject to modifications, while we were reaching final deployment stage. From June 2024, the crawler should be stable and such changes should be rare.
- 2024-01 crawl: 548’583 websites using a different sampling: top 30k websites from each country
- Crawl period: 2023-12-27 - 2024-01-16
- Document detection using search engine (startpage.com) resulted in many documents from PayPal that are linked on each result page. We exclude these manually.
- 2024-02 crawl: 790’970 websites, with 6k websites per bucket
- Crawl period: 2024-01-27 - 2024-02-23
- StartPage search dropped to about 15% of previous month due to bot detection.
- 2024-03 crawl: 835’769 websites
- Crawl period: 2024-03-02 - 2024-03-21
- StartPage search back to normal, removal of PayPal from search results - this holds for future crawls as well.
- 2024-04 crawl: 792’622 websites, we reduced the sampling to 5k websites per bucket
- Crawl period: 2024-04-02 - 2024-03-20
- 2024-05 crawl: 796’580 websites
- Crawl period: 2024-05-07 - TODO
- From this crawl on, we use classification of policies and terms. Older crawls do not check the documents by the content, but only order available links to documents according to matching. We will retrospectively compute probabilities that collected documents were indeed policy/terms for older crawls.
- Crawler correctly processes HTTP errors (e.g., 404 Page not found) and excludes them from detected documents. Also, documents shorter than 60 characters are ignored.
- 2024-06 crawl: TODO websites
- Crawl period: TODO - TODO
- 2024-07 crawl: TODO websites
- Crawl period: TODO - TODO
- We plan deployment on a new server dedicated to 5 years support.
Acknowledgement
The authors would like to thank:
Updates
- June 13, 2024: The initial version of this page.