Automating Website Registration for Studying GDPR Compliance

Authors: Karel Kubicek karel.kubicek@inf.ethz.ch, Jakob Merane, Ahmed Bouhoula, and David Basin

Abstract: Investigating how websites use sensitive user data is an active research area. However, research based on automated measurements has been limited to those websites that do not require user authentication. To overcome this limitation, we developed a crawler that automates website registrations and newsletter subscriptions and detects both security and privacy threats at scale.

We demonstrate our crawler’s capabilities by running it on 660k websites. We use this to identify security and privacy threats and to contextualize them within EU laws, namely the General Data Protection Regulation and ePrivacy Directive. Our methods detect private data collection over insecure HTTP connections and websites sending emails with user-provided passwords. We are also the first to apply machine learning to web forms, assessing violations of marketing consent collection requirements. Overall, we find that 37.2% of websites send marketing emails without proper user consent. This is mostly caused by websites failing both to verify and store consent adequately. Additionally, 1.8% of websites share users’ email addresses with third parties without a transparent disclosure.

Conference page: WWW 2024
Download authors’ pre-print of the paper: PDF
Download authors’ extended version of the paper: PDF
- This version includes our prior work and extends the comparison among automated and manual methods.
Poster: PDF
See presentation: 3 min advertisement, MP4
Request the datasets, source code, and supplementary material: Google Forms
Keep me in the loop: Google Forms

BibTex

@inproceedings{kubicek2024automating,
  title={Automating Website Registration for Studying {GDPR} Compliance},
  author={Karel Kubicek and Jakob Merane and Ahmed Bouhoula and David Basin},
  booktitle={Proceedings of the ACM Web Conference 2024},
  year={2024},
  doi={10.1145/3589334.3645709},
  publisher={ACM},
  address={New York, NY, USA},
  pages={12},
  isbn={9-8-4007-0171-9/24/05}
}

Websites are sending unsolicited emails

To sign up for web services, users typically must provide their email addresses. Unfortunately, this information can be exploited by companies to send unsolicited marketing emails, promoting their products and services. Given the prevalence of this practice, users often do not recall ever registering for a service. To counteract these practices, countries have implemented regulations on privacy (GDPR, ePrivacy Directive) and unfair competition (Unfair Competition Act). In this study, we analyze the effectiveness of these regulations.

Our crawler automates the website registration process, enabling us to scale the analysis to 660k websites, with successful registrations on over 34k sites and the collection of over 2M emails.
We have trained models to predict legal properties of forms and emails, as presented in our PETS’22 publication. Combining these predictions using decision trees allows us to classify form violations.
Additionally, we identify third-party email sharing (with classification of the reason) and security violations under Art. 32 of the GDPR.
Our observations reveal potential violations on over 37% of the 34k websites where the crawler completed registrations.

Automating registration with a crawler

Automated account creation poses a significant challenge due to websites’ efforts to thwart bots engaging in activities like spamming, data mining, and other security concerns. Our crawler employs various techniques, ranging from the Undetected Chromedriver to mimicking genuine browsing, in order to evade bot detection.

Utilizing multi-lingual (37 European languages) keyword matching for navigation and form interaction, the crawler seeks out registration and newsletter subscription forms. If these forms are not directly accessible, it intelligently navigates to relevant pages, such as login pages, that can lead to the desired goal.

The crawler can fill interactive forms protected by honey pots and CAPTCHAs. After form submission, it categorizes the response and verifies the account using activation codes or links sent through double-opt-in emails.

Crawling intermediary results in a Sankey plot Sankey diagram illustrating the intermediary results of our crawler. The numbers represent percentage of the target 660k websites.

Deployed on 660k websites from Tranco Top 1M, our crawler’s outcomes are visualized in the above figure. It identified registration or subscription forms on 33.6% of the websites, aligning with findings from other publications [13], [7]. Subsequently, it successfully submitted almost 20% of these forms, and a total of 34k websites sent us at least one email. The success rate varies based on the website sample, with websites from CrUX nearly doubling the registration rate.

With a registration rate of 5.9%, significantly higher than the previous state-of-the-art 1.6% by Drakonakis et al. [13], our crawler enables large-scale measurements of security and privacy in website sections requiring authentication or dependent on email datasets.

Detecting potential violations using ML

In our prior PETS’22 publication, we introduced 22 legal properties and their relations to detect potential violations of EU privacy regulations. In that study, legal scholars manually classified these 22 properties, creating a dataset of 666 annotated websites. Using machine learning, we trained models to classify these properties based on the code of the forms and emails. However, the models’ performance is not yet optimal, mainly due to the low number of positive samples in the annotated dataset. To minimize false positives, we combine the model’s predictions with the keyword-based detection used in the crawler. The performance of the crawler and the XGBoost ML model is presented below. We also experimented with other ML models, such as TabNet, allowing unsupervised pretraining on all crawled websites, but the results were similar to those of XGBoost at the cost of higher complexity.

Table of ML performance The performance of the ML model (XGBoost) and the crawler (Deterministic). The last column represents the portion of training data that are positive samples.

These legal properties navigate the decision trees below in determining the presence of violations in the form interface.

Decision tree for ePrivacy requirements Decision tree for GDPR requirements On the left is the decision tree capturing ePrivacy Directive requirements (consent), on the right is the tree that captures the nuances of consent defined by the GDPR.

In addition to form interface violations, we also identify insecure forms (using HTTP without TLS) and registrations where the sender shares the user-provided password in an unencrypted email. Both of these constitute violations of Art. 32 of the GDPR.

Security threats results Form violations results On the left are the security threats of registrations on websites, on the right are the violations of the marketing consent requirement in the form.

Finally, we analyze the emails received. We classify the first email from each website as double opt-in (activation emails), marketing, or other servicing email (confirmation emails, policy updates, etc.). Note that double opt-in is an established best practice to satisfy the GDPR requirement of collecting consent from the email recipient, protecting people from unwanted newsletter subscriptions. However, we consider only services that immediately send marketing emails as potentially non-compliant.

To track how websites use email addresses, each registration is performed with a unique email address. Detecting when the website shares the email address with third parties poses a challenge. For example, facebook.com sends emails from facebookmail.com. Hence, the technical definition of a third party (different domain) does not match the legal requirement of a different entity. Our heuristic predicts for each sender domain the following disclosure with 90% accuracy. We take the first of the following outcomes, ordered from the most to the least disclosed. (1) The domain name where we registered and any domains that are similar. (2) The domain of the first received email. (3) Any common email host (e.g., gmail.com) if the name preceding the @ symbol is similar to the registration domain. (4) Any domain declared on the registration page is marked as ‘In form.’ (5) Any common host that was not matched previously as ‘Dis. email host.’ (6) Domains in the privacy policy and terms and conditions are marked as ‘In policy/terms.’ (7) If all these checks fail, the domain is marked as `Undeclared.’

Email opt-in violations Email sharing violations On the left is the statistics of first email classification, on the right are the violations of email sharing.

Future work

We are currently engaged in a legal study that aims to map compliance based on various parameters of websites, including location, industry type, and business size.

Additionally, we are eager to collaborate with data protection authorities on (semi-)automated enforcement of compliance with marketing email requirements. However, our ML-based methods still exhibit significant imprecision. False positives have the potential to harm falsely-accused businesses, while false negatives may lead to automated judgments with biases.

The capabilities of our crawler open up possibilities for several previously unfeasible projects. For instance, we can now analyze whether websites respect the unsubscribe action or study the prevalence of tracking by third parties in areas of websites requiring authentication. If you are interested in accessing the crawler, please request access via this form.

Q&A

Q: I would like to collaborate with you using the crawler for XYZ.

A: We are happy to engage in new prospects building on our research. Please request access to the code or datasets via this form or contact Karel Kubicek by email.
Q: Why there are so few violations and what do you mean by the potential violation?

A: Our study is purposefully conservative in reporting violations to limit false accusations of violations. This still resulted in 37.2% of websites sending emails without valid consent.

We call violations potential for three reasons. First, as a matter of legal formality, only a legal proceeding can determine a violation. Second, while we were conservative in defining the types of potential violations, and our analysis is informed by the relevant statutes, judicial precedent, and articles by legal experts, there remains some legal uncertainty as to how courts will decide specific cases. Third, the ML models are imprecise. While we believe that the aggregated results are still representative, the individual flagged violations are still prone to false positives.
Q: Did you report your findings to regulatory authorities?

A: This is work in progress, we aim to cooperate with regulatory authorities to enforce the privacy that users deserve. If you are a data protection authority interested in our results, we are happy to collaborate.
Q: Can I use your violation detection procedure to scan a website for violations?

A: We will not release the crawler to public due to ethics risks. But we plan to work on a browser extension for auditing the compliance of the registration form using the ML we developed.
Q: My website is having subscription/registration form for B2B (business-to-business). Do these regulations concern me?

A: This aspect is complicated, since ePrivacy Directive differentiates between consumers and businesses, while GDPR does not. Our crawler tries to skip registration to B2B forms, but our manual evaluation still showed some such forms submitted.

Acknowledgement

The authors would like to thank:

Stefan Bechtold and Alexander Stremitzer for their guidance on the legal aspects of this work.
Luka Lodrant, David Bernhard, and Patrice Kast, for assistance developing the crawler.
Joachim Posegga and the Deutsches Forschungsnetz (DNF) for providing us with university VPN.
François Hublet, Matilda Backendal, Sofia Giampietro, Srđan Krstić, Tobias Klenze, Jure Lodrant, Artem Galushko for translating the crawler keywords to various languages.
Vandit Sharma for helpful feedback on the manuscript.

Updates

May 15, 2024: Conference happening, keep me in the loop link
April 30, 2024: Poster
April 11, 2024: Authors’ extended version of the paper.
February 19, 2024: Camera ready version of the paper.
February 7, 2024: The initial version of this page.