Automating Website Registration for Studying GDPR Compliance

Authors: Karel Kubicek karel.kubicek@inf.ethz.ch, Jakob Merane, Ahmed Bouhoula, and David Basin

Abstract: Investigating how websites use sensitive user data is an active research area. However, research based on automated measurements has been limited to those websites that do not require user authentication. To overcome this limitation, we developed a crawler that automates website registrations and newsletter subscriptions and detects both security and privacy threats at scale.

We demonstrate our crawler’s capabilities by running it on 660k websites. We use this to identify security and privacy threats and to contextualize them within EU laws, namely the General Data Protection Regulation and ePrivacy Directive. Our methods detect private data collection over insecure HTTP connections and websites sending emails with user-provided passwords. We are also the first to apply machine learning to web forms, assessing violations of marketing consent collection requirements. Overall, we find that 37.2% of websites send marketing emails without proper user consent. This is mostly caused by websites failing both to verify and store consent adequately. Additionally, 1.8% of websites share users’ email addresses with third parties without a transparent disclosure.

BibTex

@inproceedings{kubicek2024automating,
  title={Automating Website Registration for Studying {GDPR} Compliance},
  author={Karel Kubicek and Jakob Merane and Ahmed Bouhoula and David Basin},
  booktitle={Proceedings of the ACM Web Conference 2024},
  year={2024},
  doi={10.1145/3589334.3645709},
  publisher={ACM},
  address={New York, NY, USA},
  pages={12},
  isbn={9-8-4007-0171-9/24/05}
}

Websites are sending unsolicited emails

To sign up for web services, users typically must provide their email addresses. Unfortunately, this information can be exploited by companies to send unsolicited marketing emails, promoting their products and services. Given the prevalence of this practice, users often do not recall ever registering for a service. To counteract these practices, countries have implemented regulations on privacy (GDPR, ePrivacy Directive) and unfair competition (Unfair Competition Act). In this study, we analyze the effectiveness of these regulations.

Automating registration with a crawler

Automated account creation poses a significant challenge due to websites’ efforts to thwart bots engaging in activities like spamming, data mining, and other security concerns. Our crawler employs various techniques, ranging from the Undetected Chromedriver to mimicking genuine browsing, in order to evade bot detection.

Utilizing multi-lingual (37 European languages) keyword matching for navigation and form interaction, the crawler seeks out registration and newsletter subscription forms. If these forms are not directly accessible, it intelligently navigates to relevant pages, such as login pages, that can lead to the desired goal.

The crawler can fill interactive forms protected by honey pots and CAPTCHAs. After form submission, it categorizes the response and verifies the account using activation codes or links sent through double-opt-in emails.

Crawling intermediary results in a Sankey plot Sankey diagram illustrating the intermediary results of our crawler. The numbers represent percentage of the target 660k websites.

Deployed on 660k websites from Tranco Top 1M, our crawler’s outcomes are visualized in the above figure. It identified registration or subscription forms on 33.6% of the websites, aligning with findings from other publications [13], [7]. Subsequently, it successfully submitted almost 20% of these forms, and a total of 34k websites sent us at least one email. The success rate varies based on the website sample, with websites from CrUX nearly doubling the registration rate.

With a registration rate of 5.9%, significantly higher than the previous state-of-the-art 1.6% by Drakonakis et al. [13], our crawler enables large-scale measurements of security and privacy in website sections requiring authentication or dependent on email datasets.

Detecting potential violations using ML

In our prior PETS’22 publication, we introduced 22 legal properties and their relations to detect potential violations of EU privacy regulations. In that study, legal scholars manually classified these 22 properties, creating a dataset of 666 annotated websites. Using machine learning, we trained models to classify these properties based on the code of the forms and emails. However, the models’ performance is not yet optimal, mainly due to the low number of positive samples in the annotated dataset. To minimize false positives, we combine the model’s predictions with the keyword-based detection used in the crawler. The performance of the crawler and the XGBoost ML model is presented below. We also experimented with other ML models, such as TabNet, allowing unsupervised pretraining on all crawled websites, but the results were similar to those of XGBoost at the cost of higher complexity.

Table of ML performance The performance of the ML model (XGBoost) and the crawler (Deterministic). The last column represents the portion of training data that are positive samples.

These legal properties navigate the decision trees below in determining the presence of violations in the form interface.

Decision tree for ePrivacy requirementsDecision tree for GDPR requirements On the left is the decision tree capturing ePrivacy Directive requirements (consent), on the right is the tree that captures the nuances of consent defined by the GDPR.

In addition to form interface violations, we also identify insecure forms (using HTTP without TLS) and registrations where the sender shares the user-provided password in an unencrypted email. Both of these constitute violations of Art. 32 of the GDPR.

Security threats resultsForm violations results On the left are the security threats of registrations on websites, on the right are the violations of the marketing consent requirement in the form.

Finally, we analyze the emails received. We classify the first email from each website as double opt-in (activation emails), marketing, or other servicing email (confirmation emails, policy updates, etc.). Note that double opt-in is an established best practice to satisfy the GDPR requirement of collecting consent from the email recipient, protecting people from unwanted newsletter subscriptions. However, we consider only services that immediately send marketing emails as potentially non-compliant.

To track how websites use email addresses, each registration is performed with a unique email address. Detecting when the website shares the email address with third parties poses a challenge. For example, facebook.com sends emails from facebookmail.com. Hence, the technical definition of a third party (different domain) does not match the legal requirement of a different entity. Our heuristic predicts for each sender domain the following disclosure with 90% accuracy. We take the first of the following outcomes, ordered from the most to the least disclosed. (1) The domain name where we registered and any domains that are similar. (2) The domain of the first received email. (3) Any common email host (e.g., gmail.com) if the name preceding the @ symbol is similar to the registration domain. (4) Any domain declared on the registration page is marked as ‘In form.’ (5) Any common host that was not matched previously as ‘Dis. email host.’ (6) Domains in the privacy policy and terms and conditions are marked as ‘In policy/terms.’ (7) If all these checks fail, the domain is marked as `Undeclared.’

Email opt-in violationsEmail sharing violations On the left is the statistics of first email classification, on the right are the violations of email sharing.

Future work

We are currently engaged in a legal study that aims to map compliance based on various parameters of websites, including location, industry type, and business size.

Additionally, we are eager to collaborate with data protection authorities on (semi-)automated enforcement of compliance with marketing email requirements. However, our ML-based methods still exhibit significant imprecision. False positives have the potential to harm falsely-accused businesses, while false negatives may lead to automated judgments with biases.

The capabilities of our crawler open up possibilities for several previously unfeasible projects. For instance, we can now analyze whether websites respect the unsubscribe action or study the prevalence of tracking by third parties in areas of websites requiring authentication. If you are interested in accessing the crawler, please request access via this form.

Q&A

Acknowledgement

The authors would like to thank:

Updates