Authors: Karel Kubicek karel.kubicek@inf.ethz.ch, Jakob Merane, Ahmed Bouhoula, and David Basin
Abstract: Investigating how websites use sensitive user data is an active research area. However, research based on automated measurements has been limited to those websites that do not require user authentication. To overcome this limitation, we developed a crawler that automates website registrations and newsletter subscriptions and detects both security and privacy threats at scale.
We demonstrate our crawler’s capabilities by running it on 660k websites. We use this to identify security and privacy threats and to contextualize them within EU laws, namely the General Data Protection Regulation and ePrivacy Directive. Our methods detect private data collection over insecure HTTP connections and websites sending emails with user-provided passwords. We are also the first to apply machine learning to web forms, assessing violations of marketing consent collection requirements. Overall, we find that 37.2% of websites send marketing emails without proper user consent. This is mostly caused by websites failing both to verify and store consent adequately. Additionally, 1.8% of websites share users’ email addresses with third parties without a transparent disclosure.
@inproceedings{kubicek2024automating,
title={Automating Website Registration for Studying {GDPR} Compliance},
author={Karel Kubicek and Jakob Merane and Ahmed Bouhoula and David Basin},
booktitle={Proceedings of the ACM Web Conference 2024},
year={2024},
doi={10.1145/3589334.3645709},
publisher={ACM},
address={New York, NY, USA},
pages={12},
isbn={9-8-4007-0171-9/24/05}
}
To sign up for web services, users typically must provide their email addresses. Unfortunately, this information can be exploited by companies to send unsolicited marketing emails, promoting their products and services. Given the prevalence of this practice, users often do not recall ever registering for a service. To counteract these practices, countries have implemented regulations on privacy (GDPR, ePrivacy Directive) and unfair competition (Unfair Competition Act). In this study, we analyze the effectiveness of these regulations.
Automated account creation poses a significant challenge due to websites’ efforts to thwart bots engaging in activities like spamming, data mining, and other security concerns. Our crawler employs various techniques, ranging from the Undetected Chromedriver to mimicking genuine browsing, in order to evade bot detection.
Utilizing multi-lingual (37 European languages) keyword matching for navigation and form interaction, the crawler seeks out registration and newsletter subscription forms. If these forms are not directly accessible, it intelligently navigates to relevant pages, such as login pages, that can lead to the desired goal.
The crawler can fill interactive forms protected by honey pots and CAPTCHAs. After form submission, it categorizes the response and verifies the account using activation codes or links sent through double-opt-in emails.
Sankey diagram illustrating the intermediary results of our crawler. The numbers represent percentage of the target 660k websites.
Deployed on 660k websites from Tranco Top 1M, our crawler’s outcomes are visualized in the above figure. It identified registration or subscription forms on 33.6% of the websites, aligning with findings from other publications [13], [7]. Subsequently, it successfully submitted almost 20% of these forms, and a total of 34k websites sent us at least one email. The success rate varies based on the website sample, with websites from CrUX nearly doubling the registration rate.
With a registration rate of 5.9%, significantly higher than the previous state-of-the-art 1.6% by Drakonakis et al. [13], our crawler enables large-scale measurements of security and privacy in website sections requiring authentication or dependent on email datasets.
In our prior PETS’22 publication, we introduced 22 legal properties and their relations to detect potential violations of EU privacy regulations. In that study, legal scholars manually classified these 22 properties, creating a dataset of 666 annotated websites. Using machine learning, we trained models to classify these properties based on the code of the forms and emails. However, the models’ performance is not yet optimal, mainly due to the low number of positive samples in the annotated dataset. To minimize false positives, we combine the model’s predictions with the keyword-based detection used in the crawler. The performance of the crawler and the XGBoost ML model is presented below. We also experimented with other ML models, such as TabNet, allowing unsupervised pretraining on all crawled websites, but the results were similar to those of XGBoost at the cost of higher complexity.
The performance of the ML model (XGBoost) and the crawler (Deterministic). The last column represents the portion of training data that are positive samples.
These legal properties navigate the decision trees below in determining the presence of violations in the form interface.
On the left is the decision tree capturing ePrivacy Directive requirements (consent), on the right is the tree that captures the nuances of consent defined by the GDPR.
In addition to form interface violations, we also identify insecure forms (using HTTP without TLS) and registrations where the sender shares the user-provided password in an unencrypted email. Both of these constitute violations of Art. 32 of the GDPR.
On the left are the security threats of registrations on websites, on the right are the violations of the marketing consent requirement in the form.
Finally, we analyze the emails received. We classify the first email from each website as double opt-in (activation emails), marketing, or other servicing email (confirmation emails, policy updates, etc.). Note that double opt-in is an established best practice to satisfy the GDPR requirement of collecting consent from the email recipient, protecting people from unwanted newsletter subscriptions. However, we consider only services that immediately send marketing emails as potentially non-compliant.
To track how websites use email addresses, each registration is performed with a unique email address. Detecting when the website shares the email address with third parties poses a challenge. For example, facebook.com sends emails from facebookmail.com. Hence, the technical definition of a third party (different domain) does not match the legal requirement of a different entity. Our heuristic predicts for each sender domain the following disclosure with 90% accuracy. We take the first of the following outcomes, ordered from the most to the least disclosed. (1) The domain name where we registered and any domains that are similar. (2) The domain of the first received email. (3) Any common email host (e.g., gmail.com) if the name preceding the @ symbol is similar to the registration domain. (4) Any domain declared on the registration page is marked as ‘In form.’ (5) Any common host that was not matched previously as ‘Dis. email host.’ (6) Domains in the privacy policy and terms and conditions are marked as ‘In policy/terms.’ (7) If all these checks fail, the domain is marked as `Undeclared.’
On the left is the statistics of first email classification, on the right are the violations of email sharing.
We are currently engaged in a legal study that aims to map compliance based on various parameters of websites, including location, industry type, and business size.
Additionally, we are eager to collaborate with data protection authorities on (semi-)automated enforcement of compliance with marketing email requirements. However, our ML-based methods still exhibit significant imprecision. False positives have the potential to harm falsely-accused businesses, while false negatives may lead to automated judgments with biases.
The capabilities of our crawler open up possibilities for several previously unfeasible projects. For instance, we can now analyze whether websites respect the unsubscribe action or study the prevalence of tracking by third parties in areas of websites requiring authentication. If you are interested in accessing the crawler, please request access via this form.
Q: I would like to collaborate with you using the crawler for XYZ.
A: We are happy to engage in new prospects building on our research. Please request access to the code or datasets via this form or contact Karel Kubicek by email.
Q: Why there are so few violations and what do you mean by the potential violation?
A: Our study is purposefully conservative in reporting violations to limit false accusations of violations. This still resulted in 37.2% of websites sending emails without valid consent.
We call violations potential for three reasons. First, as a matter of legal formality, only a legal proceeding can determine a violation. Second, while we were conservative in defining the types of potential violations, and our analysis is informed by the relevant statutes, judicial precedent, and articles by legal experts, there remains some legal uncertainty as to how courts will decide specific cases. Third, the ML models are imprecise. While we believe that the aggregated results are still representative, the individual flagged violations are still prone to false positives.
Q: Did you report your findings to regulatory authorities?
A: This is work in progress, we aim to cooperate with regulatory authorities to enforce the privacy that users deserve. If you are a data protection authority interested in our results, we are happy to collaborate.
Q: Can I use your violation detection procedure to scan a website for violations?
A: We will not release the crawler to public due to ethics risks. But we plan to work on a browser extension for auditing the compliance of the registration form using the ML we developed.
Q: My website is having subscription/registration form for B2B (business-to-business). Do these regulations concern me?
A: This aspect is complicated, since ePrivacy Directive differentiates between consumers and businesses, while GDPR does not. Our crawler tries to skip registration to B2B forms, but our manual evaluation still showed some such forms submitted.
The authors would like to thank: