Posts for my papers

Multilingual Scraper of Privacy Policies and Terms of Service
- Authors: David Bernhard, Luka Nenadic, Stefan Bechtold, Karel Kubicek
- CSLAW
- Abstract: We developed a scraper for privacy policies and terms of service, designed for long-term data collection over a multiple-year period. This scraper supports 37 languages, addressing the systematic understudy of non-English websites. We crawl nearly 1 million websites monthly, sampled from various countries and popularity ranks using the Chrome UX Report list, which closely represents real user behavior and reduces sampling bias. This project provides an unprecedented dataset for empirical research in privacy policies and terms of service, enabling comprehensive studies of legal documents across different languages and regions.
Automating Website Registration for Studying GDPR Compliance
- Authors: Karel Kubicek, Jakob Merane, Ahmed Bouhoula, David Basin
- The Web Conference, WWW, 2024.
- Abstract: Investigating how websites use sensitive user data is an active research area. However, research based on automated measurements has been limited to those websites that do not require user authentication. To overcome this limitation, we developed a crawler that automates website registrations and newsletter subscriptions and detects both security and privacy threats at scale. We demonstrate our crawler's capabilities by running it on 660k websites. We use this to identify security and privacy threats and to contextualize them within EU laws, namely the General Data Protection Regulation and ePrivacy Directive. Our methods detect private data collection over insecure HTTP connections and websites sending emails with user-provided passwords. We are also the first to apply machine learning to web forms, assessing violations of marketing consent collection requirements. Overall, we find that 37.2% of websites send marketing emails without proper user consent. This is mostly caused by websites failing both to verify and store consent adequately. Additionally, 1.8% of websites share users' email addresses with third parties without a transparent disclosure.
PhD thesis: Automated Analysis and Enforcement of Consent Compliance
- Authors: Karel Kubicek
- ETH Zurich
- Abstract: The collection and processing of personal data by websites have due to their ubiquity become subject to privacy regulations. In the European Union (EU), the ePrivacy Directive mandates that any data collection not strictly necessary for service provision must be consented to by users. The General Data Protection Regulation (GDPR) further stipulates that such consent must be freely given, specific, informed, and unambiguous. Personal data is any information identifying individuals, yet our primary focus lies on email addresses and browsing behavior collected by cookies. We evaluate the extent to which the EU’s privacy regulations protect users from unsolicited emails. First, we define the legal properties of registration and newsletter sign-up processes. The combination of these properties forms a decision tree for predicting potential legal violations. We evaluate these violations on a dataset of 1000 websites that we annotated with these legal properties. Second, we train machine learning models on this annotated dataset, and together with our crawler automating the sign-up process, we scale our analysis to cover 660 202 websites. We report observations from both the manual and automated analyses, identifying websites sending marketing emails without obtaining valid consent during registration or sharing users’ email addresses with third parties. We also examine how websites request consent for non-essential cookies. By investigating a sample of 29 398 websites featuring specific cookie notices, we identify eight potential consent violations in a surprising 94.7% of these websites. These violations stem from discrepancies between the information declared in the consent notice and the actual usage of cookies on the website or from the usage of tracking cookies prior to or despite the user’s negative consent. Given the high prevalence of privacy violations, we propose automated methods for enforcing privacy compliance. We developed a browser extension, CookieBlock, which employs machine learning for client-side enforcement of cookie consent. Using an XGBoost model which attains competitive accuracy compared to domain experts, CookieBlock categorizes cookies by their usage purpose and filters them according to user preferences. Finally, we suggest utilizing the automated violation detection methods from both the email and cookie projects for notification campaigns to enhance website operators’ awareness of privacy regulations, or for direct enforcement by data protection authorities.
Block Cookies, Not Websites: Analysing Mental Models and Usability of the Privacy-Preserving Browser Extension CookieBlock
- Authors: Lorin Schöni, Karel Kubicek, Verena Zimmermann
- To appear in Proceedings on Privacy Enhancing Technologies, PETS, 2024.
- Abstract: In the modern web, users are confronted with a plethora of complex privacy-related decisions about cookies and consent, often compounded by misleading policies and deceptive patterns. Past efforts to enhance online privacy have failed due to their dependence on website compliance. A solution to this lies in privacy-enhancing tools that are directly controlled by the user. However, challenges related to the usability and flawed understanding of the tools' functionality hinder their widespread adoption. To address this problem, we evaluated the browser extension CookieBlock as an example of a current tool, which supports users by blocking tracking cookies independent of website compliance. We used a complementary approach consisting of an expert evaluation of CookieBlock and the related tools NoScript and Ghostery, and a laboratory user study focusing on the unique details of how users interact with CookieBlock specifically. The laboratory study with 42 participants investigated usage, mental models, and usability of CookieBlock based on eye tracking, interaction, and self-report data. While CookieBlock received good usability ratings, 18 participants were unable to solve a website breakage caused by cookie misclassification on their own. Overall, the results revealed flawed mental models of CookieBlock's functionality and resulting challenges in making the connection between website breakage and cookie misclassification. Implications for CookieBlock and related applications include interface design recommendations supporting accurate mental models and the proposal of improved heuristics to better guide users and warn them about potential identified website breakage.
Locality-Sensitive Hashing Does Not Guarantee Privacy! Attacks on Google’s FLoC and the MinHash Hierarchy System
- Authors: Florian Turati, Karel Kubicek, Carlos Cotrini, David Basin
- Submitted to Proceedings on Privacy Enhancing Technologies, PETS, 2023.
- Abstract: Recently proposed systems aim at achieving privacy using locality-sensitive hashing. We show how these approaches fail by presenting attacks against two such systems: Google’s FLoC proposal for privacy-preserving targeted advertising and the MinHash Hierarchy, a system for processing location trajectories in a privacy-preserving way. Our attacks refute the pre-image resistance, anonymity, and privacy guarantees claimed for these systems. In the case of FLoC, we show how to deanonymize users using Sybil attacks and to reconstruct 10% or more of the browsing history for 30% of its users using Generative Adversarial Networks. We achieve this only analyzing the hashes used by FLoC. For MinHash, we precisely identify the location trajectory of a subset of individuals and, on average, we can limit users’ trajectory to just 10% of the possible geographic area, again using just the hashes. In addition, we refute their differential privacy claims.
Checking Websites' GDPR Consent Compliance for Marketing Emails
- Authors: Karel Kubicek, Jakob Merane, Carlos Cotrini, Alexander Stremitzer, Stefan Bechtold, David Basin
- Proceedings on Privacy Enhancing Technologies, PETS, 2022.
- Abstract: The sending of marketing emails is regulated to protect users from unsolicited emails. For instance, the European Union's ePrivacy Directive states that marketers must obtain users' prior consent, and the General Data Protection Regulation (GDPR) specifies further that such consent must be freely given, specific, informed, and unambiguous. Based on these requirements, we design a labeling of legal characteristics for websites and emails. This leads to a simple decision procedure that detects potential legal violations. Using our procedure, we evaluated 1000 websites and the 5000 emails resulting from registering to these websites. Both datasets and evaluations are available upon request. We find that 21.9% of the websites contain potential violations of privacy and unfair competition rules, either in the registration process (17.3%) or email communication (17.7%). We demonstrate with a statistical analysis the possibility of automatically detecting such potential violations.
Automating Cookie Consent and GDPR Violation Detection
- Authors: Dino Bollinger, Karel Kubicek, Carlos Cotrini, David Basin
- 31st USENIX Security Symposium (UsenixSec'2022), USENIX, 2022.
- Abstract: The European Union's General Data Protection Regulation (GDPR) requires websites to inform users about personal data collection and request consent for cookies. Yet the majority of websites do not give users any choices, and others attempt to deceive them into accepting all cookies. We document the severity of this situation through an analysis of potential GDPR violations in cookie banners in almost 30k websites. We identify six novel violation types, such as incorrect category assignments and misleading expiration times, and we find at least one potential violation in a surprising 94.7% of the analyzed websites. We address this issue by giving users the power to protect their privacy. We develop a browser extension, called CookieBlock, that uses machine learning to enforce GDPR cookie consent at the client. It automatically categorizes cookies by usage purpose using only the information provided in the cookie itself. At a mean validation accuracy of 84.4%, our model attains a prediction quality competitive with expert knowledge in the field. Additionally, our approach differs from prior work by not relying on the cooperation of websites themselves. We empirically evaluate CookieBlock on a set of 100 randomly sampled websites, on which it filters roughly 90% of the privacy-invasive cookies without significantly impairing website functionality.