skip to main content
10.1145/3447535.3462509acmconferencesArticle/Chapter ViewAbstractPublication PageswebsciConference Proceedingsconference-collections
research-article

CCCC: Corralling Cookies into Categories with CookieMonster

Published:22 June 2021Publication History

ABSTRACT

Browser cookies are ubiquitous in the web ecosystem today. Although these cookies were initially introduced to preserve user-specific state in browsers, they have now been used for numerous other purposes, including user profiling and tracking across multiple websites. This paper sets out to understand and quantify the different uses for cookies, and in particular, the extent to which targeting and advertising, performance analytics and other uses which only serve the website and not the user add to overall cookie volumes. We start with 31 million cookies collected in Cookiepedia, which is currently the most comprehensive database of cookies on the Web. Cookiepedia provides a useful four-part categorisation of cookies into strictly necessary, performance, functionality and targeting/advertising cookies, as suggested by the UK International Chamber of Commerce. Unfortunately, we found that, Cookiepedia data can categorise less than 22% of the cookies used by Alexa Top20K websites and less than 15% of the cookies set in the browsers of a set of real users. These results point to an acute problem with the coverage of current cookie categorisation techniques.

Consequently, we developed CookieMonster, a novel machine learning-driven framework which can categorise a cookie into one of the aforementioned four categories with more than 94% F1 score and less than 1.5 ms latency. We demonstrate the utility of our framework by classifying cookies in the wild. Our investigation revealed that in Alexa Top20K websites necessary and functional cookies constitute only 13.05% and 9.52% of all cookies respectively. We also apply our framework to quantify the effectiveness of tracking countermeasures such as privacy legislation and ad blockers. Our results identify a way to significantly improve coverage of cookies classification today as well as identify new patterns in the usage of cookies in the wild.

Skip Supplemental Material Section

Supplemental Material

PS6.1_XuehuiHu_CCCC_CorrallingCookies_intoCategories_with_CookieMonster.mp4

mp4

72.1 MB

References

  1. Pushkal Agarwal, Sagar Joglekar, Panagiotis Papadopoulos, Nishanth Sastry, and Nicolas Kourtellis. 2020. Stop tracking me bro! differential tracking of user demographics on hyper-partisan websites. In Proceedings of The Web Conference 2020. 1479–1490.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. EasyList Atom. 2005. EasyList. https://easylist.to/Google ScholarGoogle Scholar
  3. Nataliia Bielova, Arnaud Legout, Natasa Sarafijanovic-Djukic, 2020. Missed by filter lists: Detecting unknown third-party trackers with invisible pixels. Proceedings on Privacy Enhancing Technologies 2020, 2(2020), 499–518.Google ScholarGoogle ScholarCross RefCross Ref
  4. Chetna Bindra. 2021. Building a privacy-first future for web advertising. https://blog.google/products/ads-commerce/2021-01-privacy-sandbox/.Google ScholarGoogle Scholar
  5. Aaron Cahn, Scott Alfeld, Paul Barford, and Shanmugavelayutham Muthukrishnan. 2016. An empirical study of web cookies. In Proceedings of the 25th International Conference on World Wide Web. 891–901.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Cahn, S. Alfeld, P. Barford, and S. Muthukrishnan. 2016. What’s in the community cookie jar?. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). 567–570.Google ScholarGoogle Scholar
  7. Cookie Collective. 2018. Five Models for Cookie Law Consent.Google ScholarGoogle Scholar
  8. Federico Cozza, Alfonso Guarino, Francesco Isernia, Delfina Malandrino, Antonio Rapuano, Raffaele Schiavone, and Rocco Zaccagnino. 2020. Hybrid and lightweight detection of third party tracking: Design, implementation, and evaluation. Computer Networks 167(2020), 106993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. José Estrada-Jiménez, Ana Rodríguez-Hoyos, Javier Parra-Arnau, and Jordi Forné. 2019. Measuring Online Tracking and Privacy Risks on Ecuadorian Websites. In 2019 IEEE Fourth Ecuador Technical Chapters Meeting (ETCM). IEEE, 1–6.Google ScholarGoogle Scholar
  10. Roberto Gonzalez, Lili Jiang, Mohamed Ahmed, Miriam Marciel, Ruben Cuevas, Hassan Metwalley, and Saverio Niccolini. 2017. The cookie recipe: Untangling the use of cookies in the wild. In 2017 Network Traffic Measurement and Analysis Conference (TMA). IEEE, 1–9.Google ScholarGoogle ScholarCross RefCross Ref
  11. Rohit Gupta and Rohit Panda. 2020. Block the blocker: Studying the effects of Anti Ad-blocking. arXiv preprint arXiv:2001.09434(2020).Google ScholarGoogle Scholar
  12. Xuehui Hu, Guillermo Suarez de Tangil, and Nishanth Sastry. 2020. Multi-country Study of Third Party Trackers from Real Browser Histories. In 2020 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 70–86.Google ScholarGoogle ScholarCross RefCross Ref
  13. Xuehui Hu and Nishanth Sastry. 2019. Characterising third party cookie usage in the eu after gdpr. In Proceedings of the 10th ACM Conference on Web Science. 137–141.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. ICC. 2012. ICC UK Cookie guide-EU Cookie Law. https://www.cookielaw.org/media/1096/icc_uk_cookiesguide_revnov.pdf.Google ScholarGoogle Scholar
  15. Umar Iqbal, Zubair Shafiq, and Zhiyun Qian. 2017. The ad wars: retrospective measurement and analysis of anti-adblock filter lists. In Proceedings of the 2017 Internet Measurement Conference. 171–183.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Umar Iqbal, Zubair Shafiq, Peter Snyder, Shitong Zhu, Zhiyun Qian, and Benjamin Livshits. 2018. Adgraph: A machine learning approach to automatic and effective adblocking. arXiv preprint arXiv:1805.09155 41 (2018).Google ScholarGoogle Scholar
  17. Ankit Kumar Jain and Brij B Gupta. 2018. Towards detection of phishing websites on client-side using machine learning based approach. Telecommunication Systems 68, 4 (2018), 687–700.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ankit Kumar Jain and Brij B Gupta. 2019. A machine learning based approach for phishing detection using hyperlinks information. Journal of Ambient Intelligence and Humanized Computing 10, 5 (2019), 2015–2028.Google ScholarGoogle ScholarCross RefCross Ref
  19. Amir Hossein Kargaran, Mohammad Sadegh Akhondzadeh, Mohammad Reza Heidarpour, Mohammad Hossein Manshaei, Kave Salamatian, and Masoud Nejad Sattary. 2020. On Detecting Hidden Third-Party Web Trackers with a Wide Dependency Chain Graph: A Representation Learning Approach. arXiv preprint arXiv:2004.14826(2020).Google ScholarGoogle Scholar
  20. Delfina Malandrino, Andrea Petta, Vittorio Scarano, Luigi Serra, Raffaele Spinelli, and Balachander Krishnamurthy. 2013. Privacy awareness about information leakage: Who knows what about me?. In Proceedings of the 12th ACM workshop on Workshop on privacy in the electronic society. 279–284.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. H. Metwalley, S. Traverso, and M. Mellia. 2015. Unsupervised Detection of Web Trackers. In 2015 IEEE Global Communications Conference (GLOBECOM). 1–6. https://doi.org/10.1109/GLOCOM.2015.7417499Google ScholarGoogle Scholar
  22. Netscape. 2002. PERSISTENT CLIENT STATE HTTP COOKIES. https://bit.ly/3qY55Ks.Google ScholarGoogle Scholar
  23. Midas Nouwens, Ilaria Liccardi, Michael Veale, David Karger, and Lalana Kagal. 2020. Dark patterns after the GDPR: Scraping consent pop-ups and demonstrating their influence. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. OneTrust. 2019. OneTrust PreferenceChoice’s Cookie Auto-Blocking Technology. https://bit.ly/2O4HnO6.Google ScholarGoogle Scholar
  25. OneTrust. 2020. Cookiepedia. https://cookiepedia.co.uk/.Google ScholarGoogle Scholar
  26. Sören Preibusch, Thomas Peetz, Gunes Acar, and Bettina Berendt. 2016. Shopping for privacy: Purchase details leaked to PayPal. Electronic Commerce Research and Applications 15 (2016), 52–64.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Iskander Sanchez-Rola, Matteo Dell’Amico, Platon Kotzias, Davide Balzarotti, Leyla Bilge, Pierre-Antoine Vervier, and Igor Santos. 2019. Can i opt out yet? gdpr and the global illusion of cookie control. In Proceedings of the 2019 ACM Asia conference on computer and communications security. 340–351.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Selenium. 2021. Selenium WebDriver. https://www.selenium.dev/Google ScholarGoogle Scholar
  29. Yong Shi, Gong Chen, and Juntao Li. 2018. Malicious domain name detection based on extreme machine learning. Neural Processing Letters 48, 3 (2018), 1347–1357.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Reuben Thomas. 2020. Enchant. https://abiword.github.io/enchant/.Google ScholarGoogle Scholar
  31. Ke Tian, Steve TK Jan, Hang Hu, Danfeng Yao, and Gang Wang. 2018. Needle in a haystack: Tracking down elite phishing domains in the wild. In Proceedings of the Internet Measurement Conference 2018. 429–442.Google ScholarGoogle Scholar
  32. Tobias Urban, Martin Degeling, Thorsten Holz, and Norbert Pohlmann. 2020. Beyond the Front Page: Measuring Third Party Dynamics in the Field. arXiv preprint arXiv:2001.10248(2020).Google ScholarGoogle Scholar
  33. Hong Zhao, Zhaobin Chang, Guangbin Bao, and Xiangyan Zeng. 2019. Malicious domain names detection algorithm based on N-gram. Journal of Computer Networks and Communications 2019 (2019).Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    WebSci '21: Proceedings of the 13th ACM Web Science Conference 2021
    June 2021
    328 pages
    ISBN:9781450383301
    DOI:10.1145/3447535

    Copyright © 2021 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 22 June 2021

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate218of875submissions,25%

    Upcoming Conference

    Websci '24
    16th ACM Web Science Conference
    May 21 - 24, 2024
    Stuttgart , Germany

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format