ABSTRACT
Browser cookies are ubiquitous in the web ecosystem today. Although these cookies were initially introduced to preserve user-specific state in browsers, they have now been used for numerous other purposes, including user profiling and tracking across multiple websites. This paper sets out to understand and quantify the different uses for cookies, and in particular, the extent to which targeting and advertising, performance analytics and other uses which only serve the website and not the user add to overall cookie volumes. We start with 31 million cookies collected in Cookiepedia, which is currently the most comprehensive database of cookies on the Web. Cookiepedia provides a useful four-part categorisation of cookies into strictly necessary, performance, functionality and targeting/advertising cookies, as suggested by the UK International Chamber of Commerce. Unfortunately, we found that, Cookiepedia data can categorise less than 22% of the cookies used by Alexa Top20K websites and less than 15% of the cookies set in the browsers of a set of real users. These results point to an acute problem with the coverage of current cookie categorisation techniques.
Consequently, we developed CookieMonster, a novel machine learning-driven framework which can categorise a cookie into one of the aforementioned four categories with more than 94% F1 score and less than 1.5 ms latency. We demonstrate the utility of our framework by classifying cookies in the wild. Our investigation revealed that in Alexa Top20K websites necessary and functional cookies constitute only 13.05% and 9.52% of all cookies respectively. We also apply our framework to quantify the effectiveness of tracking countermeasures such as privacy legislation and ad blockers. Our results identify a way to significantly improve coverage of cookies classification today as well as identify new patterns in the usage of cookies in the wild.
Supplemental Material
- Pushkal Agarwal, Sagar Joglekar, Panagiotis Papadopoulos, Nishanth Sastry, and Nicolas Kourtellis. 2020. Stop tracking me bro! differential tracking of user demographics on hyper-partisan websites. In Proceedings of The Web Conference 2020. 1479–1490.Google ScholarDigital Library
- EasyList Atom. 2005. EasyList. https://easylist.to/Google Scholar
- Nataliia Bielova, Arnaud Legout, Natasa Sarafijanovic-Djukic, 2020. Missed by filter lists: Detecting unknown third-party trackers with invisible pixels. Proceedings on Privacy Enhancing Technologies 2020, 2(2020), 499–518.Google ScholarCross Ref
- Chetna Bindra. 2021. Building a privacy-first future for web advertising. https://blog.google/products/ads-commerce/2021-01-privacy-sandbox/.Google Scholar
- Aaron Cahn, Scott Alfeld, Paul Barford, and Shanmugavelayutham Muthukrishnan. 2016. An empirical study of web cookies. In Proceedings of the 25th International Conference on World Wide Web. 891–901.Google ScholarDigital Library
- A. Cahn, S. Alfeld, P. Barford, and S. Muthukrishnan. 2016. What’s in the community cookie jar?. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). 567–570.Google Scholar
- Cookie Collective. 2018. Five Models for Cookie Law Consent.Google Scholar
- Federico Cozza, Alfonso Guarino, Francesco Isernia, Delfina Malandrino, Antonio Rapuano, Raffaele Schiavone, and Rocco Zaccagnino. 2020. Hybrid and lightweight detection of third party tracking: Design, implementation, and evaluation. Computer Networks 167(2020), 106993.Google ScholarDigital Library
- José Estrada-Jiménez, Ana Rodríguez-Hoyos, Javier Parra-Arnau, and Jordi Forné. 2019. Measuring Online Tracking and Privacy Risks on Ecuadorian Websites. In 2019 IEEE Fourth Ecuador Technical Chapters Meeting (ETCM). IEEE, 1–6.Google Scholar
- Roberto Gonzalez, Lili Jiang, Mohamed Ahmed, Miriam Marciel, Ruben Cuevas, Hassan Metwalley, and Saverio Niccolini. 2017. The cookie recipe: Untangling the use of cookies in the wild. In 2017 Network Traffic Measurement and Analysis Conference (TMA). IEEE, 1–9.Google ScholarCross Ref
- Rohit Gupta and Rohit Panda. 2020. Block the blocker: Studying the effects of Anti Ad-blocking. arXiv preprint arXiv:2001.09434(2020).Google Scholar
- Xuehui Hu, Guillermo Suarez de Tangil, and Nishanth Sastry. 2020. Multi-country Study of Third Party Trackers from Real Browser Histories. In 2020 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 70–86.Google ScholarCross Ref
- Xuehui Hu and Nishanth Sastry. 2019. Characterising third party cookie usage in the eu after gdpr. In Proceedings of the 10th ACM Conference on Web Science. 137–141.Google ScholarDigital Library
- ICC. 2012. ICC UK Cookie guide-EU Cookie Law. https://www.cookielaw.org/media/1096/icc_uk_cookiesguide_revnov.pdf.Google Scholar
- Umar Iqbal, Zubair Shafiq, and Zhiyun Qian. 2017. The ad wars: retrospective measurement and analysis of anti-adblock filter lists. In Proceedings of the 2017 Internet Measurement Conference. 171–183.Google ScholarDigital Library
- Umar Iqbal, Zubair Shafiq, Peter Snyder, Shitong Zhu, Zhiyun Qian, and Benjamin Livshits. 2018. Adgraph: A machine learning approach to automatic and effective adblocking. arXiv preprint arXiv:1805.09155 41 (2018).Google Scholar
- Ankit Kumar Jain and Brij B Gupta. 2018. Towards detection of phishing websites on client-side using machine learning based approach. Telecommunication Systems 68, 4 (2018), 687–700.Google ScholarDigital Library
- Ankit Kumar Jain and Brij B Gupta. 2019. A machine learning based approach for phishing detection using hyperlinks information. Journal of Ambient Intelligence and Humanized Computing 10, 5 (2019), 2015–2028.Google ScholarCross Ref
- Amir Hossein Kargaran, Mohammad Sadegh Akhondzadeh, Mohammad Reza Heidarpour, Mohammad Hossein Manshaei, Kave Salamatian, and Masoud Nejad Sattary. 2020. On Detecting Hidden Third-Party Web Trackers with a Wide Dependency Chain Graph: A Representation Learning Approach. arXiv preprint arXiv:2004.14826(2020).Google Scholar
- Delfina Malandrino, Andrea Petta, Vittorio Scarano, Luigi Serra, Raffaele Spinelli, and Balachander Krishnamurthy. 2013. Privacy awareness about information leakage: Who knows what about me?. In Proceedings of the 12th ACM workshop on Workshop on privacy in the electronic society. 279–284.Google ScholarDigital Library
- H. Metwalley, S. Traverso, and M. Mellia. 2015. Unsupervised Detection of Web Trackers. In 2015 IEEE Global Communications Conference (GLOBECOM). 1–6. https://doi.org/10.1109/GLOCOM.2015.7417499Google Scholar
- Netscape. 2002. PERSISTENT CLIENT STATE HTTP COOKIES. https://bit.ly/3qY55Ks.Google Scholar
- Midas Nouwens, Ilaria Liccardi, Michael Veale, David Karger, and Lalana Kagal. 2020. Dark patterns after the GDPR: Scraping consent pop-ups and demonstrating their influence. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–13.Google ScholarDigital Library
- OneTrust. 2019. OneTrust PreferenceChoice’s Cookie Auto-Blocking Technology. https://bit.ly/2O4HnO6.Google Scholar
- OneTrust. 2020. Cookiepedia. https://cookiepedia.co.uk/.Google Scholar
- Sören Preibusch, Thomas Peetz, Gunes Acar, and Bettina Berendt. 2016. Shopping for privacy: Purchase details leaked to PayPal. Electronic Commerce Research and Applications 15 (2016), 52–64.Google ScholarDigital Library
- Iskander Sanchez-Rola, Matteo Dell’Amico, Platon Kotzias, Davide Balzarotti, Leyla Bilge, Pierre-Antoine Vervier, and Igor Santos. 2019. Can i opt out yet? gdpr and the global illusion of cookie control. In Proceedings of the 2019 ACM Asia conference on computer and communications security. 340–351.Google ScholarDigital Library
- Selenium. 2021. Selenium WebDriver. https://www.selenium.dev/Google Scholar
- Yong Shi, Gong Chen, and Juntao Li. 2018. Malicious domain name detection based on extreme machine learning. Neural Processing Letters 48, 3 (2018), 1347–1357.Google ScholarDigital Library
- Reuben Thomas. 2020. Enchant. https://abiword.github.io/enchant/.Google Scholar
- Ke Tian, Steve TK Jan, Hang Hu, Danfeng Yao, and Gang Wang. 2018. Needle in a haystack: Tracking down elite phishing domains in the wild. In Proceedings of the Internet Measurement Conference 2018. 429–442.Google Scholar
- Tobias Urban, Martin Degeling, Thorsten Holz, and Norbert Pohlmann. 2020. Beyond the Front Page: Measuring Third Party Dynamics in the Field. arXiv preprint arXiv:2001.10248(2020).Google Scholar
- Hong Zhao, Zhaobin Chang, Guangbin Bao, and Xiangyan Zeng. 2019. Malicious domain names detection algorithm based on N-gram. Journal of Computer Networks and Communications 2019 (2019).Google Scholar
Recommendations
An Empirical Study of Web Cookies
WWW '16: Proceedings of the 25th International Conference on World Wide WebWeb cookies are used widely by publishers and 3rd parties to track users and their behaviors. Despite the ubiquitous use of cookies, there is little prior work on their characteristics such as standard attributes, placement policies, and the knowledge ...
Cookies That Give You Away: The Surveillance Implications of Web Tracking
WWW '15: Proceedings of the 24th International Conference on World Wide WebWe study the ability of a passive eavesdropper to leverage "third-party" HTTP tracking cookies for mass surveillance. If two web pages embed the same tracker which tags the browser with a unique cookie, then the adversary can link visits to those pages ...
Comments