ABSTRACT
The profileration of rich social media, on-line communities, and collectively produced knowledge resources has accelerated the convergence of technological and social networks, producing environments that reflect both the architecture of the underlying information systems and the social structure on their members. In studying the consequences of these developments, we are faced with the opportunity to analyze social network data at unprecedented levels of scale and temporal resolution; this has led to a growing body of research at the intersection of the computing and social sciences.
We discuss some of the current challenges in the analysis of large-scale social network data, focusing on two themes in particular: the inference of social processes from data, and the problem of maintaining individual privacy in studies of social networks. While early research on this type of data focused on structural questions, recent work has extended this to consider the social processes that unfold within the networks. Particular lines of investigation have focused on processes in on-line social systems related to communication [1, 22], community formation [2, 8, 16, 23], information-seeking and collective problem-solving [20, 21, 18], marketing [12, 19, 24, 28], the spread of news [3, 17], and the dynamics of popularity [29]. There are a number of fundamental issues, however, for which we have relatively little understanding, including the extent to which the outcomes of these types of social processes are predictable from their early stages (see e.g. [29]), the differences between properties of individuals and properties of aggregate populations in these types of data, and the extent to which similar social phenomena in different domains have uniform underlying explanations.
The second theme we pursue is concerned with the problem of privacy. While much of the research on large-scale social systems has been carried out on data that is public, some of the richest emerging sources of social interaction data come from settings such as e-mail, instant messaging, or phone communication in which users have strong expectations of privacy. How can such data be made available to researchers while protecting the privacy of the individuals represented in the data? Many of the standard approaches here are variations on the principle of anonymization - the names of individuals are replaced with meaningless unique identifiers, so that the network structure is maintained while private information has been suppressed.
In recent joint work with Lars Backstrom and Cynthia Dwork, we have identified some fundamental limitations on the power of network anonymization to ensure privacy [7]. In particular, we describe a family of attacks such that even from a single anonymized copy of a social network, it is possible for an adversary to learn whether edges exist or not between specific targeted pairs of nodes. The attacks are based on the uniqueness of small random subgraphs embedded in an arbitrary network, using ideas related to those found in arguments from Ramsey theory [6, 14]. Combined with other recent examples of privacy breaches in data containing rich textual or time-series information [9, 26, 27, 30], these results suggest that anonymization contains pitfalls even in very simple settings. In this way, our approach can be seen as a step toward understanding how techniques of privacy-preserving data mining (see e.g. [4, 5, 10, 11, 13, 15, 25] and the references therein) can inform how we think about the protection of eventhe most skeletal social network data.
Supplemental Material
- Lada A. Adamic and Eytan Adar. How to search a social network. Social Networks, 27(3):187--203, 2005.Google ScholarCross Ref
- Lada A. Adamic, Orkut Buyukkokten, and Eytan Adar. A social network caught in the web. First Monday, 8(6), 2003.Google Scholar
- Eytan Adar, Li Zhang, Lada A. Adamic, and Rajan M. Lukose. Implicit structure and the dynamics of blogspace. In Workshop on the Weblogging Ecosystem, 2004.Google Scholar
- Dakshi Agrawal and Charu C. Aggarwal. On the design and quantification of privacy preserving data mining algorithms. In Proc. 20th ACM Symposium on Principles of Database Systems, 2001. Google ScholarDigital Library
- Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. In Proc. ACM SIGMOD International Conference on Management of Data, pages 439--450, 2000. Google ScholarDigital Library
- Noga Alon and Joel Spencer. The Probabilistic Method. John Wiley & Sons, second edition, 2000.Google Scholar
- Lars Backstrom, Cynthia Dwork, and Jon Kleinberg. Wherefore art thou R3579X? Anonymized social networks, hidden patterns, and structural steganography. In Proc. 16th International World Wide Web Conference, 2007. Google ScholarDigital Library
- Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. Group formation in large social networks: Membership, growth, and evolution. In Proc. 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006. Google ScholarDigital Library
- Michael Barbaro and Tom Zeller Jr. A face is exposed for aol searcher no. 4417749. New York Times, 9 August 2006.Google Scholar
- Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: The SuLQ framework. In Proc. 24th ACM Symposium on Principles of Database Systems, pages 128--138, 2005. Google ScholarDigital Library
- Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In Proc. 22nd ACM Symposium on Principles of Database Systems, pages 202--210, 2003. Google ScholarDigital Library
- Pedro Domingos and Matt Richardson. Mining the network value of customers. In Proc. 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 57--66, 2001. Google ScholarDigital Library
- Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proc. 3rd International Conference on Very Large Data Bases, pages 265--284, 2006. Google ScholarDigital Library
- Paul Erdös. Some remarks on the theory of graphs. Bulletin of the AMS, 53:292--294, 1947.Google ScholarCross Ref
- Alexandre V. Evfimievski, Johannes Gehrke, and Ramakrishnan Srikant. Limiting privacy breaches in privacy preserving data mining. In Proc. 22nd ACM Symposium on Principles of Database Systems, pages 211--222, 2003. Google ScholarDigital Library
- Scott A. Golder, Dennis Wilkinson, and Bernardo A. Huberman. Rhythms of social interaction: Messaging within a massive online network. In Proc. 3rd International Conference on Communities and Technologies, 2007.Google ScholarCross Ref
- Daniel Gruhl, David Liben-Nowell, R. V. Guha, and Andrew Tomkins. Information diffusion through blogspace. In Proc. 13th International World Wide Web Conference, 2004. Google ScholarDigital Library
- Michael Kearns, Siddharth Suri, and Nick Monfort. An experimental study of the coloring problem on human subject networks. Science, 313(5788):824--827, 2006.Google ScholarCross Ref
- David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence in a social network. In Proc. 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 137--146, 2003. Google ScholarDigital Library
- Jon Kleinberg. Complex networks and decentralized search algorithms. In Proc. International Congress of Mathematicians, 2006.Google Scholar
- Jon Kleinberg and Prabhakar Raghavan. Query incentive networks. In Proc. 46th IEEE Symposium on Foundations of Computer Science, pages 132--141, 2005. Google ScholarDigital Library
- Gueorgi Kossinets and Duncan Watts. Empirical analysis of an evolving social network. Science, 311:88--90, 2006.Google ScholarCross Ref
- Ravi Kumar, Jasmine Novak, and Andrew Tomkins. Structure and evolution of online social networks. In Proc. 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 611--617, 2006. Google ScholarDigital Library
- Jure Leskovec, Lada Adamic, and Bernardo Huberman. The dynamics of viral marketing. In Proc. 7th ACM Conference on Electronic Commerce, 2006. Google ScholarDigital Library
- Nina Mishra and Mark Sandler. Privacy via pseudorandom sketches. In Proc. 25th ACM Symposium on Principles of Database Systems, pages 143--152, 2006. Google ScholarDigital Library
- Arvind Narayanan and Vitaly Shmatikov. How to break anonymity of the netflix prize dataset, October 2006. arxiv cs/0610105.Google Scholar
- Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. Anti-aliasing on the web. In Proc. 13th International World Wide Web Conference, pages 30--39, 2004. Google ScholarDigital Library
- Matt Richardson and Pedro Domingos. Mining knowledge-sharing sites for viral marketing. In Proc. 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 61--70, 2002. Google ScholarDigital Library
- Matthew Salganik, Peter Dodds, and Duncan Watts. Experimental study of inequality and unpredictability in an artificial cultural market. Science, 311:854--856, 2006.Google ScholarCross Ref
- Latanya Sweeney. Weaving technology and policy together to maintain confidentiality. J. Law Med. Ethics, 25, 1997.Google Scholar
Index Terms
- Challenges in mining social network data: processes, privacy, and paradoxes
Recommendations
Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography
WWW '07: Proceedings of the 16th international conference on World Wide WebIn a social network, nodes correspond topeople or other social entities, and edges correspond to social links between them. In an effort to preserve privacy, the practice of anonymization replaces names with meaningless unique identifiers. We describe a ...
Preservation of Centrality Measures in Anonymized Social Networks
SOCIALCOM '13: Proceedings of the 2013 International Conference on Social ComputingSocial media sites became a pervasive presence in the nowadays society. We can learn a lot of useful information about human behavior and interaction by paying attention to the information and relations of social media users. This information can be ...
IMR based Anonymization for Privacy Preservation in Data Mining
KMO '16: Proceedings of the The 11th International Knowledge Management in Organizations Conference on The changing face of Knowledge Management Impacting SocietyPrivacy Preserving Data Mining (PPDM) is a data mining research area that aims to protect individual's personal information from unsolicited or unauthorized disclosure. Privacy relates to personal information that a person would not wish others to know ...
Comments