Article

Challenges in mining social network data: processes, privacy, and paradoxes

Author:
Jon M. Kleinberg

Cornell University

Cornell University
View Profile

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2007Pages 4–5https://doi.org/10.1145/1281192.1281195

Published:12 August 2007Publication History

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 4–5

ABSTRACT

The profileration of rich social media, on-line communities, and collectively produced knowledge resources has accelerated the convergence of technological and social networks, producing environments that reflect both the architecture of the underlying information systems and the social structure on their members. In studying the consequences of these developments, we are faced with the opportunity to analyze social network data at unprecedented levels of scale and temporal resolution; this has led to a growing body of research at the intersection of the computing and social sciences.

We discuss some of the current challenges in the analysis of large-scale social network data, focusing on two themes in particular: the inference of social processes from data, and the problem of maintaining individual privacy in studies of social networks. While early research on this type of data focused on structural questions, recent work has extended this to consider the social processes that unfold within the networks. Particular lines of investigation have focused on processes in on-line social systems related to communication [1, 22], community formation [2, 8, 16, 23], information-seeking and collective problem-solving [20, 21, 18], marketing [12, 19, 24, 28], the spread of news [3, 17], and the dynamics of popularity [29]. There are a number of fundamental issues, however, for which we have relatively little understanding, including the extent to which the outcomes of these types of social processes are predictable from their early stages (see e.g. [29]), the differences between properties of individuals and properties of aggregate populations in these types of data, and the extent to which similar social phenomena in different domains have uniform underlying explanations.

The second theme we pursue is concerned with the problem of privacy. While much of the research on large-scale social systems has been carried out on data that is public, some of the richest emerging sources of social interaction data come from settings such as e-mail, instant messaging, or phone communication in which users have strong expectations of privacy. How can such data be made available to researchers while protecting the privacy of the individuals represented in the data? Many of the standard approaches here are variations on the principle of anonymization - the names of individuals are replaced with meaningless unique identifiers, so that the network structure is maintained while private information has been suppressed.

In recent joint work with Lars Backstrom and Cynthia Dwork, we have identified some fundamental limitations on the power of network anonymization to ensure privacy [7]. In particular, we describe a family of attacks such that even from a single anonymized copy of a social network, it is possible for an adversary to learn whether edges exist or not between specific targeted pairs of nodes. The attacks are based on the uniqueness of small random subgraphs embedded in an arbitrary network, using ideas related to those found in arguments from Ramsey theory [6, 14]. Combined with other recent examples of privacy breaches in data containing rich textual or time-series information [9, 26, 27, 30], these results suggest that anonymization contains pitfalls even in very simple settings. In this way, our approach can be seen as a step toward understanding how techniques of privacy-preserving data mining (see e.g. [4, 5, 10, 11, 13, 15, 25] and the references therein) can inform how we think about the protection of eventhe most skeletal social network data.

Supplemental Material

p4-kleinberg-200.mov

mov

125.6 MB

Download

p4-kleinberg-768.mov

mov

421 MB

Download

References

Lada A. Adamic and Eytan Adar. How to search a social network. Social Networks, 27(3):187--203, 2005.Google ScholarCross Ref
Lada A. Adamic, Orkut Buyukkokten, and Eytan Adar. A social network caught in the web. First Monday, 8(6), 2003.Google Scholar
Eytan Adar, Li Zhang, Lada A. Adamic, and Rajan M. Lukose. Implicit structure and the dynamics of blogspace. In Workshop on the Weblogging Ecosystem, 2004.Google Scholar
Dakshi Agrawal and Charu C. Aggarwal. On the design and quantification of privacy preserving data mining algorithms. In Proc. 20th ACM Symposium on Principles of Database Systems, 2001. Google ScholarDigital Library
Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. In Proc. ACM SIGMOD International Conference on Management of Data, pages 439--450, 2000. Google ScholarDigital Library
Noga Alon and Joel Spencer. The Probabilistic Method. John Wiley & Sons, second edition, 2000.Google Scholar
Lars Backstrom, Cynthia Dwork, and Jon Kleinberg. Wherefore art thou R3579X? Anonymized social networks, hidden patterns, and structural steganography. In Proc. 16th International World Wide Web Conference, 2007. Google ScholarDigital Library
Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. Group formation in large social networks: Membership, growth, and evolution. In Proc. 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006. Google ScholarDigital Library
Michael Barbaro and Tom Zeller Jr. A face is exposed for aol searcher no. 4417749. New York Times, 9 August 2006.Google Scholar
Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: The SuLQ framework. In Proc. 24th ACM Symposium on Principles of Database Systems, pages 128--138, 2005. Google ScholarDigital Library
Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In Proc. 22nd ACM Symposium on Principles of Database Systems, pages 202--210, 2003. Google ScholarDigital Library
Pedro Domingos and Matt Richardson. Mining the network value of customers. In Proc. 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 57--66, 2001. Google ScholarDigital Library
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proc. 3rd International Conference on Very Large Data Bases, pages 265--284, 2006. Google ScholarDigital Library
Paul Erdös. Some remarks on the theory of graphs. Bulletin of the AMS, 53:292--294, 1947.Google ScholarCross Ref
Alexandre V. Evfimievski, Johannes Gehrke, and Ramakrishnan Srikant. Limiting privacy breaches in privacy preserving data mining. In Proc. 22nd ACM Symposium on Principles of Database Systems, pages 211--222, 2003. Google ScholarDigital Library
Scott A. Golder, Dennis Wilkinson, and Bernardo A. Huberman. Rhythms of social interaction: Messaging within a massive online network. In Proc. 3rd International Conference on Communities and Technologies, 2007.Google ScholarCross Ref
Daniel Gruhl, David Liben-Nowell, R. V. Guha, and Andrew Tomkins. Information diffusion through blogspace. In Proc. 13th International World Wide Web Conference, 2004. Google ScholarDigital Library
Michael Kearns, Siddharth Suri, and Nick Monfort. An experimental study of the coloring problem on human subject networks. Science, 313(5788):824--827, 2006.Google ScholarCross Ref
David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence in a social network. In Proc. 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 137--146, 2003. Google ScholarDigital Library
Jon Kleinberg. Complex networks and decentralized search algorithms. In Proc. International Congress of Mathematicians, 2006.Google Scholar
Jon Kleinberg and Prabhakar Raghavan. Query incentive networks. In Proc. 46th IEEE Symposium on Foundations of Computer Science, pages 132--141, 2005. Google ScholarDigital Library
Gueorgi Kossinets and Duncan Watts. Empirical analysis of an evolving social network. Science, 311:88--90, 2006.Google ScholarCross Ref
Ravi Kumar, Jasmine Novak, and Andrew Tomkins. Structure and evolution of online social networks. In Proc. 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 611--617, 2006. Google ScholarDigital Library
Jure Leskovec, Lada Adamic, and Bernardo Huberman. The dynamics of viral marketing. In Proc. 7th ACM Conference on Electronic Commerce, 2006. Google ScholarDigital Library
Nina Mishra and Mark Sandler. Privacy via pseudorandom sketches. In Proc. 25th ACM Symposium on Principles of Database Systems, pages 143--152, 2006. Google ScholarDigital Library
Arvind Narayanan and Vitaly Shmatikov. How to break anonymity of the netflix prize dataset, October 2006. arxiv cs/0610105.Google Scholar
Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. Anti-aliasing on the web. In Proc. 13th International World Wide Web Conference, pages 30--39, 2004. Google ScholarDigital Library
Matt Richardson and Pedro Domingos. Mining knowledge-sharing sites for viral marketing. In Proc. 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 61--70, 2002. Google ScholarDigital Library
Matthew Salganik, Peter Dodds, and Duncan Watts. Experimental study of inequality and unpredictability in an artificial cultural market. Science, 311:854--856, 2006.Google ScholarCross Ref
Latanya Sweeney. Weaving technology and policy together to maintain confidentiality. J. Law Med. Ethics, 25, 1997.Google Scholar

Index Terms

Challenges in mining social network data: processes, privacy, and paradoxes
1. Theory of computation
  1. Randomness, geometry and discrete structures

Recommendations

Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography
WWW '07: Proceedings of the 16th international conference on World Wide Web

In a social network, nodes correspond topeople or other social entities, and edges correspond to social links between them. In an effort to preserve privacy, the practice of anonymization replaces names with meaningless unique identifiers. We describe a ...
Read More
Preservation of Centrality Measures in Anonymized Social Networks
SOCIALCOM '13: Proceedings of the 2013 International Conference on Social Computing

Social media sites became a pervasive presence in the nowadays society. We can learn a lot of useful information about human behavior and interaction by paying attention to the information and relations of social media users. This information can be ...
Read More
IMR based Anonymization for Privacy Preservation in Data Mining
KMO '16: Proceedings of the The 11th International Knowledge Management in Organizations Conference on The changing face of Knowledge Management Impacting Society

Privacy Preserving Data Mining (PPDM) is a data mining research area that aims to protect individual's personal information from unsolicited or unauthorized disclosure. Privacy relates to personal information that a person would not wish others to know ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
anonymization
data mining
diffusion of innovations
privacy in data mining
social networks
Qualifiers
- Article
Conference

Acceptance Rates
KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 105
  Total Citations
  View Citations
- 8,708
  Total Downloads
- Downloads (Last 12 months)48
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Challenges in mining social network data: processes, privacy, and paradoxes

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography

Preservation of Centrality Measures in Anonymized Social Networks

IMR based Anonymization for Privacy Preservation in Data Mining