research-article

All-distances sketches, revisited: HIP estimators for massive graphs analysis

Author:
Edith Cohen

Microsoft Research, Mountain View, CA, USA

Microsoft Research, Mountain View, CA, USA
View Profile

PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsJune 2014Pages 88–99https://doi.org/10.1145/2594538.2594546

Published:18 June 2014Publication History

PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Pages 88–99

ABSTRACT

Graph datasets with billions of edges, such as social and Web graphs, are prevalent. To be feasible, computation on such large graphs should scale linearly with graph size. All-distances sketches (ADSs) are emerging as a powerful tool for scalable computation of some basic properties of individual nodes or the whole graph.

ADSs were first proposed two decades ago (Cohen 1994) and more recent algorithms include ANF (Palmer, Gibbons, and Faloutsos 2002) and hyperANF (Boldi, Rosa, and Vigna 2011). A sketch of logarithmic size is computed for each node in the graph and the computation in total requires only a near linear number of edge relaxations. From the ADS of a node, we can estimate neighborhood cardinalities (the number of nodes within some query distance) and closeness centrality. More generally we can estimate the distance distribution, effective diameter, similarities, and other parameters of the full graph. We make several contributions which facilitate a more effective use of ADSs for scalable analysis of massive graphs.

We provide, for the first time, a unified exposition of ADS algorithms and applications. We present the Historic Inverse Probability (HIP) estimators which are applied to the ADS of a node to estimate a large natural class of queries including neighborhood cardinalities and closeness centralities. We show that our HIP estimators have at most half the variance of previous neighborhood cardinality estimators and that this is essentially optimal. Moreover, HIP obtains a polynomial improvement over state of the art for more general domain queries and the estimators are simple, flexible, unbiased, and elegant.

The ADS generalizes Min-Hash sketches, used for approximating cardinality (distinct count) on data streams. We obtain lower bounds on Min-Hash cardinality estimation using classic estimation theory. We illustrate the power of HIP, both in terms of ease of application and estimation quality, by comparing it to the HyperLogLog algorithm (Flajolet et al. 2007), demonstrating a significant improvement over this state-of-the-art practical algorithm.

We also study the quality of ADS estimation of distance ranges, generalizing the near-linear time factor-2 approximation of the diameter.

References

D. Aingworth, C. Chekuri, P Indyk, and R. Motwani. Fast estimation of diameter and shortest paths (without matrix multiplication). SIAM J. Comput., 28(4):1167--1181, 1999. Google ScholarDigital Library
T. Akiba, Y. Iwata, and Y. Yoshida. Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In SIGMOD, pages 349--360, 2013. Google ScholarDigital Library
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. System Sci., 58:137--147, 1999. Google ScholarDigital Library
L. Backstrom, P. Boldi, M. Rosa, J. Ugander, and S. Vigna. Four degrees of separation. In WebSci, pages 33--42, 2012. Google ScholarDigital Library
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM. ACM, 2002. Google ScholarDigital Library
A. Bavelas. A mathematical model for small group structures. Human Organization, 7:16--30, 1948.Google ScholarCross Ref
D. Blackwell. Conditional expectation and unbiased sequential estimation. Annals of Mathematical Statistics, 18(1), 1947.Google ScholarCross Ref
P. Boldi, M. Rosa, and S. Vigna. HyperANF: Approximating the neighbourhood function of very large graphs on a budget. In WWW, 2011. Google ScholarDigital Library
P. Boldi, M. Rosa, and S. Vigna. Robustness of social networks: Comparative results based on distance distributions. In SocInfo, pages 8--21, 2011. Google ScholarDigital Library
P. Boldi and S. Vigna. Axioms for centrality. Internet Mathematics, 2014.Google ScholarCross Ref
K. R. W. Brewer, L. J. Early, and S. F. Joyce. Selecting several samples from a single population. Australian Journal of Statistics, 14(3):231--239, 1972.Google ScholarCross Ref
A. Z. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, pages 21--29. IEEE, 1997. Google ScholarDigital Library
A. Z. Broder. Identifying and filtering near-duplicate documents. In Proc.of the 11th Annual Symposium on Combinatorial Pattern Matching, volume 1848 of LNCS, pages 1--10. Springer, 2000. Google ScholarDigital Library
E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997. Google ScholarDigital Library
E. Cohen, D. Delling, F. Fuchs, A. Goldberg, M. Goldszmidt, and R. Werneck. Scalable similarity estimation in social networks: Closeness, node labels, and random edge lengths. In COSN, 2013. Google ScholarDigital Library
E. Cohen, D. Delling, T. Pajor, and R. Werneck. Influence computation scaled-up in sketch space, 2014. Manuscript.Google Scholar
E. Cohen and H. Kaplan. Spatially-decaying aggregation over a network: model and algorithms. J. Comput. System Sci., 73:265--288, 2007. Full version of a SIGMOD 2004 paper. Google ScholarDigital Library
E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In Proceedings of the ACM PODC'07 Conference, 2007. Google ScholarDigital Library
E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008. Google ScholarDigital Library
E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. J. Algorithms, 59:19--36, 2006. Google ScholarDigital Library
P. Crescenzi, R. Grossi, L. Lanzi, and A. Marino. A comparison of three algorithms for approximating the distance distribution in real-world graphs. In TAPAS, 2011. Google ScholarDigital Library
Ch. Dangalchev. Residual closeness in networks. Phisica A, 365, 2006.Google Scholar
N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach., 54(6), 2007. Google ScholarDigital Library
M. Durand and P. Flajolet. Loglog counting of large cardinalities (extended abstract). In ESA, 2003.Google Scholar
D. Eppstein and J. Wang. Fast approximation of centrality. In SODA, pages 228--229, 2001. Google ScholarDigital Library
W. Feller. An introduction to probability theory and its applications, volume 2. John Wiley & Sons, New York, 1971.Google Scholar
P. Flajolet. Approximate counting: A detailed analysis. BIT, 25, 1985. Google ScholarDigital Library
P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms (AOFA), 2007.Google ScholarCross Ref
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. System Sci., 31:182--209, 1985. Google ScholarDigital Library
S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT, 2013. Google ScholarDigital Library
D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663--685, 1952.Google ScholarCross Ref
D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, 2010. Google ScholarDigital Library
J. C. Kiefer. Introduction to statistical inference. Springer-Verlag, New York, 1987. Google ScholarDigital Library
E. L. Lehmann and H. Scheffé. Completeness, similar regions, and unbiased estimation. Sankhya, 10(4), 1950.Google Scholar
P. Li, , K. W. Church, and T. Hastie. One sketch for all: Theory and application of conditional random sampling. In NIPS, 2008.Google ScholarDigital Library
P. Li, A. B. Owen, and C-H Zhang. One permutation hashing. In NIPS, 2012.Google ScholarDigital Library
M. H. Malewicz, G.and Austern, A.J.C Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD. ACM, 2010. Google ScholarDigital Library
R. Morris. Counting large numbers of events in small registers. Comm. ACM, 21, 1977. Google ScholarDigital Library
E. Ohlsson. Sequential poisson sampling. J. Official Statistics, 14(2):149--162, 1998.Google Scholar
T. Opsahl, F. Agneessens, and J. Skvoretz. Node centrality in weighted networks: Generalizing degreeand shortest paths. Social Networks, 32, 2010.Google Scholar
C. R. Palmer, P. B. Gibbons, and C. Faloutsos. ANF: a fast and scalable tool for data mining in massive graphs. In KDD, 2002. Google ScholarDigital Library
L. Roditty and V. Vassilevska Williams. Fast approximation algorithms for the diameter and radius of sparse graphs. In STOC. ACM, 2013. Google ScholarDigital Library
B. Rosén. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135--158, 1997.Google ScholarCross Ref
M. Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3):832, 1956.Google ScholarCross Ref

Index Terms

All-distances sketches, revisited: HIP estimators for massive graphs analysis
1. Information systems
  1. Data management systems
2. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory

Recommendations

All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis
Graph datasets with billions of edges, such as social and web graphs, are prevalent, and scalability is critical. All-distances sketches (ADS) [Cohen 1997], are a powerful tool for scalable approximation of statistics. The sketch is a small size sample of ...
Read More
Thomassen's Choosability Argument Revisited

Thomassen (J. Combin. Theory Ser. B, 62 (1994), pp. 180-181) proved that every planar graph is 5-choosable. This result was generalized by Škrekovski (Discrete Math., 190 (1998), pp. 223-226) and He, Miao, and Shen (Discrete Math., 308 (2008), pp. 4024-...
Read More
Covariance, subspace, and intrinsic Crame&acute;r-Rao bounds

Crame´r-Rao bounds on estimation accuracy are established for estimation problems on arbitrary manifolds in which no set of intrinsic coordinates exists. The frequently encountered examples of estimating either an unknown subspace or a covariance ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2014
300 pages
ISBN:9781450323758
DOI:10.1145/2594538
General Chair:
Richard Hull
IBM T.J. Watson Research Center, USA
,
Program Chair:
Martin Grohe
RWTH Aachen University, Germany
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
all-distances sketch
approximate distinct counters
closeness centrality
estimation
min-hash sketch
Qualifiers
- research-article
Conference

Acceptance Rates
PODS '14 Paper Acceptance Rate22of67submissions,33%Overall Acceptance Rate642of2,707submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 333
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

All-distances sketches, revisited: HIP estimators for massive graphs analysis

PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis

Thomassen's Choosability Argument Revisited

Covariance, subspace, and intrinsic Crame&acute;r-Rao bounds