ABSTRACT
Graph datasets with billions of edges, such as social and Web graphs, are prevalent. To be feasible, computation on such large graphs should scale linearly with graph size. All-distances sketches (ADSs) are emerging as a powerful tool for scalable computation of some basic properties of individual nodes or the whole graph.
ADSs were first proposed two decades ago (Cohen 1994) and more recent algorithms include ANF (Palmer, Gibbons, and Faloutsos 2002) and hyperANF (Boldi, Rosa, and Vigna 2011). A sketch of logarithmic size is computed for each node in the graph and the computation in total requires only a near linear number of edge relaxations. From the ADS of a node, we can estimate neighborhood cardinalities (the number of nodes within some query distance) and closeness centrality. More generally we can estimate the distance distribution, effective diameter, similarities, and other parameters of the full graph. We make several contributions which facilitate a more effective use of ADSs for scalable analysis of massive graphs.
We provide, for the first time, a unified exposition of ADS algorithms and applications. We present the Historic Inverse Probability (HIP) estimators which are applied to the ADS of a node to estimate a large natural class of queries including neighborhood cardinalities and closeness centralities. We show that our HIP estimators have at most half the variance of previous neighborhood cardinality estimators and that this is essentially optimal. Moreover, HIP obtains a polynomial improvement over state of the art for more general domain queries and the estimators are simple, flexible, unbiased, and elegant.
The ADS generalizes Min-Hash sketches, used for approximating cardinality (distinct count) on data streams. We obtain lower bounds on Min-Hash cardinality estimation using classic estimation theory. We illustrate the power of HIP, both in terms of ease of application and estimation quality, by comparing it to the HyperLogLog algorithm (Flajolet et al. 2007), demonstrating a significant improvement over this state-of-the-art practical algorithm.
We also study the quality of ADS estimation of distance ranges, generalizing the near-linear time factor-2 approximation of the diameter.
- D. Aingworth, C. Chekuri, P Indyk, and R. Motwani. Fast estimation of diameter and shortest paths (without matrix multiplication). SIAM J. Comput., 28(4):1167--1181, 1999. Google ScholarDigital Library
- T. Akiba, Y. Iwata, and Y. Yoshida. Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In SIGMOD, pages 349--360, 2013. Google ScholarDigital Library
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. System Sci., 58:137--147, 1999. Google ScholarDigital Library
- L. Backstrom, P. Boldi, M. Rosa, J. Ugander, and S. Vigna. Four degrees of separation. In WebSci, pages 33--42, 2012. Google ScholarDigital Library
- Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM. ACM, 2002. Google ScholarDigital Library
- A. Bavelas. A mathematical model for small group structures. Human Organization, 7:16--30, 1948.Google ScholarCross Ref
- D. Blackwell. Conditional expectation and unbiased sequential estimation. Annals of Mathematical Statistics, 18(1), 1947.Google ScholarCross Ref
- P. Boldi, M. Rosa, and S. Vigna. HyperANF: Approximating the neighbourhood function of very large graphs on a budget. In WWW, 2011. Google ScholarDigital Library
- P. Boldi, M. Rosa, and S. Vigna. Robustness of social networks: Comparative results based on distance distributions. In SocInfo, pages 8--21, 2011. Google ScholarDigital Library
- P. Boldi and S. Vigna. Axioms for centrality. Internet Mathematics, 2014.Google ScholarCross Ref
- K. R. W. Brewer, L. J. Early, and S. F. Joyce. Selecting several samples from a single population. Australian Journal of Statistics, 14(3):231--239, 1972.Google ScholarCross Ref
- A. Z. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, pages 21--29. IEEE, 1997. Google ScholarDigital Library
- A. Z. Broder. Identifying and filtering near-duplicate documents. In Proc.of the 11th Annual Symposium on Combinatorial Pattern Matching, volume 1848 of LNCS, pages 1--10. Springer, 2000. Google ScholarDigital Library
- E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997. Google ScholarDigital Library
- E. Cohen, D. Delling, F. Fuchs, A. Goldberg, M. Goldszmidt, and R. Werneck. Scalable similarity estimation in social networks: Closeness, node labels, and random edge lengths. In COSN, 2013. Google ScholarDigital Library
- E. Cohen, D. Delling, T. Pajor, and R. Werneck. Influence computation scaled-up in sketch space, 2014. Manuscript.Google Scholar
- E. Cohen and H. Kaplan. Spatially-decaying aggregation over a network: model and algorithms. J. Comput. System Sci., 73:265--288, 2007. Full version of a SIGMOD 2004 paper. Google ScholarDigital Library
- E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In Proceedings of the ACM PODC'07 Conference, 2007. Google ScholarDigital Library
- E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008. Google ScholarDigital Library
- E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. J. Algorithms, 59:19--36, 2006. Google ScholarDigital Library
- P. Crescenzi, R. Grossi, L. Lanzi, and A. Marino. A comparison of three algorithms for approximating the distance distribution in real-world graphs. In TAPAS, 2011. Google ScholarDigital Library
- Ch. Dangalchev. Residual closeness in networks. Phisica A, 365, 2006.Google Scholar
- N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach., 54(6), 2007. Google ScholarDigital Library
- M. Durand and P. Flajolet. Loglog counting of large cardinalities (extended abstract). In ESA, 2003.Google Scholar
- D. Eppstein and J. Wang. Fast approximation of centrality. In SODA, pages 228--229, 2001. Google ScholarDigital Library
- W. Feller. An introduction to probability theory and its applications, volume 2. John Wiley & Sons, New York, 1971.Google Scholar
- P. Flajolet. Approximate counting: A detailed analysis. BIT, 25, 1985. Google ScholarDigital Library
- P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms (AOFA), 2007.Google ScholarCross Ref
- P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. System Sci., 31:182--209, 1985. Google ScholarDigital Library
- S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT, 2013. Google ScholarDigital Library
- D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663--685, 1952.Google ScholarCross Ref
- D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, 2010. Google ScholarDigital Library
- J. C. Kiefer. Introduction to statistical inference. Springer-Verlag, New York, 1987. Google ScholarDigital Library
- E. L. Lehmann and H. Scheffé. Completeness, similar regions, and unbiased estimation. Sankhya, 10(4), 1950.Google Scholar
- P. Li, , K. W. Church, and T. Hastie. One sketch for all: Theory and application of conditional random sampling. In NIPS, 2008.Google ScholarDigital Library
- P. Li, A. B. Owen, and C-H Zhang. One permutation hashing. In NIPS, 2012.Google ScholarDigital Library
- M. H. Malewicz, G.and Austern, A.J.C Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD. ACM, 2010. Google ScholarDigital Library
- R. Morris. Counting large numbers of events in small registers. Comm. ACM, 21, 1977. Google ScholarDigital Library
- E. Ohlsson. Sequential poisson sampling. J. Official Statistics, 14(2):149--162, 1998.Google Scholar
- T. Opsahl, F. Agneessens, and J. Skvoretz. Node centrality in weighted networks: Generalizing degreeand shortest paths. Social Networks, 32, 2010.Google Scholar
- C. R. Palmer, P. B. Gibbons, and C. Faloutsos. ANF: a fast and scalable tool for data mining in massive graphs. In KDD, 2002. Google ScholarDigital Library
- L. Roditty and V. Vassilevska Williams. Fast approximation algorithms for the diameter and radius of sparse graphs. In STOC. ACM, 2013. Google ScholarDigital Library
- B. Rosén. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135--158, 1997.Google ScholarCross Ref
- M. Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3):832, 1956.Google ScholarCross Ref
Index Terms
- All-distances sketches, revisited: HIP estimators for massive graphs analysis
Recommendations
All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis
Graph datasets with billions of edges, such as social and web graphs, are prevalent, and scalability is critical. All-distances sketches (ADS) [Cohen 1997], are a powerful tool for scalable approximation of statistics. The sketch is a small size sample of ...
Thomassen's Choosability Argument Revisited
Thomassen (J. Combin. Theory Ser. B, 62 (1994), pp. 180-181) proved that every planar graph is 5-choosable. This result was generalized by Škrekovski (Discrete Math., 190 (1998), pp. 223-226) and He, Miao, and Shen (Discrete Math., 308 (2008), pp. 4024-...
Covariance, subspace, and intrinsic Crame´r-Rao bounds
Crame´r-Rao bounds on estimation accuracy are established for estimation problems on arbitrary manifolds in which no set of intrinsic coordinates exists. The frequently encountered examples of estimating either an unknown subspace or a covariance ...
Comments