skip to main content
10.1145/2594538.2594546acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

All-distances sketches, revisited: HIP estimators for massive graphs analysis

Published:18 June 2014Publication History

ABSTRACT

Graph datasets with billions of edges, such as social and Web graphs, are prevalent. To be feasible, computation on such large graphs should scale linearly with graph size. All-distances sketches (ADSs) are emerging as a powerful tool for scalable computation of some basic properties of individual nodes or the whole graph.

ADSs were first proposed two decades ago (Cohen 1994) and more recent algorithms include ANF (Palmer, Gibbons, and Faloutsos 2002) and hyperANF (Boldi, Rosa, and Vigna 2011). A sketch of logarithmic size is computed for each node in the graph and the computation in total requires only a near linear number of edge relaxations. From the ADS of a node, we can estimate neighborhood cardinalities (the number of nodes within some query distance) and closeness centrality. More generally we can estimate the distance distribution, effective diameter, similarities, and other parameters of the full graph. We make several contributions which facilitate a more effective use of ADSs for scalable analysis of massive graphs.

We provide, for the first time, a unified exposition of ADS algorithms and applications. We present the Historic Inverse Probability (HIP) estimators which are applied to the ADS of a node to estimate a large natural class of queries including neighborhood cardinalities and closeness centralities. We show that our HIP estimators have at most half the variance of previous neighborhood cardinality estimators and that this is essentially optimal. Moreover, HIP obtains a polynomial improvement over state of the art for more general domain queries and the estimators are simple, flexible, unbiased, and elegant.

The ADS generalizes Min-Hash sketches, used for approximating cardinality (distinct count) on data streams. We obtain lower bounds on Min-Hash cardinality estimation using classic estimation theory. We illustrate the power of HIP, both in terms of ease of application and estimation quality, by comparing it to the HyperLogLog algorithm (Flajolet et al. 2007), demonstrating a significant improvement over this state-of-the-art practical algorithm.

We also study the quality of ADS estimation of distance ranges, generalizing the near-linear time factor-2 approximation of the diameter.

References

  1. D. Aingworth, C. Chekuri, P Indyk, and R. Motwani. Fast estimation of diameter and shortest paths (without matrix multiplication). SIAM J. Comput., 28(4):1167--1181, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. Akiba, Y. Iwata, and Y. Yoshida. Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In SIGMOD, pages 349--360, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. System Sci., 58:137--147, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Backstrom, P. Boldi, M. Rosa, J. Ugander, and S. Vigna. Four degrees of separation. In WebSci, pages 33--42, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Bavelas. A mathematical model for small group structures. Human Organization, 7:16--30, 1948.Google ScholarGoogle ScholarCross RefCross Ref
  7. D. Blackwell. Conditional expectation and unbiased sequential estimation. Annals of Mathematical Statistics, 18(1), 1947.Google ScholarGoogle ScholarCross RefCross Ref
  8. P. Boldi, M. Rosa, and S. Vigna. HyperANF: Approximating the neighbourhood function of very large graphs on a budget. In WWW, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Boldi, M. Rosa, and S. Vigna. Robustness of social networks: Comparative results based on distance distributions. In SocInfo, pages 8--21, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Boldi and S. Vigna. Axioms for centrality. Internet Mathematics, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  11. K. R. W. Brewer, L. J. Early, and S. F. Joyce. Selecting several samples from a single population. Australian Journal of Statistics, 14(3):231--239, 1972.Google ScholarGoogle ScholarCross RefCross Ref
  12. A. Z. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, pages 21--29. IEEE, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Z. Broder. Identifying and filtering near-duplicate documents. In Proc.of the 11th Annual Symposium on Combinatorial Pattern Matching, volume 1848 of LNCS, pages 1--10. Springer, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Cohen, D. Delling, F. Fuchs, A. Goldberg, M. Goldszmidt, and R. Werneck. Scalable similarity estimation in social networks: Closeness, node labels, and random edge lengths. In COSN, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Cohen, D. Delling, T. Pajor, and R. Werneck. Influence computation scaled-up in sketch space, 2014. Manuscript.Google ScholarGoogle Scholar
  17. E. Cohen and H. Kaplan. Spatially-decaying aggregation over a network: model and algorithms. J. Comput. System Sci., 73:265--288, 2007. Full version of a SIGMOD 2004 paper. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In Proceedings of the ACM PODC'07 Conference, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. J. Algorithms, 59:19--36, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Crescenzi, R. Grossi, L. Lanzi, and A. Marino. A comparison of three algorithms for approximating the distance distribution in real-world graphs. In TAPAS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ch. Dangalchev. Residual closeness in networks. Phisica A, 365, 2006.Google ScholarGoogle Scholar
  23. N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach., 54(6), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Durand and P. Flajolet. Loglog counting of large cardinalities (extended abstract). In ESA, 2003.Google ScholarGoogle Scholar
  25. D. Eppstein and J. Wang. Fast approximation of centrality. In SODA, pages 228--229, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. W. Feller. An introduction to probability theory and its applications, volume 2. John Wiley & Sons, New York, 1971.Google ScholarGoogle Scholar
  27. P. Flajolet. Approximate counting: A detailed analysis. BIT, 25, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms (AOFA), 2007.Google ScholarGoogle ScholarCross RefCross Ref
  29. P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. System Sci., 31:182--209, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663--685, 1952.Google ScholarGoogle ScholarCross RefCross Ref
  32. D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. C. Kiefer. Introduction to statistical inference. Springer-Verlag, New York, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. E. L. Lehmann and H. Scheffé. Completeness, similar regions, and unbiased estimation. Sankhya, 10(4), 1950.Google ScholarGoogle Scholar
  35. P. Li, , K. W. Church, and T. Hastie. One sketch for all: Theory and application of conditional random sampling. In NIPS, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. P. Li, A. B. Owen, and C-H Zhang. One permutation hashing. In NIPS, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. H. Malewicz, G.and Austern, A.J.C Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Morris. Counting large numbers of events in small registers. Comm. ACM, 21, 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. E. Ohlsson. Sequential poisson sampling. J. Official Statistics, 14(2):149--162, 1998.Google ScholarGoogle Scholar
  40. T. Opsahl, F. Agneessens, and J. Skvoretz. Node centrality in weighted networks: Generalizing degreeand shortest paths. Social Networks, 32, 2010.Google ScholarGoogle Scholar
  41. C. R. Palmer, P. B. Gibbons, and C. Faloutsos. ANF: a fast and scalable tool for data mining in massive graphs. In KDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. L. Roditty and V. Vassilevska Williams. Fast approximation algorithms for the diameter and radius of sparse graphs. In STOC. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. B. Rosén. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135--158, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  44. M. Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3):832, 1956.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. All-distances sketches, revisited: HIP estimators for massive graphs analysis

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
        June 2014
        300 pages
        ISBN:9781450323758
        DOI:10.1145/2594538
        • General Chair:
        • Richard Hull,
        • Program Chair:
        • Martin Grohe

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 June 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PODS '14 Paper Acceptance Rate22of67submissions,33%Overall Acceptance Rate642of2,707submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader