skip to main content
research-article

Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection

Authors Info & Claims
Published:22 July 2015Publication History
Skip Abstract Section

Abstract

An integrated framework for density-based cluster analysis, outlier detection, and data visualization is introduced in this article. The main module consists of an algorithm to compute hierarchical estimates of the level sets of a density, following Hartigan’s classic model of density-contour clusters and trees. Such an algorithm generalizes and improves existing density-based clustering techniques with respect to different aspects. It provides as a result a complete clustering hierarchy composed of all possible density-based clusters following the nonparametric model adopted, for an infinite range of density thresholds. The resulting hierarchy can be easily processed so as to provide multiple ways for data visualization and exploration. It can also be further postprocessed so that: (i) a normalized score of “outlierness” can be assigned to each data object, which unifies both the global and local perspectives of outliers into a single definition; and (ii) a “flat” (i.e., nonhierarchical) clustering solution composed of clusters extracted from local cuts through the cluster tree (possibly corresponding to different density thresholds) can be obtained, either in an unsupervised or in a semisupervised way. In the unsupervised scenario, the algorithm corresponding to this postprocessing module provides a global, optimal solution to the formal problem of maximizing the overall stability of the extracted clusters. If partially labeled objects or instance-level constraints are provided by the user, the algorithm can solve the problem by considering both constraints violations/satisfactions and cluster stability criteria. An asymptotic complexity analysis, both in terms of running time and memory space, is described. Experiments are reported that involve a variety of synthetic and real datasets, including comparisons with state-of-the-art, density-based clustering and (global and local) outlier detection methods.

References

  1. N. Abe, B. Zadrozny, and J. Langford. 2006. Outlier detection by active learning. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD).Philadelphia, PA. 504--509. DOI:http://dx.doi.org/10.1145/1150402.1150459 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Achtert, H.-P. Kriegel, E. Schubert, and A. Zimek. 2013. Interactive data mining with 3D-parallel-coordinate-trees. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). New York City, NY. 1009--1012. DOI:http://dx.doi.org/10.1145/2463676.2463696 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. C. Aggarwal and P. S. Yu. 2001. Outlier detection for high dimensional data. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Santa Barbara, CA. 37--46. DOI:http://dx.doi.org/10.1145/375663.375668 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Agyemang, K. Barker, and R. Alhajj. 2006. A comprehensive survey of numeric and symbolic outlier mining techniques. Intelligent Data Analysis 10 (2006), 521--538. Google ScholarGoogle ScholarCross RefCross Ref
  5. F. Angiulli and F. Fassetti. 2009. DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets. ACM Transactions on Knowledge Discovery from Data (TKDD) 3, 1 (2009), 4:1--57. DOI:http://dx.doi.org/10.1145/1497577.1497581 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Angiulli and C. Pizzuti. 2002. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). Helsinki, Finland. 15--26. DOI:http://dx.doi.org/10.1007/3-540-45681-3_2 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Philadelphia, PA. 49--60. DOI:http://dx.doi.org/10.1145/304182.304187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. (2013). http://archive.ics.uci.edu/ml.Google ScholarGoogle Scholar
  9. V. Barnett. 1978. The study of outliers: Purpose and model. Applied Statistics 27, 3 (1978), 242--250.Google ScholarGoogle ScholarCross RefCross Ref
  10. V. Barnett and T. Lewis. 1994. Outliers in Statistical Data (3rd ed.). John Wiley & Sons.Google ScholarGoogle Scholar
  11. S. Basu, I. Davidson, and K. Wagstaff (Eds.). 2008. Constraint Clustering: Advances in Algorithms, Applications and Theory. CRC Press, Boca Raton, London, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. D. Bay and M. Schwabacher. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Washington, DC. 29--38. DOI:http://dx.doi.org/10.1145/956750.956758 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. J. Beckman and R. D. Cook. 1983. Outlier..........s. Technometrics 25, 2 (1983), 119--149.Google ScholarGoogle Scholar
  14. M. Bilenko, S. Basu, and R. J. Mooney. 2004. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the 21st International Conference on Machine Learning (ICML). Banff, AB, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Brecheisen, H.-P. Kriegel, P. Kröger, and M. Pfeifle. 2004. Visually mining through cluster hierarchies. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM). Lake Buena Vista, FL. 400--412.Google ScholarGoogle Scholar
  17. M. M. Breunig, H.-P. Kriegel, P. Kröger, and J. Sander. 2001. Data bubbles: Quality preserving performance boosting for hierarchical clustering. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Santa Barbara, CA. 79--90. DOI:http://dx.doi.org/10.1145/375663.375672 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. 2000. LOF: Identifying density-based local outliers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Dallas, TX. 93--104. DOI:http://dx.doi.org/10.1145/342009.335388 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Böhm and C. Plant. 2008. HISSCLU: A hierarchical density-based method for semi-supervised clustering. In Proceedings of the 11th International Conference on Extending Database Technology (EDBT). Nantes, France. 440--451. DOI:http://dx.doi.org/10.1145/1353343.1353398 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. J. G. B. Campello, D. Moulavi, and J. Sander. 2013a. Density-based clustering based on hierarchical density estimates. In Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Gold Coast, Australia. 160--172. DOI:http://dx.doi.org/10.1007/978-3-642-37456-2_14Google ScholarGoogle Scholar
  21. R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander. 2013b. A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Mining and Knowledge Discovery 27, 3 (2013), 344--371. DOI:http://dx.doi.org/10.1007/s10618-013-0311-4Google ScholarGoogle ScholarCross RefCross Ref
  22. M. H. Chehreghani, H. Abolhassani, and M. H. Chehreghani. 2008. Improving density-based methods for hierarchical clustering of web pages. Data & Knowledge Engineering 67, 1 (2008), 30--50. DOI:http://dx.doi.org/10.1016/j.datak.2008.06.006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Comaniciu and P. Meer. 1999. Distribution free decomposition of multivariate data. Pattern Analysis and Applications 2, 1 (1999), 22--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Coomans and D. L. Massart. 1981. Potential methods in pattern recognition: Part 2. CLUPOT -- An unsupervised pattern recognition technique. Analytica Chimica Acta 133, 3 (1981), 225--239.Google ScholarGoogle ScholarCross RefCross Ref
  25. A. Cuevas, M. Febrero, and R. Fraiman. 2000. Estimating the number of clusters. Canadian Journal of Statistics 28, 2 (2000), 367--382.Google ScholarGoogle ScholarCross RefCross Ref
  26. A. Cuevas, M. Febrero, and R. Fraiman. 2001. Cluster analysis: A further approach based on density estimation. Computational Statistics and Data Analysis 36 (2001), 441--459. DOI:http://dx.doi.org/10.1016/j.csda.2005.10.012 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schubert. 2014. Discriminative features for identifying and interpreting outliers. In Proceedings of the 30th International Conference on Data Engineering (ICDE), Chicago, IL. 88--99. DOI:http://dx.doi.org/10.1109/ICDE.2014.6816642Google ScholarGoogle Scholar
  28. M. Daszykowski, B. Walczak, and D. L. Massart. 2001. Looking for natural patterns in data: Part 1. Density-based approach. Chemometrics and Intelligent Laboratory Systems 56, 2 (2001), 83--92.Google ScholarGoogle ScholarCross RefCross Ref
  29. I. Davidson, K. L. Wagstaff, and S. Basu. 2006. Measuring constraint-set utility for partitional clustering algorithms. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). Berlin, Germany. 115--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. de Vries, S. Chawla, and M. E. Houle. 2010. Finding local anomalies in very high dimensional space. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM). Sydney, Australia. 128--137. DOI:http://dx.doi.org/10.1109/ICDM.2010.151 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. de Vries, S. Chawla, and M. E. Houle. 2012. Density-preserving projections for large-scale local anomaly detection. Knowledge and Information Systems (KAIS) 32, 1 (2012), 25--52. DOI:http://dx.doi.org/10.1007/s10115-011-0430-4 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Dhandapani, G. Gupta, and J. Ghosh. 2010. Design and Implementation of Scalable Hierarchical Density Based Clustering. Technical Report. Technical Report IDEAL-2010-06, Dept. Electrical and Computer Eng., Univ. of Texas at Austin.Google ScholarGoogle Scholar
  33. L. Ertöz, M. Steinbach, and V. Kumar. 2002. A new shared nearest neighbor clustering algorithm and its applications. In Workshop on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining. 105--115.Google ScholarGoogle Scholar
  34. L. Ertöz, M. Steinbach, and V. Kumar. 2003. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the 3rd SIAM International Conference on Data Mining (SDM), San Francisco, CA.Google ScholarGoogle Scholar
  35. M. Ester. 2009. Density-based Clustering. In Encyclopedia of Database Systems, L. Liu and M. T. Özsu (Eds.). Springer, 795--799. DOI:http://dx.doi.org/10.1007/978-0-387-39940-9_605Google ScholarGoogle Scholar
  36. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD). Portland, OR. 226--231.Google ScholarGoogle Scholar
  37. B. S. Everitt, S. Landau, and M. Leese. 2001. Cluster Analysis (4th ed.). Arnold. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Foss and O. R. Zaïane. 2002. A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM), Maebashi City. Japan. 179--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. L. N. Fred and A. K. Jain. 2005. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 6 (2005), 835--850. DOI:http://dx.doi.org/10.1109/TPAMI.2005.113 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. K. Fukunaga and L. Hostetler. 1975. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory 21, 1 (1975), 32--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. Gao and P.-N. Tan. 2006. Converting output scores from outlier detection algorithms into probability estimates. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM). Hong Kong, China. 212--221. DOI:http://dx.doi.org/10.1109/ICDM.2006.43 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders. 2005. The amsterdam library of object images. International Journal of Computer Vision 61, 1 (2005), 103--112. DOI:http://dx.doi.org/10.1023/B:VISI.0000042993.50813.60 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. F. E. Grubbs. 1950. Sample criteria for testing outlying observations. The Annals of Mathematical Statistics 21, 1 (1950), 27--58.Google ScholarGoogle ScholarCross RefCross Ref
  44. G. Gupta, A. Liu, and J. Ghosh. 2006. Hierarchical density shaving: A clustering and visualization framework for large biological datasets. In IEEE ICDM Workshop on Data Mining in Bioinformatics (DMB). Hong Kong, China, 89--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. G. Gupta, A. Liu, and J. Ghosh. 2010. Automated hierarchical density shaving: A robust automated clustering and visualization framework for large biological data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics 7, 2 (2010), 223--237. DOI:http://dx.doi.org/10.1145/1791396.1791402 Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. S. Hadi, A. H. M. Rahmatullah Imon, and M. Werner. 2009. Detection of outliers. Wiley Interdisciplinary Reviews: Computational Statistics 1, 1 (2009), 57--70.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. S. Hang, Z. You, and L. Y. Chun. 2009. Incorporating biological knowledge into density-based clustering analysis of gene expression data. In 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Vol. 5. Tianjin, China, 52--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. A. Hanley and B. J. McNeil. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143 (1982), 29--36.Google ScholarGoogle ScholarCross RefCross Ref
  49. J. A. Hartigan. 1975. Clustering Algorithms. John Wiley & Sons, New York, London, Sydney, Toronto. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. J. A. Hartigan. 1987. Estimation of a convex density contour in two dimensions. J. Amer. Statist. Assoc. 82, 397 (1987), 267--270.Google ScholarGoogle ScholarCross RefCross Ref
  51. D. Hawkins. 1980. Identification of Outliers. Chapman and Hall.Google ScholarGoogle Scholar
  52. M. Herbin, N. Bonnet, and P. Vautrot. 2001. Estimation of the number of clusters and influence zones. Pattern Recognition Letters 22, 14 (2001), 1557--1568. DOI:http://dx.doi.org/10.1016/S0167-8655(01)00103-9 Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. A. Hinneburg and H. H. Gabriel. 2007. Denclue 2.0: Fast clustering based on kernel density estimation. In Proceedings of the 7th International Symposium on Intelligent Data Analysis (IDA). Ljubljana, Slovenia. 70--80. DOI:http://dx.doi.org/10.1007/978-3-540-74825-0_7 Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. A. Hinneburg and D. A. Keim. 1998. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the 4th ACM International Conference on Knowledge Discovery and Data Mining (KDD). New York City, NY. 58--65.Google ScholarGoogle Scholar
  55. A. Hinneburg and D. A. Keim. 2003. A general approach to clustering in large databases with noise. Knowledge and Information Systems (KAIS) 5 (2003), 387--415. Issue 4.Google ScholarGoogle ScholarCross RefCross Ref
  56. V. J. Hodge and J. Austin. 2004. A survey of outlier detection methodologies. Artificial Intelligence Review 22 (2004), 85--126. DOI:http://dx.doi.org/10.1023/B:AIRE.0000045502.10941.a9 Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. D. Horta and R. J. G. B. Campello. 2012. Automatic aspect discrimination in data clustering. Pattern Recognition 45, 12 (2012), 4370--4388. DOI:http://dx.doi.org/10.1016/j.patcog.2012.05.011 Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. L. Hubert and P. Arabie. 1985. Comparing partitions. Journal of Classification 2, 1 (1985), 193--218.Google ScholarGoogle ScholarCross RefCross Ref
  59. J.-N. Hwang, S.-R. Lay, and A. Lippman. 1994. Nonparametric multivariate density estimation: A comparative study. IEEE Transactions on Signal Processing 42, 10 (1994), 2795--2810. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. A. K. Jain and R. C. Dubes. 1988. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. W. Jin, A. K. Tung, and J. Han. 2001. Mining top-n local outliers in large databases. In Proceedings of the 7th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). San Francisco, CA. 293--298. DOI:http://dx.doi.org/10.1145/502512.502554 Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. W. Jin, A. K. H. Tung, J. Han, and W. Wang. 2006. Ranking outliers using symmetric neighborhood relationship. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Singapore. 577--593. DOI:http://dx.doi.org/10.1007/11731139_68 Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. S. C. Johnson. 1967. Hierarchical clustering schemes. Psychometrika 32 (1967), 241--254.Google ScholarGoogle ScholarCross RefCross Ref
  64. F. Keller, E. Müller, and K. Böhm. 2012. HiCS: High contrast subspaces for density-based outlier ranking. In Proceedings of the 28th International Conference on Data Engineering (ICDE). Washington, DC. DOI:http://dx.doi.org/10.1109/ICDE.2012.88 Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. E. M. Knorr and R. T. Ng. 1997a. A unified approach for mining outliers. In Proceedings of the conference of the Centre for Advanced Studies on Collaborative research (CASCON). Toronto, ON, Canada. 11--23. DOI:http://dx.doi.org/10.1145/782010.782021 Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. E. M. Knorr and R. T. Ng. 1997b. A unified notion of outliers: Properties and computation. In Proceedings of the 3rd ACM International Conference on Knowledge Discovery and Data Mining (KDD). Newport Beach, CA. 219--222. DOI:http://dx.doi.org/10.1145/782010.782021Google ScholarGoogle Scholar
  67. E. M. Knorr and R. T. Ng. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB). New York City, NY. 392--403. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. E. M. Knorr, R. T. Ng, and V. Tucanov. 2000. Distance-based outliers: Algorithms and applications. The VLDB Journal 8, 3--4 (2000), 237--253. DOI:http://dx.doi.org/10.1007/s007780050006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. G. Kollios, D. Gunopulos, N. Koudas, and S. Berchthold. 2003. Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE Transactions on Knowledge and Data Engineering 15, 5 (2003), 1170--1187. DOI:http://dx.doi.org/10.1109/TKDE.2003.1232271 Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. H.-P. Kriegel, P. Kröger, J. Sander, and A. Zimek. 2011. Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1, 3 (2011), 231--240. DOI:http://dx.doi.org/10.1002/widm.30Google ScholarGoogle ScholarCross RefCross Ref
  71. H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2009a. LoOP: Local outlier probabilities. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM). Hong Kong, China. 1649--1652. DOI:http://dx.doi.org/10.1145/1645953.1646195 Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2009b. Outlier detection in axis-parallel subspaces of high dimensional data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Bangkok, Thailand. 831--838. DOI:http://dx.doi.org/10.1007/978-3-642-01307-2_86 Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2011a. Interpreting and unifying outlier scores. In Proceedings of the 11th SIAM International Conference on Data Mining (SDM) Mesa, AZ. 13--24.Google ScholarGoogle Scholar
  74. H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2012. Outlier detection in arbitrarily oriented subspaces. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM). Brussels, Belgium. 379--388. DOI:http://dx.doi.org/10.1109/ICDM.2012.21 Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. H.-P. Kriegel, P. Kröger, and A. Zimek. 2010. Outlier Detection Techniques. Tutorial at the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Washington, DC. (2010).Google ScholarGoogle Scholar
  76. H.-P. Kriegel, M. Schubert, and A. Zimek. 2008. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Las Vegas, NV. 444--452. DOI:http://dx.doi.org/10.1145/1401890.1401946 Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. B. Larsen and C. Aone. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). San Diego, CA. 16--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. A. Lazarevic and V. Kumar. 2005. Feature bagging for outlier detection. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Chicago, IL. 157--166. DOI:http://dx.doi.org/10.1145/1081870.1081891 Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. L. Lelis and J. Sander. 2009. Semi-supervised density-based clustering. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM). Miami, FL. 842--847. DOI:http://dx.doi.org/10.1109/ICDM.2009.143 Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. F. T. Liu, K. M. Ting, and Z.-H. Zhou. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD) 6, 1 (2012), 3:1--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. P. Liu, D. Zhou, and N. Wu. 2007. VDBSCAN: Varied density based spatial clustering of applications with noise. In International Conference on Service Systems and Service Management (ICSSSM). Chengdu, China. 1--4.Google ScholarGoogle Scholar
  82. B. Micenková, R. T. Ng, X. H. Dang, and I. Assent. 2013. Explaining outliers by subspace separability. In Proceedings of the 13th IEEE International Conference on Data Mining (ICDM). Dallas, TX. 518--527.Google ScholarGoogle Scholar
  83. D. W. Muller and G. Sawitzki. 1991. Excess mass estimates and tests for multimodality. J. Amer. Statist. Assoc. 86, 415 (1991), 738--746.Google ScholarGoogle Scholar
  84. E. Müller, I. Assent, P. Iglesias, Y. Mülle, and K. Böhm. 2012. Outlier ranking via subspace analysis in multiple views of the data. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM). Brussels, Belgium. 529--538. DOI:http://dx.doi.org/10.1109/ICDM.2012.112 Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. E. Müller, I. Assent, U. Steinhausen, and T. Seidl. 2008. OutRank: Ranking outliers in high dimensional data. In Proceedings of the 24th International Conference on Data Engineering (ICDE) Workshop on Ranking in Databases (DBRank). Cancun, Mexico. 600--603. DOI:http://dx.doi.org/10.1109/ICDEW.2008.4498387 Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. E. Müller, M. Schiffer, and T. Seidl. 2010. Adaptive outlierness for subspace outlier ranking. In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM). Toronto, ON, Canada. 1629--1632. DOI:http://dx.doi.org/10.1145/1871437.1871690 Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. E. Müller, M. Schiffer, and T. Seidl. 2011. Statistical selection of relevant subspace projections for outlier ranking. In Proceedings of the 27th International Conference on Data Engineering (ICDE). Hannover, Germany. 434--445. DOI:http://dx.doi.org/10.1109/ICDE.2011.5767916 Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. M. C. Naldi, R. J. G. B. Campello, E. R. Hruschka, and A. C. P. L. F. Carvalho. 2011. Efficiency issues of evolutionary k-means. Applied Soft Computing 11, 2 (2011), 1938--1952. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. 2010. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In Proceedings of the 15th International Conference on Database Systems for Advanced Applications (DASFAA). Tsukuba, Japan. 368--383. DOI:http://dx.doi.org/10.1007/978-3-642-12026-8_29 Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. H. V. Nguyen and V. Gopalkrishnan. 2009. Efficient pruning schemes for distance-based outlier detection. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD). Bled, Slovenia. 160--175. DOI:http://dx.doi.org/10.1007/978-3-642-04174-7_11 Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. H. V. Nguyen, V. Gopalkrishnan, and I. Assent. 2011. An unbiased distance-based outlier detection approach for high-dimensional data. In Proceedings of the 16th International Conference on Database Systems for Advanced Applications (DASFAA). Hong Kong, China. 138--152. DOI:http://dx.doi.org/10.1007/978-3-642-20149-3_12 Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. G. H. Orair, C. Teixeira, Y. Wang, W. Meira Jr., and S. Parthasarathy. 2010. Distance-based outlier detection: Consolidation and renewed bearing. Proceedings of the VLDB Endowment 3, 2 (2010), 1469--1480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. 2003. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering (ICDE). Bangalore, India. 315--326. DOI:http://dx.doi.org/10.1109/ICDE.2003.1260802Google ScholarGoogle Scholar
  94. D. Pascual, F. Pla, and J. S. Sánchez. 2006. Non parametric local density-based clustering for multimodal overlapping distributions. In Proceedings of the 7th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL). Burgos, Spain. 671--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. E. S. Pearson and C. Chandra Sekar. 1936. The efficiency of statistical tools and a criterion for the rejection of outlying observations. Biometrika 28, 3/4 (1936), 308--320.Google ScholarGoogle ScholarCross RefCross Ref
  96. T. Pei, A. Jasra, D. J. Hand, A.-X. Zhu, and C. Zhou. 2009. DECODE: A new method for discovering clusters of different densities in spatial data. Data Mining and Knowledge Discovery 18, 3 (2009), 337--369. DOI:http://dx.doi.org/10.1007/s10618-008-0120-3 Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. T. Pei, A.-X. Zhu, C. Zhou, B. Li, and C. Qin. 2006. A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. International Journal of Geographical Information Science 20, 2 (2006), 153--168.Google ScholarGoogle ScholarCross RefCross Ref
  98. N. Pham and R. Pagh. 2012. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Beijing, China. Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. S. Ramaswamy, R. Rastogi, and K. Shim. 2000. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Dallas, TX. 427--438. DOI:http://dx.doi.org/10.1145/342009.335437 Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. C. Ruiz, M. Spiliopoulou, and E. Menasalvas. 2007. C-DBSCAN: Density-based clustering with constraints. In Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, A. An, J. Stefanowski, S. Ramanna, C. Butz, W. Pedrycz, and G. Wang (Eds.). 216--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. C. Ruiz, M. Spiliopoulou, and E. Menasalvas. 2010. Density-based semi-supervised clustering. Data Mining and Knowledge Discovery 21, 3 (2010), 345--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. J. Sander. 2010. Density-based clustering. In Encyclopedia of Machine Learning, C. Sammut and G. I. Webb (Eds.). Springer, US, 270--273.Google ScholarGoogle Scholar
  103. J. Sander, X. Qin, Z. Lu, N. Niu, and A. Kovarsky. 2003. Automatic extraction of clusters from hierarchical clustering representations. In Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Seoul, Korea. 75--87. DOI:http://dx.doi.org/10.1007/3-540-36175-8_8 Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. 2012. On evaluation of outlier rankings and outlier scores. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM). Anaheim, CA. 1047--1058.Google ScholarGoogle ScholarCross RefCross Ref
  105. E. Schubert, A. Zimek, and H.-P. Kriegel. 2014a. Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM). Philadelphia, PA. 542--550. DOI:http://dx.doi.org/10.1137/1.9781611973440.63Google ScholarGoogle ScholarCross RefCross Ref
  106. E. Schubert, A. Zimek, and H.-P. Kriegel. 2014b. Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery 28, 1 (2014), 190--237. DOI:http://dx.doi.org/10.1007/s10618-012-0300-z Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. B. W. Silverman. 1986. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC.Google ScholarGoogle Scholar
  108. P. H. A. Sneath. 1957. The application of computers to taxonomy. Journal of General Microbiology 17 (1957), 201--226.Google ScholarGoogle ScholarCross RefCross Ref
  109. T. Soler and M. Chin. 1985. On transformation of covariance matrices between local Cartesian coordinate systems and commutative diagrams. In ASP-ACSM Convention. 393--406.Google ScholarGoogle Scholar
  110. W. Stuetzle. 2003. Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification 20, 1 (2003), 25--47. DOI:http://dx.doi.org/10.1007/s00357-003-0004-6Google ScholarGoogle ScholarCross RefCross Ref
  111. W. Stuetzle and R. Nugent. 2010. A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics 19, 2 (2010), 397--418.Google ScholarGoogle ScholarCross RefCross Ref
  112. H. Sun, J. Huang, J. Han, H. Deng, P. Zhao, and B. Feng. 2010. gSkeletonClu: Density-based network clustering via structure-connected tree division or agglomeration. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM). Sydney, Australia. 481--490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. P.-N. Tan, M. Steinbach, and V. Kumar. 2006. Introduction to Data Mining. Addison Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. J. Tang, Z. Chen, A. W.-C. Fu, and D. W. Cheung. 2002. Enhancing effectiveness of outlier detections for low density patterns. In Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Taipei, Taiwan. 535--548. DOI:http://dx.doi.org/10.1007/3-540-47887-6_53 Google ScholarGoogle ScholarDigital LibraryDigital Library
  115. K. L. Wagstaff. 2002. Intelligent Clustering with Instance-Level Constraints. Ph.D. Dissertation. Department of Computer Science, Cornell University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. D. Wishart. 1969. Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In Numerical Taxonomy, A. J. Cole (Ed.). 282--311.Google ScholarGoogle Scholar
  117. M. A. Wong and T. Lane. 1983. A kth nearest neighbour clustering procedure. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 45, 3 (1983), 362--368.Google ScholarGoogle Scholar
  118. R. Xu and D. Wunsch II. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16 (2005), 645--678. DOI:http://dx.doi.org/10.1109/TNN.2005.845141 Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. R. Xu and D. Wunsch II. 2009. Clustering. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. J. Yang, N. Zhong, Y. Yao, and J. Wang. 2008. Local peculiarity factor and its application in outlier detection. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Las Vegas, NV. 776--784. DOI:http://dx.doi.org/10.1145/1401890.1401983 Google ScholarGoogle ScholarDigital LibraryDigital Library
  121. K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, and W. L. Ruzzo. 2001. Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 10 (2001), 977--987.Google ScholarGoogle ScholarCross RefCross Ref
  122. K. Y. Yeung, M. Medvedovic, and R. E. Bumgarner. 2003. Clustering gene-expression data with repeated measurements. Genome Biology 4, 5 (2003), R34.1--R34.17.Google ScholarGoogle ScholarCross RefCross Ref
  123. K. Zhang, M. Hutter, and H. Jin. 2009. A new local distance-based outlier detection approach for scattered real-world data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Bangkok, Thailand. 813--822. DOI:http://dx.doi.org/10.1007/978-3-642-01307-2_84 Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. T. Zhang, R. Ramakrishnan, and M. Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Montreal, QC, Canada. 103--114. DOI:http://dx.doi.org/10.1145/233269.233324 Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. A. Zimek, R. J. G. B. Campello, and J. Sander. 2013. Ensembles for unsupervised outlier detection: Challenges and research questions. ACM SIGKDD Explorations 15, 1 (2013), 11--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. A. Zimek, R. J. G. B. Campello, and J. Sander. 2014. Data perturbation for outlier detection ensembles. In Proceedings of the 26th International Conference on Scientific and Statistical Database Management (SSDBM). Aalborg, Denmark. 13:1--12. DOI:http://dx.doi.org/10.1145/2618243.2618257 Google ScholarGoogle ScholarDigital LibraryDigital Library
  127. A. Zimek, M. Gaudet, R. J. G. B. Campello, and J. Sander. 2013a. Subsampling for efficient and effective unsupervised outlier detection ensembles. In Proceedings of the 19th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Chicago, IL. 428--436. DOI:http://dx.doi.org/10.1145/2487575.2487676 Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. A. Zimek, E. Schubert, and H.-P. Kriegel. 2012. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining 5, 5 (2012), 363--387. DOI:http://dx.doi.org/10.1002/sam.11161 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Knowledge Discovery from Data
          ACM Transactions on Knowledge Discovery from Data  Volume 10, Issue 1
          July 2015
          321 pages
          ISSN:1556-4681
          EISSN:1556-472X
          DOI:10.1145/2808688
          Issue’s Table of Contents

          Copyright © 2015 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 July 2015
          • Accepted: 1 February 2015
          • Revised: 1 December 2014
          • Received: 1 July 2013
          Published in tkdd Volume 10, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader