research-article

Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection

Authors:
Ricardo J. G. B. Campello

Department of Computer Sciences, University of São Paulo, Brazil

Department of Computer Sciences, University of São Paulo, Brazil
View Profile

,
Davoud Moulavi

Department of Computing Science, University of Alberta, Canada

Department of Computing Science, University of Alberta, Canada
View Profile

,
Arthur Zimek

Ludwig-Maximilians-Universität München, Germany

Ludwig-Maximilians-Universität München, Germany
View Profile

,
Jörg Sander

Department of Computing Science, University of Alberta, Canada

Department of Computing Science, University of Alberta, Canada
View Profile

Authors Info & Claims

ACM Transactions on Knowledge Discovery from Data Volume 10 Issue 1Article No.: 5pp 1–51https://doi.org/10.1145/2733381

Published:22 July 2015Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

An integrated framework for density-based cluster analysis, outlier detection, and data visualization is introduced in this article. The main module consists of an algorithm to compute hierarchical estimates of the level sets of a density, following Hartigan’s classic model of density-contour clusters and trees. Such an algorithm generalizes and improves existing density-based clustering techniques with respect to different aspects. It provides as a result a complete clustering hierarchy composed of all possible density-based clusters following the nonparametric model adopted, for an infinite range of density thresholds. The resulting hierarchy can be easily processed so as to provide multiple ways for data visualization and exploration. It can also be further postprocessed so that: (i) a normalized score of “outlierness” can be assigned to each data object, which unifies both the global and local perspectives of outliers into a single definition; and (ii) a “flat” (i.e., nonhierarchical) clustering solution composed of clusters extracted from local cuts through the cluster tree (possibly corresponding to different density thresholds) can be obtained, either in an unsupervised or in a semisupervised way. In the unsupervised scenario, the algorithm corresponding to this postprocessing module provides a global, optimal solution to the formal problem of maximizing the overall stability of the extracted clusters. If partially labeled objects or instance-level constraints are provided by the user, the algorithm can solve the problem by considering both constraints violations/satisfactions and cluster stability criteria. An asymptotic complexity analysis, both in terms of running time and memory space, is described. Experiments are reported that involve a variety of synthetic and real datasets, including comparisons with state-of-the-art, density-based clustering and (global and local) outlier detection methods.

References

N. Abe, B. Zadrozny, and J. Langford. 2006. Outlier detection by active learning. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD).Philadelphia, PA. 504--509. DOI:http://dx.doi.org/10.1145/1150402.1150459 Google ScholarDigital Library
E. Achtert, H.-P. Kriegel, E. Schubert, and A. Zimek. 2013. Interactive data mining with 3D-parallel-coordinate-trees. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). New York City, NY. 1009--1012. DOI:http://dx.doi.org/10.1145/2463676.2463696 Google ScholarDigital Library
C. C. Aggarwal and P. S. Yu. 2001. Outlier detection for high dimensional data. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Santa Barbara, CA. 37--46. DOI:http://dx.doi.org/10.1145/375663.375668 Google ScholarDigital Library
M. Agyemang, K. Barker, and R. Alhajj. 2006. A comprehensive survey of numeric and symbolic outlier mining techniques. Intelligent Data Analysis 10 (2006), 521--538. Google ScholarCross Ref
F. Angiulli and F. Fassetti. 2009. DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets. ACM Transactions on Knowledge Discovery from Data (TKDD) 3, 1 (2009), 4:1--57. DOI:http://dx.doi.org/10.1145/1497577.1497581 Google ScholarDigital Library
F. Angiulli and C. Pizzuti. 2002. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). Helsinki, Finland. 15--26. DOI:http://dx.doi.org/10.1007/3-540-45681-3_2 Google ScholarDigital Library
M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Philadelphia, PA. 49--60. DOI:http://dx.doi.org/10.1145/304182.304187. Google ScholarDigital Library
K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. (2013). http://archive.ics.uci.edu/ml.Google Scholar
V. Barnett. 1978. The study of outliers: Purpose and model. Applied Statistics 27, 3 (1978), 242--250.Google ScholarCross Ref
V. Barnett and T. Lewis. 1994. Outliers in Statistical Data (3rd ed.). John Wiley & Sons.Google Scholar
S. Basu, I. Davidson, and K. Wagstaff (Eds.). 2008. Constraint Clustering: Advances in Algorithms, Applications and Theory. CRC Press, Boca Raton, London, New York. Google ScholarDigital Library
S. D. Bay and M. Schwabacher. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Washington, DC. 29--38. DOI:http://dx.doi.org/10.1145/956750.956758 Google ScholarDigital Library
R. J. Beckman and R. D. Cook. 1983. Outlier..........s. Technometrics 25, 2 (1983), 119--149.Google Scholar
M. Bilenko, S. Basu, and R. J. Mooney. 2004. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the 21st International Conference on Machine Learning (ICML). Banff, AB, Canada. Google ScholarDigital Library
C. M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer. Google ScholarDigital Library
S. Brecheisen, H.-P. Kriegel, P. Kröger, and M. Pfeifle. 2004. Visually mining through cluster hierarchies. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM). Lake Buena Vista, FL. 400--412.Google Scholar
M. M. Breunig, H.-P. Kriegel, P. Kröger, and J. Sander. 2001. Data bubbles: Quality preserving performance boosting for hierarchical clustering. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Santa Barbara, CA. 79--90. DOI:http://dx.doi.org/10.1145/375663.375672 Google ScholarDigital Library
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. 2000. LOF: Identifying density-based local outliers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Dallas, TX. 93--104. DOI:http://dx.doi.org/10.1145/342009.335388 Google ScholarDigital Library
C. Böhm and C. Plant. 2008. HISSCLU: A hierarchical density-based method for semi-supervised clustering. In Proceedings of the 11th International Conference on Extending Database Technology (EDBT). Nantes, France. 440--451. DOI:http://dx.doi.org/10.1145/1353343.1353398 Google ScholarDigital Library
R. J. G. B. Campello, D. Moulavi, and J. Sander. 2013a. Density-based clustering based on hierarchical density estimates. In Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Gold Coast, Australia. 160--172. DOI:http://dx.doi.org/10.1007/978-3-642-37456-2_14Google Scholar
R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander. 2013b. A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Mining and Knowledge Discovery 27, 3 (2013), 344--371. DOI:http://dx.doi.org/10.1007/s10618-013-0311-4Google ScholarCross Ref
M. H. Chehreghani, H. Abolhassani, and M. H. Chehreghani. 2008. Improving density-based methods for hierarchical clustering of web pages. Data & Knowledge Engineering 67, 1 (2008), 30--50. DOI:http://dx.doi.org/10.1016/j.datak.2008.06.006 Google ScholarDigital Library
D. Comaniciu and P. Meer. 1999. Distribution free decomposition of multivariate data. Pattern Analysis and Applications 2, 1 (1999), 22--30. Google ScholarDigital Library
D. Coomans and D. L. Massart. 1981. Potential methods in pattern recognition: Part 2. CLUPOT -- An unsupervised pattern recognition technique. Analytica Chimica Acta 133, 3 (1981), 225--239.Google ScholarCross Ref
A. Cuevas, M. Febrero, and R. Fraiman. 2000. Estimating the number of clusters. Canadian Journal of Statistics 28, 2 (2000), 367--382.Google ScholarCross Ref
A. Cuevas, M. Febrero, and R. Fraiman. 2001. Cluster analysis: A further approach based on density estimation. Computational Statistics and Data Analysis 36 (2001), 441--459. DOI:http://dx.doi.org/10.1016/j.csda.2005.10.012 Google ScholarDigital Library
X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schubert. 2014. Discriminative features for identifying and interpreting outliers. In Proceedings of the 30th International Conference on Data Engineering (ICDE), Chicago, IL. 88--99. DOI:http://dx.doi.org/10.1109/ICDE.2014.6816642Google Scholar
M. Daszykowski, B. Walczak, and D. L. Massart. 2001. Looking for natural patterns in data: Part 1. Density-based approach. Chemometrics and Intelligent Laboratory Systems 56, 2 (2001), 83--92.Google ScholarCross Ref
I. Davidson, K. L. Wagstaff, and S. Basu. 2006. Measuring constraint-set utility for partitional clustering algorithms. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). Berlin, Germany. 115--126. Google ScholarDigital Library
T. de Vries, S. Chawla, and M. E. Houle. 2010. Finding local anomalies in very high dimensional space. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM). Sydney, Australia. 128--137. DOI:http://dx.doi.org/10.1109/ICDM.2010.151 Google ScholarDigital Library
T. de Vries, S. Chawla, and M. E. Houle. 2012. Density-preserving projections for large-scale local anomaly detection. Knowledge and Information Systems (KAIS) 32, 1 (2012), 25--52. DOI:http://dx.doi.org/10.1007/s10115-011-0430-4 Google ScholarDigital Library
S. Dhandapani, G. Gupta, and J. Ghosh. 2010. Design and Implementation of Scalable Hierarchical Density Based Clustering. Technical Report. Technical Report IDEAL-2010-06, Dept. Electrical and Computer Eng., Univ. of Texas at Austin.Google Scholar
L. Ertöz, M. Steinbach, and V. Kumar. 2002. A new shared nearest neighbor clustering algorithm and its applications. In Workshop on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining. 105--115.Google Scholar
L. Ertöz, M. Steinbach, and V. Kumar. 2003. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the 3rd SIAM International Conference on Data Mining (SDM), San Francisco, CA.Google Scholar
M. Ester. 2009. Density-based Clustering. In Encyclopedia of Database Systems, L. Liu and M. T. Özsu (Eds.). Springer, 795--799. DOI:http://dx.doi.org/10.1007/978-0-387-39940-9_605Google Scholar
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD). Portland, OR. 226--231.Google Scholar
B. S. Everitt, S. Landau, and M. Leese. 2001. Cluster Analysis (4th ed.). Arnold. Google ScholarDigital Library
A. Foss and O. R. Zaïane. 2002. A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM), Maebashi City. Japan. 179--186. Google ScholarDigital Library
A. L. N. Fred and A. K. Jain. 2005. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 6 (2005), 835--850. DOI:http://dx.doi.org/10.1109/TPAMI.2005.113 Google ScholarDigital Library
K. Fukunaga and L. Hostetler. 1975. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory 21, 1 (1975), 32--40. Google ScholarDigital Library
J. Gao and P.-N. Tan. 2006. Converting output scores from outlier detection algorithms into probability estimates. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM). Hong Kong, China. 212--221. DOI:http://dx.doi.org/10.1109/ICDM.2006.43 Google ScholarDigital Library
J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders. 2005. The amsterdam library of object images. International Journal of Computer Vision 61, 1 (2005), 103--112. DOI:http://dx.doi.org/10.1023/B:VISI.0000042993.50813.60 Google ScholarDigital Library
F. E. Grubbs. 1950. Sample criteria for testing outlying observations. The Annals of Mathematical Statistics 21, 1 (1950), 27--58.Google ScholarCross Ref
G. Gupta, A. Liu, and J. Ghosh. 2006. Hierarchical density shaving: A clustering and visualization framework for large biological datasets. In IEEE ICDM Workshop on Data Mining in Bioinformatics (DMB). Hong Kong, China, 89--93. Google ScholarDigital Library
G. Gupta, A. Liu, and J. Ghosh. 2010. Automated hierarchical density shaving: A robust automated clustering and visualization framework for large biological data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics 7, 2 (2010), 223--237. DOI:http://dx.doi.org/10.1145/1791396.1791402 Google ScholarDigital Library
A. S. Hadi, A. H. M. Rahmatullah Imon, and M. Werner. 2009. Detection of outliers. Wiley Interdisciplinary Reviews: Computational Statistics 1, 1 (2009), 57--70.Google ScholarDigital Library
S. Hang, Z. You, and L. Y. Chun. 2009. Incorporating biological knowledge into density-based clustering analysis of gene expression data. In 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Vol. 5. Tianjin, China, 52--56. Google ScholarDigital Library
J. A. Hanley and B. J. McNeil. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143 (1982), 29--36.Google ScholarCross Ref
J. A. Hartigan. 1975. Clustering Algorithms. John Wiley & Sons, New York, London, Sydney, Toronto. Google ScholarDigital Library
J. A. Hartigan. 1987. Estimation of a convex density contour in two dimensions. J. Amer. Statist. Assoc. 82, 397 (1987), 267--270.Google ScholarCross Ref
D. Hawkins. 1980. Identification of Outliers. Chapman and Hall.Google Scholar
M. Herbin, N. Bonnet, and P. Vautrot. 2001. Estimation of the number of clusters and influence zones. Pattern Recognition Letters 22, 14 (2001), 1557--1568. DOI:http://dx.doi.org/10.1016/S0167-8655(01)00103-9 Google ScholarDigital Library
A. Hinneburg and H. H. Gabriel. 2007. Denclue 2.0: Fast clustering based on kernel density estimation. In Proceedings of the 7th International Symposium on Intelligent Data Analysis (IDA). Ljubljana, Slovenia. 70--80. DOI:http://dx.doi.org/10.1007/978-3-540-74825-0_7 Google ScholarDigital Library
A. Hinneburg and D. A. Keim. 1998. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the 4th ACM International Conference on Knowledge Discovery and Data Mining (KDD). New York City, NY. 58--65.Google Scholar
A. Hinneburg and D. A. Keim. 2003. A general approach to clustering in large databases with noise. Knowledge and Information Systems (KAIS) 5 (2003), 387--415. Issue 4.Google ScholarCross Ref
V. J. Hodge and J. Austin. 2004. A survey of outlier detection methodologies. Artificial Intelligence Review 22 (2004), 85--126. DOI:http://dx.doi.org/10.1023/B:AIRE.0000045502.10941.a9 Google ScholarDigital Library
D. Horta and R. J. G. B. Campello. 2012. Automatic aspect discrimination in data clustering. Pattern Recognition 45, 12 (2012), 4370--4388. DOI:http://dx.doi.org/10.1016/j.patcog.2012.05.011 Google ScholarDigital Library
L. Hubert and P. Arabie. 1985. Comparing partitions. Journal of Classification 2, 1 (1985), 193--218.Google ScholarCross Ref
J.-N. Hwang, S.-R. Lay, and A. Lippman. 1994. Nonparametric multivariate density estimation: A comparative study. IEEE Transactions on Signal Processing 42, 10 (1994), 2795--2810. Google ScholarDigital Library
A. K. Jain and R. C. Dubes. 1988. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs. Google ScholarDigital Library
W. Jin, A. K. Tung, and J. Han. 2001. Mining top-n local outliers in large databases. In Proceedings of the 7th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). San Francisco, CA. 293--298. DOI:http://dx.doi.org/10.1145/502512.502554 Google ScholarDigital Library
W. Jin, A. K. H. Tung, J. Han, and W. Wang. 2006. Ranking outliers using symmetric neighborhood relationship. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Singapore. 577--593. DOI:http://dx.doi.org/10.1007/11731139_68 Google ScholarDigital Library
S. C. Johnson. 1967. Hierarchical clustering schemes. Psychometrika 32 (1967), 241--254.Google ScholarCross Ref
F. Keller, E. Müller, and K. Böhm. 2012. HiCS: High contrast subspaces for density-based outlier ranking. In Proceedings of the 28th International Conference on Data Engineering (ICDE). Washington, DC. DOI:http://dx.doi.org/10.1109/ICDE.2012.88 Google ScholarDigital Library
E. M. Knorr and R. T. Ng. 1997a. A unified approach for mining outliers. In Proceedings of the conference of the Centre for Advanced Studies on Collaborative research (CASCON). Toronto, ON, Canada. 11--23. DOI:http://dx.doi.org/10.1145/782010.782021 Google ScholarDigital Library
E. M. Knorr and R. T. Ng. 1997b. A unified notion of outliers: Properties and computation. In Proceedings of the 3rd ACM International Conference on Knowledge Discovery and Data Mining (KDD). Newport Beach, CA. 219--222. DOI:http://dx.doi.org/10.1145/782010.782021Google Scholar
E. M. Knorr and R. T. Ng. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB). New York City, NY. 392--403. Google ScholarDigital Library
E. M. Knorr, R. T. Ng, and V. Tucanov. 2000. Distance-based outliers: Algorithms and applications. The VLDB Journal 8, 3--4 (2000), 237--253. DOI:http://dx.doi.org/10.1007/s007780050006 Google ScholarDigital Library
G. Kollios, D. Gunopulos, N. Koudas, and S. Berchthold. 2003. Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE Transactions on Knowledge and Data Engineering 15, 5 (2003), 1170--1187. DOI:http://dx.doi.org/10.1109/TKDE.2003.1232271 Google ScholarDigital Library
H.-P. Kriegel, P. Kröger, J. Sander, and A. Zimek. 2011. Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1, 3 (2011), 231--240. DOI:http://dx.doi.org/10.1002/widm.30Google ScholarCross Ref
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2009a. LoOP: Local outlier probabilities. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM). Hong Kong, China. 1649--1652. DOI:http://dx.doi.org/10.1145/1645953.1646195 Google ScholarDigital Library
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2009b. Outlier detection in axis-parallel subspaces of high dimensional data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Bangkok, Thailand. 831--838. DOI:http://dx.doi.org/10.1007/978-3-642-01307-2_86 Google ScholarDigital Library
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2011a. Interpreting and unifying outlier scores. In Proceedings of the 11th SIAM International Conference on Data Mining (SDM) Mesa, AZ. 13--24.Google Scholar
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2012. Outlier detection in arbitrarily oriented subspaces. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM). Brussels, Belgium. 379--388. DOI:http://dx.doi.org/10.1109/ICDM.2012.21 Google ScholarDigital Library
H.-P. Kriegel, P. Kröger, and A. Zimek. 2010. Outlier Detection Techniques. Tutorial at the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Washington, DC. (2010).Google Scholar
H.-P. Kriegel, M. Schubert, and A. Zimek. 2008. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Las Vegas, NV. 444--452. DOI:http://dx.doi.org/10.1145/1401890.1401946 Google ScholarDigital Library
B. Larsen and C. Aone. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). San Diego, CA. 16--22. Google ScholarDigital Library
A. Lazarevic and V. Kumar. 2005. Feature bagging for outlier detection. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Chicago, IL. 157--166. DOI:http://dx.doi.org/10.1145/1081870.1081891 Google ScholarDigital Library
L. Lelis and J. Sander. 2009. Semi-supervised density-based clustering. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM). Miami, FL. 842--847. DOI:http://dx.doi.org/10.1109/ICDM.2009.143 Google ScholarDigital Library
F. T. Liu, K. M. Ting, and Z.-H. Zhou. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD) 6, 1 (2012), 3:1--39. Google ScholarDigital Library
P. Liu, D. Zhou, and N. Wu. 2007. VDBSCAN: Varied density based spatial clustering of applications with noise. In International Conference on Service Systems and Service Management (ICSSSM). Chengdu, China. 1--4.Google Scholar
B. Micenková, R. T. Ng, X. H. Dang, and I. Assent. 2013. Explaining outliers by subspace separability. In Proceedings of the 13th IEEE International Conference on Data Mining (ICDM). Dallas, TX. 518--527.Google Scholar
D. W. Muller and G. Sawitzki. 1991. Excess mass estimates and tests for multimodality. J. Amer. Statist. Assoc. 86, 415 (1991), 738--746.Google Scholar
E. Müller, I. Assent, P. Iglesias, Y. Mülle, and K. Böhm. 2012. Outlier ranking via subspace analysis in multiple views of the data. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM). Brussels, Belgium. 529--538. DOI:http://dx.doi.org/10.1109/ICDM.2012.112 Google ScholarDigital Library
E. Müller, I. Assent, U. Steinhausen, and T. Seidl. 2008. OutRank: Ranking outliers in high dimensional data. In Proceedings of the 24th International Conference on Data Engineering (ICDE) Workshop on Ranking in Databases (DBRank). Cancun, Mexico. 600--603. DOI:http://dx.doi.org/10.1109/ICDEW.2008.4498387 Google ScholarDigital Library
E. Müller, M. Schiffer, and T. Seidl. 2010. Adaptive outlierness for subspace outlier ranking. In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM). Toronto, ON, Canada. 1629--1632. DOI:http://dx.doi.org/10.1145/1871437.1871690 Google ScholarDigital Library
E. Müller, M. Schiffer, and T. Seidl. 2011. Statistical selection of relevant subspace projections for outlier ranking. In Proceedings of the 27th International Conference on Data Engineering (ICDE). Hannover, Germany. 434--445. DOI:http://dx.doi.org/10.1109/ICDE.2011.5767916 Google ScholarDigital Library
M. C. Naldi, R. J. G. B. Campello, E. R. Hruschka, and A. C. P. L. F. Carvalho. 2011. Efficiency issues of evolutionary k-means. Applied Soft Computing 11, 2 (2011), 1938--1952. Google ScholarDigital Library
H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. 2010. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In Proceedings of the 15th International Conference on Database Systems for Advanced Applications (DASFAA). Tsukuba, Japan. 368--383. DOI:http://dx.doi.org/10.1007/978-3-642-12026-8_29 Google ScholarDigital Library
H. V. Nguyen and V. Gopalkrishnan. 2009. Efficient pruning schemes for distance-based outlier detection. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD). Bled, Slovenia. 160--175. DOI:http://dx.doi.org/10.1007/978-3-642-04174-7_11 Google ScholarDigital Library
H. V. Nguyen, V. Gopalkrishnan, and I. Assent. 2011. An unbiased distance-based outlier detection approach for high-dimensional data. In Proceedings of the 16th International Conference on Database Systems for Advanced Applications (DASFAA). Hong Kong, China. 138--152. DOI:http://dx.doi.org/10.1007/978-3-642-20149-3_12 Google ScholarDigital Library
G. H. Orair, C. Teixeira, Y. Wang, W. Meira Jr., and S. Parthasarathy. 2010. Distance-based outlier detection: Consolidation and renewed bearing. Proceedings of the VLDB Endowment 3, 2 (2010), 1469--1480. Google ScholarDigital Library
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. 2003. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering (ICDE). Bangalore, India. 315--326. DOI:http://dx.doi.org/10.1109/ICDE.2003.1260802Google Scholar
D. Pascual, F. Pla, and J. S. Sánchez. 2006. Non parametric local density-based clustering for multimodal overlapping distributions. In Proceedings of the 7th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL). Burgos, Spain. 671--678. Google ScholarDigital Library
E. S. Pearson and C. Chandra Sekar. 1936. The efficiency of statistical tools and a criterion for the rejection of outlying observations. Biometrika 28, 3/4 (1936), 308--320.Google ScholarCross Ref
T. Pei, A. Jasra, D. J. Hand, A.-X. Zhu, and C. Zhou. 2009. DECODE: A new method for discovering clusters of different densities in spatial data. Data Mining and Knowledge Discovery 18, 3 (2009), 337--369. DOI:http://dx.doi.org/10.1007/s10618-008-0120-3 Google ScholarDigital Library
T. Pei, A.-X. Zhu, C. Zhou, B. Li, and C. Qin. 2006. A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. International Journal of Geographical Information Science 20, 2 (2006), 153--168.Google ScholarCross Ref
N. Pham and R. Pagh. 2012. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Beijing, China. Google ScholarDigital Library
S. Ramaswamy, R. Rastogi, and K. Shim. 2000. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Dallas, TX. 427--438. DOI:http://dx.doi.org/10.1145/342009.335437 Google ScholarDigital Library
C. Ruiz, M. Spiliopoulou, and E. Menasalvas. 2007. C-DBSCAN: Density-based clustering with constraints. In Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, A. An, J. Stefanowski, S. Ramanna, C. Butz, W. Pedrycz, and G. Wang (Eds.). 216--223. Google ScholarDigital Library
C. Ruiz, M. Spiliopoulou, and E. Menasalvas. 2010. Density-based semi-supervised clustering. Data Mining and Knowledge Discovery 21, 3 (2010), 345--370. Google ScholarDigital Library
J. Sander. 2010. Density-based clustering. In Encyclopedia of Machine Learning, C. Sammut and G. I. Webb (Eds.). Springer, US, 270--273.Google Scholar
J. Sander, X. Qin, Z. Lu, N. Niu, and A. Kovarsky. 2003. Automatic extraction of clusters from hierarchical clustering representations. In Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Seoul, Korea. 75--87. DOI:http://dx.doi.org/10.1007/3-540-36175-8_8 Google ScholarDigital Library
E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. 2012. On evaluation of outlier rankings and outlier scores. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM). Anaheim, CA. 1047--1058.Google ScholarCross Ref
E. Schubert, A. Zimek, and H.-P. Kriegel. 2014a. Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM). Philadelphia, PA. 542--550. DOI:http://dx.doi.org/10.1137/1.9781611973440.63Google ScholarCross Ref
E. Schubert, A. Zimek, and H.-P. Kriegel. 2014b. Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery 28, 1 (2014), 190--237. DOI:http://dx.doi.org/10.1007/s10618-012-0300-z Google ScholarDigital Library
B. W. Silverman. 1986. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC.Google Scholar
P. H. A. Sneath. 1957. The application of computers to taxonomy. Journal of General Microbiology 17 (1957), 201--226.Google ScholarCross Ref
T. Soler and M. Chin. 1985. On transformation of covariance matrices between local Cartesian coordinate systems and commutative diagrams. In ASP-ACSM Convention. 393--406.Google Scholar
W. Stuetzle. 2003. Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification 20, 1 (2003), 25--47. DOI:http://dx.doi.org/10.1007/s00357-003-0004-6Google ScholarCross Ref
W. Stuetzle and R. Nugent. 2010. A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics 19, 2 (2010), 397--418.Google ScholarCross Ref
H. Sun, J. Huang, J. Han, H. Deng, P. Zhao, and B. Feng. 2010. gSkeletonClu: Density-based network clustering via structure-connected tree division or agglomeration. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM). Sydney, Australia. 481--490. Google ScholarDigital Library
P.-N. Tan, M. Steinbach, and V. Kumar. 2006. Introduction to Data Mining. Addison Wesley. Google ScholarDigital Library
J. Tang, Z. Chen, A. W.-C. Fu, and D. W. Cheung. 2002. Enhancing effectiveness of outlier detections for low density patterns. In Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Taipei, Taiwan. 535--548. DOI:http://dx.doi.org/10.1007/3-540-47887-6_53 Google ScholarDigital Library
K. L. Wagstaff. 2002. Intelligent Clustering with Instance-Level Constraints. Ph.D. Dissertation. Department of Computer Science, Cornell University. Google ScholarDigital Library
D. Wishart. 1969. Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In Numerical Taxonomy, A. J. Cole (Ed.). 282--311.Google Scholar
M. A. Wong and T. Lane. 1983. A kth nearest neighbour clustering procedure. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 45, 3 (1983), 362--368.Google Scholar
R. Xu and D. Wunsch II. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16 (2005), 645--678. DOI:http://dx.doi.org/10.1109/TNN.2005.845141 Google ScholarDigital Library
R. Xu and D. Wunsch II. 2009. Clustering. IEEE Press. Google ScholarDigital Library
J. Yang, N. Zhong, Y. Yao, and J. Wang. 2008. Local peculiarity factor and its application in outlier detection. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Las Vegas, NV. 776--784. DOI:http://dx.doi.org/10.1145/1401890.1401983 Google ScholarDigital Library
K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, and W. L. Ruzzo. 2001. Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 10 (2001), 977--987.Google ScholarCross Ref
K. Y. Yeung, M. Medvedovic, and R. E. Bumgarner. 2003. Clustering gene-expression data with repeated measurements. Genome Biology 4, 5 (2003), R34.1--R34.17.Google ScholarCross Ref
K. Zhang, M. Hutter, and H. Jin. 2009. A new local distance-based outlier detection approach for scattered real-world data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Bangkok, Thailand. 813--822. DOI:http://dx.doi.org/10.1007/978-3-642-01307-2_84 Google ScholarDigital Library
T. Zhang, R. Ramakrishnan, and M. Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Montreal, QC, Canada. 103--114. DOI:http://dx.doi.org/10.1145/233269.233324 Google ScholarDigital Library
A. Zimek, R. J. G. B. Campello, and J. Sander. 2013. Ensembles for unsupervised outlier detection: Challenges and research questions. ACM SIGKDD Explorations 15, 1 (2013), 11--22. Google ScholarDigital Library
A. Zimek, R. J. G. B. Campello, and J. Sander. 2014. Data perturbation for outlier detection ensembles. In Proceedings of the 26th International Conference on Scientific and Statistical Database Management (SSDBM). Aalborg, Denmark. 13:1--12. DOI:http://dx.doi.org/10.1145/2618243.2618257 Google ScholarDigital Library
A. Zimek, M. Gaudet, R. J. G. B. Campello, and J. Sander. 2013a. Subsampling for efficient and effective unsupervised outlier detection ensembles. In Proceedings of the 19th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Chicago, IL. 428--436. DOI:http://dx.doi.org/10.1145/2487575.2487676 Google ScholarDigital Library
A. Zimek, E. Schubert, and H.-P. Kriegel. 2012. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining 5, 5 (2012), 363--387. DOI:http://dx.doi.org/10.1002/sam.11161 Google ScholarDigital Library

Index Terms

Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Density-based hierarchical clustering for streaming data

For streaming data that arrive continuously such as multimedia data and financial transactions, clustering algorithms are typically allowed to scan the data set only once. Existing research in this domain mainly focuses on improving the accuracy of ...
Read More
Initialization of K-modes clustering using outlier detection techniques

We considered the initialization of K-modes clustering from the view of outlier detection.We proposed an initialization algorithm for K-modes clustering via the distance-based outlier detection technique.We presented a partition entropy-based outlier ...
Read More
Density-based semi-supervised clustering

Semi-supervised clustering methods guide the data partitioning and grouping process by exploiting background knowledge, among else in the form of constraints. In this study, we propose a semi-supervised density-based clustering method. Density-based ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Knowledge Discovery from Data Volume 10, Issue 1
July 2015
321 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/2808688
Editor:
Philip S. Yu
University of Illinois at Chicago, USA
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 July 2015
- Accepted: 1 February 2015
- Revised: 1 December 2014
- Received: 1 July 2013
Published in tkdd Volume 10, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Density-based clustering
data visualization
global/local outliers
hierarchical and nonhierarchical clustering
outlier detection
unsupervised and semisupervised clustering
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 396
  Total Citations
  View Citations
- 6,274
  Total Downloads
- Downloads (Last 12 months)720
- Downloads (Last 6 weeks)104
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Density-based hierarchical clustering for streaming data

Initialization of K-modes clustering using outlier detection techniques

Density-based semi-supervised clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Density-based hierarchical clustering for streaming data

Initialization of K-modes clustering using outlier detection techniques

Density-based semi-supervised clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media