Abstract
Anomalies are data points that are few and different. As a result of these properties, we show that, anomalies are susceptible to a mechanism called isolation. This article proposes a method called Isolation Forest (iForest), which detects anomalies purely based on the concept of isolation without employing any distance or density measure---fundamentally different from all existing methods.
As a result, iForest is able to exploit subsampling (i) to achieve a low linear time-complexity and a small memory-requirement and (ii) to deal with the effects of swamping and masking effectively. Our empirical evaluation shows that iForest outperforms ORCA, one-class SVM, LOF and Random Forests in terms of AUC, processing time, and it is robust against masking and swamping effects. iForest also works well in high dimensional problems containing a large number of irrelevant attributes, and when anomalies are not available in training sample.
- Abe, N., Zadrozny, B., and Langford, J. 2006. Outlier detection by active learning. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 504--509. Google ScholarDigital Library
- Angiulli, F. and Fassetti, F. 2009. Dolphin: An efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans. Knowl. Discov. Data 3, 1, 1--57. Google ScholarDigital Library
- Asuncion, A. and Newman, D. 2007. UCI machine learning repository. http://mlearn.ics.ucl.edu/MLRepository.html.Google Scholar
- Bay, S. D. and Schwabacher, M. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 29--38. Google ScholarDigital Library
- Breiman, L. 2001. Random forests. Mach. Learn. 45, 1, 5--32. Google ScholarDigital Library
- Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. 2000. LOF: Identifying density-based local outliers. ACM SIGMOD Record 29, 2, 93--104. Google ScholarDigital Library
- Caputo, B., Sim, K., Furesjo, F., and Smola, A. 2002. Appearance-based object recognition using SVMs: Which kernel should I use? In Proceedings of the NIPS Workshop on Statitsical Methods for Computational Experiments in Visual Processing and Computer Vision.Google Scholar
- Chandola, V., Banerjee, A., and Kumar, V. 2009. Anomaly detection: A survey. ACM Comput. Surv. 41, 3, 1--58. Google ScholarDigital Library
- Chaudhary, A., Szalay, A. S., Szalay, E. S., and Moore, A. W. 2002. Very fast outlier detection in large multidimensional datasets. In Proceedings of the ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery. ACM.Google Scholar
- Fan, H., Zaïane, O. R., Foss, A., and Wu, J. 2006. A nonparametric outlier detection for effectively discovering top-n outliers from engineering data. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Lecture Notes in Computer Science, vol. 3918, Springer, 557--566. Google ScholarDigital Library
- Ghoting, A., Otey, M. E., and Parthasarathy, S. 2004. Loaded: Link-based outlier and anomaly detection in evolving datasets. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM’04). IEEE Computer Society Press, 387--390. Google ScholarDigital Library
- Hand, D. J. and Till, R. J. 2001. A simple generalisation of the area under the roc curve for multiple class classification problems. Mach. Learn. 45, 2, 171--186. Google ScholarDigital Library
- He, Z., Xu, X., and Deng, S. 2003. Discovering cluster-based local outliers. Patt. Recog. Lett. 24, 9--10, 1641--1650. Google ScholarDigital Library
- He, Z., Deng, S., and Xu, X. 2005. A unified subspace outlier ensemble framework for outlier detection. In Proceedings of the International Conference on Web-Age Information Management (WAIM). W. Fan, Z. Wu, and J. Yang Eds., Lecture Notes in Computer Science, vol. 3739, Springer, 632--637. Google ScholarDigital Library
- Joanes, D. N. and Gill, C. A. 1998. Comparing measures of sample skewness and kurtosis. J. Roy. Stat. Soc. (Series D): Statistician 47, 1, 183--189.Google ScholarCross Ref
- Knorr, E. M. and Ng, R. T. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB’98). Morgan Kaufmann, 392--403. Google ScholarDigital Library
- Knorr, E. M., Ng, R. T., and Tucakov, V. 2000. Distance-based outliers: Algorithms and applications. VLDB J. 8, 3--4, 237--253. Google ScholarDigital Library
- Knuth, D. E. 2006. Art of Computer Programming, Volume 4, Fascicle 4: Generating All Trees. Addison-Wesley Professional. Google ScholarDigital Library
- Lazarevic, A. and Kumar, V. 2005. Feature bagging for outlier detection. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD’05). ACM, New York, 157--166. Google ScholarDigital Library
- Liu, F. T., Ting, K. M., and Zhou, Z.-H. 2008a. Isolation forest. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM’08). IEEE Computer Society Press, 413--422. Google ScholarDigital Library
- Liu, F. T., Ting, K. M., and Zhou, Z.-H. 2008b. Spectrum of variable-random trees. J. Artif. Intel. Res. 32, 355--384. Google ScholarDigital Library
- Liu, F. T., Ting, K. M., and Zhou, Z.-H. 2010a. Can isolation-based anomaly detectors handle arbitrary multi-modal patterns in data? Tech. rep. TR2010/2, Monash University.Google Scholar
- Liu, F. T., Ting, K. M., and Zhou, Z.-H. 2010b. On detecting clustered anomalies using SCiForest. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD’10). 274--290. Google ScholarDigital Library
- Liu, R. Y., Parelius, J. M., and Singh, K. 1999. Multivariate analysis by data depth: Descriptive statistics, graphics and inference. Ann. Stat. 27, 3, 783--840.Google ScholarCross Ref
- Murphy, R. B. 1951. On tests for outlying observations. Ph.D. dissertation, Princeton University.Google Scholar
- Papadimitriou, S., Kitagawa, H., Gibbons, P., and Faloutsos, C. 2003. Loci: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering. 315--326.Google Scholar
- Preiss, B. R. 1999. Data Structures and Algorithms with Object-Oriented Design Patterns in Java. Wiley. Google ScholarDigital Library
- Quinlan, J. R. 1993. C4.5: Programs for Machine Learning 1st Ed. (Series in Machine Learning), Morgan Kaufmann. Google ScholarDigital Library
- Ramaswamy, S., Rastogi, R., and Shim, K. 2000. Efficient algorithms for mining outliers from large datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 427--438. Google ScholarDigital Library
- Rocke, D. M. and Woodruff, D. L. 1996. Identification of outliers in multivariate data. J. Amer. Statist. Soc. 91, 435, 1047--1061.Google Scholar
- Rousseeuw, P. J. and Leroy, A. M. 1987. Robust Regression and Outlier Detection. Wiley, New York. Google ScholarDigital Library
- Ruskey, F. 1980. On the average shape of binary trees. SIAM J. Alg. Disc. Meth. 1, 1, 43--50.Google ScholarCross Ref
- Schölkopf, B., Platt, J. C., Shawe-Taylor, J. C., Smola, A. J., and Williamson, R. C. 2001. Estimating the support of a high-dimensional distribution. Neural Comput. 13, 7, 1443--1471. Google ScholarDigital Library
- Shi, T. and Horvath, S. 2006. Unsupervised learning with random forest predictors. J. Computat. Graph. Stat. 15, 1, 118--138.Google ScholarCross Ref
- Sing, T., Sander, O., Beerenwinkel, N., and Lengauer, T. 2005. ROCR: Visualizing classifier performance in r. Bioinformatics 21, 20, 3940--3941. Google ScholarDigital Library
- Song, X., Wu, M., Jermaine, C., and Ranka, S. 2007. Conditional anomaly detection. IEEE Trans. Knowl. Data Eng. 19, 5, 631--645. Google ScholarDigital Library
- Tan, P.-N., Steinbach, M., and Kumar, V. 2005. Introduction to Data Mining. Addison-Wesley. Google ScholarDigital Library
- Tang, J., Chen, Z., Fu, A. W.-C., and Cheung, D. W.-L. 2002. Enhancing effectiveness of outlier detections for low density patterns. In Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD). Springer-Verlag, 535--548. Google ScholarDigital Library
- Tao, Y., Xiao, X., and Zhou, S. 2006. Mining distance-based outliers from large databases in any metric space. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, New York, 394--403. Google ScholarDigital Library
- Tax, D. M. J. and Duin, R. P. W. 2004. Support vector data description. Mach. Learn. 54, 1, 45--66. Google ScholarDigital Library
- Tukey, J. W. 1977. Exploratory Data Analysis. Addison-Wesley.Google Scholar
- Williams, G., Baxter, R., He, H., Hawkins, S., and Gu, L. 2002. A comparative study of rnn for outlier detection in data mining. In Proceedings of the IEEE International Conference on Data Mining (ICDM’02). IEEE Computer Society Press, 709--712. Google ScholarDigital Library
- Wu, M. and Jermaine, C. 2006. Outlier detection by sampling with accuracy guarantees. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 767--772. Google ScholarDigital Library
- Yamanishi, K., Takeuchi, J.-I., Williams, G., and Milne, P. 2000. On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 320--324. Google ScholarDigital Library
- Yu, X., Tang, L. A., and Han, J. 2009. Filtering and refinement: A two-stage approach for efficient and effective anomaly detection. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, 617--626. Google ScholarDigital Library
Index Terms
- Isolation-Based Anomaly Detection
Recommendations
Fuzzy Isolation Forest for Anomaly Detection
AbstractAnomaly detection is nowadays a key data mining task. Anomaly detection methods generally look for patterns of ”normal” profile and then identify data points that do not match that profile. One outstanding method, Isolation Forest, showed high ...
Sparse random projection isolation forest for outlier detection
Highlights- We analyzed the isolation-forest-based methods’ problem of lacking efficacy in selecting suitable hyperplanes to split data.
Graphical abstractDisplay Omitted
AbstractIsolation Forest has a low computational complexity, hence has been widely applied to detect outliers in large-scale data. However, it suffers from the artifacts caused by the hyperplanes chosen, thereby failing to detect outliers in ...
Isolation Forest
ICDM '08: Proceedings of the 2008 Eighth IEEE International Conference on Data MiningMost existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. This paper proposes a fundamentally different model-based method that ...
Comments