skip to main content
research-article

Isolation-Based Anomaly Detection

Published:01 March 2012Publication History
Skip Abstract Section

Abstract

Anomalies are data points that are few and different. As a result of these properties, we show that, anomalies are susceptible to a mechanism called isolation. This article proposes a method called Isolation Forest (iForest), which detects anomalies purely based on the concept of isolation without employing any distance or density measure---fundamentally different from all existing methods.

As a result, iForest is able to exploit subsampling (i) to achieve a low linear time-complexity and a small memory-requirement and (ii) to deal with the effects of swamping and masking effectively. Our empirical evaluation shows that iForest outperforms ORCA, one-class SVM, LOF and Random Forests in terms of AUC, processing time, and it is robust against masking and swamping effects. iForest also works well in high dimensional problems containing a large number of irrelevant attributes, and when anomalies are not available in training sample.

References

  1. Abe, N., Zadrozny, B., and Langford, J. 2006. Outlier detection by active learning. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 504--509. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Angiulli, F. and Fassetti, F. 2009. Dolphin: An efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans. Knowl. Discov. Data 3, 1, 1--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Asuncion, A. and Newman, D. 2007. UCI machine learning repository. http://mlearn.ics.ucl.edu/MLRepository.html.Google ScholarGoogle Scholar
  4. Bay, S. D. and Schwabacher, M. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 29--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Breiman, L. 2001. Random forests. Mach. Learn. 45, 1, 5--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. 2000. LOF: Identifying density-based local outliers. ACM SIGMOD Record 29, 2, 93--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Caputo, B., Sim, K., Furesjo, F., and Smola, A. 2002. Appearance-based object recognition using SVMs: Which kernel should I use? In Proceedings of the NIPS Workshop on Statitsical Methods for Computational Experiments in Visual Processing and Computer Vision.Google ScholarGoogle Scholar
  8. Chandola, V., Banerjee, A., and Kumar, V. 2009. Anomaly detection: A survey. ACM Comput. Surv. 41, 3, 1--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chaudhary, A., Szalay, A. S., Szalay, E. S., and Moore, A. W. 2002. Very fast outlier detection in large multidimensional datasets. In Proceedings of the ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery. ACM.Google ScholarGoogle Scholar
  10. Fan, H., Zaïane, O. R., Foss, A., and Wu, J. 2006. A nonparametric outlier detection for effectively discovering top-n outliers from engineering data. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Lecture Notes in Computer Science, vol. 3918, Springer, 557--566. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ghoting, A., Otey, M. E., and Parthasarathy, S. 2004. Loaded: Link-based outlier and anomaly detection in evolving datasets. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM’04). IEEE Computer Society Press, 387--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hand, D. J. and Till, R. J. 2001. A simple generalisation of the area under the roc curve for multiple class classification problems. Mach. Learn. 45, 2, 171--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. He, Z., Xu, X., and Deng, S. 2003. Discovering cluster-based local outliers. Patt. Recog. Lett. 24, 9--10, 1641--1650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. He, Z., Deng, S., and Xu, X. 2005. A unified subspace outlier ensemble framework for outlier detection. In Proceedings of the International Conference on Web-Age Information Management (WAIM). W. Fan, Z. Wu, and J. Yang Eds., Lecture Notes in Computer Science, vol. 3739, Springer, 632--637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Joanes, D. N. and Gill, C. A. 1998. Comparing measures of sample skewness and kurtosis. J. Roy. Stat. Soc. (Series D): Statistician 47, 1, 183--189.Google ScholarGoogle ScholarCross RefCross Ref
  16. Knorr, E. M. and Ng, R. T. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB’98). Morgan Kaufmann, 392--403. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Knorr, E. M., Ng, R. T., and Tucakov, V. 2000. Distance-based outliers: Algorithms and applications. VLDB J. 8, 3--4, 237--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Knuth, D. E. 2006. Art of Computer Programming, Volume 4, Fascicle 4: Generating All Trees. Addison-Wesley Professional. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lazarevic, A. and Kumar, V. 2005. Feature bagging for outlier detection. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD’05). ACM, New York, 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Liu, F. T., Ting, K. M., and Zhou, Z.-H. 2008a. Isolation forest. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM’08). IEEE Computer Society Press, 413--422. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Liu, F. T., Ting, K. M., and Zhou, Z.-H. 2008b. Spectrum of variable-random trees. J. Artif. Intel. Res. 32, 355--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Liu, F. T., Ting, K. M., and Zhou, Z.-H. 2010a. Can isolation-based anomaly detectors handle arbitrary multi-modal patterns in data? Tech. rep. TR2010/2, Monash University.Google ScholarGoogle Scholar
  23. Liu, F. T., Ting, K. M., and Zhou, Z.-H. 2010b. On detecting clustered anomalies using SCiForest. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD’10). 274--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Liu, R. Y., Parelius, J. M., and Singh, K. 1999. Multivariate analysis by data depth: Descriptive statistics, graphics and inference. Ann. Stat. 27, 3, 783--840.Google ScholarGoogle ScholarCross RefCross Ref
  25. Murphy, R. B. 1951. On tests for outlying observations. Ph.D. dissertation, Princeton University.Google ScholarGoogle Scholar
  26. Papadimitriou, S., Kitagawa, H., Gibbons, P., and Faloutsos, C. 2003. Loci: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering. 315--326.Google ScholarGoogle Scholar
  27. Preiss, B. R. 1999. Data Structures and Algorithms with Object-Oriented Design Patterns in Java. Wiley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Quinlan, J. R. 1993. C4.5: Programs for Machine Learning 1st Ed. (Series in Machine Learning), Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ramaswamy, S., Rastogi, R., and Shim, K. 2000. Efficient algorithms for mining outliers from large datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 427--438. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Rocke, D. M. and Woodruff, D. L. 1996. Identification of outliers in multivariate data. J. Amer. Statist. Soc. 91, 435, 1047--1061.Google ScholarGoogle Scholar
  31. Rousseeuw, P. J. and Leroy, A. M. 1987. Robust Regression and Outlier Detection. Wiley, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ruskey, F. 1980. On the average shape of binary trees. SIAM J. Alg. Disc. Meth. 1, 1, 43--50.Google ScholarGoogle ScholarCross RefCross Ref
  33. Schölkopf, B., Platt, J. C., Shawe-Taylor, J. C., Smola, A. J., and Williamson, R. C. 2001. Estimating the support of a high-dimensional distribution. Neural Comput. 13, 7, 1443--1471. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Shi, T. and Horvath, S. 2006. Unsupervised learning with random forest predictors. J. Computat. Graph. Stat. 15, 1, 118--138.Google ScholarGoogle ScholarCross RefCross Ref
  35. Sing, T., Sander, O., Beerenwinkel, N., and Lengauer, T. 2005. ROCR: Visualizing classifier performance in r. Bioinformatics 21, 20, 3940--3941. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Song, X., Wu, M., Jermaine, C., and Ranka, S. 2007. Conditional anomaly detection. IEEE Trans. Knowl. Data Eng. 19, 5, 631--645. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Tan, P.-N., Steinbach, M., and Kumar, V. 2005. Introduction to Data Mining. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Tang, J., Chen, Z., Fu, A. W.-C., and Cheung, D. W.-L. 2002. Enhancing effectiveness of outlier detections for low density patterns. In Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD). Springer-Verlag, 535--548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Tao, Y., Xiao, X., and Zhou, S. 2006. Mining distance-based outliers from large databases in any metric space. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, New York, 394--403. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Tax, D. M. J. and Duin, R. P. W. 2004. Support vector data description. Mach. Learn. 54, 1, 45--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Tukey, J. W. 1977. Exploratory Data Analysis. Addison-Wesley.Google ScholarGoogle Scholar
  42. Williams, G., Baxter, R., He, H., Hawkins, S., and Gu, L. 2002. A comparative study of rnn for outlier detection in data mining. In Proceedings of the IEEE International Conference on Data Mining (ICDM’02). IEEE Computer Society Press, 709--712. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Wu, M. and Jermaine, C. 2006. Outlier detection by sampling with accuracy guarantees. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 767--772. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Yamanishi, K., Takeuchi, J.-I., Williams, G., and Milne, P. 2000. On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 320--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yu, X., Tang, L. A., and Han, J. 2009. Filtering and refinement: A two-stage approach for efficient and effective anomaly detection. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, 617--626. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Isolation-Based Anomaly Detection

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data  Volume 6, Issue 1
        March 2012
        137 pages
        ISSN:1556-4681
        EISSN:1556-472X
        DOI:10.1145/2133360
        Issue’s Table of Contents

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 March 2012
        • Accepted: 1 February 2011
        • Revised: 1 August 2010
        • Received: 1 March 2010
        Published in tkdd Volume 6, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader