skip to main content
10.1145/1134285.1134445acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
Article

Effective identification of source code authors using byte-level information

Published:28 May 2006Publication History

ABSTRACT

Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually .based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the system after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author's style. Experiments on data sets of different programming-language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach.A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. Moreover, the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.

References

  1. Ding, H., Samadzadeh, M., H., Extraction of Java program fingerprints for software authorship identification, The Journal of Systems and Software, Volume 72, Issue 1, Pages 49--57 June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Frantzeskou, G., Gritzalis, S., Mac Donell, S., Source Code Authorship Analysis for supporting the cybercrime investigation process, (ICETE04), Vol 2, pages (85-92), 2004.Google ScholarGoogle Scholar
  3. Gray, A., Sallis, P., and MacDonell, S.,, Identified: A dictionary-based system for extracting source code metrics for software forensics. In Proceedings of SE:E&P'98, IEEE Computer Society Press, pages 252--259., 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Gray, A., Sallis, P., and MacDonell, S., Software forensics: Extending authorship analysis techniques to computer programs, in Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL'97), pages 1--8, 1997.Google ScholarGoogle Scholar
  5. Keselj, V., Peng, F., Cercone, N., Thomas, C., N-gram based author profiles for authorship attribution, In Proc. Pacific Association for Computational Linguistics 2003.Google ScholarGoogle Scholar
  6. Keselj, V.,. Perl package Text::N-grams http://www.cs.dal.ca/~vlado/srcperl/N-grams , 2003.Google ScholarGoogle Scholar
  7. Kilgour, R. I., Gray, A.R., Sallis, P. J., and MacDonell, S. G., A Fuzzy Logic Approach to Computer Software Source Code Authorship Analysis, Accepted In Proc. Of (ICONIP'97). Dunedin. New Zealand, 1997.Google ScholarGoogle Scholar
  8. Krsul, I., and Spafford, E. H, Authorship analysis: Identifying the author of a program, In Proc. 8th National Information Systems Security Conference, pages 514--524, National Institute of Standards and Technology., 1995.Google ScholarGoogle Scholar
  9. Longstaff, T. A., and Schultz, E. E., Beyond Preliminary Analysis of the WANK and OILZ Worms: A Case Study of Malicious Code, Computers and Security, 12:61--77, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. MacDonell, S.G, and Gray, A.R. Software forensics applied to the task of discriminating between program authors. Journal of Systems Research and Information Systems 10: 113--127 (2001).Google ScholarGoogle Scholar
  11. Peng, F., D., Shuurmans, and S., Wang., Augmenting naive bayes classifiers with statistical language models, Information Retrieval Journal, 7(1): 317--345, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sallis P., Aakjaer, A., and MacDonell, S., Software Forensics: Old Methods for a New Science. Proceedings of SE:E&P'96. Dunedin, New Zealand, IEEE Computer Society Press, 367--371, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Spafford, E. H., The Internet Worm Program: An Analysis," Computer Communications Review, 19(1): 17--49, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Spafford, E. H., and Weeber, S. A., Software forensics: tracking code to its authors, Computers and Security, 12:585--595, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Stamatatos, E., N., Fakotakis, and G. Kokkinakis. Automatic text categorisation in terms of genre and author. Computational Linguistics, 26(4): 471--495, 2000 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Effective identification of source code authors using byte-level information

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICSE '06: Proceedings of the 28th international conference on Software engineering
      May 2006
      1110 pages
      ISBN:1595933751
      DOI:10.1145/1134285

      Copyright © 2006 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 May 2006

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate276of1,856submissions,15%

      Upcoming Conference

      ICSE 2025

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader