ABSTRACT
Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually .based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the system after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author's style. Experiments on data sets of different programming-language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach.A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. Moreover, the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.
- Ding, H., Samadzadeh, M., H., Extraction of Java program fingerprints for software authorship identification, The Journal of Systems and Software, Volume 72, Issue 1, Pages 49--57 June 2004. Google ScholarDigital Library
- Frantzeskou, G., Gritzalis, S., Mac Donell, S., Source Code Authorship Analysis for supporting the cybercrime investigation process, (ICETE04), Vol 2, pages (85-92), 2004.Google Scholar
- Gray, A., Sallis, P., and MacDonell, S.,, Identified: A dictionary-based system for extracting source code metrics for software forensics. In Proceedings of SE:E&P'98, IEEE Computer Society Press, pages 252--259., 1998. Google ScholarDigital Library
- Gray, A., Sallis, P., and MacDonell, S., Software forensics: Extending authorship analysis techniques to computer programs, in Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL'97), pages 1--8, 1997.Google Scholar
- Keselj, V., Peng, F., Cercone, N., Thomas, C., N-gram based author profiles for authorship attribution, In Proc. Pacific Association for Computational Linguistics 2003.Google Scholar
- Keselj, V.,. Perl package Text::N-grams http://www.cs.dal.ca/~vlado/srcperl/N-grams , 2003.Google Scholar
- Kilgour, R. I., Gray, A.R., Sallis, P. J., and MacDonell, S. G., A Fuzzy Logic Approach to Computer Software Source Code Authorship Analysis, Accepted In Proc. Of (ICONIP'97). Dunedin. New Zealand, 1997.Google Scholar
- Krsul, I., and Spafford, E. H, Authorship analysis: Identifying the author of a program, In Proc. 8th National Information Systems Security Conference, pages 514--524, National Institute of Standards and Technology., 1995.Google Scholar
- Longstaff, T. A., and Schultz, E. E., Beyond Preliminary Analysis of the WANK and OILZ Worms: A Case Study of Malicious Code, Computers and Security, 12:61--77, 1993. Google ScholarDigital Library
- MacDonell, S.G, and Gray, A.R. Software forensics applied to the task of discriminating between program authors. Journal of Systems Research and Information Systems 10: 113--127 (2001).Google Scholar
- Peng, F., D., Shuurmans, and S., Wang., Augmenting naive bayes classifiers with statistical language models, Information Retrieval Journal, 7(1): 317--345, 2004. Google ScholarDigital Library
- Sallis P., Aakjaer, A., and MacDonell, S., Software Forensics: Old Methods for a New Science. Proceedings of SE:E&P'96. Dunedin, New Zealand, IEEE Computer Society Press, 367--371, 1996. Google ScholarDigital Library
- Spafford, E. H., The Internet Worm Program: An Analysis," Computer Communications Review, 19(1): 17--49, 1989. Google ScholarDigital Library
- Spafford, E. H., and Weeber, S. A., Software forensics: tracking code to its authors, Computers and Security, 12:585--595, 1993. Google ScholarDigital Library
- Stamatatos, E., N., Fakotakis, and G. Kokkinakis. Automatic text categorisation in terms of genre and author. Computational Linguistics, 26(4): 471--495, 2000 Google ScholarDigital Library
Index Terms
- Effective identification of source code authors using byte-level information
Recommendations
Code Authorship Attribution: Methods and Challenges
Code authorship attribution is the process of identifying the author of a given code. With increasing numbers of malware and advanced mutation techniques, the authors of malware are creating a large number of malware variants. To better deal with this ...
Large-Scale and Language-Oblivious Code Authorship Identification
CCS '18: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications SecurityEfficient extraction of code authorship attributes is key for successful identification. However, the extraction of such attributes is very challenging, due to various programming language specifics, the limited number of available code samples per ...
On the Use of Discretized Source Code Metrics for Author Identification
SSBSE '09: Proceedings of the 2009 1st International Symposium on Search Based Software EngineeringIntellectual property infringement and plagiarism litigation involving source code would be more easily resolved using code authorship identification tools. Previous efforts in this area have demonstrated the potential of determining the authorship of a ...
Comments