Machine learning algorithm for grading open-ended physics questions in Turkish

Çınar, Ayşe; Ince, Elif; Gezer, Murat; Yılmaz, Özgür

doi:10.1007/s10639-020-10128-0

Machine learning algorithm for grading open-ended physics questions in Turkish

Published: 02 March 2020

Volume 25, pages 3821–3844, (2020)
Cite this article

Education and Information Technologies Aims and scope Submit manuscript

1447 Accesses
15 Citations
Explore all metrics

Abstract

Worldwide, open-ended questions that require short answers have been used in many exams in fields of science, such as the International Student Assessment Program (PISA), the International Science and Maths Trends Research (TIMSS). However, multiple-choice questions are used for many exams at the national level in Turkey, especially high school and university entrance exams. This study aims to develop an objective and useful automatic scoring model for open-ended questions using machine learning algorithms. Within the scope of this aim, an automated scoring model construction study was conducted on four Physics questions at a University level course with the participation of 246 undergraduate students. The short-answer scoring was handled through an approach that addresses students’ answers in Turkish. Model performing machine learning classification techniques such as SVM (Support Vector Machines), Gini, KNN (k-Nearest Neighbors), and Bagging and Boosting were applied after data preprocessing. The score indicated the accuracy, precision and F1-Score of each predictive model of which the AdaBoost.M1 technique had the best performance. In this paper, we report on a short answer grading system in Turkish, based on a machine learning approach using a constructed dataset about a Physics course in Turkish. This study is also the first study in the field of open-ended exam scoring in Turkish.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial intelligence in online higher education: A systematic review of empirical research from 2011 to 2020

Article 26 February 2022

Fan Ouyang, Luyi Zheng & Pengcheng Jiao

An automated essay scoring systems: a systematic literature review

Article 23 September 2021

Dadi Ramesh & Suresh Kumar Sanampudi

Predicting academic success in higher education: literature review and best practices

Article Open access 10 February 2020

Eyman Alyahyan & Dilek Düştegör

References

Alfaro, E., Gamez, M., & Garcia, N. (2015). Adabag: Applies multiclass AdaBoost.M1, SAMME and bagging. R package version 4.2. https://cran.r-project.org/web/packages/adabag/. Accessed 9 Dec 2019.
Alfonseca, E., & Perez, D. (2004). Automatic assessment of open ended questions with a BLEU-inspired algorithm and shallow NLP. Advances in Natural Language Processing. LNCS, 3230, 25–35.
Article Google Scholar
Alvarado, J. G., Ghavidel, H. A., Zouaq, A., Jovanovic, J., & McDonald, J. (2018). A comparison of features for the automatic labeling of student answers to open-ended questions. Proceeding of the 11^th International Educational Data Mining Conference, Buffalo, NY, 55-65.
Bailey, S., & Meurers, D. (2008). Diagnosing meaning errors in short answers to reading comprehension questions, Proceedings of the 3^rd Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics, 107–115.
Basu, S., Jacobs, C., & Vanderwende, L. (2013). Powergrading: A clustering approach to amplify human effort for short answer grading. Transactions of the Association of Computational Linguistics, 1, 391–401.
Article Google Scholar
Bukai, O., Pokorny, R., & Haynes, J. (2006). An automated short-free-text scoring system: Development and assessment. Proceedings of the 20^th Interservice-Industry Training, Simulation, and Education Conference, 1–11.
Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25, 60–117.
Article Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
Article Google Scholar
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm, machine learning. Proceedings Of The TThirteenth International Conference (Icml96), 148–156.
Galhardi, L., Barbosa, C.R.S.C., ThomdeSouza, R.C., & Brancher, J.D. (2018). Portuguese automatic short answer grading, VII Congresso Brasileiro de Informática na Educação-CBIE 2018, 1373–1382.
Graham, M., Milanowski, A., & Miller, J. (2012). Measuring and promoting inter-rater agreement of teacher and principal performance ratings. Washington, DC: U.S. Department of Education: Center for Educator Compensation Reform.
Gronlund, N. E. (1998). Assessment of student achievement (6th ed.). Boston: Allyn & Bacon.
Google Scholar
Herwanto, G.B., Sari, Y., Prastowo, B.N., Riasetiawan, M., Bustoni, I.A., & Hidayatulloh, I. (2018). UKARA: A fast and simple automatic short answer scoring system for Bahasa Indonesia, Proceeding Book of 1^st International Conference on Educational Assessment and Policy, 2, 48–53.
Hewitt, P. G., Lyons, S., Suchocki, J. A., & Yeh, J. (2015). Conceptual integrated science (2nd ed.) Pearson.
Google Scholar
Hou, W., & Tsao, J. (2011). Automatic assessment of students' free-text answers with different levels. International Journal on Artificial Intelligence Tools, 20(2) ,327–347.
Hsu, H., & Hsieh, C. (2010). Feature selection via correlation coefficient clustering. Journal of Software, 5(12), 1371–1377.
Article Google Scholar
Jayashankar, S., & Sridaran, R. (2017). Superlative model using word cloud for short answers evaluation in eLearning. Education and Information Technologies, 22(5), 2383–2402.
Article Google Scholar
Joachims T. (1999). Transductive inference for text classification using support vector machines. Proceedings of the 16^th International Conference on Machine Learning (ICML), San Francisco: Morgan Kaufmann publishers, 200–209.
Klein, R., Kyrilov, A., & Tokman, M. (2011). Automated assessment of short free-text responses in computer science using latent semantic analysis. Proceedings of the 16^th Annual Joint Conference On Innovation and Technology In Computer Science Education (ITiCSE’11), 158–162.
Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37, 389–405.
Article Google Scholar
Madnani, N., Loukina, A., & Cahill, A. (2017). A large scale quantitative exploration of modeling strategies for content scoring. Proceedings of the 12^th Workshop on Innovative Use of NLP for Building Educational Applications, 457–467.
Mcdonald, J. Knott, A., & Zeng R. (2012). Free-text input vs menu selection: Exploring the difference with a tutorial dialogue system. Proceedings of Australasian Language Technology Association Workshop, 97−105.
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F., Chang, C. C. et al. (2019). Package ‘e1071’. https://cran.r-project.org/web/packages/e1071/e1071.pdf. Accessed 9 Dec 2019.
Mohler, M., & Mihalcea, R. (2009). Text-to-text semantic similarity for automatic short answer grading. Proceedings of the 12^th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 567–575.
Mowafy, M., Rezk, A., & El-bakry, H. M. (2018). An efficient classification model for unstructured text document. American Journal of Computer Science and Information Technology, 6, 1–16.
Article Google Scholar
Nielsen, R. D., Ward, W., Martin, J. H., & Palmer, M. (2008). Annotating students’ understanding of science concepts. In Proceedings of the 6^th international conference on language resources and evaluation (pp. 1–8).
Google Scholar
Perez, D., Alfonseca, E., & Rodriguez, P. (2004). Upper bounds of the BLEU algorithm applied to assessing student essays. Proceedings of the 30^th international association for educational assessment (IAEA).
Perez-Marin, D. (2004). Automatic evaluation of user’s short essays by using statistical and shallow natural language processing techniques. Unpublished PhD thesis, Computer Science Department, Universidad Autonoma of Madrid.
Pribadi, F. S., Permanasari, A. E., & Adji, T. B. (2018). Short answer scoring system using automatic reference answer generation and geometric average normalized-longest common subsequence (GAN-LCS). Education and Information Technologies, 23, 2855–2866.
Article Google Scholar
Riordan, B. Horbach, A., Cahill, A. Zesch, T. & Lee, C. M. (2017). Investigating neural architectures for short answer scoring. Proceedings of the 12^th Workshop on Innovative Use of NLP for Building Educational Applications, 159-168.
Ripley, B., & Venables, W. (2019). Package ‘class’. https://cran.r-project.org/web/packages/class/class.pdf. Accessed 9 Dec 2019.
Romagnano, L. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31–37.
Article Google Scholar
Sakaguchi, K., Heilman, M., & Madnani, N. (2015). Effective feature integration for automated short answer scoring. Proceedings of NAACL: HLT, Association for Computational Linguistics, 1049–1054.
Shermis, M. D. (2015). Contrasting state-of-the-art in the machine scoring of short-form constructed responses. Educational Assessment, 20(1), 46–65.
Article Google Scholar
Syarif, I., Zaluska, E., Prugel-Bennett, A., & Wills, G. (2012). Application of bagging, boosting and stacking to intrusion detection. Machine learning and data mining in pattern recognition (pp. 539–602). Heidelberg Springer Berlin.
Therneau, T., Atkinson, B., & Ripley, B. (2017). Recursive Partitioning and Regression Trees, Package ‘rpart’. https://cran.r-project.org/web/packages/rpart/rpart.pdf. Accessed 9 Dec 2019.
Van der Linden, W. J., & Hambleton, R. K. (1997). Item response theory: Brief history, common models, and extensions (pp. 1–28). New York: Handbook of modern item response theory: Springer-Verlag.
Book Google Scholar
Vanlehn, K., Jordan, P., Rosé, C., Bhembe, D., Böttner, M., Gaydos, A., Makatchev, M., Pappuswamy, U., Ringenberg, M., Roque, A. Siler, S., & Srivastava, R. (2002). The architecture of Why2-atlas: A coach for qualitative physics essay writing. Proceedings of the 6th International Conference on Intelligent Tutoring Systems.
Zhang, J. & Mani, I. (2003). KNN approach to unbalanced data distributions: A case study involving information extraction. Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Datasets.
Zhang, Y., Shah, R., & Chi, M. (2016). Deep learning + student modeling + clustering: A recipe for effective automatic short answer grading. The 9^th International Conference on Educational Data Mining (EDM2016), 562–567.
Ziai, R., Ott, N., & Meurers, D. (2012). Short answer assessment: Establishing links between research strands. The 7^th Workshop on the Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics, 190–200.

Download references

Author information

Authors and Affiliations

The Faculty of Business Administration, Department of Business Administration in English Sub-Department of Quantitative Methods, Marmara University, İstanbul, Turkey
Ayşe Çınar
Hasan Ali Yucel Education Faculty, Department of Science Education, Istanbul University-Cerrahpaşa, Büyükçekmece, Istanbul, Turkey
Elif Ince
Department of Informatics, Istanbul University, Vezneciler-Fatih, Istanbul, Turkey
Murat Gezer
Hasan Ali Yucel Education Faculty, Department of Computer Education and Instructional Technology, Istanbul University-Cerrahpaşa, Büyükçekmece, Istanbul, Turkey
Özgür Yılmaz

Authors

Ayşe Çınar
View author publications
You can also search for this author in PubMed Google Scholar
Elif Ince
View author publications
You can also search for this author in PubMed Google Scholar
Murat Gezer
View author publications
You can also search for this author in PubMed Google Scholar
Özgür Yılmaz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elif Ince.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Highlights

• In this study, a very high performance of the AdaBoost.M1 algorithm was observed in the scoring of four

physics questions which were quite different and difficult.

• In the evaluation of scoring of open-ended questions by using machine learning algorithms, the systems

imitate the field expert. It was constructed with the methods closest to the human scoring in this research.

• In the case of open-ended questions in the selection and placement exam taking place at the national level

in Turkey, the AdaBoost.M1 technique will be shown to be successful.

Electronic supplementary material

ESM 1

(DOCX 60 kb)

Appendices

Appendix 1 1.1 Open-Ended Questions and an Example from Students’ Handwritten Answers for each Question

Appendix 2 1.1 Accuracy, precision, F1-score performance measures for each category in the testing data sets as shown in Figs. 3, 4, 5, and 6

Rights and permissions

Reprints and permissions

About this article

Cite this article

Çınar, A., Ince, E., Gezer, M. et al. Machine learning algorithm for grading open-ended physics questions in Turkish. Educ Inf Technol 25, 3821–3844 (2020). https://doi.org/10.1007/s10639-020-10128-0

Download citation

Received: 16 November 2019
Accepted: 30 January 2020
Published: 02 March 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10639-020-10128-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine learning algorithm for grading open-ended physics questions in Turkish

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in online higher education: A systematic review of empirical research from 2011 to 2020

An automated essay scoring systems: a systematic literature review

Predicting academic success in higher education: literature review and best practices

References