Skip to main content

Advertisement

Log in

Machine learning algorithm for grading open-ended physics questions in Turkish

  • Published:
Education and Information Technologies Aims and scope Submit manuscript

Abstract

Worldwide, open-ended questions that require short answers have been used in many exams in fields of science, such as the International Student Assessment Program (PISA), the International Science and Maths Trends Research (TIMSS). However, multiple-choice questions are used for many exams at the national level in Turkey, especially high school and university entrance exams. This study aims to develop an objective and useful automatic scoring model for open-ended questions using machine learning algorithms. Within the scope of this aim, an automated scoring model construction study was conducted on four Physics questions at a University level course with the participation of 246 undergraduate students. The short-answer scoring was handled through an approach that addresses students’ answers in Turkish. Model performing machine learning classification techniques such as SVM (Support Vector Machines), Gini, KNN (k-Nearest Neighbors), and Bagging and Boosting were applied after data preprocessing. The score indicated the accuracy, precision and F1-Score of each predictive model of which the AdaBoost.M1 technique had the best performance. In this paper, we report on a short answer grading system in Turkish, based on a machine learning approach using a constructed dataset about a Physics course in Turkish. This study is also the first study in the field of open-ended exam scoring in Turkish.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Alfaro, E., Gamez, M., & Garcia, N. (2015). Adabag: Applies multiclass AdaBoost.M1, SAMME and bagging. R package version 4.2. https://cran.r-project.org/web/packages/adabag/. Accessed 9 Dec 2019.

  • Alfonseca, E., & Perez, D. (2004). Automatic assessment of open ended questions with a BLEU-inspired algorithm and shallow NLP. Advances in Natural Language Processing. LNCS, 3230, 25–35.

    Article  Google Scholar 

  • Alvarado, J. G., Ghavidel, H. A., Zouaq, A., Jovanovic, J., & McDonald, J. (2018). A comparison of features for the automatic labeling of student answers to open-ended questions. Proceeding of the 11th International Educational Data Mining Conference, Buffalo, NY, 55-65.

  • Bailey, S., & Meurers, D. (2008). Diagnosing meaning errors in short answers to reading comprehension questions, Proceedings of the 3rd Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics, 107–115.

  • Basu, S., Jacobs, C., & Vanderwende, L. (2013). Powergrading: A clustering approach to amplify human effort for short answer grading. Transactions of the Association of Computational Linguistics, 1, 391–401.

    Article  Google Scholar 

  • Bukai, O., Pokorny, R., & Haynes, J. (2006). An automated short-free-text scoring system: Development and assessment. Proceedings of the 20th Interservice-Industry Training, Simulation, and Education Conference, 1–11.

  • Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25, 60–117.

    Article  Google Scholar 

  • Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

    Article  Google Scholar 

  • Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm, machine learning. Proceedings Of The TThirteenth International Conference (Icml96), 148–156.

  • Galhardi, L., Barbosa, C.R.S.C., ThomdeSouza, R.C., & Brancher, J.D. (2018). Portuguese automatic short answer grading, VII Congresso Brasileiro de Informática na Educação-CBIE 2018, 1373–1382.

  • Graham, M., Milanowski, A., & Miller, J. (2012). Measuring and promoting inter-rater agreement of teacher and principal performance ratings. Washington, DC: U.S. Department of Education: Center for Educator Compensation Reform.

  • Gronlund, N. E. (1998). Assessment of student achievement (6th ed.). Boston: Allyn & Bacon.

    Google Scholar 

  • Herwanto, G.B., Sari, Y., Prastowo, B.N., Riasetiawan, M., Bustoni, I.A., & Hidayatulloh, I. (2018). UKARA: A fast and simple automatic short answer scoring system for Bahasa Indonesia, Proceeding Book of 1st International Conference on Educational Assessment and Policy, 2, 48–53.

  • Hewitt, P. G., Lyons, S., Suchocki, J. A., & Yeh, J. (2015). Conceptual integrated science (2nd ed.) Pearson.

    Google Scholar 

  • Hou, W., & Tsao, J. (2011). Automatic assessment of students' free-text answers with different levels. International Journal on Artificial Intelligence Tools, 20(2) ,327–347.

  • Hsu, H., & Hsieh, C. (2010). Feature selection via correlation coefficient clustering. Journal of Software, 5(12), 1371–1377.

    Article  Google Scholar 

  • Jayashankar, S., & Sridaran, R. (2017). Superlative model using word cloud for short answers evaluation in eLearning. Education and Information Technologies, 22(5), 2383–2402.

    Article  Google Scholar 

  • Joachims T. (1999). Transductive inference for text classification using support vector machines. Proceedings of the 16th International Conference on Machine Learning (ICML), San Francisco: Morgan Kaufmann publishers, 200–209.

  • Klein, R., Kyrilov, A., & Tokman, M. (2011). Automated assessment of short free-text responses in computer science using latent semantic analysis. Proceedings of the 16th Annual Joint Conference On Innovation and Technology In Computer Science Education (ITiCSE’11), 158–162.

  • Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37, 389–405.

    Article  Google Scholar 

  • Madnani, N., Loukina, A., & Cahill, A. (2017). A large scale quantitative exploration of modeling strategies for content scoring. Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, 457–467.

  • Mcdonald, J. Knott, A., & Zeng R. (2012). Free-text input vs menu selection: Exploring the difference with a tutorial dialogue system. Proceedings of Australasian Language Technology Association Workshop, 97−105.

  • Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F., Chang, C. C. et al. (2019). Package ‘e1071’. https://cran.r-project.org/web/packages/e1071/e1071.pdf. Accessed 9 Dec 2019.

  • Mohler, M., & Mihalcea, R. (2009). Text-to-text semantic similarity for automatic short answer grading. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 567–575.

  • Mowafy, M., Rezk, A., & El-bakry, H. M. (2018). An efficient classification model for unstructured text document. American Journal of Computer Science and Information Technology, 6, 1–16.

    Article  Google Scholar 

  • Nielsen, R. D., Ward, W., Martin, J. H., & Palmer, M. (2008). Annotating students’ understanding of science concepts. In Proceedings of the 6th international conference on language resources and evaluation (pp. 1–8).

    Google Scholar 

  • Perez, D., Alfonseca, E., & Rodriguez, P. (2004). Upper bounds of the BLEU algorithm applied to assessing student essays. Proceedings of the 30th international association for educational assessment (IAEA).

  • Perez-Marin, D. (2004). Automatic evaluation of user’s short essays by using statistical and shallow natural language processing techniques. Unpublished PhD thesis, Computer Science Department, Universidad Autonoma of Madrid.

  • Pribadi, F. S., Permanasari, A. E., & Adji, T. B. (2018). Short answer scoring system using automatic reference answer generation and geometric average normalized-longest common subsequence (GAN-LCS). Education and Information Technologies, 23, 2855–2866.

    Article  Google Scholar 

  • Riordan, B. Horbach, A., Cahill, A. Zesch, T. & Lee, C. M. (2017). Investigating neural architectures for short answer scoring. Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, 159-168.

  • Ripley, B., & Venables, W. (2019). Package ‘class’. https://cran.r-project.org/web/packages/class/class.pdf. Accessed 9 Dec 2019.

  • Romagnano, L. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31–37.

    Article  Google Scholar 

  • Sakaguchi, K., Heilman, M., & Madnani, N. (2015). Effective feature integration for automated short answer scoring. Proceedings of NAACL: HLT, Association for Computational Linguistics, 1049–1054.

  • Shermis, M. D. (2015). Contrasting state-of-the-art in the machine scoring of short-form constructed responses. Educational Assessment, 20(1), 46–65.

    Article  Google Scholar 

  • Syarif, I., Zaluska, E., Prugel-Bennett, A., & Wills, G. (2012). Application of bagging, boosting and stacking to intrusion detection. Machine learning and data mining in pattern recognition (pp. 539–602). Heidelberg Springer Berlin.

  • Therneau, T., Atkinson, B., & Ripley, B. (2017). Recursive Partitioning and Regression Trees, Package ‘rpart’. https://cran.r-project.org/web/packages/rpart/rpart.pdf. Accessed 9 Dec 2019.

  • Van der Linden, W. J., & Hambleton, R. K. (1997). Item response theory: Brief history, common models, and extensions (pp. 1–28). New York: Handbook of modern item response theory: Springer-Verlag.

    Book  Google Scholar 

  • Vanlehn, K., Jordan, P., Rosé, C., Bhembe, D., Böttner, M., Gaydos, A., Makatchev, M., Pappuswamy, U., Ringenberg, M., Roque, A. Siler, S., & Srivastava, R. (2002). The architecture of Why2-atlas: A coach for qualitative physics essay writing. Proceedings of the 6th International Conference on Intelligent Tutoring Systems.

  • Zhang, J. & Mani, I. (2003). KNN approach to unbalanced data distributions: A case study involving information extraction. Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Datasets.

  • Zhang, Y., Shah, R., & Chi, M. (2016). Deep learning + student modeling + clustering: A recipe for effective automatic short answer grading. The 9th International Conference on Educational Data Mining (EDM2016), 562–567.

  • Ziai, R., Ott, N., & Meurers, D. (2012). Short answer assessment: Establishing links between research strands. The 7th Workshop on the Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics, 190–200.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elif Ince.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Highlights

• In this study, a very high performance of the AdaBoost.M1 algorithm was observed in the scoring of four

physics questions which were quite different and difficult.

• In the evaluation of scoring of open-ended questions by using machine learning algorithms, the systems

imitate the field expert. It was constructed with the methods closest to the human scoring in this research.

• In the case of open-ended questions in the selection and placement exam taking place at the national level

in Turkey, the AdaBoost.M1 technique will be shown to be successful.

Electronic supplementary material

ESM 1

(DOCX 60 kb)

Appendices

Appendix 1

1.1 Open-Ended Questions and an Example from Students’ Handwritten Answers for each Question

figure a

Appendix 2

1.1 Accuracy, precision, F1-score performance measures for each category in the testing data sets as shown in Figs. 3, 4, 5, and 6

Fig. 3
figure 3

The comparison of Accuracy, Precision and F1-Score of the algorithms at ten different random iterations for Question 1

Fig. 4
figure 4

The comparison of Accuracy, Precision and F1-Score of the algorithms at ten different random iterations for Question 2

Fig. 5
figure 5

The comparison of Accuracy, Precision and F1-Score of the algorithms at ten different random iterations for Question 3

Fig. 6
figure 6

The comparison of Accuracy, Precision and F1-Score of the algorithms at ten different random iterations for Question 4

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Çınar, A., Ince, E., Gezer, M. et al. Machine learning algorithm for grading open-ended physics questions in Turkish. Educ Inf Technol 25, 3821–3844 (2020). https://doi.org/10.1007/s10639-020-10128-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10639-020-10128-0

Keywords

Navigation