The Comparison of Reliability and Validity of the Raters' Scoring Criteria with Different Characteristics of Raters of Essay Tests for Measuring Scientific Competence of Grade 9 Students

Main Article Content

Phanida Changwa
Prakittiya Tuksino

Abstract

This research aimed 1) to study the inter-rater reliability of the raters' scoring criteria of the essay test for measuring scientific competence by Intra-Class Correlation: ICC under different raters' characteristics. 2) to compare the validity of the raters' scoring criteria with the holistic scoring rubric of the essay test by Rater Agreement under different raters' characteristics. And 3) to compare the G-coefficient, under different raters' characteristics, for Cross design [p x i x r] and Nested design [p x (i : r)]. The sample was divided into 2 groups: the group of 100 Grade 9 students and the group of raters. The group of raters comprised 3 raters who were science majors and another 3 raters who were non-science majors. The research instruments were 1) an essay test to measure the scientific competency of grade 9 students in 3 situations, containing 9 questions; and 2) a holistic scoring rubric. Generalizability Coefficient scores were analyzed by EduG. The research findings were 1) the reliability of the scoring results for each item, analyzed by the Intra-Class Correlation (ICC) statistics, were found to be from low to very good for all of the raters in the group, both science majors and non-science majors. 2) the validity of the raters' scoring criteria for each item analyzed by Rater Agreement between the raters’ scores (x) and standard scores (y) revealed that the 3 raters who were science majors had the agreement index from 14 percent to 84 percent, and the 3 raters who were non-science majors had the agreement index from 27 percent to 89 percent. 3) The Generalizability Coefficient scores of the p x ( i : r ) design was higher than the p x i x r design for all raters in the group.

Article Details

Section
Research Article

References

Brennan, R. L., & Johnson, E. G. (1995). Generalizability of Performance Assessments. Journal of Educational Measurement, 14(4), 9-12.

Chiu, C., & Wolfe, E. (2002). A Method for Analyzing Sparse Data Matrices in the Generalizability Theory Framework. SAGE Journal, 26(3), 321-338.

Coffman, W. E. (1971). On the Reliability of Ratings of Essay Examinations in English. JSTOR Journal, 5(1), 24-36.

Hopkins, C. D., & Antes, R. L. (1990). Classroom Measurement and Evaluation. Peacock Press.

Koo, T. K., & Li, M. Y. (2016). A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. Journal of Chiropractic Medicine, 15(2), 155-163.

Swartz, C. W., Hooper, S. R., Montgomery, J. W., Wakely, M. B., Kruif, R. E. L., Reed, M., Brown, T. T., Levine, M. D., & White, K. P. (1999). Using Generalizability Theory to Estimate the Reliability of Writing Scores Derived from Holistic and Analytical Scoring Methods. Sage Journal, 59(3), 492-506.

Welk et al., (2004). Reliability of accelerometry-based activity monitors: a generalizability study. Ovid Journal, 36(9), 1637-1645.

Aphaikawi, D. (2019). Scoring results of subjective exams when different groups of inspectors and examination patterns. The 27th Thailand Measurement Evaluation and Research, 108-124. (in Thai)

Intanate, N. (2011). Characteristic of the open-ended mathematics test scores for different numbers of raters and scoring patterns using generalizability model and many-facet Rasch model [Doctoral dissertation]. Srinakharinwirot University. (in Thai)

Kanjanawasee S. (2007). Modern test theories. Chulalongkorn University Press. (in Thai)

Kwanja, N. (2013). Comparison of summaries reference coefficients of the process skills scale Grade 4 science with different scoring patterns [Master’s thesis]. Mahasarakham University. (in Thai)

Ministry of Education. (2017). Thailand Education Plan B.E. 2560 - 2579 (A.D. 2017 – 2036). Office of the Education Council Press. (in Thai)

Phadungphon, S. (2017). Comparison of reliability of modified essay question test for measuring the abilities in using scientific method in physic under different numbers of event and rater: an application of generalizability theory. Educational Electronic Journal, 12(4), 381-393. (in Thai)

Phusing N. (2020). Scienceteacher development model throughstem education for the schools with non-science majoringteachers (nsmt). Journal of MCU Ubon Review, 5(3), 439-454. (in Thai)

Pinyoanuntapong, B. (2004). Measurement and evaluation. Srinakharinwirot University Press. (in Thai)

Sanguanwai, C. (2015). Comparison of test reliability for Measuring Mathematical Creative problem-solving ability: Application of Generalizability theory [Master’s thesis]. Chulalongkorn University. (in Thai)

Saosin, K. (2019). Comparison of Reliability of Math Problem Solving Proficiency Test with Sub-analytical Scoring At the lower secondary level: application of summary theory referring to the reliability of measurement results. Educational Electronic Journal, 13(3), 423-438. (in Thai)

Taoto, J. (2016). A study of the confidence values of students' math subjective test scores. Secondary school with different number of examiners and scoring patterns using the theory of summaries. Reference. Hat Yai Academic Journal, 14(1), 1-14. (in Thai)

The Institute for the Promotion of Teaching Science and Technology, (2020). Scientific Literacy. https://pisathailand.ipst.ac.th/about-pisa/scientific-literacy/ (in Thai)

Tuksino, P. (2013). Teaching documentation educational research methodology. Khon Kaen University Press. (in Thai)

Umnacil, M. (2014). Comparison of reliability of modified essay question test for measuring scientific problem-solving ability using different scoring methods under different number of events: an application of generalizability theory [Master’s thesis]. Chulalongkorn University. (in Thai)