Document Type : Research Article

Author

Associate Professor, Department of English Language, Faculty of Humanities, Imam Khomeini International University, Qazvin Iran

Abstract

In rater-mediated assessments, the ratings awarded to language learners’ written, or spoken, performances do not necessarily reflect their language abilities because a number of other construct-irrelevant factors may affect the knowledge they demonstrate. Rater subjectivity and rating scales are among the variables possibly influencing the final results. The purpose of the present study was to examine the extent to which university students’ ratings on their essays mirrored the effect of these two factors. To that end, 150 Iranian EFL teachers rated ten five-paragraph essays BA students had written as their course requirements at Imam Khomeini International University. The raters used two rating scales to rate the essays on a number of assessment criteria. The study rested on a partial rating design, and the Rasch-based computer program, FACETS, was used to analyze the data. Results of Facets analyses showed raters differed considerably in the amounts of severity they exercised when rating the essays. The results also showed rater bias interactions with holistic rating scales. The implications of the findings for proposing procedures for reducing the effects of such extraneous variables are discussed. 

Keywords

Main Subjects

Article Title [Persian]

استفاده از انگاره ی چند وجهی راش جهت بررسی مقالات دانشجویان کارشناسی زبان انگلیسی در آزمون های مصحح محور

Author [Persian]

  • دکتر رجب اسفندیاری

دانشیار گروه زبان انگلیسی، دانشکده علوم انسانی، دانشگاه بین المللی امام خمینی قزوین، قزوین، ایران

Abstract [Persian]

در آزمون­های مصحح­محور، نمراتی که به عملکرد کتبی و یا شفاهی زبان­آموزان داده می­شود لزوما منعکس کننده توانایی زبانی آنها نیست بخاطر اینکه عوامل دیگری می تواند نتایج نهایی توانایی زبان­آموزان را تحت تأثیر قرار بدهد. سلیقه­ای عمل کردن مصحح­ها و مقیاس­های نمره دهی از عوامل تأثیر گذار بر توانایی زبان­آموزان است. هدف از مطالعه حاضر نیز بررسی این عوامل در در نمراتی است که به مقالات ­آنها داده می­شود. به­همین منظور، از 150 مصصح ایرانی خواسته شد تا ده مقاله­ای را که دانشجویان در درس «مقاله­نویسی» در مقطع کارشناسی نوشته بودند با استفاده از مقیاس­های نمره­دهی و معیارهای ارزشیابی مورد بررسی قرار بدهند. داده­ها با استفاده از نرم افزار فاستس مورد تحلیل قرار گرفت و نتایج تحلیل داده­ها نشان داد که مصححان درجات مختلفی از سختگیری را در هنگام نمره­دهی اعمال می­کردند. نتایج مطالعه همچنین حاکی از این بود که مصححان نسبت به مقیاس اندازه­گیری کلی­نگر سوگیری نشان دادند. کاربردهای آموزشی نتایج مطالعه در جهت کاهش سوگیری مصححان نسبت به مقیاس­های نمره­دهی و معیارهای ارزشیابی جهت بهبود نمرات مورد بررسی قرار می­گیرد.

Keywords [Persian]

  • مقیاس کلی نگر
  • مقیاس جزئی نگر
  • سوگیری
  • سلیقه ی مصحح
  • سختگیری
Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgments in a performance test of foreign language speaking. Language Testing, 12(2), 238-257.https://doi.org/10.1177/026553229501200206
Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54-74.https://doi.org/10.1080/15434300903464418
Bond, T., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences. Routledge.
Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20(1), 89-110.https://doi.org/10.1191/0265532203lt245oa
Cronbach, L. I. (1990). Essentials of psychological testing (5th ed.). Harper and Row.
Crusan, D. (2010). Assessment in the second language writing classroom. University of Michigan Press.
Crusan, D. (2015). Dance, ten; looks: three: Why rubrics matter [Editorial]. Assessing Writing, 26(1),1–4.https://doi.org/10.1016/j.asw.2015.08.002
Dempsey, M. S., PytlikZillig, L. M., & Bruning, R. H. (2009). Helping preservice teachers learn to assess writing: Practice and feedback in a Web-based environment. Assessing Writing, 14(1), 38-61.https://doi.org/10.1016/j.asw.2008.12.003
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly: An International Journal, 2(3), 197-221.https://doi.org/10.1207/s15434311laq0203_2
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 5(2), 155–185.https://doi.org/10.1177/0265532207086780
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd edition). Frankfurt: Peter Lang.
Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2(3), 175-196.https://doi.org/10.1207/s15434311laq0203_1
Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a Many‐Faceted Rasch Model. Journal of Educational Measurement, 31(2), 93-112.https://doi.org/10.1111/j.1745-3984.1994.tb00436.x
Engelhard, G., & Wind, S. A. (2017). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Routledge.
Farhady, H., Jafarpour, A., & Birjandi, P. (1994). Testing language skills: From theory to practice. The Organization for Researching and Composing University Textbooks in the Humanities (SAMT).
Ferris, D. R., & Hedgcock, J. S. (2014). Teaching L2 composition: Purpose, process, and practice (3rd ed.). Routledge.
Hamp-Lyons, L. (1991). Second language writing: Assessment issues. In B. Kroll (Ed.), Second language writing: Research insights for the classroom(pp. 69-78). Cambridge University Press.
Hamp-Lyons, L. (2011). Writing assessment: Shifting Issues, new tools, enduring questions. Assessing Writing, 16(1), 3–5.https://doi.org/10.1016/j.asw.2010.12.001
Harsch, C., & Martin, G. (2013). Comparing holistic and analytic scoring methods: Issues of validity and reliability. Assessment in Education: Principles, Policy & Practice, 20(3), 281-307.https://doi.org/10.1080/0969594X.2012.742422
Hyland, K., & Anan, E. (2006). Teachers’ perceptions of error: The effects of first language and experience. System, 34(4), 509-519.https://doi.org/10.1016/j.system.2006.09.001
Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135-159. https://doi.org/10.1080/15434303.2013.769545
Jacobs, H. L., Zinkgraf, S. A., Wormuth, D. R., Hartfiel, V. F., & Hughey, J. B. (1981). Testing ESL composition: A practical approach.Newbury House.
Kneeland, N. (1929). That lenient tendency in rating. Personnel Journal, 7, 356-366.
Knoch, U. (2011). Rating scales for diagnostic assessment of writing: What should they look like and where should the criteria come from? Assessing Writing, 16(2), 81-96.https://doi.org/10.1016/j.asw.2011.02.003
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing Writing, 12(1), 26-43.https://doi.org/10.1016/j.asw.2007.04.001
Knoch, U., Zhang, B. Y., Elder, C., Flynn, F., Huisman, A., Woodward-Kron, R., Manias, E., & McNamara, T. (2020). I will go to my grave fighting for grammar: Exploring the ability of language-trained raters to implement a professionally-relevant rating scale for writing. Assessing Writing, 46, 1-14.https://doi.org/10.1016/j.asw.2020.100488
Kondo-Brown, K. (2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19(1), 3-31.https://doi.org/10.1191/0265532202lt218oa
Kuiken, F., & Vedder, I. (2014). Rating written performance: What do raters do and why? Language Testing, 31(3), 329-348.https://doi.org/10.1177/0265532214526174
Lee, H. K. (2009). Native and nonnative rater behavior in grading Korean students’ English essays. Asia Pacific Education Review, 10(3), 387-397.https://doi.org/10.1007/s12564-009-9030-3
Lim, G. S. (2012). Developing and validating a mark scheme for Writing. Cambridge ESOL: Research Notes, 49, 6–9.
Linacre, J. M. (2004). Optimizing rating scale effectiveness. In E. V. Smith & R.M. Smith (Eds.), Introduction to Rasch measurement (pp. 257–578). JAM Press.
Linacre, J. M. (2007). Facets Rasch measurement computer program (Version 3.64.2) [Computer software]. Winsteps.com.
Linacre, J. M. (2011). FACETS (Version 3.68.1) [Computer software]. Chicago, IL: MESA Press.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54-71.https://doi.org/10.1177/026553229501200104
Marefat, F., & Heydari, M. (2016). Native and Iranian teachers’ perceptions and evaluation of Iranian students’ English essays. Assessing Writing, 27(1), 24-36.https://doi.org/10.1016/j.asw.2015.10.001
McNamara, T. F. (1996). Measuring second language performance. Addison Wesley Longman.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement, 3rd ed. (pp. 13–103). American Council on Education and Macmillan.
Mousavi, S. A. (2012). An encyclopedic dictionary of language testing. Rahnama Press.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386-422.
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II.Journal of Applied Measurement,5(2), 189-227.
North, B. (2003). Scales for rating language performance: Descriptive models, formulation styles, and presentation formats. TOEFLMonograph, 24(pp. 1-106).file:///C:/Users/RAJABE~1/AppData/Local/Temp/NORTHETS2003.pdf
Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88(2), 413-428.https://doi.org/10.1037/0033-2909.88.2.413
Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22(1), 1–30.https://doi.org/10.1191/0265532205lt295oa
Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language speaking ability: Test method and learner discourse. Language Testing, 16(1), 82–11.https://doi.org/10.1177/026553229901600105
Weigle, S. C. (2002). Assessing writing. Cambridge University Press.
Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Palgrave MacMillan.
White, E.M. (1985). Teaching and assessing writing. Jossey-Bass.
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305-335.https://doi.org/10.1177/026553229301000306
Wigglesworth, G. (1994). Patterns of rater behaviour in the assessment of an oral interaction test. Australian Review of Applied Linguistics, 17(2), 77–103. https://doi.org/10.1075/aral.17.2.04wig
Wind, S. A. (2020). Do raters use rating scale categories consistently across analytic rubric domains in writing assessment? Assessing Writing, 43, 1-14.https://doi.org/10.1016/j.asw.2019.100416