Document Type : Research Articles


Associate Professor, Department of English Language, Faculty of Humanities, Imam Khomeini International University, Qazvin Iran



In rater-mediated assessments, the ratings awarded to language learners’ written, or spoken, performances do not necessarily reflect their language abilities because a number of other construct-irrelevant factors may affect the knowledge they demonstrate. Rater subjectivity and rating scales are among the variables possibly influencing the final results. The purpose of the present study was to examine the extent to which university students’ ratings on their essays mirrored the effect of these two factors. To that end, 150 Iranian EFL teachers rated ten five-paragraph essays BA students had written as their course requirements at Imam Khomeini International University. The raters used two rating scales to rate the essays on a number of assessment criteria. The study rested on a partial rating design, and the Rasch-based computer program, FACETS, was used to analyze the data. Results of Facets analyses showed raters differed considerably in the amounts of severity they exercised when rating the essays. The results also showed rater bias interactions with holistic rating scales. The implications of the findings for proposing procedures for reducing the effects of such extraneous variables are discussed. 


Main Subjects

Article Title [فارسی]

استفاده از انگاره ی چند وجهی راش جهت بررسی مقالات دانشجویان کارشناسی زبان انگلیسی در آزمون های مصحح محور

Author [فارسی]

  • دکتر رجب اسفندیاری

دانشیار گروه زبان انگلیسی، دانشکده علوم انسانی، دانشگاه بین المللی امام خمینی قزوین، قزوین، ایران

Abstract [فارسی]

در آزمون­های مصحح­محور، نمراتی که به عملکرد کتبی و یا شفاهی زبان­آموزان داده می­شود لزوما منعکس کننده توانایی زبانی آنها نیست بخاطر اینکه عوامل دیگری می تواند نتایج نهایی توانایی زبان­آموزان را تحت تأثیر قرار بدهد. سلیقه­ای عمل کردن مصحح­ها و مقیاس­های نمره دهی از عوامل تأثیر گذار بر توانایی زبان­آموزان است. هدف از مطالعه حاضر نیز بررسی این عوامل در در نمراتی است که به مقالات ­آنها داده می­شود. به­همین منظور، از 150 مصصح ایرانی خواسته شد تا ده مقاله­ای را که دانشجویان در درس «مقاله­نویسی» در مقطع کارشناسی نوشته بودند با استفاده از مقیاس­های نمره­دهی و معیارهای ارزشیابی مورد بررسی قرار بدهند. داده­ها با استفاده از نرم افزار فاستس مورد تحلیل قرار گرفت و نتایج تحلیل داده­ها نشان داد که مصححان درجات مختلفی از سختگیری را در هنگام نمره­دهی اعمال می­کردند. نتایج مطالعه همچنین حاکی از این بود که مصححان نسبت به مقیاس اندازه­گیری کلی­نگر سوگیری نشان دادند. کاربردهای آموزشی نتایج مطالعه در جهت کاهش سوگیری مصححان نسبت به مقیاس­های نمره­دهی و معیارهای ارزشیابی جهت بهبود نمرات مورد بررسی قرار می­گیرد.

Keywords [فارسی]

  • مقیاس کلی نگر
  • مقیاس جزئی نگر
  • سوگیری
  • سلیقه ی مصحح
  • سختگیری
Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgments in a performance test of foreign language speaking. Language Testing, 12(2), 238-257.
Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54-74.
Bond, T., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences. Routledge.
Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20(1), 89-110.
Cronbach, L. I. (1990). Essentials of psychological testing (5th ed.). Harper and Row.
Crusan, D. (2010). Assessment in the second language writing classroom. University of Michigan Press.
Crusan, D. (2015). Dance, ten; looks: three: Why rubrics matter [Editorial]. Assessing Writing, 26(1),1–4.
Dempsey, M. S., PytlikZillig, L. M., & Bruning, R. H. (2009). Helping preservice teachers learn to assess writing: Practice and feedback in a Web-based environment. Assessing Writing, 14(1), 38-61.
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly: An International Journal, 2(3), 197-221.
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 5(2), 155–185.
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd edition). Frankfurt: Peter Lang.
Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2(3), 175-196.
Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a Many‐Faceted Rasch Model. Journal of Educational Measurement, 31(2), 93-112.
Engelhard, G., & Wind, S. A. (2017). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Routledge.
Farhady, H., Jafarpour, A., & Birjandi, P. (1994). Testing language skills: From theory to practice. The Organization for Researching and Composing University Textbooks in the Humanities (SAMT).
Ferris, D. R., & Hedgcock, J. S. (2014). Teaching L2 composition: Purpose, process, and practice (3rd ed.). Routledge.
Hamp-Lyons, L. (1991). Second language writing: Assessment issues. In B. Kroll (Ed.), Second language writing: Research insights for the classroom(pp. 69-78). Cambridge University Press.
Hamp-Lyons, L. (2011). Writing assessment: Shifting Issues, new tools, enduring questions. Assessing Writing, 16(1), 3–5.
Harsch, C., & Martin, G. (2013). Comparing holistic and analytic scoring methods: Issues of validity and reliability. Assessment in Education: Principles, Policy & Practice, 20(3), 281-307.
Hyland, K., & Anan, E. (2006). Teachers’ perceptions of error: The effects of first language and experience. System, 34(4), 509-519.
Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135-159.
Jacobs, H. L., Zinkgraf, S. A., Wormuth, D. R., Hartfiel, V. F., & Hughey, J. B. (1981). Testing ESL composition: A practical approach.Newbury House.
Kneeland, N. (1929). That lenient tendency in rating. Personnel Journal, 7, 356-366.
Knoch, U. (2011). Rating scales for diagnostic assessment of writing: What should they look like and where should the criteria come from? Assessing Writing, 16(2), 81-96.
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing Writing, 12(1), 26-43.
Knoch, U., Zhang, B. Y., Elder, C., Flynn, F., Huisman, A., Woodward-Kron, R., Manias, E., & McNamara, T. (2020). I will go to my grave fighting for grammar: Exploring the ability of language-trained raters to implement a professionally-relevant rating scale for writing. Assessing Writing, 46, 1-14.
Kondo-Brown, K. (2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19(1), 3-31.
Kuiken, F., & Vedder, I. (2014). Rating written performance: What do raters do and why? Language Testing, 31(3), 329-348.
Lee, H. K. (2009). Native and nonnative rater behavior in grading Korean students’ English essays. Asia Pacific Education Review, 10(3), 387-397.
Lim, G. S. (2012). Developing and validating a mark scheme for Writing. Cambridge ESOL: Research Notes, 49, 6–9.
Linacre, J. M. (2004). Optimizing rating scale effectiveness. In E. V. Smith & R.M. Smith (Eds.), Introduction to Rasch measurement (pp. 257–578). JAM Press.
Linacre, J. M. (2007). Facets Rasch measurement computer program (Version 3.64.2) [Computer software].
Linacre, J. M. (2011). FACETS (Version 3.68.1) [Computer software]. Chicago, IL: MESA Press.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54-71.
Marefat, F., & Heydari, M. (2016). Native and Iranian teachers’ perceptions and evaluation of Iranian students’ English essays. Assessing Writing, 27(1), 24-36.
McNamara, T. F. (1996). Measuring second language performance. Addison Wesley Longman.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement, 3rd ed. (pp. 13–103). American Council on Education and Macmillan.
Mousavi, S. A. (2012). An encyclopedic dictionary of language testing. Rahnama Press.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386-422.
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II.Journal of Applied Measurement,5(2), 189-227.
North, B. (2003). Scales for rating language performance: Descriptive models, formulation styles, and presentation formats. TOEFLMonograph, 24(pp. 1-106).file:///C:/Users/RAJABE~1/AppData/Local/Temp/NORTHETS2003.pdf
Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88(2), 413-428.
Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22(1), 1–30.
Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language speaking ability: Test method and learner discourse. Language Testing, 16(1), 82–11.
Weigle, S. C. (2002). Assessing writing. Cambridge University Press.
Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Palgrave MacMillan.
White, E.M. (1985). Teaching and assessing writing. Jossey-Bass.
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305-335.
Wigglesworth, G. (1994). Patterns of rater behaviour in the assessment of an oral interaction test. Australian Review of Applied Linguistics, 17(2), 77–103.
Wind, S. A. (2020). Do raters use rating scale categories consistently across analytic rubric domains in writing assessment? Assessing Writing, 43, 1-14.