Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgments in a performance test of foreign language speaking. Language Testing, 12(2), 238-257.https://doi.org/10.1177/026553229501200206
Bond, T., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences. Routledge.
Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20(1), 89-110.https://doi.org/10.1191/0265532203lt245oa
Cronbach, L. I. (1990). Essentials of psychological testing (5th ed.). Harper and Row.
Crusan, D. (2010). Assessment in the second language writing classroom. University of Michigan Press.
Crusan, D. (2015). Dance, ten; looks: three: Why rubrics matter [Editorial]. Assessing Writing, 26(1),1–4.https://doi.org/10.1016/j.asw.2015.08.002
Dempsey, M. S., PytlikZillig, L. M., & Bruning, R. H. (2009). Helping preservice teachers learn to assess writing: Practice and feedback in a Web-based environment. Assessing Writing, 14(1), 38-61.https://doi.org/10.1016/j.asw.2008.12.003
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly: An International Journal, 2(3), 197-221.https://doi.org/10.1207/s15434311laq0203_2
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 5(2), 155–185.https://doi.org/10.1177/0265532207086780
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd edition). Frankfurt: Peter Lang.
Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2(3), 175-196.https://doi.org/10.1207/s15434311laq0203_1
Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a Many‐Faceted Rasch Model. Journal of Educational Measurement, 31(2), 93-112.https://doi.org/10.1111/j.1745-3984.1994.tb00436.x
Engelhard, G., & Wind, S. A. (2017). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Routledge.
Farhady, H., Jafarpour, A., & Birjandi, P. (1994). Testing language skills: From theory to practice. The Organization for Researching and Composing University Textbooks in the Humanities (SAMT).
Ferris, D. R., & Hedgcock, J. S. (2014). Teaching L2 composition: Purpose, process, and practice (3rd ed.). Routledge.
Hamp-Lyons, L. (1991). Second language writing: Assessment issues. In B. Kroll (Ed.), Second language writing: Research insights for the classroom(pp. 69-78). Cambridge University Press.
Hamp-Lyons, L. (2011). Writing assessment: Shifting Issues, new tools, enduring questions. Assessing Writing, 16(1), 3–5.https://doi.org/10.1016/j.asw.2010.12.001
Harsch, C., & Martin, G. (2013). Comparing holistic and analytic scoring methods: Issues of validity and reliability. Assessment in Education: Principles, Policy & Practice, 20(3), 281-307.https://doi.org/10.1080/0969594X.2012.742422
Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135-159. https://doi.org/10.1080/15434303.2013.769545
Jacobs, H. L., Zinkgraf, S. A., Wormuth, D. R., Hartfiel, V. F., & Hughey, J. B. (1981). Testing ESL composition: A practical approach.Newbury House.
Kneeland, N. (1929). That lenient tendency in rating. Personnel Journal, 7, 356-366.
Knoch, U. (2011). Rating scales for diagnostic assessment of writing: What should they look like and where should the criteria come from? Assessing Writing, 16(2), 81-96.https://doi.org/10.1016/j.asw.2011.02.003
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing Writing, 12(1), 26-43.https://doi.org/10.1016/j.asw.2007.04.001
Knoch, U., Zhang, B. Y., Elder, C., Flynn, F., Huisman, A., Woodward-Kron, R., Manias, E., & McNamara, T. (2020). I will go to my grave fighting for grammar: Exploring the ability of language-trained raters to implement a professionally-relevant rating scale for writing. Assessing Writing, 46, 1-14.https://doi.org/10.1016/j.asw.2020.100488
Kondo-Brown, K. (2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19(1), 3-31.https://doi.org/10.1191/0265532202lt218oa
Kuiken, F., & Vedder, I. (2014). Rating written performance: What do raters do and why? Language Testing, 31(3), 329-348.https://doi.org/10.1177/0265532214526174
Lee, H. K. (2009). Native and nonnative rater behavior in grading Korean students’ English essays. Asia Pacific Education Review, 10(3), 387-397.https://doi.org/10.1007/s12564-009-9030-3
Lim, G. S. (2012). Developing and validating a mark scheme for Writing. Cambridge ESOL: Research Notes, 49, 6–9.
Linacre, J. M. (2004). Optimizing rating scale effectiveness. In E. V. Smith & R.M. Smith (Eds.), Introduction to Rasch measurement (pp. 257–578). JAM Press.
Linacre, J. M. (2007). Facets Rasch measurement computer program (Version 3.64.2) [Computer software]. Winsteps.com.
Linacre, J. M. (2011). FACETS (Version 3.68.1) [Computer software]. Chicago, IL: MESA Press.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54-71.https://doi.org/10.1177/026553229501200104
Marefat, F., & Heydari, M. (2016). Native and Iranian teachers’ perceptions and evaluation of Iranian students’ English essays. Assessing Writing, 27(1), 24-36.https://doi.org/10.1016/j.asw.2015.10.001
McNamara, T. F. (1996). Measuring second language performance. Addison Wesley Longman.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement, 3rd ed. (pp. 13–103). American Council on Education and Macmillan.
Mousavi, S. A. (2012). An encyclopedic dictionary of language testing. Rahnama Press.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386-422.
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II.Journal of Applied Measurement,5(2), 189-227.
North, B. (2003). Scales for rating language performance: Descriptive models, formulation styles, and presentation formats. TOEFLMonograph, 24(pp. 1-106).file:///C:/Users/RAJABE~1/AppData/Local/Temp/NORTHETS2003.pdf
Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88(2), 413-428.https://doi.org/10.1037/0033-2909.88.2.413
Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22(1), 1–30.https://doi.org/10.1191/0265532205lt295oa
Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language speaking ability: Test method and learner discourse. Language Testing, 16(1), 82–11.https://doi.org/10.1177/026553229901600105
Weigle, S. C. (2002). Assessing writing. Cambridge University Press.
Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Palgrave MacMillan.
White, E.M. (1985). Teaching and assessing writing. Jossey-Bass.
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305-335.https://doi.org/10.1177/026553229301000306
Wigglesworth, G. (1994). Patterns of rater behaviour in the assessment of an oral interaction test. Australian Review of Applied Linguistics, 17(2), 77–103. https://doi.org/10.1075/aral.17.2.04wig
Wind, S. A. (2020). Do raters use rating scale categories consistently across analytic rubric domains in writing assessment? Assessing Writing, 43, 1-14.https://doi.org/10.1016/j.asw.2019.100416