The validity of comparative judgement: A comment on Kelly, Richardson and Isaacs

5 May 2023

10 mins

Written by Dr Ian Jones and Prof Matthew Inglis. Ian is a Reader in Educational Assessment, and has led a programme of research into comparative judgement over the past decade. This blog post is a summary of the validity and reliability findings from that research. Matthew is a Professor of Mathematical Cognition and the co-director of the CMC. Edited by Bethany Woollacott.

We submitted this short note to Assessment in Education: Principles, Policy and Practice in response to a recent article they had published. Unfortunately the journal informed us that they have a policy against publishing responses to articles appearing in their journal, so we are instead posting it here.

Introduction

Kelly et al. (2022) critiqued what they called the “intrinsic validity” argument for using comparative judgement (CJ) methods in educational assessment. Specifically, they suggested that some researchers believe CJ is “intrinsically” valid because it relies upon the collective view of relevant domain experts. Kelly et al. dispute this view on three grounds:

They point out that what counts as an expert in a relevant domain is vague, and how it is operationalised varies between studies.
They worry that non-experts, such as school students, sometimes produce CJ outcomes that correlate with those derived from experts (Jones & Wheadon, 2015).
They suggest that relying upon holistic expert judgement is in tension with the common practice of removing ‘misfitting’ judges (judges who may have a different conception of the construct being assessed from the other judges involved).

We largely agree with Kelly et al. (2022) when they argue that “intrinsic validity” would be a weak basis upon which to enact significant educational reforms: the validity of any novel measurement method needs to be empirically assessed.

However, we were surprised to be cited as examples of researchers who believe that CJ is intrinsically valid. For instance, Kelly et al. cited one of our studies (Bisson, Gilmore, Inglis & Jones, 2020) as having endorsed the intrinsic validity of CJ, whereas in fact this study was precisely designed to empirically examine whether CJ was valid. We ran a randomised controlled trial using two outcome measures – one based on a CJ process, one based on a traditional standardised test – and compared the results.

Similarly, in another of our papers (Jones & Inglis, 2015), again cited by Kelly et al. as endorsing the “intrinsic validity” of CJ, we aimed to empirically investigate CJ’s validity, by comparing scores obtained from a CJ process with those obtained from a more traditional marking process. It is unclear why we would have conducted these, and other, validation studies if we felt that CJ was “intrinsically” valid.

“It is unclear why we would have conducted these, and other, validation studies if we felt that CJ was “intrinsically” valid.“

In our view, there is now growing empirical evidence that suggests that CJ assessments often produce valid outcomes. We review some of this work from our research group here. We consider convergent validity, divergent validity, and content validity in turn.

Convergent Validity

A common technique for investigating the validity of a measure is to compare it with another measure of the same construct.

We have compared CJ scores with the outcomes of validated instruments for the case of understanding of concepts in calculus (Bisson et al., 2016; Bisson et al., 2019), algebra (Bisson et al., 2016; Jones et al. 2019), statistics (Bisson et al., 2016) and proof comprehension (Davies et al. 2020). These studies reported generally modest correlations between CJ and instrument scores, although a common theme was the poor psychometric performance of the traditional instruments compared with the high reliability of the CJ outcomes.

Elsewhere we have assessed student work using both CJ and specially-designed scoring rubrics in mathematics (Jones & Inglis, 2015) and science (McMahon & Jones, 2014), or generated CJ scores for previously graded exam scripts (Jones et al., 2014; Jones et al., 2016), and in all cases found positive and often high correlations.

In addition, we have also used a proxy as a measure of the construct of interest, such as Grade Point Averages (Marshall et al., 2020), teacher estimates of achievement (Jones et al., 2013; Jones & Wheadon, 2015), previous exam grades (Jones & Karadeniz, 2016; Jones & Sirl, 2017) or predicted exam grades (Jones & Wheadon, 2015), again finding the kind of relationships one would expect if CJ scores were valid.

Divergent validity

Another technique for investigating validity is to compare the measure of interest with a measure of a different construct, with the expectation that the two measures will not be strongly correlated.

For example, we have administered open-ended mathematics tasks that require students to draw on their written communication skills and then evaluated the extent to which CJ scores reflected mathematical understanding over and above communication skills. We have found that secondary students’ numeracy test scores predicted CJ scores, but not their written communication scores (Bisson et al., 2019; Jones & Karadeniz, 2016). For primary school students, we found that both numeracy and written communication predicted CJ scores, suggesting that in some contexts care must be taken to partial out the variance in CJ scores explained by written communication skills (Jones et al., 2019).

One strand of research has explored the potential of CJ methods to measure conceptual understanding (related to underlying concepts and how they connect together) rather than procedural knowledge (related to computation and factual recall) in mathematics. In one case, we found that overall mathematics achievement, which acted as a proxy measure for conceptual understanding, predicted CJ scores, but a measure of procedural mathematical knowledge did not (Jones et al., 2013).

As noted by Kelly et al. (2022), we have also explored divergent validity by sampling groups of judges from different populations, and comparing the resultant CJ scores. For example, some of our studies (Davies et al., 2021; Jones & Alcock, 2014) compared CJ scores estimated from the judgements of experts (research mathematicians) with non-experts (postgraduate students who had not studied mathematics since age 16), and found non-experts’ judgements were unreliable and their scores only modestly correlated with experts’ scores.

Similarly, our research into peer assessment, in which students comparatively judge one another’s work, has shown that peer judges’ CJ scores tend to be more reliable than those of non-experts but less reliable than those of experts, and peers’ scores are also correlated more highly with experts’ scores than are non-experts’ scores (Jones & Alcock, 2014; Jones & Wheadon, 2015; McMahon & Jones, 2015; Sirl & Jones, 2019). These findings suggest that Kelly et al.’s (2022) concern about non-experts’ CJ outcomes may be overstated.

“These findings suggest that Kelly et al.’s (2022) concern about non-experts’ CJ outcomes may be overstated.“

Content Validity

To explore content validity we have drawn on expert review of test materials and of student responses.

For example, in one study (Jones & Inglis, 2015) we investigated the content validity of a mathematics exam that was designed to be assessed with CJ using expert review. The exam was reviewed online by 94 mathematics teachers who then completed a survey comprising Likert-style and open-text items. In other studies, we have evaluated content validity in terms of students’ responses and CJ scores. Our approach has involved qualitatively coding students’ responses using existing frameworks (Jones & Karadeniz, 2016) or grounded approaches (Davies et al., 2020; Davies et al., 2021), and then regressing the codes onto CJ scores to identify the features of high-scoring responses. We found that mathematician judges favoured definitions of mathematical proof that were consistent with characterisations offered by philosophers of mathematics (Davies et al., 2020).

We have further investigated content validity by studying the grounds on which judges made their judgements. This has included post-judging interviews with mathematicians (Davies & Jones, 2021), primary teachers (Hunter & Jones, 2018) and mathematics undergraduates (Jones & Alcock, 2014), as well as administering post-judging surveys across a range of studies (Jones et al., 2014; Jones & Alcock, 2014; Jones & Inglis, 2015; Jones & Sirl, 2017; Marshall et al., 2020). In one case we found that, after judging secondary student responses to a specially-designed summative exam, expert judges rated ‘originality and flair’ and use of ‘formal notation’ as positively influencing their decisions, and ‘errors’ and ‘untidy presentation’ as negatively influencing their decisions (Jones & Inglis, 2015).

Conclusion

In sum, we agree with Kelly et al. (2022) that relying upon “intrinsic validity” would be a poor basis upon which to incorporate CJ into educational assessment, a step that would constitute a significant reform. This is why we have conducted a programme of research to empirically examine the validity of CJ scores in a variety of contexts, using a variety of methods. We believe that the results are promising.

Direct link to Kelly et al.’s (2022) paper

References

Bisson, M.-J., Gilmore, C., Inglis, M., & Jones, I. (2016). Measuring conceptual understanding using comparative judgement. International Journal of Research in Undergraduate Mathematics Education, 2(2), 141–164. https://doi.org/10.1007/s40753-016-0024-3

Bisson, M.-J., Gilmore, C., Inglis, M., & Jones, I. (2019). Teaching using contextualised and decontextualised representations: Examining the case of differential calculus through a comparative judgement technique. Research in Mathematics Education, 22(3), 284–303. https://doi.org/10.1080/14794802.2019.1692060

Davies, B., Alcock, L., & Jones, I. (2020). Comparative judgement, proof summaries and proof comprehension. Educational Studies in Mathematics, 105(2), 181–197. https://doi.org/10.1007/s10649-020-09984-x

Davies, B., Alcock, L., & Jones, I. (2021). What do mathematicians mean by proof? A comparative-judgement study of students’ and mathematicians’ views. The Journal of Mathematical Behavior, 61, 100824. https://doi.org/10.1016/j.jmathb.2020.100824

Davies, B., & Jones, I. (2022). Assessing proof reading comprehension using summaries. International Journal of Research in Undergraduate Mathematics Education. https://doi.org/10.1007/s40753-021-00157-6

Hunter, J., & Jones, I. (2018). Free-response tasks in primary mathematics: A window on students’ thinking. Proceedings of the 41st Annual Conference of the Mathematics Education Research Group of Australasia, 41, 400–407. https://eric.ed.gov/?id=ED592426

Jones, I., & Alcock, L. (2014). Peer assessment without assessment criteria. Studies in Higher Education, 39(10), 1774–1787. https://doi.org/10.1080/03075079.2013.821974

Jones, I., Bisson, M., Gilmore, C., & Inglis, M. (2019). Measuring conceptual understanding in randomised controlled trials: Can comparative judgement help? British Educational Research Journal, 45(3), 662–680. https://doi.org/10.1002/berj.3519

Jones, I., & Inglis, M. (2015). The problem of assessing problem solving: Can comparative judgement help? Educational Studies in Mathematics, 89(3), 337–355. https://doi.org/10.1007/s10649-015-9607-1

Jones, I., Inglis, M., Gilmore, C., & Hodgen, J. (2013). Measuring conceptual understanding: The case of fractions. In A. M. Lindmeier & A. Heinze (Eds.), Proceedings of the 37th Conference of the International Group for the Psychology of Mathematics Education. (Vol. 3, pp. 113-120). Kiel, Germany. https://tinyurl.com/v8yn4545

Jones, I., & Karadeniz, I. (2016). An alternative approach to assessing achievement. In C. Csikos, A. Rausch, & J. Szitanyi (Eds.), The 40th Conference of the International Group for the Psychology of Mathematics Education (Vol. 3, pp. 51–58). IGPME. https://tinyurl.com/48vyjy8f

Jones, I., & Sirl, D. (2017). Peer assessment of mathematical understanding using comparative judgement. Nordic Studies in Mathematics Education, 22(4), 147–164. https://tinyurl.com/4p958nh6

Jones, I., & Wheadon, C. (2015). Peer assessment using comparative and absolute judgement. Studies in Educational Evaluation, 47, 93–101. https://doi.org/10.1016/j.stueduc.2015.09.004

Kelly, K. T., Richardson, M., & Isaacs, T. (2022). Critiquing the rationales for using comparative judgement: a call for clarity. Assessment in Education: Principles, Policy & Practice, 1-15. https://doi.org/10.1080/0969594X.2022.2147901

Marshall, N., Shaw, K., Hunter, J., & Jones, I. (2020). Assessment by comparative judgement: An application to secondary statistics and English in New Zealand. New Zealand Journal of Educational Studies, 55(1), 49–71. https://doi.org/10.1007/s40841-020-00163-3

Centre for Mathematical Cognition

We write mostly about mathematics education, numerical cognition and general academic life. Our centre’s research is wide-ranging, so there is something for everyone: teachers, researchers and general interest. This blog is managed by Dr Bethany Woollacott, a research associate at the CMC, who edits and typesets all posts. Please email b.woollacott@lboro.ac.uk if you have any feedback or if you would like information about being a guest contributor. We hope you enjoy our blog!