Proficiency scores in reading, mathematics and science are based on student responses to items that represent the assessment framework for each domain (see Chapter 2). While different students saw different questions, the test design, which ensured a significant overlap of items across different forms, made it possible to construct proficiency scales that are common to all students for each domain. In general, the PISA frameworks assume that a single continuous scale can be used to report overall proficiency in a domain; but this assumption is further verified during scaling (see below).
PISA proficiency scales are constructed using item-response-theory models, in which the likelihood that the test-taker responds correctly to any question is a function of the question’s characteristics (see below) and of the test-taker’s position on the scale. In other words, the test-taker’s proficiency is associated with a particular point on the scale that indicates the likelihood that he or she responds correctly to any question. Higher values on the scale indicate greater proficiency, which is equivalent to a greater likelihood of responding correctly to any question. A description of the modelling technique used to construct proficiency scales can be found in the PISA 2018 Technical Report (OECD, forthcoming[1]).
In the item-response-theory models used in PISA, the task characteristics are summarised by two parameters that represent task difficulty and task discrimination. The first parameter, task difficulty, is the point on the scale where there is at least a 50 % probability of a correct response by students who score at or above that point; higher values correspond to more difficult items. For the purpose of describing proficiency levels that represent mastery, PISA often reports the difficulty of a task as the point on the scale where there is at least a 62 % probability of a correct response by students who score at or above that point.1
The second parameter, task discrimination, represents the rate at which the proportion of correct responses increases as a function of student proficiency. For an idealised highly discriminating item, close to 0 % of students respond correctly if their proficiency is below the item difficulty, and close to 100 % of students respond correctly as soon as their proficiency is above the item difficulty. In contrast, for weakly discriminating items, the probability of a correct response still increases as a function of student proficiency, but only gradually.
A single continuous scale can therefore show both the difficulty of questions and the proficiency of test-takers (see Figure I.2.1). By showing the difficulty of each question on this scale, it is possible to locate the level of proficiency in the domain that the question demands. By showing the proficiency of test-takers on the same scale, it is possible to describe each test-taker’s level of skill or literacy by the type of tasks that he or she can perform correctly most of the time.
Estimates of student proficiency are based on the kinds of tasks that students are expected to perform successfully. This means that students are likely to be able to successfully answer questions located at or below the level of difficulty associated with their own position on the scale. Conversely, they are unlikely to be able to successfully answer questions above the level of difficulty associated with their position on the scale.2
The higher a student’s proficiency level is located above a given test question, the more likely is he or she to be able to answer the question successfully. The discrimination parameter for this particular test question indicates how quickly the likelihood of a correct response increases. The further the student’s proficiency is located below a given question, the less likely is he or she to be able to answer the question successfully. In this case, the discrimination parameter indicates how fast this likelihood decreases as the distance between the student’s proficiency and the question’s difficulty increases.