The results of the PISA 2022 test are reported in a numerical scale consisting of PISA score points. This section summarises the test-development and scaling procedures used to ensure that PISA score points are comparable across countries and with the results of previous PISA assessments.
PISA 2022 Results (Volume I)
Annex A1. The construction of reporting scales and of indices from the student context questionnaire
The construction of reporting scales
Assessment framework and test development
The first step in defining a reporting scale in PISA is developing a framework for each domain assessed. This framework provides a definition of what it means to be proficient in the domain; delimits and organises the domain according to different dimensions; and suggests the kind of test items and tasks that can be used to measure what students can do in the domain within the constraints of the PISA design (OECD, 2023[1]). These frameworks were developed by a group of international experts for each domain and agreed upon by the participating countries.
The second step is the development of the test questions (i.e. items) to assess proficiency in each domain. A consortium of testing organisations under contract to the OECD on behalf of participating governments develops new items and selects items from previous PISA tests (i.e. “trend items”) of the same domain. The expert group that developed the framework reviews these proposed items to confirm that they meet the requirements and specifications of the framework.
The third step is a qualitative review of the testing instruments by all participating countries and economies to ensure the items’ overall quality and appropriateness in their own national context. These ratings are considered when selecting the final pool of items for the assessment. Selected items are then translated and adapted to create national versions of the testing instruments. These national versions are verified by the PISA consortium.
The verified national versions of the items are then presented to a sample of 15-year-old students in all participating countries and economies as part of a field trial. This is to ensure that they meet stringent quantitative standards of technical quality and international comparability. In particular, the field trial serves to verify the psychometric equivalence of items across countries and economies (see Annex A6).
After the field trial, material is considered for rejection, revision or retention in the pool of potential items. The international expert group for each domain then formulates recommendations as to which items should be included in the main assessments. The final set of selected items is also subject to review by all countries and economies. This selection is balanced across the various dimensions specified in the framework and spans various levels of difficulty so that the entire pool of items measures performance across all component skills and a broad range of contexts and student abilities.
Proficiency scales for mathematics, reading, and science
Proficiency scores in mathematics, reading, and science are based on student responses to items that represent the assessment framework for each domain (see section above). While different students saw different questions, the test design, which ensured a significant overlap of items across different forms, made it possible to construct proficiency scales that are common to all students for each domain. In general, the PISA frameworks assume that a single continuous scale can be used to report overall proficiency in a domain but this assumption is further verified during scaling (see section below).
PISA proficiency scales are constructed using item-response-theory models in which the likelihood that the test-taker responds correctly to any question is a function of the question’s characteristics and of the test-taker’s position on the scale. In other words, the test-taker’s proficiency is associated with a particular point on the scale that indicates the likelihood that he or she responds correctly to any question. Higher values on the scale indicate greater proficiency, which is equivalent to a greater likelihood of responding correctly to any question. A description of the modelling technique used to construct proficiency scales can be found in the PISA 2022 Technical Report (OECD, Forthcoming[2])
In the item-response-theory models used in PISA, the test items characteristics are summarised by two parameters that represent task difficulty and task discrimination. The first parameter, task difficulty, is the point on the scale where there is at least a 50% probability of a correct response by students who score at or above that point; higher values correspond to more difficult items. For the purpose of describing proficiency levels that represent mastery, PISA often reports the difficulty of a task as the point on the scale where there is at least a 62% probability of a correct response by students who score at or above that point.
The second parameter, task discrimination, represents the rate at which the proportion of correct responses increases as a function of student proficiency. For an idealised highly discriminating item, close to 0% of students respond correctly if their proficiency is below the item difficulty and close to 100% of students respond correctly as soon as their proficiency is above the item difficulty. In contrast, for weakly discriminating items, the probability of a correct response still increases as a function of student proficiency, but only gradually.
A single continuous scale can therefore show both the difficulty of questions and the proficiency of test-takers (see Figure I.A1.1). By showing the difficulty of each question on this scale, it is possible to locate the level of proficiency in the domain that the question demands. By showing the proficiency of test-takers on the same scale, it is possible to describe each test-taker’s level of skill or literacy by the type of tasks that he or she can perform correctly most of the time.
Estimates of student proficiency are based on the kinds of tasks students are expected to perform successfully. This means that students are likely to be able to successfully answer questions located at or below the level of difficulty associated with their own position on the scale. Conversely, they are unlikely to be able to successfully answer questions above the level of difficulty associated with their position on the scale.1
The higher a student’s proficiency level is located above a given test question, the more likely he or she can answer the question successfully. The discrimination parameter for this particular test question indicates how quickly the likelihood of a correct response increases. The further the student’s proficiency is located below a given question, the less likely he or she is able to answer the question successfully. In this case, the discrimination parameter indicates how fast this likelihood decreases as the distance between the student’s proficiency and the question’s difficulty increases.
How reporting scales are set and linked across multiple assessments
The reporting scale for each domain was originally established when the domain was the major focus of assessment in PISA for the first time: PISA 2000 for reading, PISA 2003 for mathematics and PISA 2006 for science.
The item-response-theory models used in PISA describe the relationship between student proficiency, item difficulty and item discrimination, but do not set a measurement unit for any of these parameters. In PISA, this measurement unit was chosen the first time a reporting scale was established. The score of “500” on the scale was defined as the average proficiency of students across OECD countries; “100 score points” was defined as the standard deviation (a measure of the variability) of proficiency across OECD countries.2
To enable the measurement of trends, achievement data from successive assessments are reported on the same scale. It is possible to report results from different assessments on the same scale because in each assessment PISA retains a significant number of items from previous PISA assessments. These are known as trend items. All items used to assess reading and science in 2018 and a significant number of items used to assess mathematics (74 out of 234) were developed and already used in earlier assessments. Their difficulty and discrimination parameters were therefore already estimated in previous PISA assessments.
The answers to the trend questions from students in earlier PISA cycles, together with the answers from students in PISA 2022, were both considered when scaling PISA 2022 data to determine student proficiency, item difficulty and item discrimination. In particular, when scaling PISA 2022 data, item parameters for new items were freely estimated, but item parameters for trend items were initially fixed to their PISA 2018 values, which, in turn, were based on a concurrent calibration involving response data from multiple cycles. All constraints on trend item parameters were evaluated and, in some cases, released in order to better describe student-response patterns. See the PISA 2022 Technical Report (OECD, Forthcoming[2]) for details.
The extent to which the item characteristics estimated during the scaling of PISA 2018 data differ from those estimated in previous calibrations is summarised in the “link error”, a quantity (expressed in score points) that reflects the uncertainty in comparing PISA results over time. A link error of zero indicates a perfect match in the parameters across calibrations, while a non-zero link error indicates that the relative difficulty of certain items or the ability of certain items to discriminate between high and low achievers has changed over time, introducing greater uncertainty in trend comparisons.
How many scales per domain? Assessing the dimensionality of PISA domains
PISA frameworks for mathematics, reading, and science assume that a single continuous scale can summarise performance in each domain for all countries. This assumption is incorporated in the item-response-theory model used in PISA. Violations of this assumption therefore result in model misfit, and can be assessed by inspecting fit indices.
After the field trial, initial estimates of model fit for each item, and for each country and language group, provide indications about the plausibility of the uni-dimensionality assumption and about the equivalence of scales across countries. These initial estimates are used to refine the item set used in each domain: problematic items are sometimes corrected (e.g. if a translation error is detected); and coding and scoring rules can be amended (e.g. to suppress a partial-credit score that affected coding reliability, or to combine responses to two or more items when the probability of a correct response to one question appears to depend on the correct answer to an earlier question). Items can also be deleted after the field trial. Deletions are carefully balanced so that the set of retained items continues to provide a good balance of all aspects of the framework. After the main study, the estimates of model fit are mainly used to refine the scaling model (some limited changes to the scoring rules and item deletions can also be considered).
Despite the evidence in favour of a uni-dimensional scale for the “major” domain (i.e. mathematics in PISA 2022), PISA nevertheless provides multiple estimates of performance, in addition to the overall scale, through so-called “subscales”. Subscales represent different framework dimensions and provide a more nuanced picture of performance in a domain. Subscales within a domain are usually highly correlated across students (thus supporting the assumption that a coherent overall scale can be formed by combining items across subscales). Despite this high correlation, interesting differences in performance across subscales can often be observed at aggregate levels (across countries, across education systems within countries, or between boys and girls).
Summary descriptions of the proficiency levels of mathematical subscales
Tables I.A1.1 to I.A1.8 (below) provide summary descriptions of proficiency levels on each mathematical subscale. In some mathematical subscales there were no test items in the PISA 2022 Mathematics assessment to describe skills at levels 1c or 1b.
PISA 2022 results on mathematics subscales are included in Annex B1 (for countries and economies) and Annex B2 (for regions within countries). Results on the percentage of students scoring at each proficiency level in mathematics subscales were estimated only for proficiency levels that had proficiency descriptors (i.e. test items measuring those levels).
Indices from the student context questionnaire
In addition to scale scores representing performance in mathematics, reading and science, this volume uses indices derived from the PISA student questionnaires to contextualise PISA 2022 results or to estimate trends that account for demographic changes over time. The following indices and database variables are used in this report.
The PISA index of economic, social and cultural status (ESCS)
The PISA index of economic, social and cultural status (ESCS) is a composite score derived, as in previous cycles, from three variables related to family background: parents’ highest level of education in years (PAREDINT), parents’ highest occupational status (HISEI), and home possessions (HOMEPOS).
Parents’ highest level of education in years: Students’ responses to questions ST005, ST006, ST007 and ST008 regarding their parents’ education were classified using ISCED-11 (UNESCO, 2012[3]). Indices on parental education were constructed by recoding educational qualifications into the following categories: (1) ISCED Level 02 (pre-primary education), (2) ISCED Level 1 (primary education), (3) ISCED Level 2 (lower secondary), (4) ISCED Level 3.3 (upper secondary education with no direct access to tertiary education), (5) ISCED Level 3.4 (upper secondary education with direct access to tertiary education), (6) ISCED Level 4 (post-secondary non-tertiary), (7) ISCED Level 5 (short-cycle tertiary education), (8) ISCED Level 6 (Bachelor’s or equivalent), (9) ISCED Level 7 (Master’s or equivalent) and (10) ISCED Level 8 (Doctoral or equivalent). Indices with these categories were provided for a student’s mother (MISCED) and father (FISCED). In the event that student responses between ST005 and ST006 (for mother’s education) or between ST007 and ST008 (for father’s education) conflicted (e.g. in ST006 if a student indicated their parent having a postsecondary qualification but indicated in ST005 the parent had not completed lower secondary education), the higher education value provided by the student was used. This differs from the PISA 2018 procedure where the lower value was used. In addition, the index of highest education level of parents (HISCED) corresponded to the higher ISCED level of either parent. The index of highest education level of parents was also recoded into estimated number of years of schooling (PAREDINT). The conversion from ISCED levels to year of education is common to all countries. This international conversion was determined by using the cumulative years of education values assigned in PISA 2018 to each ISCED level. The correspondence is available in the PISA 2022 Technical Report (OECD, Forthcoming[2]).
To make PAREDINT scores for PISA 2012, PISA 2015, and PISA 2018 comparable to PAREDINT scores for PISA 2022, new PAREDINT scores were created for each student who participated in previous cycles using the coding scheme used in PISA 2022. These new PAREDINT scores were used in the computation of trend ESCS scores.
Parents’ highest occupational status: Occupational data for both the student’s father and the student’s mother were obtained from responses to open-ended questions. The responses were coded to four-digit ISCO codes (ILO, 2007) and then mapped to the international socio-economic index of occupational status (ISEI) (Ganzeboom and Treiman, 2003[4]). In PISA 2022, the ISCO and ISEI in their 2008 version were used. Three indices were calculated based on this information: father’s occupational status (BFMJ2); mother’s occupational status (BMMJ1); and the highest occupational status of parents (HISEI), which corresponds to the higher ISEI score of either parent or to the only available parent’s ISEI score. For all three indices, higher ISEI scores indicate higher levels of occupational status.
Home possessions (HOMEPOS) is a proxy measure for family wealth. In PISA 2022, students reported the availability of household items at home, including books at home and country-specific household items that were seen as appropriate measures of family wealth within the country’s context. HOMEPOS is a summary index of all household and possession items (ST250, ST251, ST253, ST254, ST255, ST256). Some HOMEPOS items used in PISA 2018 were removed in PISA 2022 while new ones were added (e.g., new items developed specifically with low-income countries in mind). Furthermore, some HOMEPOS that were previously dichotomous (yes/no) items were revised to polytomous items (1, 2, 3, etc.) allowing for capturing a greater variation in responses.
For the purpose of computing the PISA index of economic, social and cultural status (ESCS), values for students with missing PAREDINT, HISEI or HOMEPOS were imputed with predicted values plus a random component based on a regression on the other two variables. If there were missing data on more than one of the three variables, ESCS was not computed and a missing value was assigned for ESCS.
In PISA 2022, ESCS was computed by attributing equal weight to the three standardised components. The three components were standardised across the OECD countries, with each OECD country contributing equally. The final ESCS variable was transformed, with 0 the score of an average OECD student and 1 the standard deviation across equally weighted OECD countries.
Immigrant background (IMMIG)
Information on the country of birth of the students and their parents was collected. Included in the database are three country-specific variables relating to the country of birth of the student, mother and father (ST019). The variables are binary and indicate whether the student, mother and father were born in the country of assessment or elsewhere. The index on immigrant background (IMMIG) is calculated from these variables, and has the following categories: (1) native students (those students who had at least one parent born in the country); (2) second-generation students (those born in the country of assessment but whose parent[s] were born in another country); and (3) first-generation students (those students born outside the country of assessment and whose parents were also born in another country). Students with missing responses for either the student or for both parents were given missing values for this variable.
Language spoken at home (ST022)
Students indicated what language they usually spoke at home, and the database includes an internationally comparable variable (ST022Q01TA) that was derived from this information and has the following categories: (1) language at home is same as the language of assessment for that student; (2) language at home is another language.
The mappings of options provided in national versions of the student questionnaire for the two possible values for the “International Language at Home” variable (ST022Q01TA) are the responsibility of national PISA centres. For example, for students in the Flemish Community of Belgium, “Flemish dialect” was considered (together with “Dutch”) as equivalent to the “Language of test”; for students in the French Community and German-speaking Community (respectively), Walloon (a French dialect) and a German dialect were considered to be equivalent to “Another language”.
Mathematics Anxiety (ANXMAT)
The index of mathematics anxiety (ANXMAT) was constructed using student responses to question (ST345) over the extent they strongly agreed, agreed, disagreed or strongly disagreed with the following statements when asked to think about studying mathematics: “I often worry that it will be difficult for me in mathematics classes”; “I get very tense when I have to do mathematics homework”; “I get very nervous doing mathematics problems”; “I feel helpless when doing a mathematics problem”; “I worry that I will get poor <grades> in mathematics”; and “I feel anxious about failing in mathematics”.
In addition to the indices listed above, the following database variables were used in this report.
Student gender (ST004)
Age of arrival in country of test (ST021) (only for students who were born in a country that is different of the country of test)
Food insecurity (ST258)
References
[4] Ganzeboom, H. and D. Treiman (2003), “Three Internationally Standardised Measures for Comparative Research on Occupational Status”, in Advances in Cross-National Comparison, Springer US, Boston, MA, https://doi.org/10.1007/978-1-4419-9186-7_9.
[1] OECD (2023), PISA 2022 Assessment and Analytical Framework, PISA, OECD Publishing, Paris, https://doi.org/10.1787/dfe0bf9c-en.
[2] OECD (Forthcoming), PISA 2022 Technical Report, PISA, OECD Publishing, Paris.
[3] UNESCO (2012), International Standard Classification of Education ISCED 2011.