The validity and reliability of PISA scores, and their comparability across countries and languages are the key concerns that guide the development of assessment instruments and selection of the statistical model for scaling students’ responses. The procedures used by PISA to meet these goals include qualitative reviews conducted by national experts on the final main study items and statistical analyses of model fit in the context of multi-group item-response-theory models, which indicate the measurement equivalence of each item across groups defined by country and language.
PISA 2022 Results (Volume I)
Annex A6. Are PISA mathematics scores comparable across countries and languages?
Countries’ preferred items
National mathematics experts conducted qualitative reviews of the full set of items included in the PISA 2022 assessment at different stages of their development. The ratings and comments submitted by national experts determined the revision of items and coding guides for the main study, and guided the final selection of the item pool. In many cases, these changes mitigated cultural concerns and improved test fairness.
At the end of 2021, the PISA consortium asked national experts to confirm or revise their original ratings with respect to the final instruments. Sixty-eight national centres submitted ratings of the relevance of PISA 2022 mathematics items for measuring students’ “preparedness for life” – a key aspect of the validity of PISA (response options were: “not relevant”, “somewhat relevant”, “highly relevant”). National experts also indicated whether the specific competences addressed by each item were within the scope of official curricula (“not in curriculum”, “in some curricula”, “standard curriculum material”). While PISA does not intend to measure only what students learn as part of the school curriculum, ratings of curriculum coverage for PISA items provide contextual indicators to understand countries’ strengths and weaknesses in the assessment.
On average across countries/economies, 81% of items were rated as “highly relevant for students’ preparedness for life” (the highest possible rating); only 2% received a low rating on this dimension (rating equal to 1, i.e. “not relevant”).
On the other hand, national experts indicated high overlap between national curricula and the PISA mathematics item set. On average, 86% of items were rated as “standard curriculum material”, and only 3% of items were identified as “not in curriculum”. National experts from five countries – Kazakhstan, Norway, Peru, the Philippines, and Thailand – indicated that all items used in PISA could be considered standard curriculum material in their country.
Table I.A6.1 provides a summary of the ratings received from national centres about the PISA 2022 set of reading items.
National item deletions, item misfit, and item-by-country interactions
PISA reporting scales in mathematics, reading and science are linked across countries, survey cycles and delivery modes (paper and computer) through common items whose parameters are constrained to the same values and which can therefore serve as “anchors” on the reporting scale. A large number of anchor items support the validity of cross-country comparisons and trend comparisons.
The unidimensional multi-group item-response-theory (IRT) models used in PISA, with groups defined by language within countries and by cycle also result in model-fit indices for each item-group combination. These indices can indicate tensions between model constraints and response data, a situation known as “misfit” or “differential item functioning” (DIF).
In cases where the international parameters for a given item did not fit well for a particular country or language group, or for a subset of countries or language groups, PISA allowed for a “partial invariance” solution in which the equality constraints on the item parameters were released and group-specific item parameters were estimated. This approach was favoured over dropping the group-specific item responses for these items from the analysis in order to retain the information from these responses. While items with DIF treated in this way no longer contribute to the international set of comparable responses, they help reduce measurement uncertainty for the specific country-by-language group.
In rare instances where the partial invariance model was not sufficient to resolve the tension between students’ responses and the IRT model, the group-specific response data for that particular item were dropped.
An overview of the number of international/common (invariant) item parameters and group-specific item parameters in mathematics for PISA 2022 is given in Figure I.A6.1 and Figure I.A6.2; the corresponding figures for other domains can be found in the PISA 2022 Technical Report (OECD, Forthcoming[1]). Each set of stacked bars in these figures represents a country or economy; countries and economies with multiple language groups have one bar for each country-by-language group.
The bars represent the items used in the country. A colour code indicates whether international item parameters were used in scaling (“invariant items”), or whether, due to misfit when using international parameters, national item parameters were used. For items where international equality constraints were released, a distinction is made between two groups:
group-specific new items: items that received unique parameters for the particular group defined by country/language and year (in many cases, equality constraints across a subset of misfit groups defined by country/language and year, e.g. across all language groups in a country, could be implemented)
group-specific trend items: items for which the “non-invariant” item parameters used in 2022 could be constrained to the same values used in 2018 for the particular country/language group (these items contribute to measurement invariance over time but not across groups).
For any pair of countries/economies, the larger the number and share of common (“invariant”) item parameters, the more comparable the PISA scores. As the figures show, comparisons between most countries’ results are supported by strong links involving many items (in 115 of 125 country-by-language group, over 85% of the items use international, invariant item parameters).
Across every domain, international/common (invariant) item parameters dominate and only a small proportion of the item parameters are group‑specific. The PISA 2022 Technical Report (OECD, Forthcoming[1]) includes an overview of the number of deviations per item across all country-by-language groups.
The country/language group with the largest amount of misfit across items is Viet Nam in reading (this was not the case in mathematics and science). In reading, almost 40% of items (34 of 87) were assigned unique parameters in Viet Nam. As a result, a strong linkage to the international PISA scale could not be established.
References
[1] OECD (Forthcoming), PISA 2022 Technical Report, PISA, OECD Publishing, Paris.