Borgonovi and Biecek (2016[8]) developed a country-level measure of “academic endurance” based on comparisons of performance in the first and third quarter of the PISA 2012 test (the rotating booklets design used in PISA 2012 ensured that test content was perfectly balanced across the first and third quarters at aggregate levels). The reasoning behind this measure is that while effort can vary during the test, what students know and can do remains constant: any difference in performance is therefore due to differences in the amount of effort invested.2
The original indicator proposed for PISA 2012 can be adapted to the design used in 2022 in two ways.
A first set of indicators compares the performance of students who were administered a given test (e.g. mathematics) in the first hour to the performance of students who were administered the same test in the second hour of testing. The indicators used can be based on item-response theory (plausible values) or classical test theory (percent-correct scores) although comparisons based on the latter are only valid for students (or domains) whose tests are not adaptive and thus, under all circumstances, of identical difficulty.
A second indicator exploits the test design for mathematics in 2022, which partitions the item pool in three (mutually exclusive) sets, whose position is rotated across students. This means that items in set A were presented for one-third of students at the beginning of the mathematics test; one-third in the middle; and the remaining third at the end of the mathematics test; and similarly for sets B and C. By comparing the performance of students whose test was not adaptive (25% of all students who took the mathematics test) across different these three positions (beginning, middle, and end), it is possible to see how performance varies (and, typically, declines) over the course of the hour-long mathematics test in each country/economy.
Student performance by hour of testing
The comparison of students’ performance by hour of testing shows large declines between the first and the second hour of testing in several countries and economies, in particular for reading results.
In reading, on average across OECD countries, students who took the test in the second hour (in most cases, after completing an hour-long mathematics test) scored 14 points lower than students who took the test in the first hour – a large difference. Large performance declines during the test of between 20 and 30 score points were observed in Iceland, Israel, Latvia*, Albania, Qatar, Slovenia, Malta, Argentina and Norway (in descending order of the size of this difference) (Table I.A8.17).
In mathematics, on average across OECD countries, the performance difference between students who took mathematics in the second hour and those who took mathematics in the first hour is only of four points. In most countries, the difference is not statistically significant; however, in Albania and Norway the decline exceeds 10 score points (Table I.A8.14).
In science, results are between those reported above for mathematics and reading. The average decline between the first and second hour of testing is of eight points. In science, where the test was not adaptive, results based on plausible values closely match those based on percent-correct scores (the linear correlation coefficient between the two sets of estimates, a measure of their association which varies between -1 and 1, is equal to 0.95) (Table I.A8.11 and Table I.A8.20).
Overall, performance declines between the first and second hour of testing for the same country/economy across different subjects correlate only moderately. This suggests that these declines reflect both position effects (the effect of taking a test in the second hour, which is present in all subjects) and order effects (the effect of taking a reading test after a mathematics test, for example). Order effects might play out differently across subjects and depending on the country (Tables I.A8.14, I.A8.17 and I.A8.20).
Nevertheless, a few countries/economies figure consistently among those with low “endurance”, meaning their second-hour results are much lower than their first-hour results regardless of the subjects. Countries/economies with low endurance in 2022 include Albania, Malta and Norway (Tables I.A8.14, I.A8.17 and I.A8.20).
The difference between the first and second hour of testing may appear large. However, similarly large declines had already been found in 2018 in most countries. In fact, on average across OECD countries, the difference between the first and second hour of testing even reduced somewhat, meaning that performance in 2022 was lower than in 2018 throughout the test but more so at the beginning of the test. The most significant exceptions to this pattern are Albania in reading, and the Dominican Republic and Greece in science, where the performance difference between the first and second hour of testing widened between 2018 and 2022 (Tables I.A8.16, I.A8.19 and I.A8.22).
Performance decline within the hour-long mathematics test
Performance declines for a given student in the hour-long mathematics test are often larger than those between students who take the mathematics test in the first and second hour of testing because students tend to perform better at the beginning of the second hour of testing (and after a break) than at the end of the first hour of testing.
On average across OECD countries, students who were assigned to a non-adaptive test in mathematics answered 47.6% of the questions correctly if they took the test in the first hour and 46.0% if they took the same test in the second hour of testing (Table I.A8.7). At the very beginning of the mathematics test, the percent-correct rate (averaged across first- and second-hour students) was 48.1% but dropped to 47.3% in the middle section, then to 44.2% in the last section – a drop of almost four percentage points (Table I.A8.23).
The largest drop in the mathematics test was observed in Israel: percent-correct rates started at levels close to the OECD average in 2022 but dropped by about seven percentage points in the third (and last) section. In contrast, performance remained at levels close to the OECD average throughout the test in France, for example. Among high-performing countries and economies, Hong Kong (China)*, Korea, Singapore and Chinese Taipei stand out for small differences (two percentage points or less) in performance between the beginning and the end of the testing hour (Table I.A8.23).
These performance declines between the first and third section of the test can modify country rankings at the margin (for example, Israel would be ranked higher if only performance at the beginning of the mathematics test was considered) but do not affect the main conclusions that can be drawn from comparisons of PISA results across countries. Around the OECD average, a 10-point difference on the PISA mathematics scale approximately corresponds to a difference of four points in the percent-correct metric.3