Performance on school tests is the result of the interplay amongst what students know and can do, how quickly they process information, and how motivated they are for the test. To ensure that students who sit the PISA test engage with the assessment conscientiously and sustain their efforts throughout the test, schools and students that are selected to participate in PISA are often reminded of the importance of the study for their country. For example, at the beginning of the test session, the test administrator reads a script that includes the following sentence:
PISA 2018 Results (Volume I)
Annex A8. How much effort did students invest in the PISA test?
“This is an important study because it will tell us about what you have been learning and what school is like for you. Because your answers will help influence future educational policies in <country and/or education system>, we ask you to do the very best you can.”
However, viewed in terms of the individual student who takes the test, PISA can be described as a low-stakes assessment: students can refuse to participate in the test without suffering negative consequences, and do not receive any feedback on their individual performance. If students perceive an absence of personal consequences associated with test performance, there is a risk that they might not invest adequate effort (Wise and DeMars, 2010[1]).
Several studies in the United States have found that student performance on assessments, such as the United States national assessment of educational progress (NAEP), depends on the conditions of administration. In particular, students performed less well in regular low-stakes conditions compared to experimental conditions in which students received financial rewards tied to their performance or were told that their results would count towards their grades (Wise and DeMars, 2005[2]). In contrast, a study in Germany found no difference in effort or performance measures between students who sat a PISA-based mathematics test under the standard PISA test-administration conditions, and students who sat the test in alternative conditions that increased students’ stakes for performing well (Baumert and Demmrich, 2001[3]). In the latter study the experimental conditions included promising feedback about one’s performance, providing monetary incentives contingent on performance, and letting students know that the test would count towards their grades. The difference in results may suggest that the motivation of students to expend effort on a low-stakes test such as PISA may differ significantly across countries. Indeed, the only existing comparative study on the effect of incentives on test performance found that offering students monetary incentives to expend effort on a test such as PISA – something that is not possible within the regular PISA procedures – led to improved performance amongst students in the United States, while students in Shanghai (China) performed equally well with or without such incentives (Gneezy et al., 2017[4]).
These studies suggest that differences in countries’ and economies’ mean scores in PISA may reflect differences not only in what students know and can do, but also in their motivation to do their best. Put differently, PISA does not measure students’ maximum potential, but what students actually do, in situations where their individual performance is monitored only as part of their group’s performance.
A number of indicators have been developed to assess differences between individuals or between groups (e.g. across countries and economies) in students’ motivation in low-stakes tests.
Several scholars have used student self-report measures, collected shortly after the test (Wise and DeMars, 2005[2]; Eklöf, 2007[5]). Typically, students are asked about the effort they invested in the test, and the effort they would have expended in a hypothetical situation, e.g. if the test results counted towards their grades. PISA 2018 also included such questions at the end of both the paper- and computer-based test forms (see Figure I.A8.1).
However, there are several disadvantages of self-report measures. In particular, it is unclear whether students – especially those who may not have taken the test seriously – respond truthfully when asked how hard they tried on a test they have just taken; and it is unclear to what extent answers provided on subjective response scales can be compared across students, let alone across countries. The comparison between the “actual” and the “hypothetical” effort is also problematic. In the German study discussed above, regardless of the conditions under which they took the test, students said that they would have invested more effort if any of the other three conditions applied; the average difference was particularly marked amongst boys (Baumert and Demmrich, 2001[3]). One explanation for this finding is that students are under-reporting their true effort, relative to their hypothetical effort, to attribute their wrong answers in the actual test they sat to lack of effort, rather than lack of ability.
In response to these criticisms, researchers have developed new ways to examine test-taking effort, based on the observation of students’ behaviour during the test. Wise and Kong (2005[6]) propose a measure based on response time per item in computer-delivered tests. Their measure, labelled “response-time effort”, is simply the proportion of items, out of the total number of items in a test, on which respondents spent more than a threshold time T (e.g five seconds, for items based on short texts). Borgonovi and Biecek (2016[7]) developed a country-level measure of “academic endurance” based on comparisons of performance in the first quarter of the PISA 2012 test and in the third quarter of the PISA 2012 test (the rotating booklets design used in PISA 2012 ensured that test content was perfectly balanced across the first and third quarters). The reasoning behind this measure is that while effort can vary during the test, what students know and can do remains constant; any difference in performance is therefore due to differences in the amount of effort invested.1 Measures of “straightlining”, i.e. the tendency to use an identical response category for all items in a set (Herzog and Bachman, 1981[8]), may also indicate low test-taking effort.
Building on these measures, this annex presents country-level indicators of student effort and time management in PISA 2018, and compares them, where possible, to the corresponding PISA 2015 measures. The intention is not to suggest adjustments to PISA mean scores or performance distributions, but to provide richer context for interpreting cross-country differences and trends.
Average student effort and motivation
Figure I.A8.2 presents the results from students’ self-reports of effort; Figure I.A8.3 presents, for countries using the computer-based test, the result of Wise and Kong’s (2005[6]) measure of effort based on item-response time.
A majority of students across OECD countries (68 %) reported expending less effort on the PISA test than they would have done in a test that counted towards their grades. On the 1-to-10 scale illustrated in Figure I.A8.1, students reported an effort of “8” for the PISA test they just had completed, on average. They reported that they would have described their effort as “9” had the test counted towards their marks. Students in Albania, Beijing, Shanghai, Jiangsu and Zhejiang (China) (hereafter “B-S-J-Z [China]”) and Viet Nam rated their effort highest, on average across all participating countries/economies, with an average rating of “9”. Only 17 % of students in Albania and 27 % of students in Viet Nam reported that they would have invested more effort had the test counted towards their marks.
At the other extreme, more than three out of four PISA students in Germany, Denmark, Canada, Switzerland, Sweden, Belgium, Luxembourg, Austria, the United Kingdom, the Netherlands and Portugal (in descending order of that share), and 68 % on average across OECD countries, reported that they would have invested more effort if their performance on the test had counted towards their marks (Table I.A8.1). In most countries, as well as on average, boys reported investing less effort in the PISA test than girls reported. But the effort that boys reported they would have invested in the test had it counted towards their marks was also less than the effort that girls reported under the same hypothetical conditions. When the difference between the reports is considered, a larger share of girls reported that they would have worked harder on the test if it had counted towards their marks (Table I.A8.2).
Response-time effort, on the other hand, appears unrelated to country-level ratings of self-reported effort (this measure is available only for countries that delivered PISA on computer).2 In fact, most countries/economies show considerable response-time effort. In order to estimate response-time effort, a conservative threshold (i.e. a minimum of five seconds per item) was used to define “solution behaviour” on mathematics and science items; reading and global competence items were excluded from the items considered to ensure comparability across countries.3
The largest share of students who exhibited genuine response behaviour – i.e. spent at least five seconds on any mathematics or science item that was presented to them – were found in Denmark (one of the countries with the largest share of students who reported they would have worked harder on the test if it had counted towards their marks), Finland and Mexico; but many other countries and economies share similar levels of response-time effort as these three. Only Qatar (response-time effort equal to 91.5 %) has a large share of student responses to items (8.5 %) that may correspond to “rapid skipping” or “rapid guessing” behaviours (i.e. students spent less than five seconds trying to solve the item; rapidly skipped items at the end of each session were considered as non-reached items and did not count towards the measure of response-time effort; Table I.A8.7). The same pattern of low response-time effort in Qatar was already observed in 2015 data (Table I.A8.9).4
One possibility is that the differences between self-report and response-time measures of effort arise because response-time effort is not sensitive to all types of disengaged response behaviours. Not all students who expend little effort skip questions or give rapid-guess answers; some may get off task, or read through the material without focus, and end up guessing or skipping responses only after expending considerable time. Another possibility is that self-report measures of effort do not reflect the real effort that students put into the test (see above).
The reading-fluency section of the PISA 2018 test offers an opportunity to examine straightlining behaviour in the test. Students were given a series of 21 or 22 items, in rapid sequence, with identical response formats ( “yes” or “no”); meaningless sentences (such as “The window sang the song loudly”), calling for a “no” answer, were interspersed amongst sentences that had meaning (such as “The red car has a flat tyre”), calling for a “yes” answer. It is possible that some students did not read the instructions carefully, or that they genuinely considered that the meaningless sentences (which had no grammatical or syntactical flaws) had meaning. However, this response pattern (a series of 21 or 22 “yes” answers) or its opposite (a series of 21 or 22 “no” answers) is unexpected amongst students who demonstrated medium or high reading competence in the main part of the reading test.
Table I.A8.21 shows that, indeed, only 1.5 % of all students, on average across OECD countries, exhibited such patterned responses in reading-fluency tasks. That proportion is even smaller (only 0.5 %) amongst high-performing students, defined here as those students who attained high scores on the first segment of the reading test, after completing reading-fluency tasks.5 However, the proportion of high-performing students who exhibited “straightlining” behaviour on the reading-fluency test is close to 6 % in Kazakhstan, close to 5 % in the Dominican Republic, and exceeds 2 % in Albania, Indonesia, Korea, Peru, Spain, Thailand and Turkey (Table I.A8.21).6 It is possible that the unusual response format of reading-fluency tasks triggered, in some limited cases, disengaged response behaviour, and that these same students did their best in the later parts of the test. It is also possible, however, that these students did not do their best throughout the PISA test, and not only in this initial, three-minute section of the reading test.
Test fatigue and the ability to sustain motivation
For countries that delivered the test on computer, Figure I.A8.4 presents a measure of test endurance based on Borgonovi and Biecek (2016[7]). This measure compares performance in mathematics and science (domains where the assignment of tasks to students was non-adaptive) between the first and second test session (each test session corresponds to one hour). In PISA 2018, there were no students who took mathematics and science tasks in both test sessions; the comparison is therefore between equivalent groups of students, as defined by the random assignment of students to test forms. The rotation of items over test forms further ensures a balanced design.
Amongst countries that delivered the PISA test on computer, only negative or non-significant differences in performance were observed between the second and first hour of testing. This is expected, as these differences mostly reflect fatigue, and they support the interpretation of these results as indicators of “endurance”. While the differences tended to be small, in general, seven countries/economies showed a decline of more than three percentage points in the percentage of correct answers between the first and second hour of testing (in ascending order, from the smallest to the largest decline): Chile, Serbia, Baku (Azerbaijan), Colombia, Australia, Norway and Uruguay (Figure I.A8.4 and Table I.A8.3).
There was hardly a correlation between overall performance and test endurance.7 Some countries with similar performance in science and mathematics showed marked differences in test endurance. For example, amongst high-performing countries, the Netherlands showed a relatively marked drop in performance between the first and second session, while there was no significant decline in performance in B-S-J-Z (China), Singapore, Macao (China) and Finland (in descending order of the overall percentage of correct responses in mathematics and science). Test endurance was also only weakly related to the share of students in the country who reported expending less effort in the PISA test than if the test counted towards their school grades.8
Countries that delivered the PISA test on paper showed, on average, larger differences in percent-correct responses between the first half and the second half of the test. This reflects the different design of the test sessions (see Annex A5). In these countries, students could continue to work on the first half of the test during the second hour of testing, as all domains of testing were bundled in a single booklet.
Academic endurance can be computed in much the same way with PISA 2015 data. In order to compare results with PISA 2018, Table I.A8.5 uses only student performance in mathematics and science. Even so, results should not be compared directly with results in Table I.A8.3, because the test content in science and the distribution of science and mathematics questions across test forms differ between 2015 and 2018 (science was the major domain in 2015, and was always assessed over a full hour). Nevertheless, the PISA 2015 measure of academic endurance correlates strongly at the country level with the PISA 2018 measure of academic endurance (the linear correlation coefficient is r = 0.65 across the 53 countries that delivered the test on computer, and that had already delivered the test on computer in 2015) (Tables I.A8.3 and I.A8.5). In general, countries/economies where students showed above-average endurance in 2018 (such as Finland, Macao [China], Singapore and Chinese Taipei) had already demonstrated above-average endurance in 2015, and countries with below-average endurance (such as Australia, the Netherlands, Norway and Uruguay) tended to show below-average endurance in 2015 as well.
Time management and speed of information processing
Non-reached items at the end of each of the two one-hour test sessions in the computer-based assessment (and at the end of the test booklet, in the paper-based assessment) are defined for each test-taker as omitted responses that are not followed by a valid (correct or incorrect) response before the end of the session/booklet (OECD, forthcoming[9]).
Figure I.A8.5 shows the average percentage of non-reached items in mathematics and science (reading was not analysed due to the adaptive design, which makes the percentage of non-reached items not comparable across students and countries). On average across OECD countries, 4 % of items were not reached by the end of the test session: 5 % amongst students who were given science or mathematics tests during the first hour, and 3 % amongst students who were given science or mathematics tests during the second hour. This difference between the first and second hour, which can be found in most countries that delivered the test on computer, suggests that students may have become more familiar with the test platform, timing and response formats during the test. However, the percentage of non-reached items is above 15 % in Peru, Panama and Argentina (in descending order of that percentage; the latter country delivered the test on paper), and it is between 10 % and 11 % in Brazil, the Dominican Republic and Morocco. The proportion of non-reached items is smallest in Viet Nam (0.1 %), followed by B-S-J-Z (China), Korea and Chinese Taipei (between 1.1 % and 1.3 %) (Figure I.A8.5 and Table I.A8.11).
Between 2015 and 2018, the proportion of non-reached items increased in most countries. In many Latin American countries (Brazil, Colombia, Costa Rica, Mexico, Peru and Uruguay), as well as in Sweden, it increased from less than 3 % in 2015 to more than 8 % in 2018. The most significant exception to this increase is the Dominican Republic, where non-reached items decreased from 13 % to 11 %. Non-reached items also decreased in most countries that transitioned to computer-based testing in 2018 (Figure I.A8.5; Tables I.A8.11 and I.A8.13). The rotation of the major domain, and other changes affecting the length of the test, may have contributed to the increase in non-reached items in countries that delivered the test on computer. As in 2015, non-reached items were considered as “not administered” for the purpose of estimating students’ performance on the PISA scale, and the increase or decrease in non-reached items therefore cannot explain performance changes between 2015 and 2018 (though both changes may be related to the same cause, such as weaker motivation amongst students to try their best).
Figure I.A8.6 presents, for countries that delivered the PISA test on computer, the amount of time students spent on the reading, mathematics and science test. Students were given a maximum of one hour to complete the mathematics and/or science section of their PISA test (the other hour was used for assessing reading). On average across OECD countries, 50 % completed the first test section (either the reading section, or the mathematics and/or science section) within less than 43 minutes (median total time); 10 % of students took less than 28 minutes to finish the test (10th percentile of total time), and 90 % of students completed the test within 52 minutes. Students tended to be faster during the second hour, probably because they became more familiar with the test platform and the different response formats. The median total time was only 39 minutes in the second hour.
Compared to the OECD average, students were considerably faster at completing the test in Korea (median total time: 33 minutes in the first hour, 30 minutes in the second hour). They were considerably slower in Albania (53 minutes in the first hour, 45 minutes in the second hour) and in Malaysia (47 minutes and 46 minutes). In all countries and economies, the vast majority of students completed the test within the time limit (Table I.A8.15).
These patterns of variation across countries in time spent on the test were similar to those observed in 2015 (Table I.A8.17). Across countries/economies with available data, the median total time in the first hour correlates at r = 0.86 at the country level. The median test completion time in 2015 was slightly less than in 2018, on average across OECD countries (40 minutes, instead of 43 minutes), suggesting that the PISA 2015 reading and science tests could be completed in less time compared to the PISA 2018 tests (the same mathematics tests were used in 2018 as in 2015). This also aligns with findings that the number of non-reached items increased since 2015.
Tables available on line
https://doi.org/10.1787/888934029071
-
Table I.A8.1 Effort invested in the PISA test
-
Table I.A8.2 Effort invested in the PISA test, by gender
-
Table I.A8.3 Endurance in the PISA test
-
Table I.A8.4 Endurance in the PISA test, by gender
-
Table I.A8.5 Endurance in the PISA 2015 test
-
Table I.A8.6 Endurance in the PISA 2015 test, by gender
-
Table I.A8.7 Response-time effort in the PISA test
-
Table I.A8.8 Response-time effort in the PISA test, by gender
-
Table I.A8.9 Response-time effort in the PISA 2015 test
-
Table I.A8.10 Response-time effort in the PISA 2015 test, by gender
-
Table I.A8.11 Non-reached items in the PISA test
-
Table I.A8.12 Non-reached items in the PISA test, by gender
-
Table I.A8.13 Non-reached items in the PISA 2015 test
-
Table I.A8.14 Non-reached items in the PISA 2015 test, by gender
-
Table I.A8.15 Response time in the PISA test
-
Table I.A8.16 Response time in the PISA test, by gender
-
Table I.A8.17 Response time in the PISA 2015 test
-
Table I.A8.18 Response time in the PISA 2015 test, by gender
-
Table I.A8.19 Response time in the PISA reading-fluency test
-
Table I.A8.20 Response time in the PISA reading-fluency test, by gender
-
Table I.A8.21 Response accuracy in the PISA reading-fluency test, by reading performance
-
Table I.A8.22 Response accuracy in the PISA reading-fluency test, by gender
References
[3] Baumert, J. and A. Demmrich (2001), “Test motivation in the assessment of student skills: The effects of incentives on motivation and performance”, European Journal of Psychology of Education, Vol. 16/3, pp. 441-462, http://dx.doi.org/10.1007/bf03173192.
[7] Borgonovi, F. and P. Biecek (2016), “An international comparison of students’ ability to endure fatigue and maintain motivation during a low-stakes test”, Learning and Individual Differences, Vol. 49, pp. 128-137, http://dx.doi.org/10.1016/j.lindif.2016.06.001.
[5] Eklöf, H. (2007), “Test-Taking Motivation and Mathematics Performance in TIMSS 2003”, International Journal of Testing, Vol. 7/3, pp. 311-326, http://dx.doi.org/10.1080/15305050701438074.
[4] Gneezy, U. et al. (2017), Measuring Success in Education: The Role of Effort on the Test Itself, National Bureau of Economic Research, Cambridge, MA, http://dx.doi.org/10.3386/w24004.
[8] Herzog, A. and J. Bachman (1981), “Effects of questionnaire length on response quality”, Public Opinion Quarterly, Vol. 45, pp. 549–559.
[9] OECD (forthcoming), PISA 2018 Technical Report, OECD Publishing, Paris.
[1] Wise, S. and C. DeMars (2010), “Examinee Noneffort and the Validity of Program Assessment Results”, Educational Assessment, Vol. 15/1, pp. 27-41, http://dx.doi.org/10.1080/10627191003673216.
[2] Wise, S. and C. DeMars (2005), “Low Examinee Effort in Low-Stakes Assessment: Problems and Potential Solutions”, Educational Assessment, Vol. 10/1, pp. 1-17, http://dx.doi.org/10.1207/s15326977ea1001_1.
[6] Wise, S. and X. Kong (2005), “Response Time Effort: A New Measure of Examinee Motivation in Computer-Based Tests”, Applied Measurement in Education, Vol. 18/2, pp. 163-183, http://dx.doi.org/10.1207/s15324818ame1802_2.
Notes
← 1. Speed of information processing, and time management more generally, may also influence performance differences between test sections.
To limit the influence of this possible confounder, Borgonovi and Biecek (2016[7]) do not use the last quarter of the test, but the third (second-to-last) quarter. In the computer-based PISA 2015 and PISA 2018 assessments, the test is divided in two halves, each conducted in an hour-long session. Under this design, students’ time management and speed of information processing can be expected to have the same impact on both halves.
← 2. The linear correlation coefficient between average response-time effort and self-reported effort in the PISA test is weak (r = -0.20, N = 70).
The linear correlation between average response-time effort and the share of students reporting that they invested less effort in the PISA test than if their scores were going to be counted in their school marks is r = 0.38 (N = 70), meaning that in countries with greater response-time effort, more students tended to report that they would have worked harder if the test had had higher stakes for them (Tables I.A8.1 and I.A8.7).
← 3. In particular, reading items were excluded because their assignment to students was, in part, a function of students’ behaviour in prior sections of the test. As a result, each item was assigned to a different proportion of students across countries, limiting comparability of test-wide timing measures. Global competence items were excluded due to the large number of countries that did not participate in the assessment of global competence.
← 4. More generally, the linear correlation coefficient between response-time effort in 2015 and response-time effort in 2018, at the country level and across the 53 countries/economies that delivered both PISA tests on computers is r = 0.64.
← 5. High-performing students correctly answered a sufficient number of automatically scored tasks in the core section of the reading test to be assigned, with 90 % probability, to a “high” stage-1 testlet in the following section of the adaptive reading test. The same cut-off values (specific to each core testlet) were used across all countries to identify high-performing students. This information is available in variable RCORE_PERF in the PISA 2018 cognitive response database.
← 6. In all countries and economies, the proportion of correct responses to reading-fluency tasks was positively related to the proportion of correct responses in the core stage of the reading assessment.
← 7. The linear correlation coefficient between average academic endurance and mean performance in the PISA test is only r = 0.10 in reading, r = 0.13 in mathematics and r = 0.12 in science (N = 78) across all countries/economies. When countries that delivered the PISA test on paper are excluded, correlations are r = -0.08 in reading, r = -0.03 in mathematics and r = -0.03 in science (N = 70) (Tables I.B1.4, I.B1.5, I.B1.6 and I.A8.3).
← 8. When countries that delivered the PISA test on paper are excluded, the linear correlation coefficient between average academic endurance and the percentage of students who reported that they invested less effort in the PISA test than if their scores were going to be counted in their school marks is r = -0.37 (N = 70), meaning that in countries with better student endurance, a smaller proportion of students indicated that they would have worked harder if the stakes had been higher (Tables I.A8.1 and I.A8.3).