This chapter analyses disaggregated data at the respondent-item level to illustrate how respondents chose to allocate time to different items. Time spent on items was found to be strongly related to intrinsic characteristics of items, such as difficulty. Respondents devoted considerably less time to items administered in the second half of the assessment. This was accompanied by a decrease in performance (measured by the fraction of items answered correctly) and an increase in the proportion of missing answers. Respondents seem to allocate time to tasks rationally, spending less time on items that are both too difficult and too easy and more time on challenging items for which the probability of success is close to 50%. Spending more time on an item appears to increase the probability of giving a correct answer, although at declining rates.
Beyond Proficiency
Chapter 4. Allocation of time to different items in the Survey of Adult Skills
Abstract
Introduction
Chapter 3 analysed timing indicators for various groups of respondents related to the entire Survey of Adult Skills, a product of the Programme for the International Assessment of Adult Competencies (PIAAC) (hereafter referred to as “PIAAC”) assessment. This chapter examines more disaggregated data at the respondent-item level, aiming to shed light on how respondents chose to allocate time to different items in the course of the assessment.
For this analysis, it is important to recognise that PIAAC was not designed as a timed assessment. Unlike other large-scale assessments such as the OECD Programme for International Student Assessment (PISA) and many high-stakes testing situations, there was no limit set on the amount of time that respondents could take to complete the assessment.
This feature of the design of the PIAAC assessment must be kept in mind when interpreting the results of the empirical analysis. The absence of an explicit time limit means that respondents who want to maximise their performance on the test are not subject to time constraints defined by the test protocol. In the abstract, they can take as much time as they need to maximise performance. The choice of how much time to allocate to different items becomes meaningful only when the analyst makes the assumption (reasonable in these circumstances) that time has a value for respondents (i.e. that doing the assessment is one of several alternative uses of their time). As a consequence, respondents in effect trade off the value they attach to their performance on the test with the value they attach to other uses of their time.
A second important aspect is that time on task represents an imperfect proxy of the effort exerted by respondents. This is because log files are silent about how respondents actually employ the time they spend on each item. While it is reasonable to think that the amount of time spent on an item is a good approximation of how much the respondent was engaged with the item, it is possible that the respondent spent a lot of time on a given item for other reasons (e.g. because he/she was interrupted, took a break or was distracted by different things or thoughts).
As noted in Chapter 2, log files provide no record of what respondents do in the course of the assessment and the many different ways they could show engagement with an item – see Maddox et al. (2018[1]) for a small scale observational study of respondents’ behaviour during the PIAAC assessment.
Time on task, as measured in log files, is the result of the interaction of a variety of factors:
respondents’ cognitive ability
respondents’ engagement and motivation
item characteristics
external events (distractions or unforeseen events during the course of the assessment).
Each of these factors has a different relationship to time on task, and the relationship is often non-linear. In the case of cognitive ability, for instance, highly-skilled people are expected to solve items relatively rapidly. At the same time, low-skilled individuals are also expected to devote little time to difficult items, as they realise that they have low chances of solving them and will, therefore, skip them. Different item characteristics can affect the trade-offs that respondents face when deciding how much time to allocate to each item.
Item difficulty operates in a way similar to respondents’ cognitive ability. Very easy items will generally take less time to solve. However, more difficult items might require more time to solve, or they might be so difficult that the optimal action is to devote as little time as possible to them – or to skip them entirely.
Item position can also have opposite effects on time on task. Respondents might become tired at later stages of the assessment and thus need more time to solve an item. But fatigue can lead to a decrease in motivation and could, therefore, reduce the time devoted to the item. A decrease in time on task at later stages of the assessment can, in principle, also be attributed to the fact that respondents learn test-taking strategies or become more familiar with the user interface. As a consequence, they become quicker to solve the items (although the latent ability that the assessment is meant to capture does not change).
It is hard to disentangle the separate impacts of all these factors, but the chapter provides some evidence in this respect, adopting two approaches. The first, at the item level, consists of relating the time spent on various items to a range of item characteristics. The second, at the individual level, looks more precisely at how respondents allocate time between items (this could be called a within-respondents / between-items analysis).
Timing indicators and item characteristics
The analysis in this section aggregates information at the item level, relating timing indicators with various item characteristics. In particular, it examines time on task (the overall amount of time spent on the item).
At the item level, time on task displays a strong correlation with time to first interaction (Figure 4.1), but a very weak correlation with time since last action (Figure 4.2). The main reason for this is that many items do not require multiple interactions between the respondent and the computer interface. Consequently, the first interaction and the last action essentially coincide. Time on task is, thus, the indicator that is easiest to interpret. It provides the best approximation of the effort respondents decide to allocate to an item.
Figure 4.3 shows the distribution of time on task for each literacy and numeracy item in the dataset. There is considerable variation both across and within items. For many items, the distribution of time on task is extremely compressed, with half of the respondents taking between 20 and 50 seconds. For other items, the distribution of time on task is much more dispersed, with the most rapid quarter of respondents spending at most 60 seconds (1 minute), and the slowest quarter spending almost 150 seconds (2.5 minutes).
Some of the within-item variation presented in Figure 4.3 is likely to be due to differences across countries in the average time spent on the assessment. Figure 4.4 shows this for a selection of items (and for a selection of countries, in order to preserve the readability of the graph). Respondents in Finland and Norway, for example, spend consistently more time on each item than respondents in Italy and Spain.
There are, however, no major differences between countries in the degree of variability of time spent on each item (Figure 4.5). Spain is the only outlier in this respect. In other words, for each item, the distribution of time on task appears to be shifted to the right or to the left depending on the country, but the spread of the distribution displays much less variation across countries.
An important factor that could explain between-item differences in time on task is item difficulty. Figure 4.6 clearly shows that time on task increases with item difficulty, for both literacy and numeracy items. This is partly because more difficult items involve more complex cognitive operations and more extensive stimulus materials.
It is also interesting to relate time on task to the final status of the item (i.e. whether the item was answered correctly, answered incorrectly or not answered at all). Figure 4.7 shows that, for any given item, respondents who gave the correct answer did not spend a significantly different amount of time than respondents who gave an incorrect answer. This is further indication that time on task is strongly related to intrinsic characteristics of the item. The variation in time on task is much more limited for items that were not answered, indicating that the time required to decide whether or not it is worth trying to solve the item does not increase as much with item difficulty as the time needed to actually solve the item.
Finally, it is worth looking at time on task in relation to the position of the various items within the overall assessment. This provides a useful bridge to the following section, which examines in more detail individual behaviour in terms of allocation of time to items.
As explained in Chapter 2, PIAAC was designed as an adaptive test. One consequence is that the allocation of items to respondents was (conditionally) random. There are two main reasons why the same item could have been presented to respondents in a different position. First, respondents were randomly allocated to a literacy or numeracy module. Therefore, the same literacy item would be in a different position depending on whether the respondent took literacy as the first module or the second module. Second, within modules, respondents were allocated to different booklets. This allocation was only conditionally random, as the booklets varied in their level of difficulty and the allocation was based on observable characteristics of respondents that are likely to be correlated with ability. However, given that each item appears at most in two booklets, the main source of variation in the position of any given item is whether or not the item was part of the first or the second module.
In all countries, respondents tend to devote less time to the second module than to the first module. Table 4.1 illustrates this point with reference to time on task, but the same result is found when looking at other timing indicators.
Table 4.1. Time on task, by module
|
First module |
Second module |
Difference |
% Difference |
---|---|---|---|---|
Austria |
1 529.6 |
1 358.7 |
-171.0 |
-11.2% |
Denmark |
1 486.2 |
1 293.7 |
-192.5 |
-13.0% |
England/Northern Ireland (UK) |
1 305.6 |
1 134.4 |
-171.2 |
-13.1% |
Finland |
1 538.9 |
1 363.9 |
-175.0 |
-11.4% |
France |
1 461.5 |
1 247.8 |
-213.7 |
-14.6% |
Germany |
1 550.7 |
1 365.8 |
-184.9 |
-11.9% |
Ireland |
1 328.9 |
1 110.7 |
-218.2 |
-16.4% |
Italy |
1 334.1 |
1 071.8 |
-262.3 |
-19.7% |
Netherlands |
1 437.5 |
1 310.6 |
-126.9 |
-8.8% |
Norway |
1 622.1 |
1 432.6 |
-189.5 |
-11.7% |
Average |
1 420.5 |
1 229.6 |
-190.8 |
-13.4% |
Poland |
1 335.4 |
1 144.9 |
-190.5 |
-14.3% |
Slovak Republic |
1 297.1 |
1 132.0 |
-165.2 |
-12.7% |
Spain |
1 273.6 |
1 075.1 |
-198.5 |
-15.6% |
United States |
1 385.5 |
1 173.0 |
-212.5 |
-15.3% |
Note: The sample includes only participants to the computer‑based assessment who were assigned to the literacy and numeracy modules.
Source: OECD (2017[2]), Programme for the International Assessment of Adult Competencies (PIAAC), Log Files, http://dx.doi.org/10.4232/1.12955.
The decrease in the time devoted to the assessment is associated with a decline in performance. Items in the second module are less likely to be answered correctly and more likely to be left blank or skipped, as shown in Table 4.2.
Table 4.2. Correct and missing answers, by module
|
Proportion of correct answers |
Proportion of missing answers |
||||
---|---|---|---|---|---|---|
|
First module |
Second module |
Difference |
First module |
Second module |
Difference |
Austria |
0.6158 |
0.5971 |
-0.0187 |
0.039 |
0.066 |
0.027 |
Denmark |
0.6270 |
0.5924 |
-0.0346 |
0.055 |
0.101 |
0.046 |
England/Northern Ireland (UK) |
0.5920 |
0.5502 |
-0.0418 |
0.055 |
0.098 |
0.044 |
Finland |
0.6864 |
0.6525 |
-0.0339 |
0.029 |
0.056 |
0.027 |
France |
0.5593 |
0.5197 |
-0.0396 |
0.079 |
0.126 |
0.047 |
Germany |
0.6140 |
0.5807 |
-0.0334 |
0.050 |
0.079 |
0.029 |
Ireland |
0.5784 |
0.5328 |
-0.0455 |
0.048 |
0.097 |
0.049 |
Italy |
0.5147 |
0.4835 |
-0.0312 |
0.097 |
0.149 |
0.052 |
Netherlands |
0.6510 |
0.6343 |
-0.0168 |
0.034 |
0.059 |
0.024 |
Norway |
0.6397 |
0.6124 |
-0.0272 |
0.039 |
0.063 |
0.024 |
Average |
0.5962 |
0.5663 |
-0.0299 |
0.054 |
0.090 |
0.036 |
Poland |
0.5831 |
0.5449 |
-0.0383 |
0.056 |
0.105 |
0.049 |
Slovak Republic |
0.6047 |
0.5998 |
-0.0048 |
0.043 |
0.064 |
0.022 |
Spain |
0.5166 |
0.4913 |
-0.0253 |
0.087 |
0.127 |
0.040 |
United States |
0.5642 |
0.5367 |
-0.0275 |
0.039 |
0.068 |
0.029 |
Note: The sample includes only participants to the computer‑based assessment who were assigned to the literacy and numeracy modules.
Source: OECD (2017[2]), Programme for the International Assessment of Adult Competencies (PIAAC), Log Files, http://dx.doi.org/10.4232/1.12955.
In the literature on large-scale assessments, decline in performance in the course of the assessment is now a well-established result (Borgonovi and Biecek, 2016[3]; Brunello, Crema and Rocco, 2018[4]; Borghans and Schils, 2012[5]).
Timing information extracted from log files is important to better understand the mechanisms behind this established result. Past literature has attributed decline in performance during the test as evidence of lack of endurance or lack of motivation. But the decline in time allocated could also (at least partly) be due to a learning effect and to increased efficiency in answering the questions. The next section attempts to disentangle the two channels by examining whether the relationship between time on task and probability of success changes in the course of the assessment (with item position).
Time-allocation strategies
While the previous section took a predominantly item-level approach, this section focuses on individual respondents, looking at how they allocated time to the different items. Chapter 2 presented information on how the time allocated to the assessment varied across respondents. This deepens that analysis by looking at how time allocation interacts with item characteristics and how it varies during the course of the assessment.
A first question to address is whether respondents differ in the strategy they choose to allocate time between items. One way to answer this question is to compute, for each respondent, the percentile rank of the respondent in the distribution of time on task for each item presented to him/her. It is then possible to analyse the features of the individual-specific distributions of these percentile ranks. A compressed distribution indicates that the respondent adopted a consistent strategy, always devoting the same amount of time (relative to all other respondents who were assigned the same items) to all items in the assessment. On the other hand, a more dispersed distribution would characterise a respondent who spent an unusually large amount of time on some items and an unusually small amount of time on other items. The standard deviation of the individual distributions of percentile rank is used as a summary measure of the degree of dispersion.
Individual standard deviations can be aggregated by countries or by socio-demographic characteristics of respondents. The results are presented in Figure 4.8 and Figure 4.9. The average standard deviation is around 20 percentile points. This indicates a relatively large degree of individual heterogeneity: different respondents interact in different ways with the same item, with the result that the same respondent can be relatively fast on one item and relatively slow on other items. On the other hand, there is very little cross-country variation in this indicator, as shown in Figure 4.8. Similarly, Figure 4.9 shows very little variation across socio-demographic groups (note that the scale of Figure 4.9 ranges from -5 to +5 percentile ranks, while the scale of Figure 4.8 ranges from 0 to 40).
The previous section showed a strong relationship between time on task and item difficulty at the item level. It is possible to undertake this analysis at the respondent level, by asking how individuals allocate time to items based on the individual-specific probability of success. While difficulty of a specific item is a fixed characteristic of the item, the ex ante probability of success is an indicator that simultaneously takes into account the difficulty of the item and the respondent’s ability. Indeed, the reported PIAAC proficiency levels are constructed on the basis of the models used to estimate items’ parameters and respondents’ final scores and are defined in terms of a probabilistic relationship between respondents’ skills and item difficulty. It is then possible to make statements such as “A respondent at Level 3 of the PIAAC proficiency scale is able to correctly answer an item of Level 3 difficulty with a probability of 67%.”
A rational individual who values his or her time should not devote too much time to questions that are too difficult, and which he/she is therefore very unlikely to be able to answer correctly. The adaptive nature of the PIAAC assessment makes these extreme situations less frequent. This is because, on average, items are targeted by construction to the expected ability of individual respondents. However, appropriate scaling also requires that some skilled respondents are assigned very easy items and some less skilled respondents are assigned very difficult items. Furthermore, given that the measure of ability used here to compute probability of success is only known at the end of the assessment, there is a good range of variation among individuals.
The pattern in Figure 4.10 is consistent with a priori expectations. As the item becomes excessively difficult, respondents devote less time to it (relative to other respondents who faced the same item). Time-on-task percentile also tends to decrease, although to a lesser extent, when the item is very easy. The decline in time on task is lower at the top than at the bottom end of the probability of success distribution, because respondents are more likely to skip difficult items (therefore devoting very little time to them). Easy items, on the other hand, necessarily take some time to answer correctly.
The shape of the relationship between time on task and probability of success does not seem to be affected by item position, as illustrated in Figure 4.11. The curve for the second module is simply shifted downwards, consistent with the fact that respondents spend less time on the second module than on the first module.
Figure 4.12 plots time on task on the individual probability of success depending on whether the final answer was correct, incorrect or missing. In the case of correct answers, there is no decline at the bottom end of the probability of success distribution, while there is a decline at the top end, as is the case for the overall sample. The opposite happens in the case of incorrect answers. Time on task declines as items get harder, although less than in the overall sample, because respondents did attempt to give an answer. At the top end, there is no decline in time on task, which is what one would expect when respondents fail to give a correct answer to an easy item. Items for which respondents did not provide an answer follow a pattern similar to the overall sample, but the distribution of time on task is shifted downwards, indicating that at some point the respondents decided to give up.
Interestingly, the relationship between time on task and probability of success differs by module, but only for missing answers, as illustrated in Figure 4.13 and Figure 4.14. Moving from Module 1 to Module 2, the curves for correct and incorrect answers are simply shifted downwards, as in the case of Figure 4.11. For missing answers, the curve changes shape and becomes flatter. This means that, in Module 1, respondents spent a relatively larger amount of time before deciding to skip an easy item (i.e. an item with a large ex ante probability of success). In Module 2, decisions to skip easy items are taken much faster. On the other hand, there is no evidence that the increase in missing answers from Module 1 to Module 2 is concentrated in relatively easy or relatively difficult items.
More interestingly, and unexpectedly, the relationship between time on task and probability of success is unrelated to (self-reported) perseverance, which can be proxied by the answer to an item of the background questionnaire asking the respondent whether he/she “gets to the bottom of difficult things”.
The final section of this chapter looks at the relationship between time on task and actual performance on the assessment, measured by the probability of giving a correct answer to an item (Goldhammer et al., 2014[6]). This is not the same as the ex ante individual probability of success that was used in previous parts of the chapter. The ex ante individual probability of success is a measure of how difficult an item is for a respondent with a given ability level. The probability of answering an item correctly is the ex post realisation (i.e. a dummy variable taking a value of 1 if the respondent correctly answered an item and a value of 0 otherwise). No distinction is made between the absence of a response due to skipping and an incorrect answer.
The most straightforward way to investigate whether spending more time on an item actually increases the probability of giving a correct answer is through the following regression:
where is a dummy taking value 1 if individual i correctly answered item j, is a (quadratic) polynomial in the time on task spent by individual i in item j, is a respondent fixed effect (that controls for any fixed individual characteristic, like ability and average motivation) and is an item fixed effect (that controls for any characteristic of item j). is a random error term.
The regression exploits the fact that the data contain information on a variety of respondents answering the same set of items. As a result, the regression compares the outcome of different individuals who allocated a different amount of time to the same item, controlling at the same time for any fixed characteristic of the respondent (thanks to the fact that the data contain information on the same respondent answering different items).
An alternative specification would replace the individual and item fixed effect by the ex ante individual probability of success, a variable at the individual-item level that is supposed to contain all the relevant information in terms of the interaction between the respondent and the item (i.e. how difficult a given item is for a given respondent) (Table 4.3).
In both specifications, time on task has a positive but declining effect on the probability of giving a correct answer. In other words, spending more time on an item increases the probability of giving a correct answer, but only up to a certain point. Spending an excessive amount of time, in fact, indicates that the respondent has not well understood the requests of the item and is therefore less likely to give a correct answer.
For Models 4 and 5, the time on task indicators are interacted with a dummy for whether the item was taken as part of Module 2. The regression also includes the main effect of Module 2. The coefficient on the main effect of Module 2 is negative, which is consistent with what was shown before: performance significantly declines in Module 2 compared to Module 1 (Table 4.2 showed that the proportion of missing answers increased from 5.4% to 9% in Module 2 and the proportion of correct answers decreased from 60% to 56%). This capture the average effect coming from fatigue or decrease in motivation that occurs at later stages of the assessment.
More interesting is the fact that the coefficient on the interaction term is positive and statistically significant. This means that, compared to Module 1, the returns to time on task are higher in Module 2: spending more time on a given item leads to a higher increase in the probability of giving a correct answer when that item is administered in Module 2.
This result can be interpreted in two ways. On the one hand, respondents could achieve better performance if they spent a bit more time on the items. It is possible that, by the time they get to Module 2, respondents are tired or less motivated, and as a consequence the value they attach to their free time has increased relative to the value they attach to performing well on the assessment. On the other hand, for a given amount of time spent on an item, respondents are more likely to give a correct answer if the item is administered in Module 2 rather than in Module 1. This would suggest that respondents become more efficient in answering items, although it is not possible to determine for what reason.
Table 4.3. Time on task and item performance
Model 1 |
Model 2 |
Model 3 |
Model 4 |
Model 5 |
|
---|---|---|---|---|---|
Time on task |
1. 118 |
0.662 |
0.478 |
0.607 |
0.437 |
(0.006) |
(0.005) |
(0.004) |
(0.007) |
(0.007) |
|
(Time on task)^2 |
-0.010 |
-0.006 |
-0.004 |
-0.005 |
-0.004 |
(0.000) |
(0.004) |
(0.000) |
(0.000) |
(0.000) |
|
Module 2 |
- |
- |
- |
-3.412 |
-1.350 |
(0.217) |
(0.203) |
||||
Time on task*Module 2 |
- |
- |
0.088 |
0.078 |
|
(0.010) |
(0.009) |
||||
(Time on task)^2*Module 2 |
- |
- |
-0.000 |
-0.000 |
|
(0.000) |
(0.000) |
||||
Item fixed effects |
NO |
YES |
NO |
YES |
|
Respondent fixed effects |
NO |
YES |
NO |
YES |
|
Ex ante probability of success |
- |
- |
98.436 |
- |
98.431 |
(0.105) |
(0.105) |
||||
R2 |
0.03 |
0.34 |
0.38 |
0.34 |
0.38 |
N. Observations |
1 538 752 |
1 538 752 |
1 538 752 |
1 538 752 |
1 538 752 |
Note: The table reports results from different regression models. In all of them, the dependent variable is a dummy variable which equals 1 if the respondent has given a correct answer to the item and 0 otherwise. Standard errors are reported in parenthesis. Estimated coefficients and standard errors have been multiplied by 100. The sample includes only participants to the computer‑based assessment who were assigned to the literacy and numeracy modules.
Source: OECD (2017[2]), Programme for the International Assessment of Adult Competencies (PIAAC), Log Files, http://dx.doi.org/10.4232/1.12955.
Conclusions
This chapter investigated the relationship between time on task and item characteristics to shed light on the strategies and criteria respondents use to allocate time across different items in the course of the assessment.
In the first part of the chapter, the analysis was carried out at the item level. A first result is the large degree of between-item variation in time on task. In this respect, the differences between countries are not very pronounced. Time on task is strongly related to intrinsic item characteristics, such as item difficulty. Further evidence in this respect comes from the fact that respondents who correctly answered an item spent about the same amount of time as respondents who provided an incorrect answer. Respondents devoted a considerably smaller amount of time to items administered in the second half of the assessment. This was accompanied by a decrease in performance (measured by the fraction of items answered correctly) and by an increase in the proportion of missing answer. This seems to suggest that the decrease in time on task is due to an increase in fatigue or disengagement.
The second part of the chapter shifted the analysis to the level of the individual respondents. An important result is that respondents seem to allocate time to tasks in a rational way, devoting less time to items that are very difficult and very easy and more time to challenging items for which the probability of success is close to 50%. This pattern is observed in different countries, as well as in the two modules of the assessment. However, in Module 2, respondents seem to be much faster in deciding to skip items. The fact that the relationship between time on task and individual probability of success is the same across the modules provides some evidence of a learning effect. The decrease in time on task during the course of the assessment is, then, not entirely due to fatigue or disengagement, but also to some degree to the fact that respondents become more efficient in the way they interact with the assessment.
Finally, the analysis estimates the impact of time on task on performance, measured by the probability of giving a correct answer to an item. The structure of the dataset and the partially random allocation of items to respondents make it possible to control for item and respondent fixed effects, as well as for the position of the item. Indeed, the analysis shows that spending more time on an item does increase the probability of giving a correct answer, although at declining rates.
References
[5] Borghans, L. and T. Schils (2012), The Leaning Tower of Pisa: Decomposing Achievement Test Scores into Cognitive and Noncognitive Components, http://www.sole-jole.org/13260.pdf.
[3] Borgonovi, F. and P. Biecek (2016), “An international comparison of students’ ability to endure fatigue and maintain motivation during a low-stakes test”, Learning and Individual Differences, Vol. 49, pp. 128-137, http://dx.doi.org/10.1016/j.lindif.2016.06.001.
[4] Brunello, G., A. Crema and L. Rocco (2018), “Testing at length if it is cognitive or non-cognitive”, Discussion Paper Series, No. 11603, IZA, http://ftp.iza.org/dp11603.pdf.
[6] Goldhammer, F. et al. (2014), “The time on task effect in reading and problem solving is moderated by task difficulty and skill: Insights from a computer-based large-scale assessment.”, Journal of Educational Psychology, Vol. 106/3, pp. 608-626, http://dx.doi.org/10.1037/a0034716.
[1] Maddox, B. et al. (2018), “Observing response processes with eye tracking in international large-scale assessments: Evidence from the OECD PIAAC assessment”, European Journal of Psychology of Education, Vol. 33/3, pp. 543-558, http://dx.doi.org/10.1007/s10212-018-0380-2.
[2] OECD (2017), Programme for the International Assessment of Adult Competencies (PIAAC), Log Files, GESIS Data Archive, Cologne, http://dx.doi.org/10.4232/1.12955.