This chapter explains what log files are and how they complement traditional proficiency scores. It describes the features of the log files generated in the Survey of Adult Skills cognitive assessment and explores the impact of assessment design on interpretation of information contained in the files.
Beyond Proficiency
Chapter 2. What log files are and why they are useful
Abstract
What are log files?
Log files are the traces of the communication events between a user interface and a server. They are unformatted, produced on a large scale and not designed to be interpreted. The primary goal of log files is to serve as a means of communication between the interface and the server and a way to store information within a software application. The structure and contents of log files are generally created by software developers, not survey scientists, so their content is typically determined by the functionality of the computer interface, rather than the needs and interests of researchers and analysts.
The availability of log files for studies like the Survey of Adult Skills, a product of the Programme for the International Assessment of Adult Competencies (PIAAC) (hereafter referred to as “PIAAC”), or the Programme for International Student Assessment (PISA) can thus be seen as a by-product of technological innovations that increasingly allow these types of assessments to be administered on a computer-based platform, replacing more traditional paper-based assessments.
PIAAC was the first large-scale international assessment to be primarily designed for computer-based administration (Kirsch and Lennon, 2017[1]). The advantages of computer-based administration are manifold. It automates error-prone tasks, such as questionnaire branching (when delivery of questions depends on how participants answered previous questions) and manual and costly encoding of handwritten answers into a formatted dataset, and also allows automatic scoring of items.
Moreover, the technology is helpful for implementation of adaptive testing. In an adaptive test, the sequence of items (particularly their difficulty) is targeted to the expected proficiency of respondents, which is inferred from responses to previous questions. The use of such an algorithm maximises the efficiency of information that can be extracted from answers to a particular item, by avoiding easy questions for individuals who have proven that they can correctly answer more difficult items. The implications of adaptive testing for the analysis of log files are discussed more extensively later in this chapter.
The PIAAC testing application had to fulfil two important technical requirements: 1) flexibility with respect to operating systems, hardware and character fonts (most notably non-western alphabets) to enable successful implementation of the platform in all participating countries; and 2) the possibility of adapting the system to PIAAC-specific features. The web-based architecture and open-license status of the TAO (testing assisté par ordinateur) platform satisfied these requirements. Chapter 9 of the Technical Report of the Survey of Adult Skills (PIAAC) provides further details on TAO and a detailed overview of the software framework to which log files belong (OECD, 2013[2]).
Like other similar platforms, TAO is structured around two main components: a user interface and a server. Survey participants interact with a user interface similar to a web browser. It displays test items sequentially, displaying embedded buttons, text entry boxes, checkboxes and selection in a single pane, together with the item stimulus contents and questions. The interface sends all relevant participants’ inputs (including final answers) to the server. The server saves this information, adapts the display dynamically according to participants’ actions within each item and directs the user towards the next item or question. As the test is adaptive, this allocation follows predefined rules that assign items according to past answers and participants’ characteristics.
Log files are the traces of the communications between the user interface and the server. Importantly, all these records come with a precise timestamp that allows reconstruction of the complete chronology of respondents’ interactions with the test application over the course of the assessment.
Log files were created as a means to communicate and store information within a software application. They are comparable to the kind of web server traces that constitute the basic elements of what is now commonly known as big data. From the point of view of a user of survey data, the files are unformatted, not designed to be interpreted and produced on a large scale. They are typically stored in xml format, where (as in html) relevant information is embedded in series of tags. Since each interview creates several log files, the total number of files made available can quickly become very large. The current release of PIAAC log files is based on more than 200 000 files, covering about 60 000 respondents. Each of these files mixes embedded tags, question-specific and user-dependant information, the latter accounting for just a fraction of their total size.
The opacity of log-file data has limited their use by education researchers, who are generally unfamiliar with xml files. This is why the public release of the PIAAC log files was accompanied by the release of the LogDataAnalyzer (LDA). The LDA was developed to turn log files into a user-friendly dataset. It also features a few statistical tools to make it easy to explore the distribution of each variable extracted (Figure 2.1).
It is important to note that log files record only some interactions between the respondent and the computer. They do not capture many potentially important aspects of respondents’ behaviour. For example, interactions that are dealt with within the user interface itself and do not need to be transferred to the server (such as text scrolling or mouse movement) are not captured. But they could, in principle, provide valuable information about cognitive and non-cognitive processes followed by respondents in solving the items.
Moreover, participants’ actions are obviously not limited to interactions with the computer. Between each recorded interaction, participants may be reading the stimulus, thinking or simply remaining idle. They could also take other actions to solve the item, such as writing on a piece of paper or using a calculator. All these actions are potentially very valuable to help understand the behaviour of test-takers and the cognitive processes they follow to solve the tasks, but they are missing from the information contained in log files (Maddox, 2018[3]; Maddox et al., 2018[4]).
Why are log files valuable?
While log files are essentially traces or records of the transfer of information between components of the testing software, they are valuable because they provide information on how respondents processed their answers. All actions on the part of test-takers that cause a change in the interface need to be communicated to the server and are hence recorded in the log file along with a timestamp. These include not only test-takers’ responses to tasks, but also some (although not all) intermediate actions preceding input of the final answer.
Such actions causing a change in the interface typically include switches between the simulated web pages forming the test stimulus and the highlighting of text. As a result, it is possible to construct indicators that give information on how participants interact with each item before delivering the answer. Such information would be nearly impossible to collect systematically in the paper-based version of assessment, but it is an outcome of technical features of the testing application in computer-based assessments.
The type and amount of information recorded in the log files depend on the choices made by software developers. A first important lesson of this report is that the value of log files is maximised when researchers co-operate with software developers to identify the variables that should be recorded in the log files and when test developers incorporate the possible availability of log files in designing assessment items.
The information contained in log files can then be seen as clues on how participants process answers. This is potentially valuable to users of survey data (statisticians, education researchers and policy makers), as well as survey designers and managers.
Policy makers, researchers and test developers
In PIAAC and similar international surveys, measurement is outcome-based. The psychometric models used to estimate the proficiency of respondents take into account successes and failures and use them to build a continuous proficiency scale.
However, the performance of respondents on any given test results from a combination of individual cognitive and non-cognitive traits that are not measured. The Reader’s Companion for PIAAC (OECD, 2016[5]) acknowledges that: “ […] the focus of the Survey of Adult Skills is less on the mastery of certain content […] and a set of cognitive strategies than on the ability to draw on this content and these strategies to successfully perform information-processing tasks in a variety of real-world situations.” Importantly, this ability also includes the attitude of participants and their disposition to do well. However, although cognitive strategies, attitude and context provide a useful grid to describe test-taking behaviour, they are still ill-defined.
Up to now, efforts to understand skills acquisition with PIAAC have built on linking PIAAC proficiency scores with personal characteristics collected in the background questionnaire. PIAAC proficiency scales are principally designed to rank and compare survey participants for the purpose of studying the distribution of skills, both as a whole and across groups. These comparisons have provided essential results, most notably a quantification of the relationships between education, age and skills. These relationships identify good and bad performers and, consequently, which characteristics are associated with higher skill levels.
However, such analysis cannot account for differences in the cognitive processes deployed during the assessment. PIAAC proficiency scores are well suited to identify populations lacking skills, but they are not able to characterise the reasons behind a given performance, such as cognitive strategies, knowledge or attitudes, thus hindering development of policies to target these issues.
This limits the extent to which skill acquisition can be studied with PIAAC. Meaningful changes in skills always proceed from variations in knowledge, cognitive strategies or attitudes. Teachers, parents or peers do not teach skills per se, but influence attitudes and transfer knowledge or cognitive strategies.
Indeed, a large part of the research undertaken using log files or other forms of process data has tried to infer some measures of non-cognitive skills from participants’ behaviour during the assessment (Goldhammer et al., 2016[6]; Anghel and Balart, 2017[7]) (see Chapter 5 for a full discussion and analysis of disengagement).
Because log files describe how participants interact with each cognitive item, they can be used to analyse the cognitive and non-cognitive resources deployed by respondents during the assessment. However, making inferences on cognitive resources on the basis of the data from log files is not an easy task. It demands a certain amount of ingenuity on the part of the analyst, and the analysis may deliver only partial results.
Since the content of individual items provides the context for interpretation of the log-file variables, item-specific analysis is a simple and useful way to take advantage of log files. The meaning of each log-file variable is, in the end, always item-dependent. Focusing on a single item makes the analysis easier and more robust. But that is at the cost of lower generalisability of the results, because it is often difficult to extract information from log files that consistently measures the same underlying construct across different items. For example, a participant who managed to solve a given item in 30 seconds could be considered either slow or fast, depending on the item.
Item-specific analysis generally consists of going beyond an interest in the difference between correct and incorrect answers. In some cases, cognitive strategies can be observed. Greiff, Wüstenberg and Avvisati (2015[8]) provide an excellent example of the promises offered by log-file data. They study a PISA 2012 item on climate control, extracted from the Complex-Problem-Solving domain. This item requires an understanding of how a multi-parameter system works. It is generally solved by following a strategy described as vary-one-thing-at-a-time. The implementation of the strategy can be identified through the log files, without taking the final answer into account. As a result, it is possible to classify respondents based on the strategy they followed, irrespective of whether they ended up giving the right answer.
Such analysis can be extended to a set of items, as long as their content is homogeneous enough to define common strategies or features. OECD (2015[9]) analyses web navigation strategies in a subset of PISA 2012 digital reading items. It distinguishes between task-oriented and exploratory navigation. In the case of exploratory navigation, while pupils may still find the correct answer, their browsing activity features visits to irrelevant web pages. Thanks to this distinction, participating countries can be classified according to the efficiency of web navigation rather than according to digital reading scores (i.e. in terms of attitude and cognitive strategies rather than outcome-based skills). A recent attempt to identify consistent indicators across PIAAC items measuring ability in Problem Solving in Technology-Rich Environments (PSTRE) is found in He, Borgonovi and Paccagnella (forthcoming[10]).
The more diverse the set of items, the less specific the analysis will be. At the same time, the conclusions will be more wide-ranging, opening the door to measuring attitudes or behavioural traits. However, a few indicators do lend themselves to consistent analysis of all items. The most straightforward example of such an indicator is probably time on task, which has been used, for instance, by Goldhammer et al. (2016[6]) to infer attitudes such as item disengagement among survey participants.
Measurements based on log-file variables are attractive, because they reflect actual behaviour, although only in the specific context of the PIAAC assessment. In this sense, they could be a useful complement to more traditional measures of behavioural traits. Psychometric measurements of individual behavioural traits, such as the widely used Big-Five scales or the readiness-to-learn scale available in PIAAC, generally rely on self-assessment. Although these measurements do not have the disadvantage of being context-specific, they are prone to other biases that are not present in the case of log-file data (e.g. lack of sincerity or differences in how respondents interpret the questions).
Log files can also help to better understand why some items display insufficient psychometric properties. The probability of successfully completing an item should be related only to the respondent’s underlying proficiency. Non-construct-related factors, such as culture or gender, should not affect item difficulty. This is a particularly challenging constraint in international assessments, where it is not rare to find that the relative difficulty of an item is not the same in all countries.1 In such cases, the item is said to lack measurement invariance. However, the available statistical procedures are only able to detect the presence or the absence of measurement invariance; they are silent on the underlying reasons behind the failure of an item to satisfy the invariance condition. Typically, the lack of invariance is due to some features of the item content or to translation errors. Information on respondents’ behaviour contained in log files can be a useful complement to provide test developers with a better understanding of why a certain group of respondents finds a certain item more or less difficult.
Survey design and management
For survey designers and managers, log files have proved to be useful in improving data quality in several ways. In PIAAC and PISA, log files have been used to detect data falsification. By allowing a comparison of the processes leading to a response, log files represent powerful tools in the prevention and detection of data falsification in low-stakes assessments. In contrast with high-stakes assessments, such as exams, the most important source of falsification in an international survey such as PIAAC is not the participants themselves, but those involved in survey administration: interviewers, survey contractors or national managers.
Yamamoto and Lennon (2018[11]) highlight how log-file data, in particular timing data, can be used to detect cases of fabricated data. Interviewers who want to minimise effort can fill in questionnaires and assessment answers themselves, but doing that in a way that is consistent with the timing and response patters of real respondents would be cumbersome, and the amount of effort would likely offset the benefits. Survey managers who wish to inflate country performance could do so by replicating the response profiles of high achievers, if they have access to the master datasets. In doing so, however, they will also duplicate the associated timestamps (which are precise to the millisecond). Although identical answer profiles are plausible, identical timing profiles are not. Log files are particularly valuable for this purpose, because they are difficult to edit. In principle, it is possible to fabricate log files and create plausible profiles, but this would require much more sophisticated knowledge and far more time and effort than simply editing final datasets by copying and pasting respondent records.
The use of log files for the management of data quality has been taken further through their integration into dashboard software. Dashboards are tools designed to help survey managers monitor the progress of data collection. Mohadjer and Edwards (2018[12]) document the use of dashboards during the data collection phase of Round 3 of PIAAC in the United States. These dashboards were connected to interviewers’ computers and used the log files throughout their generation during interviews in order to track interviewer activity. Thanks to this system, it was possible to detect suspicious cases during data collection (such as interviews taking place at unlikely times or assessments with improbably short completion times), identify mistakes or falsification in a timely manner and take corrective action. By increasing the chances of detection of such behaviour in close to real time, the integration of dashboards and log files can greatly reduce the incentives for falsification and effort reduction on the part of interviewers and survey administrators.
Content and characteristics of PIAAC log files
The interpretation of variables derived from log files depends to a large extent on the content of test items: the tasks test-takers must carry out, the questions they must answer and the nature of the item stimulus. Inferences made regarding the cognitive strategies of test-takers on the basis of information in log files only makes sense in light of the content and format of items.
To make sure that potential respondents do not have access to the items and the correct answers, in most cases, test items are treated as confidential and are not accessible to researchers. The log-data documentation helps external users to access the contents of items that are already public. Some confidential items, including all items from the assessment of PSTRE, are also available upon submission of a detailed research proposal. The PIAAC technical report (Chapter 2) provides definitions for all three domains and describes the different context categories, the different types of tasks and the various dimensions that contribute to item difficulty (OECD, 2013[2]). However, while the technical report is a helpful resource to understand how diverse cognitive items can be, it does not give any item-specific details.
In the end, the type of information that recorded in the log files is a function of the interaction between the characteristics of each item and the characteristics of the digital assessment platform. Generally, the more complex the item stimulus, the more variables will be available. In principle, dynamic items, whose elements change in response to the actions of test-takers (e.g. manipulating values through the use of sliders or radio buttons) or become accessible only through the action of the test-taker (e.g. accessing a simulated web page by clicking on a hyperlink) will allow collection of more variables, as all changes in the user interface require some exchange of information between the server and the user interface.
It is important to keep in mind that assessment items have mainly been designed with the objective of estimating a proficiency score based on the final answer provided. Consequently, they often do not lend themselves to analysing the process through which the respondent has arrived at a specific answer. For example, it is not always possible or straightforward to unambiguously observe or define a variety of theoretically-grounded cognitive strategies that a respondent might choose to follow in trying to solve the items. This will depend on the design of the item and/or on the amount of information that ends up being recorded in the log file. By their nature, PSTRE items and, to a lesser extent, multipage literacy items lend themselves to this kind of analysis.
The user interface that is common to all PIAAC items is divided into two parts (Figure 2.2). The left panel features navigation buttons, presents the item and states the question or describes the task. Clicking on the right-hand arrow terminates the current unit and opens a new one. The right panel consists of a flexible stimulus frame in which graphical representations, text, a website or application environment can be displayed.
The features of the stimuli vary according to the domain (Table 2.1). All numeracy items contain either charts or print text. Literacy items include stimuli based on printed text, charts or web environments. Web environments can include one or several web pages. Compared to literacy and numeracy items, PSTRE items feature a wider range of stimuli, including web environments, e-mail environments and combinations of e‑mail/spreadsheet/web environments.
Table 2.1. Types of stimuli in PIAAC items
|
Literacy |
Numeracy |
PSTRE |
---|---|---|---|
Print text / chart |
27 |
49 |
0 |
Web environment |
22 |
0 |
4 |
E-mail environment |
0 |
0 |
4 |
Multiple environments |
0 |
0 |
6 |
Note: Multiple environments include spreadsheets, e-mail and web.
Response types define the format and range of possible answers. Numeracy and literacy items have different response types, and this will affect the interpretation of final answers. Response types can be classified as follows:
Stimulus choice or left-panel choice, which features a limited number of precoded answers that may or may not be mutually exclusive.
Stimulus clicking, which requires the participant to click on a graphical element in the stimulus (a cell in a table, a link).
Stimulus highlighting, which targets a string or strings of text. In the clicking and highlighting response modes, a correct response is defined in terms of a range of response actions (e.g. the minimum and maximum amount of text that can be highlighted for an answer to be correct).
Left-panel numeric entry, which requires the participant to provide the answer in the form of a number. The range of possible incorrect answers will thus depend on response mode.
Different item formats may also provide different incentives to respond in the first place. For example, it could be the case that respondents are more likely to provide answers to multiple-choice items, as this permits guessing (there is in fact no penalty for providing a wrong answer). The effort to provide an answer may be greater and the expected benefits lower in items with a more open response format (such as the input of a number or highlighting a portion of text).
It may seem surprising that PSTRE items do not feature a response mode. In fact, PSTRE items typically require a participant to perform a task, not to answer a question. The correct response to PSTRE items generally involves the participant reaching an appropriate stage in the stimulus. PSTRE items are not framed as questions but as tasks. For example, several items ask respondents to select objects among a list to verify some criteria.
Table 2.2 shows the number of items by response mode. Most numeracy items require a numeric entry, while most literacy items require the highlighting of strings of text in the stimulus. In each domain, only a few items feature a multiple-choice response format.
Table 2.2. Distribution of response modes
Response type |
Literacy |
Numeracy |
---|---|---|
Left-panel numeric entry |
3 |
31 |
Left-panel choice |
0 |
5 |
Stimulus clicking |
8 |
11 |
Stimulus highlighting |
31 |
0 |
Stimulus choice |
7 |
2 |
Note: Left-panel choice and stimulus choice are both multiple-choice items. In the former, respondents select the answer in the left panel; in the latter, they select the answer below the stimulus.
Table 2.3 lists the different variables extracted from the log files by the LogDataAnalyzer and the number of items to which they relate. Time on task, time to first interaction, number of helps and number of cancel actions are the only variables available for all items. In most cases, cancel actions and helps are very rare events. Final answers cannot be defined for PSTRE items.
Time to first interaction is a generic variable that has a very different interpretation depending on the nature of an item. In the simple static items, the first interaction will also be the final interaction, the selection or input of an answer. For more dynamic items the first interaction will be the first change in the stimulus.
Time since last action represents the time elapsed between the action of providing a final answer and the time at which the test-taker passes to the next item. Although this variable is present for all numeracy and literacy items, it does not capture exactly the same information for all items. Answer interactions are transferred immediately to the server in all response modes other than left-panel numeric entry. In that case, the content of the text field is transferred only when the item is terminated (i.e. when the test-taker moves to the next item). As a result, for all items with a numeric entry response, time since last action is defined as zero and provides no useful information. Most numeracy items are in this category.
Table 2.3. Variables extracted from log files
|
Numeracy |
Literacy |
PSTRE |
---|---|---|---|
Final response |
49 |
49 |
0 |
Time on task |
49 |
49 |
14 |
Time to first interaction |
49 |
49 |
14 |
Time since last action |
49 |
49 |
0 |
including validation |
18 |
45 |
0 |
Number using cancel button |
49 |
49 |
14 |
Number using help menu |
49 |
49 |
14 |
Number of highlight events |
0 |
31 |
0 |
Number of page revisits |
0 |
15 |
5 |
Number of page visits |
0 |
15 |
5 |
Number of different pages visited |
0 |
15 |
5 |
Sequence of visited web pages |
0 |
15 |
5 |
Time-sequence of spent time on web pages |
0 |
15 |
5 |
Number of created e-mails |
0 |
0 |
6 |
Number of different e-mail views |
0 |
0 |
6 |
Number of revisited e-mails |
0 |
0 |
6 |
Sequence of viewed e-mails |
0 |
0 |
6 |
Sequence of switching environments |
0 |
0 |
6 |
Number of switching environments |
0 |
0 |
6 |
The other variables record changes in the testing environment that result from participants’ actions. Four variables were generated for e-mail environments: number of created e-mails; number of e-mail views; number of revisited e-mails; and sequence of visited e-mails. In items containing a web environment with several web pages, the LogDataAnalyzer extracts five different variables: sequence of web pages; time-sequence of web pages; number of page visits; number of page revisits; and number of different pages visited. Finally, a series of variables describes the sequence of switching environments.
The construction of a chronology of respondents’ interactions with the test application is possible only for items containing web environments with several web pages and/or e‑mail environments. This is true for a good proportion of literacy items (if they feature several web pages) and most PSTRE items. As numeracy items are all displayed in a much simpler environment, it is not possible to construct a similar chronology.
Although these variables cover most of the information available in log files, the documentation also includes details about all the various events that can be extracted from them. For every type of event, a short description is presented, along with the xml code that stands for the event and a few examples. Guidelines about the structure of log files complete these descriptions.
Log-file data are publicly available for 16 countries. They include data recorded from the cognitive instruments only.
Other features of test design relevant to analysis of log-file data
When analysing data from the PIAAC log files, it is important to consider two features of PIAAC: the routing of respondents in the computer-based branch and the adaptive nature of the assessment. These features are designed to maximise the efficiency of PIAAC and respond to the main objective of the study, which is to estimate the distribution of proficiency of the target population in the most efficient way. However, both of these features have consequences for secondary analysis of data at the individual level.
Routing of respondents
According to the PIAAC design, not all respondents were routed to the computer-based branch of the assessment (Figure 2.3). Log-file data are obviously not available for respondents that were routed in the paper-based branch of the assessment. It follows that the log-file data are not representative of the entire PIAAC target population, but are only available for a selected sample.
The allocation of respondents to the paper-based assessment followed a two-stage process. First, respondents who declared no prior computer experience, or who failed a simple test of information communications technology (ICT), were automatically directed to the paper-based assessment. In addition, respondents who passed the ICT assessment were offered the possibility of opting out of the computer-based route and choosing to take the assessment on paper. As a result, the population for which log-file data are available (equivalent to the population assigned to the computer-based assessment) is a sub-group within the PIAAC target population that: 1) had some computer experience; 2) accepted the computer-based assessment; and 3) passed the core ICT test. There is considerable variation in the proportion of the population that meets these criteria across countries (Table 2.4).
In all countries, log-file data are available for a majority of the overall sample, and in most of them, by a large margin. The lowest proportions are in Estonia, Italy and the Slovak Republic (60% or below), but the proportion exceeds 75% in Belgium (Flanders), Denmark, England / Northern Ireland (United Kingdom), Finland, the Netherlands, Norway and the United States.
Table 2.4. Proportion of respondents that took the computer-based assessment
|
Proportion of sample covered |
Number of cases |
---|---|---|
Austria |
0.746 |
3 827 |
Belgium (Flanders) |
0.755 |
4 125 |
Denmark |
0.824 |
6 036 |
England / Northern Ireland (United Kingdom) |
0.806 |
7 163 |
Estonia |
0.531 |
4 053 |
Finland |
0.815 |
4 454 |
France |
0.692 |
4 836 |
Germany |
0.825 |
4 509 |
Ireland |
0.678 |
4 055 |
Italy |
0.605 |
2 797 |
Netherlands |
0.874 |
4 521 |
Norway |
0.836 |
4 286 |
Poland |
0.635 |
5 951 |
Slovak Republic |
0.609 |
3 487 |
Spain |
0.640 |
3 873 |
United States |
0.810 |
4 060 |
Adaptive nature of assessment design
In PIAAC, items were grouped in booklets, with each individual test-taker answering items from a selection of all booklets. The population answering any specific item is then, strictly speaking, not representative of anything. The allocation of booklets to test-takers followed several sequential steps (Figure 2.3).
Test-takers taking the computer-based version of PIAAC were initially randomly allocated to a literacy, numeracy or PSTRE module. Participants assigned to literacy or numeracy would obtain first-stage and second-stage booklets. Booklets varied in difficulty, and allocation of the booklets to participants was only conditionally random. Allocation to the first-stage booklet was determined by a set of background variables that were assumed to be correlated with proficiency, such as age and education. Allocation to the second-stage booklet was based on the same background variables and on performance on the first-stage booklet. Knowledge of the characteristics that drove the allocation of respondents to different booklets is therefore essential to any kind of analysis that aims to investigate behaviour at the item level.
After the first module, participants were allocated to a second module, with the restriction that no respondent could take a second literacy or numeracy module (it was, however, possible to be assigned a second PSTRE module). Allocation of literacy and numeracy booklets in the second module followed the same rules as in the first module.
To some degree, the representativeness of the population answering each item was traded off for more efficient measurement of proficiency at the level of the overall target population. The more successful a participant was (according to background characteristics and answers to previous items), the more likely he/she was to get a booklet with more difficult items. This is not an issue for PSTRE. In that domain, the limited size of the item pool did not allow using an adaptive design, with the consequence that all respondents who were (randomly) allocated to PSTRE modules took exactly the same items.
A consequence of the adaptive nature of the literacy and numeracy assessment is that the subsamples of participants who answer a given literacy or numeracy item are generally not comparable. The share of respondents assigned to any given item ranges from 10% to 40% of the overall sample. As the test is adaptive, good performers tend to be assigned more difficult items. Individual averages over all assigned items are thus not particularly informative. For instance, two participants with a similar proportion of correct answers might actually end up with very different scores because they were assigned to booklets of different difficulty. Raw comparisons of statistics on different items could be misleading, because they are not computed on a similar population. As allocation is at the booklet level, analysis should focus on the population assigned a given booklet and study and compare the items it contains.
Conclusions
Log files have the potential to significantly enrich the information derived from large-scale assessments. In particular, they are likely to help deliver a more nuanced, multifaceted and, ultimately, more realistic picture of the skills possessed by respondents. They also have the potential to provide important insights that would help to design more effective training and learning programmes.
However, the research on log files is still in its infancy. PIAAC is the first large-scale assessment that has allowed a serious analysis of log files, but the PIAAC log files are, to a large extent, a by-product of the fact that PIAAC is a computer-based assessment. As a result, their analysis is often cumbersome, and the information they contain often lends itself to multiple interpretations.
Reaping the full benefits of log files will require specifically designing the assessment items, the delivery platform and the hardware and software infrastructure to capture well defined and theory-based alternative cognitive strategies that respondents may follow when approaching assessment tasks. Similar points are made by Bunderson, Inouye and Olsen (1989[14]). The fourth generation of their agenda for computer-assisted assessment, which they call “intelligent measurement”, aims to provide explanations for individual performance and advice to learners and teachers. The huge progress made in the last few years is a clear sign that we are finally embarking upon a generation of intelligent measurement.
References
[7] Anghel, B. and P. Balart (2017), “Non-cognitive skills and individual earnings: new evidence from PIAAC”, SERIEs, Vol. 8/4, pp. 417-473, http://dx.doi.org/10.1007/s13209-017-0165-x.
[14] Bunderson, C., D. Inouye and J. Olsen (1989), “The four generations of computerized educational measurement.”, in Educational measurement, 3rd ed., American Council on Education.
[6] Goldhammer, F. et al. (2016), “Test-taking engagement in PIAAC”, OECD Education Working Papers, No. 133, OECD Publishing, Paris, http://dx.doi.org/10.1787/5jlzfl6fhxs2-en.
[8] Greiff, S., S. Wüstenberg and F. Avvisati (2015), “Computer-generated log-file analyses as a window into students’ minds? A showcase study based on the PISA 2012 assessment of problem solving”, Computers & Education, Vol. 91, pp. 92-105, http://dx.doi.org/10.1016/J.COMPEDU.2015.10.018.
[10] He, Q., F. Borgonovi and M. Paccagnella (forthcoming), “Using process data to understand adults’ problem-solving behaviours in PIAAC: Identifying generalised patterns across multiple tasks with sequence mining”, OECD Education Working Papers, OECD Publishing, Paris.
[1] Kirsch, I. and M. Lennon (2017), “PIAAC: a new design for a new era”, Large-scale Assessments in Education, Vol. 5/1, p. 11, http://dx.doi.org/10.1186/s40536-017-0046-6.
[3] Maddox, B. (2018), “Interviewer-respondent interaction and rapport in PIAAC article information”, Quality Assurance in Education, Vol. 26/2, pp. 182-195, http://dx.doi.org/10.1108/QAE-05-2017-0022.
[4] Maddox, B. et al. (2018), “Observing response processes with eye tracking in international large-scale assessments: Evidence from the OECD PIAAC assessment”, European Journal of Psychology of Education, http://dx.doi.org/10.1007/s10212-018-0380-2.
[12] Mohadjer, L. and B. Edwards (2018), “Paradata and dashboards in PIAAC”, Quality Assurance in Education, Vol. 26/2, pp. 263-277, http://dx.doi.org/10.1108/QAE-06-2017-0031.
[13] OECD (2016), Technical Report of the Survey of Adult Skills (PIAAC) (Second Edition), OECD Publishing, Paris, http://www.oecd.org/skills/piaac/PIAAC_Technical_Report_2nd_Edition_Full_Report.pdf.
[5] OECD (2016), The Survey of Adult Skills: Reader’s Companion, Second Edition, OECD Skills Studies, OECD Publishing, Paris, https://dx.doi.org/10.1787/9789264258075-en.
[9] OECD (2015), “The importance of navigation in online reading: Think, then click”, in Students, Computers and Learning: Making the Connection, OECD Publishing, Paris, https://dx.doi.org/10.1787/9789264239555-7-en.
[2] OECD (2013), Technical Report of the Survey of Adult Skills (PIAAC), OECD Publishing, Paris, http://www.oecd.org/skills/piaac/_Technical%20Report_17OCT13.pdf.
[11] Yamamoto, K. and M. Lennon (2018), “Quality assurance in education understanding and detecting data fabrication in large-scale assessments article information”, Quality Assurance in Education, Vol. 26/2, pp. 196-212, http://dx.doi.org/10.1108/QAE-07-2017-0038.
Note
← 1. See Chapter 12 of the PIAAC Technical Report (OECD, 2013[2]) for an illustration of the statistical procedures used to detect the psychometric properties of assessment items.