Ryan S. Baker
University of Pennsylvania
M. Aaron Hawn
University of Pennsylvania
Seiyon Lee
University of Pennsylvania
Ryan S. Baker
University of Pennsylvania
M. Aaron Hawn
University of Pennsylvania
Seiyon Lee
University of Pennsylvania
This chapter discusses the current state of the evidence on algorithmic bias in education. After defining algorithmic bias and its possible origins, it reviews the existing international evidence about algorithmic bias in education, which has focused on gender and race, but has also involved some other demographic categories. The chapter concludes with a few recommendations, notably to ensure that privacy requirements do not prevent researchers and developers from identifying bias, so that it can be addressed.
Concern about the problem of algorithmic bias has increased in the last decade. Algorithmic bias occurs when an algorithm encodes (typically unintentionally) the biases present in society, producing predictions or inferences that are clearly discriminatory towards specific groups (Executive Office of the President, 2014[1]; O’Neil, 2016[2]; Zuiderveen Borgesius, 2018[3]). This concern has emerged across domains from criminal justice (Angwin et al., 2016[4]), to medicine (O’Reilly-Shah et al., 2020[5]) to computer vision (Klare et al., 2012[6]), to hiring (Garcia, 2016[7]).
Research has demonstrated that algorithmic bias is a problem for algorithms used in education as well. Academics have been warning about possible uneven effectiveness and lack of generalizability across populations in educational algorithms for several years (e.g. (Bridgeman, Trapani and Attali, 2009[8]; Ocumpaugh and Heffernan, 2014[9])). In education, algorithmic bias can manifest in several ways. For instance, an algorithm used in testing to identify English language proficiency may systematically underrate the proficiency of learners from some countries (Wang, Zechner and Sun, 2018[10]; Loukina, Madnani and Zechner, 2019[11]), denying them access to college admission. To give another example, an algorithm identifying if learners are at risk of failing a course may underestimate the risk of learners in specific demographic groups (Hu and Rangwala, 2020[12]; Kung and Yu, 2020[13]; Yu et al., 2020[14]), denying them access to needed support.
This concern has led to increasing interest in addressing algorithmic bias in education, both in academia and industry. A rapidly increasing number of publications in this area is a testimony to the increasing academic interest in this topic. Active, even fervent debate, is ongoing about how best to measure algorithmic bias (Caton and Haas, 2020[15]; Mehrabi et al., 2021[16]; Verma and Rubin, 2018[17]) and which technical approaches can correct bias (Kleinberg, Mullainathan and Raghavan, 2016[18]; Loukina, Madnani and Zechner, 2019[11]; Lee and Kizilcec, 2020[19]). Within industry and the NGO sector, efforts such as the Prioritizing Racial Equity in AI Design Product Certification from Digital Promise (Digital Promise, 2022[20]) demonstrate the efforts being made to systematize the process of reducing algorithmic bias, and several companies have actively published evidence about the algorithmic bias in their tools and platforms, sometimes in cooperation with academics (Bridgeman, Trapani and Attali, 2009[8]; Bridgeman, Trapani and Attali, 2012[21]; Christie et al., 2019[22]; Zhang et al., 2022[23]). There has not yet been comparable interest in addressing algorithmic bias in education within the policy space – if anything, current directions in policy are towards adopting privacy regulations that will make it impossible to fix the problem of algorithmic bias in education, by making it impossible to collect the data needed to identify if it is occurring and to apply common methods for fixing it when it occurs – see a review of this issue in (Baker and Hawn, 2022[24]).
Despite the increasing concern about algorithmic bias in education, however, work to determine its scope and address it has remained limited. While an increasing number of papers look into algorithmic bias in education, as this review will illustrate, this research is highly uneven in focus. The overwhelming majority of work on algorithmic bias in education focuses on the impacts on a small number of racial and ethnic groups and on sex (Baker and Hawn, 2022[24]), with most effort going into the demographic variables that are most conveniently available to researchers (Belitz et al., 2022[25]). Work in this area is also extremely focused on algorithms used in a single country, the United States of America (Baker and Hawn, 2022[24]). The work that exists shows clear evidence that groups already disadvantaged societally are further disadvantaged by current educational technologies, a problem that requires action. But we do not yet know the full extent of the problem.
In this chapter, we discuss the current state of the evidence on algorithmic bias in education, key obstacles to creating fair algorithms, and steps that can be taken to surpass these obstacles. We conclude with recommendations for policy makers for what they can do to help resolve this still mostly hidden societal problem.
A recent survey across 146 papers found a lack of clarity in how authors define and use the term bias, from gaps of explanation as to how exactly systems are biased to confusion about the eventual harms that bias might cause (Crawford, 2017[26]; Blodgett et al., 2020[27]). We will briefly discuss some of the issues in defining algorithmic bias before proposing a limited working definition applied in this paper.
The term algorithmic bias has been used to describe many problems of fairness in automated systems, only some of which map onto statistical or technical definitions of bias. Some researchers define the term broadly, referring to biases as the set of possible harms throughout the machine-learning process, including any “unintended or potentially harmful” properties of the data that lead to “unwanted or societally unfavourable outcome[s]” (Suresh and Guttag, 2021[28]). Others apply algorithmic bias in a more limited way to cases where a model’s performance or behaviour differs systematically between groups (Gardner, Brooks and Baker, 2019[29]; Mitchell et al., 2020[30]). This second definition of algorithmic bias – systematic skew in performance – may or may not lead to harmful disparate impacts or discrimination, depending on how model results are applied.
Because of this potential for algorithmic bias to translate into unintended impacts, the machine-learning process should be conducted with caution, anticipating some of the very real damages that may result from bias. A widely accepted framework for such harms categorises them broadly into allocative and representational forms (Crawford, 2017[26]; Suresh and Guttag, 2021[28]).
Allocative harms result from the withholding or the unfair distribution of some opportunity across groups, with examples including gender bias in assigning credit limits (Knight, 2019; Telford, 2019); racial bias in sentencing decisions (Angwin et al., 2016[4]) racial bias in identifying patients for additional health care (Obermeyer et al., 2019[31]), and – in education – bias in standardised testing and its resulting impact on high-stakes admission decisions (Dorans, 2010[32]; Santelices and Wilson, 2010[33]).
Representational harms, on the other hand, manifest as the systematic representation of some group in a negative light, or by withholding positive representation (Crawford, 2017[26]). Multiple forms of representational harm have been uncovered in recent years, with Sweeney (2013[34]) identifying varieties of denigration and stereotyping, where, for instance, the word “criminal” was more frequently returned in online ads after searches for black-identifying first names.
While there are clearly a range of ways that algorithmic bias is discussed, here we focus on algorithmic bias in situations where model performance is substantially better or worse across mutually exclusive groups (i.e. (Gardner, Brooks and Baker, 2019[29]; Mehrabi et al., 2021[16]; Mitchell et al., 2020[30])). Other forms of algorithmic bias (such as the cases mentioned above) can be highly problematic, but – as we discuss below – the published research in education thus far has focused on this performance-related version of bias. In this review, we also home in on bias in algorithms, excluding the broader design of the learning or educational systems that use these algorithms. Bias can also emerge in the design of learning activities, leading to differential impact for different populations (Finkelstein et al., 2013[35]), but that is a much broader topic, beyond what this review covers.
Though the origins of algorithmic bias are complex, and solving or mitigating it can in some cases be challenging, identifying algorithmic bias related to model performance is relatively straightforward. Doing so requires only two steps: 1) obtaining data on student identity; 2) checking model performance for students belonging to different groups.
The first step poses some challenges in terms of concerns around student privacy (Pardo and Siemens, 2014[36]) and policies designed to protect student privacy (Baker, 2022[37]). If data on student identity and membership in key demographic groups was not collected initially, it can be difficult to collect after the fact. Once the data has been split into members of different groups, and the model has been applied to those learners, the results can be checked for differences in performance. There are a range of measures that can be used (Kizilcec and Lee, 2022[38]), and ideally several will be used in concert. First, the same measures generally used to evaluate algorithm performance – AUC ROC, Kappa, F1, precision, recall, and so on – can also be used to evaluate performance for sub-groups. Second, some measures specific to algorithmic bias analysis – ABROCA (Gardner, Brooks and Baker, 2019[29]), independence, separation, sufficiency, for instance – can be applied.
After examining the metrics for the differences in algorithm performance between groups, it becomes possible to analyse the expected impacts, anticipating the ways that algorithmic bias might lead to a biased response or intervention. For example, if an algorithm for predicting high school drop-out achieves 20% poorer recall (the ability to identify all individuals at risk) for members of a historically disadvantaged group, then we know that many students in the group who are at risk and need an intervention will not receive it. By contrast, if the same algorithm were to achieve 20% poorer precision (the ability to avoid selecting an individual not at risk) for members of a historically disadvantaged group, then many students in the group will receive unnecessary interventions, at best wasting their time. Checking for expected impacts also gives a sense of what would be gained by fixing a specific bias identified, and ensures that work spent to address algorithmic biases, if successful, will increase the fairness and overall benefit of using the algorithm.
Researchers have considered a range of groups which have been, or could be, impacted by algorithmic bias. Many of these groups have been defined by characteristics protected by law. In the United Kingdom, for instance, the Equality Act of 2010 merged over a hundred disparate pieces of legislation into a single legal framework, unifying protections against discrimination on the basis of sex, race, ethnicity, disability, religion, age, national origin, sexual orientation, and gender identity. In the United States, the same categories are protected by a combination of different legislation, commission rulings, and court rulings, dating back to the Civil Rights Act of 1964. Similar laws afford protections in the European Union and most other countries around the world, though differing in which groups are protected and how they are defined.
While preserving fairness for these legally defined groups is critical, looking for bias only under the lamppost of nationally protected classes (categories with their own complicated histories) may leave other, under-investigated, groups open to bias and harm. Other researchers have suggested additional characteristics which may be vulnerable to algorithmic bias in education: urbanicity (Ocumpaugh and Heffernan, 2014[9]), military-connected status (Baker, Berning and Gowda, 2020[39]), or speed of learning (Doroudi and Brunskill, 2019[40]). Existing legal frameworks used to decide which classes of people merit protection from discrimination may be helpful in assessing the unknown risks that algorithmic bias poses to less studied or unidentified groups (Soundarajan and Clausen, 2018[41]). Section 4 reviews the limited education research into algorithmic bias associated with other groups.
In an effort to better catalogue the origins of algorithmic bias, researchers have described the stages of the machine-learning life-cycle alongside the kinds of bias and harm that can arise at each stage (Barocas, Hardt and Narayanan, 2019[42]; Friedman and Nissenbaum, 1996[43]; Hellström, Dignum and Bensch, 2020[44]; Mehrabi et al., 2021[16]; Silva and Kenney, 2019[45]; Suresh and Guttag, 2021[28]). While some authors collapse the machine-learning process into broader stages (e.g., measurement, model learning, and action) (Barocas, Hardt and Narayanan, 2019[42]; Kizilcec and Lee, 2022[38]), others delimit finer-grained stages, such as data collection, data preparation, model development, model evaluation, model post-processing, and model deployment (Suresh and Guttag, 2021[28]). Industry researchers, in turn, have offered additional stages more common to applied contexts, such as Task Definition, Dataset Construction, Testing Process, Deployment, and ongoing Feedback from users (Cramer et al., 2019[46]).
At each of these stages, particular forms of bias can arise. Examples include historical bias, representation bias, measurement bias, aggregation bias, evaluation bias, and deployment bias (Suresh and Guttag, 2021[28]). By grounding aspirational, goal-driven algorithms in data from an historically inequitable world, historical bias is often perpetuated in education. The most common example, perhaps, is using student demographics as a feature to increase model performance, with the result of lowering the predicted grades for some students based on membership in a demographic group (i.e. (Wolff et al., 2013[47])). A recent survey of the role of demographics in educational data mining, finds that roughly half of papers incorporating demographics into models as features risk this form of bias, using at least one demographic attribute as a predictive feature without considering demographics during model testing or validation (Paquette et al., 2020[48]).
Representational bias occurs when groups under-sampled in training data receive lower-performing predictions. Measurement bias occurs when the selected variables lack construct validity in a way that leads to unequal prediction across groups. A model predicting school violence, for example, might be biased if the labelling of which students engage in violence involves prejudice – e.g. the same violent behaviour is documented for members of one race but not for members of another (Bireda, 2002[49]).
Past the data collection stages of machine-learning, the model learning phase is susceptible to aggregation bias, when training data from distinct populations are combined, with the resulting model working less well for some – or all – groups of learners (Suresh and Guttag, 2021[28]). When detectors of student emotion, for instance, were trained on a combination of urban, rural, and suburban students, they functioned more poorly for all three groups than detectors trained on individual groups (Ocumpaugh and Heffernan, 2014[9]). In the application phases of machine-learning, evaluation bias occurs when the test sets used to evaluate a model fail to represent the populations with which the model will be applied, and deployment bias occurs when a model designed for one purpose is used for other tasks, such as applying a model designed to help teachers identify student disengagement as a tool to assign summative participation grades to students.
Increasing research and journalism has exposed these forms of algorithmic bias in areas such as at-risk prediction for dropping out of high school or college (Anderson, Boodhwani and Baker, 2019[50]), at-risk prediction for failing a course (Hu and Rangwala, 2020[12]; Lee and Kizilcec, 2020[19]), automated essay scoring (Bridgeman, Trapani and Attali, 2009[8]; Bridgeman, Trapani and Attali, 2012[21]), assessment of spoken language proficiency (Wang, Zechner and Sun, 2018[10]), and the detection of student emotion (Ocumpaugh and Heffernan, 2014[9]). In these cases and others, reviewed below, algorithmic bias has impacted educational algorithms in terms of student race, ethnicity, nationality, gender, native language, urbanicity, parental educational background, socio-economic status, and whether a student has a parent in the military. This evidence has prompted increasing academic and industry research into the ways that algorithmic bias can be more effectively identified, mitigated, and its harms reduced.
Much current work addressing algorithmic bias has focused on mitigation at the model evaluation and post‑processing stages of the machine-learning process. Recent surveys present several taxonomies and definitions of fairness with related metrics (Barocas, Hardt and Narayanan, 2019[42]; Caton and Haas, 2020[15]; Kizilcec and Lee, 2022[38]; Mehrabi et al., 2021[16]; Mitchell et al., 2020[30]; Verma and Rubin, 2018[17]). While these formalised metrics make a clear contribution to clarifying algorithmic bias, their application has revealed obstacles. Specifically, technical challenges to the use of fairness metrics manifest in several “impossibility” results (Chouldechova, 2017[51]; Kleinberg, Mullainathan and Raghavan, 2016[18]; Berk et al., 2018[52]; Loukina, Madnani and Zechner, 2019[11]; Lee and Kizilcec, 2020[19]; Darlington, 1971[53]), where satisfaction of one statistical criterion of fairness makes it impossible to satisfy another. For instance, Kleinberg et al. (2016[18]) demonstrate that it is mathematically impossible under normal conditions for a risk estimate model to avoid all three of the following undesirable properties: 1) systematically skewing upwards or downwards for one demographic group; 2) assigning a higher average risk estimate to individuals not at risk for one group than the other; 3) assigning a lower average risk estimate to individuals who are at risk in one group than the other.
Other challenges for this pathway to mitigating bias include the difficulty in describing optimal trade-offs in fairness for domain-specific problems (Lee and Kizilcec, 2020[19]; Makhlouf, Zhioua and Palamidessi, 2021[54]; Suresh and Guttag, 2021[28]), as well as the sociotechnical critique that an overemphasis on seemingly objective, statistical criteria for fairness may provide an excuse for developers and users of algorithms to avoid grappling with the full range of potential bias and harms from employing algorithms for high-stakes decisions (Green, 2020[55]; Green and Hu, 2018[56]; Green and Viljoen, 2020[57]). In order to address the fuller picture of algorithmic bias, it is critical to identify and mitigate bias, not only during the later stages of the process, but also during the earlier stages of data collection and data preparation.
Attempts to address algorithmic bias solely by adjusting algorithms may be ineffective if we have not collected the right data. Specifically, representational and measurement bias (Suresh and Guttag, 2021[28]) can prevent methods further down the pipeline from resolving, or even detecting, bias.
As a key example, if we collect training data only from suburban upper middle-class children, we should not expect our model to work for urban lower-income students. More broadly, if we do not collect data from the right sample of learners, we risk representational bias and cannot expect our models to work for all learners.
Measurement bias is another significant challenge that improved metrics or algorithms cannot overcome on their own. While measurement bias can occur in both predictor variables and training labels (Suresh and Guttag, 2021[28]), the most concerning cases involve the latter. If, for instance, Black students behave similarly to students from other groups, but are still more likely to be labelled in a dataset as engaging in school violence, then it is difficult to determine whether an algorithm works equally well for both groups, or to be at all confident that the algorithm’s functioning is not biased. Surprisingly, this bias in training labels may even come from students themselves if the label depends on students’ responses and can be impacted by confidence, cultural interpretation, or stereotype threat (Tempelaar, Rienties and Nguyen, 2020[58]). In these cases, finding an alternate variable to predict – one not as impacted by bias – may be the best alternative. Other cases of measurement bias may be easier to mitigate, such as when human coders, impacted by their own bias (Kraiger and Ford, 1985[59]; Okur et al., 2018[60]), label some aspect of previously collected data. In the situation where predictor variables are biased, they may be substituting for other variables that would explicitly define group membership, in which case it may be best to discard the biased predictors from consideration.
Ultimately, the best path to addressing both representational and measurement bias is to collect better data – data that includes sufficient proportions of relevant groups, and in which key variables are not themselves biased (Cramer et al., 2019[46]; Holstein et al., 2019[61]). Completing this task, however, depends on knowing what groups are critical to represent in the data sets used to develop models, the focus of our next section.
A great majority of research has focused on a limited number of groups within the diverse student population, focusing on variables involving race and ethnicity, nationality, and gender (Baker and Hawn, 2022[24]). Race and ethnicity, nationality and gender, unsurprisingly, represent the most common demographic categories or variables collected by or made available to researchers, whether by convention or for convenience, especially as most research was conducted in the United States.
Within these broad categories, there is some variance in how the variables are considered. At times, specific racial groups are considered and other times they are considered in terms of whether a student is an URM (Under-Represented Minority) or not. Although a minority in most studies, Asians are typically treated as non-URM in US educational research. Even when racial groups are separated in analysis, heterogeneity within these groups is typically ignored (i.e. people whose ancestors have lived in their current country for generations versus recent immigrants; individuals with different national origins with very different histories and cultures; (Baker et al., 2019[62])).
In this section, we will examine the evidence on algorithmic bias in education by addressing which groups of students have been systematically impacted, in terms of these most common categories. The overview will be organised into the different locations in the world in which each study was conducted, in order to illustrate the uneven amount of research on algorithmic bias in education that has occurred in different regions. We will discuss the implications of that unevenness, and how to address it, later in this chapter.
The majority of research on algorithmic bias in education thus far has been conducted in the United States The strong interest in documenting and addressing algorithmic bias in the United States maps to broader societal concern in the United States about algorithmic bias (Corbett-Davies and Goel, 2018[63]) and discrimination in general (Barocas, Hardt and Narayanan, 2019[42]; O’Neil, 2016[2]). It also may reflect the relatively high availability of educational data for research purposes in the United States. Even most of the research on how learners from different nationalities are impacted by research on algorithmic bias has often been conducted in the United States (Bridgeman, Trapani and Attali, 2009[8]; Bridgeman, Trapani and Attali, 2012[21]; Li et al., 2021[64]; Ogan et al., 2015[65]; Wang, Zechner and Sun, 2018[10]).
Within the United States, a considerable amount of research has investigated the impact of algorithmic bias in education on different racial groups. A recent review by Baker and Hawn (2022[24]) identifies ten cases where this was investigated, across algorithms for purposes ranging from predicting dropout (Anderson, Boodhwani and Baker, 2019[50]; Christie et al., 2019[22]; Kai et al., 2017[66]; Yu, Lee and Kizilcec, 2021[67]), predicting course failure (Lee and Kizilcec, 2020[19]; Yu et al., 2020[14]), and automated essay scoring (Bridgeman, Trapani and Attali, 2009[8]; Bridgeman, Trapani and Attali, 2012[21]; Ramineni and Williamson, 2018[68]). Typically, across studies, algorithms were less effective for Black and Hispanic/Latino students in general (Anderson, Boodhwani and Baker, 2019[50]; Bridgeman, Trapani and Attali, 2012[21]; Lee and Kizilcec, 2020[19]; Ramineni and Williamson, 2018[68]; Yu, Lee and Kizilcec, 2021[67]), and often also had different profiles of false positive and negative results for students in different racial groups (Anderson, Boodhwani and Baker, 2019[50]). More recently, the Penn Center for Learning Analytics (PCLA) wiki (Penn Center for Learning Analytics, n.d.[69]) has identified an additional six studies (published since (Baker and Hawn[24])was finalised) on this topic. Curiously, smaller effects were seen in many of these more recent studies than in earlier studies, which may suggest either that there were some “file drawer” problems with earlier work (that is, results with small effects were not published in the past), or that a broader range of possible contexts are being investigated.
Though there has been considerable attention to race in general, less quantity of research has been paid to indigenous learners, often due to issues of sample size (Anderson, Boodhwani and Baker, 2019[50]), though notable counter-examples exist (e.g. (Christie et al., 2019[22])).
Within the United States, considerable research has also investigated the impact of algorithmic bias in education in terms of learners with different genders. Baker and Hawn (2022[24]) identified nine cases, and three additional papers have been identified since then by the PCLA wiki. Across these papers, gender effects were highly inconsistent, with significant biases against females in some cases (Gardner, Brooks and Baker, 2019[29]; Yu et al., 2020[14]) and significant biases against males in other cases (Hu and Rangwala, 2020[12]; Lee and Kizilcec, 2020[19]; Kai et al., 2017[66]).
While race has been used as a predictor variable in Europe (Wolff et al., 2013[47]), it has not been the subject of systematic investigation into algorithmic bias in education. Research into how algorithmic bias impacts learners from different nationalities has also not been carried out in Europe, to the best of our knowledge, although Bridgeman and colleagues (2009[8]; 2012[21]) investigated algorithmic bias in automated essay scoring on learners from around the world, including several European countries, finding that learners from European countries were less impacted than learners in Asia. However, Wang, Zechner and Sun (2018[10]) found substantial inaccuracies in speech evaluation for learners from Germany, and Li et al. (2021[64]) found that academic achievement prediction was less effective for learners from Moldova than learners in wealthier countries.
However, research on algorithmic bias in terms of gender has occurred in Europe: Riazy et al. (2020[70]) investigated the impacts of gender on course outcome prediction and Rzepka et al. (2022[71]) investigated the impacts of gender on prediction conducted during a spelling learning activity. Only small effects were found.
Overall, then, there is not yet evidence for major impacts of algorithmic bias in Europe, in terms of race, nationality, or gender, but there also have been few studies, and these studies do not cover the range of applications which research in the United States has investigated.
Multiple studies of algorithmic bias in terms of nationality have been conducted on learners from around the world, though primarily involving researchers based in the United States. Baker and Hawn (2022[24]) identified four such studies. These studies have involved a range of applications, from predicting academic achievement (Li et al., 2021[64]), to automated essay scoring (Bridgeman, Trapani and Attali, 2009[8]; Bridgeman, Trapani and Attali, 2012[21]), to speech evaluation (Wang, Zechner and Sun, 2018[10]), to models of help-seeking (Ogan et al., 2015[65]). The studies have documented biases impacting learners from China, Korea, India, Vietnam, the Philippines, Costa Rica, and individuals living in countries where the primary language is Arabic. These studies have been fairly different from each other (except for the two Bridgeman et al. studies) and have documented a range of patterns, clearly indicating that considerably more research is needed.
Three studies on algorithmic bias in education in terms of gender have been conducted outside of the United States and Europe. Verdugo and colleagues (2022[72]) found bias in algorithms predicting university dropout in Chile, negatively impacting female students. Sha and colleagues (2021[73]; 2022[74]) investigate algorithms for four different applications in Australia, finding substantial gender biases but not always in the same direction.
Automated essay scoring used in a high-stakes examination (the Test of English as a Foreign Language) was found to systematically rate essays differently than human graders. Specifically, the algorithm rated native speakers of Arabic, Hindi, and Spanish lower than students from other countries, relative to human graders. The algorithm had been used to replace one of two human coders. In response to this evidence, the test developer instituted a new practice: First a single human grader and the machine rate the essay. If the human and machine give substantially different ratings, a second human rates the essay. If the two humans agree, the automated score is discarded (Bridgeman, Trapani and Attali, 2012[21]).
A model predicting first-year dropout from a Chilean university was found to perform more poorly for female students and students who attended private high schools. A range of fairness techniques were applied, improving the equity of model performance, and in turn the equity of the provision of dropout supports to students (Vasquez Verdugo et al., 2022[72]).
Models detecting student affect (whether a student was bored, frustrated, confused, or engaged) within an online learning platform were found to perform more poorly for students in rural communities than for students in urban or suburban communities. By creating a model tailored to rural students, model performance was improved for this group of students. The models are being used to conduct learning engineering research on how to improve the design of learning content; reducing inequities in the models reduces the risk that incorrect design decisions are made (Ocumpaugh and Heffernan, 2014[9]).
While the majority of research on algorithmic bias in education has investigated race and ethnicity, nationality and gender, other categories of identity have also been investigated. In this section, we will examine the evidence for algorithmic bias in education impacting learners in these categories. Across studies, researchers have investigated algorithmic bias in terms of learners’ urbanicity (city or rural area), socio-economic status, type of school attended (public or private), native language, disabilities, parental educational background, military-connected status. These variables have generally not been investigated in sufficient detail to draw solid conclusions. As with race and ethnicity, nationality and gender, the majority of studies occurred in the United States (15 studies), compared to three in Europe and two in the rest of the world.
According to the PCLA wiki, four studies have thus far investigated native language in terms of algorithmic bias in education: two in the United States (Naismith et al., 2019[75]; Loukina, Madnani and Zechner, 2019[11]), one in Europe (Rzepka et al., 2022[71]), and one in Australia (Sha et al., 2021[73]). Three of four studies found evidence for algorithmic biases negatively impacting non-native speakers, but one (Rzepka et al., 2022[71]) found slightly better model accuracy for non-native speakers. All four studies involved educational tasks where the use of language was central (essay-writing, speaking, spelling, and posting to discussion forums).
The PCLA wiki identified five studies on parental educational background, four in the United States and one in Europe. All five show differences in model performance and prediction in terms of this variable, but the ways that bias manifest are inconsistent across studies, with some studies finding better performance for students with more educated parents but other studies finding better performance for students with less educated parents.
The PCLA wiki also identifies five studies on socio-economic status, all five conducted in the United States. Four of the five papers (predicting dropout, grade point average [GPA], and learning) find that algorithms are less effective for students with poorer socio-economic backgrounds, but the fifth (on automated essay scoring) finds no evidence of difference. A sixth study conducted in Chile, on whether students are attending public or private schools (highly associated with socio-economic status), finds that models predicting university dropout are more accurate for learners from public schools.
There has been relatively little research on how algorithmic bias impacts learners with disabilities. In the United States, Baker and Hawn (2022[24]) document a single study, by Loukina and Buzick (2017[76]), which found that a system for assessing proficiency in spoken English was less accurate for students who were identified by test administrators to have a speech impairment.
In Europe, Baker and Hawn (2022[24]) also document a single study, by Riazy et al. (2020[70]), who found that a system for predicting course outcomes had systematic inaccuracies for learners with self-declared disabilities. These two studies clearly do not cover the range of disabilities that may lead to algorithmic bias in education, and no studies have been documented outside the United States and Europe.
According to the PCLA wiki, two studies have investigated algorithmic bias in terms of student urbanicity (urban versus rural), both in the United States. Ocumpaugh and colleagues (2014[9]) find that models predicting student emotion are less effective if they are developed using data from urban learners and then tested on data from rural learners, compared to if they are tested on data from unseen urban learners. The same is true if the model is developed using data from rural learners and tested on data from urban learners. However, Samei and colleagues (2015[77]) find that models on classroom discourse do not differ between urban and rural settings. More research is needed to determine which types of prediction are impacted when going between urban and rural settings.
Finally, one study conducted in the United States finds that models predicting graduation and standardised examination scores are less accurate for students with family members in the military (Baker, Berning and Gowda, 2020[39]). Across these studies, a variety of variables are investigated and there is on the whole evidence that algorithmic bias has impacts that go beyond race/ethnicity, gender, and nationality. A range of variables have not yet been investigated at all, including religion, age, children of migrant workers, specific disabilities beyond speech impairment, transgender status, and sexual orientation.
The previous sections of this chapter outline what is currently known to the field about how algorithmic bias is manifesting in education. Our review indicates that there is clear evidence that algorithmic bias is manifesting in many ways, but also indicates how limited our current knowledge is. Many potential areas of algorithmic bias are documented in just a single article, and many potential areas where algorithmic bias could be occurring have not yet been studied at all. There also is no clear sense as to the magnitude of the problem for different use cases and groups of students.
As Baker and Hawn (2022[24]) note, we are at the very beginning of a progression in fixing the problem of algorithmic bias. At first, there is unknown bias – a problem exists, but developers and researchers do not know that the problem exists. Perhaps it is known that there is a problem in general, but not exactly who is being affected or exactly how. Descriptive research can move a specific educational algorithm from unknown bias to known bias.
In known bias, it is now known that there is a problem, where it is occurring, and who is impacted. Our knowledge may still be incomplete, but it is sufficient to potentially take action. Once we know what the bias is, it becomes possible to move towards fairness. There is increasing understanding of the steps that can be taken to increase the fairness of algorithms, within the broader machine-learning community (Mehrabi et al., 2021[16]; Narayanan, 2018[78]). Although this work remains far from perfect, and debate is ongoing about the best methods (Kleinberg, Mullainathan and Raghavan, 2016[18]; Berk et al., 2018[52]), there is enough knowhow about addressing algorithmic bias that once we know a bias exists, we can take steps to fix it. Finally, increasing algorithmic fairness can be a step towards creating a world with equity, with equal opportunity for all learners – see (Holstein and Doroudi, 2022[79]).
Working towards equity necessarily implies determining where current technologies and pedagogies are today most unfair, and working to fix these problems first. Many of these places of greatest unfairness involve inequities that are already widely known. But some may also be less well-known to the educational and policy communities. We may miss key inequities due to our own biases and assumptions. In other words, more research is needed, because the world of education today is mostly in a state of unknown bias.
There are currently many obstacles to achieving fairness and equity with educational technology. The biggest, as the previous section notes, is how much we do not know about the biases that exist in the world, in general but also between countries. As Baker and Hawn (2022[24]) note, unknown biases can be split into two categories. The first is when we do not know that algorithmic bias exists for a specific group of learners. The second is when we know that there is bias impacting a specific group, but we do not know how this bias manifests. Both types of bias appear to exist in our current understanding of algorithmic bias in education. The research thus far is limited, both in terms of what groups have been studied, and how thoroughly we have studied algorithmic bias in education for the groups it is known to impact. Even for relatively thoroughly studied problems such as racism and sexism, we do not know all the ways that racism and sexism impact the effectiveness of educational algorithms. The biases of educational algorithms for indigenous populations has been much less studied than the biases for other groups, for instance; transgendered learners have been much less studied; and the experience of racial minorities with educational algorithms has been much more thoroughly studied in the United States than in other countries.
One of the key barriers to conducting this type of research is the lack of high-quality and easily accessible educational data on group membership, identity, perception, or status. As Belitz et al. (2022[25]) notes, even when identity data is collected, it is in terms of a small number of categories. And most studies do not obtain even this limited level of identity or group membership data.
Barriers to collecting data on group membership come for many reasons, including convenience, regulatory barriers, and concerns around student privacy. Often, compliance organisations such as privacy officers and institutional review boards consider demographic data to be high-risk and create incentives (not always consciously) to avoid collecting this type of data. If – to give an example commonly seen in the United States – a researcher is required to collect parental consent if they collect demographic data, but is not required to collect any consent at all if they avoid demographic data, then there is a strong incentive to avoid collecting demographic data, and in turn to ignore issues of algorithmic bias (and other forms of bias as well). There are current efforts in many countries to create stricter data privacy laws for education – laws that have the goal of protecting children, but as currently designed may make it impossible to identify or address algorithmic bias (see discussion in Baker, in press).
Another key incentive that reduces investigation into algorithmic bias is the risk involved to any commercial organisation that is open about the flaws in their product. Any openness about a product’s flaws – or even openness about a product’s design – can be an opportunity for their competitors. An environment where commercial companies can choose whether or not to analyse their product’s flaws, and where there is significant competition, is an environment where companies will have a strong reason not to look into (and fix) biases in their product. Being too open about bias may not just lead to sales competition – it may lead to criticism by journalists, community members, and academics. At the extreme, an organisation that is public about bias in their content risks lawsuits or action by regulators.
While there is currently some incentive to learning systems to demonstrate educational effectiveness – see platforms such as the What Works Clearinghouse and Evidence for ESSA (WWC, 2012[80]; Slavin, 2020[81]), currently these initiatives treat a curriculum as either effective or ineffective, rather than being effective or ineffective for specific groups of learners.
Another important obstacle to addressing algorithmic bias in education is the lack of toolkits for assessing and fixing algorithmic bias that are specifically tailored to education. Educational data has been known to be different than other types of data commonly used in machine-learning, possessing a complex multi-level nature (actions within students within classrooms within teachers within schools within districts; and identity factors that are confounded with those levels) that must be accounted for in order for an analysis to be valid (O’Connell and McCoach, 2008[82]). While existing toolkits are applicable up to a point, more work is needed to make it easy for them to be adapted and used in education – see (Kizilcec and Lee, 2022[38]; Holstein and Doroudi, 2022[79]). Existing toolkits for identifying algorithmic bias offer generally useful metrics (discussed above), but often ignore the unique aspects of educational data, making them less relevant. The reason is that existing toolkits for addressing algorithmic bias are designed to treat data points as interchangeable, and therefore are not compatible with educational algorithms that explicitly consider the multi-level nature of educational data. The lack of toolkits currently increases the cost of testing for and resolving algorithmic bias for organisations without expertise in this area.
All in all, then, while the importance of addressing algorithmic bias in education is clear, without concerted efforts there are also manifestly several challenges and obstacles which will slow efforts in this area. Fortunately, there are several steps which policy makers can take.
In this section, we present six recommendations for policy makers that can help to address algorithmic bias, resolving or working around the challenges currently present in the environment, and building on existing work by academics, NGOs, and industry (Box 9.2).
1. Consider algorithmic bias when considering privacy policy and mandates so that privacy requirements do not prevent researchers from identifying and addressing algorithmic bias.
2. Require algorithmic bias analyses, including requiring necessary data collection.
3. Guide algorithmic bias analysis based on local context and local equity concerns.
4. Fund research into unknown biases around the world.
5. Fund development of toolkits for algorithmic bias in education.
6. Re-design effectiveness clearinghouses to consider learner diversity.
The first recommendation is simply to not make it impossible to address algorithmic bias. As mentioned above, many countries are currently considering legislation around data privacy in education that would make it impossible to collect (or retain for sufficient time to conduct analysis) the data on student identity, interaction, and outcomes which is necessary to identify and address algorithmic bias. If educational technology providers cannot collect or cannot use data on learner identity, they cannot determine who is negatively impacted by algorithmic bias, and almost certainly cannot produce algorithms less impacted by algorithmic bias. If educational technology providers cannot retain data on student usage long enough to measure relevant outcomes, they cannot know if students in different groups are being differentially impacted. Student privacy is important but so is fairness.
Ideally, rather than create policy preventing the collection of data necessary to check for and address algorithmic bias, policy makers would require the collection of data needed for these purposes, under best-practices safeguards. Ideally, this data collection mandate would be combined with some degree of protection or release from liability for companies that fully followed required security practices (especially in today’s environment, where maintaining perfect data security is challenging even when following best practices).
This would be the first step towards requiring educational algorithms used beyond a certain scale (perhaps 1 000 active users) to explicitly document and publish checks for algorithmic bias, at minimum providing evidence on whether the models have substantial difference in their quality of performance for different populations (if present in their user base). The requirement to publicly release evidence on algorithmic bias would probably be sufficient to create strong pressure to fix biases found in the algorithms.
One current challenge faced by organisations making good-faith attempts to collect data to investigate algorithmic bias is deciding which identity variables to collect data on (Belitz et al., 2022[25]). Policy makers can assist with this. While census categories provide one source of possible variables, census categories simultaneously miss key categories shown to be associated with algorithmic bias (as discussed above) and also can include groups not present in a specific data set due to uneven distribution of that group across the general population. Policy standardising a minimum set of identity markers to collect and report on in each policy region would provide consistency and comparability between different reports of algorithmic bias. It would also help to ensure that groups currently most disadvantaged in each region are supported rather than further disadvantaged by educational algorithms. Finally, standardizing on a minimum set of identity categories would also prevent organisations from reporting only the groups for whom their tool is unbiased. The actual process of selecting which categories are relevant within a specific policy environment should not be arbitrary; ideally, selection would be made by a representative combination of stakeholders in the local community, including researchers who can evaluate the data available.
As the discussion above illustrates, it is difficult to fix a problem if we do not know if it is there; it is difficult to fix unknown biases. Thus far, the super-majority of research on algorithmic bias has involved race/ethnicity and gender in the United States – and even in the United States, key racial or ethnic groups more represented in specific geographical areas (such as Native Americans, and members of the Portuguese and Brazilian diasporas in New England) have been under-studied, as have other categories connected to algorithmic bias.
Outside the United States, there has been much less research into algorithmic bias. There is a clear need for further research on algorithmic bias in education in other OECD countries, investigating which groups are impacted and how they are impacted. Without this research, developers around the world will be limited to addressing the inequity problems known to exist in the United States, which are different from the problems in other countries (Wimmer, 2017[83]), or will be guided by intuition rather than data in which problems they attempt to address.
Policy makers can address this current limitation by creating grant-making programs which make funds available for research into who is impacted by algorithmic bias in education in their region.
As discussed above, the current lack of good toolkits for identifying and addressing algorithmic bias in education raises the cost of doing so; an organisation must either hire an expert in this area or develop their own expertise over time. The development of high-quality, usability-engineered toolkits supporting the use of best practices will increase the feasibility of conducting this type of analysis and improvement, for a wide range of educational technology providers and researchers. Policy makers can address this limitation by creating grant-making programs which make funds available for the development of toolkits of this nature. Even one such toolkit would make a substantial difference to the field.
Currently, effectiveness clearinghouses such as the What Works Clearinghouse and Evidence for ESSA – created (respectively) directly by a governmental agency and with foundation grant funding – summarise the evidence for the effectiveness of different curricula, including computer-delivered curricula. However, they treat effectiveness as a single dimension – either a curriculum is effective for all or for none. Curricula and educational technologies may, however, be effective for specific groups of learners and not for others (Cheung and Slavin, 2013[84]). An educational technology that is algorithmically biased is unlikely to be equally effective for all learners; if its algorithms function less effectively for specific groups of learners, the technology is very likely to function less effectively at supporting those learners in achieving better outcomes. As new clearinghouses are developed, or existing clearinghouses seek future funding, it may be possible for policy makers to influence their directions towards considering differences between groups of learners in effectiveness. Doing so will provide greater incentive for educational technology providers (and curriculum developers in general) to document (and attend to) the effectiveness of their products for the full diversity of learners.
In this chapter, we have reviewed the current evidence for algorithmic bias in education: who is impacted, how they are impacted, and the (large) gaps in the field’s understanding of this area. We review some of the factors slowing progress in this area, and conclude with recommendations for what policy makers can do to support the field in understanding and reducing algorithmic biases in education.
The potential of algorithms for education is high. The best adaptive learning systems and at-risk prediction systems have made large positive impacts on student outcomes (Ma et al., 2014[85]; VanLehn, 2011[86]; Millron, Malcom and Kil, 2014[87]). However, this potential cannot be fully reached if algorithms replicate or even magnify the biases occurring in societies around the world. It is only by researching and resolving algorithmic bias that we can develop educational technologies that reach their full potential, and in turn support every student in achieving their own full potential.
Policy makers around the world are at a key moment in the progress towards resolving algorithmic bias and developing educational technologies that are fair and equitable for all learners. There is increased understanding that algorithmic bias exists, including in education. There are the beginnings of progress in understanding who is impacted and how. However, this progress is limited in scope – specific dimensions of student identity (particularly race/ethnicity and gender) have been much more heavily studied than other dimensions which also appear to be affected by algorithmic bias. Furthermore, research on algorithmic bias in education has been heavily concentrated in the United States, creating a lack of clarity on who is being negatively impacted in the rest of the world, and how to support them. Finally, this progress is put at risk by the possibility of imbalanced privacy laws, which may prevent future work to investigate and fix algorithmic biases and, ultimately, enhance equity.
[50] Anderson, H., A. Boodhwani and R. Baker (2019), Assessing the Fairness of Graduation Predictions.
[4] Angwin, J.; J. Larson; S. Mattu and L. Kirchner (2016), Machine Bias, Auerbach Publications.
[37] Baker, R. (2022), The Current Trade-off Between Privacy and Equity in Educational Technology, Rowman & Littlefield.
[39] Baker, R., A. Berning and S. Gowda (2020), Differentiating Military-Connected and Non-Military-Connected Students: Predictors of Graduation and SAT Score.
[24] Baker, R. and A. Hawn (2022), “Algorithmic Bias in Education”, International Journal of Artificial Intelligence in Education, Vol. 32/4, pp. 1052-1092.
[62] Baker, R.; E. Walker; A. Ogan and M. Madaio (2019), “Culture in Computer-Based Learning Systems: Challenges and Opportunities”, Computer-Based Learning in Context, Vol. 1/1, pp. 1-13, https://doi.org/10.35542/osf.io/ad39g.
[42] Barocas, S., M. Hardt and A. Narayanan (2019), Fairness and Machine Learning. Limitations and opportunities, https://fairmlbook.org/.
[25] Belitz, C.; J. Ocumpaugh; S. Ritter; R. Baker; S. Fancsali and N. Bosch (2022), “Constructing categories: Moving beyond protected classes in algorithmic fairness”, Journal of the Association for Information Science and Technology, pp. 1-6, https://doi.org/10.1002/asi.24643.
[52] Berk, R., H. Heidari; S. Jabbari; M. Kearns and A. Roth (2018), “Fairness in Criminal Justice Risk Assessments: The State of the Art”, Sociological Methods & Research, Vol. 50/1, pp. 3-44, https://doi.org/10.1177/0049124118782533.
[49] Bireda, M. (2002), Eliminating Racial Profiling in School Discipline: Cultures in Conflict, Rowman & Littlefield Education.
[27] Blodgett, S.; S. Barocas; H. Daumé III and H. Wallach (2020), “Language (Technology) is Power: A Critical Survey of “Bias” in NLP”, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5454–5476, https://doi.org/10.18653/v1/2020.acl-main.485.
[21] Bridgeman, B., C. Trapani and Y. Attali (2012), “Comparison of Human and Machine Scoring of Essays: Differences by Gender, Ethnicity, and Country”, Applied Measurement in Education, Vol. 25/1, pp. 27-40, https://doi.org/10.1080/08957347.2012.635502.
[8] Bridgeman, B., C. Trapani and Y. Attali (2009), Considering Fairness and Validity in Evaluating Automated Scoring.
[15] Caton, S. and C. Haas (2020), “Fairness in Machine Learning: A Survey”, arXiv preprint arXiv:2010.04053, https://doi.org/10.48550/arXiv.2010.04053.
[84] Cheung, A. and R. Slavin (2013), “The effectiveness of educational technology applications for enhancing mathematics achievement in K-12 classrooms: A meta-analysis”, Educational Research Review, Vol. 9, pp. 88-113, https://doi.org/10.1016/j.edurev.2013.01.001.
[51] Chouldechova, A. (2017), “Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments”, Big Data, Vol. 5/2, pp. 153-163, https://doi.org/10.1089/big.2016.0047.
[22] Christie, S.; D. Jarratt; L. Olson and T. Taijala (2019), “Machine-Learned School Dropout Early Warning at Scale”, International Educational Data Mining Society (EDM 2019), pp. 726-731.
[63] Corbett-Davies, S. and S. Goel (2018), “The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning”, arXiv, https://doi.org/10.48550/arXiv.1808.00023.
[46] Cramer, H. et al. (2019), Translation Tutorial: Challenges of incorporating algorithmic fairness.
[26] Crawford, K. (2017), The Trouble with Bias - NIPS 2017 Keynote, https://www.youtube.com/watch?v=fMym_BKWQzk.
[53] Darlington, R. (1971), “Another look at “cultural fairness.””, Journal of Educational Measurement, Vol. 8/2, pp. 71–82, https://doi.org/10.1111/j.1745-3984.1971.tb00908.x.
[20] Digital Promise (2022), Prioritizing Racial Equity in AI Design, https://productcertifications.microcredentials.digitalpromise.org/explore/1-prioritizing-racial-equity-in-ai-design-4 (accessed on 23 December 2022).
[32] Dorans, N. (2010), “Misrepresentations in Unfair Treatment by Santelices and Wilson”, Harvard Educational Review, Vol. 80/3, pp. 404-413, https://doi.org/10.17763/haer.80.3.l253473353686748.
[40] Doroudi, S. and E. Brunskill (2019), Fairer but Not Fair Enough On the Equitability of Knowledge Tracing, https://doi.org/10.1145/3303772.3303838.
[1] Executive Office of the President (2014), Big Data: Seizing Opportunities, Preserving Values.
[35] Finkelstein, S.; E. Yarzebinski; C. Vaughn; A. Ogan and J. Cassell (2013), The Effects of Culturally Congruent Educational Technologies on Student Achievement.
[43] Friedman, B. and H. Nissenbaum (1996), “Bias in computer systems”, ACM Transactions on Information Systems, Vol. 14/3, pp. 330–347, https://doi.org/10.1145/230538.230561.
[7] Garcia, M. (2016), “Racist in the Machine: The Disturbing Implications of Algorithmic Bias”, World Policy Journal, Vol. 33/4, pp. 111-117, https://doi.org/10.1215/07402775-3813015.
[29] Gardner, J., C. Brooks and R. Baker (2019), Evaluating the Fairness of Predictive Student Models Through Slicing Analysis, Association for Computing Machinery, https://doi.org/10.1145/3303772.3303791.
[55] Green, B. (2020), The false promise of risk assessments: epistemic reform and the limits of fairness, https://doi.org/10.1145/3351095.3372869.
[56] Green, B. and L. Hu (2018), The Myth in the Methodology: Towards a Recontextualization of Fairness in Machine Learning.
[57] Green, B. and S. Viljoen (2020), Algorithmic realism: expanding the boundaries of algorithmic thought, https://doi.org/10.1145/3351095.3372840.
[44] Hellström, T., V. Dignum and S. Bensch (2020), Bias in Machine Learning -- What is it Good for?, https://doi.org/10.48550/arXiv.2004.00686.
[79] Holstein, K. and S. Doroudi (2022), Equity and Artificial Intelligence in education, Routledge.
[61] Holstein, K.; J. Wortman Vaughan; H. Daumé III; M. Dudik and H. Wallach (2019), Improving fairness in machine learning systems: What do industry practitioners need?.
[12] Hu, Q. and H. Rangwala (2020), Towards Fair Educational Data Mining: A Case Study on Detecting At-risk Students, https://files.eric.ed.gov/fulltext/ED608050.pdf.
[66] Kai, S. et al. (2017), Predicting Student Retention from Behavior in an Online Orientation Course.
[38] Kizilcec, R. and H. Lee (2022), Algorithmic fairness in education, Routledge.
[6] Klare, B., M. Burge; J. Klontz; R. Vorder Bruegge and A. Jain (2012), Face Recognition Performance: Role of Demographic Information, IEEE, https://doi.org/10.1109/TIFS.2012.2214212.
[18] Kleinberg, J., S. Mullainathan and M. Raghavan (2016), Inherent Trade-Offs in the Fair Determination of Risk Scores, https://doi.org/10.48550/arXiv.1609.05807.
[59] Kraiger, K. and J. Ford (1985), “A Meta-Analysis of Ratee Race Effects in Performance Ratings”, Journal of Applied Psychology, Vol. 70/1, pp. 56-65.
[13] Kung, C. and R. Yu (2020), Interpretable Models Do Not Compromise Accuracy or Fairness in Predicting College Success, Association for Computing Machinery, https://doi.org/10.1145/3386527.3406755.
[19] Lee, H. and R. Kizilcec (2020), Evaluation of Fairness Trade-offs in Predicting Student Success, https://doi.org/10.48550/arXiv.2007.00088.
[64] Li, X.; D. Song; M. Han; Y. Zhang and R. Kizilcec (2021), “On the limits of algorithmic prediction across the globe”, arXiv preprint arXiv:2103, https://doi.org/10.48550/arXiv.2103.15212.
[76] Loukina, A. and H. Buzick (2017), “Use of Automated Scoring in Spoken Language Assessments for Test Takers With Speech Impairments: Automated Scoring With Speech Impairments”, ETS Research Report Series, Vol. 3, https://doi.org/10.1002/ets2.12170.
[11] Loukina, A., N. Madnani and K. Zechner (2019), The many dimensions of algorithmic fairness in educational applications, Association for Computational Linguistics, https://doi.org/10.18653/v1/W19-4401.
[54] Makhlouf, K., S. Zhioua and C. Palamidessi (2021), “On the Applicability of Machine Learning Fairness Notions”, ACM SIGKDD Explorations Newsletter, Vol. 23, pp. 14-23.
[85] Ma, W.; O. Adesope; J. Nesbit and Q. Liu (2014), “Intelligent Tutoring Systems and Learning Outcomes: A Meta-Analysis”, Journal of Educational Psychology, Vol. 106/4, pp. 901-918.
[16] Mehrabi, N.; S. Zhioua and C. Palamidessi (2021), “A Survey on Bias and Fairness in Machine Learning”, ACM Computing Surveys, Vol. 54/6, pp. 1-35, https://doi.org/10.1145/3457607.
[87] Millron, M., L. Malcom and D. Kil (2014), “Insight and Action Analytics: Three Case Studies to Consider”, Research & Practice in Assessment, Vol. 9, pp. 70-29.
[30] Mitchell, S.; E. Potash; S. Barocas; A. D'Amour and K. Lum (2020), “Algorithmic Fairness: Choices, Assumptions, and Definitions”, Annual Review of Statistics and Its Application, Vol. 8, pp. 141-163, https://doi.org/10.1146/annurev-statistics-042720-125902.
[75] Naismith, B.; N. Han; A. Juffs; B. Hill and D. Zheng (2019), Accurate Measurement of Lexical Sophistication with Reference to ESL Learner Data.
[78] Narayanan, A. (2018), Translation tutorial: 21 fairness definitions and their politics.
[2] O’Neil, C. (2016), Weapons of math destruction: how big data increases inequality and threatens democracy, Crown Publishing.
[5] O’Reilly-Shah, V. et al. (2020), “Bias and ethical considerations in machine learning and the automation of perioperative risk assessment”, British Journal of Anaesthesia, Vol. 125/6, pp. 843-846, https://doi.org/10.1016/j.bja.2020.07.040.
[31] Obermeyer, Z.; B. Powers; C. Vogeli and S. Mullainathan (2019), “Dissecting racial bias in an algorithm used to manage the health of populations”, Science, Vol. 366/6464, pp. 447-453, https://doi.org/10.1126/science.aax2342.
[82] O’Connell, A. and D. McCoach (2008), Multilevel modeling of educational data, IAP.
[9] Ocumpaugh, J. and C. Heffernan (2014), “Population validity for educational data mining models: A case study in affect detection”, British Journal of Educational Technology, Vol. 45/3, pp. 487-501, https://doi.org/10.1111/bjet.12156.
[65] Ogan, A.; E. Walker; R. Baker; M. Rodrigo; J.C. Soriano and M.J. Castro (2015), “Towards understanding how to assess help-seeking behavior across cultures”, International Journal of Artificial Intellignce in Education, Vol. 25/2, pp. 229-248, https://doi.org/10.1007/s40593-014-0034-8.
[60] Okur, E.; S. Aslan; N. Alyuz; A. Arslan and R. Baker (2018), Role of Socio-Cultural Differences in Labeling Students’ Affective States, Springer International Publishing.
[48] Paquette, L.; J. Ocumpaugh; Z. Li; A. Andres and R. Baker (2020), “Who’s Learning? Using Demographics in EDM Research”, Journal of Educational Data Mining, Vol. 12/3, pp. 1–30, https://doi.org/10.5281/zenodo.4143612.
[36] Pardo, A. and G. Siemens (2014), “Ethical and privacy principles for learning analytics”, British Journal of Educational Technology, Vol. 45/3, pp. 438-450, https://doi.org/10.1111/bjet.12152.
[69] Penn Center for Learning Analytics (n.d.), Algorithmic Bias in Education, https://www.pcla.wiki/index.php/Algorithmic_Bias_in_Education.
[68] Ramineni, C. and D. Williamson (2018), “Understanding Mean Score Differences Between the e-rater® Automated Scoring Engine and Humans for Demographically Based Groups in the GRE® General Test”, ETS Research Report Series, Vol. 2018/1, pp. 1-31, https://doi.org/10.1002/ets2.12192.
[70] Riazy, S., K. Simbeck and V. Schreck (2020), Fairness in Learning Analytics: Student At-risk Prediction in Virtual Learning Environments, https://doi.org/10.5220/0009324100150025.
[71] Rzepka, N.; K. Simbeck; H. Müller and N. Pinkwart (2022), Fairness of In-session Dropout Prediction, https://doi.org/10.5220/0010962100003182.
[77] Samei, B.; A. Olney; S. Kelly; M. Nystrand; S. D'Mello; N. Blanchard and A. Greasser (2015), Modeling Classroom Discourse: Do Models That Predict Dialogic Instruction Properties Generalize across Populations?.
[33] Santelices, M. and M. Wilson (2010), “Unfair Treatment? The Case of Freedle, the SAT, and the Standardization Approach to Differential Item Functioning”, Harvard Educational Review, Vol. 80/1, pp. 106-134, https://doi.org/10.17763/haer.80.1.j94675w001329270.
[74] Sha, L.; M. Raković; A. Das; D. Gašević and G. Chen (2022), ““Leveraging Class Balancing Techniques to Alleviate Algorithmic Bias for Predictive Tasks in Education”, IEEE Transactions on Learning Technologies, Vol. 15/4, pp. 481-492, https://doi.org/10.1109/TLT.2022.3196278.
[73] Sha, L.; M. Raković; A. Whitelock-Wainwright; D. Carroll; V. Yew; D. Gašević and G. Chen (2021), Assessing algorithmic fairness in automatic classifiers of educational forum posts, https://doi.org/10.1007/978-3-030-78292-4_31.
[45] Silva, S. and M. Kenney (2019), “Algorithms, Platforms, and Ethnic Bias”, Communications of the ACM, Vol. 62/11, pp. 37-39, https://doi.org/10.1145/3318157.
[81] Slavin, R. (2020), “How evidence-based reform will transform research and practice in education”, Educational Psychologist, Vol. 55/1, pp. 21-31, https://doi.org/10.1080/00461520.2019.1611432.
[41] Soundarajan, S. and D. Clausen (2018), Equal Protection Under the Algorithm : A Legal-Inspired Framework for Identifying Discrimination in Machine Learning.
[28] Suresh, H. and J. Guttag (2021), “A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle”, EAAMO ’21: Equity and Access in Algorithms, Mechanisms, and Optimization 17, pp. 1-9, https://doi.org/10.1145/3465416.3483305.
[34] Sweeney, L. (2013), “Discrimination in online ad delivery”, Communications of the ACM, Vol. 56/5, pp. 44-54, https://doi.org/10.1145/2447976.2447990.
[58] Tempelaar, D., B. Rienties and Q. Nguyen (2020), “Subjective data, objective data and the role of bias in predictive modelling: Lessons from a dispositional learning analytics application”, Plos One, Vol. 15/6, https://doi.org/10.1371/journal.pone.0233977.
[86] VanLehn, K. (2011), “The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems”, Educational Psychologist, Vol. 46/4, pp. 197-221, https://doi.org/10.1080/00461520.2011.611369.
[72] Vasquez Verdugo, J.; X. Gitiaux; C. Ortega and H. Rangwala (2022), FairEd: A Systematic Fairness Analysis Approach Applied in a Higher Educational Context, https://doi.org/10.1145/3506860.3506902.
[17] Verma, S. and J. Rubin (2018), Fairness Definitions Explained, https://doi.org/10.1145/3194770.3194776.
[10] Wang, Z., K. Zechner and Y. Sun (2018), “Monitoring the Performance of Human and Automated Scores for Spoken Responses”, Language Testing, Vol. 35/1, pp. 101-120, https://doi.org/10.1177/0265532216679451.
[83] Wimmer, A. (2017), “Power and pride: national identity and ethnopolitical inequality around the world”, World Politics, Vol. 69/4, pp. 605-639, https://doi.org/10.1017/S0043887117000120.
[47] Wolff, A.; Z. Zdrahal; A. Nikolov and M. Pantucek (2013), Improving retention: predicting at-risk students by analysing clicking behaviour in a virtual learning environment, https://doi.org/10.1145/2460296.2460324.
[80] WWC (2012), What Works Clearinghouse, https://ies.ed.gov/ncee/wwc/.
[67] Yu, R., H. Lee and R. Kizilcec (2021), Should College Dropout Prediction Models Include Protected Attributes?, https://doi.org/10.48550/arXiv.2103.15237.
[14] Yu, R.; Q. Li; C. Fischer; S. Doroudi and D. Xu (2020), Towards Accurate and Fair Prediction of College Success: Evaluating Different Sources of Student Data.
[23] Zhang, J.; J.L. Andres, Juliana Ma; S. Hutt; R. Baker; J. Ocumpaugh; C. Mills; J. Brooks; S. Sethuraman and T. Young (2022), Detecting SMART Model Cognitive Operations in Mathematical Problem-Solving Process.
[3] Zuiderveen Borgesius, F. (2018), Discrimination, artificial intelligence, and algorithmic decision-making, Council of Europe, https://pure.uva.nl/ws/files/42473478/32226549.pdf.