The least absolute shrinkage and selection operator (also known as lasso), which is a machine learning technique within the family of supervised statistical learning methods, is applied throughout this report as a compass to guide the selection of key teacher and school factors related to student achievement, social‑emotional skills and gaps in student performance within schools. In this report, lasso is used for model selection, although it can be used for prediction and inference as well. Lasso has several attributes that makes it an attractive tool for selecting among the many variables collected through the TALIS questionnaires those that are potentially key predictors of student outcomes. These attributes are:
Lasso is designed to select variables that are important and should be included in the model.
The outcome variable guides the model selection process (i.e. supervised statistical learning method).
Lasso can handle high‑dimensional models where the number of variables is high relative to the number of observations.
Lasso is most useful when only a few out of many potential variables affect the outcome (Hastie, Tibshirani and Friedman, 2017[1]; Hastie, Tibshirani and Wainwright, 2015[2]; Tibshirani, 1996[3]). The assumption that the number of coefficients that are non‑zero (i.e. correlated with the outcome variable) in the true model is small relative to the sample size is known as a sparsity assumption. The approximate sparsity assumption requires that the number of non‑zero coefficients in the model that best approximates the true model be small relative to the sample size.
Lasso estimates coefficients in a model. It selects variables that correlate well with the outcome in one dataset (training sample) and then tests whether the selected variables predict the outcome well in another dataset (validation sample). Lasso proceeds with model selection by estimating model coefficients in such a way that some of the coefficient estimates are exactly zero and, hence, excluded from the model, while others are not (Hastie, Tibshirani and Friedman, 2017[1]; Hastie, Tibshirani and Wainwright, 2015[2]; Tibshirani, 1996[3]). In the context of model selection, lasso may not always be able to distinguish an irrelevant predictor that is highly correlated with the predictors in the true model from the true predictors (Wang et al., 2019[4]; Zhao and Yu, 2006[5]).
Lasso for linear models solves an optimisation problem. The lasso estimate is defined as:
Where is the outcome variable, refers to the potential covariates, is the vector of coefficients on , is the lasso penalty parameter, refers to the parameter‑level weights known as penalty loadings and is the lasso penalty. As the lasso penalty term is not scale invariant, one needs to standardise the variables included in the model before solving the optimisation problem.
Thus, the optimisation problem contains two parts, the least‑squares fit measure:
And the penalty term:
The and parameters (also called “tuning” parameters) specify the weight applied to the penalty term. When is large, the penalty term is also large, which results in lasso selecting few or no variables. As decreases, the penalty associated with each non‑zero decreases, which results in an increase in the number of coefficient estimates kept by lasso. When , then lasso reduces to the ordinary least squares (OLS) estimator without any coefficient estimates being excluded from the model.
Two commonly used methods to select the so called “tuning” parameters are cross‑validation (CV), and the adaptive lasso. CV finds the that minimises the out‑of‑sample prediction error. Although CV works well for prediction, it tends to include covariates whose coefficients are zero in the true model that best approximates the data. The adaptive lasso, which consists of two CVs, is more parsimonious when it comes to model selection. After finding a CV solution for , it does another CV among the covariates selected in the first step by using weights (, where are the penalised estimates from the first CV) on the coefficients in the penalty function. Covariates with smaller coefficients are more likely to be excluded in the second step (Drukker and Lui, 2019[6]).
The third commonly used method is the plugin lasso. Among the three methods, the plugin lasso is the most parsimonious and also the fastest one in terms of computational time. Instead of minimising a CV function as presented above for the CV and adaptive lasso methods, the plugin function uses an iterative formula to find the smallest value of that is large enough to dominate the estimation error in the coefficients. The plugin lasso selects the penalty loadings to normalise the scores of the (unpenalised) fit measure for each parameter and then it chooses a value for that is greater than the largest normalised score with a probability that is close to 1 (Drukker and Lui, 2019[6]). For more detail on the plugin lasso, see Belloni et al. (2012[7]) and Drukker and Liu (2019[8]). The plugin lasso tends to select the most important variables and it is good at not including covariates that do not belong to the true model. However, unlike the adaptive method, the plugin lasso can overlook some covariates with large coefficients and select covariates with small coefficients (Drukker and Lui, 2019[6]). Given its favourable model selection attributes, the plugin method is applied in Chapters 2 and 3.
Yet, in the school‑level analysis of Chapter 4, which explores the teacher and school factors that could play a role in mitigating within‑school disparities in performance between girls and boys, the adaptive lasso is used. Due to the analysis being conducted at the school level rather than at the student level, sample sizes decrease considerably, leading to the selection of fewer variables by lasso. Therefore, the use of adaptive lasso is preferred, as it results in more covariates being selected as compared to the more parsimonious plugin lasso. It is also important to note that the model selection properties of lasso have limitations. Notably, irrespective of the way in which the tuning parameters are selected, lasso may not always be able to distinguish an irrelevant predictor that is highly correlated with the predictors in the true model from the true predictors.
Dividing the sample into training and validation sub‑samples allows for validating the performance of the lasso estimator (or estimators, if different methods to select the tuning parameters are tested).2 In this report, the training and validation samples are generated by randomly splitting the overall TALIS‑PISA link sample into two sub‑samples, with 85% of the observations allocated to the training sample and 15% kept for the validation sample. While the proportion of observations allocated to each sub‑sample could be considered somewhat arbitrary, sensitivity analysis shows that the lasso regression results reported herein are fairly robust to how the TALIS‑PISA link sample is split into training and validation sub‑samples.3 It is also important to note that sample split is performed after creating a balanced sample that includes only those observations that have full information (i.e. observations with missing information are excluded) for all variables included in the model.4
Applying lasso for model selection means finding a model that fits the data, not finding a model that allows for interpreting estimated coefficients as effects. Thus, when used for model selection, lasso selects variables and estimates coefficients, but it does not provide the standard errors required for performing statistical inference. Indeed, lasso’s covariate‑selection ability makes it a non‑standard estimator and prevents the estimation of standard errors.
In this report, lasso is applied for model selection based on the overall population of 15‑year‑old students (Chapters 2 and 3) and schools (Chapter 4) surveyed within the TALIS‑PISA link (i.e. the pooled sample across all participating countries and economies). As the model selection is based on the pooled sample, country fixed effects are imposed on lasso to ensure they are always included among the selected covariates. In addition, the controls for student characteristics – such as student gender, migrant background and socio‑economic status – are also imposed on lasso to ensure they are always included among the selected covariates. Moreover, sampling weights are not used in the lasso regression analysis. This may be a limitation; hence, caution is warranted while interpreting the results of lasso regression analyses within this report.
Lasso regressions are estimated using the Stata (version 16.1) (StataCorp, 2019[9])“lasso” module – see Drukker and Lu (2019[6]) for an introduction.