This report uses a dataset of online job postings (OJPs) with monthly information between January of 2018 to June of 2022 to analyse Umbria’s labour market trends. The data is collected, transformed and harmonised by Lightcast (formerly Emsi-Burning Glass Technologies). The data is composed of 6.8 million individual level job postings for Italy and 72 434 for Umbria. There are up to 70 different variables ranging from skill keywords contained in each job posting, qualifications and experience required to fill the job and its geographical location, as well as the type of contract (permanent, temporary) and, when available, the salary offered for the specific role advertised. The OECD further transformed the data to create yearly aggregates, cross tabulations and other statistics presented in the document. Furthermore, the raw text of the OJPs is used for analysis, which is explained in this Annex. Lightcast offers the unique possibility to investigate the text contained in each online job posting, which reveals an amount of information that cannot be matched by any other source.
The Regional Training Catalogue (RTC) by ARPAL Umbria contains information on 1649 different courses, that in have 23 652 training spots available. The variables within this dataset have been discussed in Table 2.1 in Chapter 2.
In order to analyse the text information contained in OJPs and in the RTC, this report leverages Natural Language Processing (NLP henceforth) techniques. NLP is a multi-disciplinary field that draws on techniques from computer science, linguistics, mathematics, and psychology. More precisely, NLP focuses on the interaction between human language and computers. It involves developing algorithms and computational models that can process and analyse natural language data, including text, speech, and images. NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models. Together, these technologies enable computers to process human language in the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment (IBM, 2022[1]). In a nutshell, NLP allows researchers to create a map from words and their complex semantic meanings to numbers that can be analysed through algebraic manipulations and the use of probability models.
This Annex provides technical guidance on the methods that have been used throughout this report to create a mapping between the dictionary of words used in the Regional Training Catalogue (RTC) in Umbria and the skill keywords extracted from online job postings (OJPs).
Figure A A.1 provides a graphical description of the conceptual framework adopted to create the mapping between the vocabulary used to describe the training courses in the RTC and the keywords used by employers in OJPs. Both data sources (RTC and OJPs), at the top of the picture, are initially treated as distinct entities, both requiring data pre-processing, that is, standardization of their textual content and the removal of unnecessary words (typically called ‘stop words’) that do not convey particular semantic meaning or useful information.
The information contained in the RTC comes in separated structured files that contain the description of the learning modules and of the skills that students will learn once enrolling in the course. To highlight the most important aspects of the course descriptions, specific skill keywords have been extracted using simple rules (see Step 1, below).
The OJPs data, on the other hand, is already mapped to the ESCO skill taxonomy when collected by Lightcast and requires minimal pre-processing and cleaning. The keywords extracted from the OJPs are much more abundant than those present in the vocabulary of the RTC as they are derived from the analysis of over a million OJPs gathered for the entire Italian labour market in 2019. These keywords are used to create a semantic representation, the so-called embedding, that describes to what extent each skill keyword in the OJPs is semantically associated with another (see Step 2, below).
The embedding created using the OJPs keywords can also be used to determine the extent to which each skill provided by a training program in the RTC is related to a skill mentioned in the OJPs. This is achieved by measuring the semantic similarity (cosine similarity) between the skills in the RTC and those extracted from the OJPs. This strategy allows to mapping each skill keyword in the RTC to a set of skills in the demand side (OJPs), effectively integrating them into the latter’s dictionary. This mapping is then used for all subsequent analyses, as presented and described in chapters 2 and 3 of this report.