The data is entered by the contracting authority into a standardised reporting form through the government-run electronic procurement platforms – Official Gazette as well as the Elektronikus Közbeszerzési Rendszer (EKR) and the Közbeszerzési Adatbázis. First, all publications describing a tendering process are scraped. Then, the original information from each publication is parsed into a uniformly structured data template, and cleaned, which includes the conversion of structured text into standard data types (numbers, dates, enumeration values).
One tender is described by multiple publications, such as a call for tender, contract award notice, or possibly modifications, cancellations or notices about contract implementation. All relevant publications related to a single tender are grouped together based on publication cross references. After matching the publications describing the same tender together, the next step is mastering. Mastering refers to the process of creating a single record for each tender that stores a final value for each variable, that is the best representation of tendering process. The data cleaning, matching and mastering steps are developed originally by DIGIHWIST.1
In the final analysed dataset, each contract is stored as a separate observation. Some tenders contain more than one lot. These tenders can have more than one contract award notice for one call for tender publications. Such a design complicates the linking of tendering details - such as connecting the estimated prices of individual lots to the individual contracts awarded is not always possible. Further complication are the framework agreements, considering that they are first ‘pre-awarded’, which is followed by the follow-up award or contract implementation notice.
Once completed, the dataset goes through several phases of filtering, selecting relevant observations for the analysis, tender years 2017-2022. Contracts that miss bidder name, buyer name, contracts that are cancelled or non-awarded parts of the framework agreement are filtered out. As a result of the filtering, the final contract-level dataset contains 106 045 observations. This is a reduced number of the full dataset that has 429 241 observations for the period 2012-2023, most of which are also losing bidders or missing an important piece of information. To visualise and track yearly and monthly distributions of variables, two additional variables ‘year’ and ‘month’ were created. The variable ‘year’ is created based on tender year. To create the variable ‘month’, the first call for a tender publication was used. For those observations that have missing value, an intermediate step was added. A month is imputed based on the median value difference between contract award date and first call for tender publication per 4-digit sector. The obtained value is subtracted from the contract award date to arrive at an imputed date. Finally, the month from this newly created date is extracted and added in cases of a missing value in the original month variable.
As a result of these processing steps, a final data table is created for the analysis that corresponds to an awarded contract.