Data source and methodology for the experimental statistics on the calculation of the quarterly migration balance

1.  Measuring international migration

One important objective of migration statistics is to determine the number of foreign citizens residing in Hungary according to the EU definition of the usual resident population (residence of at least 12 months) and also to determine the number of foreign citizens entering and leaving the country according to this definition. A further objective is to examine the migration of Hungarian citizens: the number of Hungarian citizens emigrating or returning is also determined on the basis of the definition of the usual resident population.

Since the beginning of migration statistical data collection, migration data have been published by the Hungarian Central Statistical Office (HCSO) at annual intervals. The aim of the current experimental statistics is to estimate the migration balance on a quarterly basis, which required 1) an increase in the frequency of data collection from annual to quarterly/monthly and 2) the development of models for estimating quarterly migration. The estimation procedure was based on regression models run on already available historical data, which allowed the estimation of the expected residence time of the migrating population. The imputation of missing data using the multivariate analysis method made it possible to estimate residence durations even for individuals for whom data on migration duration were incomplete.

2. Data sources

Currently, the main administrative data sources used to produce migration statistics are:

a)      Ministry of Energy, Personal Data and Address Records (OSAP 2228);

b)      Emigration and return migration of Hungarian citizens: the Social Insurance Identification Number (TAJ) Register of the National Health Insurance Fund Management (NHIF) (OSAP 2197);

Article (5), Section 80 of Act LXXXIII of 1997 on Compulsory Health Insurance Benefits stipulates that anyone who takes out health insurance abroad is obliged to notify the home health insurance company. The social insurance register therefore contains data based on the obligation to report emigration and includes registered migration events. Despite the obligation, the data are not comprehensive.

c)      Emigration and immigration of foreign citizens: aliens administrative registers of the National Directorate-General for Aliens Policing (NDGAP) (OSAP 2196, 2550 data transfers);

There is a separate register for nationals of the European Economic Area (EEA) countries with the right of free movement and residence and third-country nationals.

  3. Practice of migration statistical data processing

The NDGAP database (OSAP 2196 and 2550) is a panel database: a person can appear in several rows (have several migration events). Socio-economic and migration variables can be included as explanatory variables in multivariate analyses and explanatory models. The NDGAP database allows for the analysis of variables such as sex, age, marital status, citizenship, purpose of migration, educational attainment and occupation. However, the latter two variables are only recorded for third-country nationals.

The register of foreigners kept by the NDGAP is one of the most important sources of statistical data on foreigners who have arrived in Hungary and have been granted a residence or permanent residence permit. The register covers citizens of the European Union and EEA countries and third-country nationals. The data from the registers are received monthly by HCSO.

A specific feature of the EEA file is that registration events for one person are not linked to each other. In the third-country national database, however, the notifications concerning one person are linked. However, there may be a cross-over between the two databases. One of the main challenges in preparing the quarterly migration balance was that the majority of EEA citizens do not have a residence period in the immigration files of the NDGAP The main task was how to assign residence periods to replace the missing residence periods.

The NHIF database also has a panel structure: a person can appear in several rows (have several migration events). In a multivariate statistical model, for example, the following socio-economic variables can be used as explanatory variables: sex, age, marital status and country of emigration. The use of date-type variables is particularly problematic in the NHIF database: so-called censored data are common, where the end of the emigration event has not yet occurred during the observed period, i.e. the date of emigration is known but the date of return is not. It may also happen that the date of emigration before the return migration is not known. There may also be cases where there are further emigration(s) and return migration(s) between two emigration events, but these are not known. These missing dates may, for example, result from the fact that emigrating Hungarian citizens do not report their emigration and return to the authorities. Due to censored and missing data, the duration of Hungarian citizens' stay abroad can be estimated using multivariate analysis methods.  

4. Linear regression

Numerous statistical methods (e.g. linear or censored regression models) can be used to estimate the censored and missing duration of migration events, which may take into account the explanatory variables mentioned earlier. We chose the loglinear regression model to estimate the missing duration of migration events. The model equation can be written as follows:

where:

The explanatory variables of the model can be the socio-demographic characteristics of the individual and the main characteristics associated with migration events. If only one known start or end date is associated with the migration event, then the estimated duration of stay in Hungary/abroad in days is added to or subtracted from the date to obtain the missing date. Using the log-linear model, it is possible to impute future dates, i.e. to estimate the expected end of the migration event.

5.  Imputation of missing data

For the NDGAP and NHIF data, migration data with known durations were split into a training and test set in a ratio of 80–20%, and the migration durations were estimated for the test set using the results of regressions fitted to on the training set. When comparing the actual and predicted durations of the test set, the MAE (Mean Absolute Error) values were high. One reason for this is that the durations predicted by the loglinear model are too short, especially for EEA nationals, who have a high proportion of unknown durations of residence. For the foreigners who were in Hungary at the beginning of the reference period and for whom the duration of stay was missing, we proceeded as follows: the duration estimated by the loglinear model was added to the beginning of the reference period rather than to the date of immigration. We considered as immigrants those persons whose actual or estimated length of stay exceeded 12 months.

Regarding the NHIF dataset, we accepted the results of the duration estimation in two cases. Where only the date of return migration was known for a migration event, the number of returning citizens in the reference period was estimated by taking into account the estimated length of time spent abroad. If the estimated length of time spent abroad was 12 months or more and the actual date of return migration fell in the reference quarter, it was counted as a return migration. If only the date of emigration was known, it was checked whether the emigration took place during the reference period or one year before the reference period. In the former case, we took into account whether the estimated duration of stay abroad was expected to reach one year. If so, we counted it as emigration. In the case of the NHIF datab, great care has been taken to avoid overlaps between periods, so that circular migration gives a more accurate picture than has been the case in the past. This reduced the volume of emigration and return migration, but had less impact on the balance.