Data collection and storage

Data collection

There are several measures that first responders or survey enumerators can take to improve the reliability, coverage, timeliness of the data. For example, adjusting interviewing techniques from computer-assisted web-interviewing (CAWI) or computer-assisted telephoning-interviewing (CATI) to computer-assisted person-interviewing for household surveys would improve the coverage of hard-to-reach migrant groups.

Refugee and refugee-related populations are under-sampled in censuses and surveys due to language barriers in completing questionnaires and a tendency to avoid contact with authorities or suspicions about the motives of data collection. This can be overcome by using enumerators and interpreters with appropriate language skills, ensuring that questionnaires can be translated into other languages in response to evolving patterns of migration and working with refugee representatives to explain the purpose of the survey and reassure respondents (EGRISS, 2018). As emphasized in Part II, Chapter 3, it is crucial that censuses and surveys cover refugee camps, reception centres and informal settlements accommodating a high proportion of refugees and asylum seekers.

Household surveys may lead to a gender-bias when the head the of the household provides answers for other household members, especially if the questions are related to the reasons for migration, family planning and decision-making and sexual and reproductive health. Therefore, the choice of enumerator, the timing and the location of the interviews are all important considerations for ensuring that the perspectives of migrants from all genders and age groups are reflected in the data (EGRISS, 2018).

Data entry errors might be prevented by circulating excel sheets with predefined answer categories (i.e., data validation) and training survey enumerators and first responders on how to record the data in line with standardized measurement guidelines. A common mistake in the recording of data is the assignment of values to the wrong answer categories, especially when the values are context specific. For example, certain occupations might not fit so easily into the International Standard Classification of Occupations (ISCO) nomenclature, especially in contexts with a large informal economy. Transcription errors might be prevented by developing electronic platforms and distributing handheld devices for recording the data so that survey enumerators are not recording the data manually, which not only contributes to transcription errors but also leads to a backlog in data entry and cleaning.

Data quality often suffers from a long chain of reporting, starting with first responders and survey enumerators extending to higher levels of administration (EGRISS, 2018) and oftentimes the data does not make it to the National Statistical Office (NSO) but remain with other entities, such as the ministry of interior. Establishing a clear chain of reporting from the moment the data is collected to the moment it is received by the National Statistical Office, or even to relevant regional institutions such as Eurostat, is highly recommended. Assigning an information management officer at a sub-national administrative unit who is responsible for cleaning the data before it is transferred to central government administrators would improve data quality. Limiting data modification rights to information management officers would also ensure that data are not erased or duplicated. To prevent any double-counting, data records should be linked by a unique PIN. Furthermore, an internal PIN, rather than an identification document number which directly identifies the data subject, is preferable for data protection and privacy reasons (with access to the concordance between internal PINs and identification documentation available only to a very restricted number of officials).

Data storage considerations

Utrecht University’s Research Data and Management Support has outlined several good practices in storing data which can be applied outside of academic institutions (Utrecht University):

Choosing storage media (e.g., portable devices, clouds, network drives) wisely, as not all storage locations are equally suitable for all types of storage;
Managing versions and copies of data carefully by protecting raw data, keeping temporary and master copies apart, backing-up master copies in physically distinct locations and setting up a strategy for version control, for example by numbering different versions of the same dataset;
Structuring folders and naming files in a clear and logical way;
Assigning metadata, which provide information about your data, such as the topic, content and circumstances under which the data were obtained;
Using standard file formats that are widely employed and supported by multiple software platforms. As further elaborated by EGRISS in their manual on refugee statistics, providing data and metadata in a variety of formats would improve data availability and sharing. Archiving and creating adequate data documentation (e.g., data codebooks, glossaries and variable measurement details) for use in policymaking and evaluation, including for gender-based analysis, encourages the use, reliability, comparability and replicability (EGRISS, 2018).
Securing data files, for example, by controlling access to restricted materials with encryption, developing procedural arrangement for access conditions, not sending personal or confidential data via email, destroying data in a reliable manner when needed, setting up firewalls and using anti-virus software, using only secured wireless networks, keeping login credentials private, locking your computer when leaving it and storing an external hard disk or USB stick in a secure location.

Building a strong data infrastructure

The World Bank’s (2021) World Development Report on Data for Better Lives has emphasized that Governments need to pay more attention to the specific infrastructure required to support the sharing, storage, and processing of large volumes of data. Harnessing the full economic and social value of modern data services calls for digital infrastructure that is universally accessible, while also offering adequate internet speed at affordable cost. However, the developing world is lagging behind, with major gaps in broadband connectivity between high- and low-income populations and a substantial divide emerging between high- and low-income countries in the availability of data infrastructure. Lower- and middle-income countries lack domestic facilities for locally generated data to be exchanged (e.g., via internet exchange points), stored (e.g., at colocation data centres) and processed (e.g., on cloud platforms). Instead, many lower- and middle-income countries depend on overseas facilities, requiring them to transfer large volumes of data in and out of the country—which is not only more costly but slower in terms of speed than if they had their own data facilities. As further elaborated in the World Bank 2021 report, governments can strengthen their data infrastructure by:

creating internet exchange points (IXP);
creating a favourable environment for colocation data centres;
securing on-ramps to global cloud resources;
investing in and retaining ICT human resources.

Chapter overview