Innovative data sources

Fuelled by rapid technological advancement, an increasing amount of migration-related information is now available especially from the private sector. An unprecedented amount of data, commonly known as “big data”, have been generated through the use of digital devices such as mobile phones, internet-based platforms (such as social media and online payment services) and sensors (such as satellite and drone imagery). A growing body of research attempts to present the various ways in which the use and analysis of big data can help document forced migration, transnational networks and human trafficking or estimate remittance flows (IOM, 2020).

 Big data are referred to as “big” because of their volume, their “velocity” or speed and their “variety” or complexity (Hilbert, 2013) Some types of big data represent only the users of the specific devices or apps. These data are different from data based on traditional household surveys as they do not refer to a random sample of individuals but to the totality of the population using certain platforms (Hilbert, 2013) and because of the specific technical and analytical methods required to extract meaningful insights from them (De Mauro, Greco and Grimaldi, 2016). 

While new data sources should not replace population censuses, household surveys and administrative records, they can complement them when needed. In addition to this, practitioners might need to use innovative sources if seeking to analyse migration-related topics in real-time. Big data also presents the advantage of lower data collection costs, timeliness, and filling migration data gaps (i.e., temporary or seasonal migration and the limited coverage of hard-to-reach populations) in comparison to some traditional data (IOM 2023a).  

Data innovation, including big data, is now the core focus of various task forces and working groups (UN Global Working Group on Big Data for Official Statistics, 2016; Eurostat Big Data Task Force, 2020; UN Global Pulse, 2020), and mentioned in key global migration policy frameworks, such as the Global Compact for Safe, Orderly and Regular Migration (GCM) (IOM, 2021). Within this context, IOM’s GMDAC and the EU Commission’s Joint Research Centre (JRC) funded the Big Data for Migration Alliance (BD4M) in collaboration with the GovLab of the New York University (NYU), standing as the first dedicated network of stakeholders seeking to facilitate responsible data innovation and collaboration to improve the evidence base on migration and human mobility and its use for policy making (see https://data4migration.org/about/). To accelerate further applications of data innovation that can provide useful insights for policy and programming, the BD4M facilitates access to summaries of curated projects and studies using data innovation to address the demand for policy-relevant analysis through the Data Innovation Directory (DID) (see https://www.migrationdataportal.org/data-innovation). 

What are the different types of ‘big data’?

Big data sources that have so far been used in migration-related studies can be grouped under three broad categories (Global Migration Group, 2017): 

  1. Mobile-phone-based data or Mobile Positioning Data (MPD) refer to de-identified call detail records of mobile network operators which contain information about the time and associated cell tower of users’ text messages, calls or other data exchanges. They can provide insights into human movement at the national and regional levels in (almost) real time (IOM, 2023a);
  2. Internet-based data refer to data collected from social media services such as Facebook, Twitter, Instagram, LinkedIn, TikTok and Google. They contain information about likes, posts, comments, shares, searchers and the use of hashtags combined with the positions of the users based on GPS location data, IP addresses and Wi-Fi network locations (IOM, 2023a)
  3. Sensor-based data are can come from satellites (e.g., images of the Earth) or from other types of sensors such as ship transponders. They contain information about structures on the surface, such as houses, roads, electricity lines, destroyed infrastructures and emergency camps (IOM, 2023a).
Mobile-phone based data

Mobile phone data or Mobile Positioning Data (MPD) provide real-time data and are less affected by selectivity issues than internet-based data, therefore constituting one of the most useful big data sources. Appropriate safeguards such as pseudonymization, masking and data privacy impact assessments prevent privacy breaches connected to research with MPD. However, MPD are more reliable within a country, as people tend to switch their SIM cards to a local one, especially when undertaking long-term journeys or migration (IOM, 2023a).

The most common types of MPD are call detail records (CDRs), signalling data and sociodemographic user data. CDRs are records of calls, texts messages and data usage stored by mobile network operators for billing purposes. Signalling data are generated by communication between the mobile device and network for improving the signal quality. Socio-demographic data are basic descriptives of the users, which can be misleading if, for example, the head of the household signs a contract for multiple family members (IOM, 2023a). 

MPD data come in three forms:

  • Domestic data, or the MPD of subscribers to the network of the mobile network operator within the country
  • Outbound roaming data, or the MPD of domestic subscribers using the network of a foreign mobile network operator;
  • Inbound roaming data, or the MPD of foreign subscribers using the network of a domestic mobile network operator.

Mobile phone CDRs have been used to track internal displacement following natural disasters, such as the Nepal earthquake, or the spread of infections, such as COVID-19 (IOM, 2021). For example, analysing mobile phone CDRs in Nepal helped to understand large population movements following the earthquake that struck the country in 2015, contributing to more timely and effective responses by authorities and relief organizations, in the absence of official statistics (IOM, 2018). In 2022, a collaboration between IOM Mongolia and the National Statistical Office of Mongolia was initiated to explore aware and anonymous processing of CDR to improve infrastructure for nomadic people (IOM, 2023c). MPD also enable researchers to compare different types of temporary migrants, including tourists, commuters, transnationals and long-term visitors and to country hard-to-reach populations. Using CDRs to study temporary mobility between Estonia and Finland, Silm et al. (2021) show that regular cross-border travellers constitute 5% of visitors from Estonia to Finland. Salah et al. (2019) enriched CDR with tags indicating group membership, and initiated the Data for Refugees Challenge to address issues of over 3.5 million Syrians who came into Turkey following the war in the Syrian Arab Republic.

Case study: Using big data to generate mobility statistics in Indonesia (IOM, 2023c)

Indonesia is among the most active explorers of big data sources for National Statistical Systems in the Asia–Pacific region. Since 2016, Statistics Indonesia (BPS) has made use of location-based services (LBS) and CDRs in the production of various types of mobility statistics, including tourism statistics, commuting statistics and statistics of population mobility between city centres and surroundings areas. The expansion of the use of mobile phone data in mobility statistics, supported by its increased coverage, cost-effectiveness, level of detail, timeliness and accuracy has encouraged the creation of indicators estimating internal migration and transnationalism on the basis of combinations of traditional data sources and mobile phone data.

Case Study: Using CDRs for monitoring COVID-19 prevention measures in Israel (World Bank, 2021)

The government of Israel approved emergency regulations in March 2020 to allow the individual-level data collected from cell phones to be used to track people and then, through contact tracing, to curtail the spread of COVID-19. In Israel, analysis of the cellular data suggested their use led to identification of more than one-third of all of the country’s COVID-19 cases in the early weeks of the pandemic (more than 5,500 of the 16,200 people who had contracted the disease), possibly contributing to Israel’s exceptionally low initial rates of COVID-19 infections and deaths.

Internet-based data

Internet-based data come in larger quantities but are the most affected by selectivity, security and data privacy concerns (IOM, 2023a). Geo-located social media activity, such as on Twitter and Facebook, have been used to infer international migration flows and stocks, also disaggregated by age, sex as well as skill levels or sector of occupation, based on user self-reported information. Data from LinkedIn can be used to study migrant workers and their occupational profiles and changes in job positions can be used to estimate the movement of highly skilled migrants (IOM, 2018).

Case Study: Using Twitter to study migration

Several studies have used Twitter data to study migration events, defining country of residence as the country where a user tweets for the most part over a period of one year (Kim et al., 2020) and outmigration events as a change in the country of residence of the user (Zagheni et al. 2014). Migration events can also be detected when the nationality of the user (identified through the language settings and social connections linked to the account) are different from the country of residence (Kim et al., 2020). While Twitter data is freely available and can be accessed using an application programming interface (API), only a very small percentage of tweets come with geolocations based on the user opting in to share their exact position (IOM, 2023a).

Case study: Using Facebook to study migration

Facebook data can be used to study movement after a natural disaster or economic crisis, especially as it is a suitable alternative in situations when in-person data is difficult to pursue. For example, During the COVID-19 pandemic, Facebook’s Disease Prevention Maps provided data on population distribution and movement on a daily basis, available for further analysis of the disease outbreak (Maas et al., 2020) (IOM, 2021). The Facebook advertising platform designed for the purposes of targeted marketing, allows advertisers to select users who “used to live in [country x]”, “live abroad”, and “recently moved”, the first criterion being most widely used by researchers as a proxy for migration history (IOM, 2023a). By collecting data on users living in a country other than their self-reported country, also referred to as “expats”, Facebook can potentially be used as a ‘real-time census’ to estimate the stock of migrants in a country. The estimate of 273 million “expats” globally provided by Facebook is not so far from the UNDESA 2017 estimate of 258 international migrants (IOM, 2018).  However, Facebook does not specify how previous country of residence is inferred or whether the definition of country of residence corresponds with the UN 1998 Recommendations or revised Recommendations. An issue affecting both Facebook and Twitter are the bots at fake accounts which account for approximately five per cent of worldwide monthly active users (IOM, 2023a).

Sensor-based data

Sensor-based data are particularly useful for assessing migration due to environmental threats and can provide knowledge of migration drivers in certain regions (IOM, 2023a).

Case study:  using satellite imagery for population mapping (IOM, 2023a)

South Sudan’s National Bureau of Statistics produces subnational population projections based on data from population censuses, the most recent of which was conducted in 2008. However, these projections do not account for displacement by widespread conflict and annual flooding. The 2019 South Sudan population data set produced by WorldPop (2020) as part of the GRID3 project combined subnational census projections with building footprints from recent satellite imagery, providing grid cell-level (100m x 100m) population estimates that account for displacement. The higher spatial resolution of the data provides more detailed information about the numbers and locations of IDPs. See figure below.

P2C2F4

Figure 5:

P2C2F5

What are the strengths of big data?

Big data may be particularly useful to study patterns of temporary or circular migration, which are hard to measure through traditional sources and methods, or to anticipate migration trends. They are particularly useful for studying ‘transnationals’ or people living and working in more than one country, who are usually hard to track through traditional data sources (IOM, 2018). They can also contribute to more timely monitoring of public opinion or media discourse on migration, compared to public opinion surveys, for instance. These data provide real-time observations which is crucial in situations where traditional data collection systems are disrupted, such as in the case of global health or other kinds of crises (IOM, 2023a).  Big data can support disaggregation by specific geographic locations which facilitates analyses of internal migration and within country comparisons of migration across administrative units (IOM, 2023a). Another advantage is that such data are generated at no additional cost and can be obtained at a lower cost compared to data from traditional sources – depending on the willingness of data holders to share data or the insights these can generate (IOM, 2021).

What are the limitations of big data?
  • No regulatory framework: The opportunities offered by big data are met by some significant challenges. Collaborations with private sector entities such as Mobile Network Operators (MNOs) and the regulatory frameworks for safeguarding data and privacy and ensuring secure data sharing are generally lacking (IOM, 2023a).
  • Privacy and ethical issues: There are confidentiality and ethical issues in using data automatically generated by individuals, often without their informed consent, as well as human rights concerns due to the risks of using such data for surveillance or for purposes other than those for which they were collected, which are particularly serious in contexts of irregular migration and forced displacement. Citizens may be reluctant or sceptical about sharing data about social media and mobile phone usage given the many high-profile incidents of security and privacy breaches (IOM, 2023a). In Israel the transfer of CDR data raised fundamental concerns about trust, with citizens concerned that their CDR data could then be repurposed by government officials for other unintended and potentially harmful purposes beyond public health (World Bank, 2021). Developing frameworks defining the dos and don’ts of big data usage in line with ethical, privacy and security standards can contribute to earning public trust (IOM, 2023a).
  • Sample bias: Big data are inherently biased. Users of social media or mobile phones are not necessarily representative of the population at large. Specifically, differences in internet access or use of mobile devices and social media platforms by level of economic development, sex, age and urban/rural areas are still significant. Also noteworthy is the significant gender digital divide. Globally, some 250 million fewer women than men use the internet. In low-income countries, only one in seven women is online, compared with one in five men. Women are somewhat more likely than men to be challenged by digital literacy issues and to face additional obstacles to being online. For example, in many countries lack of family approval for women owning a cell-phone is a major barrier (World Bank, 2021).
  • Social desirability: Self-reported information on social media may not always reflect reality and the potential presence of fake or double social media accounts needs to be corrected for.
  • Problem of measurement: There are also difficulties in applying the statistical definition of an international migrant, as information on the length of residence in the host country is hard to obtain from big data sources (IOM, 2018).
  • Limited technical capacities: Some of the challenges are due to difficulties in accessing data – held by private or state actors – or using data for research purposes; inappropriate infrastructure and data management and security systems; and methodological difficulties in extracting meaning from huge, complex and “noisy” volumes of data.

Table 1: Strengths and limitations of different data sources

EMDP2C2T1