Using IPUMS-DHS Contextual Variables

Variables Describing the Environmental and Social Context of Survey Respondents (contextual variables)

Eliminating the need to create linking keys and merge data files, IPUMS-DHS simplifies analyzing how physical and social contexts shape health and behavior.

Contextual variables describe features of the physical and social environment of a small geographic area (a 5-10 kilometer radius) surrounding the location where a DHS respondent was interviewed (i.e., around the GPS points of survey clusters where samples were drawn). For DHS samples with GPS locations for the survey sampling points (clusters), IPUMS-DHS now includes contextual variables encompassing:

Check back frequently, as we will continue adding more contextual variables in the coming months.

The original DHS files include some variables that are implicitly contextual, such as the classification of where the respondent lives as urban versus rural. The IPUMS-DHS contextual variables allow researchers to study how a wider range of surrounding characteristics may influence health and well-being. For example, researchers have found that exposure to unusually hot days is correlated with higher rates of heart attack and low birthweights, while unusually high or low rainfall influences outmigration.

Researchers may include contextual variables as part of their customized data file (extract), treating these variables as characteristics of the respondents' environment (just like urban or rural residence). For samples not yet included in IPUMS-DHS, researchers may download a flat file containing environmental and contextual variables and link to respondents in an original DHS data file, on the basis of sample and cluster number (available here).

Below, we describe the DHS clusters used in creating the contextual variables, and we explain our overall approach to computing values for the contextual variables. Researchers interested in a specific type of contextual variable can jump to a description of that variable and the methods and source material used to create it, by using the links below.



List of IPUMS-DHS Contextual Variables


IPUMS-DHS Contextual Variables: DHS Clusters

About DHS Clusters

The Demographic and Health Surveys (DHS) Program provides GPS coordinates for clusters: groupings of households that participated in the survey. The GPS data enable the calculation of local environmental statistics from gridded data sources. IPUMS-DHS has created a series of such variables that users can add to their data extracts.

GPS readings are highly accurate, but are displaced to ensure respondent confidentiality. Displacement ranges from 0 to 2 kilometers in urban areas to 0 to 5 kilometers for rural areas. Clusters are not displaced across survey regions or national boundaries. The contextual variables calculated by IPUMS average the values within the radius of displacement.

GPS readings are typically collected using GPS receivers and are accurate to less than 15 meters. To ensure respondent confidentiality, all surveys' GPS coordinates are randomly displaced 0-2 kilometers for urban areas and 0-5 kilometers for rural areas, with a further 1% of rural clusters being displaced up to 10 kilometers. The displaced clusters still stay within the country and DHS survey regions. For more detail, please see the documentation of DHS cluster displacement.

IPUMS-DHS obtained DHS cluster shapefiles in World Geodetic System (WGS)-1984 Geographic Datum from The DHS Program website. DHS cluster data consist of attributes such as DHSID, latitude and longitude values, urban/rural label, and altitude. DHSID is the 14-character DHS identification code, which is a concatenation of the 2-character country code, the 4-digit survey year, and the 8-digit cluster identification number. DHSID is available for every IPUMS-DHS sample and uniquely identifies clusters across samples, and it serves as the unique linking key between IPUMS-DHS microdata and DHS cluster shapefiles. Users interested in performing spatial analysis may obtain DHS cluster shapefiles from The DHS Program website.

About Contextual Variable Computation

The statistics for all contextual variables in the list are computed within a buffer around DHS clusters, as opposed to using their exact location. The main purpose of using a buffer is to minimize the effects of DHS cluster displacement. It is also best to calculate environmental statistics by considering the surrounding area for individuals or communities, instead of using the value at the single point location. The size of buffer, however, varies by the variable. A 10 kilometer buffer was used for cropland, pasture area, crop production, and harvested area for crops. A 5 kilometer buffer was used for ecoregion, livelihood zone, and soil data. All buffer sizes for a variable were consistent across all clusters - regardless of whether urban or rural - to make the data consistent and comparable across individuals.


IPUMS-DHS Contextual Variables: CROPLAND and PASTURELAND

About CROPLAND and PASTURELAND

CROPLAND and PASTURELAND data represent the proportion of a 10 kilometer circular buffer around each DHS cluster location that is covered by cropland or pasture area, respectively. CROPLAND and PASTURELAND are numeric variables between 0 and 1. Cropland and pasture data were originally obtained from EarthStat, which develops global datasets of croplands and pastures circa 2000 by combining agricultural inventory data and satellite-derived land cover data. Data are available at circa 10 kilometer spatial resolution and consist of the proportion (0 to 1) of each 10 kilometer pixel that is covered by cropland or pasture areas. Original documentation for these data is available here.

Data Production

The original Cropland and Pasture Area raster layers, with circa 10 kilometer spatial resolution and in the World Geodetic System (WGS)-1984 Geographic Datum coordinate system, were acquired from the EarthStat website. We performed all spatial data processing in Esri's ArcGIS software package. We used the Resample tool to create new layers with the finer spatial resolution of 500 meters. We then used the Focal Statistics tool to update each pixel value: proportion of cropland/pasture, with the mean proportion of cropland/pasture within a 10 kilometer circular buffer around each 500 meter pixel. Next, we used the Extract Values to Points tool to assign the focal cropland/pasture statistics of the intersecting pixel to each cluster location. If any cluster remained with no value, the previous steps were repeated up to 5 more times until each cluster location received a valid value.

Original Cropland and Pasture Data Citation

Ramankutty, N., A.T. Evan, C. Monfreda, and J.A. Foley (2008), Farming the planet: 1. Geographic distribution of global agricultural lands in the year 2000. Global Biogeochemical Cycles 22, GB1003, doi:10.1029/2007GB002952.

IPUMS-DHS Contextual Variables: "CROP"_H

About "CROP"_H

"CROP"_H reports total harvested area, in hectares, dedicated to a specific crop within a 10 kilometer circular buffer around each DHS cluster location. Data were originally obtained from EarthStat, which provides land use datasets created by combining national, state, and county-level census statistics with global dataset of croplands on a circa 10 kilometer latitude/longitude grid. The resulting land use datasets for harvested area are available for 175 distinct crops in the year 2000. IPUMS-DHS focuses on a subset of 17 major crops: barley, cassava, cotton, groundnuts, maize, millet, oil palm, potatoes, rapeseed, rice, rye, sorghum, soybeans, sugar beets, sugarcane, sunflowers, and wheat. Original data documentation on these harvested area data is available here.

Data Production

Harvested Area raster layers, with circa 10 kilometer spatial resolution and in the World Geodetic System (WGS)-1984 Geographic Datum coordinate system, were acquired from the EarthStat website for 17 crops. We performed all spatial data processing in Esri's ArcGIS software package. We used the Resample tool to create new layers with the finer spatial resolution of 500 meters. To account for the finer resolution, we divided each pixel’s value by 400 (the number of 500 meter pixels we created from each 10 kilometer resolution pixel. We then used the Focal Statistics tool to update each pixel value: harvested area with the sum of harvested area values within a 10 kilometer circular buffer around each 500 meter pixel. Next, we used the Extract Values to Points tool to assign the value of the intersecting pixel to each cluster.

Original Crop Data Citation

Monfreda, C., N. Ramankutty, and J.A. Foley (2008). Farming the planet. Part 2: Geographic distribution of crop areas, yields, physiological types, and net primary production in the year 2000. Global Biogeochemical Cycles 22, GB1022, doi:10.1029/2007GB002947.

IPUMS-DHS Contextual Variables: "CROP"_P

About "CROP"_P

"CROP"_P represents total production of a specific crop, in metric tons, within a 10 kilometer circular buffer around each DHS cluster location. Data were originally obtained from EarthStat, which provides a variety of datasets, including the crop production data for 175 crops. EarthStat offers land use datasets created by combining national, state, and county-level census statistics with a global dataset of croplands on a circa 10 kilometer latitude/longitude grid. The resulting land use datasets depict the production of 175 distinct crops of the world in the year 2000. IPUMS-DHS uses a subset of 17 major crops: barley, cassava, cotton, groundnuts, maize, millet, oil palm, potatoes, rapeseed, rice, rye, sorghum, soybeans, sugar beets, sugarcane, sunflowers, and wheat. Original documentation for these crop data is available here.

Data Production

Seventeen Crop Production raster layers, with circa 10 kilometer spatial resolution and in the World Geodetic System (WGS)-1984 Geographic Datum coordinate system, were acquired from the EarthStat website. We performed all spatial data processing in Esri's ArcGIS software package.We used the Resample tool to create new layers with the finer spatial resolution of 500 meters. To account for the finer resolution, we divided each pixel’’s value by 400 (the number of 500 meter pixels we created from each 10 kilometer resolution pixel). We then used the Focal Statistics tool to update each pixel value for crop production with the sum of crop production values within a 10 kilometer circular buffer around each 500 meter pixel. Next, we used the Extract Values to Points tool to assign the value of the intersecting pixel to each cluster location.

Original Crop Data Citation

Monfreda, C., N. Ramankutty, and J.A. Foley (2008). Farming the planet. Part 2: Geographic distribution of crop areas, yields, physiological types, and net primary production in the year 2000. Global Biogeochemical Cycles 22, GB1022, doi:10.1029/2007GB002947.

IPUMS-DHS Contextual Variable: ECOREGION

About ECOREGION

ECOREGION reports the predominant terrestrial ecoregion within a 5 kilometer circular buffer around each DHS cluster location. Data were originally obtained from the Terrestrial Ecoregion of the World (TEOW) database provided by the World Wildlife Foundation (WWF). The ECOREGION dataset specifies the predominant ecoregion type and uses a 5-digit coding system, consisting of the 1-digit realm (ecozone) code, the 2-digit code for biome (the type of habitat) to which the ecoregion belongs, and the 2-digit specific ecoregion number. See the ECOREGION Appendix for the codes and labels for each of the 825 terrestrial ecoregions. Users can conduct analyses at any of the three nested levels - realm, biome, or specific ecoregion. Original documentation for these ecoregion data is available here.

Data Production

The TEOW shapefile, in the World Geodetic System (WGS)-1984 Geographic Datum coordinate system, was acquired from the WWF website. We performed all spatial data processing in Esri's ArcGIS software package. We used the Polygon to Raster Conversion tool to create a raster layer with 500 meter spatial resolution. We then used the Focal Statistics tool to update each pixel value: ecoregion type, with the major ecoregion within a 5 kilometer circular buffer. Next, we used the Extract Values to Points tool to assign the value of the intersecting pixel to each cluster location. If any cluster remained with no value, the previous steps were repeated until each cluster location received a valid value.

Original Ecoregion Data Citation

Olson, D. M., Dinerstein, E., Wikramanayake, E. D., Burgess, N. D., Powell, G. V. N., Underwood, E. C., D'Amico, J. A., Itoua, I., Strand, H. E., Morrison, J. C., Loucks, C. J., Allnutt, T. F., Ricketts, T. H., Kura, Y., Lamoreux, J. F., Wettengel, W. W., Hedao, P., Kassem, K. R. 2001. Terrestrial ecoregions of the world: a new map of life on Earth. Bioscience 51(11):933-938.

IPUMS-DHS Contextual Variable: LIVELIHOOD

About LIVELIHOOD

LIVELIHOOD represents the predominant livelihood zone (i.e., main economic basis for subsistence) within a 5 kilometer circular buffer around each DHS cluster location. Livelihood zones data were originally obtained from the Famine Early Warning System Network (FEWS NET). The livelihood zone code in this dataset consists of a 3-digit iso-country code, followed by the 3-digit livelihood code. Note that the livelihood zones are country-specific, and the livelihood codes and names are not consistent across countries; therefore, LIVELIHOOD is only comparable within countries. For the full list of livelihood zone codes and values, organized by country, see the LIVELIHOOD Appendix. Original documentation for these livelihood zone data is available here.

Data Production

The Livelihood Zones shapefiles, in the World Geodetic System (WGS)-1984 Geographic Datum coordinate system, were acquired from FEWS NET website. We performed all spatial data processing in Esri's ArcGIS software package. We used the Polygon to Raster Conversion tool to convert each shapefile to a raster layer with 500 meter spatial resolution. We then used the Focal Statistics tool to update each pixel value with the predominant livelihood zone within a 5 kilometer circular buffer. In the next step, we used the Extract Values to Points tool to assign the focal livelihood zone of the intersecting pixel to each cluster location.

Original Livelihood Zone Data Citation

https://www.fews.net/sectors/livelihoods

IPUMS-DHS Contextual Variables: SOIL

About SOIL

SOIL reports the predominant soil type within a 5 kilometer circular buffer around each DHS cluster location. The SOIL dataset specifies the predominant soil type in each cluster's neighborhood and the top-level class to which the soil belongs. Data were originally obtained from SoilGrids, which is provided by the World Soil Information (ISRIC) database. SoilGrids is an automated soil mapping system based on 150,000 global soil profiles and classes and a stack of 158 soil covariates. The mapping system uses machine learning algorithms to predict standard soil properties at different standard depths SoilGrids250m represents the world's soil data as a gridded dataset with 250 meter pixels. The global dataset recognizes 118 soil types nested within 30 top-level categories. See the SOIL Appendix for the full list of soil classes and types, together with their codes in IPUMS-DHS. Original documentation for the soil type data, including a description of each soil type, is available here.

Data Production

The SoilGrids raster data, with circa 250 meter spatial resolution and in the World Geodetic System (WGS)-1984 Geographic Datum coordinate system, was acquired from ISRIC website. We performed all spatial data processing in Esri's ArcGIS software package. We used the Focal Statistics tool to update each pixel value with the predominant soil type within a 5 kilometer circular buffer around each 250 meter pixel. Next, we used the Extract Values to Points tool to assign the focal soil type of the intersecting pixel to each cluster location. If any cluster remained with no value, the previous steps were repeated until each cluster location received a valid value.

Original Soil Data Citation

Hengl, T., Mendes de Jesus, J., Heuvelink, G. B.M., Ruiperez Gonzalez, M., Kilibarda, M. et al. (2017) SoilGrids250m: global gridded soil information based on Machine Learning. PLoS ONE 12(2): e0169748. doi:10.1371/journal.pone.0169748.

IPUMS-DHS Contextual Variables: NDVI

About NDVI

NDVI is a set of 72 numeric variables ranging from -1 to 1 (and from 0 to 1 for the cases included in IPUMS-DHS) measuring live greenness (i.e., greenness of vegetation and land cover) in an area. NDVI reports the maximum NDVI value within a 10 kilometer circular buffer around each DHS cluster for the 60 individual months prior to the survey start date, the month of the survey start date, and the 11 individual months following the survey start date. NDVI data in IPUMS-DHS do not contain any negative values, since themaximum NDVI value is reported.

We also calculated the average value of NDVI maximums for each month over the 2000-2016 timeframe and created the NDVI_"month" variables (e.g., NDVI_JAN).

Data were originally obtained from The Moderate Resolution Imaging Spectroradiometer (MODIS), which provides vegetation indices. Data are produced on 16-day intervals and at multiple spatial resolutions, providing consistent spatial and temporal comparisons of vegetation canopy greenness, a composite property of leaf area, chlorophyll and canopy structure. Original data documentation and user guides are available here.

Data Production

The NDVI raster layers with 250 meter spatial resolution and in the World Geodetic System (WGS)-1984 Geographic Datum coordinate system were acquired from MODIS website. We performed all spatial data processing in Esri's ArcGIS software package. We used the Focal Statistics tool to update each pixel value with the maximum NDVI within a 10 kilometer circular buffer around each 250 meter pixel. Next, we used the Extract Values to Points tool to assign the focal NDVI statistics of the intersecting pixel to each cluster location.

Original NDVI Data Citation

K. Didan. (2015). MOD13Q1 MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V006. NASA EOSDIS Land Processes DAAC. https://doi.org/10.5067/modis/mod13q1.006

IPUMS-DHS Contextual Variables: POPDENSITY

About POPDENSITY_"YEAR"

The POPDENSITY_“year” variable reports the population density (persons per square kilometer) within a 5 kilometer circular buffer around each DHS cluster location, in that year. Data were originally obtained from Gridded Population of the World (GPW), v4, which is provided by the Socioeconomic Data and Applications Center (SEDAC). GPW-v4 consists of estimates of human population density based on counts consistent with national censuses and population registers, for the years 2000, 2005, 2010, 2015, and 2020. A proportional allocation gridding algorithm, utilizing approximately 12.5 million national and sub-national administrative units, is used to assign population values to circa 1 km grid cells. The population density grids are created by dividing the population count grids by the land area grids. The pixel values in GPW-v4 data represent persons per square kilometers. We computed the mean of all pixel values within a 5 kilometer circular buffer around each cluster location, as the estimated population density for that cluster. Original data documentation is available here. Population estimates are calculated for years 2000, 2005, 2010, 2015, and 2020 by extrapolating raw census counts.

Data Production

The Population Density gridded raster layers for the years 2000, 2005, 2010, 2015, and 2020, with circa 1 km spatial resolution and in the World Geodetic System (WGS)-1984 Geographic Datum coordinate system, were acquired from SEDAC website. We performed all spatial data processing in Esri's ArcGIS software package. We used the Resample tool to create new raster layers with 500 meters spatial resolution. We then used the Focal Statistics tool to update each pixel value, with the mean population density within a 5 kilometer circular buffer around each 500 meter pixel. Next, we used the Extract Values to Points tool to assign the value of the intersecting pixel to each cluster location. If any cluster remained with no value, the previous two steps were repeated until each cluster location received a valid value.

Original Population Density Data Citation

Center for International Earth Science Information Network - CIESIN - Columbia University. 2016. Documentation for the Gridded Population of the World, Version 4(GPWv4). Palisades NY: NASA Socioeconomic Data and Applications Center (SEDAC). http://dx.doi.org/10.7927/H4D50JX4 Accessed DAY MONTH YEAR.

IPUMS-DHS Contextual Variables: PRECIP

About PRECIP

PRECIP is a set of 72 variables reporting the average precipitation in millimeters, received within a 10 kilometer circular buffer around each DHS cluster. The variables report precipitation for the 60 individual months prior to the survey start date, the month of the survey start date, and the 11 individual months following the survey start date. Data were originally obtained from Climate Hazards Group (CHG), which provides InfraRed Precipitation with Station data (CHIRPS). CHIRPS is a 30+ year quasi-global rainfall dataset. Spanning 50°S-50°N (and all longitudes), starting in 1981 and extending to the near-present, CHIRPS incorporates 0.05° resolution satellite imagery with in-situ station data, to create gridded rainfall time series for trend analysis and seasonal drought monitoring. Original data documentation and user guides are available here.

We also calculated the average value of precipitation for each month over the 2000-2016 timeframe to create the PRECIP_"month" variables (e.g., PRECIP_JAN).

Data Production

The precipitation raster layers with circa 5 kilometer spatial resolution and in the World Geodetic System (WGS)-1984 Geographic Datum coordinate system, were acquired from the CHIRPS website. We performed all spatial data processing in Esri's ArcGIS software package. We used the Resample tool to create new layers with spatial resolution of 500 meters, then used the Focal Statistics tool to update each pixel value with the mean precipitation within a 10 kilometer circular buffer around each 500 meter pixel. Next, we used the Extract Values to Points tool to assign the focal precipitation statistics of the intersecting pixel to each cluster location.

Original Precipitation Data Citation

Funk, Chris, Pete Peterson, Martin Landsfeld, Diego Pedreros, James Verdin, Shraddhanand Shukla, Gregory Husak, James Rowland, Laura Harrison, Andrew Hoell & Joel Michaelsen. "The climate hazards infrared precipitation with station - a new environmental record for monitoring extremes". Scientific Data 2, 150066. doi:10.1038/sdata.2015.66 2015.

IPUMS-DHS Contextual Variables: MALARIA

About MALARIA

MALARIA is a set of 16 separate variables, covering the years 2000 to 2015. These variables report the mean clinical Plasmodium falciparium malaria incidence (expressed as fractions of cases per person) within a 10 kilometer circular buffer around each DHS cluster location, by year. MALARIA variables range from 0 to 1. Data were originally obtained from The Malaria Atlas Project (MAP), where users can find original data documentation and user guides.

Data Production

The malaria incidence raster layers, with circa 5 kilometer spatial resolution in the World Geodetic System (WGS)-1984 Geographic Datum coordinate system, were acquired from the MAP website. We performed all spatial data processing in Esri's ArcGIS software package. We used the Resample tool to create new layers with spatial resolution of 500 meters, then used the Focal Statistics tool to update each pixel value with the mean malaria incidence per person within a 10 kilometer circular buffer around each 500 meter pixel. Finally, we used the Extract Values to Points tool to assign the focal malaria statistics of the intersecting pixel to each cluster location.

Original Malaria Data Citation

Bhatt, S., Weiss, D.J.W., Cameron, E., Bizansio, D., Mappin, B., Dalrymple, U., Battle, K.E., Moyes, C.L., Henry, A., Eckhoff, P.A., Wenger, E.A., Briet, O., Penny, M.A., Smith, T.A., Bennett, A., Yukich, J., Eisele, T.P., Griffin, J.T., Fergus, C.A., Lynch, M., Lindgren, F., Cohen, J.M., Murray, C.L.J, Smith, D.L., Hay,S.I., et al, (2015). The effect of malaria control on Plasmodium Falciparum in Africa between 2000 and 2015. Nature, 526(7572): 207-211. DOI:http://www.nature.com/doifinder/10.1038/nature15535

IPUMS-DHS Contextual Variables: CONFLICT VARIABLES

About CONFLICT VARIABLES

Conflict data report the number of armed conflicts by conflict type within a 10 kilometer circular buffer around each DHS cluster location, in each year from 1997 to 2016. Data were originally obtained from Armed Conflict Location and Event Data Project (ACLED). ACLED provides political violence and protests data for over 60 countries. IPUMS-DHS created annual counts of three types of armed conflicts:

Definitions for these types of armed conflicts are available in ACLED's codebook.

Data Production

The Armed Conflict data were acquired from the ACLED website. We performed all spatial data processing in Esri's ArcGIS software package. We used the Make XY Event Layer tool to create point layer based on the XY coordinates available in tables. We used the Buffer tool to create a 10 kilometer circular buffer around each cluster. Next, we used the Spatial Join tool to match each cluster with the event(s) that occurred within the buffer around that cluster to create counts. As per ACLED recommendations, only events that occurred within the same country as the DHS sample are included in the counts. We created three annual time-series: battles (all types), riots or protests, and violence against civilians.

Original conflict Data Citation

Raleigh, Clionadh, Andrew Linke, Havard Hegre and Joakim Karlsen. 2010. Introducing ACLED-Armed Conflict Location and Event Data. Journal of Peace Research 47(5) 651-660.

About CSV Files

Contextual Variable CSV Files

Static .csv files are avaiable for all contextual variables for all DHS samples with GPS data available before January 2017. This includes samples that are not yet available in IPUMS-DHS. In addition, monthly time series data is available for NDVI (2000-2016) and precipitation (1981-2016).