Merge Data & Dissolve Polygons#

By: Steven Chao


Learning Objectives

  • Import dataframes into Python for analysis

  • Perform basic dataframe column operations

  • Merge dataframes using a unique key

  • Group attributes based on a similar attribute

  • Dissolve vector geometries based on attribute values


Introduction#

Dataframes are widely used in Python for manipulating data. Recall that a dataframe is essentially an Excel spreadsheet (a 2-D table of rows and columns); in fact, many of the functions that you might use in Excel can often be replicated when working with dataframes in Python!

This chapter will introduce you to some of the basic operations that you can perform on dataframes. We will use these basic operations in order to calculate and map poverty rates in the Commonwealth of Virginia. We will pull data from the US Census Bureau’s American Community Survey (ACS) 2019 (see this page for the data).

# Import modules
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
from census import Census
from us import states
import os

Accessing Data#

Import census data#

Let’s begin by accessing and importing census data. Importing census data into Python requires a Census API key. A key can be obtained from Census API Key. It will provide you with a unique 40 digit text string. Please keep track of this number. Store it in a safe place.

# Set API key
c = Census("CENSUS API KEY HERE")
#ignore this, I am just reading in my api key privately
c = Census(os.environ.get('census_api_key'))

With the Census API key set, we will access the census data at the tract level for the Commonwealth of Virginia from the 2019 ACS, specifically the ratio of income to poverty in the past 12 months (C17002_001E, total; C17002_002E, < 0.50; and C17002_003E, 0.50 - 0.99) variables and the total population (B01003_001E) variable. For more information on why these variables are used, refer to the US Census Bureau’s article on how the Census Bureau measures poverty and the list of variables found in ACS.

The census package provides us with some easy convenience methods that allow us to obtain data for a wide variety of geographies. The FIPS code for Virginia is 51, but if needed, we can also use the us library to help us figure out the relevant FIPS code.

# Obtain Census variables from the 2019 ACS at the tract level for the Commonwealth of Virginia (FIPS code: 51)
# C17002_001E: count of ratio of income to poverty in the past 12 months (total)
# C17002_002E: count of ratio of income to poverty in the past 12 months (< 0.50)
# C17002_003E: count of ratio of income to poverty in the past 12 months (0.50 - 0.99)
# B01003_001E: total population
# Sources: https://api.census.gov/data/2019/acs/acs5/variables.html; https://pypi.org/project/census/
va_census = c.acs5.state_county_tract(fields = ('NAME', 'C17002_001E', 'C17002_002E', 'C17002_003E', 'B01003_001E'),
                                      state_fips = states.VA.fips,
                                      county_fips = "*",
                                      tract = "*",
                                      year = 2017)

Now that we have accessed the data and assigned it to a variable, we can read the data into a dataframe using the pandas library.

# Create a dataframe from the census data
va_df = pd.DataFrame(va_census)

# Show the dataframe
print(va_df.head(2))
print('Shape: ', va_df.shape)
                                         NAME  C17002_001E  C17002_002E  \
0     Census Tract 60, Norfolk city, Virginia       3947.0        284.0   
1  Census Tract 65.02, Norfolk city, Virginia       3287.0        383.0   

   C17002_003E  B01003_001E state county   tract  
0        507.0       3947.0    51    710  006000  
1        480.0       3302.0    51    710  006502  
Shape:  (1907, 8)

By showing the dataframe, we can see that there are 1907 rows (therefore 1907 census tracts) and 8 columns.

Import Shapefile#

Let’s also read into Python a shapefile of the Virginia census tracts and reproject it to the UTM Zone 17N projection. (This shapefile can be downloaded on the Census Bureau’s website on the Cartographic Boundary Files page or the TIGER/Line Shapefiles page.)

# Access shapefile of Virginia census tracts
va_tract = gpd.read_file("https://www2.census.gov/geo/tiger/TIGER2019/TRACT/tl_2019_51_tract.zip")

# Reproject shapefile to UTM Zone 17N
# https://spatialreference.org/ref/epsg/wgs-84-utm-zone-17n/
va_tract = va_tract.to_crs(epsg = 32617)

# Print GeoDataFrame of shapefile
print(va_tract.head(2))
print('Shape: ', va_tract.shape)

# Check shapefile projection
print("\nThe shapefile projection is: {}".format(va_tract.crs))
  STATEFP COUNTYFP TRACTCE        GEOID    NAME             NAMELSAD  MTFCC  \
0      51      700  032132  51700032132  321.32  Census Tract 321.32  G5020   
1      51      700  032226  51700032226  322.26  Census Tract 322.26  G5020   

  FUNCSTAT    ALAND  AWATER     INTPTLAT      INTPTLON  \
0        S  2552457       0  +37.1475176  -076.5212499   
1        S  3478916  165945  +37.1625163  -076.5527816   

                                            geometry  
0  POLYGON ((897174.191 4119897.084, 897174.811 4...  
1  POLYGON ((893470.562 4123469.385, 893542.722 4...  
Shape:  (1907, 13)

The shapefile projection is: EPSG:32617

By printing the shapefile, we can see that the shapefile also has 1907 rows (1907 tracts). This number matches with the number of census records that we have on file. Perfect!

Not so fast, though. We have a potential problem: We have the census data, and we have the shapefile of census tracts that correspond with that data, but they are stored in two separate variables (va_df and va_tract respectively)! That makes it a bit difficult to map since these two separate datasets aren’t connected to each other.

Performing Dataframe Operations#

Create new column from old columns to get combined FIPS code#

To solve this problem, we can join the two dataframes together via a field or column that is common to both dataframes, which is referred to as a key.

Looking at the two datasets above, it appears that the GEOID column from va_tract and the state, county, and tract columns combined from va_df could serve as the unique key for joining these two dataframes together. In their current forms, this join will not be successful, as we’ll need to merge the state, county, and tract columns from va_df together to make it parallel to the GEOID column from va_tract. We can simply add the columns together, much like math or the basic operators in Python, and assign the “sum” to a new column.

To create a new column–or call an existing column in a dataframe–we can use indexing with [] and the column name (string). (There is a different way if you want to access columns using the index number; read more about indexing and selecting data in the pandas documentation.)

# Combine state, county, and tract columns together to create a new string and assign to new column
va_df["GEOID"] = va_df["state"] + va_df["county"] + va_df["tract"]

Printing out the first rew rows of the dataframe, we can see that the new column GEOID has been created with the values from the three columns combined.

# Print head of dataframe
va_df.head(2)
NAME C17002_001E C17002_002E C17002_003E B01003_001E state county tract GEOID
0 Census Tract 60, Norfolk city, Virginia 3947.0 284.0 507.0 3947.0 51 710 006000 51710006000
1 Census Tract 65.02, Norfolk city, Virginia 3287.0 383.0 480.0 3302.0 51 710 006502 51710006502

Remove dataframe columns that are no longer needed#

To reduce clutter, we can delete the state, county, and tract columns from va_df since we don’t need them anymore. Remember that when we want to modify a dataframe, we must assign the modified dataframe back to the original variable (or a new one, if preferred). Otherwise, any modifications won’t be saved. An alternative to assigning the dataframe back to the variable is to simply pass inplace = True. For more information, see the pandas help documentation on drop.

# Remove columns
va_df = va_df.drop(columns = ["state", "county", "tract"])

# Show updated dataframe
va_df.head(2)
NAME C17002_001E C17002_002E C17002_003E B01003_001E GEOID
0 Census Tract 60, Norfolk city, Virginia 3947.0 284.0 507.0 3947.0 51710006000
1 Census Tract 65.02, Norfolk city, Virginia 3287.0 383.0 480.0 3302.0 51710006502

Check column data types#

The key in both dataframe must be of the same data type. Let’s check the data type of the GEOID columns in both dataframes. If they aren’t the same, we will have to change the data type of columns to make them the same.

# Check column data types for census data
print("Column data types for census data:\n{}".format(va_df.dtypes))

# Check column data types for census shapefile
print("\nColumn data types for census shapefile:\n{}".format(va_tract.dtypes))

# Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html
Column data types for census data:
NAME            object
C17002_001E    float64
C17002_002E    float64
C17002_003E    float64
B01003_001E    float64
GEOID           object
dtype: object

Column data types for census shapefile:
STATEFP       object
COUNTYFP      object
TRACTCE       object
GEOID         object
NAME          object
NAMELSAD      object
MTFCC         object
FUNCSTAT      object
ALAND          int64
AWATER         int64
INTPTLAT      object
INTPTLON      object
geometry    geometry
dtype: object

Looks like the GEOID columns are the same!

Merge dataframes#

Now, we are ready to merge the two dataframes together, using the GEOID columns as the primary key. We can use the merge method in GeoPandas called on the va_tract shapefile dataset.

# Join the attributes of the dataframes together
# Source: https://geopandas.org/docs/user_guide/mergingdata.html
va_merge = va_tract.merge(va_df, on = "GEOID")

# Show result
print(va_merge.head(2))
print('Shape: ', va_merge.shape)
  STATEFP COUNTYFP TRACTCE        GEOID  NAME_x             NAMELSAD  MTFCC  \
0      51      700  032132  51700032132  321.32  Census Tract 321.32  G5020   
1      51      700  032226  51700032226  322.26  Census Tract 322.26  G5020   

  FUNCSTAT    ALAND  AWATER     INTPTLAT      INTPTLON  \
0        S  2552457       0  +37.1475176  -076.5212499   
1        S  3478916  165945  +37.1625163  -076.5527816   

                                            geometry  \
0  POLYGON ((897174.191 4119897.084, 897174.811 4...   
1  POLYGON ((893470.562 4123469.385, 893542.722 4...   

                                             NAME_y  C17002_001E  C17002_002E  \
0  Census Tract 321.32, Newport News city, Virginia       5025.0        161.0   
1  Census Tract 322.26, Newport News city, Virginia       4167.0        736.0   

   C17002_003E  B01003_001E  
0        342.0       5079.0  
1        559.0       4167.0  
Shape:  (1907, 18)

Success! We still have 1907 rows, which means that all rows (or most of them) were successfully matched! Notice how the census data has been added on after the shapefile data in the dataframe.

Some additional notes about joining dataframes:

  • the columns for the key do not need to have the same name.

  • for this join, we had a one-to-one relationship, meaning one attribute in one dataframe matched to one (and only one) attribute in the other dataframe. Joins with a many-to-one, one-to-many, or many-to-many relationship are also possible, but in some cases, they require some special considerations. See this Esri ArcGIS help documentation on joins and relates for more information.

Subset dataframe#

Now that we merged the dataframes together, we can further clean up the dataframe and remove columns that are not needed. Instead of using the drop method, we can simply select the columns we want to keep and create a new dataframe with those selected columns.

# Create new dataframe from select columns
va_poverty_tract = va_merge[["STATEFP", "COUNTYFP", "TRACTCE", "GEOID", "geometry", "C17002_001E", "C17002_002E", "C17002_003E", "B01003_001E"]]

# Show dataframe
print(va_poverty_tract.head(2))
print('Shape: ', va_poverty_tract.shape)
  STATEFP COUNTYFP TRACTCE        GEOID  \
0      51      700  032132  51700032132   
1      51      700  032226  51700032226   

                                            geometry  C17002_001E  \
0  POLYGON ((897174.191 4119897.084, 897174.811 4...       5025.0   
1  POLYGON ((893470.562 4123469.385, 893542.722 4...       4167.0   

   C17002_002E  C17002_003E  B01003_001E  
0        161.0        342.0       5079.0  
1        736.0        559.0       4167.0  
Shape:  (1907, 9)

Notice how the number of columns dropped from 13 to 9.

Dissolve geometries and get summarized statistics to get poverty statistics at the county level#

Next, we will group all the census tracts within the same county (COUNTYFP) and aggregate the poverty and population values for those tracts within the same county. We can use the dissolve function in GeoPandas, which is the spatial version of groupby in pandas. We use dissolve instead of groupby because the former also groups and merges all the geometries (in this case, census tracts) within a given group (in this case, counties).

# Dissolve and group the census tracts within each county and aggregate all the values together
# Source: https://geopandas.org/docs/user_guide/aggregation_with_dissolve.html
va_poverty_county = va_poverty_tract.dissolve(by = 'COUNTYFP', aggfunc = 'sum')

# Show dataframe
print(va_poverty_county.head(2))
print('Shape: ', va_poverty_county.shape)
/home/mmann1123/miniconda3/envs/pygis2/lib/python3.10/site-packages/geopandas/geodataframe.py:1676: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  aggregated_data = data.groupby(**groupby_kwargs).agg(aggfunc)
                                                   geometry  C17002_001E  \
COUNTYFP                                                                   
001       POLYGON ((971814.409 4160130.005, 971664.507 4...      32345.0   
003       POLYGON ((722871.236 4189239.708, 722813.526 4...      97587.0   

          C17002_002E  C17002_003E  B01003_001E  
COUNTYFP                                         
001            2423.0       3993.0      32840.0  
003            5276.0       4305.0     105105.0  
Shape:  (133, 5)

Notice that we got the number of rows down from 1907 to 133.

Perform column math to get poverty rates#

We can estimate the poverty rate by dividing the sum of C17002_002E (ratio of income to poverty in the past 12 months, < 0.50) and C17002_003E (ratio of income to poverty in the past 12 months, 0.50 - 0.99) by B01003_001E (total population).

Side note: Notice that C17002_001E (ratio of income to poverty in the past 12 months, total), which theoretically should count everyone, does not exactly match up with B01003_001E (total population). We’ll disregard this for now since the difference is not too significant.

# Get poverty rate and store values in new column
va_poverty_county["Poverty_Rate"] = (va_poverty_county["C17002_002E"] + va_poverty_county["C17002_003E"]) / va_poverty_county["B01003_001E"] * 100

# Show dataframe
va_poverty_county.head(2)
geometry C17002_001E C17002_002E C17002_003E B01003_001E Poverty_Rate
COUNTYFP
001 POLYGON ((971814.409 4160130.005, 971664.507 4... 32345.0 2423.0 3993.0 32840.0 19.537150
003 POLYGON ((722871.236 4189239.708, 722813.526 4... 97587.0 5276.0 4305.0 105105.0 9.115646

Plotting Results#

Finally, since we have the spatial component connected to our census data, we can plot the results!

# Create subplots
fig, ax = plt.subplots(1, 1, figsize = (20, 10))

# Plot data
# Source: https://geopandas.readthedocs.io/en/latest/docs/user_guide/mapping.html
va_poverty_county.plot(column = "Poverty_Rate",
                       ax = ax,
                       cmap = "RdPu",
                       legend = True)

# Stylize plots
plt.style.use('bmh')

# Set title
ax.set_title('Poverty Rates (%) in Virginia', fontdict = {'fontsize': '25', 'fontweight' : '3'})
Text(0.5, 1.0, 'Poverty Rates (%) in Virginia')
../_images/e_vector_merge_dissolve_28_1.png