US Metro Areas: Using ‘tidycensus’ to Answer Social Justice Questions
Background and Motivation
Social justice is all about human interaction. How do people interact? What privileges and biases are woven into the ways people interact? How is wealth and opportunity distributed amongst individuals in society? Asking social justice questions directly relates to asking questions about people. Therefore, with high concentrations of people, cities are the perfect unit of analysis for social justice topics. With people living in close proximity, cities are where people, cultures, backgrounds, biases, and privileges collide.
With this line of thinking in mind, we turned to American cities and metro areas as the main unit of analysis for our PUG midterm/final project. Originally, we were curious about quantifying the wellbeing of a city. Is it possible to measure the wellbeing of a city? What factors contribute to wellbeing? From wealth to health, we quickly found that answering these questions is a complex task that requires considering an infinite amount of contributing variables. Because of this, we narrowed our definition of wellbeing. As a group, we decided to focus on change in population as a holistic measure of a city’s wellbeing. We reasoned that increasing population would be an indicator of positive growth/wellbeing and decreasing population would be an indicator of negative growth/wellbeing. Thus, our question of interest evolved from ‘what factors contribute to the wellbeing of American cities?” to “what factors contribute to population growth in American cities?”
What factors contribute to population growth in American Cities?
To address this question, we chose to create a Shiny App with an interactive linear regression model and leaflet map. The map allows users to spatially visualize the variation in our variables of interest and observe their change over time. One of the main takeaways from our midterm project was the observed relationships between city regions and the concentration of high/low values of our variables. Thus, our leaflet map serves as a way to further explore these relationships and presents them in a manner that is more understandable for the viewer. Although not specifically connected to our goal of finding factors that contribute to population growth, the map serves as a nice way to familiarize users with our variables and visualize the relationships present in the scatter plot/linear regression model. In the model and scatter plot section of our Shiny App, the user has the option to select a variable of interest and then see if the variable IS or IS NOT a significant predictor of change in population.
Although with many limitations, the results of our findings and our related conclusions tell a story about how the major metropolitan areas of America have changed over time, and which variables correlate–and perhaps even causate–to population growth or decay in these metro areas.
Data
TidyCensus
We returned to the ‘tidycensus’ package to collect data for our analysis of population growth in American cities. Tidycensus is an R package that allows users to access data from the US Census Bureau’s decennial Census and American Community Survey. More information about the basics of working with the ‘tidycensus’ package can be found here. We referred to this resource often throughout our project to learn about the basic functions and capabilities of tidycensus.
For our midterm project, we pulled 2018 American Community Survey data for US metropolitan areas (check out our midterm Shiny App here!). With out final project’s focus on analyzing change over time, we needed to pull data from years from ‘tidycensus’.
As we started compiling our new dataset, we quickly found that American Community Survey data was only available for metro areas via ‘tidycensus’ from 2009 to 2018. Because of this limitation, we chose to focus on this specific time range for our project. The next step in our data collection process was to select and compile variables of interest for all of our different years and cities. Variables are encoded within tidycensus (i.e B01003_001 = population). Thus, in order to access data using the get_acs function, one must know the variable’s specific code. Very quickly, we found that some of the variables we used in our midterm project had different codes for different years, complicating our data collection process. Therefore, a large part of this section of our project became filtering through the ACS 2009 - 2018 data to find variables/variable codes that were consistent for all of the years. Eventually we compiled the following list:
- population = “B01003_001”
- median_income = “B19326_001”
- foreign_born = “B05002_013”
- median_value = “B25077_001”
- married_households = “B11001_003”
- total_households = “B11001_001”
- houses_for_sale = “B25004_004”
- pop_white_alone = “B02001_002”
- pop_black_alone = “B02001_003”
- median_age_female = “B01002_003”
- median_age_male = “B01002_002”
- under_poverty_line = “B10059_002”
- bachelors_degree = “B15003_022”
After collecting these variable codes, we used a for loop to compile the data from all of the different years. Using a for loop was a really convenient way for us to avoid having to individually compile and join datasets for each year. Although difficult to understand out of context, here is a snippet of our for loop code to get a sense of the process of obtaining data via tidycensus:
for (i in 2013:2017){
  df <- get_acs(geography = geo, variables = vars, year = i)%>%
        pivot_wider(id_cols = NAME, 
                    names_prefix = paste0("y", i, "_"), 
                    names_from = variable, 
                    values_from = estimate)
  
  cities <- cities %>% left_join(df,
            by = "NAME")
}Lastly, because one of our end goals of this project was to create leaflet map, we needed to find the latitude and longitiude coordinates for our US metro areas. To accomplish this, we joined our dataset with the mdsr WorldCities dataset. We had to hardcode coordinates for San Antonio, San Juan, Virginia Beach, and McAllen because they were not in the WorldCities dataset. Later, while we were creating the map, we realized some of the coordinates were wrong because we had joined the mdsr dataset by city name and some cities have the same name (i.e Portland, Oregon was showing up in Portland, Maine). We hardcoded to correct the instances with this issue.
Shiny App
With our dataset compiled, we then created a shiny web application containing a Leaflet map and linear regression model.
Explore our Shiny App below, or click here to open it in a new tab!
Analysis and Findings
Our map and model are filled with all sorts of interesting relationships and conclusions.
Map
One of the most interesting relationships from the map portion of the Shiny App is shown below. It displays the median male age of each city in 2018. As the map displays, cities in Florida tend to have a higher median male age. The user can enter the Shiny application at the top of the page to experiment with different variables and years, as well as view specific selected values for individual cities.
A key piece of insight from using spatial data is how it opens up an opportunity to explore this data on different levels - as seen with the Florida example, there are patterns to be found on a state level, but quick comparisons of city data can be done with the map as is. Just as any other map, visualizations of spatial data allow for easier navigation. The model, which will follow shortly, provides further insight.
Model
There can probably never be an objective indicator of well-being, but we believed a city’s change in population could be considered a good measurement - the idea of this being that cities with growing populations must have some beneficial factor in a person’s life, thus justifying people moving in (or, at the very least, not moving out). Inside the Shiny app, a scatterplot was created comparing the predictor variables - which the user may select accordingly - with the percent change in population. Included below is the scatterplot of Percent of Married-Couple Households and Percent Population Change (also displaying the population, marked by the size of every dot), along with the fitting and analysis of its model.
## 
## Call:
## lm(formula = pct_pop_change ~ pct_married_hh, data = model_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.078899 -0.023230 -0.003886  0.022852  0.106304 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.6914     0.4123  -6.527 2.97e-09 ***
## pct_married_hh   2.9895     0.4501   6.641 1.75e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03337 on 98 degrees of freedom
## Multiple R-squared:  0.3104, Adjusted R-squared:  0.3033 
## F-statistic: 44.11 on 1 and 98 DF,  p-value: 1.747e-09With a p-value extremely close to zero, we may reasonably conclude that the percentage of married-couple households is a significant predictor for percent change in population. According to our model, an increase of 0.01 in the percentage of married-couple households will result, on average, in a 0.03 percent increase in the percent change in population (these percent increases are arithmetic, not geometric, e.g. .92 + 0.01 not .92 * 1.01). The correlation of determination (0.3104, 0.3033 adjusted) suggests that 31.04% of the variation of the percent change in population can be explained by this predictor - quite impressive for a single variable! From our analysis, this is the best single predictor for percent change in population in our dataset. Maybe more people are attracted to a city if there is a larger number of married couples, implying more families? We will clarify that we cannot conclude an increase in married-couple households causes an increase in population growth (as this is not an experiment).
A few other variables were shown to be significant: the Median Age, both Male and Female, Percentage of the Population that is Foreign-Born, Number of Vehicles per Household, and the Percentage of the Population with a Bachelor’s Degree. These, however, did not account for as much variability as the Percentage of Married-Couple Households (the highest other than the latter being Number of Vehicles, with an r-squared of about 0.08). The scatterplot and model output is included within the shiny app.
One thing to note is that, although the Shiny app gives the option to choose which regions of the United States to include in the scatterplot, it does not calculate a model for any combination other than the entire set of cities. Having to fit a model for any and all combinations of these four regions would result in 20 models per each of the 14 variables, or 280 linear models, far beyond the scope of this project. This is worth mentioning, however, as it is more than possible that the aforementioned predictors - or others - will account for more variability if the data were to be broken down by region. Further limitations of our project are included below.
Limitations and Conclusion
Probably the biggest limitation of our work is the small number of years for which we were able to gather data. Data from the American Community Survey is available via tidycensus from 2009 to 2018, but many areas were named differently by the ACS from 2009 to 2012. This presented a big enough obstacle to joining the data from these years that we decided to only include years 2013 to 2018.
Even if we had compiled data from all ten years of the ACS, though, this would still only provide an incomplete picture of the way US Metropolitan Areas have changed over time. Ideally, we would be able to go back several decades and connect patterns in the data with longer-term historical developments (this would require some qualitative background research, which was also beyond the scope of this project). Perhaps in a future Shiny app, we could include data going back into the 20th century and let the user specify the start and end years of a time period they want to look at.
Our findings are also not necessarily generalizable. We only looked at the one hundred most populous metro areas in the country, which are probably not representative of all of urban and suburban America. A lot of areas on this list might have similar characteristics to say, the next 100 most populated cities and towns, but it is likely that the top cities like New York, Los Angeles and Chicago skew the data. In any case, the map we’ve created has a lot of empty space since most cities are concentrated on the coasts; states like Montana, North and South Dakota, Maine, West Virginia, and Alaska have no data, making it hard to draw conclusions about the spatial distribution of variables across the country.
Finally, our model does an imperfect job of answering our main question: what factors contribute to cities’ wellbeing? This component of our Shiny app uses population growth from 2013 to 2018 as a proxy for wellbeing, when this is obviously not the ultimate indicator of how successful a city is or how high a quality of life its residents have. And even if population growth were a perfect measure, the cities that grew the most from 2013 to 2018 are not necessarily the ones that are growing the fastest right now or even the ones that have grown the most over the past twenty or fifty years. Our question, of course, is incredibly complex, and the answer probably involves an interaction between a multitude of variables. Still, the interactivity of our Shiny app presents the user with a number of thought-provoking patterns and relationships, which, hopefully, will lead to more good questions.
Overall, although our project and conclusions have many limitations, the findings displayed in the Shiny App point to interesting relationships that might be interesting for future, more specific, statistical analyses of US metro areas and wellbeing.
Sources
- Kyle Walker and Matt Herman (2020). tidycensus: Load US Census Boundary and Attribute Data as ‘tidyverse’ and ‘sf’-Ready Data Frames. R package version 0.10.2. https://CRAN.R-project.org/package=tidycensus 
- Ben Baumer, Nicholas Horton and Daniel Kaplan (2019). mdsr: Complement to ‘Modern Data Science with R’. R package version 0.1.7. https://CRAN.R-project.org/package=mdsr 
- Joe Cheng, Bhaskar Karambelkar and Yihui Xie (2019). leaflet: Create Interactive Web Maps with the JavaScript ‘Leaflet’ Library. R package version 2.0.3. https://CRAN.R-project.org/package=leaflet 
- Winston Chang, Joe Cheng, JJ Allaire, Yihui Xie and Jonathan McPherson (2020). shiny: Web Application Framework for R. R package version 1.5.0. https://CRAN.R-project.org/package=shiny 
- Original S code by Richard A. Becker, Allan R. Wilks. R version by Ray Brownrigg. Enhancements by Thomas P Minka and Alex Deckmyn. (2018). maps: Draw Geographical Maps. R package version 3.3.0. https://CRAN.R-project.org/package=maps