Introduction

This blog post looks at different aspects of mental and social health during the 2020 Covid-19 pandemic across the United States. The pandemic has been, and continues to be, a great source of anxiety, depression, and other regressive emotions for individuals due to but not limited to the need for quarantine, social distancing, and a cruel but necessary departure from normal, pre-pandemic life. It is important to examine how the pandemic has impacted mental and social health, in different ways — this will help us understand and respond to the pandemic as a country. Our main takeaway from our Shiny App project was the strong correlation between anxiety and depression and that these two factors are not necessarily correlated with reduced access to healthcare; so, we wanted to examine the anxiety and depression angle further in this blog post. In this post, we examine symptom severity of anxiety and depression in different demographic groups; most common sentiments and words used in tweets about Covid-19; and the spatial patterns of anxiety and depression, social media presence of Covid-19 across counties, stress related to state-mandated shutdowns, and social vulnerability, across the contiguous United States.

Data

Dataset 1: CDC Anxiety/Depression Surveys

  • Link: https://www.cdc.gov/nchs/covid19/health-care-access-and-mental-health.htm
  • Description: We used data from the household “pulse polls” that were created by the CDC and the U.S. Census Bureau in the early stages of the pandemic in the US, starting on April 23 and continues to the present. We focused on the “Anxiety and Depression” survey, which asked four questions about the frequency of depressive/anxious emotions in the past seven days.
  • The data was broken down into different categories, including race, gender, age group, and education level. By examining the quantitative values associated with anxiety and depression from the survey, we can explore the ways that depressive/anxious symptoms change across different demographic groups.
  • On their website, the CDC notes that this survey uses data collected in a somewhat unconventional manner, given the short duration between data collection and data analysis.

Dataset 2: Covid-19 Tweets

  • Link: https://www.kaggle.com/gpreda/covid19-tweets
  • Description: This is a dataset from kaggle called “COVID19 Tweets” that came in the format of a .csv file. It contains all tweets from 07/24/20 through 08/30/20 that used the hashtag #covid19. They were pulled using Twitter API and a Python script. The dataset contains the following information: user name, location, other variables about the user (i.e., their description, number of followers, favorites, etc.), date of the tweet, full text tweet, hashtags, source, and whether or not it is a retweet.

Dataset 3: The New York Times Covid-19 Data

  • Link: https://github.com/nytimes/covid-19-data
  • Description: This dataset was retrieved from The New York Times GitHub where they’ve made publicly available Covid-19 data that they have collected since January 2020. It contains information per week for how many current and probable cases and deaths there are in each county. We use only the information for confirmed current cases and have changed the time progression to monthly.

Dataset 4: CDC 2018 Social Vulnerability Index

  • Link: https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html
  • Description: The CDC has collected data through census tracts in 2000, 2010, 2014, 2016, and 2018 to investigate the social vulnerability of communities in the U.S. Natural disasters and infectious disease outbreaks, particularly Covid-19, pose threats to community health. Due to factors like socioeconomic status, household composition, minority identity, housing type, transportation limits, and disability, certain communities are even more socially vulnerable than others and at particularly high risk to these disasters.
  • We’ve included this dataset in order to have a spatial view of how social vulnerability correlates with mental health factors and Covid-19 infection. The data has all the aforementioned variables on scales decided by the CDC; we use the cumulative value that assigns an overall score of vulnerability to every individual county in the contiguous United States. This data is temporally static, as it is a snapshot of community vulnerability at the time the last batch of census data was released in 2018.

Dataset 5: Social Media Counts During Covid-19

  • Link: https://coronavirus-resources.esri.com/datasets/feb6280d42de4e91b47cf37344a91eae_0?page=4&showData=true
  • Description: This dataset includes counts of social media posts that mention COVID-19 for each included county, though the central United States region suffers from a paucity of data as time progresses. This data is provided per week starting in January, 2020, and is updated through August, but we only use the data starting in April.
  • We’ve included this dataset because social media in our modern era is inextricably entwined with mental health and self-expression. The dataset includes county locations, total posts per week, unique users per week, and sentiment analysis values. For our analysis we only use county locations and total posts per week, which we’ve changed into months.
  • It was necessary to have both this dataset and Dataset 2 because Dataset 2 does not include pre-analyzed sentiment values while this one does, and we wanted to perform our own sentiment analysis with Dataset 2. This dataset was necessary because we needed geographic indicators (county-level, specifically) in order to create our spatial visualizations.

Dataset 6: State Actions and Mandates in Response to Covid-19

  • Link: https://www.arcgis.com/home/item.html?id=90a137c9e0494ecab29672bba8cb3ee0&sublayer=0#data
  • Description: This dataset compiles information on how each state initially reacted to the pandemic around the month of March, 2020. The variables include written descriptions of what each state did in regard to curfew, masks, wage changes, non-essential business closures, limits on gathering, emergency declarations, travel restrictions, school closures, and more.
  • We focused in on three variables — mask policy, curfew, and wage changes.

Other Datasets These other datasets were used for the spatial visualizations, which required extra geographic information.

  1. Cities and counties. Link: https://simplemaps.com/data/us-cities
  2. County-level and state-level data from the maps package

Clustering by Demographic (Ava Tillman)

Elbow plots

**note: since rates of anxiety and depression were calculated in the same manner based on survey responses, standardizing the variables was not necessary

Impact

  1. Are certain groups (teens, Hispanic/“other”, females, and people with less than a high school diploma) responding worse to the pandemic?
  2. Are these the people who are be disproportionately affected by Covid-19?
  3. How can we focus on giving people in these demographic groups adequate access to mental health resources?

Limitations

  • Our data on anxiety and depression only spans the duration of the Pandemic, so we can’t compare these levels of anxiety and depression to “normal” levels (pre-Covid)
  • We can’t tell if certain groups are experiencing higher levels of anxiety and depression because of the Pandemic, or if these are just general trends
  • We could strengthen our research by increasing the scope of the data, for example, by looking at global trends or comparing these anxiety and depression rates to previous months, years, or even decades
  • This data comes specifically from people in the US who chose to take the survey, but it could be extended to a global project to see if different demographic groups in other countries share similar trends

Text Analysis for Covid Tweets (Lauren Simpson)

Sentiment Analysis

We conducted a sentiment analysis on full text tweets from July 24 to August 30 2020 that used the #covid19. We separated the tweets into unigrams and performed a sentiment analysis, using sentiments from the nrc lexicon. We found that the most common sentiment in the tweets was Negative, followed by Positive, Fear, Trust, Anger, Sadness, Disgust, Anticipation, Joy, and the least common sentiment was surprise. After the general sentiment analysis, we examined the top 10 most common words that belonged to sentiments related to anxiety and depression, specifically Positive, Negative, Fear, and Sadness. Interestingly, quite a few of the top 10 words overlap for Negative, Fear, and Sadness, such as: pandemic, death, hospital, etc. Overall, exploring the most common anxiety and depression related words individuals used in tweets about covid 19 helps us to gain a new perspective on how individuals were feeling during this time.

Comparison Cloud

Since we noticed some overlap in the top 10 most common words for anxiety and depression related sentiments, we decided to create a comparison word cloud to visually depict words by these 4 sentiments. Of note, comparison clouds are similar to typical word clouds, but they allow you to visualize the most frequent words across different categories. Additionally, the code for comparison clouds requires the data to be formatted into a term document matrix (tdm), as demonstrated above. The words are colored by the sentiment they primarily belong to, and the size of the words indicates how frequently they were used. We found that pandemic was the most common Sadness word, government and risk were among the most common Fear words, virus was the most common Negative word, and vaccine was the most common Positive word.

## COMPARISON WORD CLOUD
# read in data to create term document matrix for comparison and commonality clouds
nrc_tweets_cloud <- read_csv(paste0(path_karen, "text analysis/nrc_tweets_cloud.csv"))
# create a term document matrix for the comparison.cloud function
tdm <- nrc_tweets_cloud %>%
  # need to take unigrams in word out to create the matrix
  select(-word) %>%
  as.matrix() 
# add the unigrams in word back in once matrix is created
rownames(tdm) <- nrc_tweets_cloud$word
# create the comparison cloud
set.seed(10) # set seed for reproducibility
comparison.cloud(tdm, random.order=FALSE, 
                 colors = c("#9E31CC", "#BD1550", "#5E8C6A", "#3299BB"), 
                 title.size=2,
                 title.colors = c("#9E31CC", "#BD1550", "#5E8C6A", "#3299BB"),
                 title.bg.colors = c("#DEC8E8", "#EDC5D3", "#CCD9CF", "#C6E9F5"),
                 max.words=90)

Commonality Cloud

A commonality cloud is a type of word cloud that allows you to visualize the most common words across different categories. Similar to the comparison cloud, a term document matrix is needed. We only compared words across the fear, negative, and sadness sentiments, as positive did not have enough words in common with the other three to produce a meaningful cloud. Based on the cloud produced, we can see that pandemic is by far the most commonly used word across the fear, negative, and sadness sentiments. Death is the second most common. The rest of the words in purple indicate other words that are common and shared by all three sentiment categories. The large amount of purple words demonstrates that there are numerous words in the covid 19 tweets that span the sentiment categories of fear, negative, and sadness.

## COMMONALITY WORD CLOUD
# create a new term document matrix without positive
tdm_common <- nrc_tweets_cloud %>%
  # need to take unigrams in word out to create the matrix
  select(-word, -positive) %>%
  as.matrix() 
# add the unigrams in word back in once matrix is created
rownames(tdm_common) <- nrc_tweets_cloud$word
# create the commonality cloud
set.seed(2) # set seed for reproducibility
commonality.cloud(tdm_common, random.order = FALSE, 
                  scale = c(5, .5), 
                  colors = c("#9E31CC", "#BD1550", "#5E8C6A", "#3299BB"), 
                  max.words = 200)

Conclusion/Limitations

Overall, the results from the text and sentiment analyses demonstrate that individuals have been expressing high numbers of words associated with sadness, fear, and general negativity in their tweets about covid19. These findings are in line with the high levels of anxiety and depression during the pandemic that we observed in the CDC Anxiety and Depression dataset. Additionally, we were able to examine specific words that belonged to each of these sentiments individually, as well as words and themes that spanned across sentiments, most notably, “pandemic.” These findings make a lot of sense, as we know from our own personal experiences that the pandemic has been a great source of sadness, fear, and negative emotions.

Limitations of this text and sentiment analysis include the fact that the data was only collected over a short period of time, from July 24 to August 30, 2020, and the dataset only contained tweets using the hashtag #covid19. This month-long timeframe is a small snapshot of tweets during the pandemic, so the tweets analyzed in this text analysis may not be representative of tweets, sentiments, or most common words utilized during the pandemic as a whole, as they may only be specific to this time period. Additionally, there were likely many other tweets individuals posted about the coronavirus or pandemic without using #covid19, which were not included in our analysis. Thus, our findings may not be generalizable to all tweets about the coronavirus, but rather only tweets in which individuals use #covid19.

Our text analyses could be improved by examining tweets about the pandemic over a longer period of time, for example, a few months, rather than just one. Additionally, we could broaden the scope of hashtags used and examine tweets with other hashtags related to the coronavirus, rather than relying on one, #covid19. Furthermore, in future work it would be interesting to conduct text and sentiment analyses not only on coronavirus related tweets, but also text such as news stories, academic research articles, or even posts from other social media sites to get a more comprehensive view of how individuals are using words to express certain sentiments during this time.

Spatial Data (Karen Liu)

Our spatial visualizations depict the values of certain explanatory variables across the contiguous United States during the months of April 2020 to August 2020, from near the beginning of the Covid-19 pandemic in the U.S. to the current time. We’ve chosen to hone in on anxiety and depression through time, our main dataset; social media counts of how many times Covid-19 was mentioned per county, through time; the initial state mandates in response to Covid-19 such as lockdown procedures; and the CDC’s 2018 Social Vulnerability Index. Looking at these variables together should give us an idea of how these social factors may interact with each other during the current crisis. In addition to these variables, the hardest-hit cities have been represented by green dots on the map, which are sized according the number of cases in that city; this should provide a view of how regions highly impacted by the pandemic are performing with respect to our main variables.

Results

Usage of Data

While the usage of the anxiety and depression, social media, and social vulnerability datasets were rather straightforward (directly used quantitative variables on predetermined scales), the usage of the state actions dataset was not. The state actions are, of course, not listed in numbers — the dataset only means to provide us a summary of every state’s policies in words; so, we had to find a way to turn this dataset into something that is quantifiable for our maps.

We focused in on three variables — mask policy, curfew, and wage changes because we felt that these were generally the most representative for all the other variables in the dataset. Each variable was converted to a scale of 1 to 4 based on the policy. For example, for mask policy, the levels were “Recommendation”, “Mandatory (for essential business employees)”, “Mandatory (for essential business employees and patrons while on premises)”, and “Mandatory”. These were assigned values of 1, 2, 3, and 4, respectively, based on their perceived causative power for stress. In the end, a scale of 1-12 indicated the relative “stress” imposed by the state government onto the citizens on the basis of the pandemic.

Note that our stress scale does not necessarily correlate with a “good” or “bad” government, and even less so political affiliation. Even though there are certainly political factors that can lead to certain governments making “bad” decisions for public health, population is also an important factor that affects a state’s policies. Even more importantly, in this pandemic certain “bad” decisions are the more relaxing ones while other “bad” decisions are more stressful. Take for example curfew; enforced curfew was assigned a 4 for stress but is considered a “good” government decision, while “no wage extension” was also assigned a 4 for stress but is considered a “bad” government decision.

Impact

The overall conclusion that this map seems to support is that urban areas and socially vulnerable populations are by far the most likely to fall victim to Covid-19 and experience mental health decline. This is interesting because social vulnerability was not necessarily at its highest in the urban coastal areas, though it was certainly higher there than in rural areas. Regardless of the overlap between urbanism and social vulnerability, the relationship between these two groups and health detriments appears to be quite strong as we look at these maps side by side.

Social media does not seem to have too high of a relationship with any of the other variables, and if it does, then social media activity’s relationship to urbanism most likely eclipses any other relationships that may be able to appear. Interestingly, though, mentions of Covid-19 slowly increased from April to August; this is worth noting because, rather than becoming weary of discussing the pandemic, it seems that interest and concern steadily mounted as time continued.

State actions were also tricky to assess because too many underlying variables muddle the correlation between stress levels associated with state mandate and the other variables. This will be further discussed in the Limitations tab.

Limitations

  • The most obvious limitation is in the quantification of the state actions dataset. State actions are difficult to quantify fairly, as we would have to weigh each variable’s importance in the production of “stress”. Furthermore, the quantification may be able to be improved by incorporating more of the variables, but for this project it was too tricky of an operation to code, so we settled on picking the most representative variables.
  • The most important limitation that is applicable to all the analysis is that there are many lurking variables behind our selected variables of interest. Political alignment, population size and density, urbanism, and ruralism all affect the truth behind the relationships that we seem to see in these maps. Political alignment, as discussed before in Usage of Data, heavily affects the state mandate-associated stress values. Population density is a cause of greater numbers of Covid-19 cases and may eclipse other reasons for having high case counts. Urbanism and ruralism are huge lurking variables, as urbanism tends to be associated with greater reports of anxiety and depression, whether this is because of lessened stigma in urban areas or greater stress in fast-moving cities. Urbanism also affects social vulnerability, as easier access to resources draws in vulnerable individuals while higher costs of living exacerbate many communities’ situations. Nonresponse bias could also be a problem in rural areas because rural areas may be less likely or less able to respond.
  • The city values for Covid-19 cases are not actually the values for those cities; rather, those are the values for the county that the city is in, and the city that is plotted was simply picked for being the most representative of that county (having the greatest population). This works well visually since we cannot tell which cities are which son the map, and the points are simply there for us to gauge the general presence of cases. Still, this is a limitation worth noting.
  • In the social media data, more and more counties have missing data as time progresses.

Conclusion

  • In addition to a strong correlation between anxiety and depression, levels of both variables seem to be influenced by age, race, gender, and education level. This can be connected to the amount of time certain groups spend on social media, and may be effected by the negative sentiments in Covid-19 tweets.
  • However, levels of anxiety and depression were not compared to levels before the Pandemic, so we cannot say for sure if these demographic-specific trends are the result of Covid-related stressors.
  • Individuals have been expressing high numbers of words associated with sadness, fear, and general negativity in their tweets about Covid-19, with the most commonly used word that spanned all 3 sentiments being “pandemic.”
  • The main limitation of the text and sentiment analysis was that tweets were only collected from July to August, and only tweets with the hashtag #covid19 were used. Future research should examine covid-related tweets over a longer period of time with a broader range of hashtags or other search criteria in order to get a more representative and accurate sense of the most commonly used words and sentiments in covid-related tweets.
  • The overall conclusion that the map seems to support is that urban areas and socially vulnerable populations are by far the most likely to fall victim to Covid-19 and experience mental health decline. This is interesting because social vulnerability was not necessarily at its highest in the urban coastal areas, though it was certainly higher there than in rural areas. Regardless of the overlap between urbanism and social vulnerability, the relationship between these two groups and health detriments appears to be quite strong as we look at these maps side by side. Social media does not seem to have too high of a relationship with any of the other variables, and if it does, then social media activity’s relationship to urbanism most likely eclipses any other relationships that may be able to appear. Interestingly, though, mentions of Covid-19 slowly increased from April to August; this is worth noting because, rather than becoming weary of discussing the pandemic, it seems that interest and concern steadily mounted as time continued.
  • State actions are difficult to quantify fairly, as we would have to weigh each variable’s importance in the production of “stress”. Furthermore, the quantification may be able to be improved by incorporating more of the variables. It would also be beneficial to perhaps look at a different quantification of social media activity. Another very important expansion would be to visualize level of urbanism across the country.

Bibliography

  1. Centers for Disease Control and Prevention/ Agency for Toxic Substances and Disease Registry/ Geospatial Research, Analysis, and Services Program. CDC Social Vulnerability Index 2018 Database US. https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html.
  2. Centers for Disease Control and Prevention/ National Center for Health Statistics. NCHS Household Pulse Survey, Anxiety and Depression. 2020. https://www.cdc.gov/nchs/covid19/pulse/mental-health.htm
  3. National Governors Association. COVID-19 State and Territory Actions Tracker. 2020. https://coronavirus-resources.esri.com/datasets/NGA2::covid19-state-and-territory-actions-download?showData=true
  4. Preda, G. (2020, August). COVID19 Tweets, Version 24. Retrieved November 17, 2020 from https://www.kaggle.com/gpreda/covid19-tweets.
  5. Simplemaps. United States Cities Database. 2020. https://simplemaps.com/data/us-cities
  6. Spatial.ai & Datastory. COVID-19 Social Media Counts & Sentiment. 2020. https://coronavirus-resources.esri.com/datasets/feb6280d42de4e91b47cf37344a91eae_0?orderBy=current_fl&showData=true
  7. The New York Times. Coronavirus (Covid-19) Data in the United States. 2020. https://github.com/nytimes/covid-19-data