Tracking coronavirus in the headlines

One of the first projects I posted on this site discussed how to scrape headlines from Memeorandum.com, a news aggregation site with an easy-to-use archival feature that makes it possible to browse news stories all the way back to 2004. In that post, I commented on how Memeorandum, which uses an algorithm to find and cluster American news from across the political spectrum, can be thought of as a rich and accessible source for text data. It’s especially useful if you want to take a quick look at how topics traveled through the headlines at a given point in time. For this post, I’m dusting off the code I used in my original scraping project to try to learn something about how the coronavirus has been covered in the American media.

I’m scraping the data from Memeorandum using the same methods I described in my original project write-up. I won’t go into the details again here, but here’s what the data looks like:

data_2020 %>% 
  slice(1:3) %>% 
  knitr::kable()
date author outlet headline
2019-12-31 Abdul-Zahra Associated Press iraqi supporters of iran-backed militia attack us embassy — baghdad (ap) — dozens of angry iraqi shiite militia supporters broke into the u.s. embassy compound in baghdad on tuesday after smashing a main door and setting fire to a reception area, prompting tear gas and sounds of gunfire.
2019-12-31 Serhan The Atlantic the problem with a host country failing to protect its embassies
2019-12-31 Lippman Politico trump does new year’s eve his way

Each observation represents a headline that was on the Memeorandum front page at 5:00 PM ET on a given day. You can check the source page for yourself here. I’m starting with December 31st, 2019, because that’s when the first known cases of coronavirus, also referred to aslike many Americans can only look back with wonder at how remote the threat felt at the time. But Memeorandum let’s us take a trip back to those simpler times, when we “only” had to worry about war with Iran.

When did the phrase “coronavirus” first start appearing in Memeorandum headlines? Let’s do a quick search on the terms “COVID,” “coronavirus,” “new virus,” and “Wuhan.”

data_2020 %>% 
  filter(str_detect(headline, "covid|coronavirus|wuhan|new virus")) %>% 
  arrange(date) %>% 
  slice(1:3) %>% 
  knitr::kable()
date author outlet headline
2020-01-21 O’Donnell The Week the first case of wuhan virus has reportedly been detected in the u.s. breitbart : brazilian prosecutors charge glenn greenwald with cybercrimes
2020-01-22 Moritsugu Associated Press chinese city stops outbound flights, trains to fight virus — beijing (ap) — chinese state media say the city of wuhan is shutting down outbound flights and trains as the country battles the spread of a new virus that has sickened hundreds and killed 17. — the official xinhua news agency …
2020-01-22 Evans National Review china quarantines wuhan to prevent spread of coronavirus

It looks like the first mentions appeared in a wave around three weeks after the virus was first reported. Now, this seems a little late to me. Since Memeorandum is designed to aggregate American political news, it’s likely that although COVID-19 was in the headlines before January 21st, the topic simply failed to register in this particular corner of the Internet. So for the rest of this post, bear in mind that we aren’t getting the full picture – this is just a quick look at how coronavirus coverage played out in the data we have available.

Below we have a chart that describes how mentions of coronavirus have grown over time. I’ve included mentions of Trump as a sort of baseline – he’s bound to make an appearance in the headlines every day, no matter how severe the crisis (or crises), and since his mentions are relatively consistent we can answer questions like “when did the coronavirus get more coverage than the president?” (It looks like it was around early March). And lastly, I’ve included mentions of “social distancing” and related phrases, which we see start to rise just after COVID coverage began to rise in earnest in late February.

data_2020 %>% 
  mutate(coronavirus_count = str_count(headline, "covid|coronavirus|sars|new virus")) %>% 
  mutate(social_distancing_count = str_count(headline, "social distancing|quarantine|isolation")) %>% 
  mutate(trump_count = str_count(headline, "trump|potus")) %>% 
  gather(key = metric, value = n, ends_with("count")) %>% 
  group_by(date, metric) %>% 
  summarise(
    n = sum(n, na.rm = TRUE)
  ) %>% 
  ungroup() %>% 
  mutate(metric = str_remove(metric, "_count$") %>% str_replace_all("_", " ") %>% str_to_title()) %>% 
  ggplot(aes(date, n, color = reorder(metric, -n))) + 
  geom_point(size = .5) + 
  geom_line() + 
  theme_ipsum() + 
  labs(
    x = NULL,
    y = "Mentions",
    color = NULL,
    title = "Headline mentions of Trump and the coronavirus",
    caption = "Data scraped from memeorandum.com"
  )

It’s interesting to see an initial spike in coverage in late January – not because there wasn’t any cause for concern then, but because the topic pretty much disappeared from the headlines soon after, only to return a few weeks later when the threat to the U.S. became much harder to ignore. The virus was the dominant topic in March, when the U.S. economy began to shut down and the case count began to spike.

Here’s a quick and dirty function that will plot the most mentioned topics by day for a given month, which lets us see the comings and goings of various topics in the news cycle.

month_names <- c("January, 2020", "February, 2020", "March, 2020", "April, 2020")
month_topics <- function(month_n) {
  data_2020 %>% 
    filter(month(date) == month_n) %>% 
    unnest_tokens(word, headline) %>% 
    filter(!word %in% stopwords::stopwords(), !word %in% c("new")) %>% 
    count(date, word) %>%
    group_by(date) %>% 
    filter(n == max(n)) %>% 
    slice(1) %>% 
    ungroup() %>% 
    mutate(word = str_to_title(word)) %>% 
    ggplot(aes(date, n, fill = reorder(word, -n))) +
    geom_col() + 
    scale_fill_brewer(palette = "Dark2") +
    theme_ipsum() + 
    labs(
      title = paste0("Most mentioned topics by day: ", month_names[month_n]),
      x = NULL,
      y = "Mentions",
      fill = NULL
    ) 
}
month_topics(1)

January involved a brief war-scare with Iran after the US Embassy in Baghdad was attacked. Trump dominated the rest of the month, in part because the Senate was considering his impeachment at the time.

month_topics(2)

Remember when the Iowa caucus debacle felt like a big problem? Ha. February was again dominated by Trump, although Sanders had his moment in the spotlight after winning the Nevada caucuses. And then the coronavirus started getting a lot more air time, which leads us to March…

month_topics(3)

… Where one of the most eventful weeks in Democratic primary history barely registers on this chart. Not even Trump can seem to get a word in edgewise.

month_topics(4)

And now April, where COVID continues to dominate, but Trump did manage to elbow his way back on top of the pile for a few days.

Here’s a look at how the top six topics stack up in terms of cumulative mentions over time.

topics <- 
  data_2020 %>% 
  arrange(date) %>% 
  mutate(id = row_number()) %>% 
  unnest_tokens(word, headline) %>% 
  filter(!word %in% stopwords::stopwords()) %>%
  filter(!word %in% c("news", "new", "u.s", "says")) %>% 
  mutate(word = str_replace_all(word, "\\'s", "")) %>% 
  count(word) %>% 
  arrange(desc(n)) %>% 
  slice(1:6) %>% 
  pull(word) %>% 
  unique()

data_2020 %>% 
  arrange(date) %>% 
  mutate(id = row_number()) %>% 
  unnest_tokens(word, headline) %>% 
  filter(word %in% topics) %>% 
  count(date, word) %>% 
  group_by(word) %>% 
  mutate(running_n = cumsum(n)) %>% 
  ungroup() %>% 
  mutate(word = str_to_title(word)) %>% 
  ggplot(aes(date, running_n, color = reorder(word, -running_n))) + 
  geom_line() + 
  scale_y_continuous(labels = scales::comma) + 
  scale_color_brewer(palette = "Dark2") + 
  theme_ipsum() + 
  labs(
    y = "Cumulative mentions",
    x = NULL,
    color = "Topic",
    title = "Topic prevalence over time"
  )

As expected, Trump maintains a steady level of coverage throughout the year, whereas the virus quickly accelerates in late February / early March. Mentions of impeachment, after starting out strong, top out after Trump’s acquittal in the Senate. Perhaps surprisingly, Bernie is mentioned more than Biden throughout the course of the year, from his early wins to his eventual concession. It will be interesting to see if Biden’s mentions grow anywhere near Trump’s as we approach the general election, or if his campaign’s current strategy of lying low and avoiding the media will prevail.

There’s more to do with this data but I’ll leave it here for now. Once we’re past the worst of the pandemic, it would be interesting to look back and re-run this analysis over the course of the crisis. Here’s hoping that day comes soon.

Related