Creative Commons

Scraping the news

Founded by Gabe Rivera in 2004, Memeorandum.com is a fully automated article aggregator for American political news. Rivera’s black box algorithm determines the “importance” of a given article by some combination of its popularity and age, and then clusters articles by topic.

For an aggregator that’s never touched by human editors, the results are nothing short of incredible. Since discovering Memeorandum a little over a year ago on Twitter, it’s become my go-to source for political news, whether I’m looking to catch up on the past 24 hours or I want to see breaking updates on the moment’s top stories.

Soon after discovering it, I also realized that it was a potentially valuable and untapped source of text data, and might be able to tell us something about political news cycles. While I won’t be mounting any serious inquiries here (for the moment, at least), I think Memeorandum is worth a deeper examination for a few reasons:

  • Memeorandum’s algorithm does a good job of collecting stories from both major news outlets and one-person blogs. It also doesn’t seem to have any partisan blind spots — for a given topic, you’re likely to see The New York Times, The Daily Caller, and Thinkprogress gathered side by side, each offering different takes on the same story.

  • The home page is refreshed every five minutes, and older versions are immediately made available through the site’s “Archives” feature. Simply type in a date and time and you’ll be shown exactly what the site looked like in that moment, nearly down to the minute.

  • Memeorandum has been running continuously since April 1st, 2004. That means the archives contain nearly fifteen years worth of in-the-moment political news, all organized according to importance and recency.

  • In most cases, even links dating back to 2004 still work, making Memeorandum a useful tool for navigating political news cycles. I looked to see what the headlines looked like on my 9th birthday, for instance, and was able to read archived pieces on George Bush’s then record-low 45% approval rating and John Edwards’ adventures in podcasting.

  • Memeorandum is pretty lo-fi, and admittedly kind of ugly. I mention this as a feature only because this makes it a bit easier to scrape text data from the site, which is the focus of this project.

For researchers seeking text data to enrich their understanding of the news and media, Memeorandum is perhaps a bit superficial, as it only displays headlines and sometimes a few sentences from an article’s lede. That said, while there are other ways to access old news articles, none that I know of can match the ease and efficiency of Memeorandum’s system, which — as we’ll soon see — also lends itself well to automated scraping. If you’re looking for something quick and dirty, either for analysis or to stay up to date on the news, Memeorandum is truly hard to beat.

Scraping the data

Given any date and time since March 2006 (when Memeorandum’s layout was updated to its current form), the following code will scrape, parse, and rectangle the contents of the homepage into a tidy form. Because I haven’t familiarized myself with rvest or any other modern scraping packages since first attempting a version of this project last year, I’ll be relying partly on the bespoke htmlToText.R script written by Tony Breyal.

The actual scraping isn’t implemented here, but the code to scrape and parse the site’s contents in the past 90 days is displayed and the data read in from a previously written file:

library(tidyverse)
library(lubridate)
library(tidytext)
library(igraph)
library(ggraph)
library(stringi)
source('~/Desktop/Summer Projects/htmlToText.R')
source('~/Desktop/Summer Projects/customize_script.R')

Memeorandum’s archival system follows a convenient pattern, e.g. https://www.memeorandum.com/060301/h1700 is the archived site for (20)06/03/01 17:00. url_date simply translates any date to a working url, with a default time of 5:00 PM. I use a “slow” version of the htmlToText() scraper because I don’t want to cause any trouble with Memeorandum’s servers. When scraping, it’s always best to be modest with your scripts, as I’ve learned after being blacklisted twice (I’m sorry, Memeorandum!).

url_date <- function(date, time = "17:00") {
  date <- format(date, format = "%y%m%d") %>%
    str_replace_all("/|-", "")
  time <- format(time, format = "%h%m") %>% str_replace(":", "")
  paste('http://www.memeorandum.com/', date, '/h', time, sep = '')
}

slow_html_to_text <- function(url) {
  Sys.sleep(time = 5)
  htmlToText(url)
}

The following function takes care of the scraping, parsing, and rectangling all in one, generating a tibble with date, author, outlet, and headline columns. For bigger jobs, it would make sense to separate these steps in case anything breaks, but I haven’t had trouble with queries dealing with less than 100 days.

scrape_date <- function(date, ...) {
  text <- 
    url_date(date, ...) %>% 
    htmlToText() %>%
    str_replace_all("\\s+", " ") %>% 
    str_replace_all(
      "(About memeorandum:).*(memeorandum on Twitter More Items: )", ""
    ) %>% 
    str_replace_all("(From Mediagazer:).*", "") %>% 
    str_replace_all("(Sister Sites: Techmeme).*", "") %>% 
    str_replace_all(
      ".*(Refer to this page to reenable cookies. Top Items: )", ""
    ) %>% 
    str_replace_all("(Discussion: ).*?(Discussion:)", "") %>% 
    str_replace_all("(From Techmeme: ).*", "") %>% 
    stri_replace_all_fixed(
      "» All Related Discussion « Hide All Related Discussion ", ""
    ) %>% 
    str_replace_all("(Top Items: )", "") %>% 
    stri_replace_all_fixed("RELATED: ", "") %>% 
    stri_replace_all_fixed("Discussion: ", "") %>% 
    stri_replace_all_fixed("+", "")
  
  tibble(
    date = date %>% as_date(),
    byline = 
      text %>% 
      str_extract_all("(\\S+\\s+)\\/\\s(?<=\\s\\/\\s)[^:|+]*[^:|\\s]*") %>% 
      map(~ str_trim(., side = "both")) %>% 
      flatten_chr(),
    headline =
      text %>% 
      str_split("(\\S+\\s+)(\\S+\\s+)\\/\\s(?<=\\s\\/\\s)[^:|+]*[^:|\\s]*") %>%
      map(~ str_replace(., ": ", "")) %>% 
      map(~ str_trim(., side = "both")) %>% 
      flatten_chr() %>% 
      as_tibble() %>% 
      slice(-1) %>% 
      pull(value)
  ) %>% 
    separate(byline, into = c("author", "outlet"), sep = " / ")
}

Here I specify the 90 day date range I’m interested in before actually generating the tibble with a simple call to map_dfr() (no for loops here!). Once it’s done, we’ll have all the data we need for analysis.

date_range <- today() - 90:1

data_90 <- 
  date_range %>% 
  map_dfr(., ~ scrape_date(.))

Here I’ll use the 90 days worth of data generated by the previous script, which I saved in a .csv. When reading in the data you’ll notice I’m doing a bit more parsing, getting rid of punctuation — this is a just a preference, since I’ll be looking mostly at things like word counts, but it might be good to have the “raw” text available for other reasons.

data_90 <- 
  read_csv(file_90) %>% 
  mutate(
    headline = headline %>% 
      str_replace_all("\\-", " ") %>% 
      str_replace_all("\\'s", "") %>%  
      str_remove_all("[:punct:]"),
    date = mdy(date)
  )

Here’s what the data looks like:

head(data_90) %>% 
  knitr::kable()
date author outlet headline
2018-05-15 Rogin Washington Post china gave trump a list of crazy demands and he caved to one of them after top trump officials went to beijing last month the chinese government wrote up a document with a list of economic and trade demands that ranged from the reasonable to the ridiculous
2018-05-15 Spross The Week trump zte puzzler
2018-05-15 Kuhn NPR as us china trade talks begin zte is in the spotlight sputnik international trump likely using chinese firm zte to bargain for more concessions from beijing
2018-05-15 Pramuk CNBC republican sen marco rubio warns trump reversal on china zte is a national security risk
2018-05-15 Stevens Unlikely Voter marco rubio actually understates how crazy it is to let zte go unchecked
2018-05-15 Hartmann New York Magazine why trump is suddenly worried about saving jobs in china 6 theories gordon

The parsing I implemented isn’t perfect — it only captures authors’ last names, for instance — but for now I’m quite happy with how it turned out.

Analyzing the last 90 days of news

Now that we have the data, the first question we might ask is simple: where does Memeorandum get its articles from?

data_90 %>% 
  count(outlet) %>% 
  arrange(desc(n)) %>% 
  top_n(n = 20, wt = n) %>%
  ggplot(aes(reorder(outlet, n), n)) + 
  geom_col(fill = custom_palette[1]) + 
  coord_flip() + 
  custom_theme +
  labs(
    x = NULL,
    y = "Articles",
    title = "Most common outlets",
    subtitle = paste(min(data_90$date), "to", max(data_90$date))
  )

data_90 %>% 
  mutate(
    byline = paste0(author, " (", outlet, ")")
  ) %>% 
  count(byline) %>% 
  arrange(desc(n)) %>% 
  top_n(n = 20, wt = n) %>%
  ggplot(aes(reorder(byline, n), n)) + 
  geom_col(fill = custom_palette[1]) + 
  coord_flip() + 
  custom_theme +
  labs(
    x = NULL,
    y = "Articles",
    title = "Most common authors",
    subtitle = paste(min(data_90$date), "to", max(data_90$date))
  )

Like I said before, there don’t appear to be any partisan blind spots. The top 20 consists of a healthy dose of hard news in addition to smatterings of more partisan sites.

The list of the top 20 authors is less informative, because it pretty much only shows daily bloggers, who appear to create content at prodigious rates. However, while the roughly three articles per day put out by Taegan Goddard seems impressive at first glance, upon further inspection his posts on Political Wire are mostly just bite-sized summaries of the day’s stories. In other words it’s like a Twitter feed, although to his credit he’s been doing it since 1999.

Next I examine the headlines themselves by taking advantage of the principles of tidytext, a wonderful package developed by Julia Silge and David Robinson. The following analysis borrows heavily from the lessons and code included in their free book, Text Mining with R.

I start by generating counts of bigrams, or pairs of words that appear next to each other in the text. The most frequent are, as one might expect, about President Trump.

bigram_counts <- 
  data_90 %>% 
  unnest_tokens(output = bigram, input = headline, token = "ngrams", n = 2) %>% 
  count(bigram) %>% 
  separate(bigram, into = c("word1", "word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word) %>% 
  arrange(desc(n))

bigram_counts %>% 
  unite(word1, word2, col = "bigram", sep = " ", remove = FALSE) %>% 
  anti_join(
    data_90 %>% 
      mutate(outlet = str_trim(outlet) %>% str_to_lower()) %>% 
      select(outlet),
    by = c("bigram" = "outlet")
  ) %>% 
  anti_join(
    data_90 %>% 
      mutate(
        outlet = str_trim(outlet) %>% 
          str_to_lower() %>% 
          str_replace("the ", "")
      ) %>% 
      select(outlet),
    by = c("bigram" = "outlet")
  ) %>% 
  head() %>% 
  knitr::kable()
bigram word1 word2 n
donald trump donald trump 805
president trump president trump 550
supreme court supreme court 522
north korea north korea 456
michael cohen michael cohen 409
york times york times 366

If we’re interested in seeing which words are common to multiple bigrams, we can visualize the corpus as a network (where words are nodes and pairs are indicated by edges). Again, it’s not unsurprising to see that the corpus — much like the world — seems to revolve around Trump. Most of the other common bigrams are names, and are not that informative.

bigram_graph <- 
  bigram_counts %>% 
  filter(n >= 55) %>% 
  graph_from_data_frame()

ggraph(bigram_graph) +
  geom_edge_link(
    aes(edge_alpha = n), 
    show.legend = FALSE
  ) + 
  geom_node_point() + 
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) + 
  theme_void()

In short, counting words — even in bigram form — doesn’t seem all that exciting.

That’s where tf-idf comes in. Short for “term frequency—inverse document frequency,” tf-idf is a statistic that reflects how important a word is to a document in the context of an entire collection of documents, or corpus. In this case, I’m treating each day’s worth of articles as a separate document, and the collection of 90 days worth of data as the corpus. Therefore we’d expect extremely common words like “Trump” or “Congress” to yield low tf-idf scores, since they’re always in the news and therefore likely appear in every document. On the other hand, we should expect terms with high tf-idf scores to reference subjects that dominated news coverage for one or maybe two days, before quickly disappearing. In other words, finding which terms earn high tf-idf scores would tell us something about which stories tend to burn bright and fast in the news cycle.

data_90 %>% 
  unnest_tokens(output = "word", input = headline) %>% 
  count(date, word, sort = TRUE) %>% 
  bind_tf_idf(word, date, n) %>% 
  arrange(desc(tf_idf)) %>% 
  top_n(n = 20, w = tf_idf) %>%
  group_by(word) %>% 
  summarise(
    tf_idf = sum(tf_idf)
  ) %>% 
  ungroup() %>% 
  ggplot(aes(reorder(word, tf_idf), tf_idf)) + 
  geom_col(fill = custom_palette[1], width = .75) + 
  coord_flip() + 
  custom_theme +
  labs(
    y = "TF-IDF",
    x = NULL,
    title = "Most 'important' words according to tf-idf",
    subtitle = "High tf-idf words are common in the document but rare elsewhere"
  ) 

Our intuition seems to hold up. The top terms reference controversies (Roseanne Barr’s racist Twitter tirades, Omarosa’s tell-all talk show spree), tragedies (Anthony Bourdain’s death, the Santa Fe High School shooting), and world news not typically at the center of American political media’s attention (Venezuelan dictator Maduro’s attempted assassination, whatever happened in Ireland).

tidytext also makes it straightforward to link words in a corpus to sentiment. Although these techniques are not meant for relatively tiny sets of news headlines, the finding that news coverage seems to be consistently negative does have a ring of truth to it.

data_90 %>% 
  unnest_tokens(output = word, input = headline) %>% 
  anti_join(get_stopwords(), by = "word") %>% 
  left_join(get_sentiments("nrc")) %>% 
  filter(!is.na(sentiment)) %>% 
  mutate(
    sentiment = case_when(
      sentiment %in% c("anger", "disgust", "fear", "negative", "sadness") ~ "negative",
      sentiment %in% c("joy", "positive", "trust") ~ "positive",
      TRUE ~ "other"
    )
  ) %>% 
  group_by(date) %>% 
  summarise(
    Negative = sum(sentiment == "negative") / n(),
    Positive = sum(sentiment == "positive") / n(),
    Other = sum(sentiment == "other") / n()
  ) %>% 
  gather(key = sentiment, value = prop, -date) %>% 
  mutate(sentiment = fct_reorder(sentiment, prop) %>% fct_rev()) %>% 
  ggplot(aes(date, prop, color = sentiment, group = sentiment)) + 
  geom_line() + 
  scale_y_continuous(labels = scales::percent) +
  scale_color_manual(values = custom_palette) + 
  theme_minimal() +
  labs(
    title = "Headline sentiment by word count",
    x = NULL,
    y = NULL,
    color = NULL
  ) +
  custom_theme

Zooming in, we can see some of the problems with this approach. I, for one, don’t feel immediate positive associations with the terms “white” and “president” these days — and certainly not when they appear together. But while this form of sentiment analysis is somewhat of a blunt instrument, I do think it’s nice how straightforward it becomes with tidytext.

top_30_words <- 
  data_90 %>% 
  unnest_tokens(output = word, input = headline) %>% 
  anti_join(get_stopwords(), by = "word") %>% 
  left_join(get_sentiments("nrc")) %>% 
  filter(!is.na(sentiment)) %>% 
  mutate(
    sentiment = case_when(
      sentiment %in% c("anger", "disgust", "fear", "negative", "sadness") ~ "negative",
      sentiment %in% c("joy", "positive", "trust") ~ "positive",
      TRUE ~ "other"
    )
  ) %>% 
  count(sentiment, word) %>% 
  arrange(desc(n)) %>% 
  top_n(30, w = n) %>% 
  pull(word)

data_90 %>% 
  unnest_tokens(output = word, input = headline) %>% 
  anti_join(get_stopwords(), by = "word") %>% 
  left_join(get_sentiments("nrc")) %>% 
  filter(!is.na(sentiment), word %in% top_30_words) %>% 
  mutate(
    sentiment = case_when(
      sentiment %in% c("anger", "disgust", "fear", "negative", "sadness") ~ "negative",
      sentiment %in% c("joy", "positive", "trust") ~ "positive",
      TRUE ~ "other"
    )
  ) %>% 
  count(sentiment, word) %>% 
  filter(sentiment != "other") %>% 
  mutate(n = if_else(sentiment == "negative", n * -1, n * 1)) %>% 
  ggplot(aes(reorder(word, n), n, fill = sentiment %>% fct_rev())) +
  geom_bar(stat = "identity", width = .75) + 
  scale_fill_manual(values = custom_palette[c(3, 5)]) +
  scale_y_continuous(labels = scales::comma) +
  coord_flip() + 
  custom_theme +
  labs(
    x = NULL,
    y = "Count",
    fill = NULL,
    title = "Most common 'positive' and 'negative' words"
  )

Analyzing the media’s role in the 2016 general election

Back when I first discovered Memeorandum and built my first rudimentary scraper, I scraped and downloaded the text from each day of the 2016 general election. Unfortunately my parsing was a little sloppy and I had never heard of tidy data at the time, so the file I saved is a bit of a mess. But thanks to stringr, gather(), and some gumption, tidying the data now is a breeze!

general <- 
  read_csv(file_general) %>% 
  select(-X1) %>% 
  gather(key = date, value = headline, -Byline, -Outlet) %>% 
  mutate(
    headline = 
      stri_replace_all_fixed(headline, ": ", "") %>% 
      stri_replace_all_fixed("'s", "") %>% 
      str_to_lower() %>% 
      str_replace_all("\\-", " ") %>% 
      str_replace_all("\\'s", "") %>%  
      str_remove_all("[:punct:]"),
    date = str_replace(date, "\\.\\d", "") %>% ymd()
  ) %>% 
  rename_all(str_to_lower) %>% 
  select(date, everything()) %>% 
  filter(!is.na(headline))

general %>% 
  unnest_tokens(output = bigram, input = headline, token = "ngrams", n = 2) %>% 
  filter(!is.na(bigram)) %>% 
  count(bigram) %>% 
  separate(bigram, into = c("word1", "word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word) %>% 
  arrange(desc(n)) %>% 
  filter(n >= 80) %>% 
  graph_from_data_frame() %>% 
  ggraph() +
  geom_edge_link(
    aes(edge_alpha = n), 
    show.legend = FALSE
  ) + 
  geom_node_point() + 
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) + 
  theme_void()

The bigram graph displays some familiar names and topics, but interestingly the word “Trump” is less of a focal point. The largest subgraph, which includes both Trump and Clinton, mostly reflects topics related to the campaign.

We can also revisit the tf-idf statistic to see which terms were important for short periods of time before fading from the news cycle.

general %>% 
  unnest_tokens(output = "word", input = headline) %>% 
  count(date, word, sort = TRUE) %>% 
  bind_tf_idf(word, date, n) %>% 
  arrange(desc(tf_idf)) %>% 
  top_n(n = 20, w = tf_idf) %>%
  group_by(word) %>% 
  summarise(
    tf_idf = sum(tf_idf)
  ) %>% 
  ungroup() %>% 
  ggplot(aes(reorder(word, tf_idf), tf_idf)) + 
  geom_col(fill = custom_palette[1], width = .75) + 
  coord_flip() + 
  custom_theme +
  labs(
    y = "TF-IDF",
    x = NULL,
    title = "Most 'important' words of the general election according to tf-idf",
    subtitle = "High tf-idf words are common in the document but rare elsewhere"
  ) 

The top term likely refers to the deadly sniper attack on Dallas police officers in July, 2016. The next few likely have to do with the #BlackLivesMatter movements that followed the tragic deaths of Alton Sterling in Baton Rouge and Sylville Smith in Milwaukee at the hands of police.

The tf-idf scores from the past 90 days were interesting precisely because we weren’t interested in seeing how many times topics like President Trump were appearing in the news. In other words, finding that Trump is the most commonly referenced word in the news is not surprising now that he’s president. But during the campaign, we might have expected another topic — Hillary Clinton — to keep pace.

It turns out that the results from 2016 aren’t all that different from today: Trump’s name dominated Clinton’s in the headlines, and it wasn’t close. Of course, a huge portion of this coverage was negative (which can likely be said for Clinton, as well), and the impact of such an imbalance on the course of the campaign is hard to decipher. But the overall trend reflects Trump’s ability to leverage his daily outrages into unprecedented levels of free publicity.

general %>% 
  unnest_tokens(output = "word", input = headline) %>% 
  filter(word == "trump" | word == "clinton") %>% 
  mutate(
    word = str_to_title(word) %>% fct_relevel("Trump")
  ) %>% 
  count(date, word) %>% 
  ggplot(aes(date, n, fill = word, group = word)) + 
  geom_col(position = "dodge") + 
  scale_fill_manual(values = custom_palette[c(5,3)]) +
  custom_theme +
  labs(
    x = NULL,
    y = NULL,
    fill = NULL,
    title = "Trump dominated news coverage during the general election"
  )

general %>% 
  unnest_tokens(output = "word", input = headline) %>% 
  filter(word == "trump" | word == "clinton") %>% 
  mutate(
    word = str_to_title(word)
  ) %>% 
  count(date, word) %>% 
  spread(key = word, value = n) %>% 
  mutate(
    diff = Trump - Clinton,
    trump_clinton = if_else(
      diff <= 0, 
      "Clinton mentioned more", 
      "Trump mentioned more"
    )
  ) %>% 
  ggplot(aes(date, diff, fill = trump_clinton)) + 
  geom_col() +
  scale_fill_manual(values = custom_palette[c(3, 5)]) +
  custom_theme + 
  theme(
    legend.position = c(.5, .85)
  ) +
  labs(
    y = "Difference in mentions",
    x = NULL,
    fill = NULL,
    title = "Trump was mentioned more than Clinton on all but a handful of days\nduring the election"
  )

Since the early days of the election cycle, the imbalance in coverage between Trump and everyone else has been questioned and scrutinized, and it’s unclear what effect it might have had on the campagin. After all, Clinton didn’t lose for a lack of name recognition. But still, the inordinate attention paid to Trump’s antics clearly benefited his campaign — some researchers say to the tune of at least $2 billion — and may have informed his team’s internal decision to focus on digital persuasion rather than traditional ad buys.

In the future, it would be worth asking how anomalous this imbalance was compared to other elections. The beauty of Memeorandum is that we could plug in the dates for the 2008 and 2012 elections and soon find out. But I want to give the site’s servers a rest and take a break from scraping — that can be a project for another day.