Scraping the news

We live in an age where the media you consume — and how you consume it — says a lot about you. If I told you that I read a lot of The New York Times, Politico, The New Republic, and Vox, you might reasonably guess something about my politics and interests. If I told you that I checked Twitter between ten and ten thousand times a day, that might also tell you something about me. But what if I told you the first place I look for news is Memeorandum.com?

The front page of Memeorandum at the time this post was originally published.

The front page of Memeorandum at the time this post was originally published.

Chances are you've never heard of it, but it is in fact the place I've turned to most often over the last few years to get a bird's-eye view of the news I'm interested in. Memeorandum is an un-editorialized news aggregator that crawls the web to find the stories that are driving the news cycle, particularly in American politics. Its algorithm ingests news from all across the political spectrum, meaning you'll be able to find coverage of the same topics from both Breitbart and Mother Jones, or The Daily Mail and The Washington Post. If you can grow to appreciate the site's early-Internet aesthetic and slightly behind-the-times name (it was created in 2004), you'll be hard-pressed to find a better aggregator for consuming American news.

Nate Silver has referenced Memeorandum a few times in his work, and gave a nice explanation of the site here:

The site uses an algorithm to determine which stories are leading political coverage on the Internet; the details of the calculation are somewhat opaque, but a lot of it is based on which stories are being linked to by other news organizations and what themes are commonly recurring among different news outlets. Simply put, Memeorandum is a good indicator of what stories journalists are talking about.

Another great feature of the site is its easily accessible archives. The site updates every five minutes with new articles, and the archive feature lets you navigate back to any five minute interval in the last sixteen years of Memeorandum's existence to find a snapshot of that moment's top stories. The result is a treasure trove of data that might tell us a thing or two about online political media and modern newsmaking, so in this post I'll walk through how I scraped the Memeorandum archives for 2020 and offer up a few simple analyses to demonstrate how the data can be used.

Loading packages

The workhorse packages behind the work in this post are rvest, which handles the web-scraping and HTML-parsing I performed to collect the data, and tidytext, which has some useful functions for text analysis.

library(tidyverse)
library(rvest)
library(lubridate)
library(tidytext)
library(ggtext)

Scraping the archives

The Memeorandum main page is a simple static site that clusters headlines by topic and orders them by prevalence. Stories appear as either main items or related articles, and both the author and outlet are listed next to the headlines. The key to scraping past instances of the main page is to notice that the URL convention for archived posts follows a simple date-time format: www.memeorandum.com/[YYMMDD]/h[hhmm]. So the URL for a snapshot of the news on January 1st, 2010 at 5:00 PM EST would end with /100101/h1700.

I wrote a function that downloads the raw HTML from the site at 6:00 PM on a given date, then extracts the relevant headline, author, and outlet data from each story on the site at that moment. I was able to use rvest::html_nodes() and some trial-and-error to extract each story from one of two sections: the main items at the top of each topic cluster, or the sub-items comprised of related articles and discussion pieces.

scrape_memeorandum <- function(scrape_date) {
  # Convert date to relevant URL
  url <- scrape_date %>%
    format("%y%m%d") %>%
    paste0("https://www.memeorandum.com/", ., "/h1800") # snapshots at 6 PM EST
  # Read raw HTML
  raw <- read_html(url)

  tibble( # scrape sub items
    byline = raw %>%
      html_nodes(".lnkr cite") %>%
      html_text(),
    headline = raw %>%
      html_nodes(".lnkr a:nth-child(2)") %>% # first child is redundant
      html_text(),
    scrape_date = scrape_date,
    section = "sub"
  ) %>%
    bind_rows(
      tibble( # scrape main items
        byline = raw %>%
          html_nodes("script ~ cite") %>% # main items are structured differently, need subsequent selector
          html_text(),
        headline = raw %>%
          html_nodes("strong a") %>%
          html_text(),
        scrape_date = scrape_date,
        section = "main"
      )
    ) %>%
    transmute(
      scrape_date,
      outlet = if_else( # parse byline into outlet/author
        str_detect(byline, "\\s\\/\\s"),
        str_extract(byline, "(?<=\\s\\/\\s).+(?=\\:)"),
        str_extract(byline, ".+(?=\\:)")
      ),
      author = str_extract(byline, ".+(?=\\s\\/\\s)"),
      headline,
      section
    )
}

Once I had this working I iterated through each day of the year so far, pausing for 10 seconds between each visit to Memeorandum out of consideration for their servers. I stored each day's worth of data separately because if my loop failed spuriously, I didn't want to risk losing data and having to re-scrape dates.

I was not able to find any guidance from Memeorandum about their terms of use or scraping limits, so I tried to follow the basic etiquette described here. If you plan on scraping Memeorandum as well, be conservative and be considerate. And if you take issue with my approach, please let me know!

scrape_dates <-
  as_date(ymd("2020-01-01"):(today() - 1))

scrape_dates %>%
  map_dfr(
    .,
    ~ {
      print(now())
      Sys.sleep(10)
      write_csv(
        scrape_memeorandum(.),
        paste0(
          "~/personal-website/content/post/memeorandum-scraping/data/memeorandum_",
          format(., "%y%m%d"),
          ".csv"
        )
      )
    }
  )

Analysis

Now that we have all the data in a tibble, we can start our analysis. Here's what the scraped data looks like:

head(df) %>% 
  knitr::kable()
scrape_dateoutletauthorheadlinesection
2020-01-01Alternet.orgAlex HendersonGOP consultant says the Republican Party has reached a ‘day of reckoning’ as Trump consumes it wholesub
2020-01-01BBCNAUS embassy attack: Protesters withdraw after standoffsub
2020-01-01New York TimesNAPro_Iranian Protesters End Siege of U.S. Embassy in Baghdadsub
2020-01-01QuartzJustin RohrlichAmid Baghdad embassy attacks, US spending on diplomatic security drops 11%sub
2020-01-01Wall Street JournalNAProtesters Retreat From U.S. Embassy Site in Iraqsub
2020-01-01WNDJoe SaundersTrump show of strength pays off, militants reported withdrawing from U.S. Embassy siegesub

We can also use tidytext::unnest_tokens() to expand our data to the level of each individual word, and take out stop words while we're at it:

data(stop_words)

df %>% 
  slice(1) %>% 
  unnest_tokens(word, headline) %>% 
  anti_join(stop_words, by = "word") %>% 
  knitr::kable()
scrape_dateoutletauthorsectionword
2020-01-01Alternet.orgAlex Hendersonsubgop
2020-01-01Alternet.orgAlex Hendersonsubconsultant
2020-01-01Alternet.orgAlex Hendersonsubrepublican
2020-01-01Alternet.orgAlex Hendersonsubparty
2020-01-01Alternet.orgAlex Hendersonsubreached
2020-01-01Alternet.orgAlex Hendersonsubday
2020-01-01Alternet.orgAlex Hendersonsubreckoning
2020-01-01Alternet.orgAlex Hendersonsubtrump
2020-01-01Alternet.orgAlex Hendersonsubconsumes

To start off, let's look at the top eight topics over the course of the year. As we can see, Trump and his White House absolutely dominated coverage, as one might expect. No topic came close to matching the number of mentions he received, and he was mentioned at a steady pace throughout the year (though both he and Biden seem to have experienced a slight bump toward the end of the election cycle).

top_8 <-
  df %>%
  unnest_tokens(word, headline) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(word = str_remove(word, "\\'s$")) %>%
  count(word) %>%
  slice_max(n = 8, order_by = n)

df %>%
  unnest_tokens(word, headline) %>%
  filter(word %in% top_8$word) %>%
  group_by(word) %>%
  arrange(scrape_date) %>%
  mutate(row_n = row_number()) %>%
  ungroup() %>%
  ggplot(aes(scrape_date, row_n, color = word)) +
  geom_line(aes(group = word), show.legend = FALSE) +
  ggrepel::geom_label_repel(
    data = transmute(top_8, scrape_date = ymd("2020-12-13"), row_n = n, word),
    aes(label = word),
    show.legend = FALSE
  ) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(plot.title = element_markdown(face = "bold")) +
  labs(
    x = NULL,
    y = "Mentions",
    color = NULL,
    title = "Most common _Memeorandum.com_ headline topics of 2020"
  )

Trump's dominance over the headlines was the rule, but there were exceptions. Here are all the days where Trump was not the most mentioned topic in the news. Non-Trump news cycles were far and few between, and most often emerged in the wake of the deaths of key figures like Kobe Bryant, George Floyd, John Lewis, and Ruth Bader Ginsburg.

# Top words over time
top_words <-
  df %>%
  unnest_tokens(word, headline) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(word = str_remove(word, "\\'s$")) %>%
  count(scrape_date, word) %>%
  group_by(scrape_date) %>%
  slice_max(order_by = n, n = 1, with_ties = FALSE) %>%
  ungroup()

top_words_text <-
  top_words %>%
  filter(word != "trump") %>%
  group_by(word) %>%
  slice_min(order_by = scrape_date) %>%
  ungroup() %>%
  mutate(word = str_to_title(word))

top_words %>%
  ggplot(aes(scrape_date, n)) +
  geom_col(aes(fill = word == "trump"), size = 0, show.legend = FALSE) +
  scale_fill_manual(values = c("red", "grey")) +
  ggrepel::geom_label_repel(
    data = top_words_text,
    aes(label = word),
    vjust = 1,
    force = 20,
    arrow = arrow(angle = 45, length = unit(0.05, "cm"), ends = "last", type = "open")
  ) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    plot.title = element_markdown(face = "bold"),
    plot.subtitle = element_markdown()
  ) +
  labs(
    title = "Which topics broke through Donald Trump's noise?",
    subtitle = "Highlighting days <span style = 'color:red;'>when Trump wasn't the #1 topic</span> on _Memeorandum.com_.",
    y = "Mentions",
    x = NULL
  )

However, there was a long stretch between March and April where the focus turned elsewhere. March began with the end of the Democratic Primary — when Joe Biden won Super Tuesday and Elizabeth Warren dropped out — before quickly turning into a waking nightmare, as the coronavirus began to make its first show of force in America and sent parts of the country into lockdown.

top_words <-
  df %>%
  filter(scrape_date >= ymd("2020-03-01"), scrape_date < ymd("2020-05-01")) %>% 
  unnest_tokens(word, headline) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(word = str_remove(word, "\\'s$")) %>%
  count(scrape_date, word) %>%
  group_by(scrape_date) %>%
  slice_max(order_by = n, n = 1, with_ties = FALSE) %>%
  ungroup()

top_words_text <-
  top_words %>%
  filter(word != "trump") %>%
  group_by(word) %>%
  slice_min(order_by = scrape_date) %>%
  ungroup() %>%
  mutate(word = str_to_title(word))

top_words %>%
  mutate(
    col = case_when(
      word == "trump" ~ "Trump",
      word == "coronavirus" ~ "COVID",
      word == "warren" ~ "Warren",
      TRUE ~ "Biden"
    )
  ) %>% 
  ggplot(aes(scrape_date, n)) +
  geom_col(aes(fill = col), size = 0) +
  scale_fill_manual(values = c("blue", "red", "grey", "dodger blue")) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    plot.title = element_markdown(face = "bold"),
    plot.subtitle = element_markdown()
  ) +
  labs(
    title = "Which topics broke through Donald Trump's noise?",
    subtitle = "March through April, 2020",
    y = "Mentions",
    x = NULL,
    fill = NULL
  )

Inspired by the @NYT_first_said bot on Twitter, which records the appearance of novel words in The New York Times lexicon, I wanted to see how often new words appeared on the main page — though for the purposes of this exercise, I'm only considering data from this year in my definition of "new."

## First appearances
first_appearances <-
  df %>%
  select(scrape_date, headline) %>%
  unnest_tokens(word, headline) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(row_n = row_number()) %>%
  group_by(scrape_date) %>%
  mutate(row_n_date = row_number()) %>%
  ungroup() %>%
  group_by(word) %>%
  mutate(first_appearance = row_n == min(row_n)) %>%
  ungroup()

first_appearances %>%
  ggplot(aes(scrape_date, -row_n_date, fill = first_appearance, alpha = first_appearance)) +
  geom_tile(show.legend = FALSE) +
  scale_fill_manual(values = c("grey", "red")) + 
  scale_alpha_manual(values = c(0.15, 1)) + 
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    plot.title = element_markdown(face = "bold"),
    axis.text.y = element_blank()
  ) +
  labs(
    title = "Prevalence of <span style = 'color:red;'>new words</span> on the _Memeorandum_ front page",
    y = NULL,
    x = NULL
  )

I made this as more of a fun visual exercise than any sort of useful analysis, so in that spirit I'll top things off with a list of all the funky words that appeared on Memeorandum for the first time in 2020 on December 13th.

first_appearances %>%
  filter(scrape_date == ymd("2020-12-13"), first_appearance) %>%
  mutate(x = 0, y = 0) %>%
  ggplot(aes(x, y, label = word)) +
  ggrepel::geom_label_repel(segment.alpha = 0, force = 15) +
  theme_void() +
  theme(plot.title = element_markdown(face = "bold")) +
  labs(
    title = "Words first appearing on _Memeorandum.com_ on 12/13/2020"
  )

There's plenty more to do with this data, so I may come back and update this post later. But for now I hope I've at least convinced you to check out Memeorandum for yourself!

Benjamin Chang Sorensen
Benjamin Chang Sorensen
Seattle, WA • he/him

Related