David Yu (Flickr), Creative Commons

Policing trends in San Francisco

In this project I’ll be exploring the popular San Francisco Police Department’s incident data set for the year 2017, which describes basic details about reported police interactions, including location, time, and the related offense(s). Similar data is available stretching back to 2003, also on the DataSF website.

As always, I’ll begin by loading some libraries and reading in the data. I plan to supplement the police data with some spatial data, also from DataSF, as well as some census block data from the Eviction Lab. For the Eviction Lab data, which includes information on populations, demographics, and poverty rates for each census block in the city, I use data from 2016, the last year for which complete records are available.

Reading in the data

library(tidyverse) 
library(sf)
library(lubridate)

fix_name <- function(var) {
  var %>% 
    str_to_lower() %>% 
    str_replace_all(pattern = " ", replacement = "_") %>% 
    str_replace_all(pattern = "-", replacement = "_")  
}
data <- 
  read_csv(file_police) %>% 
  rename_all(fix_name) %>% 
  rename(id = incidntnum) %>% 
  mutate_if(is.character, str_to_lower) %>% 
  mutate(
    dayofweek = factor(
      dayofweek,
      levels = c("sunday", "monday", "tuesday",
                 "wednesday", "thursday", "friday", "saturday"
      )
    ),
    date = mdy(date)
  ) %>% 
  st_as_sf(coords = c("x", "y"), crs = 4326, remove = FALSE)

neighborhoods <- 
  st_read(file_neighborhoods) %>% 
  st_transform(crs = 4326)
## Reading layer `geo_export_388aef45-a5dc-4fff-86fe-89e0817d7cf8' from data source `/Users/Benjamin/Desktop/Summer Projects/Analysis Neighborhoods/geo_export_388aef45-a5dc-4fff-86fe-89e0817d7cf8.shp' using driver `ESRI Shapefile'
## Simple feature collection with 41 features and 1 field
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -122.5149 ymin: 37.70813 xmax: -122.357 ymax: 37.8333
## epsg (SRID):    4326
## proj4string:    +proj=longlat +ellps=WGS84 +no_defs
blocks <- 
  st_read(file_blocks) %>% 
  rename_all(fix_name) %>% 
  filter(countyfp == "075") %>% 
  st_set_crs(value = 4326)
## Reading layer `cb_2017_06_bg_500k' from data source `/Users/Benjamin/Desktop/Stanford/Spring '18/ENGR 150/cb_2017_06_bg_500k/cb_2017_06_bg_500k.shp' using driver `ESRI Shapefile'
## Simple feature collection with 23196 features and 10 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -124.4096 ymin: 32.53416 xmax: -114.1312 ymax: 42.00952
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
eviction_lab <- 
  read_csv(file_evictionlab) %>% 
  rename_all(fix_name) %>% 
  filter(parent_location == "San Francisco County, California", year == 2016)

stations <- 
  read_csv(file_police_stations) %>% 
  rename_all(fix_name) %>% 
  mutate(
    location = str_replace_all(location, pattern = "\\(|\\)", replacement = "")
  ) %>% 
  separate(location, into = c("x", "y"), sep = ", ") %>% 
  mutate(
    x = as.numeric(x),
    y = as.numeric(y)
  )

First steps

Let’s start by looking at what kinds of incidents are included in the data set. In the following plot we see that auto break ins are the number one reported crime, followed by a handful of property crimes like vandalism and theft.

data %>% 
  count(descript) %>% 
  arrange(desc(n)) %>% 
  top_n(n = 25, w = n) %>% 
  mutate(descript = str_to_title(descript)) %>% 
  ggplot(aes(reorder(descript, n), n)) + 
  geom_point() + 
  coord_flip() + 
  scale_y_continuous(labels = scales::comma) + 
  labs(
    x = NULL,
    y = NULL,
    title = "Most common police incidents of 2017",
    caption = "SF Open Data"
  )

We can also consider the incidents in terms of their “categories,” as determined by the SFPD. Here’s a look at how the top five categories (not including the ambiguous “other offenses” category which was the most common) changed over time. There are some odd spikes here and there, but the frequency of these top crimes seem mostly consistent across the year.

top_5_crimes <-
  data %>% 
  st_set_geometry(value = NULL) %>% 
  filter(category != "other offenses") %>% 
  count(category) %>% 
  top_n(n = 5, wt = n) %>% 
  arrange(desc(n)) %>% 
  pull(category)
  
data %>% 
  st_set_geometry(value = NULL) %>% 
  distinct(id, .keep_all = TRUE) %>% 
  filter(category %in% top_5_crimes) %>% 
  unite(col = dtime, date, time, sep = " ") %>% 
  mutate(
    dtime = ymd_hms(dtime),
    hour = hour(dtime),
    wday = wday(dtime, label = TRUE, abbr = TRUE),
    date = floor_date(dtime, unit = "day")
  ) %>% 
  count(category, date) %>% 
  ggplot(
    aes(date, n, color = reorder(category, n) %>% fct_rev(), group = category)
  ) +
  geom_line() + 
  theme_minimal() + 
  labs(
    x = NULL,
    y = NULL,
    color = "Category",
    title = "Police incident trends, 2017"
  )

And here’s how the number of overall incidents changes throughout the course of a given weekday. With spikes around noon and 6:00 PM each day (I guess cops are busy during meal times?). We can also see that more late night/early morning crimes are committed on the weekend than during the week.

data %>%
  st_set_geometry(value = NULL) %>% 
  unite(col = dtime, date, time, sep = " ") %>% 
  mutate(
    dtime = ymd_hms(dtime),
    hour = hour(dtime),
    wday = wday(dtime, label = TRUE, abbr = TRUE)
  ) %>% 
  count(hour, wday) %>% 
  ggplot(aes(hour, n, color = wday, group = wday)) + 
  geom_line() + 
  scale_color_hue() + 
  theme_minimal() + 
  labs(
    x = "Hour",
    y = "Incidents",
    color = NULL,
    title = "Police incidents by day and hour"
  )

Building heatmaps

With a rich dataset like this, it can be difficult to effectively visualize trends without omitting a fair amount of information. While each of the three previous plots tells an interesting story, they also leave out a lot of key data.

The best way I’ve found to convey almost all of the information in the data set is through heatmaps. The following plots break down the data by time, offense, week day, and neighborhood. They show the same trends as the previous plots, but in a compact and easily comparable fashion. Thanks to the viridis package, they also look pretty nice.

data %>% 
  st_set_geometry(value = NULL) %>% 
  unite(col = dtime, date, time, sep = " ") %>% 
  mutate(
    dtime = ymd_hms(dtime),
    hour = hour(dtime),
    wday = wday(dtime, label = TRUE, abbr = TRUE),
    month = month(dtime)
  ) %>% 
  mutate(category = str_to_title(category)) %>% 
  count(hour, category) %>% 
  filter(n !=  0) %>% 
  ggplot(aes(hour, reorder(category, n))) + 
  geom_tile(aes(fill = n)) +
  viridis::scale_fill_viridis(
    breaks = c(1, 10, 100, 4000),
    labels = c(1, 10, 100, 4000),
    trans = "log2"
  ) + 
  scale_x_continuous(
    breaks = c(0, 5, 10, 15, 20),
    labels = c("12AM", "5AM", "10AM", "3PM", "8PM")
  ) +
  labs(
    y = NULL,
    x = NULL,
    fill = "Frequency",
    title = "San Francisco police heatmap"
  ) + 
  guides(
    fill = guide_colorbar(
      bin = 5, 
      barheight = 9,
      barwidth = .25,
      raster = FALSE,
      ticks = FALSE,
      title.position = "top"
    )
  ) +
  theme_minimal()

top_15_crimes <-
  data %>% 
  st_set_geometry(value = NULL) %>% 
  count(category) %>% 
  top_n(n = 15, wt = n) %>% 
  arrange(desc(n)) %>% 
  pull(category)

data %>% 
  st_set_geometry(value = NULL) %>% 
  filter(category %in% top_15_crimes) %>% 
  unite(col = dtime, date, time, sep = " ") %>% 
  mutate(
    dtime = ymd_hms(dtime),
    hour = hour(dtime),
    wday = wday(dtime, label = TRUE, abbr = TRUE),
    month = month(dtime)
  ) %>% 
  mutate(category = str_to_title(category)) %>% 
  count(hour, category, wday) %>% 
  filter(n !=  0) %>% 
  ggplot(aes(hour, reorder(category, n))) + 
  geom_tile(aes(fill = n)) +
  facet_wrap(. ~ wday) +
  viridis::scale_fill_viridis(
    breaks = c(1, 8, 60, 400),
    labels = c(1, 8, 60, 400),
    trans = "log2"
  ) + 
  scale_x_continuous(
    breaks = c(0, 5, 10, 15, 20),
    labels = c("12AM", "5AM", "10AM", "3PM", "8PM")
  ) +
  labs(
    y = NULL,
    x = NULL,
    fill = "Frequency",
    title = "San Francisco police heatmap by day of week"
  ) + 
  guides(
    fill = guide_colorbar(
      bin = 5, 
      barheight = 9,
      barwidth = .25,
      raster = FALSE,
      ticks = FALSE,
      title.position = "top"
    )
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 0))

data %>% 
  st_set_geometry(value = NULL) %>% 
  filter(category %in% top_15_crimes) %>% 
  unite(col = dtime, date, time, sep = " ") %>% 
  mutate(
    dtime = ymd_hms(dtime),
    hour = hour(dtime),
    wday = wday(dtime, label = TRUE, abbr = TRUE),
    month = month(dtime)
  ) %>% 
  mutate(
    category = str_to_title(category),
    pddistrict = str_to_title(pddistrict)
  ) %>% 
  count(hour, category, pddistrict) %>% 
  filter(n !=  0) %>% 
  ggplot(aes(hour, reorder(category, n))) + 
  geom_tile(aes(fill = n)) +
  facet_wrap(. ~ pddistrict) +
  viridis::scale_fill_viridis(
    breaks = c(1, 8, 60, 400),
    labels = c(1, 8, 60, 400),
    trans = "log2"
  ) + 
  scale_x_continuous(
    breaks = c(0, 5, 10, 15, 20),
    labels = c("12AM", "5AM", "10AM", "3PM", "8PM")
  ) +
  labs(
    y = NULL,
    x = NULL,
    fill = "Frequency",
    title = "San Francisco police heatmap by police district"
  ) + 
  guides(
    fill = guide_colorbar(
      bin = 5, 
      barheight = 9,
      barwidth = .25,
      raster = FALSE,
      ticks = FALSE,
      title.position = "top"
    )
  ) +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 0))

Spatial analysis

Now let’s take a spatial view of the data. Because I have shapefiles that describe each of San Francisco’s neighborhoods and census blocks, I can use sf to add more detail about each incident’s location. For each shapefile, I use st_intersects() to find which incidents should be grouped together according to their geography, and then update the data set with this new information. After labeling each incident’s location, I can do some simple group_by() and summarise() calls to describe each geographic unit in terms of the police data.

For now, I’ll be looking at the proportion of incidents that are “resolved” by police (in most cases, this means the incident resulted in an arrest). I’ll start with neighborhoods.

intersect_tbl <- 
  data %>% 
  st_intersects(
    neighborhoods,
    sparse = FALSE
  ) %>% 
  as_tibble() 

colnames(intersect_tbl) <- 
  neighborhoods %>% 
  pull(nhood)

incident_nhood <- 
  intersect_tbl %>% 
  mutate(
    id = data %>% pull(id)
  ) %>% 
  select(id, everything()) %>% 
  gather(key = nhood, value = val, -id) %>% 
  filter(val) %>% 
  select(-val)

data <- 
  data %>% 
  st_set_geometry(value = NULL) %>% 
  left_join(
    incident_nhood,
    by = "id"
  ) %>% 
  distinct(id, category, .keep_all = TRUE)

nhood_resolved <- 
  data %>% 
  filter(resolution == "none" | resolution == "arrest, booked") %>% 
  filter(!is.na(nhood)) %>% 
  group_by(nhood) %>% 
  summarise(
    prop_resolved = 1 - (sum(resolution == "none") / n())
  ) %>% 
  ungroup()

nhood_resolved %>% 
  arrange(desc(prop_resolved)) %>% 
  top_n(n = 5, w = prop_resolved) %>% 
  knitr::kable()
nhood prop_resolved
Tenderloin 0.3355153
Mission 0.2686959
Visitacion Valley 0.2657784
Excelsior 0.2420113
Bayview Hunters Point 0.2379668

Now I’ll do the same for census blocks.

intersect_tbl_blocks <- 
  data %>% 
  st_as_sf(coords = c("x", "y"), crs = 4326) %>% 
  st_intersects(
    blocks,
    sparse = FALSE
  ) %>% 
  as_tibble() 

colnames(intersect_tbl_blocks) <- 
  blocks %>% 
  pull(geoid)

incident_blocks <- 
  intersect_tbl_blocks %>% 
  mutate(
    id = data %>% pull(id)
  ) %>% 
  select(id, everything()) %>% 
  gather(key = block, value = val, -id) %>% 
  filter(val) %>% 
  select(-val)

data <- 
  data %>% 
  left_join(incident_blocks, by = "id") %>% 
  distinct(id, category, .keep_all = TRUE)

blocks_resolved <-
  data %>% 
  left_join(
    eviction_lab %>% select(geoid, population),
    by = c("block" = "geoid")
  ) %>% 
  filter(!is.na(block)) %>% 
  group_by(block, population) %>% 
  summarise(
    n_incident = n(),
    prop_resolved = 1 - (sum(resolution == "none") / n()),
    prop_arrested = 
      sum(
        resolution == "arrest, booked" | resolution == "arrest, cited"
      ) / 
      n()
  ) %>% 
  ungroup()

blocks_resolved %>% 
  arrange(desc(prop_resolved)) %>% 
  top_n(n = 5, w = prop_resolved) %>% 
  knitr::kable()
block population n_incident prop_resolved prop_arrested
060750124011 2061 807 0.5452292 0.5303594
060750260032 2189 130 0.4769231 0.4461538
060750124021 995 1222 0.4623568 0.4451718
060750301021 2789 168 0.4345238 0.4226190
060750124012 2726 822 0.4330900 0.4111922

Now, these statistics seem interesting, but they’re pretty meaningless without knowledge of where these blocks are in the city. Let’s visualize it, and let’s also include the location of the city’s police stations.

blocks_resolved %>%
  left_join(blocks, by = c("block" = "geoid")) %>% 
  ggplot() +
  geom_sf(aes(fill = prop_resolved), size = 0) +
  geom_point(data = stations, aes(y, x), color = "navy", size = 3) + 
  geom_text(
    data = stations, 
    aes(y, x), 
    color = "yellow", 
    label = "★", 
    family = "HiraKakuPro-W3",
    size = 1.5
  ) +
  coord_sf(datum = NA, ylim = c(37.71, 37.83), xlim = c(-122.525, -122.32)) + 
  theme_void() +
  scale_fill_gradientn(
    colors = c("#fee8c8", "#fdbb84", "#e34a33"),
    labels = scales::percent
  ) + 
  labs(
    title = "San Francisco policing trends by block",
    fill = "Police incidents resolved"
  ) +
  guides(
    fill = guide_colorbar(
      nbin = 5, 
      barheight = .25,
      barwidth = 9,
      raster = FALSE,
      ticks = FALSE,
      title.position = "top"
    ),
    size = guide_legend(
      title.position = "top"
    )
  ) + 
  theme(
    legend.position = "bottom"
  )

When we group by neighborhoods, we can see clearly that the Tenderloin is somewhat of an outlier.

top_10_nhoods <- 
  data %>% 
  count(nhood) %>% 
  arrange(desc(n)) %>% 
  top_n(n = 10, w = n) %>% 
  pull(nhood)

nhood_labels <- 
  neighborhoods %>%
  filter(nhood %in% top_10_nhoods) %>% 
  mutate(
    cent = st_centroid(geometry),
    x = str_extract(cent, "-122.*\\s") %>% parse_number(),
    y = str_extract(cent, "37\\..*") %>% parse_number()
  ) %>%
  st_set_geometry(value = NULL) 

nhood_resolved %>% 
  left_join(neighborhoods, by = "nhood") %>% 
  ggplot() + 
  geom_sf(aes(fill = prop_resolved), size = 0) +
  geom_point(data = stations, aes(y, x), color = "navy", size = 3) + 
  geom_text(
    data = stations, 
    aes(y, x), 
    color = "yellow", 
    label = "★", 
    family = "HiraKakuPro-W3",
    size = 1.5
  ) +
  ggrepel::geom_text_repel(
    data = nhood_labels,
    aes(x, y, label = nhood),
    point.padding = .15
  ) +
  coord_sf(datum = NA) + 
  theme_void() + 
  scale_fill_gradientn(
    colors = c("#fee8c8", "#fdbb84", "#e34a33"),
    labels = scales::percent
  ) + 
  labs(
    title = "San Francisco policing trends by neighborhood",
    fill = "Police incidents resolved"
  ) +
  guides(
    fill = guide_colorbar(
      nbin = 5, 
      barheight = .25,
      barwidth = 9,
      raster = FALSE,
      ticks = FALSE,
      title.position = "top"
    ),
    size = guide_legend(
      title.position = "top"
    )
  ) + 
  theme(
    legend.position = "bottom"
  )

Since we’re looking at the location of different police stations, it might be interesting to look how resolution rates differ across police districts. In fact, this might be a more informative geographic unit of analysis, since presumably different police stations treat their respective districts differently. The only problem is that there don’t appear to be police district shapefiles readily available to help us run the same analysis we conducted for blocks and neighborhoods.

Luckily, however, each incident is labelled with a district name, and that should be just enough given the richness of the location data we’re working with. Using sf::st_combine(), we can actually build new shapes from smaller pieces, in this case from census blocks. Assuming that the police districts are constructed along census block lines (which may not be exactly true, but is likely a good approximation), we can use the incident labels to create new geometry objects representing the different districts. This will allow us to create a map similar to what we did for the block and neighborhood information.

As it turns out, we see some similar patterns, but surprisingly the Tenderloin is even more of an outlier in this map than it was in the census block or neighborhood maps. This might suggest that the resolution rate is in fact tied in some way to the responding police stations and the districts they patrol.

districts <- 
  data %>% 
  st_as_sf(coords = c("x", "y"), crs = 4326, remove = FALSE) %>% 
  count(block, pddistrict) %>% 
  group_by(block) %>% 
  filter(n == max(n)) %>% 
  ungroup() %>% 
  left_join(
    blocks %>% 
      rename(block_geom = geometry) %>% 
      mutate(geoid = as.character(geoid)) %>% 
      select(geoid, block_geom) %>% 
      as_tibble(),
    by = c("block" = "geoid")
  ) %>%
  filter(!is.na(block)) %>% 
  group_by(pddistrict) %>% 
  mutate(
    new_geom = st_combine(block_geom)
  ) %>% 
  ungroup() 

districts_resolved <- 
  data %>% 
  group_by(pddistrict) %>% 
  summarise(
    n_incident = n(),
    prop_resolved = 1 - (sum(resolution == "none") / n())
  ) %>% 
  ungroup()

districts_resolved %>% 
  arrange(desc(prop_resolved)) %>% 
  top_n(n = 5, w = prop_resolved) %>% 
  knitr::kable()
pddistrict n_incident prop_resolved
tenderloin 7929 0.4025728
mission 20152 0.2702461
bayview 12770 0.2299922
ingleside 10407 0.2294609
southern 26746 0.2281463
districts_resolved %>% 
  left_join(
    districts %>% 
      st_set_geometry(value = NULL) %>% 
      select(pddistrict, new_geom) %>% 
      distinct(pddistrict, .keep_all = TRUE),
    by = "pddistrict"
  ) %>% 
  ggplot() + 
  geom_sf(aes(geometry = new_geom, fill = prop_resolved), size = 0) +
  geom_point(data = stations, aes(y, x), color = "navy", size = 3) + 
  geom_text(
    data = stations, 
    aes(y, x), 
    color = "yellow", 
    label = "★", 
    family = "HiraKakuPro-W3",
    size = 1.5
  ) +
  ggrepel::geom_text_repel(
    data = 
      stations %>% 
      mutate(
        district_name = 
          word(district_name, 1) %>%
          str_to_title()
        ), 
    aes(y, x, label = district_name),
    point.padding = .15
  ) + 
  coord_sf(datum = NA, ylim = c(37.71, 37.83), xlim = c(-122.525, -122.32)) + 
  scale_fill_gradientn(
    colors = c("#fee8c8", "#fdbb84", "#e34a33"),
    labels = scales::percent
  ) + 
  labs(
    title = "San Francisco policing trends by police district",
    fill = "Police incidents resolved"
  ) +
  guides(
    fill = guide_colorbar(
      nbin = 5, 
      barheight = .25,
      barwidth = 9,
      raster = FALSE,
      ticks = FALSE,
      title.position = "top"
    ),
    size = guide_legend(
      title.position = "top"
    )
  ) + 
  theme_void() + 
  theme(
    legend.position = "bottom"
  )

Conclusion

What can we make of this information? Why are incidents in the Tenderloin seemingly so likely to end in arrest compared to other neighborhoods and to other police districts? I don’t want to speculate too much, but I imagine it has something to do with the police department’s historical relationship to the populations and communities residing there. In many ways, the Tenderloin is underprivileged and underserved, and police there might respond differently to incidents than they would in other parts of the city.

To learn more, I’d want to speak with residents about their experiences and perhaps with officers who patrol the district. It’s important to recognize that in this case, as in many, data is better used as a means to motivate thoughtful questions rather than to find complete answers. The disparity we’ve observed is worth exploring further, but ultimately won’t be entirely explained by the limited data we have available here.