lab3.Rmd - Grip

title: "Lab3 Solutions" output: github_document

Exercise 1 - Create a chloropleth map of Canada to visualize statistics

library(maptools)
library(rgdal)
library(ggmap)
library(tidyverse)
library(stringr)
library(gridExtra)

# Read in shape files and store it as SpatialPolygonsDataFrame object
tract <- readOGR(dsn = 'data/gpr_000a11a_e', layer = 'gpr_000a11a_e')
gpclibPermit()
# Transform into R readable dataframe
tract <- fortify(tract, region='PRENAME')

# Read in physical activity during leisure time, by sex, provinces and territories data
health <- read_csv('data/health78a-eng.csv')

# Edit column names
colnames(health) <- c('id', 2010:2014)

# Wrangle/clean data to include overall province count by year only
health_edit <- health %>%
  filter(row_number() %in% 7:(nrow(.) - 1)) %>% 
  filter(id != 'Males' & id != 'Females') %>%
  gather(year, count, c(`2010`, `2011`, `2012`, `2013`, `2014`)) %>% 
  mutate(count = as.numeric(str_replace_all(count, '[[:punct:]]', '')), year = as.numeric(year)) %>% 
  select(id, count, year)

# Join health data with tract
tract_health <- tract %>% left_join(health_edit)

# Make a map using ggplot
health_vis_2014 <- tract_health %>% 
  filter(year == 2014) %>% 
  ggplot() +
  geom_map(map = tract_health, aes(map_id = id, x = long, y = lat, group = id, fill = count / 1e6), color = 'black', size = 0.25) +
  coord_map('ortho') +
  theme_nothing(legend = TRUE) +
  theme(plot.title = element_text(size = 16)) +
  labs(title = 'Physical activity during leisure time by province/territory in 2014',
       fill = '') +
  scale_fill_continuous(low = "#6e96b7", high = "#01182b", guide = "colourbar") +
  guides(fill = guide_colorbar(title = "Number of persons in millions", barheight = 10, ticks = FALSE))

print(health_vis_2014)

We can see that in 2014, Ontario has the highest physical activity during leisure time with Quebec in second and BC in third. We can also see that, in general, Northern Canada is much less physically active during leisure time than Southern.

First of all, I reversed the scale since it made more sense to me that the most heavily populated areas are the darkest. Also, this visual is not very effective since these are absolute numbers and do not take population into account. Clearly more heavily populated places would have higher rates. Here, one quantitative attribute is used per region, encoded with a continuous colormap. The mark used in this case is area, and channels are color and position. This map is good to see density and comparing exercise levels across provinces. The tasks that this kind of plot is useful for is to see how various geographical areas compare to one another for a specific statistic or measurement. The choropleth map provides an easy way to visualize how a measurement varies across a geographic area or it shows the level of variability within a region. While there are millions of color variations the human eye is limited to how many colors it can easily distinguish. Generally five to seven color categories is recommended.

Here the scalability is moderate. There is only one attribute shown per region, and while this map shows only 13 regions we could theoretically fit hundreds of regions within a single screen. The main limits to scalability is when there is large variance in region sizes, which would lead to the color differences shown in small regions being less visually salient than in large regions.

I chose blue as the color here because it is generally a safe color to go with and shows the variation clearly. Here we change the saturation in the monochromatic color blue to show the variation in physical activity. The attribute characteristic is sequential because it's in one direction.

Exercise 2 - Correcting count data for population size

2a

A chloropleth map that illustrates the number of asthma cases per Canadian province/territory for the year of 2014.

# Read in asthma data
asthma <- read_csv('data/health50a-eng.csv')

# Edit column names
colnames(asthma) <- c('id', 2010:2014)

# Wrangle/clean data to include overall province count
asthma_2014 <- asthma %>% 	
  filter(row_number() %in% 6:(nrow(.) - 1)) %>% 
  filter(id != 'Males' & id != 'Females') %>%
  mutate(`2014` = as.numeric(str_replace_all(`2014`, '[[:punct:]]', ''))) %>% 
  select(id, `2014`)

#Join asthma data with tract
tract_asthma <- left_join(tract, asthma_2014)

#Make a map using ggplot
asthma_vis_2014 <- tract_asthma %>% 
  ggplot() +
  geom_map(map = tract_asthma, aes(map_id = id, x = long, y = lat, group = id, fill = `2014`/1000), color = 'black', size = 0.25) +
  coord_map('ortho') +
  theme_nothing(legend = TRUE) +
  labs(title = 'Number of asthma cases by province/territory in 2014',
       fill = '') + 
  scale_fill_continuous(low = "#6e96b7", high = "#01182b", guide = "colourbar",
                        breaks = c(10, 250, 500, 750, 910)) +
  guides(fill = guide_colorbar(title = "\n\nNumber of asthma cases in thousands", barheight = 10, ticks = FALSE))

2b

A chloropleth map that illustrates the number of asthma cases per Canadian province/territory for the year of 2014, standardized to the population size of that province

# Read in population data
pop <- read_csv('data/canadian_pop_by_prov.csv')

#Wrangle/clean data to include overall province population
pop_2014 <- pop %>%
  filter(Ref_Date == '2014/10' & GEO != 'Canada') %>% 
  select(id = GEO, population = Value)

#Join pop_2014 data with tract_asthma and add per_1000 column
tract_asthma_pop <- left_join(tract_asthma, pop_2014) %>% 
  mutate(per_1000 = `2014`/(population/1000))


#Make a map using ggplot
asthma_pop_vis_2014 <- tract_asthma_pop %>% 
  ggplot() +
  geom_map (map = tract_asthma_pop, aes(map_id = id, x = long, y = lat, group = id, fill = per_1000), color = 'black', size = 0.25) +
  coord_map('ortho') +
  theme_nothing(legend = TRUE) +
  labs(title = 'Number of asthma cases per 1000 people in 2014',
       fill = '') +
  scale_fill_continuous(low = "#6e96b7", high = "#01182b", guide = "colourbar", breaks = c(43, 50, 60, 70, 78)) +
  guides(fill = guide_colorbar(title = "\n\nNumber of asthma cases per 1000 people", barheight = 10, ticks = FALSE))

grid.arrange(asthma_vis_2014, asthma_pop_vis_2014, nrow = 2, heights = unit(c(4, 4), "in"))

2c

The two visualizations share similar characteristics (stylistically) as the one from question 1 (marks/channels/etc.). In the first visualization, we look at number of asthma cases in absolute terms per province, without considering population. In that case, from the viz alone, we can see that Ontario and Quebec have the highest number of asthma case. We see the lowest numbers in the north. If we look at the second plot, which takes into account population and number of cases per 1000 people, we see quite different results. In this case we are looking at proportions. Quebec has the highest number of asthma cases per 1000 people not Ontario. We can also see darker shades in some northern parts that were light in the previous plot. For example, the Northwest territories have a relatively higher number of asthma cases per 1000 people. It has an overall low number (plot 1) because its population is very low, compared to Ontario for example, so it would be expected that it is not comparable.

Exercise 3 - Plotting points on a map using ggmap

I downloaded the Canadian earthquake data between 12/1/2016 and 12/7/2016.


# Read in earthquake data between 12/1/2016 and 12/7/2016
quakes <- read_csv('data/EQCanOB_20161208.083845.csv', skip = 1)

#Wrangle/clean data to include overall province population
quakes_2016 <- quakes[1:length(quakes$Lat)-1, ] %>%
  mutate(Mag = as.numeric(str_replace_all(Mag, '[[:alpha:]]', '')), Depth = as.numeric(str_replace_all(Depth, '[[:alpha:]]|[[*]]', ''))) %>% 
  select(lat = Lat, long = Long, depth = Depth, magnitude = Mag)

#Overlay quakes data on map
quakes_vis <- ggmap(get_map(location = 'canada', source = 'google', maptype = 'terrain', crop = FALSE, zoom = 3)) +
  geom_point(data = quakes_2016, aes(x = long, y = lat, size = magnitude, color = depth), alpha = 0.8) +
  scale_size("Richter magnitude") +
  scale_colour_gradient(high = "#132B43", low = "#56B1F7") + 
  labs(title = 'Earthquakes in Canada in Week 1 of December 2016', x = '', y = '') +
  guides(color = guide_colorbar(ticks = FALSE, title = "Depth (km)")) +
  scale_y_continuous(limits = c(40, 75)) +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())	

print(quakes_vis)

Discuss the results of your visualization (what did you find out about the data by creating the visualization).

I visualized only one week's worth of data (Week 1 of December 2016). I could see that most of the earthquakes from last week were concentrated in the far East and far West of the country, and mostly in coastal areas. It seems that most of these quakes had similar magnitudes (2-3 range), with the ones in the West (close to BC) having a range of depths.

The data is effective visually depending on the purpose. If we want to see the frequency of earthquakes an their magnitude/depths then yes this vis is effective. The marks used in this case are points, the channels are position, color (for depth), and size (for magnitude). This kind of visual is effective to see locations of certain events and we can even infer densities. In the first visual, I assigned rounded values to the magnitudes to get a better representation size wise. In that case I think it is acceptable to have overlapping points as they suggest a density of earthquakes in that specific location. I used blue as the color because I believe it represents the points well. Here we altered saturation and transparency for effect. The data in this case is sequential with regards to depth and magnitude, only one direction. Also, the scalability here depends on purpose and size of map. Based on my research of seismic maps online, the first visual is perfectly good to show earthquakes in a given time period. It depends on the size of the map, the distribution of the points, and what we are trying to accomplish. Here I would say scalability would be hundreds for observations, and singles for levels of each attribute. Too many different sizes for magnitude would make the distinction more difficult.

Exercise 4 - Contour heat maps with ggmap

Download the .csv file for the Vancouver Police Department Crime dataset (or find another dataset with similar data) and create a contour heat map (also known as a contour plot, a filled isocontour plot, or a level plot) with ggmap. Refer to Lesson # 3 in this tutorial as a guide on how to do this.


# Read in crime data
crime <- read_csv('data/crime_csv_all_years.csv')


#Bind rows and transform to long/lat
crime_utm <- SpatialPoints(cbind(crime$X, crime$Y), proj4string = CRS("+proj=utm +zone=10"))
crime_lat_long <- data.frame(spTransform(crime_utm, CRS("+proj=longlat")))

#Get map
crime_vis <- ggmap(get_map(location = c(-123.1207, 49.2827),
                           zoom = 14,
                           maptype = "roadmap"),
                   extent = "device") +
  stat_density2d(data = crime_lat_long,
                 aes(x = coords.x1, 
                     y = coords.x2, 
                     fill = ..level..,
                     alpha= ..level..), 
                 bins = 15, 
                 geom = "polygon") +
  scale_fill_gradient(low = "black", high = "red", 
                      guide = guide_colorbar(title = "crimes reported since 2003")) +
  ggtitle("Map of Crime Density in Vancouver") +
  guides(alpha = FALSE)

print(crime_vis)

We can see from this visualization that the highest rates of crime happen in Granville (12500) and some crime in Hastings with some crime sprawl around those areas too.

This vis uses area marks. The color saturation and transparency channels are used to redundantly encode the crime density. This supports comparing crime (or other factor) levels in different map areas. Scalability in this case would be very large for observations, easily could be millions. It is possible to have large amounts of data since the density will adjust appropriately according to the ratios. Here we have a red to black saturation. The data is sequential and luminance is the shades of red, where higher shades of re represent higher crime density. Red and black have a strong enough disparity to provide for good colors in this case.