Create maps in R in 10 (fairly) easy steps

01.03.2016

One problem with your typical election map is that it lets you see which candidate won where, but that's about it. In classic presidential election maps, that means it's easy to see if a state voted Democrat or Republican, but not necessarily which candidate actually won, thanks to large differences in U.S. population densities — giving us states like Montana, which is 4th-largest in area but tied for dead last in number of electoral votes.

That's why, if you're a data geek, it can be interesting to make your own election maps to visualize the questions that are important to you. For example: Where were a candidate's real areas of strength and weakness Which areas contributed most to victory This goal of better mapping trends can work with many other types of data, from sales figures to mobile data coverage.

There are many options for mapping. If you do this kind of thing often or want to create a map with lots of slick bells and whistles, it could make more sense to learn GIS software like Esri's ArcGIS or open-source QGIS. If you care only about well-used geographic areas such as cities, counties or zip codes, software like Tableau and Microsoft Power BI may have easier interfaces. If you don't mind drag-and-drop tools and having your data in the cloud, there are still more options such as Google Fusion Tables.

But there are also advantages to using R -- a language designed for data analysis and visualization. It's open source, which means you don't have to worry about ever losing access to (or paying for) your tools. All your data stays local if you want it to. It's fully command-line scripted end-to-end, making an easily repeatable process in a single platform from data input and re-formatting through final visualization. And, R's mapping options are surprisingly robust.

Ready to code your own election results maps -- or any other kind of color-coded choropleth map Here’s how to handle a straightforward two-person race and a more complex race with three or more candidates in R.

We'll be using two mapping packages in this tutorial: tmap for quick static maps and leaflet for interactive maps. You can install and load them now with

install.packages("tmap") install.packages("leaflet") library("tmap") library("leaflet")

(Skip the install.packages lines for any R packages that are already on your system.)

I'll start with the New Hampshire Democratic primary results, which are available from the NH secretary of state's office as a downloadable Excel spreadsheet.

Getting election data into the proper format for mapping is one of this project's biggest challenges -- more so than actually creating the map. For simplicity, let's stick to results by county instead of drilling down to individual towns and precincts.

One common problem: Results data need to have one column with all election district names — whether counties, precincts or states — and candidate names as column headers. Many election returns, though, are reported with each election district in its own column and candidate results by row.

That's the case with the official NH results. I transposed the data to fix that and otherwise cleaned up the spreadsheet a bit before importing it into R (such as removing ", d" after each candidate's name). The first column now has county names, while every additional column is a candidate name; each row is a county result. I also got rid of the’ total’ row at the bottom, which can interfere with data sorting.

You can do the same -- or, if you'd like to download the data file and all the other files I'm using, including R code, head to the "Mapping with R" file download page below. (Free Insider registration needed. Bonus: You'll be helping me convince my boss that I ought to write more of these types of tutorials). If you download and unzip the mapping with R file, look for NHD2016.xlsx in the zip file.

To make your R mapping script as re-usable as possible, I suggest putting data file names at the top of the script -- that makes it easy to swap in different data files without having to hunt through code to find where a file name appears. You can put this toward the top of your R script:

datafile <- "data/NHD2016.xlsx"

Note: My data file isn't in the same working directory as my R script; I have it in a data subdirectory. Make sure to include the appropriate file path for your system, using forward slashes even on Windows.

There are several packages for importing Excel files into R; but for ease of use, you can't beat rio. Install it with:

install.packages("rio") if it's not already on your system, and then run:

nhdata <- rio::import(datafile)

to store data from the election results spreadsheet into a variable called nhdata.

There were actually 28 candidates in the results; but to focus on mapping instead of data wrangling, let's not worry about the many minor candidates and pretend there were just two: Hillary Clinton and Bernie Sanders. Select just the County, Clinton and Sanders columns with:

nhdata <- nhdata[,c("County", "Clinton", "Sanders")]

Now we need to think about what exactly we'd like to color-code on the map. We need to pick one column of data for the map's county colors, but all we have so far is raw vote totals. We probably want to calculate either the winner's overall percent of the vote, the winner's percentage-point margin of victory or, less common, the winner's margin expressed by number of votes (after all, winning by 5 points in a heavily populated county might be more useful than winning by 10 points in a place with way fewer people if the goal is to win the entire state).

It turns out that Sanders won every county; but if he didn't, we could still map the Sanders "margin of victory" and use negative values for counties he lost.

Let's add columns for candidates' margins of victory (or loss) and percent of the vote, again for now pretending there were votes cast only for the two main candidates:

Whether you're mapping results for your city, your state or the nation, you need geographic data for the area you'll be mapping in addition to election results. There are several common formats for such geospatial data; but for this tutorial, we'll focus on just one: shapefiles, a widely used format developed by Esri.

If you want to map results down to your city or town's precinct level, you'll probably need to get files from a local or state GIS office. For mapping by larger areas like cities, counties or states, the Census Bureau is a good place to find shapefiles.

For this New Hampshire mapping project by county, I downloaded files from the Cartographic Boundary shapefiles page -- these are smaller, simplified files designed for mapping projects where extraordinarily precise boundaries aren't needed. (Files for engineering projects or redistricting tend to be considerably larger).

I chose the national county file at http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_county_5m.zip and unzipped it within my data subdirectory. With R, it's easy to create a subset for just one state, or more; and now I've got a file I can re-use for other state maps by county as well.

There are a lot of files in that newly unzipped subdirectory; the one you want has the .shp extension. I'll store the name of this file in a variable called usshapefile:

usshapefile <- "data/cb_2014_us_county_5m/cb_2014_us_county_5m.shp"

Several R packages have functions for importing shapefiles into R. I'll use tmap's read_shape(), which I find quite intuitive:

usgeo <- read_shape(file=usshapefile)

If you want to check to see if the usgeo object looks like geography of the U.S., run tmap's quick thematic map command: qtm(usgeo). This may take a while to load and appear small and rather boring, but if you've got a map of the U.S. with divisions, you're probably on the right track.

If you run str(usgeo) to see the data structure within usgeo, it will look pretty unusual if you haven't done GIS in R before. usgeo contains a LOT of data, including columns starting with @ as well as more familiar entries starting with $. If you're interested in the ins and outs of this type of geospatial object, known as a SpatialPolygonsDataFrame, see Robin Lovelace's excellent Creating maps in R tutorial, especially the section on "The structure of spatial data in R."

For this tutorial, we're interested in what's in usgeo@data -- the object's "data slot." (Mapping software will need spatial data in the @Polygons slot, but that's nothing we'll manipulate directly). Run str(usgeo@data) and that structure should look familiar - much more like a typical R data frame, including columns for STATEFP (state FIPS code), COUNTYFP (county FIPS codes), NAME (in this case county names, making it easy to match up with county names in election results).

Extracting geodata just for New Hampshire is similar to subsetting any other type of data in R, we just need the state FIPS code for New Hampshire, which turns out to be 33 -- or in this case "33," since the codes aren't stored as integers in usgeo.

Here's the command to extract New Hampshire data using FIPS code 33:

nhgeo <- usgeo[usgeo@data$STATEFP=="33",]

If you want to do a quick check to see if nhgeo looks correct, run the quick thematic map function again qtm(nhgeo) and you should see something like this:

Still somewhat boring, but it looks like the Granite State with county-sized divisions, so it appears we've got the correct file subset.

Like any database join or merge, this has two requirements: 1) a column shared by each data set, and 2) records stored exactly the same way in both data sets. (Having a county listed as "Hillsborough" in one file and FIPS code "011" in another wouldn't give R any idea how to match them up without some sort of translation table.)

Trust me: You will save yourself a lot of time if you run a few R commands to see whether the nhgeo@data$NAME vector of county names is the same as the nhdata$County vector of county names.

Do they have the same structure

str(nhgeo@data$NAME) Factor w/ 1921 levels "Abbeville","Acadia",..: 1470 684 416 1653 138 282 1131 1657 334 791 str(nhdata$County) chr [1:11] "Belknap" "Carroll" "Cheshire" "Coos" "Grafton"

Whoops, problem number one: The geospatial file lists counties as R factors, while they're plain character text in the data. Change the factors to character strings with:

nhgeo@data$NAME <- as.character(nhgeo@data$NAME)

Next, it is helpful to sort both data sets by county name and then compare.

nhgeo <- nhgeo[order(nhgeo@data$NAME),] nhdata <- nhdata[order(nhdata$County),]

Are the two county columns identical now They should be; let's check:

identical(nhgeo@data$NAME,nhdata$County ) [1] TRUE

Now we can join the two files. The sp package's merge function is pretty common for this type of task, but I like tmap's append_data() because of its intuitive syntax and allowing names of the two join columns to be different.

nhmap <- append_data(nhgeo, nhdata, key.shp = "NAME", key.data="County")

You can see the new data structure with:

str(nhmap@data)

The hard part is done: finding data, getting it into the right format and merging it with geospatial data. Now, creating a simple static map of Sanders' margins by county in number of votes is as easy as:

qtm(nhmap, "SandersMarginVotes")

and mapping margins by percentage:

qtm(nhmap, "SandersMarginPctgPoints")

We can see that there's some difference between which areas gave Sanders the highest percent win versus which ones were most valuable for largest number-of-votes advantage.

For more control over the map's colors, borders and such, use the tm_shape() function, which uses a ggplot2-like syntax to set fill, border and other attributes:

The first line above sets the geodata file to be mapped, while tm_fill() sets the data column to use for mapping color values. The PRGn" palette argument is a ColorBrewer palette of purples and greens -- if you're not familiar with ColorBrewer, you can see the various palettes available at colorbrewer2.org. Don't like the ColorBrewer choices You can use built-in R palettes or set your own color HEX values manually instead of using a named ColorBrewer option.

There are also a few built-in tmap themes, such as tm_style_classic:

You can save static maps created by tmap by using the save_tmap() function:

The filename extension can be .jpg, .svg, .pdf, .png and several others; tmap will then produce the appropriate file, defaulting to the size of your current plotting window. There are also arguments for width, height, dpi and more; run ("save_tmap") for more info.

If you'd like to learn more about available tmap options, package creator Martijn Tennekes posted a PDF presentation on creating maps with tmap as well as tmap in a nutshell.

The next map we'll create will let users click to see underlying data as well as switch between maps, thanks to RStudio's Leaflet package that gives an R front-end to the open-source JavaScript Leaflet mapping library.

For a Leaflet map, there are two extra things we'll want to create in addition to the data we already have: A color palette and pop-up window contents.

For palette, we specify the data range we're mapping and what kind of color palette we want -- both the particular colors and the type of color scale. There are four built-in types:

Create a Leaflet palette with this syntax:

mypalette <- colorFunction(palette = "colors I want", domain = mydataframe$dataColumnToMap)

where colorFunction is one of the four scale types above, such as colorNumeric() or colorFactor and "colors I want" is a vector of colors.

Just to change things up a bit, I'll map where Hillary Clinton was strongest, the inverse of the Sanders maps. To map Clinton's vote percentage, we could use this palette:

clintonPalette <- colorNumeric(palette = "Blues", domain=nhmap$ClintonPct)

where "Blues" is a range of blues from ColorBrewer and domain is the data range of the color scale. This can be the same as the data we're actually plotting but doesn't have to be. colorNumeric means we want a continuous range of colors, not specific categories.

We'll also want to add a pop-up window -- what good is an interactive map without being able to click or tap and see underlying data

Aside: For the pop-up window text display, we'll want to turn the decimal numbers for votes such as 0.7865 into percentages like 78.7%. We could do it by writing a short formula, but the scales package has a percent() function to make this easier. Install (if you need to) and load the scales package:

install.packages("scales") library("scales")

Content for a pop-up window is just a simple combination of HTML and R variables, such as:

(If you're not familiar with paste0, it's a concatenate function to join text and text within variables).

Now, the map code:

Let's go over the code. leaflet(nhmap) creates a leaflet map object and sets nhmap as the data source. addProviderTiles("CartoDB.Positron" ) sets the background map tiles to CartoDB's attractive Positron design. There's a list of free background tiles and what they look like on GitHub if you'd like to choose something else.

The addPolygons() function does the rest -- putting the county shapes on the map and coloring them accordingly. stroke=FALSE says no border around the counties, fillOpacity sets the opacity of the colors, popup sets the contents of the popup window and color sets the palette — I’m not sure why the tilde is needed before the palette name, but that's the function format — and what data should be mapped to the color.

The Leaflet package has a number of other features we haven't used yet, including adding legends and the ability to turn layers on and off. Both will be very useful when mapping a race with three or more candidates, such as the current Republican primary.

Let's look at the GOP results in South Carolina among the top three candidates. I won't go over the data wrangling on this, except to say that I downloaded results from the South Carolina State Election Commission as well as Census Bureau data for education levels by county. If you download the project files, you'll see the initial data as well as the R code I used to add candidate vote percentages and join all that data to the South Carolina shapefile. That creates a geospatial object scmap to map.

There's so much data for a multi-candidate race that it's a little more complicated to choose what to color beyond "who won." I decided to go with one map layer to show the winner in each county, one layer each for the top three candidates (Trump, Rubio and Cruz) and a final layer showing percent of adult population with at least a bachelor's degree (Why education Some news reports out of South Carolina said that seemed to correlate with levels of Trump's support; mapping that will help show such a trend.)

In making my color palettes, I decided to use the same numerical scale for all three candidates. If I scaled color intensity for each candidate's minimum and maximum, a candidate with 10% to 18% would have a map with the same color intensities as one who had 45% to 52%, which gives a wrong impression of the losing candidate's strength. So, first I calculated the minimum and maximum for the combined Trump/Rubio/Cruz county results:

Now I can create a palette for each candidate using different colors but the same intensity range.

I'll also add palettes for the winner and education layers:

Finally, I'll create a basic pop-up showing the county name, who won, the percentage for each candidate and percent of population with a college degree:

This shows a basic map of winners by county:

A multi-layer map with layer controls starts off the same as our previous map, with one addition: A group name. In this case, each layer will be its own group, but it's also possible to turn multiple layers on and off together.

The next step is to add additional polygon layers for each candidate and a final layer for college education, along with a layer control to wrap up the code. This time, we'll store the map in a variable and then display it:

And now display the map with:

scGOPmap

addLayersControl can have two types of groups: baseGroups, like used above, which allow only one layer to be viewed at a time; and overlayGroups, where multiple layers can be viewed at once and each turned off individually.

If you're familiar with RMarkdown or Shiny, a Leaflet map can be embedded in an RMarkdown document or Shiny Web application. If you'd like to use this map as an HTML page on a website or elsewhere, save a Leaflet map with the htmlwidget package's saveWidget() function:

You can also save the map with external resources such as jQuery and the Leaflet JavaScript code in a separate directory by using the selfcontained=FALSE argument and choosing the subdirectory for the dependency files:

This should get you started on creating your own choropleth maps with R. To see how to create maps with latitude/longitude point markers, see Useful new R packages for data visualization and analysis.

Next: Learn R for beginners with our PDF

(www.computerworld.com)

Sharon Machlis

Per E-Mail versenden

Artikel als PDF kaufen

Über den Autor