ggplot2 is based on the grammar of graphics, a coherent system for describing and building graphs.
With ggplot2, you can build every graph from the same components:
To display values, map variables in the data to visual properties of the geom (aesthetics) like size, color, and x and y locations.
Complete the template below to build a graph.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
ggplot(data = mpg, aes(x = cty, y = hwy))
qplot creates a complete plot with given data, geom, and mappings. Supplies many useful defaults.
last_plot() returns the last plot
ggsave(“plot.png”, width = 5, height = 5) saves last plot as 5’ x 5’ file named “plot.png” in working directory.
You can add a third varible, like class
, to a two dimensional scatterplot by mapping it to an aesthetic. An aesthetic is a visual property of the objects in your plot.
Aesthetics include things like:
class
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
size
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.
# Left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
# Right
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
set the aesthetic properties manually - outside of aes()
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
A geom is the geometrical object that a plot uses to represent data.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))
ggplot2 provides over 30 geoms, and extension packages provide even more (see https://www.ggplot2-exts.org for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at http://rstudio.com/cheatsheets. To learn more about any single geom, use help: ?geom_smooth
.
In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype
example).
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv),
show.legend = FALSE
)
To display multiple geoms:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
The diamonds
dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price
, carat
, color
, clarity
, and cut
of each diamond.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. The figure below describes how this process works with geom_bar()
.
?geom_bar
shows that the default value for stat
is “count”, which means that geom_bar()
uses stat_count()
.
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
The stacking is performed automatically by the position adjustment specified by the position
argument. If you don’t want a stacked bar chart, you can use one of three other options:
"identity"
,"dodge"
"fill"
. ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +
geom_bar(fill = NA, position = "identity")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
The values of hwy
and displ
are rounded so the points appear on a grid and many points overlap each other. This problem is known as overplotting.
position = "jitter"
adds a small amount of random noise to each point.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point
coord_flip()
switches the x and y axes. ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
coord_quickmap()
sets the aspect ratio correctly for maps.#install.packages("maps")
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
coord_polar()
uses polar coordinates. bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()
R tools to access open data from Eurostat database
Data in the Eurostat database is stored in tables. Each table has an identifier, a short table_code, and a description (e.g. tsdtr420 - People killed in road accidents).
Key eurostat functions allow to find the table_code, download the eurostat table and polish labels in the table.
Find the table code
The search_eurostat(pattern, …) function scans the directory of Eurostat tables and returns codes and descriptions of tables that match pattern.
# install.packages("eurostat")
library(eurostat)
query <- search_eurostat("road", type = "table")
query[3:100,1:2]
Download the table
The get_eurostat(id, time_format = “date”, filters = “none”, type = “code”, cache = TRUE, …) function downloads the requested table from the Eurostat bulk download facility or from The Eurostat Web Services JSON API (if filters are defined). Downloaded data is cached (if cache=TRUE). Additional arguments define how to read the time column (time_format) and if table dimensions shall be kept as codes or converted to labels (type).
dat <- get_eurostat(id = "sdg_11_40", time_format = "num")
## Table sdg_11_40 cached at C:\Users\ncloud\AppData\Local\Temp\2\RtmpmmhtcB/eurostat/sdg_11_40_num_code_TF.rds
head(dat)
Add labels
The label_eurostat(x, lang = “en”, …) gets definitions for Eurostat codes and replace them with labels in given language (“en”, “fr” or “de”).
dat <- label_eurostat(dat)
head(dat, 10)
The get_eurostat() function returns tibbles in the long format. Packages dplyr and tidyr are well suited to transform these objects. The ggplot2 package is well suited to plot these objects.
t1 <- get_eurostat("sdg_11_40", filters = list(geo = c("UK", "FR", "PL", "ES", "PT"), unit = c("NR")))
t1
library(ggplot2)
ggplot(t1, aes(x = time, y = values, color = geo, group = geo, shape = geo)) +
geom_point(size = 2) +
geom_line() +
theme_bw() +
labs(title="Road accidents", x = "Year", y = "Victims")
library("dplyr")
t2 <- t1 %>% filter(time == "2014-01-01")
ggplot(t2, aes(geo, values, fill=geo)) +
geom_bar(stat = "identity") + theme_bw() +
theme(legend.position = "none")+
labs(title="Road accidents in 2014", x="", y="Victims")