dplyr provides a grammar for manipulating tables in R. This cheat sheet will guide you through the grammar, reminding you how to select, filter, arrange, mutate, summarise, group, and join data frames and tibbles.
dplyr functions work with pipes and expect tidy data.
pipes makes x %>% f(y) becomes f(x, y)
# install.packages('nycflights13')
library(nycflights13)
library(tidyverse)
If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter()
and stats::lag()
.
nycflights13::flights
. This data frame contains all 336,776 flights that departed from New York City in 2013.
flights
key dplyr functions :
filter()
).arrange()
).select()
).mutate()
).summarise()
).These can all be used in conjunction with group_by()
which changes the scope of each function from operating on the entire dataset to operating on it group-by-group.
These six functions provide the verbs for a language of data manipulation.
All verbs work similarly:
These apply summary functions to columns to create a new table of summary statistics. Summary functions take vectors as input and return one value.
summarise(mtcars, avg = mean(mpg))
count(iris, Species)
VARIATIONS
Use group_by() to create a “grouped” copy of a table. dplyr functions will manipulate each “group” separately and then combine the results.
g_iris <- group_by(iris, Species)
ungroup(x, …) : Returns ungrouped copyof table.
ungroup(g_iris)
mtcars %>%
group_by(cyl) %>%
summarise(avg = mean(mpg))
EXTRACT CASES
Row functions return a subset of rows as a new table.
Extract rows that meet logical criteria.
filter(iris, Sepal.Length > 7)
Remove rows with duplicate values.
distinct(iris, Species)
Randomly select fraction of rows.
sample_frac(iris, 0.5, replace = TRUE)
Randomly select size rows.
sample_n(iris, 10, replace = TRUE)
Select rows by position.
slice(iris, 10:15)
Select and order top n entries (by group if grouped data).
top_n(iris, 5, Sepal.Width)
Logical and Boolean Operators
ARRANGE CASES
Order rows by values of a column or columns (low to high), use with desc() to order from high to low.
arrange(mtcars, mpg)
arrange(mtcars, desc(mpg))
Add one or more rows to a table.
add_row(faithful, eruptions = 1, waiting = 1)
EXTRACT VARIABLES
Column functions return a set of columns as a new vector or table.
Extract column values as a vector. Choose by name or index.
pull(iris, Sepal.Length)
Extract columns as a table. Also select_if().
select(iris, Sepal.Length, Species)
Use these helpers with select ()
select(iris, starts_with("Sepal"))
MAKE NEW VARIABLES
These apply vectorized functions to columns. Vectorized funs take vectors as input and return vectors of the same length as output.
Compute new column(s).
mutate(mtcars, gpm = 1/mpg)
Compute new column(s), drop others.
transmute(mtcars, gpm = 1/mpg)
Apply funs to every column. Use with funs(). Also mutate_if().
mutate_all(faithful, funs(log(.), log2(.)))
mutate_if(iris, is.numeric, funs(log(.)))
Apply funs to specific columns. Use with funs(), vars() and the helper functions for select().
mutate_at(iris, vars( -Species), funs(log(.)))
Add new column(s). Also add_count(), add_tally().
add_column(mtcars, new = 1:32)
Rename columns.
rename(iris, Length = Sepal.Length)
TO USE WITH MUTATE ()
mutate() and transmute() apply vectorized functions to columns to create new columns. Vectorized functions take vectors as input and return vectors of the same length as output.
OFFSETS
dplyr::lead() - Offset elements by -1
CUMULATIVE AGGREGATES
dplyr::cumsum()
RANKINGS
dplyr::row_number() - rank with ties = “first”
MATH
dplyr::near() - safe == for floating point numbers
MISC
dplyr::recode_factor() - Vectorized switch() for factors
TO USE WITH SUMMARISE ()
summarise() applies summary functions to columns to create a new table. Summary functions take vectors as input and return single values as output.
sum(!is.na()) - # of non-NA’s
median() - median
sum() - # of TRUE’s
dplyr::nth() - value in nth location of vector
max() - maximum value
var() - variance
Tidy data does not use rownames, which store a variable outside of the columns. To work with the rownames, first move them into a column.
Move row names into col.
a <- rownames_to_column(iris, var = "C")
Move col in row names.
column_to_rownames(a, var = "C")
has_rownames()
remove_rownames()
COMBINE VARIABLES
BE SURE THAT ROWS ALIGN.
left_join(x, y, by = NULL,
copy=FALSE, suffix=c(“.x”,“.y”),…)
right_join(x, y, by = NULL,
copy = FALSE, suffix=c(“.x”,“.y”),…)
inner_join(x, y, by = NULL,
copy = FALSE, suffix=c(“.x”,“.y”),…)
full_join(x, y, by = NULL,
copy=FALSE, suffix=c(“.x”,“.y”),…)
left_join(x, y, by = "A")
left_join(x, y, by = c("C" = "D"))
left_join(x, y, by = c("C" = "D"), suffix =
c("1", "2"))
COMBINE CASES
bind_rows(…, .id = NULL)
intersect(x, y, …)
setdiff(x, y, …)
union(x, y, …)
EXTRACT ROWS
semi_join(x, y, by = NULL, …)
anti_join(x, y, by = NULL, …)
There are two main alternatives to readr: base R and data.table’s fread()
. The most important differences are discussed below.
Compared to the corresponding base functions, readr functions:
Use a consistent naming scheme for the parameters (e.g. col_names
and col_types
not header
and colClasses
).
Are much faster (up to 10x).
Leave strings as is by default, and automatically parse common date/time formats.
Have a helpful progress bar if loading is going to take a while.
All functions work exactly the same way regardless of the current locale. To override the US-centric defaults, use locale()
.
fread()
data.table has a function similar to read_csv()
called fread. Compared to fread, readr functions:
Are slower (currently ~1.2-2x slower. If you want absolutely the best performance, use data.table::fread()
.
Use a slightly more sophisticated parser, recognising both doubled (""""
) and backslash escapes ("\""
), and can produce factors and date/times directly.
Forces you to supply all parameters, where fread()
saves you work by automatically guessing the delimiter, whether or not the file has a header, and how many lines to skip.
Are built on a different underlying infrastructure. Readr functions are designed to be quite general, which makes it easier to add support for new rectangular data formats. fread()
is designed to be as fast as possible.