Data Manipulation with dplyr
The Grammar of Data Manipulation
2025-08-10
What is dplyr?
Part of the tidyverse ecosystem
A grammar of data manipulation
Provides a consistent set of verbs for data manipulation
Fast, expressive, and easy to read
Works with data frames and tibbles
dplyr is one of the most important R packages for data science. It’s designed to make data manipulation fast and intuitive.
The Five Main Verbs
Transform:
filter() - subset rows
select() - subset columns
mutate() - create new variables
arrange() - reorder rows
Summarize:
summarize() - reduce to summary
group_by() - group operations
count() - count observations
Setup
Let’s load the required packages and data:
library (dplyr)
library (nycflights13)
# Look at the flights data
glimpse (flights)
We’ll use the nycflights13 dataset throughout this presentation.
filter() - Subset Rows
Filter rows based on conditions:
# Flights on January 1st
flights |>
filter (month == 1 , day == 1 )
# Flights with departure delays > 60 minutes
flights |>
filter (dep_delay > 60 )
# Multiple conditions with OR
flights |>
filter (carrier == "AA" | carrier == "UA" )
Key operators: ==, !=, >, <, >=, <=, %in%, is.na()
select() - Choose Columns
Select specific columns by name or position:
# Select specific columns
flights |>
select (year, month, day, carrier, flight)
# Select columns by range
flights |>
select (year: day, carrier: flight)
# Select with helper functions
flights |>
select (starts_with ("dep" ), ends_with ("delay" ))
Helper functions: starts_with(), ends_with(), contains(), matches(), everything()
mutate() - Create New Variables
Add new columns or modify existing ones:
# Create new variables
flights |>
mutate (
gain = dep_delay - arr_delay,
speed = distance / air_time * 60 ,
hour = dep_time %/% 100
)
# Use newly created variables
flights |>
mutate (
gain = dep_delay - arr_delay,
gain_per_hour = gain / hour
)
arrange() - Reorder Rows
Sort rows by one or more variables:
# Sort by departure delay (ascending)
flights |>
arrange (dep_delay)
# Sort by departure delay (descending)
flights |>
arrange (desc (dep_delay))
# Sort by multiple variables
flights |>
arrange (year, month, day, desc (dep_delay))
Combining Operations with the Pipe
Chain operations together for readable code:
# Find the most delayed flights to Houston
flights |>
filter (dest %in% c ("IAH" , "HOU" )) |>
arrange (desc (dep_delay)) |>
select (month, day, carrier, flight, dep_delay, dest) |>
head (10 )
The pipe |> makes code more readable by eliminating nested function calls.
summarize() - Create Summaries
Collapse data into summary statistics:
# Basic summaries
flights |>
summarize (
avg_delay = mean (dep_delay, na.rm = TRUE ),
median_delay = median (dep_delay, na.rm = TRUE ),
n_flights = n (),
n_carriers = n_distinct (carrier)
)
Always use na.rm = TRUE when dealing with missing values!
group_by() - Group Operations
Apply operations to groups of data:
# Average delay by carrier
flights |>
group_by (carrier) |>
summarize (
avg_delay = mean (dep_delay, na.rm = TRUE ),
n_flights = n (),
.groups = "drop"
) |>
arrange (desc (avg_delay))
Use .groups = "drop" to remove grouping after summarizing.
Multiple Grouping Variables
Group by multiple variables for more detailed analysis:
# Average delay by carrier and month
flights |>
group_by (carrier, month) |>
summarize (
avg_delay = mean (dep_delay, na.rm = TRUE ),
n_flights = n (),
.groups = "drop"
) |>
arrange (carrier, month)
count() - Count Observations
A shortcut for group_by() + summarize(n = n()):
# Count flights by destination
flights |>
count (dest, sort = TRUE )
# Count with weights
flights |>
count (carrier, wt = distance, sort = TRUE , name = "total_miles" )
Conditional Logic with if_else()
Create conditional variables:
flights |>
mutate (
delay_status = if_else (dep_delay > 0 , "Delayed" , "On Time" ),
flight_length = if_else (distance > 1000 , "Long" , "Short" )
) |>
count (delay_status, flight_length)
Advanced Conditional Logic with case_when()
Handle multiple conditions elegantly:
flights |>
mutate (
delay_category = case_when (
dep_delay < 0 ~ "Early" ,
dep_delay <= 15 ~ "On Time" ,
dep_delay <= 60 ~ "Delayed" ,
dep_delay > 60 ~ "Very Delayed" ,
is.na (dep_delay) ~ "Cancelled"
)
) |>
count (delay_category)
Working with Missing Values
Handle missing data appropriately:
# Filter out missing values
flights |>
filter (! is.na (dep_delay)) |>
summarize (avg_delay = mean (dep_delay))
# Count missing values
flights |>
summarize (
total_flights = n (),
missing_dep_delay = sum (is.na (dep_delay)),
prop_missing = mean (is.na (dep_delay))
)
slice() Functions - Select Rows by Position
Various ways to select rows by position:
# First and last rows
flights |> slice_head (n = 5 )
flights |> slice_tail (n = 5 )
# Random sample
flights |> slice_sample (n = 100 )
# Top n by variable
flights |>
slice_max (dep_delay, n = 5 , na_rm = TRUE )
across() - Apply Functions to Multiple Columns
Apply the same operation to multiple columns:
# Summarize all numeric columns
flights |>
summarize (
across (where (is.numeric), mean, na.rm = TRUE )
)
# Apply multiple functions
flights |>
group_by (carrier) |>
summarize (
across (c (dep_delay, arr_delay),
list (mean = mean, sd = sd),
na.rm = TRUE )
)
Real-World Example
Let’s analyze flight performance by carrier:
carrier_performance <- flights |>
filter (! is.na (dep_delay), ! is.na (arr_delay)) |>
group_by (carrier) |>
summarize (
flights = n (),
avg_dep_delay = mean (dep_delay),
avg_arr_delay = mean (arr_delay),
on_time_rate = mean (dep_delay <= 0 ),
very_late_rate = mean (dep_delay > 60 ),
.groups = "drop"
) |>
arrange (desc (on_time_rate))
carrier_performance
Common Patterns
Window functions with mutate():
# Rank flights by delay within each day
flights |>
group_by (year, month, day) |>
mutate (
delay_rank = min_rank (desc (dep_delay)),
delay_percentile = percent_rank (dep_delay)
) |>
filter (delay_rank <= 3 )
Best Practices
Do: - Use meaningful variable names - Filter early in the pipeline - Handle missing values explicitly - Use helper functions like n_distinct()
Avoid: - Overly complex single operations - Forgetting about missing values - Not dropping groups after summarizing - Modifying original data unnecessarily
Resources for Learning More
Summary
The dplyr toolkit:
filter() - subset rows based on conditions
select() - choose columns
mutate() - create/modify variables
arrange() - sort rows
summarize() + group_by() - create summaries
count() - count observations
across() - apply functions to multiple columns
Remember: Start simple, chain operations with |>, and handle missing values!
Questions?
Thank you for your attention!
Practice with your own data and explore the dplyr documentation for more advanced features.
Encourage questions about specific use cases. Remind audience that practice is key to mastering dplyr.