Data Manipulation with dplyr

The Grammar of Data Manipulation

Your Name

2025-08-10

What is dplyr?

  • Part of the tidyverse ecosystem
  • A grammar of data manipulation
  • Provides a consistent set of verbs for data manipulation
  • Fast, expressive, and easy to read
  • Works with data frames and tibbles

The Five Main Verbs

Transform:

  • filter() - subset rows
  • select() - subset columns
  • mutate() - create new variables
  • arrange() - reorder rows

Summarize:

  • summarize() - reduce to summary
  • group_by() - group operations
  • count() - count observations

Setup

Let’s load the required packages and data:

library(dplyr)
library(nycflights13)

# Look at the flights data
glimpse(flights)

We’ll use the nycflights13 dataset throughout this presentation.

filter() - Subset Rows

Filter rows based on conditions:

# Flights on January 1st
flights |>
  filter(month == 1, day == 1)

# Flights with departure delays > 60 minutes
flights |>
  filter(dep_delay > 60)

# Multiple conditions with OR
flights |>
  filter(carrier == "AA" | carrier == "UA")

Key operators: ==, !=, >, <, >=, <=, %in%, is.na()

select() - Choose Columns

Select specific columns by name or position:

# Select specific columns
flights |>
  select(year, month, day, carrier, flight)

# Select columns by range
flights |>
  select(year:day, carrier:flight)

# Select with helper functions
flights |>
  select(starts_with("dep"), ends_with("delay"))

Helper functions: starts_with(), ends_with(), contains(), matches(), everything()

mutate() - Create New Variables

Add new columns or modify existing ones:

# Create new variables
flights |>
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    hour = dep_time %/% 100
  )

# Use newly created variables
flights |>
  mutate(
    gain = dep_delay - arr_delay,
    gain_per_hour = gain / hour
  )

arrange() - Reorder Rows

Sort rows by one or more variables:

# Sort by departure delay (ascending)
flights |>
  arrange(dep_delay)

# Sort by departure delay (descending)
flights |>
  arrange(desc(dep_delay))

# Sort by multiple variables
flights |>
  arrange(year, month, day, desc(dep_delay))

Combining Operations with the Pipe

Chain operations together for readable code:

# Find the most delayed flights to Houston
flights |>
  filter(dest %in% c("IAH", "HOU")) |>
  arrange(desc(dep_delay)) |>
  select(month, day, carrier, flight, dep_delay, dest) |>
  head(10)

The pipe |> makes code more readable by eliminating nested function calls.

summarize() - Create Summaries

Collapse data into summary statistics:

# Basic summaries
flights |>
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE),
    median_delay = median(dep_delay, na.rm = TRUE),
    n_flights = n(),
    n_carriers = n_distinct(carrier)
  )

Always use na.rm = TRUE when dealing with missing values!

group_by() - Group Operations

Apply operations to groups of data:

# Average delay by carrier
flights |>
  group_by(carrier) |>
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE),
    n_flights = n(),
    .groups = "drop"
  ) |>
  arrange(desc(avg_delay))

Use .groups = "drop" to remove grouping after summarizing.

Multiple Grouping Variables

Group by multiple variables for more detailed analysis:

# Average delay by carrier and month
flights |>
  group_by(carrier, month) |>
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE),
    n_flights = n(),
    .groups = "drop"
  ) |>
  arrange(carrier, month)

count() - Count Observations

A shortcut for group_by() + summarize(n = n()):

# Count flights by destination
flights |>
  count(dest, sort = TRUE)

# Count with weights
flights |>
  count(carrier, wt = distance, sort = TRUE, name = "total_miles")

Conditional Logic with if_else()

Create conditional variables:

flights |>
  mutate(
    delay_status = if_else(dep_delay > 0, "Delayed", "On Time"),
    flight_length = if_else(distance > 1000, "Long", "Short")
  ) |>
  count(delay_status, flight_length)

Advanced Conditional Logic with case_when()

Handle multiple conditions elegantly:

flights |>
  mutate(
    delay_category = case_when(
      dep_delay < 0      ~ "Early",
      dep_delay <= 15    ~ "On Time", 
      dep_delay <= 60    ~ "Delayed",
      dep_delay > 60     ~ "Very Delayed",
      is.na(dep_delay)   ~ "Cancelled"
    )
  ) |>
  count(delay_category)

Working with Missing Values

Handle missing data appropriately:

# Filter out missing values
flights |>
  filter(!is.na(dep_delay)) |>
  summarize(avg_delay = mean(dep_delay))

# Count missing values
flights |>
  summarize(
    total_flights = n(),
    missing_dep_delay = sum(is.na(dep_delay)),
    prop_missing = mean(is.na(dep_delay))
  )

slice() Functions - Select Rows by Position

Various ways to select rows by position:

# First and last rows
flights |> slice_head(n = 5)
flights |> slice_tail(n = 5)

# Random sample
flights |> slice_sample(n = 100)

# Top n by variable
flights |> 
  slice_max(dep_delay, n = 5, na_rm = TRUE)

across() - Apply Functions to Multiple Columns

Apply the same operation to multiple columns:

# Summarize all numeric columns
flights |>
  summarize(
    across(where(is.numeric), mean, na.rm = TRUE)
  )

# Apply multiple functions
flights |>
  group_by(carrier) |>
  summarize(
    across(c(dep_delay, arr_delay), 
           list(mean = mean, sd = sd), 
           na.rm = TRUE)
  )

Real-World Example

Let’s analyze flight performance by carrier:

carrier_performance <- flights |>
  filter(!is.na(dep_delay), !is.na(arr_delay)) |>
  group_by(carrier) |>
  summarize(
    flights = n(),
    avg_dep_delay = mean(dep_delay),
    avg_arr_delay = mean(arr_delay),
    on_time_rate = mean(dep_delay <= 0),
    very_late_rate = mean(dep_delay > 60),
    .groups = "drop"
  ) |>
  arrange(desc(on_time_rate))

carrier_performance

Common Patterns

Window functions with mutate():

# Rank flights by delay within each day
flights |>
  group_by(year, month, day) |>
  mutate(
    delay_rank = min_rank(desc(dep_delay)),
    delay_percentile = percent_rank(dep_delay)
  ) |>
  filter(delay_rank <= 3)

Performance Tips

  • Filter early - reduce data size as soon as possible
  • Use count() instead of group_by() + summarize(n = n())
  • Chain operations logically with the pipe
  • Be explicit about handling missing values
  • Use .groups = "drop" after summarize() to avoid warnings

Best Practices

Do: - Use meaningful variable names - Filter early in the pipeline - Handle missing values explicitly - Use helper functions like n_distinct()

Avoid: - Overly complex single operations - Forgetting about missing values - Not dropping groups after summarizing - Modifying original data unnecessarily

Resources for Learning More

Summary

The dplyr toolkit:

  • filter() - subset rows based on conditions
  • select() - choose columns
  • mutate() - create/modify variables
  • arrange() - sort rows
  • summarize() + group_by() - create summaries
  • count() - count observations
  • across() - apply functions to multiple columns

Remember: Start simple, chain operations with |>, and handle missing values!

Questions?

Thank you for your attention!

Practice with your own data and explore the dplyr documentation for more advanced features.