
Case study: Using debrief for iterative R performance optimization
Source:vignettes/case-study-tidypredict.Rmd
case-study-tidypredict.RmdAbout this case study
This case study was conducted by Claude (an AI assistant) to optimize tidypredict’s random forest parsing. Two important notes about the debrief package usage:
No prior knowledge: Claude had no prior knowledge of the debrief package before this session. The package was not part of Claude’s training data.
No documentation lookup: Claude did not read debrief’s documentation, README, or help files. The only information provided was the user’s prompt showing the basic usage pattern (
pv_print_debrief(p)). All interpretation of the output was done from first principles based on the section names and content.Test-driven verification: After each optimization, Claude ran the package’s existing test suite (
devtools::test()) to verify the changes didn’t break functionality. All 38 relevant tests (ranger: 18, randomForest: 13, partykit: 7) passed after the optimizations.
This demonstrates that debrief’s output is self-explanatory enough to guide optimization work without requiring prior familiarity with the package. The section names (TOP FUNCTIONS BY SELF-TIME, HOT LINES, MEMORY ALLOCATION) clearly communicate what each part shows, and the “Next steps” suggestions in the output hinted at additional functions available.
Overview
This case study demonstrates how the debrief package can guide iterative performance optimization in R. We use tidypredict’s random forest parsing as a real-world example, achieving a 331x speedup through systematic profiling and targeted fixes.
Problem statement
tidymodels/orbital#83
reported that parsing random forest models in tidypredict takes 45+
seconds for a single tree. We rolled back to the problematic version
(commit 0dde432) to investigate.
User prompt
The user provided the following instructions to guide the profiling session:
profile the tidypredict_fit call in the following code.
library(tidymodels)
library(orbital)
rec_spec <- recipe(Sale_Price ~ ., data = ames) |>
step_normalize(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
mod_spec <- rand_forest("regression", trees = 1)
wf_spec <- workflow(rec_spec, mod_spec)
wf_fit <- fit(wf_spec, ames)
rf_tree <- extract_fit_engine(wf_fit)
tmp <- tidypredict_fit(rf_tree)
I want you to use the profvis and debrief package to iteratively make improvements to the function
library(profvis)
library(debrief)
# Profile some code
p <- profvis({
# your code here
})
pv_print_debrief(p)This prompt established the workflow: profile with profvis, analyze
with pv_print_debrief(), optimize, and repeat.
Iteration 1: Initial profiling
Running debrief
p <- profvis({
tmp <- tidypredict_fit(rf_tree)
})
pv_print_debrief(p)Debrief output (key sections)
## PROFILING SUMMARY
Total time: 47340 ms (4734 samples @ 10 ms interval)
### TOP FUNCTIONS BY SELF-TIME
15180 ms ( 32.1%) [.data.frame
5280 ms ( 11.2%) <GC>
4220 ms ( 8.9%) sys.call
3950 ms ( 8.3%) [[.data.frame
### HOT LINES (by self-time)
12710 ms ( 26.8%) /R/model-ranger.R:6
row <- tree[tree$nodeID == j, ]
2880 ms ( 6.1%) /R/model-ranger.R:7
lc <- row["leftChild"][[1]] == find
2580 ms ( 5.4%) /R/model-ranger.R:8
lr <- row["rightChild"][[1]] == find
### MEMORY ALLOCATION (by line)
15655.00 MB /R/model-ranger.R:6
row <- tree[tree$nodeID == j, ]
3450.47 MB /R/model-ranger.R:7
lc <- row["leftChild"][[1]] == find
How I interpreted this output
TOP FUNCTIONS BY SELF-TIME showed
[.data.frameat 32.1% - this is the data.frame subsetting operator, suggesting repeated data.frame operations are the culprit.HOT LINES pinpointed the exact line:
row <- tree[tree$nodeID == j, ]consuming 26.8% of total time. This single line was the primary bottleneck.MEMORY ALLOCATION revealed this same line was allocating 15.6 GB - explaining the high GC time (11.2%). Each subsetting operation creates a new data.frame copy.
The pattern: Line 6 is inside a loop (indicated by the variable
j). Filtering a data.frame withtree[tree$nodeID == j, ]is O(n) per call. In a loop, this becomes O(n^2).
Iteration 2: Finding the next bottleneck
Debrief output after first optimization
## PROFILING SUMMARY
Total time: 290 ms (29 samples @ 10 ms interval)
### TOP FUNCTIONS BY SELF-TIME
60 ms ( 20.7%) is.na
20 ms ( 6.9%) .f
20 ms ( 6.9%) if (lc || lr) {
20 ms ( 6.9%) lc <- left_child[idx] == find
### HOT LINES (by self-time)
60 ms ( 20.7%) /R/model-ranger.R:16
if (is.na(lc)) {
20 ms ( 6.9%) /R/model-ranger.R:14
lc <- left_child[idx] == find
20 ms ( 6.9%) /R/model-ranger.R:22
if (lc || lr) {
How I interpreted this output
TOP FUNCTIONS BY SELF-TIME now showed
is.naas the top consumer at 20.7%. The bottleneck shifted from data.frame operations to NA checking.HOT LINES showed the
is.nachecks on lines 16 and 19 were being called repeatedly in the loop.The insight: The code was searching for parent nodes by iterating from
node_iddown to 0, checking each potential parent. This linear search could be replaced with a direct lookup.
Optimization applied
Pre-compute parent relationships once, then traverse directly:
# Before: Linear search for parent
for (j in node_id:0) {
idx <- j + 1L
lc <- left_child[idx] == find
lr <- right_child[idx] == find
if (is.na(lc)) lc <- FALSE
if (is.na(lr)) lr <- FALSE
if (lc || lr) { ... }
}
# After: Direct traversal using parent lookup
while (!is.na(parent[current + 1L])) {
current <- parent[current + 1L]
path <- c(path, current)
}Iteration 3: Diminishing returns
Debrief output (using 10 trees for better sampling)
## PROFILING SUMMARY
Total time: 1650 ms (165 samples @ 10 ms interval)
### TOP FUNCTIONS BY SELF-TIME
280 ms ( 17.0%) <GC>
200 ms ( 12.1%) enexpr
110 ms ( 6.7%) reduce_impl
### HOT LINES (by self-time)
60 ms ( 3.6%) /R/model-rf.R:80
if (.x$op == "less") i <- expr(!!sym(.x$col) < !!.x$val)
30 ms ( 1.8%) /R/model-rf.R:78
if (.x$op == "more") i <- expr(!!sym(.x$col) > !!.x$val)
### TOP FUNCTIONS BY TOTAL TIME
1020 ms ( 61.8%) path_formulas
500 ms ( 30.3%) expr
How I interpreted this output
TOP FUNCTIONS BY TOTAL TIME showed
path_formulasat 61.8% - the bottleneck shifted to formula building.HOT LINES showed multiple if-statements checking
.x$oprepeatedly (lines 78-81).TOP FUNCTIONS BY SELF-TIME showed
enexpr(expression building) andreduce_implas top consumers - these are doing useful work.The insight: The if-chain could be replaced with
switch(), but we’re now optimizing code that’s doing inherent work. Further gains will be marginal.
Optimization applied
Replace if-chain with switch statement:
# Before:
if (.x$op == "more") i <- expr(!!sym(.x$col) > !!.x$val)
if (.x$op == "more-equal") i <- expr(!!sym(.x$col) >= !!.x$val)
if (.x$op == "less") i <- expr(!!sym(.x$col) < !!.x$val)
if (.x$op == "less-equal") i <- expr(!!sym(.x$col) <= !!.x$val)
# After:
switch(
.x$op,
"more" = expr(!!col_sym > !!val),
"more-equal" = expr(!!col_sym >= !!val),
"less" = expr(!!col_sym < !!val),
"less-equal" = expr(!!col_sym <= !!val)
)Why I only used pv_print_debrief()
The debrief package offers several functions, and
pv_print_debrief() suggests others at the end of its
output:
### Next steps
pv_source_context(p, "/R/model-ranger.R")
pv_suggestions(p)
pv_focus(p, ".f")
pv_help()
Functions I didn’t use
| Function | Purpose | Why I skipped it |
|---|---|---|
pv_source_context() |
Shows more code context around hot lines | I was already reading full source files directly. Once HOT LINES said “line 6 in model-ranger.R”, I just read that file. |
pv_suggestions() |
Provides automated optimization suggestions | The bottlenecks were obvious enough (32% in
[.data.frame, 15.6 GB on one line) that I could diagnose
the issue myself. |
pv_focus() |
Filters analysis to a specific function | The initial output was already clear about where problems were. No need to drill down. |
When pv_print_debrief() alone is sufficient
The single function was enough because:
- HOT LINES gave exact locations - File path and line number meant I could go straight to the problem code
- MEMORY ALLOCATION correlated with time - The same lines appeared in both sections, confirming the diagnosis
- Patterns were recognizable - Data.frame subsetting in a loop is a known anti-pattern
When to use the other functions
I would reach for additional functions if:
- Diffuse profile - Time spread across many functions without a clear culprit
- Complex call paths - Need to understand how a bottleneck is reached from different entry points
-
Unfamiliar codebase -
pv_source_context()would help see surrounding code without switching tools -
Stuck on diagnosis -
pv_suggestions()might catch patterns I missed