dplyr is an amazing tool for data wrangling and I use it daily. However, there is one type of operation that I frequently do that has historically caused me some confusion and frustration: row-wise means. Once I figured out what was going on, I wanted to share what I learned through this brief blog post. It will focus on how to avoid some common issues I ran into and how to speed up rowwise operations with large data frames. I hope some find it helpful.
# Load packages used in this post
library(tidyverse)
library(microbenchmark)
# Set random seed for reproduciblity
set.seed(2020)
Let's say we have a tibble (or data frame) containing 10 observations and 4 numerical variables: y
, x1
, x2
, and x3
. We can simulate this quickly using rnorm()
to sample from different normal distributions.
# Simulate data
y <- rnorm(n = 10, mean = 100, sd = 15)
x1 <- rnorm(n = 10, mean = 0, sd = 1)
x2 <- rnorm(n = 10, mean = 10, sd = 10)
x3 <- rnorm(n = 10, mean = 20, sd = 5)
n10 <- tibble(y, x1, x2, x3) %>% print()
## # A tibble: 10 x 4 ## y x1 x2 x3 ## <dbl> <dbl> <dbl> <dbl> ## 1 106. -0.853 31.7 15.9 ## 2 105. 0.909 21.0 16.3 ## 3 83.5 1.20 13.2 25.5 ## 4 83.0 -0.372 9.27 32.2 ## 5 58.1 -0.123 18.3 21.9 ## 6 111. 1.80 12.0 21.5 ## 7 114. 1.70 23.0 18.6 ## 8 96.6 -3.04 19.4 20.4 ## 9 126. -2.29 8.53 17.2 ## 10 102. 0.0583 11.1 22.2
Now let's say we want to add a new variable xmean
to the tibble containing each observation's mean of x1
, x2
, and x3
. If you are just learning dplyr
, you would probably try to combine the mean()
and mutate()
functions as below.
# Example 1 (unintended results)
n10 %>%
mutate(xmean = mean(c(x1, x2, x3)))
## # A tibble: 10 x 5 ## y x1 x2 x3 xmean ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 106. -0.853 31.7 15.9 12.6 ## 2 105. 0.909 21.0 16.3 12.6 ## 3 83.5 1.20 13.2 25.5 12.6 ## 4 83.0 -0.372 9.27 32.2 12.6 ## 5 58.1 -0.123 18.3 21.9 12.6 ## 6 111. 1.80 12.0 21.5 12.6 ## 7 114. 1.70 23.0 18.6 12.6 ## 8 96.6 -3.04 19.4 20.4 12.6 ## 9 126. -2.29 8.53 17.2 12.6 ## 10 102. 0.0583 11.1 22.2 12.6
However, you'll notice in the output above that the new xmean
variable contains repetitions of a constant value. What is going on here? Basically, what mutate()
did was take all the numbers in x1
, x2
, and x3
, combine them into one long vector of 30 numbers, and send that vector to the mean()
function. The mean()
function then returns a single value—the mean of all 30 numbers—and tries to put that into the new column xmean
. But because the column needs to be a vector of 10 numbers to fit into the tibble, that single value gets recycled (i.e., repeated 10 times). To verify this is what happened, we can do the operation by hand and see that we get the same number:
mean(c(x1, x2, x3))
## [1] 12.60424
So, clearly mutate()
is not doing what we intended it to do. Luckily, dplyr 1.0.0 added some great features for doing operations within rows. The simplest version simply adds a call to the rowwise()
function to our pipeline.
# Example 2 (intended results)
n10 %>%
rowwise() %>%
mutate(xmean = mean(c(x1, x2, x3)))
## # A tibble: 10 x 5 ## # Rowwise: ## y x1 x2 x3 xmean ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 106. -0.853 31.7 15.9 15.6 ## 2 105. 0.909 21.0 16.3 12.7 ## 3 83.5 1.20 13.2 25.5 13.3 ## 4 83.0 -0.372 9.27 32.2 13.7 ## 5 58.1 -0.123 18.3 21.9 13.4 ## 6 111. 1.80 12.0 21.5 11.7 ## 7 114. 1.70 23.0 18.6 14.4 ## 8 96.6 -3.04 19.4 20.4 12.2 ## 9 126. -2.29 8.53 17.2 7.81 ## 10 102. 0.0583 11.1 22.2 11.1
This did what we wanted it to do, despite the actual mutate()
call being identical to what is was before! Pretty cool. We can even save some time by selecting the variables to include in the mean()
operation automatically, instead of listing them out in the c()
function. This isn't such a time-savings in this case with only three variables, but in settings with more variables it can really add up. To do so, we just need to use a tidy selection function; in this case, all the variables we want to include start with the letter “x” so let's use starts_with()
.
# Example 3 (error)
n10 %>%
rowwise() %>%
mutate(xmean = mean(starts_with("x")))
## Error: Problem with `mutate()` input `xmean`. ## x `starts_with()` must be used within a *selecting* function. ## i See <https://tidyselect.r-lib.org/reference/faq-selection-context.html>. ## i Input `xmean` is `mean(starts_with("x"))`. ## i The error occured in row 1.
Shoot, that didn't work. But why not? Basically, the problem is that mutate()
doesn't know what do to with selection functions like starts_with()
. The error message basically says that we are in the “wrong context” for a selection function. Luckily, dplyr 1.0.0 also added the c_across()
function, which will allow us to change the context to one that does allow selection functions. The code below now works as intended, first selecting all the variables starting with “x” and then computing their row-wise means.
# Example 4 (intended results)
n10 %>%
rowwise() %>%
mutate(xmean = mean(c_across(starts_with("x"))))
## # A tibble: 10 x 5 ## # Rowwise: ## y x1 x2 x3 xmean ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 106. -0.853 31.7 15.9 15.6 ## 2 105. 0.909 21.0 16.3 12.7 ## 3 83.5 1.20 13.2 25.5 13.3 ## 4 83.0 -0.372 9.27 32.2 13.7 ## 5 58.1 -0.123 18.3 21.9 13.4 ## 6 111. 1.80 12.0 21.5 11.7 ## 7 114. 1.70 23.0 18.6 14.4 ## 8 96.6 -3.04 19.4 20.4 12.2 ## 9 126. -2.29 8.53 17.2 7.81 ## 10 102. 0.0583 11.1 22.2 11.1
Huzzah. There are two things to note about rowwise()
, however. First, it transformed our tibble into an implicitly “grouped” tibble, which is what allowed our mutate()
function to calculate row-wise means instead of overall means (basically, it is treating each row as a separate group and calculating the means per group/row). However, after that mutate()
call, the tibble remains grouped. This is handy if we want to continue doing row-wise operations, but how do we tell it to stop once we are done with row-wise operations and want to return to “normal” behavior? Let's see when this could be a problem; one example is if we want to calculate the maximum row-wise mean xmean_max
.
# Example 5 (unintended results)
n10 %>%
rowwise() %>%
mutate(xmean = mean(c_across(starts_with("x")))) %>%
summarize(xmean_max = max(xmean))
## # A tibble: 10 x 1 ## xmean_max ## <dbl> ## 1 15.6 ## 2 12.7 ## 3 13.3 ## 4 13.7 ## 5 13.4 ## 6 11.7 ## 7 14.4 ## 8 12.2 ## 9 7.81 ## 10 11.1
Here we wanted to summarize over all values of xmean
and expected a single maximum value. Instead we got the same 10 values back. What happened? Basically, our tibble was still implicitly grouped by row and the summarize()
function respected that, calculating the maximum of each group/row. To avoid this behavior, we can add the ungroup()
function to our pipeline (reverting the tibble back to a standard one without implicit grouping).
# Example 6 (intended results)
n10 %>%
rowwise() %>%
mutate(xmean = mean(c_across(starts_with("x")))) %>%
ungroup() %>%
summarize(xmean_max = max(xmean))
## # A tibble: 1 x 1 ## xmean_max ## <dbl> ## 1 15.6
Now we get the desired behavior, and so I am usually very careful to add ungroup()
to my pipeline as soon as I am done with row-wise operations (otherwise you might end up with some unexpected problems).
The other thing to note about rowwise()
is that it can be slow. With a small tibble like this, it doesn't matter much, but the difference could be meaningful for larger and more complex data. In these cases, you have some alternatives. This blog post describes some of them, but does not address the specific case of means, which is what I want to do the most in practice. A faster alternative in this case is to use the rowMeans()
function. As you might imagine, this function takes in a numeric matrix or dataframe and returns the mean of each row.
rowMeans(n10)
## [1] 38.12065 35.67395 30.84624 31.02943 24.55300 36.51232 39.33531 33.31696 37.45554 33.78977
But we want to exclude the y
variable and append it to the n10
tibble. How to do so? We might reasonably try to put it into mutate()
like we did with mean()
:
# Example 7 (error)
n10 %>%
mutate(xmean = rowMeans(c(x1, x2, x3)))
## Error: Problem with `mutate()` input `xmean`. ## x 'x' must be an array of at least two dimensions ## i Input `xmean` is `rowMeans(c(x1, x2, x3))`.
But here we run into a problem. rowMeans()
is expecting a numeric matrix or data frame, but is being provided with a vector of 30 numbers again (as in Example 1). Thus, it doesn't have rows to calculate means within and returns an error. We can solve this by transforming the vector to a matrix:
# Example 8 (intended results)
n10 %>%
mutate(xmean = rowMeans(matrix(c(x1, x2, x3), ncol = 3)))
## # A tibble: 10 x 5 ## y x1 x2 x3 xmean ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 106. -0.853 31.7 15.9 15.6 ## 2 105. 0.909 21.0 16.3 12.7 ## 3 83.5 1.20 13.2 25.5 13.3 ## 4 83.0 -0.372 9.27 32.2 13.7 ## 5 58.1 -0.123 18.3 21.9 13.4 ## 6 111. 1.80 12.0 21.5 11.7 ## 7 114. 1.70 23.0 18.6 14.4 ## 8 96.6 -3.04 19.4 20.4 12.2 ## 9 126. -2.29 8.53 17.2 7.81 ## 10 102. 0.0583 11.1 22.2 11.1
But a better option might be to pass it a tibble (or data frame) instead, using the select()
function and the dot operator (.
). The dot operator represents the last item in the pipeline (in this case, the tibble n10
). Read more about the dot operator here.
# Example 9 (intended results)
n10 %>%
mutate(xmean = rowMeans(select(., x1, x2, x3)))
## # A tibble: 10 x 5 ## y x1 x2 x3 xmean ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 106. -0.853 31.7 15.9 15.6 ## 2 105. 0.909 21.0 16.3 12.7 ## 3 83.5 1.20 13.2 25.5 13.3 ## 4 83.0 -0.372 9.27 32.2 13.7 ## 5 58.1 -0.123 18.3 21.9 13.4 ## 6 111. 1.80 12.0 21.5 11.7 ## 7 114. 1.70 23.0 18.6 14.4 ## 8 96.6 -3.04 19.4 20.4 12.2 ## 9 126. -2.29 8.53 17.2 7.81 ## 10 102. 0.0583 11.1 22.2 11.1
This works nicely and, because select()
creates a context that allows selection functions, we can use starts_with()
here.
# Example 10 (intended results)
n10 %>%
mutate(xmean = rowMeans(select(., starts_with("x"))))
## # A tibble: 10 x 5 ## y x1 x2 x3 xmean ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 106. -0.853 31.7 15.9 15.6 ## 2 105. 0.909 21.0 16.3 12.7 ## 3 83.5 1.20 13.2 25.5 13.3 ## 4 83.0 -0.372 9.27 32.2 13.7 ## 5 58.1 -0.123 18.3 21.9 13.4 ## 6 111. 1.80 12.0 21.5 11.7 ## 7 114. 1.70 23.0 18.6 14.4 ## 8 96.6 -3.04 19.4 20.4 12.2 ## 9 126. -2.29 8.53 17.2 7.81 ## 10 102. 0.0583 11.1 22.2 11.1
But dplyr 1.0.0 also added a new way to streamline this kind of thing. We can use the new across()
function to create this context and avoid the need for the dot operator. Note that we using across()
here instead of c_across()
because the latter is for working within rows (in combination with rowwise()
) and here we want the row-wise operations to be handled by rowMeans()
.
# Example 11 (intended results)
n10 %>%
mutate(xmean = rowMeans(across(starts_with("x"))))
## # A tibble: 10 x 5 ## y x1 x2 x3 xmean ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 106. -0.853 31.7 15.9 15.6 ## 2 105. 0.909 21.0 16.3 12.7 ## 3 83.5 1.20 13.2 25.5 13.3 ## 4 83.0 -0.372 9.27 32.2 13.7 ## 5 58.1 -0.123 18.3 21.9 13.4 ## 6 111. 1.80 12.0 21.5 11.7 ## 7 114. 1.70 23.0 18.6 14.4 ## 8 96.6 -3.04 19.4 20.4 12.2 ## 9 126. -2.29 8.53 17.2 7.81 ## 10 102. 0.0583 11.1 22.2 11.1
To test the speeds of each approach, we can use the microbenchmark package, which will precisely time each approach over multiple iterations (in this case, 100). Let's wrap up by testing the speech of each approach with increasingly large datasets.
# Create function wrappers so the microbenchmark output is prettier
f_rowwise <- function(.data) {
.data %>% rowwise() %>% mutate(xmean = mean(c_across(starts_with("x"))))
}
f_rowMeans <- function(.data) {
.data %>% mutate(xmean = rowMeans(across(starts_with("x"))))
}
# Simulate larger datasets
n100 <- n10 %>% slice(rep(1:n(), times = 10))
n1000 <- n100 %>% slice(rep(1:n(), times = 10))
n10000 <- n1000 %>% slice(rep(1:n(), times = 10))
# Perform microbenchmarking
speedtest <-
microbenchmark(
f_rowwise(n10),
f_rowwise(n100),
f_rowwise(n1000),
f_rowwise(n10000),
f_rowMeans(n10),
f_rowMeans(n100),
f_rowMeans(n1000),
f_rowMeans(n10000),
times = 100L
)
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
f_rowwise(n10) | 3.856 | 4.055 | 4.594 | 4.190 | 4.409 | 12.177 | 100 |
f_rowwise(n100) | 7.255 | 7.746 | 8.687 | 8.039 | 8.576 | 16.815 | 100 |
f_rowwise(n1000) | 42.225 | 44.410 | 47.811 | 45.748 | 49.755 | 74.409 | 100 |
f_rowwise(n10000) | 410.771 | 432.763 | 450.767 | 442.979 | 458.940 | 616.720 | 100 |
f_rowMeans(n10) | 2.505 | 2.665 | 3.118 | 2.828 | 3.055 | 10.686 | 100 |
f_rowMeans(n100) | 2.516 | 2.643 | 3.186 | 2.771 | 2.976 | 11.093 | 100 |
f_rowMeans(n1000) | 2.515 | 2.677 | 3.098 | 2.835 | 3.006 | 6.802 | 100 |
f_rowMeans(n10000) | 2.642 | 2.801 | 3.034 | 2.938 | 3.089 | 6.894 | 100 |
Although the two approaches are very similar at first, the speed difference grows with the number of observations. So, although I think it is worth learning the new rowwise()
and c_across()
functions, in settings where observations are many and speed in paramount, it may be worthwhile to use “parallel” functions such as rowMeans()
, rowSums()
, pmin()
, pmax()
, and paste()
.
▶ Click here for session info
sessionInfo()
## R version 4.0.2 (2020-06-22) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 10 x64 (build 18363) ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] kableExtra_1.1.0 RWordPress_0.2-3 knitr_1.29 microbenchmark_1.4-7 forcats_0.5.0 stringr_1.4.0 ## [7] dplyr_1.0.0 purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.2 ggplot2_3.3.2 ## [13] tidyverse_1.3.0 ## ## loaded via a namespace (and not attached): ## [1] bitops_1.0-6 matrixStats_0.56.0 fs_1.4.2 usethis_1.6.1 lubridate_1.7.9 devtools_2.3.0 ## [7] webshot_0.5.2 httr_1.4.1 rprojroot_1.3-2 rstan_2.21.1 tools_4.0.2 backports_1.1.8 ## [13] utf8_1.1.4 R6_2.4.1 DBI_1.1.0 colorspace_1.4-1 withr_2.2.0 tidyselect_1.1.0 ## [19] gridExtra_2.3 prettyunits_1.1.1 processx_3.4.3 curl_4.3 compiler_4.0.2 cli_2.0.2 ## [25] rvest_0.3.5 xml2_1.3.2 desc_1.2.0 scales_1.1.1 callr_3.4.3 digest_0.6.25 ## [31] StanHeaders_2.21.0-5 rmarkdown_2.3 pkgconfig_2.0.3 htmltools_0.5.0 sessioninfo_1.1.1 highr_0.8 ## [37] dbplyr_1.4.4 rlang_0.4.6 readxl_1.3.1 rstudioapi_0.11 XMLRPC_0.3-1 farver_2.0.3 ## [43] generics_0.0.2 jsonlite_1.7.0 inline_0.3.15 RCurl_1.98-1.2 magrittr_1.5 loo_2.3.0 ## [49] Rcpp_1.0.5 munsell_0.5.0 fansi_0.4.1 lifecycle_0.2.0 stringi_1.4.6 yaml_2.2.1 ## [55] pkgbuild_1.0.8 grid_4.0.2 blob_1.2.1 parallel_4.0.2 crayon_1.3.4 haven_2.3.1 ## [61] hms_0.5.3 ps_1.3.3 pillar_1.4.4 markdown_1.1 codetools_0.2-16 stats4_4.0.2 ## [67] pkgload_1.1.0 reprex_0.3.0 XML_3.99-0.4 glue_1.4.1 packrat_0.5.0 evaluate_0.14 ## [73] V8_3.2.0 remotes_2.1.1 RcppParallel_5.0.2 modelr_0.1.8 vctrs_0.3.1 testthat_2.3.2 ## [79] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1 xfun_0.15 mime_0.9 broom_0.7.0 ## [85] viridisLite_0.3.0 rsconnect_0.8.16 memoise_1.1.0 ellipsis_0.3.1
Here is some history about the development of this stuff (thanks @mattansb):
https://github.com/tidyverse/dplyr/issues/4544
Very helpful. Thank you!
Awesome thanks!
Thank you! it totally solves my problem! You’re genius