- Mastering Data Analysis with R
- Gergely Daróczi
- 436字
- 2025-02-18 02:16:38
Running benchmarks
As already discussed in the previous chapters, with the help of the microbenchmark
package, we can run any number of different functions for a specified number of times on the same machine to get some reproducible results on the performance.
To this end, we have to define the functions that we want to benchmark first. These were compiled from the preceding examples:
> AGGR1 <- function() aggregate(hflights$Diverted, + by = list(hflights$DayOfWeek), FUN = mean) > AGGR2 <- function() with(hflights, aggregate(Diverted, + by = list(DayOfWeek), FUN = mean)) > AGGR3 <- function() aggregate(Diverted ~ DayOfWeek, + data = hflights, FUN = mean) > TAPPLY <- function() tapply(X = hflights$Diverted, + INDEX = hflights$DayOfWeek, FUN = mean) > PLYR1 <- function() ddply(hflights, .(DayOfWeek), + function(x) mean(x$Diverted)) > PLYR2 <- function() ddply(hflights, .(DayOfWeek), summarise, + Diverted = mean(Diverted)) > DPLYR <- function() dplyr::summarise(hflights_DayOfWeek, + mean(Diverted))
However, as mentioned before, the summarise
function in dplyr
needs some prior data restructuring, which also takes time. To this end, let's define another function that also includes the creation of the new data structure along with the real aggregation:
> DPLYR_ALL <- function() { + hflights_DayOfWeek <- group_by(hflights, DayOfWeek) + dplyr::summarise(hflights_DayOfWeek, mean(Diverted)) + }
Similarly, benchmarking data.table
also requires some additional variables for the test environment; as hlfights_dt
is already sorted by DayOfWeek
, let's create a new data.table
object for benchmarking:
> hflights_dt_nokey <- data.table(hflights)
Further, it probably makes sense to verify that it has no keys:
> key(hflights_dt_nokey) NULL
Okay, now, we can define the data.table
test cases along with a function that also includes the transformation to data.table
, and adding an index just to be fair with dplyr
:
> DT <- function() hflights_dt_nokey[, mean(FlightNum), + by = DayOfWeek] > DT_KEY <- function() hflights_dt[, mean(FlightNum), + by = DayOfWeek] > DT_ALL <- function() { + setkey(hflights_dt_nokey, 'DayOfWeek') + hflights_dt[, mean(FlightNum), by = DayOfWeek] + setkey(hflights_dt_nokey, NULL) + }
Now that we have all the described implementations ready for testing, let's load the microbenchmark
package to do its job:
> library(microbenchmark) > res <- microbenchmark(AGGR1(), AGGR2(), AGGR3(), TAPPLY(), PLYR1(), + PLYR2(), DPLYR(), DPLYR_ALL(), DT(), DT_KEY(), DT_ALL()) > print(res, digits = 3) Unit: milliseconds expr min lq median uq max neval AGGR1() 2279.82 2348.14 2462.02 2597.70 2719.88 10 AGGR2() 2278.15 2465.09 2528.55 2796.35 2996.98 10 AGGR3() 2358.71 2528.23 2726.66 2879.89 3177.63 10 TAPPLY() 19.90 21.05 23.56 29.65 33.88 10 PLYR1() 56.93 59.16 70.73 82.41 155.88 10 PLYR2() 58.31 65.71 76.51 98.92 103.48 10 DPLYR() 1.18 1.21 1.30 1.74 1.84 10 DPLYR_ALL() 7.40 7.65 7.93 8.25 14.51 10 DT() 5.45 5.73 5.99 7.75 9.00 10 DT_KEY() 5.22 5.45 5.63 6.26 13.64 10 DT_ALL() 31.31 33.26 35.19 38.34 42.83 10
The results are pretty spectacular: from more than 2,000 milliseconds, we could improve our tools to provide the very same results in only a bit more than 1 millisecond. The spread can be demonstrated easily on a violin plot with a logarithmic scale:
> autoplot(res)

Therefore, dplyr
seems to be the most efficient solution, although if we also take the extra step (to group data.frame
) into account, it makes the otherwise clear advantage rather unconvincing. As a matter of fact, if we already have a data.table
object, and we can save the transformation of a traditional data.frame
object into data.table
, then data.table
performs better than dplyr
. However, I am pretty sure that you will not really notice the time difference between the two high-performance solutions; both of these do a very good job with even larger datasets.
It's worth mentioning that dplyr
can work with data.table
objects as well; therefore, to ensure that you are not locked to either package, it's definitely worth using both if needed. The following is a POC example:
> dplyr::summarise(group_by(hflights_dt, DayOfWeek), mean(Diverted)) Source: local data table [7 x 2] DayOfWeek mean(Diverted) 1 1 0.002997672 2 2 0.002559323 3 3 0.003226211 4 4 0.003065727 5 5 0.002687865 6 6 0.002823121 7 7 0.002589057
Okay, so now we are pretty sure to use either data.table
or dplyr
for computing group averages in the future. However, what about more complex operations?