Running benchmarks_Mastering Data Analysis with R-QQ阅读女生中文现言网

书名：Mastering Data Analysis with R
作者名：Gergely Daróczi
本章字数：436字
更新时间：2025-02-18 02:16:38

Running benchmarks

As already discussed in the previous chapters, with the help of the microbenchmark package, we can run any number of different functions for a specified number of times on the same machine to get some reproducible results on the performance.

To this end, we have to define the functions that we want to benchmark first. These were compiled from the preceding examples:

> AGGR1 <- function() aggregate(hflights$Diverted,
+ by = list(hflights$DayOfWeek), FUN = mean)
> AGGR2 <- function() with(hflights, aggregate(Diverted,
+ by = list(DayOfWeek), FUN = mean))
> AGGR3 <- function() aggregate(Diverted ~ DayOfWeek,
+ data = hflights, FUN = mean)
> TAPPLY <- function() tapply(X = hflights$Diverted, 
+ INDEX = hflights$DayOfWeek, FUN = mean)
> PLYR1 <- function() ddply(hflights, .(DayOfWeek),
+ function(x) mean(x$Diverted))
> PLYR2 <- function() ddply(hflights, .(DayOfWeek), summarise,
+ Diverted = mean(Diverted))
> DPLYR <- function() dplyr::summarise(hflights_DayOfWeek,
+ mean(Diverted))

However, as mentioned before, the summarise function in dplyr needs some prior data restructuring, which also takes time. To this end, let's define another function that also includes the creation of the new data structure along with the real aggregation:

> DPLYR_ALL <- function() {
+ hflights_DayOfWeek <- group_by(hflights, DayOfWeek)
+ dplyr::summarise(hflights_DayOfWeek, mean(Diverted))
+ }

Similarly, benchmarking data.table also requires some additional variables for the test environment; as hlfights_dt is already sorted by DayOfWeek, let's create a new data.table object for benchmarking:

> hflights_dt_nokey <- data.table(hflights)

Further, it probably makes sense to verify that it has no keys:

> key(hflights_dt_nokey)
NULL

Okay, now, we can define the data.table test cases along with a function that also includes the transformation to data.table, and adding an index just to be fair with dplyr:

> DT <- function() hflights_dt_nokey[, mean(FlightNum),
+ by = DayOfWeek]
> DT_KEY <- function() hflights_dt[, mean(FlightNum),
+ by = DayOfWeek]
> DT_ALL <- function() {
+ setkey(hflights_dt_nokey, 'DayOfWeek')
+ hflights_dt[, mean(FlightNum), by = DayOfWeek]
+ setkey(hflights_dt_nokey, NULL)
+ }

Now that we have all the described implementations ready for testing, let's load the microbenchmark package to do its job:

> library(microbenchmark)
> res <- microbenchmark(AGGR1(), AGGR2(), AGGR3(), TAPPLY(), PLYR1(),
+ PLYR2(), DPLYR(), DPLYR_ALL(), DT(), DT_KEY(), DT_ALL())
> print(res, digits = 3)
Unit: milliseconds
 expr min lq median uq max neval
 AGGR1() 2279.82 2348.14 2462.02 2597.70 2719.88 10
 AGGR2() 2278.15 2465.09 2528.55 2796.35 2996.98 10
 AGGR3() 2358.71 2528.23 2726.66 2879.89 3177.63 10
 TAPPLY() 19.90 21.05 23.56 29.65 33.88 10
 PLYR1() 56.93 59.16 70.73 82.41 155.88 10
 PLYR2() 58.31 65.71 76.51 98.92 103.48 10
 DPLYR() 1.18 1.21 1.30 1.74 1.84 10
 DPLYR_ALL() 7.40 7.65 7.93 8.25 14.51 10
 DT() 5.45 5.73 5.99 7.75 9.00 10
 DT_KEY() 5.22 5.45 5.63 6.26 13.64 10
 DT_ALL() 31.31 33.26 35.19 38.34 42.83 10

The results are pretty spectacular: from more than 2,000 milliseconds, we could improve our tools to provide the very same results in only a bit more than 1 millisecond. The spread can be demonstrated easily on a violin plot with a logarithmic scale:

> autoplot(res)

Therefore, dplyr seems to be the most efficient solution, although if we also take the extra step (to group data.frame) into account, it makes the otherwise clear advantage rather unconvincing. As a matter of fact, if we already have a data.table object, and we can save the transformation of a traditional data.frame object into data.table, then data.table performs better than dplyr. However, I am pretty sure that you will not really notice the time difference between the two high-performance solutions; both of these do a very good job with even larger datasets.

It's worth mentioning that dplyr can work with data.table objects as well; therefore, to ensure that you are not locked to either package, it's definitely worth using both if needed. The following is a POC example:

> dplyr::summarise(group_by(hflights_dt, DayOfWeek), mean(Diverted))
Source: local data table [7 x 2]

 DayOfWeek mean(Diverted)
1 1 0.002997672
2 2 0.002559323
3 3 0.003226211
4 4 0.003065727
5 5 0.002687865
6 6 0.002823121
7 7 0.002589057

Okay, so now we are pretty sure to use either data.table or dplyr for computing group averages in the future. However, what about more complex operations?