Aggregation_Mastering Data Analysis with R-QQ阅读女生幻言网

书名：Mastering Data Analysis with R
作者名：Gergely Daróczi
本章字数：1974字
更新时间：2025-02-18 02:16:38

Aggregation

The most straightforward way of summarizing data is calling the aggregate function from the stats package, which does exactly what we are looking for: splitting the data into subsets by a grouping variable, then computing summary statistics for them separately. The most basic way to call the aggregate function is to pass the numeric vector to be aggregated, and a factor variable to define the splits for the function passed in the FUN argument to be applied. Now, let's see the average ratio of diverted flights on each weekday:

> aggregate(hflights$Diverted, by = list(hflights$DayOfWeek),
+ FUN = mean)
 Group.1 x
1 1 0.002997672
2 2 0.002559323
3 3 0.003226211
4 4 0.003065727
5 5 0.002687865
6 6 0.002823121
7 7 0.002589057

Well, it took some time to run the preceding script, but please bear in mind that we have just aggregated around a quarter of a million rows to see the daily averages for the number of diverted flights departing from the Houston airport in 2011.

In other words, which also makes sense for all those not into statistics, the percentage of diverted flights per weekday. The results are rather interesting, as it seems that flights are more often diverted in the middle of the week (around 0.3 percent) than over the weekends (around 0.05 percent less), at least from Houston.

An alternative way of calling the preceding function is to supply the arguments inside of the with function, which seems to be a more human-friendly expression after all because it saves us from the repeated mention of the hflights database:

> with(hflights, aggregate(Diverted, by = list(DayOfWeek),
+ FUN = mean))

The results are not shown here, as they are exactly the same as those shown earlier. The manual for the aggregate function (see ?aggregate) states that it returns the results in a convenient form. Well, checking the column names of the abovementioned returned data does not seem convenient, right? We can overcome this issue by using the formula notation instead of defining the numeric and factor variables separately:

> aggregate(Diverted ~ DayOfWeek, data = hflights, FUN = mean)
 DayOfWeek Diverted
1 1 0.002997672
2 2 0.002559323
3 3 0.003226211
4 4 0.003065727
5 5 0.002687865
6 6 0.002823121
7 7 0.002589057

The gain by using the formula notation is at least two-fold:

There are relatively few characters to type
The headers and row names are correct in the results
This version also runs a bit faster than the previous aggregate calls; please see the all-out benchmark at the end of this section

The only downside of using the formula notation is that you have to learn it, which might seem a bit awkward at first, but as formulas are highly used in a bunch of R functions and packages, particularly for defining models, it's definitely worth learning how to use them in the long run.

Note

The formula notation is inherited from the S language with the following general syntax: response_variable ~ predictor_variable_1 + … + predictor_variable_n. The notation also includes some other symbols, such as - for excluding variables and : or * to include the interaction between the variables with or without themselves. See Chapter 5, Building Models (authored by Renata Nemeth and Gergely Toth), and ?formula in the R console for more details.

Quicker aggregation with base R commands

An alternative solution to aggregate data might be to call the tapply or by function, which can apply an R function over a ragged array. The latter means that we can provide one or more INDEX variables, which will be coerced to factor, and then, run the provided R function separately on all cells in each subset. The following is a quick example:

> tapply(hflights$Diverted, hflights$DayOfWeek, mean)
 1 2 3 4 5 6 7 
0.002998 0.002559 0.003226 0.003066 0.002688 0.002823 0.002589

Please note that tapply returns an array object instead of convenient data frames; on the other hand, it runs a lot quicker than the abovementioned aggregate calls. Thus, it might be reasonable to use tapply for the computations and then, convert the results to data.frame with the appropriate column names.

Convenient helper functions

Such conversions can be done easily and in a very user-friendly way by, for example, using the plyr package, a general version of the dplyr package, which stands for plyr specialized f or data frames.

The plyr package provides a variety of functions to apply data from data.frame, list, or array objects, and can return the results in any of the mentioned formats. The naming scheme of these functions is easy to remember: the first character of the function name stands for the class of the input data, and the second character represents the output format, all followed by ply in all cases. Besides the three abovementioned R classes, there are some special options coded by the characters:

d stands for data.frame
s stands for array
l stands for list
m is a special input type, which means that we provide multiple arguments in a tabular format for the function
r input type expects an integer, which specifies the number of times the function will be replicated
_ is a special output type that does not return anything for the function

Thus, the following most frequently used combinations are available:

ddply takes data.frame as input and returns data.frame
ldply takes list as input but returns data.frame
l_ply does not return anything, but it's really useful for example, to iterate through a number of elements instead of a for loop; as with a set .progress argument, the function can show the current state of iterations, the remaining time

Please find more details, examples, and use cases of plyr in Chapter 4, Restructuring Data. Here, we will only concentrate on how to summarize data. To this end, we will use ddply (not to be confused with the dplyr package) in all the following examples: taking data.frame as the input argument and returning data with the same class.

So, let's load the package and apply the mean function on the Diverted column over each subset by DayOfWeek:

> library(plyr)
> ddply(hflights, .(DayOfWeek), function(x) mean(x$Diverted))
 DayOfWeek V1
1 1 0.002997672
2 2 0.002559323
3 3 0.003226211
4 4 0.003065727
5 5 0.002687865
6 6 0.002823121
7 7 0.002589057

Note

The . function of the plyr package provides us with a convenient way of referring to a variable (name) as is; otherwise, the content of the DayOfWeek columns would be interpreted by ddply, resulting in an error.

An important thing to note here is that ddply is much quicker than our first attempt with the aggregate function. On the other hand, I am not yet pleased with the results, V1 and such creative column names have always freaked me out. Instead of updating the names of the data.frame post processing let's call the summarise helper function instead of the previously applied anonymous one; here, we can also provide the desired name for our newly computed column:

> ddply(hflights, .(DayOfWeek), summarise, Diverted = mean(Diverted))
 DayOfWeek Diverted
1 1 0.002997672
2 2 0.002559323
3 3 0.003226211
4 4 0.003065727
5 5 0.002687865
6 6 0.002823121
7 7 0.002589057

Okay, much better. But, can we do even better?

High-performance helper functions

Hadley Wickham, the author of ggplot, reshape, and several other R packages, started working on the second generation, or rather a specialized version, of plyr in 2008. The basic concept was that plyr is most frequently used to transform one data.frame to another data.frame; therefore, its operation requires extra attention. The dplyr package, plyr specialized for data frames, provides a faster implementation of the plyr functions, written in raw C++, and dplyr can also deal with remote databases.

However, the performance improvements also go hand-in-hand with some other changes; for example, the syntax of dplyr has changed a lot as compared to plyr. Although the previously mentioned summarise function does exist in dplyr, we do not have the ddplyr function any more, as all functions in the package are dedicated to act as some component of plyr::ddplyr.

Anyway, to keep the theoretical background short, if we want to summarize the subgroups of a dataset, we have to define the groups before aggregation:

> hflights_DayOfWeek <- group_by(hflights, DayOfWeek)

The resulting object is the very same data.frame that we had previously with one exception: a bunch of metadata was merged to the object by the means of attributes. To keep the following output short, we do not list the whole structure (str) of the object, but only the attributes are shown:

> str(attributes(hflights_DayOfWeek))
List of 9
 $ names : chr [1:21] "Year" "Month" "DayofMonth" ...
 $ class : chr [1:4] "grouped_df" "tbl_df" "tbl" ...
 $ row.names : int [1:227496] 5424 5425 5426 5427 5428 ...
 $ vars :List of 1
 ..$ : symbol DayOfWeek
 $ drop : logi TRUE
 $ indices :List of 7
 ..$ : int [1:34360] 2 9 16 23 30 33 40 47 54 61 ...
 ..$ : int [1:31649] 3 10 17 24 34 41 48 55 64 70 ...
 ..$ : int [1:31926] 4 11 18 25 35 42 49 56 65 71 ...
 ..$ : int [1:34902] 5 12 19 26 36 43 50 57 66 72 ...
 ..$ : int [1:34972] 6 13 20 27 37 44 51 58 67 73 ...
 ..$ : int [1:27629] 0 7 14 21 28 31 38 45 52 59 ...
 ..$ : int [1:32058] 1 8 15 22 29 32 39 46 53 60 ...
 $ group_sizes : int [1:7] 34360 31649 31926 34902 34972 ...
 $ biggest_group_size: int 34972
 $ labels :'data.frame': 7 obs. of 1 variable:
 ..$ DayOfWeek: int [1:7] 1 2 3 4 5 6 7
 ..- attr(*, "vars")=List of 1
 .. ..$ : symbol DayOfWeek

From this metadata, the indices attribute is important. It simply lists the IDs of each row for one of the weekdays, so later operations can easily select the subgroups from the whole dataset. So, let's see how the proportion of diverted flights looks like with some performance boost due to using summarise from dplyr instead of plyr:

> dplyr::summarise(hflights_DayOfWeek, mean(Diverted))
Source: local data frame [7 x 2]

 DayOfWeek mean(Diverted)
1 1 0.002997672
2 2 0.002559323
3 3 0.003226211
4 4 0.003065727
5 5 0.002687865
6 6 0.002823121
7 7 0.002589057

The results are pretty familiar, which is good. However, while running this example, did you measure the execution time? This was close to an instant, which makes dplyr even better.

Aggregate with data.table

Do you remember the second argument of [.data.table? It's called j, which stands for a SELECT or an UPDATE SQL statement, and the most important feature is that it can be any R expression. Thus, we can simply pass a function there and set groups with the help of the by argument:

> hflights_dt[, mean(Diverted), by = DayOfWeek]
 DayOfWeek V1
1: 6 0.002823121
2: 7 0.002589057
3: 1 0.002997672
4: 2 0.002559323
5: 3 0.003226211
6: 4 0.003065727
7: 5 0.002687865

I am pretty sure that you are not in the least surprised by how fast the results were returned by data.table, as people can get used to great tools very quickly. Further, it was very concise as compared to the previous two-line dplyr call, right? The only downside of this solution is that the weekdays are ordered by some hardly intelligible rank. Please see Chapter 4, Restructuring Data, for more details on this; for now, let's fix the issue quickly by setting a key, which means that we order data.table first by DayOfWeek:

> setkey(hflights_dt, 'DayOfWeek')
> hflights_dt[, mean(Diverted), by = DayOfWeek]
 DayOfWeek V1
1: 1 0.002997672
2: 2 0.002559323
3: 3 0.003226211
4: 4 0.003065727
5: 5 0.002687865
6: 6 0.002823121
7: 7 0.002589057

Note

To specify a name for the second column in the resulting tabular object instead of V1, you can specify the summary object as a named list, for example, as hflights_dt[, list('mean(Diverted)' = mean(Diverted)), by = DayOfWeek], where you can use . (dot) instead of list, just like in ply r.

Besides getting the results in the expected order, summarizing data by an already existing key also runs relatively fast. Let's verify this with some empirical evidence on your machine!