- Mastering Data Analysis with R
- Gergely Daróczi
- 1974字
- 2025-02-18 02:16:38
Aggregation
The most straightforward way of summarizing data is calling the aggregate
function from the stats
package, which does exactly what we are looking for: splitting the data into subsets by a grouping variable, then computing summary statistics for them separately. The most basic way to call the aggregate
function is to pass the numeric vector to be aggregated, and a factor variable to define the splits for the function passed in the FUN
argument to be applied. Now, let's see the average ratio of diverted flights on each weekday:
> aggregate(hflights$Diverted, by = list(hflights$DayOfWeek), + FUN = mean) Group.1 x 1 1 0.002997672 2 2 0.002559323 3 3 0.003226211 4 4 0.003065727 5 5 0.002687865 6 6 0.002823121 7 7 0.002589057
Well, it took some time to run the preceding script, but please bear in mind that we have just aggregated around a quarter of a million rows to see the daily averages for the number of diverted flights departing from the Houston airport in 2011.
In other words, which also makes sense for all those not into statistics, the percentage of diverted flights per weekday. The results are rather interesting, as it seems that flights are more often diverted in the middle of the week (around 0.3 percent) than over the weekends (around 0.05 percent less), at least from Houston.
An alternative way of calling the preceding function is to supply the arguments inside of the with
function, which seems to be a more human-friendly expression after all because it saves us from the repeated mention of the hflights
database:
> with(hflights, aggregate(Diverted, by = list(DayOfWeek), + FUN = mean))
The results are not shown here, as they are exactly the same as those shown earlier. The manual for the aggregate
function (see ?aggregate
) states that it returns the results in a convenient form. Well, checking the column names of the abovementioned returned data does not seem convenient, right? We can overcome this issue by using the formula notation instead of defining the numeric and factor variables separately:
> aggregate(Diverted ~ DayOfWeek, data = hflights, FUN = mean) DayOfWeek Diverted 1 1 0.002997672 2 2 0.002559323 3 3 0.003226211 4 4 0.003065727 5 5 0.002687865 6 6 0.002823121 7 7 0.002589057
The gain by using the formula notation is at least two-fold:
- There are relatively few characters to type
- The headers and row names are correct in the results
- This version also runs a bit faster than the previous
aggregate
calls; please see the all-out benchmark at the end of this section
The only downside of using the formula notation is that you have to learn it, which might seem a bit awkward at first, but as formulas are highly used in a bunch of R functions and packages, particularly for defining models, it's definitely worth learning how to use them in the long run.
Note
The formula notation is inherited from the S language with the following general syntax: response_variable ~ predictor_variable_1 + … + predictor_variable_n
. The notation also includes some other symbols, such as -
for excluding variables and :
or *
to include the interaction between the variables with or without themselves. See Chapter 5, Building Models (authored by Renata Nemeth and Gergely Toth), and ?formula
in the R console for more details.
Quicker aggregation with base R commands
An alternative solution to aggregate data might be to call the tapply
or by
function, which can apply an R function over a ragged array. The latter means that we can provide one or more INDEX
variables, which will be coerced to factor, and then, run the provided R function separately on all cells in each subset. The following is a quick example:
> tapply(hflights$Diverted, hflights$DayOfWeek, mean) 1 2 3 4 5 6 7 0.002998 0.002559 0.003226 0.003066 0.002688 0.002823 0.002589
Please note that tapply
returns an array
object instead of convenient data frames; on the other hand, it runs a lot quicker than the abovementioned aggregate calls. Thus, it might be reasonable to use tapply
for the computations and then, convert the results to data.frame
with the appropriate column names.
Convenient helper functions
Such conversions can be done easily and in a very user-friendly way by, for example, using the plyr
package, a general version of the dplyr
package, which stands for plyr specialized f or data frames.
The plyr
package provides a variety of functions to apply data from data.frame
, list
, or array
objects, and can return the results in any of the mentioned formats. The naming scheme of these functions is easy to remember: the first character of the function name stands for the class of the input data, and the second character represents the output format, all followed by ply in all cases. Besides the three abovementioned R classes, there are some special options coded by the characters:
d
stands fordata.frame
s
stands forarray
l
stands forlist
m
is a special input type, which means that we provide multiple arguments in a tabular format for the functionr
input type expects an integer, which specifies the number of times the function will be replicated_
is a special output type that does not return anything for the function
Thus, the following most frequently used combinations are available:
ddply
takesdata.frame
as input and returnsdata.frame
ldply
takeslist
as input but returnsdata.frame
l_ply
does not return anything, but it's really useful for example, to iterate through a number of elements instead of afor
loop; as with a set.progress
argument, the function can show the current state of iterations, the remaining time
Please find more details, examples, and use cases of plyr
in Chapter 4, Restructuring Data. Here, we will only concentrate on how to summarize data. To this end, we will use ddply
(not to be confused with the dplyr
package) in all the following examples: taking data.frame
as the input argument and returning data with the same class.
So, let's load the package and apply the mean
function on the Diverted
column over each subset by DayOfWeek
:
> library(plyr) > ddply(hflights, .(DayOfWeek), function(x) mean(x$Diverted)) DayOfWeek V1 1 1 0.002997672 2 2 0.002559323 3 3 0.003226211 4 4 0.003065727 5 5 0.002687865 6 6 0.002823121 7 7 0.002589057
An important thing to note here is that ddply
is much quicker than our first attempt with the aggregate
function. On the other hand, I am not yet pleased with the results, V1
and such creative column names have always freaked me out. Instead of updating the names of the data.frame
post processing let's call the summarise
helper function instead of the previously applied anonymous one; here, we can also provide the desired name for our newly computed column:
> ddply(hflights, .(DayOfWeek), summarise, Diverted = mean(Diverted)) DayOfWeek Diverted 1 1 0.002997672 2 2 0.002559323 3 3 0.003226211 4 4 0.003065727 5 5 0.002687865 6 6 0.002823121 7 7 0.002589057
Okay, much better. But, can we do even better?
High-performance helper functions
Hadley Wickham, the author of ggplot
, reshape
, and several other R packages, started working on the second generation, or rather a specialized version, of plyr
in 2008. The basic concept was that plyr
is most frequently used to transform one data.frame
to another data.frame
; therefore, its operation requires extra attention. The dplyr
package, plyr
specialized for data frames, provides a faster implementation of the plyr
functions, written in raw C++, and dplyr
can also deal with remote databases.
However, the performance improvements also go hand-in-hand with some other changes; for example, the syntax of dplyr
has changed a lot as compared to plyr
. Although the previously mentioned summarise
function does exist in dplyr
, we do not have the ddplyr
function any more, as all functions in the package are dedicated to act as some component of plyr::ddplyr
.
Anyway, to keep the theoretical background short, if we want to summarize the subgroups of a dataset, we have to define the groups before aggregation:
> hflights_DayOfWeek <- group_by(hflights, DayOfWeek)
The resulting object is the very same data.frame
that we had previously with one exception: a bunch of metadata was merged to the object by the means of attributes. To keep the following output short, we do not list the whole structure (str
) of the object, but only the attributes are shown:
> str(attributes(hflights_DayOfWeek)) List of 9 $ names : chr [1:21] "Year" "Month" "DayofMonth" ... $ class : chr [1:4] "grouped_df" "tbl_df" "tbl" ... $ row.names : int [1:227496] 5424 5425 5426 5427 5428 ... $ vars :List of 1 ..$ : symbol DayOfWeek $ drop : logi TRUE $ indices :List of 7 ..$ : int [1:34360] 2 9 16 23 30 33 40 47 54 61 ... ..$ : int [1:31649] 3 10 17 24 34 41 48 55 64 70 ... ..$ : int [1:31926] 4 11 18 25 35 42 49 56 65 71 ... ..$ : int [1:34902] 5 12 19 26 36 43 50 57 66 72 ... ..$ : int [1:34972] 6 13 20 27 37 44 51 58 67 73 ... ..$ : int [1:27629] 0 7 14 21 28 31 38 45 52 59 ... ..$ : int [1:32058] 1 8 15 22 29 32 39 46 53 60 ... $ group_sizes : int [1:7] 34360 31649 31926 34902 34972 ... $ biggest_group_size: int 34972 $ labels :'data.frame': 7 obs. of 1 variable: ..$ DayOfWeek: int [1:7] 1 2 3 4 5 6 7 ..- attr(*, "vars")=List of 1 .. ..$ : symbol DayOfWeek
From this metadata, the indices
attribute is important. It simply lists the IDs of each row for one of the weekdays, so later operations can easily select the subgroups from the whole dataset. So, let's see how the proportion of diverted flights looks like with some performance boost due to using summarise
from dplyr
instead of plyr
:
> dplyr::summarise(hflights_DayOfWeek, mean(Diverted)) Source: local data frame [7 x 2] DayOfWeek mean(Diverted) 1 1 0.002997672 2 2 0.002559323 3 3 0.003226211 4 4 0.003065727 5 5 0.002687865 6 6 0.002823121 7 7 0.002589057
The results are pretty familiar, which is good. However, while running this example, did you measure the execution time? This was close to an instant, which makes dplyr
even better.
Aggregate with data.table
Do you remember the second argument of [.data.table
? It's called j
, which stands for a SELECT
or an UPDATE
SQL statement, and the most important feature is that it can be any R expression. Thus, we can simply pass a function there and set groups with the help of the by
argument:
> hflights_dt[, mean(Diverted), by = DayOfWeek] DayOfWeek V1 1: 6 0.002823121 2: 7 0.002589057 3: 1 0.002997672 4: 2 0.002559323 5: 3 0.003226211 6: 4 0.003065727 7: 5 0.002687865
I am pretty sure that you are not in the least surprised by how fast the results were returned by data.table
, as people can get used to great tools very quickly. Further, it was very concise as compared to the previous two-line dplyr
call, right? The only downside of this solution is that the weekdays are ordered by some hardly intelligible rank. Please see Chapter 4, Restructuring Data, for more details on this; for now, let's fix the issue quickly by setting a key, which means that we order data.table
first by DayOfWeek
:
> setkey(hflights_dt, 'DayOfWeek') > hflights_dt[, mean(Diverted), by = DayOfWeek] DayOfWeek V1 1: 1 0.002997672 2: 2 0.002559323 3: 3 0.003226211 4: 4 0.003065727 5: 5 0.002687865 6: 6 0.002823121 7: 7 0.002589057
Besides getting the results in the expected order, summarizing data by an already existing key also runs relatively fast. Let's verify this with some empirical evidence on your machine!