- Mastering Data Analysis with R
- Gergely Daróczi
- 1598字
- 2025-02-18 02:16:38
Drop needless data
Although not loading the needless data is the optimal solution (see the Loading a subset of text files and Loading data from databases sections in Chapter 1, Hello, Data!), we often have to filter the original dataset inside R. This can be done with the traditional tools and functions from base R, such as subset
, by using which
and the [
or [[
operator (see the following code), or for example with the SQL-like approach of the sqldf
package:
> library(sqldf) > sqldf("SELECT * FROM mtcars WHERE am=1 AND vs=1") mpg cyl disp hp drat wt qsec vs am gear carb 1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 4 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 5 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 6 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 7 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
I am sure that all readers who have a decent SQL background and are just getting in touch with R appreciate this alternative way of filtering data, but I personally prefer the following rather similar, native, and much more concise R version:
> subset(mtcars, am == 1 & vs == 1) mpg cyl disp hp drat wt qsec vs am gear carb Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Please note the slight difference in the results. This is attributed to the fact that the row.names
argument of sqldf
is FALSE
by default, which can of course be overridden to get the exact same results:
> identical( + sqldf("SELECT * FROM mtcars WHERE am=1 AND vs=1", + row.names = TRUE), + subset(mtcars, am == 1 & vs == 1) + ) [1] TRUE
These examples focused on how to drop rows from data.frame
, but what if we also want to remove some columns?
The SQL approach is really straightforward; just specify the required columns instead of *
in the SELECT
statement. On the other hand, subset
also supports this approach by the select
argument, which can take vectors or an R expression describing, for example, a range of columns:
> subset(mtcars, am == 1 & vs == 1, select = hp:wt) hp drat wt Datsun 710 93 3.85 2.320 Fiat 128 66 4.08 2.200 Honda Civic 52 4.93 1.615 Toyota Corolla 65 4.22 1.835 Fiat X1-9 66 4.08 1.935 Lotus Europa 113 3.77 1.513 Volvo 142E 109 4.11 2.780
Let's take this to the next step, and see how we can apply the forementioned filters on some larger datasets, when we face some performance issues with the base
functions.
Drop needless data in an efficient way
R works best with datasets that can fit in the actual physical memory, and some R packages provide extremely fast access to this amount of data.
Some of the related packages were already mentioned in Chapter 1, Hello, Data!, where we benchmarked reading a relatively large amount of data from the hflights
package into R.
Let's see how the preceding examples perform on this dataset of a quarter of a million rows:
> library(hflights) > system.time(sqldf("SELECT * FROM hflights WHERE Dest == 'BNA'", + row.names = TRUE)) user system elapsed 1.487 0.000 1.493 > system.time(subset(hflights, Dest == 'BNA')) user system elapsed 0.132 0.000 0.131
The base::subset
function seems to perform pretty well, but can we make it any faster? Well, the second generation of the plyr
package, called dplyr
(the relevant details are discussed High-performance helper functions section in this chapter and Chapter 4, Restructuring Data), provides extremely fast C++ implementations of the most common database manipulation methods in a rather intuitive way:
> library(dplyr) > system.time(filter(hflights, Dest == 'BNA')) user system elapsed 0.021 0.000 0.022
Further, we can extend this solution by dropping some columns from the dataset just like we did with subset
before, although now, we call the select
function instead of passing an argument with the same name:
> str(select(filter(hflights, Dest == 'BNA'), DepTime:ArrTime)) 'data.frame': 3481 obs. of 2 variables: $ DepTime: int 1419 1232 1813 900 716 1357 2000 1142 811 1341 ... $ ArrTime: int 1553 1402 1948 1032 845 1529 2132 1317 945 1519 ...
Therefore, it's like calling the filter
function instead of subset
, and we get the results faster than the blink of an eye! The dplyr
package can work with traditional data.frame
or data.table
objects, or can interact directly with the most widely used database engines. Please note that row names are not preserved in dplyr
, so if you require them, it's worth copying the names to explicit variables before passing them to dplyr
or directly to data.table
as follows:
> mtcars$rownames <- rownames(mtcars) > select(filter(mtcars, hp > 300), c(rownames, hp)) rownames hp 1 Maserati Bora 335
Drop needless data in another efficient way
Let's see a quick example of the data.table
solution on its own, without dplyr
.
After loading the package, we have to transform the hflights
traditional data.frame
to data.table
. Then, we create a new column, called rownames
, to which we assign the rownames
of the original dataset with the help of the := assignment operator specific to data.table
:
> library(data.table) > hflights_dt <- data.table(hflights) > hflights_dt[, rownames := rownames(hflights)] > system.time(hflights_dt[Dest == 'BNA']) user system elapsed 0.021 0.000 0.020
Well, it takes some time to get used to the custom data.table
syntax and it might even seem a bit strange to the traditional R user at first sight, but it's definitely worth mastering in the long run. You get great performance, and the syntax turns out to be natural and flexible after the relatively steep learning curve of the first few examples.
As a matter of fact, the data.table
syntax is pretty similar to SQL:
DT[i, j, ... , drop = TRUE]
This could be described with SQL commands as follows:
DT[where, select | update, group by][having][order by][ ]...[ ]
Therefore, [.data.table
(which stands for the [
operator applied to a data.table
object) has some different arguments as compared to the traditional [.data.frame
syntax, as you have already seen in the preceding example.
Note
Now, we are not dealing with the assignment operator in detail, as this example might be too complex for such an introductory part of the book, and we are probably getting out of our comfort zone. Therefore, please find more details in Chapter 4, Restructuring Data, or head to ?data.table
for a rather technical overview.
It seems that the first argument (i
) of the [.data.table
operator stands for filtering, or in other words, for the WHERE
statement in SQL parlance, while [.data.frame
expects indices specifying which rows to keep from the original dataset. The real difference between the two arguments is that the former can take any R expression, while the latter traditional method expects mainly integers or logical values.
Anyway, filtering is as easy as passing an R expression to the i
argument of the [
operator specific to data.table
. Further, let's see how we can select the columns in the data.table
syntax, which should be done in the second argument (j
) of the call on the basis of the abovementioned general data.table
syntax:
> str(hflights_dt[Dest == 'BNA', list(DepTime, ArrTime)]) Classes 'data.table' and 'data.frame': 3481 obs. of 2 variables: $ DepTime: int 1419 1232 1813 900 716 1357 2000 1142 811 1341 ... $ ArrTime: int 1553 1402 1948 1032 845 1529 2132 1317 945 1519 ... - attr(*, ".internal.selfref")=<externalptr>
Okay, so we now have the two expected columns with the 3,481 observations. Note that list
was used to define the required columns to keep, although the use of c
(a function from base R to concatenate vector elements) is more traditionally used with [.data.frame
. The latter is also possible with [.data.table
, but then, you have to pass the variable names as a character vector and set with
to FALSE
:
> hflights_dt[Dest == 'BNA', c('DepTime', 'ArrTime'), with = FALSE]
Now that we are more or less familiar with our options for filtering data inside a live R session, and we know the overall syntax of the dplyr
and data.table
packages, let's see how these can be used to aggregate and summarize data in action.