- Mastering Data Analysis with R
- Gergely Daróczi
- 503字
- 2025-02-18 02:16:38
Merging datasets
Besides the previously described elementary actions on a single dataset, joining multiple data sources is one of the most used methods in everyday action. The most often used solution for such a task is to simply call the merge
S3 method, which can act as a traditional SQL inner and left/right/full outer joiner of operations—represented in a brief summary by C.L. Moffatt (2008) as follows:
data:image/s3,"s3://crabby-images/91c67/91c67b4b41e6708592c7f08370e85f5433c48365" alt="Merging datasets"
The dplyr
package provides some easy ways for doing the previously presented join operations right from R, in an easy way:
inner_join
: This joins the variables of all the rows, which are found in both datasetsleft_join
: This includes all the rows from the first dataset and join variables from the other tablesemi_join
: This includes only those rows from the first dataset that are found in the other one as wellanti_join
: This is similar tosemi_join
, but includes only those rows from the first dataset that are not found in the other one
These features are also supported by the mult
argument of [
operator of data.table
call, but for the time being, let's stick to the simpler use cases.
In the following example, we will merge a tiny dataset with the hflights
data. Let's create the data.frame
demo by assigning names to the possible values of the DayOfWeek
variable:
> (wdays <- data.frame( + DayOfWeek = 1:7, + DayOfWeekString = c("Sunday", "Monday", "Tuesday", + "Wednesday", "Thursday", "Friday", "Saturday") + )) DayOfWeek DayOfWeekString 1 1 Sunday 2 2 Monday 3 3 Tuesday 4 4 Wednesday 5 5 Thursday 6 6 Friday 7 7 Saturday
Let's see how we can left-join the previously defined data.frame
with another data.frame
and other tabular objects, as merge
also supports fast operations on, for example, data.table
:
> system.time(merge(hflights, wdays)) user system elapsed 0.700 0.000 0.699 > system.time(merge(hflights_dt, wdays, by = 'DayOfWeek')) user system elapsed 0.006 0.000 0.009
The prior example automatically merged the two tables via the DayOfWeek
variable, which was part of both datasets and resulted in an extra variable in the original hflights
dataset. However, we had to pass the variable name in the second example, as the by
argument of merge.data.table
defaults to the key variable of the object, which was missing then. One thing to note is that merging with data.table
was a lot faster than the traditional tabular object type.
A much simpler way of merging datasets is when you simply want to add new rows or columns to the dataset with the same structure. For this end, rbind
and cbind
, or rBind
and cBind
for sparse matrices, do a wonderful job.
One of the most often used functions along with these base commands is do.call
, which can execute the rbind
or cbind
call on all elements of a list
, thus enabling us, for example, to join a list of data frames. Such lists are usually created by lapply
or the related functions from the plyr
package. Similarly, rbindlist
can be called to merge a list
of data.table
objects in a much faster way.