Piping or Chaining

We will discuss a concept that will help us greatly when it comes to working with our data. The usual way to perform multiple operations in one line is by nesting.

To consider an example we will look at the data provided in the gapminder package:

library(gapminder)
head(gapminder)
## # A tibble: 6 × 6
##       country continent  year lifeExp      pop gdpPercap
##        <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan      Asia  1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia  1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia  1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia  1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia  1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia  1977  38.438 14880372  786.1134

Let’s say that we want to have the GDP per capita and life expectancy Kenya. Traditionally speaking we could do this in a nested manner:

filter(select(gapminder, country, lifeExp, gdpPercap), country=="Kenya")

It is not easy to see exactly what this code was doing but we can write this in a manner that follows our logic much better. The code below represents how to do this with chaining.

gapminder %>%
    select(country, lifeExp, gdpPercap) %>%
    filter(country=="Kenya")

We now have something that is much clearer to read. Here is what our chaining command says:

1. Take the gapminder data

2. Select the variables: countrylifeExp and gdpPercap.

3. Only keep information from Kenya.

The nested code says the same thing but it is hard to see what is going on if you have not been coding for very long. The result of this search is below:

## # A tibble: 12 × 3
##    country lifeExp gdpPercap
##     <fctr>   <dbl>     <dbl>
## 1    Kenya  42.270  853.5409
## 2    Kenya  44.686  944.4383
## 3    Kenya  47.949  896.9664
## 4    Kenya  50.654 1056.7365
## 5    Kenya  53.559 1222.3600
## 6    Kenya  56.155 1267.6132
## 7    Kenya  58.766 1348.2258
## 8    Kenya  59.339 1361.9369
## 9    Kenya  59.285 1341.9217
## 10   Kenya  54.407 1360.4850
## 11   Kenya  50.992 1287.5147
## 12   Kenya  54.110 1463.2493

What is %>%

In the previous code we saw that we used %>% in the command you can think of this as saying then. For example:

gapminder %>%
    select(country, lifeExp, gdpPercap) %>%
    filter(country=="Kenya")

This translates to:

Take Gapminder then select these columns select(country, lifeExp, gdpPercap) then filter out so we only keep Kenya

Why Chain?

We still might ask why we would want to do this. Chaining increases readability significantly when there are many commands. With many pacakges we can replace the need to perform nested arguments. The chaining operator is automatically imported from the magrittr package.

User Defined Function

Let’s say that we wish to find the Euclidean distance between two vectors say, x1 and x2. We could use the math formula:

sum(x1x2)2−−−−−−−−−−−−√sum(x1−x2)2

In the nested manner this would be:

x1 <- 1:5; x2 <- 2:6
sqrt(sum((x1-x2)^2))

However, if we chain this we can see how we would perform this mathematically.

# chaining method
(x1-x2)^2 %>% sum() %>% sqrt()

If we did it by hand we would perform elementwise subtraction of x2 from x1 then we would sum those elementwise values then we would take the square root of the sum.

# chaining method
(x1-x2)^2 %>% sum() %>% sqrt()
## [1] 2.236068

Many of us have been performing calculations by this type of method for years, so that chaining really is more natural for us.