What’s in [a] purrr? – Gavin Masterson

I have experimented with functional programming in R over the years, yet I’ve never given the paradigm sufficient time to feel comfortable in its presence¹.

In this series of posts I will tackle my mental blocks by working on increasingly complex applications of functional programming. The path will meander a bit because this is essentially a journey of discovery for me.

Note: I will not be spending any time working with the apply family of functions in this series. I have used them successfully and I recommend that you read their documentation to understand how they are both similar and different to the map* family of functions in purrr.

How Hard is it to Purrr?

The more I use R, the more I grow to appreciate the ‘ecosystem’ in which a package operates. This is particularly true of the ‘Tidyverse’ experience, where the core packages are carefully curated to do one or a few things very well. According to the Tidyverse website, the eight core packages are ggplot2, dplyr, tidyr, readr, purrr, tibble, stringrand forcats.

I’ve spent a lot of time working with ggplot2, dplyr and stringr, and I’m still learning new tricks with each of them. So how does purrr compare to the other seven core packages in terms of size and/or complexity?

To answer this question, I’m going to start by comparing the namespaces of the eight, core Tidyverse packages.

One very quick (and dirty) way that I use to get a feel for a package is getNamespaceExports function. As the function name suggests, it returns a character vector containing the names of all the objects available to me in a package’s namespace.

Let me demonstrate this quickly by listing the first six exported functions from the utils package:

getNamespaceExports("utils") |> # Bonus: This is the base R pipe (v 4.1+)
    head(6)

[1] "aspell_package_Rd_files" "vi"                     
[3] "read.table"              "URLdecode"              
[5] "rc.status"               "write.csv"

Tangent Time

Did you know that you could call vi (the Unix text editor) from within the R workspace?
I didn’t.

You can call vi inside R or RStudio if you are in a Unix OS (or using WSL - like me!):

vi("I learn new things every day!", file = "always-learning.txt")

But I digress…

Clearing the Throat

To investigate purrr, I will use my trusty Tidyverse tools - but this time I will be including purrr itself.

We can load the eight core packages of the Tidyverse with a single library call.

library(tidyverse)

The next step is to make a character vector containing the core package names so that I can iterate over them for different purposes.

tidyverse_core <- c("ggplot2", "dplyr", "tidyr", "readr", 
                    "purrr", "tibble", "stringr", "forcats")

We have already seen how to return a vector of the exported objects in the namespace of a package. Now I want to return all the exported objects from each of the eight core packages of the Tidyverse.

Non-vectorised Functions

In an ‘ideal world’², my trusty getNamespaceExports function would take a vector of inputs and return a well-curated object that contains the information I want. Let’s cross our fingers and see if we get lucky:

getNamespaceExports(tidyverse_core) %>%
    length()

[1] 533

While the getNamespaceExports function appears to work with a vector, the sad truth is that it has only returned the objects for the first name in my tidyverse_core vector, which is ggplot2³.

In Purrr-suit of Answers

Now I know I have to use iteration to step through each name in my tidyverse_core vector. I can either write a very careful for loop and assign each step’s output to a pre-allocated output of my choice, or I can learn to tap the power of the ‘ready-to-use’ for loops in the purrr::map* functions.

First I will use purrr::map to call the getNamespaceExports function on each element of the tidyverse_core vector and then print the str (structure) of the object that map returns:

For clarity, I will list the purrr:: namespace prefix before every purrr function that I use in each chunk.

tidyverse_core %>%
purrr::map(getNamespaceExports) %>%
    str()

List of 8
 $ : chr [1:533] "draw_key_vpath" "StatDensity2dFilled" "find_panel" "stat_density2d_filled" ...
 $ : chr [1:288] "rows_upsert" "src_local" "db_analyze" "n_groups" ...
 $ : chr [1:65] "complete" "tribble" "pivot_wider" "full_seq" ...
 $ : chr [1:115] "read_log" "read_fwf" "read_tsv" "spec_csv2" ...
 $ : chr [1:189] "pmap_chr" "invoke_map_df" "as_vector" "is_vector" ...
 $ : chr [1:47] "set_tidy_names" "lst" "size_sum" "deframe" ...
 $ : chr [1:59] "str_glue_data" "str_replace_na" "str_to_upper" "str_order" ...
 $ : chr [1:37] "fct_match" "fct_inseq" "fct_inorder" "first2" ...

Excellent! I know that my function call has worked because I got back a list containing eight vectors of object names.

This doesn’t tell me the relative size of the eight packages though so I need to perform another iteration. This time I want to call the length function on each vector in the list that map(getNamespaceExports) creates.

tidyverse_core %>%
    purrr::map(getNamespaceExports) %>%
    purrr::map(length) %>%
    str()

List of 8
 $ : int 533
 $ : int 288
 $ : int 65
 $ : int 115
 $ : int 189
 $ : int 47
 $ : int 59
 $ : int 37

This is looking good, but I can’t be certain which number relates to which package name. I would prefer to output a dataframe that contains a column of the package names and a column listing the number of exported objects.

What happens if I pass the list output as a column to the tibble function?

tidyverse_core %>% 
    purrr::map(getNamespaceExports) %>% 
    purrr::map(length) %>%
    tibble(
        package = tidyverse_core,
        num_exports = .)

# A tibble: 8 × 2
  package num_exports
  <chr>   <list>     
1 ggplot2 <int [1]>  
2 dplyr   <int [1]>  
3 tidyr   <int [1]>  
4 readr   <int [1]>  
5 purrr   <int [1]>  
6 tibble  <int [1]>  
7 stringr <int [1]>  
8 forcats <int [1]>

Well that’s not great. The tibble I wanted is printed but the values returned by map(length) are hidden from view. This is because tibbles support list columns (or nesting). I need to end up with a numeric vector to pass to the num_exports column.

Luckily I can do this with the map_dbl function. The documentation tells me that map_dbl functions exactly like map but that map_dbl returns a numeric vector as output. The other map_* functions work the same way - returning a vector of the type in the function suffix. So map_chr returns a character vector, and map_lgl returns a logical vector.

With this change to map_dbl:

tidyverse_core %>% 
    purrr::map(getNamespaceExports) %>% 
    purrr::map_dbl(length) %>%
    tibble(
        package = tidyverse_core,
        num_exports = .)

# A tibble: 8 × 2
  package num_exports
  <chr>         <dbl>
1 ggplot2         533
2 dplyr           288
3 tidyr            65
4 readr           115
5 purrr           189
6 tibble           47
7 stringr          59
8 forcats          37

Success! Well - almost. I’d like to order the rows for readability so the last thing I do below is add an arrange call on the dataframe.

tidyverse_core %>% 
    purrr::map(getNamespaceExports) %>% 
    purrr::map_dbl(length) %>%
    tibble(
        package = tidyverse_core,
        num_exports = .
    ) %>%
    arrange(desc(num_exports))

Before we look at the final output, let’s mentally review each line of this chunk.

Using the purrr::map function, I call the getNamespaceExports function on each element of the tidyverse_core vector.
I want to know the length of each of the eight returned vectors from my purrr::map call. These are contained in a list, so I use purrr::map_dbl to call the length function on each vector and return a numeric vector.
Printing a list makes for messy output, so I use tibble to create a data.frame with two columns named package, and num_exports.
Lastly, the arrange(desc(exports)) call reorders the data so that the packages are listed in descending order according to the num_exports column.

… which produces this concise summary:

# A tibble: 8 × 2
  package exports
  <chr>     <dbl>
1 ggplot2     533
2 dplyr       288
3 purrr       189
4 readr       115
5 tidyr        65
6 stringr      59
7 tibble       47
8 forcats      37

As you might have predicted, ggplot2 exports the most objects of the eight, core Tidyverse packages followed by dplyr and then purrr.

I must point out that not all of these objects will be functions that we will need to learn to use. Some of them are supporting functions that are used by the other purrr functions e.g., as_mapper. Nevertheless, there are enough functions in purrr to keep us entertained for a long while yet.

Sidenote: If anyone tries to make you feel guilty about ‘Googling’ your way to solutions with ggplot2, ask them to name all 523 objects exported by ggplot2 from memory.

A Feeling of Purrr-fection

When I think about all the things that are happening within the piped code sequence above, I feel incredibly satisfied with how little code I had to write to achieve it.

Using a vector, we generated a very large list of eight vectors with different lengths and then summarised the list elements with a single line of code. ‘Prettifying’ the final output took as many lines as the computation.

That is the beauty of purrr and functional programming. When I use purrr, I save myself all the time of writing the equivalent for loops that would achieve the same result. If I had to code my own for loops here, I’d probably still be writing this blog post.

Footnotes

https://twitter.com/gavinprm/status/1512156623636209667↩︎
An ‘ideal world’ is the one in which someone else has written a function that exactly matches my use case so that I can do exactly what I want with minimal effort.↩︎
You can test this if you want to confirm this using getNamespaceExports("ggplot2")↩︎