getNamespaceExports("utils") |> # Bonus: This is the base R pipe (v 4.1+)
head(6)
[1] "aspell_package_Rd_files" "vi"
[3] "read.table" "URLdecode"
[5] "rc.status" "write.csv"
Learning to purrr (Part 1)
Gavin Masterson
May 2, 2022
I have experimented with functional programming in R over the years, yet I’ve never given the paradigm sufficient time to feel comfortable in its presence1.
In this series of posts I will tackle my mental blocks by working on increasingly complex applications of functional programming. The path will meander a bit because this is essentially a journey of discovery for me.
Note: I will not be spending any time working with the
apply
family of functions in this series. I have used them successfully and I recommend that you read their documentation to understand how they are both similar and different to themap*
family of functions inpurrr
.
The more I use R, the more I grow to appreciate the ‘ecosystem’ in which a package operates. This is particularly true of the ‘Tidyverse’ experience, where the core packages are carefully curated to do one or a few things very well. According to the Tidyverse website, the eight core packages are ggplot2
, dplyr
, tidyr
, readr
, purrr
, tibble
, stringr
and forcats
.
I’ve spent a lot of time working with ggplot2
, dplyr
and stringr
, and I’m still learning new tricks with each of them. So how does purrr
compare to the other seven core packages in terms of size and/or complexity?
To answer this question, I’m going to start by comparing the namespace
s of the eight, core Tidyverse packages.
One very quick (and dirty) way that I use to get a feel for a package is getNamespaceExports
function. As the function name suggests, it returns a character vector containing the names of all the objects available to me in a package’s namespace
.
Let me demonstrate this quickly by listing the first six exported functions from the utils
package:
[1] "aspell_package_Rd_files" "vi"
[3] "read.table" "URLdecode"
[5] "rc.status" "write.csv"
Did you know that you could call vi
(the Unix text editor) from within the R workspace?
I didn’t.
You can call vi
inside R or RStudio if you are in a Unix OS (or using WSL - like me!):
But I digress…
To investigate purrr
, I will use my trusty Tidyverse tools - but this time I will be including purrr
itself.
We can load the eight core packages of the Tidyverse with a single library call.
The next step is to make a character vector containing the core package names so that I can iterate over them for different purposes.
We have already seen how to return a vector of the exported objects in the namespace
of a package. Now I want to return all the exported objects from each of the eight core packages of the Tidyverse.
In an ‘ideal world’2, my trusty getNamespaceExports
function would take a vector of inputs and return a well-curated object that contains the information I want. Let’s cross our fingers and see if we get lucky:
While the getNamespaceExports
function appears to work with a vector, the sad truth is that it has only returned the objects for the first name in my tidyverse_core
vector, which is ggplot2
3.
Now I know I have to use iteration to step through each name in my tidyverse_core
vector. I can either write a very careful for
loop and assign each step’s output to a pre-allocated output of my choice, or I can learn to tap the power of the ‘ready-to-use’ for
loops in the purrr::map*
functions.
First I will use purrr::map
to call the getNamespaceExports
function on each element of the tidyverse_core
vector and then print the str
(structure) of the object that map
returns:
For clarity, I will list the
purrr::
namespace prefix before everypurrr
function that I use in each chunk.
List of 8
$ : chr [1:533] "draw_key_vpath" "StatDensity2dFilled" "find_panel" "stat_density2d_filled" ...
$ : chr [1:288] "rows_upsert" "src_local" "db_analyze" "n_groups" ...
$ : chr [1:65] "complete" "tribble" "pivot_wider" "full_seq" ...
$ : chr [1:115] "read_log" "read_fwf" "read_tsv" "spec_csv2" ...
$ : chr [1:189] "pmap_chr" "invoke_map_df" "as_vector" "is_vector" ...
$ : chr [1:47] "set_tidy_names" "lst" "size_sum" "deframe" ...
$ : chr [1:59] "str_glue_data" "str_replace_na" "str_to_upper" "str_order" ...
$ : chr [1:37] "fct_match" "fct_inseq" "fct_inorder" "first2" ...
Excellent! I know that my function call has worked because I got back a list
containing eight vectors of object names.
This doesn’t tell me the relative size of the eight packages though so I need to perform another iteration. This time I want to call the length
function on each vector in the list that map(getNamespaceExports)
creates.
List of 8
$ : int 533
$ : int 288
$ : int 65
$ : int 115
$ : int 189
$ : int 47
$ : int 59
$ : int 37
This is looking good, but I can’t be certain which number relates to which package name. I would prefer to output a dataframe that contains a column of the package names and a column listing the number of exported objects.
What happens if I pass the list
output as a column to the tibble
function?
tidyverse_core %>%
purrr::map(getNamespaceExports) %>%
purrr::map(length) %>%
tibble(
package = tidyverse_core,
num_exports = .)
# A tibble: 8 × 2
package num_exports
<chr> <list>
1 ggplot2 <int [1]>
2 dplyr <int [1]>
3 tidyr <int [1]>
4 readr <int [1]>
5 purrr <int [1]>
6 tibble <int [1]>
7 stringr <int [1]>
8 forcats <int [1]>
Well that’s not great. The tibble I wanted is printed but the values returned by map(length)
are hidden from view. This is because tibbles support list
columns (or nesting). I need to end up with a numeric vector to pass to the num_exports
column.
Luckily I can do this with the map_dbl
function. The documentation tells me that map_dbl
functions exactly like map
but that map_dbl
returns a numeric vector as output. The other map_*
functions work the same way - returning a vector of the type in the function suffix. So map_chr
returns a character vector, and map_lgl
returns a logical vector.
With this change to map_dbl
:
tidyverse_core %>%
purrr::map(getNamespaceExports) %>%
purrr::map_dbl(length) %>%
tibble(
package = tidyverse_core,
num_exports = .)
# A tibble: 8 × 2
package num_exports
<chr> <dbl>
1 ggplot2 533
2 dplyr 288
3 tidyr 65
4 readr 115
5 purrr 189
6 tibble 47
7 stringr 59
8 forcats 37
Success! Well - almost. I’d like to order the rows for readability so the last thing I do below is add an arrange
call on the dataframe.
Before we look at the final output, let’s mentally review each line of this chunk.
purrr::map
function, I call the getNamespaceExports
function on each element of the tidyverse_core
vector.purrr::map
call. These are contained in a list, so I use purrr::map_dbl
to call the length
function on each vector and return a numeric vector.tibble
to create a data.frame with two columns named package
, and num_exports
.arrange(desc(exports))
call reorders the data so that the packages are listed in descending order according to the num_exports
column.… which produces this concise summary:
# A tibble: 8 × 2
package exports
<chr> <dbl>
1 ggplot2 533
2 dplyr 288
3 purrr 189
4 readr 115
5 tidyr 65
6 stringr 59
7 tibble 47
8 forcats 37
As you might have predicted, ggplot2
exports the most objects of the eight, core Tidyverse packages followed by dplyr
and then purrr
.
I must point out that not all of these objects will be functions that we will need to learn to use. Some of them are supporting functions that are used by the other purrr
functions e.g., as_mapper
. Nevertheless, there are enough functions in purrr
to keep us entertained for a long while yet.
Sidenote: If anyone tries to make you feel guilty about ‘Googling’ your way to solutions with
ggplot2
, ask them to name all 523 objects exported byggplot2
from memory.
When I think about all the things that are happening within the piped code sequence above, I feel incredibly satisfied with how little code I had to write to achieve it.
Using a vector, we generated a very large list of eight vectors with different lengths and then summarised the list elements with a single line of code. ‘Prettifying’ the final output took as many lines as the computation.
That is the beauty of purrr
and functional programming. When I use purrr
, I save myself all the time of writing the equivalent for
loops that would achieve the same result. If I had to code my own for
loops here, I’d probably still be writing this blog post.
https://twitter.com/gavinprm/status/1512156623636209667↩︎
An ‘ideal world’ is the one in which someone else has written a function that exactly matches my use case so that I can do exactly what I want with minimal effort.↩︎
You can test this if you want to confirm this using getNamespaceExports("ggplot2")
↩︎