Data Science with {pins}

Data Sharing in Distributed and Polyglot Settings

First Things First…

It’s Data-Sharing Day!

I have some data to share with you.

# install.packages("pins")
library(pins)
web_board <- board_url("https://gavinmasterson.com/pins/")
web_board |> pin_versions("snake_top_five")
web_board |> pin_meta("snake_top_five")
web_board |> pin_read(name = "snake_top_five")

R Project icon by Icons8

# python -m pip install pins
import pins
base_url = "https://gavinmasterson.com/pins"
pin_paths = {
  "snake_detections": "snake_detections/20240730T195415Z-6c718/",
  "snake_top_five": "snake_top_five/20240730T195447Z-2039f/",
  }
board = pins.board_url(base_url, pin_paths)
board.pin_list()
# Investigate individual data sets
board.pin_meta("snake_detections")
board.pin_read("snake_detections")

Python icon by Icons8

Data Science in 2024

The Power of Networks

Our human experience is defined by our connections…

to ourselves,
to each other,
to nature,
to places.

Our connections ground us, teach us, and enable us.

The Power of Networks

A single computer is powerful.

A network of computers is unlimited.¹

Data on a single computer is useful.

Data shared throughout a network can change everything.²

A Tough Choice?

“R or Python” is a consistent search query:

A Tough Choice?

Apparently a ‘high stakes’ decision in Singapore:

Why Not Both?

All coding languages have strengths and weaknesses
Polyglot teams can:
- blend the strengths of different coding languages,
- chunk workflows into stages which connect to others.
Connections create opportunities:
- Web scrape with Playwright => Visualise with ggplot2.
- Model with tidymodels => Dashboard with Streamlit.

One API to Connect Them All

{pins} is a package created by Posit.
There are versions for Python and R that have a similar API.
The goal of {pins} is to facilitate the sharing of data, models, or any file.

In this presentation, I will demonstrate a workflow to:

create a pin board,
write files to the board,
add metadata about the files/file versions.

Follow Along

You can run this workflow locally, with the same data that I have used.

# install.packages("httr")
library(httr)
url <- "https://gavinmasterson.com/data-raw/snake-detections.csv"
response <- GET(url)
snake_data <- content(response, "parsed")

R Project icon by Icons8

Creating Boards

In my website project I create a directory named ‘pins’ then:

library(dplyr)
library(pins)

board <- board_folder(path = "pins", versioned = TRUE)

board_folder() creates a board inside a folder. You can use this to share files by using a folder on a shared network drive or inside a DropBox.¹

Writing Data

board |>
  pin_write(
    snake_data,
    name = "snake_detections",
    type = "json",
    title = "Snake Species Detections Over Time",
    description = "Datetimes of 20 snake species detected during the survey.",
    versioned = TRUE,
    tags = c("snake", "detections")
  )

Writing Updated Data

If we add a new field to the data, we might break code that uses the previous version of snake_detections in our pin board.

This is where the versioned argument shines.
Let us add the observer_id field to our data frame:

snake_data_new <-
  snake_data |>
  bind_cols(
    observer_id = sample.int(4, nrow(snake_data), replace = TRUE)
  )

Writing Updated Data

Now we write the new data frame as a new version of the snake_detections pin:

board |>
  pin_write(
    snake_data_new,
    name = "snake_detections",
    type = "json",
    versioned = TRUE,
    title = "Snake Species Detections Over Time",
    description = "Datetimes of 20 snake species detected during the survey. New: Added observer_id field.",
    tags = c("snake", "detections")
  )

Writing Updated Data

Now we write the new data frame as a new version of the snake_detections pin:

board |>
  pin_write(
    snake_data_new,
    name = "snake_detections",
    type = "json",
    versioned = TRUE,
    title = "Snake Species Detections Over Time",
    description = "Datetimes of 20 snake species detected during the survey. New: Added observer_id field.",
    tags = c("snake", "detections")
  )

Same Data, Different Pin

What if consumers want just the top five most-frequently detected species of the snake_detections pin? For example, if the dataset is very large, or you want to share a training dataset for a model.

Creating a new version of snake_detections with fewer species will cause confusion/break some user’s code.
Solution: Pin the subset as a new object.

The Top 5 Detected Species

Let us subset the full dataset:

snakes_top_5 <-
  snake_data |>
  count(common_name, sort = TRUE) |>
  head(5) |>
  semi_join(x = snake_data, y = _, by = "common_name")

Note

We pass the modified snake_data object to the y argument of semi_join using the _ placeholder for R’s base pipe. (The near-equivalent of . with magrittr’s %>%)

If we do not, the semi_join step does nothing.

The Top 5 Detected Species

Now we can write snakes_top_5 to our board as a new pin:

board |>
  pin_write(
    snakes_top_5,
    name = "snake_top_five",
    type = "json",
    title = "The Top Five Most-Frequently Detected Snake Species",
    description = "Five snake species made up 65.5% of total detections (380 of 580) during the survey.",
    tags = c("top5", "snake", "detection")
  )

The Top 5 Detected Species

Now we can write snakes_top_5 to our board as a new pin:

board |>
  pin_write(
    snakes_top_5,
    name = "snake_top_five",
    type = "json",
    title = "The Top Five Most-Frequently Detected Snake Species",
    description = "Five snake species made up 65.5% of total detections (380 of 580) during the survey.",
    tags = c("top5", "snake", "detection")
  )

View the Board State

We have used pin_write() three times, to pin:

The first version of snake_detections, with all the data.
A new version of snake_detections that includes an observer_id.
A subset of the snake_detections data, which we named snake_top_five.

View the Board State

To view the state of the board, we can use the following code:

pin_list(board)
# [1] "snake_detections" "snake_top_five"
pin_versions(board, name = "snake_detections")
# # A tibble: 2 × 3
#   version                created             hash 
#   <chr>                  <dttm>              <chr>
# 1 20240730T195356Z-8c385 2024-07-30 21:53:56 8c385
# 2 20240730T195415Z-6c718 2024-07-30 21:54:15 6c718

R Project icon by Icons8

import pins
base_url = "https://gavinmasterson.com/pins"
pin_paths = {
  "snake_detections": "snake_detections/20240730T195415Z-6c718/",
  "snake_top_five": "snake_top_five/20240730T195447Z-2039f/",
  }
board = pins.board_url(base_url, pin_paths)
board.pin_list()
# ['snake_detections', 'snake_top_five']

Python icon by Icons8

Pin Management

Next we can add a manifest file to our board.

The manifest is a yaml file that helps users to navigate through our board.

From what I can tell, the navigation of the folder does not depend on the manifest file, so this part of the workflow will depend on personal preference.

Pin Management

A board manifest file records all the pins, along with their versions, stored on a board. This can be useful for a board built using, for example, board_folder() or board_s3(), then served as a website, such that others can consume using board_url(). The manifest file is not versioned like a pin is, and this function will overwrite any existing _pins.yaml file on your board. It is your responsibility as the user to keep the manifest up to date (emphasis added).¹

Pin Management

In R, we can generate a manifest for our board using:

write_board_manifest(board)

The function outputs a _pins.yaml file to the root of our board folder. The manifest file for our board looks like this:

_pins.yaml

snake_detections:
- snake_detections/20240730T195356Z-8c385/
- snake_detections/20240730T195415Z-6c718/
snake_top_five:
- snake_top_five/20240730T195447Z-2039f/

Pin Management

Note

There is no Python equivalent of write_board_manifest() at present.

If desired, the file can be created manually to match the structure of _pins.yaml shown previously.

Success!

The workflow I have demonstrated here is the exact process I used to share the snake-detections.csv data via my website in the form of two, versioned, {pins} objects.

Get the Pin from my `board_url()`

Here is the same code we used at the start of the presentation:

# python -m pip install pins
import pins
base_url = "https://gavinmasterson.com/pins"
pin_paths = {
  "snake_detections": "snake_detections/20240730T195415Z-6c718/",
  "snake_top_five": "snake_top_five/20240730T195447Z-2039f/",
  }
board = pins.board_url(base_url, pin_paths)
board.pin_list()
# Investigate individual data sets
board.pin_meta("snake_detections")
board.pin_read("snake_detections")

Python icon by Icons8

# install.packages("pins")
library(pins)
web_board <- board_url("https://gavinmasterson.com/pins/")
web_board |> pin_versions("snake_top_five")
web_board |> pin_meta("snake_top_five")
web_board |> pin_read(name = "snake_top_five")

R Project icon by Icons8

Next Steps

Use Other Board Types

There are many other board_* functions to use.¹

Here are some of functions to create pin boards using cloud services/folders:

board_azure
board_gcs
board_gdrive
board_ms365
board_s3

Use Other Data Types

In this demonstration, I have stored csv data in json format. There are many other formats to choose.

Files can be stored as one of:

csv
json
parquet
arrow
rds or qs (R binary formats)
joblib (Python module for parallel computation)

Use Custom Data Types

Custom data formats (not specified previously) can be pinned to boards.¹

To write custom file types to a board, use pin_upload.

To get a custom file from a board, use pin_download².

Using {pins} for these files allows you to integrate them into your data science workflows.

Bonus: Deploying a Board with Quarto

To serve my pin board on my Quarto¹ website, I have modified the header of my _quarto.yml file to look like this:

_quarto.yml

project:
  type: website
  resources:
  - pins
  - data-raw/snake-detections.csv
  output-dir: public
...

The directories/files listed under the resources keyword are rendered into my website when using quarto render.

Data Science with {pins}

First Things First…

It’s Data-Sharing Day!

Data Science in 2024

The Power of Networks

The Power of Networks

A Tough Choice?

A Tough Choice?

A Tough Choice?

Why Not Both?

Sharing Data with {pins}

One API to Connect Them All

Follow Along

Creating Boards

Writing Data

Writing Updated Data

Writing Updated Data

Writing Updated Data

Same Data, Different Pin

The Top 5 Detected Species

The Top 5 Detected Species

The Top 5 Detected Species

View the Board State

View the Board State

Pin Management

Pin Management

Pin Management

Pin Management

Success!

Get the Pin from my board_url()

Next Steps

Use Other Board Types

Use Other Data Types

Use Custom Data Types

Bonus: Deploying a Board with Quarto

Get the Pin from my `board_url()`