Data Sharing in Distributed and Polyglot Settings
I have some data to share with you.
# python -m pip install pins
import pins
base_url = "https://gavinmasterson.com/pins"
pin_paths = {
"snake_detections": "snake_detections/20240730T195415Z-6c718/",
"snake_top_five": "snake_top_five/20240730T195447Z-2039f/",
}
board = pins.board_url(base_url, pin_paths)
board.pin_list()
# Investigate individual data sets
board.pin_meta("snake_detections")
board.pin_read("snake_detections")
Our human experience is defined by our connections…
Our connections ground us, teach us, and enable us.
A single computer is powerful.
A network of computers is unlimited.1
Data on a single computer is useful.
Data shared throughout a network can change everything.2
“R or Python” is a consistent search query:
Apparently a ‘high stakes’ decision in Singapore:
Playwright
=> Visualise with ggplot2
.tidymodels
=> Dashboard with Streamlit
.In this presentation, I will demonstrate a workflow to:
You can run this workflow locally, with the same data that I have used.
In my website project I create a directory named ‘pins’ then:
board_folder()
creates a board inside a folder. You can use this to share files by using a folder on a shared network drive or inside a DropBox.1
If we add a new field to the data, we might break code that uses the previous version of snake_detections
in our pin board.
This is where the versioned
argument shines.
Let us add the observer_id
field to our data frame:
Now we write the new data frame as a new version of the snake_detections
pin:
Now we write the new data frame as a new version of the snake_detections
pin:
What if consumers want just the top five most-frequently detected species of the snake_detections
pin? For example, if the dataset is very large, or you want to share a training dataset for a model.
snake_detections
with fewer species will cause confusion/break some user’s code.Let us subset the full dataset:
snakes_top_5 <-
snake_data |>
count(common_name, sort = TRUE) |>
head(5) |>
semi_join(x = snake_data, y = _, by = "common_name")
Note
We pass the modified snake_data
object to the y
argument of semi_join
using the _
placeholder for R’s base pipe. (The near-equivalent of .
with magrittr’s %>%
)
If we do not, the semi_join
step does nothing.
Now we can write snakes_top_5
to our board as a new pin:
Now we can write snakes_top_5
to our board as a new pin:
We have used pin_write()
three times, to pin:
snake_detections
, with all the data.snake_detections
that includes an observer_id
.snake_detections
data, which we named snake_top_five
.
To view the state of the board, we can use the following code:
Next we can add a manifest file to our board.
The manifest is a yaml
file that helps users to navigate through our board.
From what I can tell, the navigation of the folder does not depend on the manifest file, so this part of the workflow will depend on personal preference.
A board manifest file records all the pins, along with their versions, stored on a board. This can be useful for a board built using, for example,
board_folder()
orboard_s3()
, then served as a website, such that others can consume usingboard_url()
. The manifest file is not versioned like a pin is, and this function will overwrite any existing_pins.yaml
file on your board. It is your responsibility as the user to keep the manifest up to date (emphasis added).1
In R, we can generate a manifest for our board using:
The function outputs a _pins.yaml
file to the root of our board
folder. The manifest file for our board looks like this:
Note
There is no Python equivalent of write_board_manifest()
at present.
If desired, the file can be created manually to match the structure of _pins.yaml
shown previously.
The workflow I have demonstrated here is the exact process I used to share the snake-detections.csv
data via my website in the form of two, versioned, {pins} objects.
board_url()
Here is the same code we used at the start of the presentation:
# python -m pip install pins
import pins
base_url = "https://gavinmasterson.com/pins"
pin_paths = {
"snake_detections": "snake_detections/20240730T195415Z-6c718/",
"snake_top_five": "snake_top_five/20240730T195447Z-2039f/",
}
board = pins.board_url(base_url, pin_paths)
board.pin_list()
# Investigate individual data sets
board.pin_meta("snake_detections")
board.pin_read("snake_detections")
There are many other board_*
functions to use.1
Here are some of functions to create pin boards using cloud services/folders:
board_azure
board_gcs
board_gdrive
board_ms365
board_s3
In this demonstration, I have stored csv
data in json
format. There are many other formats to choose.
Files can be stored as one of:
csv
json
parquet
arrow
rds
or qs
(R binary formats)joblib
(Python module for parallel computation)Custom data formats (not specified previously) can be pinned to boards.1
To write custom file types to a board, use pin_upload
.
To get a custom file from a board, use pin_download
2.
Using {pins} for these files allows you to integrate them into your data science workflows.
To serve my pin board on my Quarto1 website, I have modified the header of my _quarto.yml
file to look like this:
_quarto.yml
The directories/files listed under the resources
keyword are rendered into my website when using quarto render
.