How to work with remote Parquet files with the duckdb R package?

For large datasets, it is sometimes convenient to explore them without downloading them locally. With Arrow, you can work with these remotes files if they are stored in AWS S3 or Google Cloud Storage. It is however not yet possible for files stored over HTTPS (it is on the roadmap). On the other hand, with the “httpfs” extension, DuckDB allows you to query over the wire these Parquet files.

You can even set things up so you can use dplyr verbs to work with these remote files. I will demonstrate this using a Parquet version of the penguins dataset hosted on my site.

Let’s start by loading the required packages:

library(DBI)
library(duckdb)
library(dplyr)

We are creating a con object to hold our DuckDB connection:

con <- duckdb::duckdb()

Let’s install (only needed once) and load the httpfs extension:

dbExecute(con, "INSTALL httpfs;")
dbExecute(con, "LOAD httpfs;")

At this point, we could use DuckDB’s SQL syntax to work with our remote dataset:

dbGetQuery(con,
  "SELECT species,
          AVG(bill_length_mm) AS avg_bill_length,
          AVG(bill_depth_mm) AS avg_bill_depth
   FROM PARQUET_SCAN('https://francoismichonneau.net/assets/data/penguins.parquet')
   GROUP BY species;")

# A tibble: 3 × 3
  species   avg_bill_length avg_bill_depth
  <chr>               <dbl>          <dbl>
1 Adelie               38.8           18.3
2 Gentoo               47.5           15.0
3 Chinstrap            48.8           18.4

However, you can create a view using this remote file, which in turn, will allow you to use dplyr to query your file:

dbExecute(con,
  "CREATE VIEW penguins AS
   SELECT * FROM PARQUET_SCAN('https://francoismichonneau.net/assets/data/penguins.parquet');
")

You can check it worked by running:

dbListTables(con)

[1] "penguins"

Now you can work with this remote data with dplyr:

tbl(con, "penguins") |>
  group_by(species) |>
  summarize(
    avg_bill_length = mean(bill_length_mm),
    avg_bill_depth = mean(bill_depth_mm)
  )

# Source:   SQL [3 x 3]
# Database: DuckDB 0.8.1 [francois@Linux 6.2.0-20-generic:R 4.3.0/:memory:]
  species   avg_bill_length avg_bill_depth
  <chr>               <dbl>          <dbl>
1 Adelie               38.8           18.3
2 Gentoo               47.5           15.0
3 Chinstrap            48.8           18.4

How to work with remote Parquet files with the duckdb R package?

François Michonneau

Comments