lazysf

February 18, 2026 · View on GitHub

CRAN
status

lazysf provides a dplyr backend for any vector data source readable by GDAL. It uses gdalraster::GDALVector to talk to GDAL and dbplyr for SQL translation, giving you lazy evaluation of spatial data through familiar dplyr verbs.

Vector data sources — files, URLs, databases, cloud storage — are accessed through a DBI connection. Nothing is read into memory until you call collect().

Quick start

library(lazysf)
library(dplyr)

f <- system.file("extdata/nc.gpkg", package = "lazysf", mustWork = TRUE)
lf <- lazysf(f)
lf
#> Warning in new_CppObject_xp(fields$.module, fields$.pointer, ...): SpatiaLite
#> is not available
#> # Source:   table<"nc"> [?? x 16]
#> # Database: GDAL <SQLITE> WKB [/perm_storage/home/mdsumner/R/x86_64-pc-linux-g...]
#>        FID  AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74 SID74
#>    <int64> <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>
#>  1       1 0.114     1.442  1825    1825 Ashe  37009  37009        5  1091     1
#>  2       2 0.061     1.231  1827    1827 Alle… 37005  37005        3   487     0
#>  3       3 0.143     1.63   1828    1828 Surry 37171  37171       86  3188     5
#>  4       4 0.07      2.968  1831    1831 Curr… 37053  37053       27   508     1
#>  5       5 0.153     2.206  1832    1832 Nort… 37131  37131       66  1421     9
#>  6       6 0.097     1.67   1833    1833 Hert… 37091  37091       46  1452     7
#>  7       7 0.062     1.547  1834    1834 Camd… 37029  37029       15   286     0
#>  8       8 0.091     1.284  1835    1835 Gates 37073  37073       37   420     0
#>  9       9 0.118     1.421  1836    1836 Warr… 37185  37185       93   968     4
#> 10      10 0.124     1.428  1837    1837 Stok… 37169  37169       85  1612     1
#> # ℹ more rows
#> # ℹ 5 more variables: NWBIR74 <dbl>, BIR79 <dbl>, SID79 <dbl>, NWBIR79 <dbl>,
#> #   geom <wk_wkb>

Standard dplyr verbs generate SQL that GDAL executes:

lf |>
  filter(AREA < 0.1) |>
  select(NAME, AREA, geom) |>
  arrange(AREA)
#> Warning in new_CppObject_xp(fields$.module, fields$.pointer, ...): SpatiaLite
#> is not available
#> # Source:     SQL [?? x 3]
#> # Database:   GDAL <SQLITE> WKB [/perm_storage/home/mdsumner/R/x86_64-pc-linux-g...]
#> # Ordered by: AREA
#>        FID NAME         AREA geom                                               
#>    <int64> <chr>       <dbl> <wk_wkb>                                           
#>  1       0 New Hanover 0.042 <MULTIPOLYGON (((-77.96073 34.18924, -77.96587 34.…
#>  2       1 Chowan      0.044 <MULTIPOLYGON (((-76.68874 36.29452, -76.64822 36.…
#>  3       2 Clay        0.051 <MULTIPOLYGON (((-83.938 34.98939, -83.98855 34.98…
#>  4       3 Pasquotank  0.053 <MULTIPOLYGON (((-76.29893 36.21423, -76.32423 36.…
#>  5       4 Mitchell    0.059 <MULTIPOLYGON (((-82.11885 35.81853, -82.14665 35.…
#>  6       5 Polk        0.06  <MULTIPOLYGON (((-82.21017 35.19313, -82.27833 35.…
#>  7       6 Alleghany   0.061 <MULTIPOLYGON (((-81.23989 36.36536, -81.24069 36.…
#>  8       7 Camden      0.062 <MULTIPOLYGON (((-76.00897 36.3196, -75.95718 36.1…
#>  9       8 Perquimans  0.063 <MULTIPOLYGON (((-76.48053 36.07979, -76.53696 36.…
#> 10       9 Avery       0.064 <MULTIPOLYGON (((-81.94135 35.95498, -81.9614 35.9…
#> # ℹ more rows

Use show_query() to see the SQL:

lf |>
  filter(AREA < 0.1) |>
  select(NAME, AREA, geom) |>
  arrange(AREA) |>
  show_query()
#> <SQL>
#> SELECT "NAME", "AREA", "geom"
#> FROM "nc"
#> WHERE ("AREA" < 0.1)
#> ORDER BY "AREA"

Use collect() to pull data into memory:

d <- lf |>
  filter(AREA < 0.1) |>
  select(NAME, AREA, geom) |>
  collect()
#> Warning in new_CppObject_xp(fields$.module, fields$.pointer, ...): SpatiaLite
#> is not available
d
#> # A tibble: 34 × 4
#>        FID NAME        AREA geom                                                
#>    <int64> <chr>      <dbl> <wk_wkb>                                            
#>  1       0 Alleghany  0.061 <MULTIPOLYGON (((-81.23989 36.36536, -81.24069 36.3…
#>  2       1 Currituck  0.07  <MULTIPOLYGON (((-76.00897 36.3196, -76.01735 36.33…
#>  3       2 Hertford   0.097 <MULTIPOLYGON (((-76.74506 36.23392, -76.98069 36.2…
#>  4       3 Camden     0.062 <MULTIPOLYGON (((-76.00897 36.3196, -75.95718 36.19…
#>  5       4 Gates      0.091 <MULTIPOLYGON (((-76.56251 36.34057, -76.60424 36.3…
#>  6       5 Vance      0.072 <MULTIPOLYGON (((-78.49252 36.17359, -78.51472 36.1…
#>  7       6 Pasquotank 0.053 <MULTIPOLYGON (((-76.29893 36.21423, -76.32423 36.2…
#>  8       7 Watauga    0.081 <MULTIPOLYGON (((-81.80622 36.10456, -81.81715 36.1…
#>  9       8 Perquimans 0.063 <MULTIPOLYGON (((-76.48053 36.07979, -76.53696 36.0…
#> 10       9 Chowan     0.044 <MULTIPOLYGON (((-76.68874 36.29452, -76.64822 36.3…
#> # ℹ 24 more rows

Spatial SQL (SQLITE dialect)

lazysf defaults to the SQLITE dialect, which gives you access to spatial SQL functions. These are translated from R-style names to their SQL equivalents:

lf |>
  mutate(area_m2 = st_area(geom)) |>
  filter(st_area(geom) > 0.1) |>
  mutate(srid = st_srid(geom)) |>
  collect()

Functions like st_area(), st_srid(), and st_geometrytype() work via GDAL’s built-in SQLite engine (no SpatiaLite extension needed). Functions that operate on geometry values (like st_astext(), st_intersects(), st_buffer()) require a SpatiaLite-enabled GDAL build.

Arrow stream interface

For larger datasets, enable GDAL’s columnar Arrow C stream interface for faster data transfer:

lf <- lazysf(f, use_arrow = TRUE)
lf |> collect()

This uses GDALVector$getArrowStream() via nanoarrow — data moves from GDAL to R in columnar batches instead of row-by-row. Requires GDAL >= 3.6.

SQL dialects

lazysf supports two SQL dialects:

SQLITE (default): Full SQLite syntax including subqueries, spatial functions, CAST, GROUP BY, ORDER BY. Required for dbplyr to work properly. This is GDAL’s embedded SQLite engine, available for any format.

lazysf(f) |>
  group_by(SID74) |>
  summarise(n = n(), mean_area = mean(AREA, na.rm = TRUE)) |>
  collect()
#> Warning in new_CppObject_xp(fields$.module, fields$.pointer, ...): SpatiaLite
#> is not available
#> # A tibble: 23 × 4
#>        FID SID74     n mean_area
#>    <int64> <dbl> <int>     <dbl>
#>  1       0     0    13  0.084846
#>  2       1     1    11  0.083091
#>  3       2     2     8  0.13088 
#>  4       3     3     6  0.10783 
#>  5       4     4    13  0.14715 
#>  6       5     5    11  0.12545 
#>  7       6     6     4  0.16575 
#>  8       7     7     4  0.14975 
#>  9       8     8     5  0.134   
#> 10       9     9     2  0.1605  
#> # ℹ 13 more rows

OGRSQL: GDAL’s native OGR SQL. Simpler, no subquery support, but works for basic operations:

lazysf(f, dialect = "OGRSQL") |>
  filter(NAME %LIKE% "A%") |>
  collect()

DBI connection

For more control, use the DBI interface directly:

con <- dbConnect(GDALSQL(), f)
con
#> <GDALVectorConnection>
#>       DSN: /perm_storage/home/mdsumner/R/x86_64-pc-linux-gnu-library/4.5/lazysf/extdata/nc.gpkg
#>   dialect: SQLITE
#>  geometry: WKB
#>     arrow: off
#>    tables: nc
dbListTables(con)
#> [1] "nc"
dbListFields(con, "nc")
#>  [1] "AREA"      "PERIMETER" "CNTY_"     "CNTY_ID"   "NAME"      "FIPS"     
#>  [7] "FIPSNO"    "CRESS_ID"  "BIR74"     "SID74"     "NWBIR74"   "BIR79"    
#> [13] "SID79"     "NWBIR79"   "geom"
DBI::dbDisconnect(con)
#> <GDALVectorConnection>
#>       DSN: 
#>   dialect: SQLITE
#>  geometry: WKB
#>     arrow: off
#>    tables: (unavailable)

Geometry formats

lazysf supports four geometry output formats, set via geom_format:

  • "WKB" (default): Well-Known Binary, as wk::wkb vectors
  • "WKT": Well-Known Text, as wk::wkt vectors
  • "BBOX": Bounding box per feature, as wk::rct vectors
  • "NONE": No geometry (attributes only, faster for non-spatial queries)
lazysf(f, geom_format = "WKT") |>
  select(NAME, geom) |>
  head(3) |>
  collect()
#> Warning in new_CppObject_xp(fields$.module, fields$.pointer, ...): SpatiaLite
#> is not available
#> # A tibble: 3 × 3
#>       FID NAME      geom                                                        
#>   <int64> <chr>     <wk_wkt>                                                    
#> 1       0 Ashe      MULTIPOLYGON (((-81.47276 36.23436, -81.54084 36.27251, -81…
#> 2       1 Alleghany MULTIPOLYGON (((-81.23989 36.36536, -81.24069 36.37942, -81…
#> 3       2 Surry     MULTIPOLYGON (((-80.45634 36.24256, -80.47639 36.25473, -80…

Online sources

Any GDAL-readable source works — local files, URLs, cloud storage, databases:

## GeoJSON from URL
lazysf("https://example.com/data.geojson")

## PostgreSQL/PostGIS
lazysf("PG:host=localhost dbname=mydb user=me password=secret")

## Cloud-optimized sources via /vsicurl/
lazysf("/vsicurl/https://example.com/large.gpkg")

Installation

# From CRAN
install.packages("lazysf")

# Development version from r-universe
options(repos = c(
  hypertidy = "https://hypertidy.r-universe.dev",
  CRAN = "https://cloud.r-project.org"))
install.packages("lazysf")

Known limitations

  • No window functions: slice_min(), slice_max(), row_number() etc. are not supported by GDAL’s SQLite engine.
  • OGRSQL dialect: Does not support subqueries, so complex dplyr verb chains may fail. Use SQLITE dialect (the default) instead.
  • FID is sticky: GDAL always includes the feature ID column in results, even when your SQL doesn’t select it.
  • Format-dependent: Performance and capability varies by GDAL driver. File formats like Shapefile and GeoPackage work well; text formats like CSV or GeoJSON are slower for large data.

Code of Conduct

Please note that the lazysf project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.