README.org
April 21, 2026 ยท View on GitHub
| nc: named capture | [[https://rdatatable-community.github.io/The-Raft/posts/2024-08-01-seal_of_approval-nc/][https://rdatatable-community.github.io/The-Raft/posts/2024-08-01-seal_of_approval-nc/hex_approved.png]]
| [[file:tests/testthat][tests]] | [[https://github.com/tdhock/nc/actions][https://github.com/tdhock/nc/workflows/R-CMD-check/badge.svg]] | | [[https://github.com/jimhester/covr][coverage]] | [[https://app.codecov.io/gh/tdhock/nc?branch=master][https://codecov.io/gh/tdhock/nc/branch/master/graph/badge.svg]] |
User-friendly functions for extracting a data table (row for each match, column for each group) from non-tabular text data using regular expressions, and for melting/reshaping columns that match a regular expression. Please read and cite my related R Journal papers, if you use this code!
- [[https://journal.r-project.org/archive/2019/RJ-2019-050/index.html][Comparing namedCapture with other R packages for regular expressions]] (2019).
- [[https://journal.r-project.org/archive/2021/RJ-2021-029/index.html][Wide-to-tall Data Reshaping Using Regular Expressions and the nc Package]] (2021).
** Quick demo of matching functions
#+BEGIN_SRC R fruit.vec <- c("granny smith apple", "blood orange and yellow banana") fruit.pattern <- list(type=".*?", " ", fruit="orange|apple|banana") nc::capture_first_vec(fruit.vec, fruit.pattern) #> type fruit #> 1: granny smith apple #> 2: blood orange nc::capture_all_str(fruit.vec, fruit.pattern) #> type fruit #> 1: granny smith apple #> 2: blood orange #> 3: and yellow banana #+END_SRC
** Quick demo of reshaping functions
#+begin_src R (one.iris <- iris[1,]) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.2 setosa nc::capture_melt_single(one.iris, part=".", "[.]", dim=".") #> Species part dim value #> 1: setosa Sepal Length 5.1 #> 2: setosa Sepal Width 3.5 #> 3: setosa Petal Length 1.4 #> 4: setosa Petal Width 0.2 nc::capture_melt_multiple(one.iris, part=".", "[.]", column=".") #> Species part Length Width #> 1: setosa Petal 1.4 0.2 #> 2: setosa Sepal 5.1 3.5 nc::capture_melt_multiple(one.iris, column=".", "[.]", dim=".") #> Species dim Petal Sepal #> 1: setosa Length 1.4 5.1 #> 2: setosa Width 0.2 3.5 #+end_src
** Installation
#+BEGIN_SRC R install.packages("nc")
or:
if(!require(devtools))install.packages("devtools") devtools::install_github("tdhock/nc") #+END_SRC
** Usage overview
Watch the [[https://www.youtube.com/watch?v=4mDJnVtzsbg&list=PLwc48KSH3D1P8R7470s0lgcUObJLEXSSO&index=1][screencast tutorial videos]]!
The main functions provided in nc are:
| Subject | nc function | Similar to | And | |----------------------+-------------------------+---------------------------------------+-------------------------| | Single string | =capture_all_str= | =stringr::str_match_all= | =rex::re_matches= | | Character vector | =capture_first_vec= | =stringr::str_match= | =rex::re_matches= | | Data frame chr cols | =capture_first_df= | =tidyr::extract/separate_wider_regex= | =data.table::tstrsplit= | | Data frame col names | =capture_melt_single= | =tidyr::pivot_longer= | =data.table::melt= | | Data frame col names | =capture_melt_multiple= | =tidyr::pivot_longer= | =data.table::melt= | | File paths | =capture_first_glob= | =arrow::open_dataset= | |
- [[https://cloud.r-project.org/web/packages/nc/vignettes/v0-overview.html][Vignette 0]] provides an overview of the various functions.
- [[https://cloud.r-project.org/web/packages/nc/vignettes/v1-capture-first.html][Vignette 1]] discusses =capture_first_vec= and =capture_first_df=, which capture the first match in each of several subjects (character vector, data frame character columns).
- [[https://cloud.r-project.org/web/packages/nc/vignettes/v2-capture-all.html][Vignette 2]] discusses =capture_all_str= which captures all matches in a single string, or a single multi-line text file. The vignette also shows how to use =capture_all_str= on several different strings/files, using data.table =by= syntax.
- [[https://cloud.r-project.org/web/packages/nc/vignettes/v3-capture-melt.html][Vignette 3]] discusses =capture_melt_single= and =capture_melt_multiple= which match a regex to the column names of a wide data frame, then melt/reshape the matching columns. These functions are especially useful when more than one separate piece of information can be captured from each column name, e.g. the iris column names =Petal.Width=, =Sepal.Width=, etc each have two pieces of information (flower part and measurement dimension).
- [[https://cloud.r-project.org/web/packages/nc/vignettes/v4-comparisons.html][Vignette 4]] shows comparisons with related R packages.
- [[https://cloud.r-project.org/web/packages/nc/vignettes/v5-helpers.html][Vignette 5]] explains how to use helper functions for creating complex regular expressions.
- [[https://cloud.r-project.org/web/packages/nc/vignettes/v6-engines.html][Vignette 6]] explains how to use different regex engines.
- [[https://cloud.r-project.org/web/packages/nc/vignettes/v7-capture-glob.html][Vignette 7]] explains how to read regularly named files, and use a regex to extract meta-data from the file names, using =nc::capture_first_glob=.
*** Choice of regex engine
By default, nc uses PCRE. Other options include ICU and RE2.
To tell nc that you would like to use a certain engine, #+BEGIN_SRC R options(nc.engine="RE2") #+END_SRC
Every function also has an engine argument, e.g.
#+BEGIN_SRC R nc::capture_first_vec( "foo a\U0001F60E# bar", before=".?", emoji="\p{EMOJI_Presentation}", after=".", engine="ICU") #> before emoji after #> 1 foo a ๐ # bar #+END_SRC
** Related work
For an detailed comparison of regex C libraries in R (ICU, PCRE, TRE, RE2), see my [[https://github.com/tdhock/namedCapture-article][R journal (2019) paper about namedCapture]].
The nc reshaping functions provide functionality similar to packages tidyr, stats, data.table, reshape, reshape2, cdata, utils, etc. The main difference is that =nc::capture_melt_*= support named capture regular expressions with type conversion, which (1) makes it easier to create/maintain a complex regex, and (2) results in less repetition in user code. For a detailed comparison, see [[https://github.com/tdhock/nc-article][my R Journal (2021) paper about nc]].
Below I list the main differences between the functions in =nc= and other analogous R functions:
- Main =nc= functions all have the =capture_= prefix for easy auto-completion.
- Output in =nc= is always a data.table (other packages output either a list, character matrix, or data frame).
- For memory efficiency, =nc::capture_first_df= modifies the input if it is a data table, whereas =tidyr= functions always copy the input table.
- By default the =nc::capture_first_vec= stops with an error if any subjects do not match, whereas other functions return NA/missing rows.
- =nc::capture_all_str= only supports capturing multiple matches in a single subject (returning a data table), whereas other functions support multiple subjects (and return list of character matrices). For handling multiple subjects using =nc=, use =DT[, nc::capture_all_str(subject), by]= (see [[https://cloud.r-project.org/web/packages/nc/vignettes/v2-capture-all.html][vignette 2]] for more info).
- =nc::capture_melt_single= and =nc::capture_melt_multiple= use regex for wide-to-tall data reshaping, see [[https://cloud.r-project.org/web/packages/nc/vignettes/v3-capture-melt.html][Vignette 3]] and my [[https://journal.r-project.org/archive/2021/RJ-2021-029/index.html][R Journal (2021)]] paper for more info. Whereas in nc these are two separate functions, other packages typically provide a single function which does both kinds of reshaping, for example [[https://r-datatable.com/reference/measure.html][measure]] in =data.table=.
- =nc::capture_first_glob= is for reading any kind of regularly named files into R using regex, whereas =arrow::open_dataset= requires a particular naming scheme (does not support regex).
- Helper function =nc::measure= can be used to create the =measure.vars= argument of =data.table::melt=, and =nc::capture_longer_spec= can be used to create the =spec= argument of =tidyr::pivot_longer=. This can be useful if you want to use nc to define the regex, but you want to use the other package functions to do the reshape.
- Similar to [[https://github.com/r-lib/rex/blob/main/R/capture.R][rex::capture]], helper function =nc::field= is provided for defining patterns that match subjects like variable=value, and create a column/group named variable (useful to avoid repeating variable names in regex code). See [[https://cloud.r-project.org/web/packages/nc/vignettes/v2-capture-all.html][vignette 2]] for more info.
- Similar to [[https://github.com/r-lib/rex/blob/main/R/or.R][rex::or]], =nc::alternatives_with_shared_groups= is provided for defining a pattern containing alternatives with shared groups. See [[https://cloud.r-project.org/web/packages/nc/vignettes/v5-helpers.html][vignette 5]] for more info.