TidierStrings.jl
August 22, 2024 · View on GitHub
What is TidierStrings.jl
TidierStrings.jl is a 100% Julia implementation of the R stringr package.
TidierStrings.jl has one main goal: to implement stringr's straightforward syntax and of ease of use for Julia users. While this package was developed to work seamlessly with TidierData.jl functions and macros, it also works independently as a standalone package.
Installation
For the stable version:
] add TidierStrings
The ] character starts the Julia package manager. Press the backspace key to return to the Julia prompt.
or
For the development version:
using Pkg
Pkg.add(url = "https://github.com/TidierOrg/TidierStrings.jl.git")
What functions does TidierStrings.jl support?
TidierStrings.jl currently supports:
| Category | Function |
|---|---|
| Matching | str_count, str_detect, str_locate, str_locate_all, str_replace, str_replace_all, |
str_remove, str_remove_all, str_split, str_starts, str_ends, str_subset, str_which | |
| Concatenation | str_c, str_flatten, str_flatten_comma |
| Characters | str_dup, str_length, str_width, str_trim, str_squish, str_wrap, str_pad |
| Locale | str_equal, str_to_upper, str_to_lower, str_to_title, str_to_sentence, str_unique |
| Other | str_conv, str_like, str_replace_missing, word |
Examples
using Tidier
using TidierStrings
df = DataFrame(
Names = ["Alice", "Bob", "Charlie", "Dave", "Eve", "Frank", "Grace"],
City = ["New York 2019-20", "Los \n\n\n\n\n\n Angeles 2007-12 2020-21", "San Antonio 1234567890 ", " New York City", "LA 2022-23", "Philadelphia 2023-24", "San Jose 9876543210"],
Occupation = ["Doctor", "Engineer", "Final Artist", "Scientist", "Physician", "Lawyer", "Teacher"],
Description = ["Alice is a doctor in New York",
"Bob is is is an engineer in Los Angeles",
"Charlie is an artist in Chicago",
"Dave is a scientist in Houston",
"Eve is a physician in Phoenix",
"Frank is a lawyer in Philadelphia",
"Grace is a teacher in San Antonio"]
)
$ 7 \times 4 \text{DataFrame} \text{Row} │ \text{Names} \text{City} \text{Occupation} \text{Description} │ \text{String} \text{String} \text{String} \text{String} ─────┼───────────────────────────────────────────────────────────────────────────────────────────── 1 │ \text{Alice} \text{New} \text{York} 2019-20 \text{Doctor} \text{Alice} \text{is} \text{a} \text{doctor} \text{in} \text{New} \text{York} 2 │ \text{Bob} \text{Los} \\text{n}\\text{n}\\text{n}\\text{n}\\text{n}\\text{n} \text{Angeles} 2… \text{Engineer} \text{Bob} \text{is} \text{is} \text{is} \text{an} \text{engineer} \text{in} \text{Los} … 3 │ \text{Charlie} \text{San} \text{Antonio} 1234567890 \text{Final} \text{Artist} \text{Charlie} \text{is} \text{an} \text{artist} \text{in} \text{Chicago} 4 │ \text{Dave} \text{New} \text{York} \text{City} \text{Scientist} \text{Dave} \text{is} \text{a} \text{scientist} \text{in} \text{Houston} 5 │ \text{Eve} \text{LA} 2022-23 \text{Physician} \text{Eve} \text{is} \text{a} \text{physician} \text{in} \text{Phoenix} 6 │ \text{Frank} \text{Philadelphia} 2023-24 \text{Lawyer} \text{Frank} \text{is} \text{a} \text{lawyer} \text{in} \text{Philadelphia} 7 │ \text{Grace} \text{San} \text{Jose} 9876543210 \text{Teacher} \text{Grace} \text{is} \text{a} \text{teacher} \text{in} \text{San} \text{Antonio} $
str_squish(): Removes leading and trailing white spaces from a string and also replaces consecutive white spaces in between words with a single space. It will also remove new lines.
df = @chain df begin
@mutate(City = str_squish(City))
end
$ 7 \times 4 \text{DataFrame} \text{Row} │ \text{Names} \text{City} \text{Occupation} \text{Description} │ \text{String} \text{String} \text{String} \text{String} ─────┼─────────────────────────────────────────────────────────────────────────────────────── 1 │ \text{Alice} \text{New} \text{York} 2019-20 \text{Doctor} \text{Alice} \text{is} \text{a} \text{doctor} \text{in} \text{New} \text{York} 2 │ \text{Bob} \text{Los} \text{Angeles} 2007-12 2020-21 \text{Engineer} \text{Bob} \text{is} \text{is} \text{is} \text{an} \text{engineer} \text{in} \text{Los} … 3 │ \text{Charlie} \text{San} \text{Antonio} 1234567890 \text{Final} \text{Artist} \text{Charlie} \text{is} \text{an} \text{artist} \text{in} \text{Chicago} 4 │ \text{Dave} \text{New} \text{York} \text{City} \text{Scientist} \text{Dave} \text{is} \text{a} \text{scientist} \text{in} \text{Houston} 5 │ \text{Eve} \text{LA} 2022-23 \text{Physician} \text{Eve} \text{is} \text{a} \text{physician} \text{in} \text{Phoenix} 6 │ \text{Frank} \text{Philadelphia} 2023-24 \text{Lawyer} \text{Frank} \text{is} \text{a} \text{lawyer} \text{in} \text{Philadelphia} 7 │ \text{Grace} \text{San} \text{Jose} 9876543210 \text{Teacher} \text{Grace} \text{is} \text{a} \text{teacher} \text{in} \text{San} \text{Antonio} $
Support Regex: str_detect, str_replace, str_replace_all, str_remove, str_remove_all, str_count, str_equal, and str_subset
str_detect()
'str_detect()' checks if a pattern exists in a string. It takes a string and a pattern as arguments and returns a boolean indicating the presence of the pattern in the string. This can be used inside of @filter, @mutate, if_else() and case_when(). str_detect supports logical operators | and &.
case_when() with filter() and str_detect()
@chain df begin
@mutate(Occupation = if_else(str_detect(Occupation, "Doctor | Physician"), "Physician", Occupation))
@filter(str_detect(Description, "artist | doctor"))
end
Row │ Names City Occupation Description
│ String String String String
─────┼────────────────────────────────────────────────────────────────────────────────
1 │ Alice New York 2019-20 Physician Alice is a doctor in New York
2 │ Charlie San Antonio 1234567890 Final Artist Charlie is an artist in Chicago
@chain df begin
@mutate(state = case_when(str_detect(City, "NYC | New York") => "NY",
str_detect(City, "LA | Los Angeles | San & Jose") => "CA",
true => "other"))
end
$ 7 \times 5 \text{DataFrame} \text{Row} │ \text{Names} \text{City} \text{Occupation} \text{Description} \text{state} │ \text{String} \text{String} \text{String} \text{String} \text{String} ─────┼─────────────────────────────────────────────────────────────────────────────────────────────── 1 │ \text{Alice} \text{New} \text{York} 2019-20 \text{Doctor} \text{Alice} \text{is} \text{a} \text{doctor} \text{in} \text{New} \text{York} \text{NY} 2 │ \text{Bob} \text{Los} \text{Angeles} 2007-12 2020-21 \text{Engineer} \text{Bob} \text{is} \text{is} \text{is} \text{an} \text{engineer} \text{in} \text{Los} … \text{CA} 3 │ \text{Charlie} \text{San} \text{Antonio} 1234567890 \text{Final} \text{Artist} \text{Charlie} \text{is} \text{an} \text{artist} \text{in} \text{Chicago} \text{other} 4 │ \text{Dave} \text{New} \text{York} \text{City} \text{Scientist} \text{Dave} \text{is} \text{a} \text{scientist} \text{in} \text{Houston} \text{NY} 5 │ \text{Eve} \text{LA} 2022-23 \text{Physician} \text{Eve} \text{is} \text{a} \text{physician} \text{in} \text{Phoenix} \text{CA} 6 │ \text{Frank} \text{Philadelphia} 2023-24 \text{Lawyer} \text{Frank} \text{is} \text{a} \text{lawyer} \text{in} \text{Philadelphia} \text{other} 7 │ \text{Grace} \text{San} \text{Jose} 9876543210 \text{Teacher} \text{Grace} \text{is} \text{a} \text{teacher} \text{in} \text{San} \text{Antonio} \text{CA} $
str_replace()
Replaces the first occurrence of a pattern in a string with a specified text. Takes a string, pattern to search for, and the replacement text as arguments. It also supports the use of regex and logical operator | . This is in contrast to str_replace_all() which will replace each occurence of a match within a string.
@chain df begin
@mutate(City = str_replace(City, r"\s*20\d{2}-\d{2,4}\s*", " ####-## "))
@mutate(Description = str_replace(Description, "is | a", "will become "))
end
$ 7 \times 4 \text{DataFrame} \text{Row} │ \text{Names} \text{City} \text{Occupation} \text{Description} │ \text{String} \text{String} \text{String} \text{String} ─────┼─────────────────────────────────────────────────────────────────────────────────────── 1 │ \text{Alice} \text{New} \text{York} ####-## \text{Doctor} \text{Alice} \text{will} \text{become} \text{a} \text{doctor} \text{in} \text{Ne}… 2 │ \text{Bob} \text{Los} \text{Angeles} ####-## 2020-21 \text{Engineer} \text{Bob} \text{will} \text{become} \text{is} \text{is} \text{an} \text{enginee}… 3 │ \text{Charlie} \text{San} \text{Antonio} 1234567890 \text{Final} \text{Artist} \text{Charlie} \text{will} \text{become} \text{an} \text{artist} \text{in}… 4 │ \text{Dave} \text{New} \text{York} \text{City} \text{Scientist} \text{Dave} \text{will} \text{become} \text{a} \text{scientist} \text{in} … 5 │ \text{Eve} \text{LA} ####-## \text{Physician} \text{Eve} \text{will} \text{become} \text{a} \text{physician} \text{in} \text{P}… 6 │ \text{Frank} \text{Philadelphia} ####-## \text{Lawyer} \text{Frank} \text{will} \text{become} \text{a} \text{lawyer} \text{in} \text{Ph}… 7 │ \text{Grace} \text{San} \text{Jose} 9876543210 \text{Teacher} \text{Grace} \text{will} \text{become} \text{a} \text{teacher} \text{in} \text{S}… $
str_remove and str_remove_all
These remove the first match occurrence or all occurences, respectively.
@chain df begin
@mutate(split = str_remove_all(Description, "is"))
end
$ 7 \times 5 \text{DataFrame} \text{Row} │ \text{Names} \text{City} \text{Occupation} \text{Description} \text{split} │ \text{String} \text{String} \text{String} \text{String} \text{String} ─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ \text{Alice} \text{New} \text{York} 2019-20 \text{Doctor} \text{Alice} \text{is} \text{a} \text{doctor} \text{in} \text{New} \text{York} \text{Alice} \text{a} \text{doctor} \text{in} \text{New} \text{York} 2 │ \text{Bob} \text{Los} \text{Angeles} 2007-12 2020-21 \text{Engineer} \text{Bob} \text{is} \text{is} \text{is} \text{an} \text{engineer} \text{in} \text{Los} … \text{Bob} \text{an} \text{engineer} \text{in} \text{Los} \text{Angeles} 3 │ \text{Charlie} \text{San} \text{Antonio} 1234567890 \text{Final} \text{Artist} \text{Charlie} \text{is} \text{an} \text{artist} \text{in} \text{Chicago} \text{Charlie} \text{an} \text{artist} \text{in} \text{Chicago} 4 │ \text{Dave} \text{New} \text{York} \text{City} \text{Scientist} \text{Dave} \text{is} \text{a} \text{scientist} \text{in} \text{Houston} \text{Dave} \text{a} \text{scientist} \text{in} \text{Houston} 5 │ \text{Eve} \text{LA} 2022-23 \text{Physician} \text{Eve} \text{is} \text{a} \text{physician} \text{in} \text{Phoenix} \text{Eve} \text{a} \text{physician} \text{in} \text{Phoenix} 6 │ \text{Frank} \text{Philadelphia} 2023-24 \text{Lawyer} \text{Frank} \text{is} \text{a} \text{lawyer} \text{in} \text{Philadelphia} \text{Frank} \text{a} \text{lawyer} \text{in} \text{Philadelphia} 7 │ \text{Grace} \text{San} \text{Jose} 9876543210 \text{Teacher} \text{Grace} \text{is} \text{a} \text{teacher} \text{in} \text{San} \text{Antonio} \text{Grace} \text{a} \text{teacher} \text{in} \text{San} \text{Antonio} $
str_equal()
Checks if two strings are exactly the same. Takes two strings as arguments and returns a boolean indicating whether the strings are identical.
@chain df begin
@mutate(Same_City = case_when(str_equal(City, Occupation) => "Yes",
true => "No"))
end
Row │ Names City Occupation Description Same_City
│ String String String String String
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Alice New York 2019-20 Doctor Alice is a doctor in New York No
2 │ Bob Los Angeles 2007-12 2020-21 Engineer Bob is is is an engineer in Los … No
3 │ Charlie San Antonio 1234567890 Final Artist Charlie is an artist in Chicago No
4 │ Dave New York City Scientist Dave is a scientist in Houston No
5 │ Eve LA 2022-23 Physician Eve is a physician in Phoenix No
6 │ Frank Philadelphia 2023-24 Lawyer Frank is a lawyer in Philadelphia No
7 │ Grace San Jose 9876543210 Teacher Grace is a teacher in San Antonio No
str_to_upper and str_to_lower
These will take a string and convert it to all uppercase or lowercase.
@chain df begin
@mutate(Names = str_to_upper(Names))
@select(Names)
end
$ 7 \times 1 \text{DataFrame} \text{Row} │ \text{Names} │ \text{String} ─────┼───────── 1 │ \text{ALICE} 2 │ \text{BOB} 3 │ \text{CHARLIE} 4 │ \text{DAVE} 5 │ \text{EVE} 6 │ \text{FRANK} 7 │ \text{GRACE} $