TidierText.jl
December 7, 2023 · View on GitHub
What is TidierText.jl
TidierText.jl is a 100% Julia implementation of the R tidytext package. The purpose of the package is to make it easy analyze text data using DataFrames.
An extensive guide to tidy text analysis by Julia Silge and David Robinson is available here: https://www.tidytextmining.com/.
Installation
For the development version:
using Pkg
Pkg.add(url="https://github.com/TidierOrg/TidierText.jl")
What functions does TidierText.jl support?
@bind_tf_idf()@unnest_tokens()@unnest_regex()@unnest_characters()@unnest_ngrams()get_stopwords()tidy()nma_words
How does the package work?
Let's load the package and read in the UCLA Fall 2018 course dataset.
using TidierData
using TidierText
using CSV
courses = CSV.read(download("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/ucla_f18.csv"), DataFrame)
What are the course names?
@chain courses begin
@select(id = rownames, course)
@slice(1:10)
end
$ 10 \times 2 \text{DataFrame} \text{Row} │ \text{id} \text{course} │ \text{Int64} \text{String} ─────┼────────────────────────────────────────── 1 │ 1 \text{Leadership} \text{Laboratory} 2 │ 2 \text{Heritage} \text{and} \text{Values} 3 │ 3 \text{Team} \text{and} \text{Leadership} \text{Fundamentals} 4 │ 4 \text{Air} \text{Force} \text{Leadership} \text{Studies} 5 │ 5 \text{National} \text{Security} \text{Affairs}/\text{Prepar}… 6 │ 6 \text{Introduction} \text{to} \text{Black} \text{Studies} 7 │ 7 \text{African} \text{American} \text{Musical} \text{Heritage} 8 │ 8 \text{UCLA} \text{Centennial} \text{Initiative}: \text{Arth}… 9 │ 9 \text{UCLA} \text{Centennial} \text{Initiative}: \text{Soci}… 10 │ 10 \text{Student} \text{Research} \text{Program} $
Let's tokenize the course names and convert them to lowercase.
tokens = @chain courses begin
@select(id = rownames, course)
@slice(1:10)
@unnest_tokens(word, course, to_lower = true)
end;
@chain tokens @slice(1:10)
$ 10 \times 2 \text{DataFrame} \text{Row} │ \text{id} \text{word} │ \text{Int64} \text{SubStrin}… ─────┼───────────────────── 1 │ 1 \text{leadership} 2 │ 1 \text{laboratory} 3 │ 2 \text{heritage} 4 │ 2 \text{and} 5 │ 2 \text{values} 6 │ 3 \text{team} 7 │ 3 \text{and} 8 │ 3 \text{leadership} 9 │ 3 \text{fundamentals} 10 │ 4 \text{air} $
Let's add the term frequency, inverse document frequency, and the tf-idf.
@chain tokens begin
@count(id, word)
@bind_tf_idf(word, id, n)
@slice(1:10)
end
10×6 DataFrame
Row │ id word n tf idf tf_idf
│ Int64 SubStrin… Int64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────
1 │ 1 leadership 1 0.5 1.20397 0.601986
2 │ 1 laboratory 1 0.5 2.30259 1.15129
3 │ 2 heritage 1 0.333333 1.60944 0.536479
4 │ 2 and 1 0.333333 0.916291 0.30543
5 │ 2 values 1 0.333333 2.30259 0.767528
6 │ 3 team 1 0.25 2.30259 0.575646
7 │ 3 and 1 0.25 0.916291 0.229073
8 │ 3 leadership 1 0.25 1.20397 0.300993
9 │ 3 fundamentals 1 0.25 2.30259 0.575646
10 │ 4 air 1 0.25 2.30259 0.575646