StringZilla 🦖 Command-Line Interface
December 17, 2025 · View on GitHub

Most text processing command-line utilities have obscure syntax, limited portability across operating systems, can't deal with larger-than-memory datasets, and are not actively leveraging modern SIMD capabilities, such as AVX-512 on x86 and SVE on ARM. This utility is written in Rust, leveraging StringZilla for both pipe and file-based text processing across Linux, macOS, and Windows. Just one command to install it from Crates.io:
cargo install stringzilla-cli
It provides the following subcommands:
sz-find: find all inclusions of a substring in a file similar togrep, but with a saner syntaxsz-outline: provide LLM with an outline of a file for Markdown, HTML, C, C++, and Python source filessz-count: 3x fasterwcword count, that can actually handle UTF-8 properlysz-dedup: deduplicate lines; safe for larger-than-memory filessz-split: 4x fastersplitfile splitting, that won't break UTF-8 characters or linessz-cols: extract columns from delimited text; replacescut -fandawk '{print $N}'with simpler syntaxsz-rows: extract rows by index or range; replacessed -n,head,tail, andawk 'NR==N'- :soon:
sz-sort: sort lines - :soon:
sz-fuzzy-find: combination of exact and Levenshtein-bounded substring search
Installation
cargo install --git https://github.com/ashvardanian/StringZilla-CLI # install from GitHub
cargo install --path . --force # or install from local clone
sz-find: Unicode Aware Substring Search
A grep-like tool using literal substring matching (not regex) for maximum speed.
Unlike grep and ripgrep, sz-find performs full Unicode-compliant case folding for case-insensitive search, correctly handling all 1M+ defined characters.
# Basic search
sz-find "error" log.txt
# Case-insensitive search (with full Unicode case folding)
sz-find -i "error" log.txt
# Show line numbers
sz-find -n "pattern" file.txt
# Count matches only
sz-find -c "pattern" file.txt
# Context lines (like grep -B/-A/-C)
sz-find -B 2 -A 2 "error" log.txt
# Multi-line patterns (pattern can span lines)
sz-find -m "hello\nworld" file.txt
# UTF-8 mode (handles Unicode newlines: NEL, LINE SEPARATOR, etc.)
sz-find --utf8 "pattern" file.txt
There is partial support for Unicode case folding in ripgrep (rg -i).
The only tool seemingly implementing full folding is pcre2, designed for RegEx, rather than substring search.
The first is comparably fast, the second one is orders of magnitude slower.
Here's what Unicode compliance on a mixed dataset means for German queries, where the Eszett (ß) character is commonly used, and folds to "ss".
$ rg -c -i "strasse" xlsum.csv # 183 results
$ sz-find -c -i "strasse" xlsum.csv # 205 results, 22 more
$ rg -c -i "gross" xlsum.csv # 5412 results
$ sz-find -c -i "gross" xlsum.csv # 5418 results, +6 more
$ rg -c -i "weiss" xlsum.csv # 350 results
$ sz-find -c -i "weiss" xlsum.csv # 352 results, +2 more
Ligatures from PDFs and word processors (fi, fl, ff, ffi, ffl) are also folded correctly:
$ rg -c -i "fi" xlsum.csv # 464678 results
$ sz-find -c -i "fi" xlsum.csv # 464699 results, +21 more
$ rg -c -i "ffi" xlsum.csv # 162155 results
$ sz-find -c -i "ffi" xlsum.csv # 162157 results, +2 more
Turkish dotted/dotless I (İ/I/i/ı) folding is also a common pitfall:
$ rg -c -i "işi" xlsum.csv # 23957 results
$ sz-find -c -i "işi" xlsum.csv # 25065 results, +1108 more
The difference becomes significant when searching legal documents, German/Swiss news, Turkish text, PDF-extracted content, or any content with typographic ligatures.
sz-outline: File Outliner for LLMs
Extract structural outlines from source files for LLM context windows.
When feeding large files to language models, you often need a high-level overview without the full content.
sz-outline extracts headings, function signatures, includes, and other structural elements.
# Outline a Markdown file (headings only)
sz-outline README.md
# With line numbers and byte offsets (-v)
sz-outline -v README.md
# Detailed mode with child blocks (-vv)
sz-outline -vv README.md
# Outline C/C++ source (includes + function signatures)
sz-outline src/main.c
# Force file type detection
sz-outline -t md document.txt
# Read from stdin
cat file.md | sz-outline -t md -
There are several verbosity levels supported:
Default - Names only:
$ sz-outline README.md
# StringZilla CLI
## Installation
## sz-find: Unicode Aware Substring Search
### Performance vs ripgrep
## sz-outline: File Outliner for LLMs
-v - Add line numbers and byte offsets:
$ sz-outline -v README.md
# StringZilla CLI [L1, @0]
## Installation [L27, @892]
## sz-find: Unicode Aware Substring Search [L34, @1045]
### Performance vs ripgrep [L127, @4521]
-vv - Detailed with child blocks (code blocks, tables, images):
$ sz-outline -vv README.md
# StringZilla CLI [L1, @0, 45B]
- paragraph [L3-5, 312B]
- code (bash) [L9-11, 89B]
## Installation [L27, @892, 156B]
- code (bash) [L29-32, 245B]
For C source files, sz-outline extracts includes and function signatures:
$ sz-outline -v src/parser.c
#include <stdio.h> [L1, @0]
#include <stdlib.h> [L2, @19]
#include "parser.h" [L3, @39]
static int parse_token(const char *input) [L12-45, @156, definition]
int parse_file(FILE *fp) [L47-123, @892, definition]
void cleanup(void) [L125-130, @2341, definition]
Function signatures are normalized (whitespace collapsed) and categorized as declarations (;) or definitions ({}).
sz-count: Word Count
The wc utility on Linux can be used to count the number of lines, words, and bytes in a file.
Using SIMD-accelerated character and character-set search, StringZilla can be noticeably faster, even with slow SSDs.
$ time wc enwik9.txt
13147025 129348346 1000000000 enwik9.txt
real 0m3.562s
user 0m3.470s
sys 0m0.092s
$ time sz-count --wc enwik9.txt
13147025 139132610 1000000000 enwik9.txt # Note: word count differs due to stricter ASCII whitespace handling
real 0m1.165s
user 0m1.121s
sys 0m0.044s
sz-split: Split File into Smaller Ones
The split utility on Linux can be used to split a file into smaller ones.
The current prototype only splits by line counts.
$ time split -l 100000 enwik9.txt ...
real 0m6.424s
user 0m0.179s
sys 0m0.663s
$ time sz_split -l 100000 enwik9.txt ...
real 0m1.482s
user 0m1.020s
sys 0m0.460s
sz-cols: Extract Columns
The cut utility and awk '{print $N}' are commonly used to extract columns from delimited text.
sz-cols provides a simpler, more intuitive syntax with SIMD-accelerated delimiter scanning.
# Replaces: cut -f2 data.tsv
# Replaces: awk -F'\t' '{print \$2}' data.tsv
$ sz-cols -f 2 data.tsv
# Extract multiple columns with custom output delimiter
# Replaces: cut -f1,3 -d',' --output-delimiter=';' data.csv
$ sz-cols -f 1,3 -d ',' -D ';' data.csv
# Extract a range of columns
# Replaces: cut -f2-5 data.tsv
$ sz-cols -f 2-5 data.tsv
# Mixed selection: specific columns and ranges
$ sz-cols -f 1,3-5,8 data.tsv
sz-rows: Extract Rows
The sed -n 'Np', head -n N, tail -n N, and awk 'NR==N' commands are commonly used to extract specific lines.
sz-rows unifies all these use cases with a single, intuitive interface.
# Extract line 5
# Replaces: sed -n '5p' file.txt
# Replaces: awk 'NR==5' file.txt
$ sz-rows -r 5 file.txt
# Extract lines 10-20
# Replaces: sed -n '10,20p' file.txt
$ sz-rows -r 10-20 file.txt
# Extract first 10 lines
# Replaces: head -n 10 file.txt
$ sz-rows -r 1-10 file.txt
# Extract last 10 lines
# Replaces: tail -n 10 file.txt
$ sz-rows --tail 10 file.txt
# Extract specific lines
# Replaces: sed -n '1p;5p;10p' file.txt
$ sz-rows -r 1,5,10 file.txt
# Extract every 5th line
# Replaces: awk 'NR % 5 == 0' file.txt
$ sz-rows --every 5 file.txt
# Show line numbers in output
$ sz-rows -n -r 5-10 file.txt