README.md
February 4, 2026 ยท View on GitHub
Seqtool is a fast and flexible command line program for dealing with
large amounts of biological sequences.
It provides different subcommands for converting, inspecting
and modifying sequences.
The standalone binary (5-7 MB) is simply named st to save some typing.
Documentation
See https://markschl.github.io/seqtool-docs
Downloads
Binaries are available for Linux, OS X and Windows
๐ฅ download the latest version (v0.4.0)
Feature overview
File formats
Reads and writes FASTA, FASTQ and CSV/TSV, optionally compressed with GZIP, BZIP2, or the faster and more modern Zstandard or LZ4 formats
Example: compressed FASTQ to FASTA
Combine multiple compressed FASTQ files, converting them to FASTA, using pass.
st pass file1.fastq.gz file2.fastq.gz -o output.fasta
Note: almost every command can read multiple input files and convert between formats,
but pass does nothing other than reading and writing while other command perform certain actions.
Example: FASTA to tab-separated list
Aside from ID and sequence, any variable/function such as
the sequence length (seqlen) can be written to delimited text.
st pass input.fasta --to-tsv id,seq,seqlen
id1 ACG 3
id1 ACGTACGT 7
id1 ACGTA 5
Highly versatile thanks to variables/functions
See also variables/functions for more details.
Example: count sequences in a large set of FASTQ files
st count -k path data/*.fastq.gz
data/sample1.fastq.gz 30601
data/sample2.fastq.gz 15702
data/sample3.fastq.gz 264965
data/sample4.fastq.gz 1120
data/sample5.fastq.gz 7021
(...)
In count, one or several categorical variables/functions
can be specified with -k/--key.
Example: summarize the GC content in 10% intervals
The function bin(variable, interval) groups continuous numeric values
into intervals
st count -k 'bin(gc_percent, 10)' sequences.fasta
(10, 20] 57
(20, 30] 2113
(30, 40] 11076
(40, 50] 7184
(50, 60] 12
Example: Assign new sequence IDs
st set -i 'seq_{num}' seqs.fasta > renamed.fasta
>seq_1
SEQUENCE
>seq_2
SEQUENCE
>seq_3
SEQUENCE
(...)
Example: De-replicate by description and sequence
seqs.fasta with a 'group' annotation in the header:
>id1 group1
SEQUENCE1
>id2 group1
SEQUENCE2
>id3 group1
SEQUENCE2
>id4 group2
SEQUENCE1
>id5 group2
SEQUENCE1
st unique 'desc,seq' seqs.fasta > grouped_uniques.fasta
>id1 group1
SEQUENCE1
>id2 group1
SEQUENCE2
>id4 group2
SEQUENCE1
Expressions
From simple math to complicated filter expressions, the tiny integrated JavaScript engine (QuickJS) offers countless possibilities for customized sequence processing.
Example: filter FASTQ sequences by quality and length
This filter command removes sequencing reads with more than one expected sequencing error (like USEARCH can do) or sequence length of <100 bp.
st filter 'exp_err < 1 && seqlen >= 100' reads.fastq > filtered.fastq
Header attributes for metadata storage
key=value header attributes allow storing and passing on
all kinds of information
Example: De-replicate by sequence (seq variable) and/or other properties
The unique command returns all unique sequences and annotates the number of records with the same sequence in the header:
st unique seq -a abund={n_duplicates} input.fasta > uniques.fasta
>id1 abund=3
TCTTTAATAACCTGATTAG
>id3 abund=1
GGAGGATCCGAGCG
(...)
It is also possible to de-replicate by multiple keys, e.g. by sequence,
but grouped by a sample attribute in the header:
st unique 'seq,attr(sample)' input.fasta > uniques.fasta
>id1 sample=1
SEQUENCE1
>id3 sample=2
SEQUENCE2
>id10 sample=1
SEQUENCE3
>id11 sample=3
SEQUENCE4
(...)
Fast pattern search
The find command enables flexible and fast pattern search, where the optimal algorithm is chosen automatically. Primer/adapter searches usually yield the same results as Cutadapt, while being faster.
Example: pre-processing of mixed multi-marker amplicon sequences (primer trimming, grouping by amplicon)
These steps could be part of an amplicon pipeline that de-multiplexes multi-marker amplicons.
Three different seqtool commands are chained, whereby find searches for a set of primers,
which are removed by trim and finally split distributes the sequences into different files named
by the forward primer. The primer names and positions are passed to the next command in the form of
header attributes (-a/--attr option).
primers.fasta
>prA
PRIMER
>prB
PRIMER
Command for searching/trimming
st find file:primers.fasta -a primer='{pattern_name}' -a end='{match_end}' sequences.fasta |
st trim -e '{attr(end)}..' |
st split -o '{attr(primer)}'
| prA.fasta | prB.fasta | undefined.fasta |
|---|---|---|
|
|
Note: no primer, sequence not trimmed since |
Integration of external metadata
Integration of sequence metadata sources in the form of delimited text
Example: Add Genus names from a separate tab-separated list
| input.fasta | genus.tsv |
|---|---|
|
|
Using -m/--meta to include genus.tsv as metadata source:
st set -m genus.tsv --desc '{meta(genus)}' input.fasta > with_genus.fasta
| with_genus.fasta |
|---|
|
Example: Choose specific sequences given a separate file with an ID list
| input.fasta | id_list.txt |
|---|---|
|
|
st filter -m id_list.txt 'has_meta()' input.fasta > subset.fasta
| subset.fasta |
|---|
|
Commands
Basic conversion/editing
- pass: Directly pass input to output without any processing, useful for converting and attribute setting
Information about sequences
- view: View biological sequences, colored by base / amino acid, or by sequence quality
- count: Count all records in the input (total or categorized by variables/functions)
- stat: Return per-sequence statistics as tab delimited list
Subset/reorganize/compare
- sort: Sort records by sequence or any other criterion
- unique: De-replicate by sequence and/or other properties, returning only unique records
- filter: Keep/exclude sequences based on different properties with a mathematical (JavaScript) expression
- split: Distribute sequences into multiple files based on a variable/function or advanced expression
- cmp: Compare two input files/streams
- sample: Get a random subset of sequences; either a fixed number or an approximate fraction of the input
- slice: Return a range of sequence records from the input
- head: Return the first N sequences
- tail: Return the last N sequences
- interleave: Interleave records of all files in the input
Search and replace
- find: Search for pattern(s) in sequences or sequene headers for record filtering, pattern replacement or passing hits to next command
- replace: Fast and simple pattern replacement in sequences or headers
Modifying commands
- del: Delete header ID/description and/or attributes
- set: Replace the header, header attributes or sequence with new content
- trim: Trim sequences on the left and/or right (single range) or extract and concatenate several ranges
- mask: Soft or hard mask sequence ranges
- upper: Convert sequences to uppercase
- lower: Convert sequences to lowercase
- revcomp: Reverse complements DNA or RNA sequences
- concat: Concatenates sequences/alignments from different files
Comparison with other tools
There are other tools with a similar focus such as Seqtk, SeqKit, the FASTX-Toolkit, as well as the more specialized USEARCH and VSEARCH offering some of the functions as well.
Seqtool performs well compared to these tools on a selection of diverse tasks:
License
Either MIT or Apache-2.0 at your choice