Human-readable path compression in GFA-formatted pangenome graphs
March 2, 2026 · View on GitHub
Human-readable path compression in GFA-formatted pangenome graphs
sqz is a tool for compressing and decompressing path annotations in GFA files. It uses a modified version of the GFA-format containing Q- and Z-lines.
Usage
Use
sqz compress-full <GFA-FILE> > <COMPRESSED-GFA-FILE>
to compress a GFA file and
sqz decompress <COMPRESSED-GFA-FILE> > <GFA-FILE>
for decompression. For large GFA files take a look at
sqz compress-partial --help
Installation
Using conda/mamba
conda install -c conda-forge -c bioconda sqz
Using a precompiled binary
Use
wget https://github.com/codialab/sqz/releases/download/v0.2.0/sqz
chmod +x sqz
and make sqz available on your PATH to install it.
Building from source
sqz is written in RUST and requires a working RUST build system (version >= 1.74.1) for installation. See here for more details.
git clone git@github.com:codialab/sqz.git
cd sqz
cargo build --release
Format
The format of sqz is mostly based on the GFA format, but extended to include one new types of lines: Q-lines.
W-lines have been changed to allow the usage of identifiers from Q-lines.
Q Rule line
A Q-line defines part of a compressed walk that can be used as part of other compressed walks.
Required fields
| Column | Field | Type | Regexp | Description |
|---|---|---|---|---|
| 1 | RecordType | Character | Q | Record type |
| 2 | Name | String | @[!-)+-<>-~][!-~]* | Rule name |
| 3 | CompressedWalk | String | ([><][!-;=?-~]+)+ | Compressed Walk |
A Walk is defined as
<walk> ::= ( `>' | `<' <segId> )+
where <segId> corresponds either to the identifier of a segment or the
identifier of a Q-line. A valid walk must exist in the graph. The identifier of
a Q-line starts with the character @
W Compressed walk line
A walk line describes an oriented walk in the graph. It is only intended for a graph without overlaps between segments. Note that W-lines can not use jump connections (introduced in v1.2).
Required fields
| Column | Field | Type | Regexp | Description |
|---|---|---|---|---|
| 1 | RecordType | Character | W | Record type |
| 2 | SampleId | String | [!-)+-<>-~][!-~]* | Sample identifier |
| 3 | HapIndex | Integer | [0-9]+ | Haplotype index |
| 4 | SeqId | String | [!-)+-<>-~][!-~]* | Sequence identifier |
| 5 | SeqStart | Integer | \*|[0-9]+ | Optional Start position |
| 6 | SeqEnd | Integer | \*|[0-9]+ | Optional End position (BED-like half-close-half-open) |
| 7 | CompressedWalk | String | ([><][!-;=?-~]+)+ | Compressed Walk |
For a haploid sample, HapIndex takes 0. For a diploid or polyploid sample,
HapIndex starts with 1. For two W-lines with the same
(SampleId,HapIndex,SeqId), their [SeqSart,SeqEnd) should have no
overlaps. A Walk is defined as
<walk> ::= ( `>' | `<' <segId> )+
where <segId> corresponds either to the identifier of a segment or the
identifier oft a Q-line. A valid walk must exist in the graph.
Example
S s11 ACCTT
S s12 TC
S s13 GATT
L s11 + s12 - 0M
L s12 - s13 + 0M
L s11 + s13 + 0M
Q @q1 >s11<s12
W NA12878 1 chr1 0 11 >@q1>s13
Citation
Peter Heringer and Daniel Doerr. Human Readable Compression of GFA Paths Using Grammar-Based Code. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 14:1-14:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025) https://doi.org/10.4230/LIPIcs.WABI.2025.14