Prune Graph
March 17, 2026 ยท View on GitHub
Fast pruning of weighted graphs with optional filtering.
Instalation
Clone the repository:
$ git clone https://github.com/fgvieira/prune_graph.git
Compile:
$ cargo build --release
And run the tests:
$ cargo test
Usage
$ ./target/release/prune_graph --in input.tsv --out out.keep
or:
$ zcat input.tsv.gz | ./target/release/prune_graph > out.keep
To plot the graph (optional)
$ cat out.dot | dot -Tsvg > out.svg
If you want to get a full list of option, just run:
$ ./target/release/prune_graph --help
Input data
As input, you need a TSV file (with or without header) with, at least, three columns. The first two columns must be the node names (defining an edge), and an additional column with the edge's weight (can be specified with --weight).
Transform input data
If you want to transform input data, you can use any CSV manipulation tool (e.g. Miller or CSVtk). For example, to use absolute values on column 5:
$ cat test/example.tsv | mlr --tsv --implicit-csv-header put '\$5 = abs(\$5) then fill-empty' | ./target/release/prune_graph --header [...]
Filter edges
To filter edges, you can use option --filter with any expression supported by fasteval. For example, to use column 7 as weight and only consider edges > 0.2:
$ cat test/example.tsv | ./target/release/prune_graph --weight "column_7" --filter "column_7 > 0.2" --out out.keep
or, if also wanted to filter on column_3 > 1000:
$ cat test/example.tsv | ./target/release/prune_graph --weight "column_7" --filter "column_3 > 1000 && column_7 > 0.2" --out out.keep
or, if you want to only use 0.1 < weight > 0.2:
$ cat test/example.tsv | ./target/release/prune_graph --weight "column_7" --filter "column_3 > 1000 && (column_7 < 0.1 || column_7 > 0.2)" --out out.keep
Output
The output will be a list of the remaining nodes after pruning. Optionally, you can also get a list of the nodes that were removed (--out-excl).
Performance
Due to the way prune_graph is parallelized, its performance is strongly dependent on the degree of connectivity of the graph (see examples below).
Random function
shuf_seed () {
SEED=\$1; shift
shuf --random-source <(openssl enc -aes-256-ctr -pass pass:$SEED -nosalt </dev/zero 2>/dev/null) $@
}
Example 1 - Single component, highly connected
Graph with fewer nodes than edges.
Code
$ N_NODES=100000
$ N_EDGES=5000000
$ seq --equal-width 1 $N_NODES | xargs printf "node_%s\n" > /tmp/nodes.rnd
$ paste <(shuf_seed 123 --repeat --head-count $N_EDGES /tmp/nodes.rnd) <(shuf_seed 456 --repeat --head-count $N_EDGES /tmp/nodes.rnd) | awk '{OFS="\t"; print \$0,rand()}' > example_1compH.tsv
Example 2 - Single component, loosely connected
Graph similar to LD in a chromosome, where each node/snp has (on average) $AVG_N_FWD_EDGES edges with downstream node/snp(s).
Code
$ N_NODES=1000000
$ AVG_N_FWD_EDGES=10
$ seq --equal-width 1 $N_NODES | xargs printf "node_%s\n" | perl -se 'srand(12345); @n = <>; chomp(@n); for ($i=0; $i <= $#n; $i++){ for ($j=$i+1; $j <= $#n; $j++) {print(join("\t", @{[$n[$i], $n[$j], sprintf("%.3f", rand())]})."\n"); last if rand() < 1/$n}}' -- -n=$AVG_N_FWD_EDGES > example_1compL.tsv
Example 3 - Large number of components
Graph with more nodes than edges.
Code
$ N_NODES=10000000
$ N_EDGES=5000000
$ seq --equal-width 1 $N_NODES | xargs printf "node_%s\n" > /tmp/nodes.rnd
$ paste <(shuf_seed 123 --repeat --head-count $N_EDGES /tmp/nodes.rnd) <(shuf_seed 456 --repeat --head-count $N_EDGES /tmp/nodes.rnd) | awk '{OFS="\t"; print \$0,rand()}' > example_large.tsv
Speed-up overview
The speed-up is measured as the average processing speed (nodes / s) over the first 20 iterations, using default settings (unless specified otherwise).
| Example 1 | Example 1 | Example 2 | Example 2 | Example 3 | Example 3 | |
|---|---|---|---|---|---|---|
| n_threads | Mode 1 | Mode 2 | Mode 1 | Mode 2 | Mode 1 | Mode 2 |
| 1 | 0.45 | 1.37 | 1.20 | 1.79 | 159473.62 | 0.18 |
| 2 | 0.46 | 2.05 | 1.24 | 1.90 | 167429.55 | 0.18 |
| 4 | 0.51 | 3.78 | 1.28 | 2.03 | 187083.92 | 0.18 |
| 6 | 0.54 | 5.48 | 1.28 | 2.02 | 193490.72 | 0.18 |
| 8 | 0.55 | 6.75 | 1.31 | 2.02 | 195761.61 | 0.18 |
| 10 | 0.55 | 8.13 | 1.32 | 2.03 | 199361.19 | 0.19 |
| 15 | 0.56 | 11.03 | 1.30 | 2.05 | 207449.42 | 0.19 |
| 20 | 0.57 | 12.95 | 1.31 | 2.12 | 204722.47 | 0.18 |
Very large graphs
If you are dealing with very large graphs (more than 4 billion nodes/edges), you need to compile with the large_graph feature:
$ cargo build --features large_graph --release
Since it can be quite slow to prune large graphs (specially if they made up of few but highly connected components), you might want to split your original graph into its components and prune wach one separately. First, to get a list of the components, run:
$ ./target/release/prune_graph --in example_large.tsv --weight "column_3" --filter 'column_3 < 0.5' --out-comps example_large.comps.tsv | sort | md5sum
64b755cd826a2c6876aa25ec6159ebfd -
and press Ctrl+C when it starts the prunning. File example_large.comps.tsv will have the graph edges as a TSV file.
Then run prune_graph on groups of (e.g.) 10 components (without splitting components):
$ cat example_large.comps.tsv | awk '\$4 != prev_comp {printf "==="} {prev_comp=\$4; print}' | parallel --pipe --recend "===" --rrs -N 10 ./target/release/prune_graph --quiet --weight "column_3" | sort | md5sum
64b755cd826a2c6876aa25ec6159ebfd -