Prune Graph

March 17, 2026 · View on GitHub

Fast pruning of weighted graphs with optional filtering.

Instalation

Clone the repository:

$ git clone https://github.com/fgvieira/prune_graph.git

Compile:

$ cargo build --release

And run the tests:

$ cargo test

Usage

$ ./target/release/prune_graph --in input.tsv --out out.keep

or:

$ zcat input.tsv.gz | ./target/release/prune_graph > out.keep

To plot the graph (optional)

$ cat out.dot | dot -Tsvg > out.svg

If you want to get a full list of option, just run:

$ ./target/release/prune_graph --help

As input, you need a TSV file (with or without header) with, at least, three columns. The first two columns must be the node names (defining an edge), and an additional column with the edge's weight (can be specified with --weight).

Transform input data

If you want to transform input data, you can use any CSV manipulation tool (e.g. Miller or CSVtk). For example, to use absolute values on column 5:

$ cat test/example.tsv | mlr --tsv --implicit-csv-header put '\$5 = abs(\$5) then fill-empty' | ./target/release/prune_graph --header [...]

Filter edges

To filter edges, you can use option --filter with any expression supported by fasteval. For example, to use column 7 as weight and only consider edges > 0.2:

$ cat test/example.tsv | ./target/release/prune_graph --weight "column_7" --filter "column_7 > 0.2" --out out.keep

or, if also wanted to filter on column_3 > 1000:

$ cat test/example.tsv | ./target/release/prune_graph --weight "column_7" --filter "column_3 > 1000 && column_7 > 0.2" --out out.keep

or, if you want to only use 0.1 < weight > 0.2:

$ cat test/example.tsv | ./target/release/prune_graph --weight "column_7" --filter "column_3 > 1000 && (column_7 < 0.1 || column_7 > 0.2)" --out out.keep

Output

The output will be a list of the remaining nodes after pruning. Optionally, you can also get a list of the nodes that were removed (--out-excl).

Performance

Due to the way prune_graph is parallelized, its performance is strongly dependent on the degree of connectivity of the graph (see examples below).

Random function

shuf_seed () {
    SEED=\$1; shift
    shuf --random-source <(openssl enc -aes-256-ctr -pass pass:$SEED -nosalt </dev/zero 2>/dev/null) $@
}

Example 1 - Single component, highly connected

Graph with fewer nodes than edges.

Code

$ N_NODES=100000
$ N_EDGES=5000000
$ seq --equal-width 1 $N_NODES | xargs printf "node_%s\n" > /tmp/nodes.rnd
$ paste <(shuf_seed 123 --repeat --head-count $N_EDGES /tmp/nodes.rnd) <(shuf_seed 456 --repeat --head-count $N_EDGES /tmp/nodes.rnd) | awk '{OFS="\t"; print \$0,rand()}' > example_1compH.tsv

Example 2 - Single component, loosely connected

Graph similar to LD in a chromosome, where each node/snp has (on average) $AVG_N_FWD_EDGES edges with downstream node/snp(s).

Code

$ N_NODES=1000000
$ AVG_N_FWD_EDGES=10
$ seq --equal-width 1 $N_NODES | xargs printf "node_%s\n" | perl -se 'srand(12345); @n = <>; chomp(@n); for ($i=0; $i <= $#n; $i++){ for ($j=$i+1; $j <= $#n; $j++) {print(join("\t", @{[$n[$i], $n[$j], sprintf("%.3f", rand())]})."\n"); last if rand() < 1/$n}}' -- -n=$AVG_N_FWD_EDGES > example_1compL.tsv

Example 3 - Large number of components

Graph with more nodes than edges.

Code

$ N_NODES=10000000
$ N_EDGES=5000000
$ seq --equal-width 1 $N_NODES | xargs printf "node_%s\n" > /tmp/nodes.rnd
$ paste <(shuf_seed 123 --repeat --head-count $N_EDGES /tmp/nodes.rnd) <(shuf_seed 456 --repeat --head-count $N_EDGES /tmp/nodes.rnd) | awk '{OFS="\t"; print \$0,rand()}' > example_large.tsv

Speed-up overview

The speed-up is measured as the average processing speed (nodes / s) over the first 20 iterations, using default settings (unless specified otherwise).

	Example 1	Example 1	Example 2	Example 2	Example 3	Example 3
n_threads	Mode 1	Mode 2	Mode 1	Mode 2	Mode 1	Mode 2
1	0.45	1.37	1.20	1.79	159473.62	0.18
2	0.46	2.05	1.24	1.90	167429.55	0.18
4	0.51	3.78	1.28	2.03	187083.92	0.18
6	0.54	5.48	1.28	2.02	193490.72	0.18
8	0.55	6.75	1.31	2.02	195761.61	0.18
10	0.55	8.13	1.32	2.03	199361.19	0.19
15	0.56	11.03	1.30	2.05	207449.42	0.19
20	0.57	12.95	1.31	2.12	204722.47	0.18

Very large graphs

If you are dealing with very large graphs (more than 4 billion nodes/edges), you need to compile with the large_graph feature:

$ cargo build --features large_graph --release

Since it can be quite slow to prune large graphs (specially if they made up of few but highly connected components), you might want to split your original graph into its components and prune wach one separately. First, to get a list of the components, run:

$ ./target/release/prune_graph --in example_large.tsv --weight "column_3" --filter 'column_3 < 0.5' --out-comps example_large.comps.tsv | sort | md5sum
64b755cd826a2c6876aa25ec6159ebfd  -

and press Ctrl+C when it starts the prunning. File example_large.comps.tsv will have the graph edges as a TSV file.

Then run prune_graph on groups of (e.g.) 10 components (without splitting components):

$ cat example_large.comps.tsv | awk '\$4 != prev_comp {printf "==="} {prev_comp=\$4; print}' | parallel --pipe --recend "===" --rrs -N 10 ./target/release/prune_graph --quiet --weight "column_3" | sort | md5sum
64b755cd826a2c6876aa25ec6159ebfd  -