PANABYSS

June 3, 2026 · View on GitHub

Tools to manipulate and visualize pangenomes variation graphs.

license Cross-platform

Overview

PanAbyss is based on a graph database modelisation (Neo4j) created from a pangenome graph (GFA file) and annotation files (GFF or GTF). It allows the following functionalities:

  • Load a GFA file into a Neo4j database
  • Load annotation files (GFF or GTF): this will link annotations to the pangenome.
  • Compute shared regions between a set of selected haplotypes
  • Computes the sequences of a selected region
  • Compute a global phylogenetic tree or a local phylogenetic tree from a selected region (neighbour joining with a distance matrix based on Jaccard index)
  • Visualize a region and annotation of the pangenome

TP53 Gene visualization Figure 1 – Visualization of TP53 gene on HPRC pangenome (orange path is GRCh38 path, exons are shown in green).

Installation

Requirements

  • Docker available: see docker documentation if not installed. Docker must be able to be launched by the $USER user; otherwise, see the procedure for launching Docker in non-root mode:
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
  • Miniconda 3. Please choose the installer corresponding to your OS: Miniconda dowloads
  • Mamba: this package will be automatically installed if not present.
  • 20 GB RAM (32 Go+ recommended for big data). For small pangenome it is possible to run with 8 GB RAM and 6 GB swap or virtual memory (to do this, modify memory configuration in data/conf/neo4j.conf file)
  • Sufficient disk space: approximately 10 times the size of the GFA, ideally SSD (HDD are about 10 times slower and are not recommended for this use)
  • The version of Panabyss used to construct the database must be compatible with the version to visualize and analyse data (the version of the database is indicated in the Stats node)

Quickstart

  • Installation (only the first time):
    • Download the last release in the Panabyss project and unzip the archive where do you want to store your data
    • Make launcher executable (linux/Mac):
      chmod +x launch.sh
    
    • Copy your GFA files into /data/gfa directory.
    • Copy your annotations files into /data/annotations directory.
  • Launch the tool:
    • Execute launcher: ./launch.sh on Linux/Mac (for server mode in production or large pangenome use ./launch_gunicorn.sh instead) or launch.bat on windows and go to http://localhost:8050
    • To launch on another port (default is 8050), just specify the porty after. For example, to launch on port 8051: ./launch.sh 8051 or launch.bat 8051.
  • Prepare database (IHM):
    • Go to "DB management" page and set the container name.
    • Load GFA data: select GFA files to load. According to your data:
      • Unique GFA with multiple chromosomes: in this case there is no chromosome to set into input box associated to the file (Panabyss will get the chromosome from the P/W line according to the rules described in "Important notes" section).
      • One GFA file per chromosome: in this case, it is strongly recommended to enter the chromosome value in the input box associated with the file. Otherwise, it will use the chromosome defined in the path/walk, but this value is not always well defined.
      • The GFA not involve any chromosome: in this case set the chromosome input to zero, otherwise the database will not be properly constructed.
    • Create the database: click on Create new DB button. This step will use the csv generated by the previous step in the /data/import directory.
    • Load annotations by selecting the files to load (gtf or gff) and, for each selected file, the haplotype associated. Before to load annotations it is required that indexes are fully created: after loading GFA the indexes are automatically created but if data are big it requires some time, in this case a warning message is displayed. Once data are loaded the tool can be ued (see quick pages description).

Important notes

  • To launch multiple Neo4j instances, it is required to change Neo4j ports. These ports are defined in the db_conf.json and can be updated here.
  • On Windows system, raxml-ng must be installed manually, see raxml documentation. If not installed, then the global phylogenetic tree could not be computed with this method (but the neighbor joining method will work).
  • The default memory used by the Neo4j database is defined into the data/conf/Neo4j.conf file, it requires at least 20 Go, if the system (and docker configuration) doesn't have this memory available it will be necessary to tune these values.
  • The GFA file must be properly structured for the application to correctly identify the individual name and chromosome. We strongly recommend to use W lines but according to the GFA format:
    • For GFA files with W lines:
      The data is typically organized as follows:

      • Column 2: Individual name
      • Column 3: Haplotype number (the individual will then be named individual_haplotype)
      • Column 4: Chromosome identifier
    • For P lines:
      The path name (pathName, in column 2) must follow one of the two formats below:

      • genome#haplotype#chromosome
      • genome.haplotype.chromosome

In all cases, if a chromosome is specified when loading the GFA file, that value will take precedence.

Logs

Logs are displayed by default in the console and in log files located in the ./logs directory.

To configure logging behavior, modify the following parameters in the ./conf.json file:

  • log_retention_days — Defines the number of days to keep log files.
  • log_level — Set to DEBUG, INFO, WARNING, or ERROR to display only logs at or above the selected level.
  • log_server — Determines where logs are written:
    • "console" — Log only to the console.
    • "file" — Log only to a file.
    • "both" — Log to both the console and file (default value).

Quick pages description

The menu allow to navigate on different pages:

  • DB management (available only in admin mode for server mode installation): this page is used only on the start to create DB and load data.
  • Home page: page to visualize data (by defining the chromosome, start and stop or genome and gene_name / gene_id).
  • Shared region discovery: page to detect the nodes shared by a selection of haplotypes. It computes the list of identified regions that can be exported in csv. Sequence associated to a region can be seen by clicking on "size" column.
  • Phylogenetic: on left it is possible to load a reference phylogenetic tree. On right, by clicking on the "Plot tree..." button it computes the tree of the region defined in the Home page.
  • Sequences: by clicking on the button it computes the sequence for each haplotype of the region selected in the home page.

URL query : it is possible to access directly a pangenome region via URL. For example:

Generate the database

  • There are 3 ways to generate database:
    • From a dump file: this is the fastest way but the dump must be available. It uses the neo4j-admin load functionality. If the neo4j.dump file is available, move or copy it into ./import directory, this file will be used to generate database.
    • From csv file: it is a fast way to create the database if the csv files are available. It uses the neo4j-admin import functionality. This is the default procedure when creating a database from GFA.
    • From the GFA file (and gtf / gff if available): the database is constructed directly from the GFA file. This procedure is slow compared to the others and it is only used to add data later if needed (not recommended for big gfa files).
  • For dump file or csv files: these files must be put into /data/import directory before database creation. If these files exist it is not required to import gfa file and you can directly click on "Create new DB" button.
  • Dump file can be generated from IHM by clicking on Dump DB button.

Server Mode Configuration

In server mode, if the server is publicly accessible, it may be necessary to restrict access and disable administration and file upload features.

To do so, modify the configuration file (./conf.json):

  • Set the "server_mode" parameter to true.
  • If temporary administration functions are needed (for example, for the initial data upload):
    • Set the "admin_mode" parameter to true.
    • In this case, the application will prompt for a username and password to access it.
    • This login information is defined in the "admin_users" field — you should update it with the desired credentials.
    • Once the data has been loaded, set "admin_mode" back to false to allow open access for all users.
  • To launch application, use the ./launch_gunicorn.sh script (server gunicorn is available only for Linux/Mac).
  • To stop application, use the ./stop_server.sh script.
  • Gunicorn log into /logs/gunicorn but it is recommended to set a rotation file for this log.

Running PanAbyss on Low-Memory Systems

Several solutions are available to run PanAbyss on machines with limited RAM:

  • The simplest solution, especially when using an SSD, is to increase the swap size so that the total available memory (RAM + swap) is sufficient.

  • If increasing swap is not possible, the application can still run with 8 GB of RAM and 6 GB of swap/virtual memory for small pangenomes.
    In this case, before the first launch of the application, you must edit the file: ./install/conf/neo4j.conf and adjust the following parameters according to your machine resources:

    • server.memory.heap.max_size=4g
    • server.memory.pagecache.size=2g
  • The most memory-intensive step is the conversion of the pangenome into a Neo4j-compatible import format (GFA → CSV conversion). For very large pangenomes or when working on a machine with limited RAM, this preprocessing step can be executed on another machine with more resources or on a computing cluster. To do so:

    • Download PanAbyss on the remote machine
    • Place the .gfa file(s) into ./data/gfa. In case of multiple gfa, a chromosomes_file.csv must be created in the same directory with in first column the gfa file name (filename) and in second column the chromosome name associated to the gfa (chromosome).
    • Run: ./launch.sh --generate_csv_import. This command generates the Neo4j import files inside ./data/import
    • The generated files can then be copied into the local machine's ./data/import directory
    • After that, the database can be created locally without selecting a GFA file, and the application can be used normally.

Parameters file

The parameter file is named ./conf.json. It contains the following parameters:

  • "container_name": set by the application, it is the name of the Docker container containing the Neo4j DB.
  • "http_port": the HTTP port for the Neo4j DB (default: 7474).
  • "bolt_port": the Bolt port for the Neo4j DB (default: 7687).
  • "login": login to access the Neo4j DB (default: "neo4j"). If this value needs to be changed, it must first be modified in the Neo4j configuration file.
  • "password": password to access the Neo4j DB (default: "Administrateur"). If this value needs to be changed, it must first be modified in Neo4j.
  • "server_mode": set to false for local installation and true for server installation. This will block administration and file upload features.
  • "admin_mode": only used in server mode. Set to false to block all administration features. If set to true, a login/password will be required to access the application.
  • "admin_users": defines the login/password to access the application when admin_mode is set to true.
  • "server_log_mode": "console", "file", or "both" — determines where logs are written.
  • "log_retention_days": number of days to keep logs (default: 7).
  • "log_level": "DEBUG", "INFO", "WARNING", "ERROR" — logs are displayed only if their level is equal to or higher than this setting.
  • "gunicorn_log_level": "DEBUG", "INFO", "WARNING", "ERROR" - log level for gunicorn server. If not set then there won't be log from gunicorn.
  • "db_gfa_loading_batch_size": According to the ram available: bigger batch size will go faster but will consume more memory. Default value is 2,000,000.
  • "max_nodes_from_db": set the maximum number of nodes to get from database (can be used for phylogenetic or sequence). Default value is 50,000.
  • "max_nodes_to_visualize": set the maximum number of nodes to visualize in GUI. Default value is 10,000.
  • "max_gwas_store": set the maximum of stored results for shared region discovery. Set to 0 (or not set) for no limit.
  • "max_gwas_running_inactivity_hours": set the maximum refresh time for running job before deleting them.
  • "max_gwas_regions": set the maximum regions to visualize in shared regions discovery. Set to 0 (or not set) for no limit.
  • "gwas_annotations_windows_size": set the size of windows for annotations search in shared regions discovery.
  • "gwas_annotations_max_attempts": set the number of attempts to search annotations before or after in shared region discovery. The total size of the region is gwas_annotations_windows_size x gwas_annotations_max_attempts.
  • "gwas_max_running_jobs": set the limit of running gwas jobs. Set to 0 = not limited, set to -1 = deactivate the functionality.
  • "phylo_block_tree_recomputation": if set to true then it won't be possible to recompute a globale tree (for server purpose).
  • "viz_filter_by_flow": if set to true then, in case of too wide region, the algorithm will try to reduce nodes number by filtering by flow (try to keep the near core pangenome). Default is true but it is time consuming.

Contacts

F Graziani, M Zytnicki

Genotoul Bioinfo team, MIAT, INRAE Auzeville, France.

Licence

Panabyss is available under the GNU license.