pygscatalog

December 10, 2025 · View on GitHub

Documentation Status codecov Ruff pre-commit

This repository contains Python applications and libraries for working with polygenic scores (PGS :dna:) and the PGS Catalog, an open database of polygenic scores and the relevant metadata required for accurate application and evaluation. It is based on a previous codebase of utilities (pgscatalog_utils) that has been converted to namespace packages for modularity and re-use.

User applications

install with bioconda install with pypi

These CLI applications are used internally by the PGS Catalog Calculator (pgsc_calc) workflow for calculating PGS and performing common adjustments for genetic ancestry.

If you want an automatic method of calculating PGS, including genetic ancestry similarity estimation and PGS normalisation, the workflow is the easiest method.

Tip

If you want to use all of the applications listed below, you can install the package pgscatalog-utils with pip or bioconda

ApplicationDescriptionInstallLink
pgscatalog-downloadDownload scoring files from the PGS Catalog in specific genome buildspipx install pgscatalog.coreREADME
pgscatalog-formatFormat scoring files into a consistent schemapipx install pgscatalog.coreREADME
pgscatalog-relabelRelabel values in a column based on values in a column in another filepipx install pgscatalog.coreREADME
pgscatalog-matchMatch structured scoring file to variants in target genomespipx install pgscatalog.matchREADME
pgscatalog-matchmergeMerge variant match results, useful on larger datasetspipx install pgscatalog.matchREADME
pgscatalog-intersectMatch variants across two different variant information files (e.g. reference & target genomes)pipx install pgscatalog.matchREADME
pgscatalog-aggregateAggregate calculated PGS split across multiple filespipx install pgscatalog.calcREADME
pgscatalog-ancestry-adjustAdjust calculated PGS in the context of genetic ancestrypipx install pgscatalog.calcREADME
pgsc_calc loadQuery an indexed VCF/BGEN and create a zarr zip archive (pre-release)pipx install pgscatalog.calcREADME
pgsc_calc scoreCalculate polygenic scores from zarr zip archives (pre-release)pipx install pgscatalog.calcREADME
pgscatalog-validateCheck if the scoring files match the PGS Catalog scoring file formatpipx install pgscatalog.validateREADME

Developers

Getting started with pgscatalog.utils

The pgscatalog.utils package is set up as a uv workspace. A workspace is a way to manage several related packages that have common dependencies.

$ tree -L 3
pgscatalog.utils
├── LICENSE
├── README.md
├── docker
│   ├── build.Dockerfile
│   └── dev.Dockerfile
├── noxfile.py
├── packages
│   ├── pgscatalog.calc
│   │   ├── LICENSE
│   │   ├── README.md
│   │   ├── pyproject.toml
│   │   ├── src
│   │   └── tests
│   ├── pgscatalog.core
│   │   ├── CHANGELOG.md
│   │   ├── LICENSE
│   │   ├── README.md
│   │   ├── pyproject.toml
│   │   ├── src
│   │   └── tests
│   └── pgscatalog.match
│       ├── CHANGELOG.md
│       ├── LICENSE
│       ├── README.md
│       ├── poetry.toml
│       ├── pyproject.toml
│       ├── src
│       └── tests
├── pyproject.toml
├── src
│   └── pgscatalog
│       └── utils
├── tests
│   └── test_utils.py
└── uv.lock

There are four Python packages in total:

  • pgscatalog.core (a uv subpackage)
  • pgscatalog.match (a uv subpackage)
  • pgscatalog.calc (a uv subpackage)
  • pgscatalog.utils (the root package in the uv workspace)

To simplify common development tasks nox has been set up to provide automation. The GitHub action workflows use nox as the main entrypoint for most checks.

You'll need to install uv too.

Creating a development environment

$ cd pgscatalog.utils
$ nox -s dev

This will create a .venv directory in the pgscatalog.utils folder. In a uv workspace only one venv and lockfile exists (in the root package).

Packages are installed in editable mode in a workspace to simplify development (make changes and run with uv, no need to reinstall any packages)

To get started with development it's simplest to set up your IDE with the created virtual environment.

To run CLI applications you can also use uv:

$ uv run pgscatalog-download --help

Running tests

The test suite is run against every supported Python version automatically:

$ cd pgscatalog.utils
$ nox -s tests -- pgscatalog.core

If no positional arguments are set (e.g. nox -s tests) pgscatalog.utils is tested, but testing the root package isn't very helpful.

Linting packages

$ cd pgscatalog.utils
$ nox -s lint -- pgscatalog.core

Building packages

$ cd pgscatalog.utils
$ nox -s build -- pgscatalog.core

The build artefacts will be in dist/.

Libraries

If you write Python code to work with PGS, the underlying libraries for the apps are documented and available for re-use:

LibraryDescriptionLink
pgscatalog.coreCore classes and functions to work with PGS dataAPI reference
pgscatalog.matchVariant matching across scoring files and target genomesAPI reference
pgscatalog.calcGenetic ancestry similarity estimation and normalisationAPI reference

Documentation

Full documentation for the applications and libraries is available at https://pygscatalog.readthedocs.io/.

Credits & Licence

pygscatalog(aka pgscatalog_utils) is developed as part of the PGS Catalog project, a collaboration between the University of Cambridge’s Department of Public Health and Primary Care (Michael Inouye, Samuel Lambert) and the European Bioinformatics Institute (Helen Parkinson, Laura Harris).

This package contains code libraries and apps for working with PGS Catalog data and calculating PGS within the PGS Catalog Calculator (pgsc_calc) workflow, and is based on an earlier codebase (pgscatalog_utils) with contributions and input from members of the PGS Catalog team (Samuel Lambert, Benjamin Wingfield, Aoife McMahon Laurent Gil) and Inouye lab (Rodrigo Canovas, Scott Ritchie, Jingqin Wu).

If you use this package or the PGS Catalog Calculator (pgsc_calc) workflow we ask you to cite our paper describing software and updated PGS Catalog resource:

  • Lambert, Wingfield et al. (2024) Enhancing the Polygenic Score Catalog with tools for score calculation and ancestry normalization. Nature Genetics. doi:10.1038/s41588-024-01937-x.

All of our code is open source and permissively licensed with Apache 2.

This work has received funding from EMBL-EBI core funds, the Baker Institute, the University of Cambridge, Health Data Research UK (HDRUK), and the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101016775 INTERVENE.