ProForma (Proteoform and Peptidoform Notation)
September 26, 2025 · View on GitHub
Protein and peptide sequences are usually represented using a string of amino acids using a well-known one letter code endorsed by the IUPAC. However, there is still no clear consensus about how to represent ‘proteoforms’ and ‘peptidoforms’, meaning all possible variations of a protein/peptide sequence, including protein modifications, both artefactual and post-translational modifications (PTMs). There are indeed multiple ways of encoding mass modifications and extended discussion has taken place to achieve a consensus. A standard notation for proteoforms and peptidoforms is then required for the community, so that it can be embedded in many relevant PSI (and potentially other) file formats.
The PSI has developed a format called PEFF (PSI Extended FASTA Format) that can be used to represent proteoforms. Additionally, the Consortium for Top Down Proteomics CTDP developed a notation format called ProForma v1, aiming to represent proteoforms.
This format specification represents the consensus for the standard representation of proteoforms and peptidoforms. This notation aims to support the main proteomics approaches, including bottom-up (focused on peptides/peptidoforms) and top-down (focused on proteins/proteoforms) approaches.
Use cases supported (with examples)
The ProForma notation is a string of characters that represent linearly one or more peptidoform/proteoform primary structures with possibilities to link peptidic chains together. It is not meant to represent secondary or tertiary structures.
Canonical IUPAC amino acids and ambiguous/unusual amino acids
EMEVEESPEKVAEJNPSNGGTT(J indicates either I or L)
PTMs using common ontologies or controlled vocabularies (e.g. Unimod, PSI-MOD, and RESID)
EM[Oxidation]EVEES[UNIMOD:21]PEKEM[L-methionine sulfoxide]EVEES[MOD:00046]PEKEM[R:L-methionine (R)-sulfoxide]EVEES[RESID:AA0037]PEK
Cross-linkers using the XL-MOD ontology
EMEVTK[XLMOD:02001#XL1]SESPEK[#XL1]EVTSEKC[L-cystine (cross-link)#XL1]LEMSC[#XL1]EFD
Glycans using the GNO (Glycan Naming Ontology) ontology
YPVLN[GNO:G62765YT]VTMPN[GNO:G02815KT]NSNGKFDK
Arbitrary mass shifts and unknown mass gaps
EM[+15.9949]EVEES[-79.9663]PEKRTAAX[+367.0537]WT
Elemental formulas and Glycan compositions
SEQUEN[Formula:C12H20O2]CESEQUEN[Glycan:HexNAc1Hex 2]CE
Terminal and Labile Modifications
[iTRAQ4plex]-EMEVNESPEK-[Methyl]{Glycan:Hex}EMEVNESPEK
Ambiguity of modification position (completely unlocalised, n possible sites, or a range of sites)
[Phospho]?EMEVTSESPEKEMEVT[#g1]S[#g1]ES[Phospho#g1]PEKPROT(EOSFORMS)[+19.0523]ISK
Global modifications (e.g. isotopic labeling or fixed protein modifications)
<13C>ATPEILTVNSIGQLK<[S-carboxamidomethyl-L-cysteine]@C>ATPEILTCNSIGCLK
Additional user-supplied information and multi-valued tags
ELV[info:AnyString]ISELV[+11.9784|info:suspected frobinylation]IS
Defined charges, or charge cariers (ProForma extension see section 7.1)
VAEINPSNGGTT/2
Chimeric spectra (ProForma extension see section 7.2)
VAEINPSNGGTT+FNEKFKGGKATJ[iTRAQ4plex]-EMEVNESPEK-[Methyl]+[Phospho]?EMEVTSESPEK