awesome-proteomics

A curated list of awesome Proteomics software, resources, and libraries. Mostly command line based, and free or open-source. Please feel free to contribute!

Proteomics
Proteomics Databases
- Mass Spectrometry Databases
- Protein Databases
Proteomics Tools
- Database search algorithms
- MS/MS peptide quantification

Proteomics

Proteomics is the study of proteins in biological systems. It involves the large-scale study of proteins, their structure and physiological role or functions. The word proteome is actually a combination of protein and genome and was coined by Mark Wilkins in 1994. The proteome is a broad term that also encompasses the alterations or modifications produced in native protein when organisms are subjected to a plethora of changes.

Proteomics Databases

Mass Spectrometry Databases

ProteomeXchange The ProteomeXchange Consortium provide globally coordinated standard data submission and dissemination pipelines involving the main proteomics repositories, and encourage open data policies in the field. It provide the ability to search for datasets in all participating PX resources at once. The submitted dataset files remain in the receiving resources but are then linked from ProteomeCentral.
omicsDI The Omics Discovery Index (OmicsDI) is an integrated and open source platform facilitating the access and dissemination of omics datasets. It provides a unique infrastructure to integrate datasets coming from multiple omics studies, including at present proteomics, genomics, transcriptomics and metabolomics. It stores metadata coming from the public datasets from every resource using an efficient indexing system, which is able to integrate different biological entities including genes, proteins and metabolites with the relevant life science literature.
PRIDE The PRIDE PRoteomics IDEntifications (PRIDE) Archive database is a centralized, standards compliant, public data repository for mass spectrometry proteomics data, including protein and peptide identifications and the corresponding expression values, post-translational modifications and supporting mass spectra evidence.
MassIVE MassIVE is a community resource developed by the NIH-funded Center for Computational Mass Spectrometry to promote the global, free exchange of mass spectrometry data. It is integrated into a flexible, scalable, user-friendly framework offering a diverse set of data analysis algorithms.
jPOSTdb jPOSTdb (Japan ProteOme STandard DataBase) is a database containing re-analysis results with unified criteria for proteome data from jPOSTrepo. It provides viewers showing the frequency of detected post-translational modifications, the co-occurrence of phosphorylation sites on a peptide and peptide sharing among proteoforms.
iProX iProX is an integrated proteome resources center in China, which is built to accelerate the worldwide data sharing in proteomics. iProX is composed of a data submission system and a proteome database.
Peptide Atlas PeptideAtlas is a multi-organism, publicly accessible compendium of peptides identified in a large set of tandem mass spectrometry proteomics experiments. Mass spectrometer output files are collected for human, mouse, yeast, and several other organisms, and searched using the latest search engines and protein sequences.
ProteomicsDB ProteomicsDB is a protein-centric in-memory database for the exploration of large collections of quantitative mass spectrometry-based proteomics data. It is dedicated to expedite the identification of the human proteome and its use across the scientific community.
Panorma Panorama is a freely-available, open-source repository server application for targeted mass spectrometry assays that integrates into a Skyline mass spec workflow. It has been implemented as a module within LabKey Server, an open-source bioinformatics data management platform with extensive support for proteomics and small molecule data and a security model rich enough to support clinical studies.

Protein Databases

UniProt The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt Knowledgebase, the centrepiece of the UniProt Consortium’s activities, is an expertly and richly curated protein database, consisting of two sections called UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot contains high-quality manually annotated and non-redundant protein sequence records. UniProtKB/TrEMBL contains high-quality computationally analysed records enriched with automatic annotation and classification. Three UniRef databases – UniRef100, UniRef90 and UniRef50 – merge sequences automatically across species. UniRef100 is based on all UniProtKB records. It also contains selected UniParc records, including Ensembl protein translations from chicken, cow, dog, fly, Fugu, human, mouse, rat, Tetraodon, Xenopus and zebrafish. UniParc is designed to capture all publicly available protein sequence data and contains all the protein sequences from the main publicly available protein sequence databases. This makes UniParc the most comprehensive publicly accessible non-redundant protein sequence database.
neXtProt neXtProt is a comprehensive human-centric discovery platform, offering its users a seamless integration of and navigation through protein-related data. It is a knowledgebase focusing exclusively on human proteins, leverages the expert manual annotation carried out at specialist resources and in-house to provide a single point of reference. Information concerning human protein function, cellular localization, tissular expression, interactions, variants and their phenotypic effect, post-translational modifications (PTMs), as well as peptide identified in mass spectrometry experiments and epitopes recognized by antibodies have been integrated from a number of resources. The current neXtProt release was built using human genome assembly GRCh38 and UniProtKB. The data from UniProtKB is currently supplemented with data from Bgee, HPA, PeptideAtlas, SRMAtlas, GOA, dbSNP, Ensembl, COSMIC, DKF GFP-cDNA localization, Weizmann Institute of Science’s Kahn Dynamic Proteomics Database, IntAct, GlyConnect, gnomAD, as well as in-house curated data.
Human Protein Atlas The Human Protein Atlas is a Swedish-based program initiated in 2003 with the aim to map all the human proteins in cells, tissues, and organs using an integration of various omics technologies, including antibody-based imaging, mass spectrometry-based proteomics, transcriptomics, and systems biology. The Human Protein Atlas consists of ten separate sections, each focusing on a particular aspect of the genome-wide analysis of the human proteins. The Tissue section, showing the distribution of the proteins across all major tissues and organs in the human body. The Brain section, exploring the distribution of proteins in various regions of the mammalian brain. The Single Cell Type section, showing expression of protein-coding genes in single human cell types based on scRNA-seq. The Tissue Cell Type section, showing expression of protein-coding genes in human cell types based on bulk RNAseq data. The Pathology section, showing the impact of protein levels for the survival of patients with cancer. The Immune Cell section, showing expression of protein-coding genes in immune cell types. The Blood Protein section, describing proteins detected in blood and proteins secreted by human tissues. The Subcellular section, showing the subcellular localization of proteins in single cells. The Cell Line section, showing expression of protein-coding genes in human cell lines. The Metabolic section, exploring expression of protein-coding genes in the context of the human metabolic network.
NCBI-RefSeq The Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq sequences form a foundation for medical, functional, and diversity studies. They provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis, expression studies, and comparative analyses. RefSeq genomes are copies of selected assembled genomes available in GenBank. RefSeq transcript and protein records are generated by several processes including: Computation, Manual curation, Propagation from annotated genomes that are submitted to members of the International Nucleotide Sequence Database Collaboration (INSDC).

Proteomics Tools

Database search algorithms

OMSSA (2004) Purpose - Efficient MS/MS search algorithm Advantage - Specificity is calculated by a classic probability score using an explicit model for matching experimental spectra to sequences Publication - Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH. Open mass spectrometry search algorithm. Journal of proteome research. 2004 Oct 11;3(5):958-64. PMID: 15473683 License - Free Links - Website - ftp://ftp.ncbi.nih.gov/pub/lewisg/omssa/
MassWiz Purpose - MS/MS search algorithm Advantage - A novel empirical scoring function that gives appropriate weights to major ions, continuity of b-y ions, intensities, and the supporting neutral losses based on the instrument type Publication - Yadav AK, Kumar D, Dash D. MassWiz: a novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry. Journal of proteome research. 2011 May 6;10(5):2154-60. License - Free Links - Website - https://sourceforge.net/projects/masswiz/
InsPecT
X!Tandem
Mascot
SEQUEST
MSFragger (2017) Purpose - Peptide identification in mass spectrometry-based proteomics. Advantage - Fragment-ion indexing method enabling high speed search Publication - 28394336 License - Free Links - Website - https://msfragger.nesvilab.org/ GitHub - https://github.com/Nesvilab/MSFragger
PGA (2016) Purpose - R package for novel peptides identification using RNA-Seq Advantage - Customized protein database and can generate an HTML-based report with a visualized interface Publication - 27316337 License - Free Links - Website - http://wenbostar.github.io/PGA/, https://bioconductor.org/packages/PGA GitHub - https://github.com/wenbostar/PGA
MS-GF+ (2014) Purpose - Database search tool for proteomics Advantage - Identifies more PSMs as compared with Mascot, X!Tandem, OMSSA, Crux, Comet and InsPecT and integrated in various pipelines Publication - 25358478 License - Free Links - Website - https://omics.pnl.gov/software/ms-gf GitHub - https://github.com/MSGFPlus/msgfplus
Open-pFind (2018) Purpose - Sequence-tag-based search engine to identify peptides Advantage - Improves peptide and protein identification by matching MS/MS spectra to a substantially expanded search space Publication - 30295672 License - Free (License required for installation) Links - Website - http://pfind.ict.ac.cn/software/pFind3 GitHub - https://github.com/pFindStudio/pFind3

MS/MS peptide quantification

MaxQuant MaxQuant is a proteomics software package designed for analyzing large mass-spectrometric data sets using Andromeda search engine. It supports several labeling techniques as well as label-free quantification.
MetaMorpheus MetaMorpheus is a bottom-up proteomics database search software with integrated post-translational modification (PTM) discovery capability.
msInspect
OpenMS/TOPP
pView 2
mzMine 2
Census

R for Mass Spectrometry

The aim of the R for Mass Spectrometry initiative is to provide efficient, thoroughly documented (with vignettes and longer tutorials), tested and flexible R software for the analysis and interpretation of high throughput mass spectrometry assays, including proteomics and metabolomics experiments. All packages are shared under permissible open source licences.

Here is a list of the core packages:

The Spectra package provides base classes and processing methods for raw mass spectrometry data. It is designed with efficiency, both in terms of memory footprint and processing time in mind, and can manage data in different types of formats.
The QFeatures package offers the infrastructure to manage and process quantitative features for high-throughput mass spectrometry assays, including proteomics and metabolomics experiments.
The MsCoreUtils package defines low-level functions for mass spectrometry data processing and is independent of any high-level data structures.
The PSMatch package package to load, process and analyse PSMs (Peptide-Spectrum Matches).
The MsExperiment package provides the infrastructure to store and manage all aspects related to a complete proteomics or metabolomics mass spectrometry experiment. It relies on the other R for Mass Spectrometry core packages for the data crunching.

Most of the R for Mass Spectrometry packages are also available on Bioconductor, and can be installed with BiocManager::install(). The code below illustrates this for the Spectra package:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Spectra")

For questions about specific software or their usage, please refer to the software’s github issue page, or use the Bioconductor support site.