An Open Software Development-based Ecosystem of R Packages for Proteomics Data Analysis

An Open Software Development-based Ecosystem of R Packages for Proteomics Data Analysis


Author(s): Laurent Gatto,RforMassSpectrometry contributors

Affiliation(s): de Duve Institute, UCLouvain, Belgium



A frequent problem with scientific research software is the lack of support, maintenance and further development. In particular, development by a single researcher can easily result in orphaned and dysfunctional software packages, especially if combined with poor documentation, missing unit tests or lack of adherence to open software development standards. The RforMassSpectrometry (https://www.rformassspectrometry.org/) initiative aims to develop an efficient, scalable, and stable infrastructure for mass spectrometry (MS) based proteomics (Gatto et al. poster) and metabolomics (Rainer et al. poster) data analysis. As part of this initiative, a growing ecosystem of R software packages is being developed covering different aspects of metabolomics and proteomics data analysis. To avoid the aforementioned problems, community contributions are fostered, and open development, documentation and long-term support emphasised. At the heart of the package ecosystem lies the *Spectra* package that provides the core infrastructure to handle, process and visualise MS data. Its design allows easy expansion to support existing and new file or data formats, including data representations with minimal memory footprint or remote data access. For proteomics data analysis, two packages in particular are dedicated to the analysis or quantitative and identification data. The *PSMatch* package handles and manages peptide identification data. It also provides functions to model and visualise peptide-protein relations to make informed decision about shared peptide filtering. The package also provides functions to calculate and visualise MS2 fragment ions, in conjunction with the *Spectra* package. The *QFeatures* package is the working horse for quantitative proteomics data. It builds on the familiar *SummarizedExperiment* and *MultiAssayExperiment* infrastructure and provides a familiar Bioconductor user experience to manage bulk and single-cell quantitative data across different assay levels (such as peptide spectrum matches, peptides and proteins) in a coherent and tractable way. These three packages rely on *MsCoreUtils* for efficient implementations of commonly used algorithms, designed to be re-used by other R packages. In contrast to a monolithic software design, the RforMassSpectrometry ecosystem enables to build customised, modular, and reproducible analysis workflows. Future proteomics-related development will focus on improved data structures and analysis methods, better support for third-party data import, and better interoperability with other open source software including a direct integration with Python MS libraries.