The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at https://github.com/pluskal-lab/MassSpecGym.
MassSpecGym: A benchmark for the discovery and identification of molecules
MassSpecGym is a new benchmark dataset for MS/MS data, providing challenges in molecular structure generation, molecule retrieval, and spectrum simulation, with comprehensive evaluation metrics and a generalized data split.
- Year
- 2024
- Venue
- arXiv 2024
- Authors
- 30
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2410.23326v3ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
30Fei WangBo wangJosef SivicRoman BushuievAnton BushuievNiek F. de JongeAdamo YoungFleming KretschmerRaman SamusevichJanne HeirmanLuke ZhangKai DührkopMarcus LudwigNils A. HauptApurva KaliaCorinna BrungsRobin SchmidRussell GreinerDavid S. WishartLi-Ping LiuJuho RousuWout BittremieuxHannes RostTytus D. MakSoha HassounFlorian HuberJustin J. J. van der HooftMichael A. StravsSebastian BöckerTomáš Pluskal