Machine Learning-Assisted False Positive Detection in Metabolite Identification Workflows

Machine Learning-Assisted False Positive Detection in Metabolite Identification Workflows

72nd ASMS Conference on Mass Spectrometry. June 2024

Ramon Adàlia1,3; Fabien Fontaine2; Luca Morettoni2; Ismael Zamora2

1Lead Molecular Design, Sant Cugat del Vallès, Spain; 2Mass Analytica, Sant Cugat del Vallès, Spain; 3Universitat Autònoma de Barcelona, Cerdanyola del Vallès, Spain

Abstract

Introduction

Metabolite identification is pivotal in both drug discovery and metabolomics, enabling the comprehensive analysis of small molecules within biological systems. However, the complexity inherent in mass spectrometry data often results in numerous false positive peak detections. Current methods for false positive detection rely on manual data inspection, a labor-intensive and time-consuming process that acts as a bottleneck in metabolite identification workflows.

In this study, we propose leveraging machine learning models to assist experts in identifying false positives. We demonstrate the viability of this approach by developing models that achieve high predictive performance across two distinct experimental protocols. Additionally, we utilize the SHapley Additive exPlanations (SHAP) method to analyze feature importances, offering insights into the primary factors driving predictions.

Methods

Metabolite identification data was gathered from public repositories and processed using specialized software to automatically identify corresponding metabolites. A field expert later examined the results, manually categorizing them as either true or false positives.

Extracting features from each metabolite peak involved utilizing chromatographic peak data, mass spectrometry data, and kinetic data where applicable. The dataset was then divided into training and test sets, with an 80/20 split. Within the training set, machine learning models were developed using gradient boosting decision trees (GBDTs). Hyperparameters underwent tuning through random search and 5-fold cross-validation. The best-performing model was evaluated using the test set. This entire process was repeated for each of the two distinct experimental protocols present in the data.

Preliminary data

Two distinct experimental protocols were present in the data. The primary difference between them lay in the number of incubation time points: one protocol involved 5 time points while the other featured just 1. Consequently, a separate model was developed for each protocol, given that the analysis of the data varied significantly, particularly with the former protocol requiring the incorporation of kinetic data.

For experiments following the 1-time-point protocol, a total of 2,570 metabolite peaks were examined, with 1,703 determined as false positives (66.26%). The primary objective of these experiments was the identification of gluthatione conjugates, formed by appending a gluthatione moiety to a parent molecule. Data for testing was constructed by splitting experiments, resulting in 431 metabolites, of which 287 were false positives (66.60%). The best-performing model achieved a recall of 93.73% and a precision of 89.67% on the test set.

Experiments adhering to the 5-time-point protocol involved a total of 1,543 metabolite peaks, with 1,068 identified as false positives (69.22%). The primary focus of these experiments was soft spot identification. Test data was generated by experiment splits, yielding 419 metabolites, of which 287 were false positives (68.50%). The top-performing model attained a recall of 97.56% and a precision of 93.33% on the test set.

Beyond evaluating the predictive performance of the models, the SHapley Additive exPlanations (SHAP) method was employed to analyze feature importances. The most significant features for each prediction were extracted and aggregated by category or source. This facilitated a clear understanding of the main reasons behind predictions, which were empirically verified to be accurate in most cases. Consequently, this method could serve as a guiding tool to aid experts in the manual inspection of metabolite identification results, as well as in reassessing and rectifying previous manual annotations.

 

You must be logged in to access this content. Not yet registered? Create a new account