Machine Learning-Assisted False Positive Detection in Metabolite Identification Workflows

Machine Learning-Assisted False Positive Detection in Metabolite Identification Workflows

December 10, 2025

Ramon Adàlia, Paula Cifuentes, Joyce Liu, Lionel Cheruzel, Gemma Sanjuan, Tomàs Margalef, Ismael Zamora

Abstract

Metabolite identification is a pivotal step in drug discovery and development, enabling the comprehensive analysis of drug-derived compounds within biological systems. However, the complexity of liquid chromatography–mass spectrometry data often results in numerous false positives, complicating the identification of true metabolites. This study introduces a machine-learning-based approach to improve the accuracy of false positive detection in metabolite identification workflows. By incorporating expert knowledge, we develop a feature set for metabolite-related chromatographic peaks that characterizes true and false positives with high accuracy, integrating data from mass spectra, chromatographic signals, and kinetic profiles. We validate this method via gradient boosting decision tree classifiers on both publicly available and proprietary “real-world” data sets, including small molecules and new modalities. Our findings demonstrate that machine learning-assisted techniques significantly reduce false positive identifications, thereby increasing the efficiency and accuracy of metabolite identification processes.

Prediction of peptide cleavage sites using protein language models and graph neural networks

Prediction of peptide cleavage sites using protein language models and graph neural networks

October 30, 2025

Paula Cifuentes, Ramon Adàlia, Ismael Zamora

Abstract

The growing interest in using peptide molecules as therapeutic agents, driven by their high selectivity and efficacy, has become a significant trend in the pharmaceutical industry. However, their oral administration remains challenging due to their low bioavailability and vulnerability to proteases, which produce the cleavage of peptide bonds. To optimize peptide drug development, in silico tools based on machine learning algorithms have been developed for site of cleavage prediction. These tools, which rely on manual feature extraction, have limitations in capturing complex peptide structures, especially those involving non-natural amino acids or cyclic peptides. This study presents two novel in silico approaches for cleavage site prediction. The first approach uses protein language models, specifically ESM-2, which has been fine- tuned to leverage its learned peptide structure embeddings for accurate cleavage site prediction, eliminating the need for manual feature engineering. The second approach employs graph neural networks, representing peptides via hierarchical graphs at the atom and amino acid levels, effectively handling cyclic peptide structures, including those containing non-natural amino acids. The applicability of this second approach is shown through a case study on a set of four cyclic peptides containing non-natural amino acids, comparing in silico predictions with experimental data.

Scalable Peptide MRM Transition Prediction for High-Throughput Proteomics via Hashing-Based Sequence Encoding

Scalable Peptide MRM Transition Prediction for High-Throughput Proteomics via Hashing-Based Sequence Encoding

Peptide analysis via Multiple Reaction Monitoring (MRM) is indispensable for quantification and/or biomarker validation and drug development, yet its reliance on experimental transition optimization limits scalability. Current computational models for small molecules fail to address peptide-specific complexities, such as sequence-dependent fragmentation and charge-state variability. We introduce a novel framework that combines hashing-based peptide fragment encoding with gradient-boosted decision trees to predict MRM transitions efficiently. This method eliminates bottlenecks in experimental workflows, enabling rapid, resource-efficient transition identification without compromising accuracy—a critical advancement for high-throughput proteomics pipelines.

Molecular Structure and Mass Spectral Data Quality Driven Processing of High-Resolution Mass Spectrometry Data for Pharmacokinetics Studies

Molecular Structure and Mass Spectral Data Quality Driven Processing of High-Resolution Mass Spectrometry Data for Pharmacokinetics Studies

Our inability to comprehensively process high resolution mass spectrometry data for quantitative analysis has long been an impediment to the broader adoption of this powerful technique. We have developed an approach that agnostically and automatically identifies all ions related to the compound in both the MS and MSMS data. The algorithm uses the structure of the molecule to automatically select the optimal compound related MS and MSMS signals, and parameters (extraction window, S/N) to provide the best overall method to meet the assay acceptance criteria defined by the user. Results using this structure and data driven approach are presented for pharmacokinetic data that were collected using the same set of samples analyzed on both QQQ and HRMS instruments.

Plasma lipidomics analysis reveals altered profile of triglycerides and phospholipids in children with Medium-Chain Acyl-CoA dehydrogenase deficiency

Plasma lipidomics analysis reveals altered profile of triglycerides and phospholipids in children with Medium-Chain Acyl-CoA dehydrogenase deficiency

July 2024

Inês M S GuerraHelena B FerreiraTatiana MaurícioMarisa PinhoLuísa DiogoSónia MoreiraLaura GoracciStefano BonciarelliTânia MeloPedro Domingues, M Rosário DominguesAna S P Moreira

Abstract

Medium-chain acyl-CoA dehydrogenase deficiency (MCADD) is the most prevalent mitochondrial fatty acid β-oxidation disorder. In this study, we assessed the variability of the lipid profile in MCADD by analysing plasma samples obtained from 25 children with metabolically controlled MCADD (following a normal diet with frequent feeding and under l-carnitine supplementation) and 21 paediatric control subjects (CT). Gas chromatography-mass spectrometry was employed for the analysis of esterified fatty acids, while high-resolution C18-liquid chromatography-mass spectrometry was used to analyse lipid species. We identified a total of 251 lipid species belonging to 15 distinct lipid classes. Principal component analysis revealed a clear distinction between the MCADD and CT groups. Univariate analysis demonstrated that 126 lipid species exhibited significant differences between the two groups. The lipid species that displayed the most pronounced variations included triacylglycerols and phosphatidylcholines containing saturated and monounsaturated fatty acids, specifically C14:0 and C16:0, which were found to be more abundant in MCADD. The observed changes in the plasma lipidome of children with non-decompensated MCADD suggest an underlying alteration in lipid metabolism. Therefore, longitudinal monitoring and further in-depth investigations are warranted to better understand whether such alterations are specific to MCADD children and their potential long-term impacts.

Keywords: Lipid profile; Lipidomics; Mass spectrometry; Medium‐chain acyl‐CoA dehydrogenase deficiency (MCADD); Phospholipids (PL); Plasma analysis; Triacylglycerols (TG).