MRS Meetings and Events

 

MD02.06.03 2023 MRS Spring Meeting

Improving Deep Neural Network in Predicting Electron Ionization Mass Spectra by Molecular Similarity-wise Sampling

When and Where

Apr 13, 2023
2:15pm - 2:30pm

Marriott Marquis, Second Level, Foothill G1/G2

Presenter

Co-Author(s)

Ryohei Yamaguchi1,Shigenori Takeda1,Toshifumi Kakiuchi1,Yutaka Imamura1

AGC Co.1

Abstract

Ryohei Yamaguchi1,Shigenori Takeda1,Toshifumi Kakiuchi1,Yutaka Imamura1

AGC Co.1
Mass spectrometry is an indispensable analytical tool to identify molecular species. A target molecule is often determined by comparing the mass spectrum with known spectra stored in a database. This approach is useful for detecting known molecules, but not applicable for novel molecules. To solve this issue, a recent study proposed to add machine-learning predicted spectra into the database in order to increase chemical space covered by the database. This method is expected to extend applicability of the knowledge-based searching system, but the coverage of the system would be still restricted by the predictability of the machine-leaning model. In this study, we therefore modified the machine-learning model by training our own dataset in addition to the original open data of mass spectroscopy.<br/><br/>First, we tried to extend the novel approach by adding our own dataset of fluorinated molecules to the open database composed of more than 200 thousand molecules; however, the limited additional dataset much less than the huge original datasets was not enough to improve the predictability of the machine-learning model for fluorinated molecules. We therefore applied a simple but powerful dataset sampling scheme that downsamples the molecular datasets in terms of molecular similarity. The molecular similarity between molecules in own and open datasets was evaluated by Tanimoto method with Morgan fingerprints, then the datasets with reduced data with low similarity were used for training the machine-learning model as the downsampled dataset. Additionally, the datasets exhibiting bit collision due to the limitation of Extended-Connectivity-Fingerprints (ECFP4) in representing molecular structures were eliminated to improve the training performance. Consequently, the model trained with the reduced dataset was found to overperform the original model for predicting mass spectra of extra fluorinated molecules, which demonstrated that the strategic learning with appropriate downsampling can improve the performance of deep learning models, even though additional data is limited.

Symposium Organizers

Soumendu Bagchi, Los Alamos National Laboratory
Huck Beng Chew, The University of Illinois at Urbana-Champaign
Haoran Wang, Utah State University
Jiaxin Zhang, Oak Ridge National Laboratory

Symposium Support

Bronze
Patterns and Matter, Cell Press

Publishing Alliance

MRS publishes with Springer Nature