Ryohei Yamaguchi1,Shigenori Takeda1,Toshifumi Kakiuchi1,Yutaka Imamura1
AGC Co.1
Ryohei Yamaguchi1,Shigenori Takeda1,Toshifumi Kakiuchi1,Yutaka Imamura1
AGC Co.1
Mass spectrometry is an indispensable analytical tool to identify molecular species. A target molecule is often determined by comparing the mass spectrum with known spectra stored in a database. This approach is useful for detecting known molecules, but not applicable for novel molecules. To solve this issue, a recent study proposed to add machine-learning predicted spectra into the database in order to increase chemical space covered by the database. This method is expected to extend applicability of the knowledge-based searching system, but the coverage of the system would be still restricted by the predictability of the machine-leaning model. In this study, we therefore modified the machine-learning model by training our own dataset in addition to the original open data of mass spectroscopy.<br/><br/>First, we tried to extend the novel approach by adding our own dataset of fluorinated molecules to the open database composed of more than 200 thousand molecules; however, the limited additional dataset much less than the huge original datasets was not enough to improve the predictability of the machine-learning model for fluorinated molecules. We therefore applied a simple but powerful dataset sampling scheme that downsamples the molecular datasets in terms of molecular similarity. The molecular similarity between molecules in own and open datasets was evaluated by Tanimoto method with Morgan fingerprints, then the datasets with reduced data with low similarity were used for training the machine-learning model as the downsampled dataset. Additionally, the datasets exhibiting bit collision due to the limitation of Extended-Connectivity-Fingerprints (ECFP4) in representing molecular structures were eliminated to improve the training performance. Consequently, the model trained with the reduced dataset was found to overperform the original model for predicting mass spectra of extra fluorinated molecules, which demonstrated that the strategic learning with appropriate downsampling can improve the performance of deep learning models, even though additional data is limited.