A Novel Graph Representation and an Improved Transfer Learning Method for Accurate Predictions of the Chemical Properties of Molecules

When and Where

Dec 6, 2024
9:00am - 9:15am

Hynes, Level 2, Room 210

Presenter(s)

Joonhyuk Choi

Youngchun Kwon

Co-Author(s)

Joonhyuk Choi¹,Youngchun Kwon¹

Samsung Advanced Institute of Technology¹

Abstract

Joonhyuk Choi¹,Youngchun Kwon¹

Samsung Advanced Institute of Technology¹

It has been shown that artificial intelligence (AI) models are quite successful to predict chemical properties of molecules and it is crucial to choose a proper molecular representation such as SMILES and a graph in order to improve the performance of the AI models. Recently, graph neural networks (GNNs) have demonstrated superior performance on predicting chemical properties of given molecules and they become prevailing molecular representations. Due to the high dimensionality for data representation with graphs, however, as the number of data points that we need to deal with increases, it demands very large calculation resources and it becomes quite difficult to manage the graph representations to proceed training processes for the AI models. Furthermore, the scarcity of available datasets to predict chemical properties such as retention time of molecules makes development of accurate chemical prediction AI models challenging. To address the first scalability issue with graph-based molecular representations, we propose a sparsified graph representation that regards only heavy atoms in a molecule as nodes and chemical bonds as edges. We show that our proposed representation with an improved message passing and readout functions in a GNN is more scalable to large molecules and provides higher prediction accuracy for NMR chemical shift than generally used graph-based methods. In order to overcome the scarcity problem of training datasets for making an accurate model that predicts retention time for small molecules, we present an improved transfer learning method that learns from a small training data set with a pre-trained GNN. The GNN is pre-trained on the METLIN-SMRT data set and then is fine-tuned on the target training data set for a fixed number of iterations using the limited memory Broyden-Fletcher-Goldfarb-Shanno optimizer. We demonstrate that this proposed method provides better prediction accuracy on numerous chromatographic systems than existing other transfer methods.

Symposium Organizers

Kjell Jorner, ETH Zurich

Jian Lin, University of Missouri-Columbia

Daniel Tabor, Texas A&M University

Dmitry Zubarev, IBM

Session Chairs

Jian Lin

Dmitry Zubarev

Symposium Supporters

2024 MRS Fall Meeting & Exhibit