Investigation of Machine Learning Force Fields for Biomolecular Systems Using Fragment Molecular Orbital Method Data

Hiromu Matsumoto

Ryosuke Kita

Chiduru Watanabe

Masateru Ohta

Naoki Tanimura

Koji Okuwaki

Yu-Shi Tian

Daisuke Takaya

Mitsunori Ikeguchi

Kaori Fukuzawa

Teruki Honma

Tsuyohiko Fujigaya

Koichiro Kato

Hiromu Matsumoto¹,Ryosuke Kita¹,Chiduru Watanabe²,Masateru Ohta²,Naoki Tanimura³,Koji Okuwaki⁴,Yu-Shi Tian⁵,Daisuke Takaya⁵,Mitsunori Ikeguchi^2,6,Kaori Fukuzawa⁵,Teruki Honma²,Tsuyohiko Fujigaya¹,Koichiro Kato¹

Kyushu University¹,RIKEN²,Mizuho Research & Technologies³,JSOL Corporation⁴,Osaka University⁵,Yokohama City University⁶

Hiromu Matsumoto¹,Ryosuke Kita¹,Chiduru Watanabe²,Masateru Ohta²,Naoki Tanimura³,Koji Okuwaki⁴,Yu-Shi Tian⁵,Daisuke Takaya⁵,Mitsunori Ikeguchi^2,6,Kaori Fukuzawa⁵,Teruki Honma²,Tsuyohiko Fujigaya¹,Koichiro Kato¹

Kyushu University¹,RIKEN²,Mizuho Research & Technologies³,JSOL Corporation⁴,Osaka University⁵,Yokohama City University⁶

Introduction
In molecular simulations for drug discovery, achieving both high accuracy and low computational cost is crucial. Unlike traditional molecular force fields and quantum mechanical (QM) calculations, machine learning force fields (MLFFs) are expected to meet these demands effectively. Previous approaches to developing MLFFs have relied on QM-based datasets derived from conventional density functional theory (DFT) or ab initio molecular orbital methods. However, the significant computational costs associated with these methods, particularly for large systems such as biomolecules, have considerably restricted the scope of MLFF research in drug discovery. We considered that the Fragment Molecular Orbital (FMO) method¹⁾, which offers efficient QM calculations for entire biomolecular systems, could address this issue. In this study, we investigate whether FMO data can be effectively used to construct MLFFs. Furthermore, we explore the use of the FMO Database (FMODB) to enhance MLFF accuracy through transfer learning.

Methods
To evaluate the utility of FMO data in constructing MLFFs, TrpCage, a small protein consisting of 20 residues, was selected as the model system. Additionally, the effectiveness of transfer learning to improve the accuracy of MLFFs was investigated by utilizing the FMODB, a comprehensive public database of FMO calculation results.
For the MLFF training dataset, diverse configurations of TrpCage were sampled through molecular dynamics (MD) simulations. The potential energies and forces acting on each atom for each structure were computed using the FMO method. MD simulations of the TrpCage NMR structure (PDBID: 1L2Y) in water were conducted using GROMACS software with the Amber ff14SB force field and the TIP3P water model. A total of 5,000 structures were obtained from these simulations, sampled every 1 ns over 50 runs of 100 ns each. Subsequent FMO calculations (FMO2-MP2/6-31G* with energy gradient) were performed to evaluate the energy and forces on each structure using the ABINIT-MP program on the Fugaku supercomputer (hp230131). The dataset was divided into training, validation, and test sets in a ratio of 8:1:1 ratio. The MLFF was constructed based on the High Dimensional Neural Network Potential (HDNNP) framework proposed by Behler and Parrinello²⁾.
Additionally, a pre-trained model for transfer learning was developed using 15,454 energy records from FMODB, including atomic species C, H, N, O, S, F, and Cl.

Results
Without Transfer Learning
The initial correlation coefficient (R) values for the prediction of TrpCage's energy and forces without transfer learning were 0.58 and 0.70, respectively. These results indicated that the constructed MLFF could learn the relationship between structure and force/energy from FMO data, but the prediction accuracy remained moderate.
With Transfer Learning
The prediction results for TrpCage using transfer learning from FMODB data showed improvements. The R values for energy and force predictions increased to 0.61 and 0.73, respectively. These improvements demonstrated the effectiveness of using large scale pre-training datasets to improve the accuracy of MLFFs.

Acknowledgments
This research was conducted as part of the Life Intelligence Consortium (LINC) and the FMO Drug Design Consortium (FMODD). The work was supported in part by the Japan Agency for Medical Research and Development (AMED) under the Drug Discovery and Life Science Research Support Platform Project (BINDS) (Grant No. JP23ama121030).

References
1) Kitaura, K. et al., Chem Phys Lett 313, 701–706 (1999).
2) Behler, J. & Parrinello, M., Phys Rev Lett 98, 146401 (2006).

Symposium Organizers

Kjell Jorner, ETH Zurich

Jian Lin, University of Missouri-Columbia

Daniel Tabor, Texas A&M University

Dmitry Zubarev, IBM

Symposium Supporters

2024 MRS Fall Meeting & Exhibit

Investigation of Machine Learning Force Fields for Biomolecular Systems Using Fragment Molecular Orbital Method Data

When and Where

Presenter(s)

Co-Author(s)

Abstract

Symposium Organizers

Session Chairs

In this Session