MRS Meetings and Events

 

DS04.07.11 2023 MRS Fall Meeting

A Transformer Based Large-Scale Molecular Representation Model

When and Where

Nov 28, 2023
8:00pm - 10:00pm

Hynes, Level 1, Hall A

Presenter

Co-Author(s)

Indra Priyadarsini S1,Seiji Takeda1,Akihiro Kishimoto1,Hajime Shinohara1,Daiju Nakano1

IBM Research - Tokyo1

Abstract

Indra Priyadarsini S1,Seiji Takeda1,Akihiro Kishimoto1,Hajime Shinohara1,Daiju Nakano1

IBM Research - Tokyo1
Large scale molecular representation methods have shown to be useful in several applications and areas of material science including virtual screening, drug discovery, chemical modeling, material design and molecular dynamics simulation. These representations prove to provide both effective and efficient analysis of molecular data. With the advancements in deep learning, several models have been developed to learn the representations directly from the molecular structures. Recently, transformer based molecular representations have gained significant importance in the field of material informatics. The importance of transformer-based molecular representation continues to grow as researchers explore their potential in advancing drug discovery, materials science, and other areas of molecular research.<br/><br/>In this study, we develop one such transformer-based model that is capable of capturing complex relationships and interactions within molecules. While most of the existing works focus on only capturing the representations through encoder-only models, we present an encoder-decoder model based on BART (Bidirectional and Auto-Regressive Transformers) that is not only capable of efficiently learning the molecular representations but also auto-regressively generate molecules from the representations. This can prove to be highly impactful especially in cases of new molecule design and generation, enabling efficient and effective analysis and manipulation of the molecular data.<br/><br/>The model is trained on a dataset of 10 billion molecules from the publicly available ZINC-22 database, rendering it the most extensive training dataset employed to date. The dataset is encoded to SELFIES (SELF-referencing Embedded Strings) representation as SELFIES provides a more concise and interpretable representation, making it suitable for machine learning applications where compactness and generalization are important. The encoded SELFIES are then tokenized using an efficient tokenization scheme with masking in order to improve generalizability. We show that the learned molecular representation outperforms existing baselines on downstream tasks, thus validating the efficacy of the large pre-trained model.

Symposium Organizers

Andrew Detor, GE Research
Jason Hattrick-Simpers, University of Toronto
Yangang Liang, Pacific Northwest National Laboratory
Doris Segets, University of Duisburg-Essen

Symposium Support

Bronze
Cohere

Session Chairs

Jason Hattrick-Simpers
Yangang Liang
Michael Thuis

In this Session

DS03.07.05
WITHDRAWN (NO SHOW) 12.13.2023 DS03.07.05 Optimizing 2.8 Micron Emission in Er:YLF Q-Switched Lasers

DS04.07.01
Unraveling the Mechanisms of Stability in CoxMo70-xFe10Ni10Cu10 High Entropy Alloys via Physically Interpretable Graph Neural Networks

DS04.07.02
Autoencoder Based on Graph and Recurrent Neural Networks and Application to Property Prediction

DS04.07.03
Chemical State Analysis Assisted Combinatorial Exploration of New Phase Spaces: Application to Ternary Zn-M-N Nitrides and Synthesis of Wurtzite Zn2TaN3.

DS04.07.04
Data-Driven Doping for Semiconductors: Identifying Top Dopant Candidates for Complex Crystals

DS04.07.05
Optimizing Active Learning in Materials Discovery Through a Holistic Pruning Strategy for NN-based Agents

DS04.07.06
Hydrogen Absorption and Diffusion in High Entropy Alloys: Insights from DFT and Machine Learning

DS04.07.07
A Convergence of Fast Sintering, Grain Growth Analysis, High Throughput Measurements, and Data Driven Computer Models to Develop New Solid-State Sodium-Ion Battery Materials

DS04.07.08
A Unified Theory Quantifying How Lattice Dynamics Facilitate Proton Transport in Various Ternary-Oxide Phases

DS04.07.09
Machine Learning Prediction of Heat Capacity for Solid Mixtures of Pseudo-Binary Oxides

View More »

Publishing Alliance

MRS publishes with Springer Nature