MRS Meetings and Events

 

DS04.12.05 2023 MRS Fall Meeting

Alternative Machine-Readable Representation of Molecules for More Efficient Use of Computational Resources in Machine Learning Applications

When and Where

Nov 30, 2023
2:45pm - 3:00pm

Sheraton, Second Floor, Back Bay B

Presenter

Co-Author(s)

Emilio Alexis de la Cruz Nuñez Andrade1,Isaac Vidal-Daza2,1,Rafael Gómez-Bombarelli3,Francisco Martin-Martinez1

Swansea University1,Universidad de Granada2,Massachusetts Institute of Technology3

Abstract

Emilio Alexis de la Cruz Nuñez Andrade1,Isaac Vidal-Daza2,1,Rafael Gómez-Bombarelli3,Francisco Martin-Martinez1

Swansea University1,Universidad de Granada2,Massachusetts Institute of Technology3
The rapid increase in computational resources demand is unprecedented because of the surge in artificial intelligence (AI), big data, and high-throughput computing. In chemistry, machine learning (ML) is revolutionizing molecular discovery, materials design, and property predictions in areas ranging from biomedicine to energy harvesting and storage, among many others. In practice, the implementation of ML methods relies on the codification of chemical structures into a suitable format for practical implementation in computational tools. To this end, the chemistry community adopted the Simplified Molecular Input Line Entry System (SMILES)[1] for initial structure codification, and the subsequent DeepSMILES[2] and SELFIES[3] (SELF-referencing Embedded Strings) as more sophisticated approaches. These representations are further encoded into a machine-readable format that captures the structural and chemical characteristics of molecules, such as One Hot Encoding (OHE), Molecular Graphs (MG) or Descriptors (Molecular Fingerprints), whose selection depends on the application, the data set, and the ML model to train. In this study, we propose an alternative representation to traditional OHE of SMILES, DeepSMILES and SELFIES, which allows for comparable results in model accuracy and robustness but improves the efficiency in the use of computational resources compared with the traditional alternatives. To evaluate the effectiveness of this alternative representation, we conducted a set of benchmarks and comparative analysis with a Variational Autoencoder and a Recurrent Neural Network. We also explored the impact of this alternative representation in the required size of the training dataset, the molecular diversity, novelty, and validity, as well as in the model complexity for different number of hyperparameters. This alternative representation provides a new avenue for more efficient computing with less use of computational resources and faster performance, which impacts ML methods for chemistry, but also for any other fields that uses OHE as data representation.<br/><br/>[1] Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to methodology<br/>and encoding rules. Journal of chemical information and computer sciences, 28(1), 31-36.<br/>[2] O'Boyle, N., & Dalke, A. (2018). DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical<br/>structures.<br/>[3] Krenn, M., Häse, F., Nigam, A., Friederich, P., & Aspuru-Guzik, A. (2020). Self-referencing embedded strings<br/>(SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4),<br/>045024.

Symposium Organizers

Andrew Detor, GE Research
Jason Hattrick-Simpers, University of Toronto
Yangang Liang, Pacific Northwest National Laboratory
Doris Segets, University of Duisburg-Essen

Symposium Support

Bronze
Cohere

Publishing Alliance

MRS publishes with Springer Nature