Alternative Machine-Readable Representation of Molecules for More Efficient Use of Computational Resources in Machine Learning Applications

When and Where

Nov 30, 2023
2:45pm - 3:00pm

Sheraton, Second Floor, Back Bay B

Presenter

Emilio Alexis de la Cruz Nuñez Andrade

Isaac Vidal-Daza

Rafael Gómez-Bombarelli

Francisco Martin-Martinez

Co-Author(s)

Emilio Alexis de la Cruz Nuñez Andrade¹,Isaac Vidal-Daza^2,1,Rafael Gómez-Bombarelli³,Francisco Martin-Martinez¹

Swansea University¹,Universidad de Granada²,Massachusetts Institute of Technology³

Abstract

Emilio Alexis de la Cruz Nuñez Andrade¹,Isaac Vidal-Daza^2,1,Rafael Gómez-Bombarelli³,Francisco Martin-Martinez¹

Swansea University¹,Universidad de Granada²,Massachusetts Institute of Technology³

The rapid increase in computational resources demand is unprecedented because of the surge in artificial intelligence (AI), big data, and high-throughput computing. In chemistry, machine learning (ML) is revolutionizing molecular discovery, materials design, and property predictions in areas ranging from biomedicine to energy harvesting and storage, among many others. In practice, the implementation of ML methods relies on the codification of chemical structures into a suitable format for practical implementation in computational tools. To this end, the chemistry community adopted the Simplified Molecular Input Line Entry System (SMILES)[1] for initial structure codification, and the subsequent DeepSMILES[2] and SELFIES[3] (SELF-referencing Embedded Strings) as more sophisticated approaches. These representations are further encoded into a machine-readable format that captures the structural and chemical characteristics of molecules, such as One Hot Encoding (OHE), Molecular Graphs (MG) or Descriptors (Molecular Fingerprints), whose selection depends on the application, the data set, and the ML model to train. In this study, we propose an alternative representation to traditional OHE of SMILES, DeepSMILES and SELFIES, which allows for comparable results in model accuracy and robustness but improves the efficiency in the use of computational resources compared with the traditional alternatives. To evaluate the effectiveness of this alternative representation, we conducted a set of benchmarks and comparative analysis with a Variational Autoencoder and a Recurrent Neural Network. We also explored the impact of this alternative representation in the required size of the training dataset, the molecular diversity, novelty, and validity, as well as in the model complexity for different number of hyperparameters. This alternative representation provides a new avenue for more efficient computing with less use of computational resources and faster performance, which impacts ML methods for chemistry, but also for any other fields that uses OHE as data representation. [1] Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1), 31-36. [2] O'Boyle, N., & Dalke, A. (2018). DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. [3] Krenn, M., Häse, F., Nigam, A., Friederich, P., & Aspuru-Guzik, A. (2020). Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4), 045024.