Emilio Alexis de la Cruz Nuñez Andrade1,Isaac Vidal-Daza2,1,Rafael Gómez-Bombarelli3,Francisco Martin-Martinez1
Swansea University1,Universidad de Granada2,Massachusetts Institute of Technology3
Emilio Alexis de la Cruz Nuñez Andrade1,Isaac Vidal-Daza2,1,Rafael Gómez-Bombarelli3,Francisco Martin-Martinez1
Swansea University1,Universidad de Granada2,Massachusetts Institute of Technology3
The rapid increase in computational resources demand is unprecedented because of the surge in artificial intelligence (AI), big data, and high-throughput computing. In chemistry, machine learning (ML) is revolutionizing molecular discovery, materials design, and property predictions in areas ranging from biomedicine to energy harvesting and storage, among many others. In practice, the implementation of ML methods relies on the codification of chemical structures into a suitable format for practical implementation in computational tools. To this end, the chemistry community adopted the Simplified Molecular Input Line Entry System (SMILES)[1] for initial structure codification, and the subsequent DeepSMILES[2] and SELFIES[3] (SELF-referencing Embedded Strings) as more sophisticated approaches. These representations are further encoded into a machine-readable format that captures the structural and chemical characteristics of molecules, such as One Hot Encoding (OHE), Molecular Graphs (MG) or Descriptors (Molecular Fingerprints), whose selection depends on the application, the data set, and the ML model to train. In this study, we propose an alternative representation to traditional OHE of SMILES, DeepSMILES and SELFIES, which allows for comparable results in model accuracy and robustness but improves the efficiency in the use of computational resources compared with the traditional alternatives. To evaluate the effectiveness of this alternative representation, we conducted a set of benchmarks and comparative analysis with a Variational Autoencoder and a Recurrent Neural Network. We also explored the impact of this alternative representation in the required size of the training dataset, the molecular diversity, novelty, and validity, as well as in the model complexity for different number of hyperparameters. This alternative representation provides a new avenue for more efficient computing with less use of computational resources and faster performance, which impacts ML methods for chemistry, but also for any other fields that uses OHE as data representation.<br/><br/>[1] Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to methodology<br/>and encoding rules. Journal of chemical information and computer sciences, 28(1), 31-36.<br/>[2] O'Boyle, N., & Dalke, A. (2018). DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical<br/>structures.<br/>[3] Krenn, M., Häse, F., Nigam, A., Friederich, P., & Aspuru-Guzik, A. (2020). Self-referencing embedded strings<br/>(SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4),<br/>045024.