Regression Transformer—Blending Numerical and Textual Tokens for Concurrent Property Prediction and Conditional Generation

When and Where

May 23, 2022
11:30am - 11:45am

DS01-Virtual

Presenter

Jannis Born

Matteo Manica

Co-Author(s)

Jannis Born^1,2,Matteo Manica¹

IBM Research Europe¹,ETH Zürich²

Abstract

Jannis Born^1,2,Matteo Manica¹

IBM Research Europe¹,ETH Zürich²

Transformer-based models lack an intrinsic way of representing numerals as tokens. Hence, the benefits of large-scale self-supervised pretraining do not yet extend to text datasets with quantitative numerical labels .However, efficiently encoding continuous properties jointly with sentences would open the door for ”swiss army knife” autoregressive Transformers that concurrently perform property prediction and conditional generation, dependent on the mask location.To that end, we present the Regression Transformer (RT), a XLNet-based language model that can be trained on numerically labeled text datasets. We introduce a scheme to convert floats of arbitrary precision into a sequence of tokens and then devise numerical encodings that preserve distances of digits in the embedding space. Focusing on chemical languages, we propose an alternating training scheme to concurrently optimize property prediction (PP) and text generation and extend the XLNet objective with a self-consistency loss. Our results on several synthetic and realistic molecular PP datasets demonstrate that the generality of self-supervised pretraining extends to numerically labelled datasets. In particular, the performance of traditional regression models can be surpassed by encoding numerals as tokens and training with cross entropy loss. Importantly, priming the same model with continuous properties encoded as tokens naturally yields a conditional generative models that is found useful forproperty-driven, local exploration of the chemical space.