Jannis Born1,2,Matteo Manica1
IBM Research Europe1,ETH Zürich2
Jannis Born1,2,Matteo Manica1
IBM Research Europe1,ETH Zürich2
Transformer-based models lack an intrinsic way of representing numerals as tokens. Hence, the benefits of large-scale self-supervised pretraining do not yet extend to text datasets with quantitative numerical labels .However, efficiently encoding continuous properties jointly with sentences would open the door for ”swiss army knife” autoregressive Transformers that concurrently perform property prediction and conditional generation, dependent on the mask location.To that end, we present the Regression Transformer (RT), a XLNet-based language model that can be trained on numerically labeled text datasets. We introduce a scheme to convert floats of arbitrary precision into a sequence of tokens and then devise numerical encodings that preserve distances of digits in the embedding space. Focusing on chemical languages, we propose an alternating training scheme to concurrently optimize property prediction (PP) and text generation and extend the XLNet objective with a self-consistency loss. Our results on several synthetic and realistic molecular PP datasets demonstrate that the generality of self-supervised pretraining extends to numerically labelled datasets. In particular, the performance of traditional regression models can be surpassed by encoding numerals as tokens and training with cross entropy loss. Importantly, priming the same model with continuous properties encoded as tokens naturally yields a conditional generative models that is found useful forproperty-driven, local exploration of the chemical space.