MRS Meetings and Events

 

DS01.15.03 2022 MRS Spring Meeting

Identification of Enzymatic Active Sites with Unsupervised Language Modelling

When and Where

May 23, 2022
11:15am - 11:30am

DS01-Virtual

Presenter

Co-Author(s)

Matteo Manica1,Loïc Kwate Dassi1,Daniel Probst1,Philippe Schwaller1,Yves Nana Teukam1,Teodoro Laino1

IBM Research Europe1

Abstract

Matteo Manica1,Loïc Kwate Dassi1,Daniel Probst1,Philippe Schwaller1,Yves Nana Teukam1,Teodoro Laino1

IBM Research Europe1
The first decade of genome sequencing saw a surge in the characterisation of proteins with unknown functionality. Even still, more than 20% of proteins in well-studied model animals have yet to be identified, making the discovery of their active site one of biology's greatest puzzle. Herein, we apply a Transformer architecture to a language representation of bio-catalyzed chemical reactions to learn the signal at the base of the substrate-active site atomic interactions. The language representation comprises a reaction simplified molecular-input line-entry system (SMILES) for substrate and products, complemented with amino acid (AA) sequence information for the enzyme. We demonstrate that by creating a custom tokenizer and a score based on attention values, we can capture the substrate-active site interaction signal and utilize it to determine the active site position in unknown protein sequences, unraveling complicated 3D interactions using just 1D representations.<br/>We consider a Transfomer-based model, BERT, trained with different losses and analyse the performance in comparison with statistical baselines and methods based on sequence alignments.<br/>This approach exhibits remarkable results and is able to recover, with no supervision, 31.51% of the active site when considering co-crystallized substrate-enzyme structures as a ground truth, largely outperforming sequence alignment-based approaches. Our findings are further corroborated by docking simulations on the 3D structure of few enzymes. This work confirms the unprecedented impact of natural language processing and more specifically of the Transformer architecture on domain-specific languages, paving the way to effective solutions for protein functional characterisation and bio-catalysis engineering.

Symposium Organizers

Mathieu Bauchy, University of California, Los Angeles
Mathew Cherukara, Argonne National Laboratory
Grace Gu, University of California, Berkeley
Badri Narayanan, University of Louisville

Publishing Alliance

MRS publishes with Springer Nature