Trend Analysis and Insight Extractions Using Named Entity Recognition of CO₂RR Literature

When and Where

Nov 29, 2022
8:00pm - 10:00pm

Hynes, Level 1, Hall A

Presenter

Jiwoo Choi

Kihoon Bang

Suji Jang

Kwang-Ryeol Lee

Sang Soo Han

Donghun Kim

Co-Author(s)

Jiwoo Choi^1,2,Kihoon Bang¹,Suji Jang¹,Kwang-Ryeol Lee¹,Sang Soo Han¹,Donghun Kim¹

Korea Institute of Science and Technology¹,Korea University²

Abstract

Jiwoo Choi^1,2,Kihoon Bang¹,Suji Jang¹,Kwang-Ryeol Lee¹,Sang Soo Han¹,Donghun Kim¹

Korea Institute of Science and Technology¹,Korea University²

In recent years, big data and artificial intelligence have penetrated materials science research. Currently, most openly available material databases use results derived from computer simulations and not from experiments. Some examples of materials research projects include the Materials Project, Novel Materials Discovery (NOMAD), and Open Quantum Materials Database (OQMD). Unfortunately, it is still difficult to build a large-scale experimental materials database. In this context, the scientific literature is one of the underutilized potential data sources because it contains well-organized experimental data that is easily accessible. An intensive study of natural language processing (NLP) of a huge volume of literature in materials science is required. Data can be automatically extracted from literature using NLP. Among various research topics in materials science, CO2 reduction reaction (CO2RR) catalysis would be an interesting topic to apply NLP. CO2RR catalysis, a conversion process from carbon dioxides into valuable compounds, would alleviate today’s energy crises and environmental problems. Although a large volume of CO2RR studies have been performed, however the experimental databases have not yet been built. We aim to build a large scale experimental databases using a variety of NLP techniques, and also aim to utilize them to extract research trends or insights, which would benefit the relevant research community. In this work, we collected papers related to CO2RR and conducted a study to extract key entities from the papers based on named entity recognition (NER). We provide a universal method to crawl and screen papers of user’s interest (in this example, CO2 electrochemical reduction research) and excluding noise papers using a combination of Doc2Vec and the Latent Dirichlet Allocation (LDA) model: As a result, we collected approximately 4,800 papers. Then, we developed NER models based on long short term memory (LSTM) or bidirectional encoder representations from transformer (BERT). These models were applied to the abstracts of the collected papers so that ten key entities regarding material names (catalyst, electrolyte etc.) and catalytic performances (Faradaic efficiency, current density etc.) are extracted. The average f1-score of MatBERT-based approach is over 85%, greatly exceeding that of LSTM-based approach, indicating the context-inclusive approach is necessary. Additionally, we also investigated over various BERT models, from BERT_base, SciBERT, MatSciBERT, and MatBERT) and their performance comparisons tell that the more domain knowledge is reflected in BERT model, the better the performance we achieve. Lastly, the trend and knowledge extracted from the NER studies in the CO2RR research field will be discussed.