Jiwoo Choi1,2,Kihoon Bang1,Suji Jang1,Kwang-Ryeol Lee1,Sang Soo Han1,Donghun Kim1
Korea Institute of Science and Technology1,Korea University2
Jiwoo Choi1,2,Kihoon Bang1,Suji Jang1,Kwang-Ryeol Lee1,Sang Soo Han1,Donghun Kim1
Korea Institute of Science and Technology1,Korea University2
In recent years, big data and artificial intelligence have penetrated materials science research. Currently, most openly available material databases use results derived from computer simulations and not from experiments. Some examples of materials research projects include the Materials Project, Novel Materials Discovery (NOMAD), and Open Quantum Materials Database (OQMD). Unfortunately, it is still difficult to build a large-scale experimental materials database. In this context, the scientific literature is one of the underutilized potential data sources because it contains well-organized experimental data that is easily accessible. An intensive study of natural language processing (NLP) of a huge volume of literature in materials science is required. Data can be automatically extracted from literature using NLP.<br/> Among various research topics in materials science, CO<sub>2</sub> reduction reaction (CO<sub>2</sub>RR) catalysis would be an interesting topic to apply NLP. CO<sub>2</sub>RR catalysis, a conversion process from carbon dioxides into valuable compounds, would alleviate today’s energy crises and environmental problems. Although a large volume of CO<sub>2</sub>RR studies have been performed, however the experimental databases have not yet been built. We aim to build a large scale experimental databases using a variety of NLP techniques, and also aim to utilize them to extract research trends or insights, which would benefit the relevant research community.<br/> In this work, we collected papers related to CO<sub>2</sub>RR and conducted a study to extract key entities from the papers based on named entity recognition (NER). We provide a universal method to crawl and screen papers of user’s interest (in this example, CO<sub>2</sub> electrochemical reduction research) and excluding noise papers using a combination of Doc2Vec and the Latent Dirichlet Allocation (LDA) model: As a result, we collected approximately 4,800 papers. Then, we developed NER models based on long short term memory (LSTM) or bidirectional encoder representations from transformer (BERT). These models were applied to the abstracts of the collected papers so that ten key entities regarding material names (catalyst, electrolyte etc.) and catalytic performances (Faradaic efficiency, current density etc.) are extracted. The average f1-score of MatBERT-based approach is over 85%, greatly exceeding that of LSTM-based approach, indicating the context-inclusive approach is necessary. Additionally, we also investigated over various BERT models, from BERT_base, SciBERT, MatSciBERT, and MatBERT) and their performance comparisons tell that the more domain knowledge is reflected in BERT model, the better the performance we achieve. Lastly, the trend and knowledge extracted from the NER studies in the CO<sub>2</sub>RR research field will be discussed.