Dec 2, 2024
11:30am - 11:45am
Sheraton, Second Floor, Constitution B
Matilda Sipilä1,Farrokh Mehryary1,Sampo Pyysalo1,Filip Ginter1,Milica Todorović1
University of Turku1
Matilda Sipilä1,Farrokh Mehryary1,Sampo Pyysalo1,Filip Ginter1,Milica Todorović1
University of Turku1
Scientific text is a promising source of data in materials science, and there is ongoing research on how to utilize textual data in materials discovery. The recent success of transformer-based language models has led to the development of new machine learning tools. These tools, such as question answering (QA), are now available for information extraction (IE) from scientific literature. The QA models are large language (BERT) models tuned towards an IE task, conducted by asking a comprehensible question. The potential of the QA method lies in its versatility, accessibility and scalability. Human language queries make it easy to use even for researchers with no previous knowledge of language technology. Also, no re-training of QA model is needed to extract information about different materials and properties.<br/><br/>We explored the IE performance of the QA method on the task of extracting bandgap values of halide perovskite materials from scientific literature. We tested five different BERT models and found that MatBERT model produced the best results. Compared to the more established IE tool ChemDataExtractor2, the QA method performed well, and we were able to collect correct bandgap values from text. Extracted information will next be used to map the space of materials properties and find promising new materials solutions. We implemented this method into a web application to make the QA tool more widely available. Through this work, we seek to lower the barriers for non-experts to use large language models for IE and help democratize use of language technology in materials research.