Sonakshi Gupta1,Pranav Shetty1,Aishat Adeboye1,Rampi Ramprasad1
Georgia Institute of Technology1
Sonakshi Gupta1,Pranav Shetty1,Aishat Adeboye1,Rampi Ramprasad1
Georgia Institute of Technology1
Polymer informatics has made great strides in recent years in predicting polymer properties and designing new materials. These data-driven models are powered by curated data and require painstaking manual curation often from the rapidly growing corpus of journal articles. Data curators and materials scientists who search for material property information from this growing body of literature face an uphill task.<br/><br/>In this work, we present a pipeline that leverages large language models to extract material property information from the text of journal articles. We frame the problem as a text-completion problem by inputting the text containing material property data and a prompt with the relevant instructions to the GPT3.5 model accessed through the OpenAI API. An example prompt looks like ‘Extract all bandgap values from the following text in json format: ...’. The output produced by the language model for this prompt is the tuple of material and property value as a dictionary. We use the paradigm of few-shot prompting wherein a few representative examples are selected and input output pairs are provided as a prompt to the model. This specifies a format for the data to be extracted and increases extraction performance. We benchmarked our method on two datasets of abstracts containing polymer glass transition temperature and bandgap respectively and show that this method outperforms information extraction using fully supervised methods using named entity recognition and heuristic rules for relation extraction. The resulting method was then applied to a corpus of 2.6 million materials science articles to extract all polymer glass transition temperature and bandgap values recorded therein.