April 7 - 11, 2025
Seattle, Washington
Symposium Supporters
2025 MRS Spring Meeting & Exhibit
MT03.10.02

LLMs for FAIR Materials Data Curation

When and Where

Apr 11, 2025
10:45am - 11:00am
Summit, Level 4, Room 422

Presenter(s)

Co-Author(s)

Catherine Brinson1,Defne Circi1,Bhuwan Dhingra1

Duke University1

Abstract

Catherine Brinson1,Defne Circi1,Bhuwan Dhingra1

Duke University1
Advances in materials science require leveraging past findings and data from the vast published literature. Large language models (LLMs) and vision-language models (VLMs) offer transformative potential to systematically convert unstructured textual, tabular, and graphical information embedded within articles into structured, analyzable formats. Despite their promise, the capability of these models to extract information from hybrid materials science articles, which often include tables alongside text, remains underexplored. Furthermore, the scarcity of annotated datasets, particularly for charts, poses a significant barrier to progress in this domain as they contain the densest information. To address this gap, we introduce an automated framework that evaluates the quality of information extraction from hybrid articles and charts. In addition, we propose benchmark datasets to support and standardize future research. To overcome the challenge of limited data availability for training, we also develop a method for synthetically generating chart datasets. We aim to fine-tune a pretrained image-to-text model on materials science figures with complete and consistent annotations to demonstrate the efficiency of our synthetic data generation.
Our results emphasize the importance of multimodal datasets and benchmarks in advancing the application of LLMs and VLMs for scientific research. By bridging gaps in data accessibility and enabling robust evaluations, this work contributes to the acceleration of materials discovery and highlights the broader potential of LLM-driven knowledge extraction in scientific fields.

Symposium Organizers

Qian Yang, University of Connecticut
Tuan Anh Pham, Lawrence Livermore National Laboratory
Victor Fung, Georgia Institute of Technology
James Chapman, Boston University

Session Chairs

James Chapman
Victor Fung
N M Anoop Krishnan

In this Session