Apr 11, 2025
10:45am - 11:00am
Summit, Level 4, Room 422
Catherine Brinson1,Defne Circi1,Bhuwan Dhingra1
Duke University1
Advances in materials science require leveraging past findings and data from the vast published literature. Large language models (LLMs) and vision-language models (VLMs) offer transformative potential to systematically convert unstructured textual, tabular, and graphical information embedded within articles into structured, analyzable formats. Despite their promise, the capability of these models to extract information from hybrid materials science articles, which often include tables alongside text, remains underexplored. Furthermore, the scarcity of annotated datasets, particularly for charts, poses a significant barrier to progress in this domain as they contain the densest information. To address this gap, we introduce an automated framework that evaluates the quality of information extraction from hybrid articles and charts. In addition, we propose benchmark datasets to support and standardize future research. To overcome the challenge of limited data availability for training, we also develop a method for synthetically generating chart datasets. We aim to fine-tune a pretrained image-to-text model on materials science figures with complete and consistent annotations to demonstrate the efficiency of our synthetic data generation.
Our results emphasize the importance of multimodal datasets and benchmarks in advancing the application of LLMs and VLMs for scientific research. By bridging gaps in data accessibility and enabling robust evaluations, this work contributes to the acceleration of materials discovery and highlights the broader potential of LLM-driven knowledge extraction in scientific fields.