Apr 10, 2025
5:00pm - 7:00pm
Summit, Level 2, Flex Hall C
Sumin Lee1,Hyekyung Choi1,MinJong Noh2,Jae-Hyuck Shim1,3,4,Yunseok Kim1,3,4
Sungkyunkwan University1,Samsung SDI Co., Ltd2,Energy Materials Research Center, Korea Institute of Science and Technology(KIST)3,KIST-SKKU Carbon-Neutral Research Center, Sungkyunkwan University (SKKU)4
Sumin Lee1,Hyekyung Choi1,MinJong Noh2,Jae-Hyuck Shim1,3,4,Yunseok Kim1,3,4
Sungkyunkwan University1,Samsung SDI Co., Ltd2,Energy Materials Research Center, Korea Institute of Science and Technology(KIST)3,KIST-SKKU Carbon-Neutral Research Center, Sungkyunkwan University (SKKU)4
Recent progress in natural language processing (NLP), exemplified by the release of ChatGPT, has heightened interest in automating the retrieval and analysis of scientific publications. Despite this enthusiasm, researchers often lack a standardized framework for systematically collecting, processing, and leveraging data extracted from diverse articles. In this study, we introduce a relational database architecture optimized for storing scientific articles supporting downstream NLP tasks. Our approach is centered on a core “reference information” table capturing essential metadata—title, abstract, publication year, and DOI—and linking these records to additional tables containing figure captions, table captions, and paragraph texts. This design simplifies data collection and promotes flexible model integration, enabling researchers to construct tailored NLP pipelines aligned with their specific research objectives. To demonstrate the utility of our framework, we assembled a dataset of 278 journal articles on hydrogen storage alloys and performed an NLU-based classification task involving four distinct label classes, informed by domain-specific knowledge. We subsequently evaluated the performance of five transformer-based models—SciBERT, MatSciBERT, ChemicalSciBERT, BERT, and SBERT—on this classification task. Among these models, MatSciBERT exhibited the highest accuracy, corroborated by multi-dimensional scaling (MDS) analyses. Finally, we show that this classification strategy mitigates the need for full-scan searching in a Vector database, thereby enhancing the efficiency of retrieval-augmented generation (RAG) processes. Taken together, these findings underscore the broad applicability and potential impact of our proposed framework in materials research and beyond.