Synthesizing Multimodal Experimental Datasets from Scientific Literature of Materials Science

When and Where

Dec 3, 2024
4:00pm - 4:15pm

Sheraton, Second Floor, Constitution B

Presenter(s)

Vipul Gupta

Florian Pyczak

Ingo Schmitt

Co-Author(s)

Vipul Gupta^1,2,Florian Pyczak^1,2,Ingo Schmitt²

Helmholtz-Zentrum Hereon¹,Brandenburgische Technische Universität Cottbus-Senftenberg²

Abstract

Vipul Gupta^1,2,Florian Pyczak^1,2,Ingo Schmitt²

Helmholtz-Zentrum Hereon¹,Brandenburgische Technische Universität Cottbus-Senftenberg²

Recent developments in the field of data mining have received significant attention across scientific communities for their potential to advance research. Experimental datasets of research findings are usually published in scientific literature. Mining such literature thus enables the discovery of synergistic effects and meaningful insights by virtue of evaluating the combined experimental datasets. The availability of machine-readable collections containing experimental datasets from relevant literature is therefore essential for knowledge discovery in scientific literature. Unfortunately, such collections are not provided by any existing tool or digital library. The creation of these collections demands: i) highly specific searches to identify relevant literature, and ii) non-trivial extraction of experimental datasets due to complex patterns and multimodal representations, such as text, table, and scatter plot. For example, within the field of materials science, creating a collection that has exclusively experimental datasets on a specific mechanical property of a particular alloy system is not possible.

This work introduces a scientific literature data mining platform designed to address these challenges. It facilitates federated search-based automatic ingestion of literature from digital libraries, followed by retrieving relevant literature. Besides phrase, faceted, full-text, and conjunctive and disjunctive search capabilities, the implemented information retrieval system allows dataset-aware literature retrieval based on the metadata of visual elements. This metadata includes a visual element type depending on its content and characteristics, along with the caption text. Moreover, the platform enables semi-automatic extraction of experimental datasets from the identified relevant literature. In particular, it employs plot digitisation and deep learning-based techniques to extract named entities (e.g., temperature, stress, and microstructure) and events (e.g., thermal history of specimen) from both text corpus and visual elements. Furthermore, the platform aids in creating curated datasets that can be utilized for exploratory data analysis and predictive modelling. This presentation emphasizes features and applicability of the platform within the materials science field, exemplified by the use case to create the minimum creep rate dataset for a gamma titanium aluminide system.

Symposium Organizers

Deepak Kamal, Syensqo

Christopher Kuenneth, University of Bayreuth

Antonia Statt, University of Illinois

Milica Todorović, University of Turku

Symposium Support

Bronze
Matter

Session Chairs

Christopher Kuenneth

Milica Todorović

Symposium Supporters

2024 MRS Fall Meeting & Exhibit