Dec 3, 2024
4:00pm - 4:15pm
Sheraton, Second Floor, Constitution B
Vipul Gupta1,2,Florian Pyczak1,2,Ingo Schmitt2
Helmholtz-Zentrum Hereon1,Brandenburgische Technische Universität Cottbus-Senftenberg2
Vipul Gupta1,2,Florian Pyczak1,2,Ingo Schmitt2
Helmholtz-Zentrum Hereon1,Brandenburgische Technische Universität Cottbus-Senftenberg2
Recent developments in the field of data mining have received significant attention across scientific communities for their potential to advance research. Experimental datasets of research findings are usually published in scientific literature. Mining such literature thus enables the discovery of synergistic effects and meaningful insights by virtue of evaluating the combined experimental datasets. The availability of machine-readable collections containing experimental datasets from relevant literature is therefore essential for knowledge discovery in scientific literature. Unfortunately, such collections are not provided by any existing tool or digital library. The creation of these collections demands: i) highly specific searches to identify relevant literature, and ii) non-trivial extraction of experimental datasets due to complex patterns and multimodal representations, such as text, table, and scatter plot. For example, within the field of materials science, creating a collection that has exclusively experimental datasets on a specific mechanical property of a particular alloy system is not possible.<br/><br/>This work introduces a scientific literature data mining platform designed to address these challenges. It facilitates federated search-based automatic ingestion of literature from digital libraries, followed by retrieving relevant literature. Besides phrase, faceted, full-text, and conjunctive and disjunctive search capabilities, the implemented information retrieval system allows dataset-aware literature retrieval based on the metadata of visual elements. This metadata includes a visual element type depending on its content and characteristics, along with the caption text. Moreover, the platform enables semi-automatic extraction of experimental datasets from the identified relevant literature. In particular, it employs plot digitisation and deep learning-based techniques to extract named entities (e.g., temperature, stress, and microstructure) and events (e.g., thermal history of specimen) from both text corpus and visual elements. Furthermore, the platform aids in creating curated datasets that can be utilized for exploratory data analysis and predictive modelling. This presentation emphasizes features and applicability of the platform within the materials science field, exemplified by the use case to create the minimum creep rate dataset for a gamma titanium aluminide system.