Not as Simple as We Thought: A Rigorous Examination of Data Aggregation in Materials Informatics

When and Where

Apr 24, 2024
9:00am - 9:30am

Room 320, Level 3, Summit

Presenter(s)

Taylor Sparks

Federico Ottomano

Giovanni De Felice

Vladimir Gusev

Co-Author(s)

Taylor Sparks^1,2,Federico Ottomano²,Giovanni De Felice²,Vladimir Gusev²

University of Utah¹,University of Liverpool²

Abstract

Taylor Sparks^1,2,Federico Ottomano²,Giovanni De Felice²,Vladimir Gusev²

University of Utah¹,University of Liverpool²

Recent Machine Learning (ML) developments have opened new possibilities for materials research. However, due to the underlying statistical nature, the performance of ML estimators is heavily affected by the quality of training datasets, which are severely limited and fragmented in the case of materials informatics. Here, we investigate whether state-of-the-art ML models for property predictions can benefit from the aggregation of different datasets. We probe three different aggregation strategies in which we prioritize training size, element diversity, and composition diversity by using novelty scores from the DiSCoVeR algorithm. Surprisingly, our results consistently show that both simple and refined data aggregation strategies lead to a reduction in performance. This suggests caution when merging different experimental data sources. To guide the size increment, we compare the use of DiSCoVeR, which prioritizes chemical diversity, with a random selection. Our results show that targeting novel chemistries is not beneficial in building a training dataset.

Symposium Organizers

Raymundo Arroyave, Texas A&M Univ

Elif Ertekin, University of Illinois at Urbana-Champaign

Rodrigo Freitas, Massachusetts Institute of Technology

Aditi Krishnapriyan, UC Berkeley

Session Chairs

Aditi Krishnapriyan

Wennie Wang

Symposium Supporters

2024 MRS Spring Meeting & Exhibit