April 22 - 26, 2024
Seattle, Washington
May 7 - 9, 2024 (Virtual)
Symposium Supporters
2024 MRS Spring Meeting & Exhibit
MT01.05.02

Not as Simple as We Thought: A Rigorous Examination of Data Aggregation in Materials Informatics

When and Where

Apr 24, 2024
9:00am - 9:30am
Room 320, Level 3, Summit

Presenter(s)

Co-Author(s)

Taylor Sparks1,2,Federico Ottomano2,Giovanni De Felice2,Vladimir Gusev2

University of Utah1,University of Liverpool2

Abstract

Taylor Sparks1,2,Federico Ottomano2,Giovanni De Felice2,Vladimir Gusev2

University of Utah1,University of Liverpool2
Recent Machine Learning (ML) developments have opened new possibilities for materials research. However, due to the underlying statistical nature, the performance of ML estimators is heavily affected by the quality of training datasets, which are severely limited and fragmented in the case of materials informatics. Here, we investigate whether state-of-the-art ML models for property predictions can benefit from the aggregation of different datasets. We probe three different aggregation strategies in which we prioritize training size, element diversity, and composition diversity by using novelty scores from the DiSCoVeR algorithm. Surprisingly, our results consistently show that both simple and refined data aggregation strategies lead to a reduction in performance. This suggests caution when merging different experimental data sources. To guide the size increment, we compare the use of DiSCoVeR, which prioritizes chemical diversity, with a random selection. Our results show that targeting novel chemistries is not beneficial in building a training dataset.

Symposium Organizers

Raymundo Arroyave, Texas A&M Univ
Elif Ertekin, University of Illinois at Urbana-Champaign
Rodrigo Freitas, Massachusetts Institute of Technology
Aditi Krishnapriyan, UC Berkeley

Session Chairs

Aditi Krishnapriyan
Wennie Wang

In this Session