Apr 24, 2024
9:00am - 9:30am
Room 320, Level 3, Summit
Taylor Sparks1,2,Federico Ottomano2,Giovanni De Felice2,Vladimir Gusev2
University of Utah1,University of Liverpool2
Taylor Sparks1,2,Federico Ottomano2,Giovanni De Felice2,Vladimir Gusev2
University of Utah1,University of Liverpool2
Recent Machine Learning (ML) developments have opened new possibilities for materials research. However, due to the underlying statistical nature, the performance of ML estimators is heavily affected by the quality of training datasets, which are severely limited and fragmented in the case of materials informatics. Here, we investigate whether state-of-the-art ML models for property predictions can benefit from the aggregation of different datasets. We probe three different aggregation strategies in which we prioritize training size, element diversity, and composition diversity by using novelty scores from the DiSCoVeR algorithm. Surprisingly, our results consistently show that both simple and refined data aggregation strategies lead to a reduction in performance. This suggests caution when merging different experimental data sources. To guide the size increment, we compare the use of DiSCoVeR, which prioritizes chemical diversity, with a random selection. Our results show that targeting novel chemistries is not beneficial in building a training dataset.