Uncertainty Quantification in Machine Learning Models of Materials Properties

When and Where

Apr 24, 2024
3:30pm - 4:00pm

Room 320, Level 3, Summit

Presenter

Dane Morgan

Ryan Jacobs

Lane Schultz

Vidit Agrawal

Shixin Zhang

Glenn Palmer

Ben Blaiszik

Aristana Courtas

KJ Schmidt

Co-Author(s)

Dane Morgan¹,Ryan Jacobs¹,Lane Schultz¹,Vidit Agrawal¹,Shixin Zhang¹,Glenn Palmer²,Ben Blaiszik³,Aristana Courtas³,KJ Schmidt³

University of Wisconsin--Madison¹,Duke University²,The University of Chicago³

Abstract

Dane Morgan¹,Ryan Jacobs¹,Lane Schultz¹,Vidit Agrawal¹,Shixin Zhang¹,Glenn Palmer²,Ben Blaiszik³,Aristana Courtas³,KJ Schmidt³

University of Wisconsin--Madison¹,Duke University²,The University of Chicago³

Machine learning models are being increasingly used to predict an enormous range of materials properties. Such models are typically trained on computed and/or experimental data that has strong biases in terms of the sampled systems, potentially leading to models with limited accuracy and very specific domains. It is therefore of increasing importance to establish effective practices for uncertainty quantification of machine learning models used for materials properties. In this talk we share an approach that divides uncertainty quantification into separate challenges of error and domain determination, which together provide a strong framework for practical uncertainty quantification. This approach leads to uncertainty quantification that can guide users whether prediction on any given test data point is likely to be appropriate, and if it is appropriate, what accuracy might be expected. For determining errors, we demonstrate that, when properly calibrated, ensembles of models fit to bootstrap sampling of training data can provide robust and easily accessible estimates of test data point residuals[1]. For determining domain, we demonstrate that a kernel density estimate of training data density in feature space can be used to identify regions of feature space with inadequate sampling and therefore likely to be out of domain. Assessing any domain determination strategy is difficult as there is no unique ground truth for a test data point being in or out of domain. To manage this problem we propose a set of criteria for ground truth based on matching chemical intuition and expected large residuals and residual estimation errors with being out of domain. We show that a kernel density approach can generally categorize new test data points as in/out of domain with good accuracy (e.g., max F1 scores of about 80% or better) when using any of these criteria. Finally, we discuss how these methods can be trivially integrated into model fits through the MAST-ML[2] package and how such uncertainty aware models can be easily hosted in the cloud through the Foundry[3] service.<br/><br/>(1) Palmer, G.; Du, S. Q.; Politowicz, A.; Emory, J. P.; Yang, X. Y.; Gautam, A.; Gupta, G.; Li, Z. L.; Jacobs, R.; Morgan, D. Calibration after bootstrap for accurate uncertainty quantification in regression models. npj Comput. Mater. 2022, 8 (1), 9, Article. DOI: 10.1038/s41524-022-00794-8.<br/>(2) Jacobs, R.; Mayeshiba, T.; Afflerbach, B.; Miles, L.; Williams, M.; Turner, M.; Finkel, R.; Morgan, D. The Materials Simulation Toolkit for Machine learning (MAST-ML): An automated open source toolkit to accelerate data-driven materials research. Comput. Mater. Sci. 2020, 176, 13, Article. DOI: 10.1016/j.commatsci.2020.109544.<br/>(3) Blaiszik, B.; Schmidt, K.; Scourtas, A. Foundry-ML. 2023. https://foundry-ml.org (accessed 2023).