April 7 - 11, 2025
Seattle, Washington
Symposium Supporters
2025 MRS Spring Meeting & Exhibit
MT01.02.08

Exploring the Limits of kNN Noisy Feature Detection and Recovery for Self-Driving Labs

When and Where

Apr 8, 2025
4:30pm - 4:45pm
Summit, Level 4, Room 424

Presenter(s)

Co-Author(s)

Qiuyu Shi1,Kangming Li2,1,Yao Fehils3,Daniel Persaud1,Jason Hattrick-Simpers1

University of Toronto1,Acceleration Consortium2,Artifical, Inc.3

Abstract

Qiuyu Shi1,Kangming Li2,1,Yao Fehils3,Daniel Persaud1,Jason Hattrick-Simpers1

University of Toronto1,Acceleration Consortium2,Artifical, Inc.3
Integrating machine learning with automated experimentation platforms to make a Self-Driving Laboratory (SDL) has been promised to accelerate materials discovery. However, this promise relies on consistency in the data quality gathered by the platform, which is being provided to machine learning to drive the experimentation efforts. Inconsistencies in data over the lifetime of an SDL could occur due to several factors, for example, transcription mistakes, calculation errors, and equipment malfunctions, all of which may compromise SDL performance. Previous studies have demonstrated the potential of the k-Nearest Neighbor (kNN) imputation method for recovering noisy or missing values in datasets. However, a systematic study that integrates noise detection and recovery while examining the impact of dataset size, noise intensity, and noise type on recovery reliability remains highly necessary.
In this work, we establish an automated workflow for detecting and correcting noisy features in datasets, aiming to explore the limits of successful imputation as a function of dataset size, noise type, and noise intensity. We first use properties of the statistical distributions of the predicted features derived via kNN imputation to isolate noisy features, by comparing the Earth Mover’s Distance of each feature between noisy test data and clean validation data. Once detected, we apply the same imputation method to correct the noisy feature by imputing it and then using the remaining N-1 features to predict the Nth (noisy) feature. We identify which samples are recoverable and quantify the success of the recovery, showing that the influence of noise is closely tied to the distribution of the feature values. Furthermore, we systematically investigate how sample size, noise type, and noise intensity affect both the detectability and recoverability of noisy features, as well as their correlations with feature characteristics. We found that as Gaussian noise decreases in a feature, kNN rapidly loses the ability to identify that feature as noisy when the dataset size is below 1,000 samples. Our framework not only benchmarks kNN imputation efficiency but can also be extended to evaluate the stability of other imputation methods.
Overall, this study provides a robust framework for detecting and correcting noisy features in datasets while deepening our understanding of which noise types are most amenable to detection and recovery under various conditions. These findings could be used to strengthen data management strategies in SDL, leading to more resilient and precise experimental outcomes.

Symposium Organizers

Nongnuch Artrith, University of Utrecht
Haegyeom Kim, Lawrence Berkeley National Laboratory
Mahshid Ahmadi, University of Tennessee, Knoxville
Guoxiang (Emma) Hu, Georgia Institute of Technology

Symposium Support

Bronze
APL Machine Learning
Jiang Family Foundation
Wellcos Corporation

Session Chairs

Mahshid Ahmadi
Haegyeom Kim

In this Session