Exploring the Limits of kNN Noisy Feature Detection and Recovery for Self-Driving Labs

When and Where

Apr 8, 2025
4:30pm - 4:45pm

Summit, Level 4, Room 424

Presenter(s)

Qiuyu Shi

Kangming Li

Yao Fehils

Daniel Persaud

Jason Hattrick-Simpers

Co-Author(s)

Qiuyu Shi¹,Kangming Li^2,1,Yao Fehils³,Daniel Persaud¹,Jason Hattrick-Simpers¹

University of Toronto¹,Acceleration Consortium²,Artifical, Inc.³

Abstract

Qiuyu Shi¹,Kangming Li^2,1,Yao Fehils³,Daniel Persaud¹,Jason Hattrick-Simpers¹

University of Toronto¹,Acceleration Consortium²,Artifical, Inc.³

Integrating machine learning with automated experimentation platforms to make a Self-Driving Laboratory (SDL) has been promised to accelerate materials discovery. However, this promise relies on consistency in the data quality gathered by the platform, which is being provided to machine learning to drive the experimentation efforts. Inconsistencies in data over the lifetime of an SDL could occur due to several factors, for example, transcription mistakes, calculation errors, and equipment malfunctions, all of which may compromise SDL performance. Previous studies have demonstrated the potential of the k-Nearest Neighbor (kNN) imputation method for recovering noisy or missing values in datasets. However, a systematic study that integrates noise detection and recovery while examining the impact of dataset size, noise intensity, and noise type on recovery reliability remains highly necessary.
In this work, we establish an automated workflow for detecting and correcting noisy features in datasets, aiming to explore the limits of successful imputation as a function of dataset size, noise type, and noise intensity. We first use properties of the statistical distributions of the predicted features derived via kNN imputation to isolate noisy features, by comparing the Earth Mover’s Distance of each feature between noisy test data and clean validation data. Once detected, we apply the same imputation method to correct the noisy feature by imputing it and then using the remaining N-1 features to predict the N^th (noisy) feature. We identify which samples are recoverable and quantify the success of the recovery, showing that the influence of noise is closely tied to the distribution of the feature values. Furthermore, we systematically investigate how sample size, noise type, and noise intensity affect both the detectability and recoverability of noisy features, as well as their correlations with feature characteristics. We found that as Gaussian noise decreases in a feature, kNN rapidly loses the ability to identify that feature as noisy when the dataset size is below 1,000 samples. Our framework not only benchmarks kNN imputation efficiency but can also be extended to evaluate the stability of other imputation methods.
Overall, this study provides a robust framework for detecting and correcting noisy features in datasets while deepening our understanding of which noise types are most amenable to detection and recovery under various conditions. These findings could be used to strengthen data management strategies in SDL, leading to more resilient and precise experimental outcomes.

Symposium Organizers

Nongnuch Artrith, University of Utrecht

Haegyeom Kim, Lawrence Berkeley National Laboratory

Mahshid Ahmadi, University of Tennessee, Knoxville

Guoxiang (Emma) Hu, Georgia Institute of Technology

Symposium Support

Bronze
APL Machine Learning
Jiang Family Foundation
Wellcos Corporation

Session Chairs

Mahshid Ahmadi

Haegyeom Kim

Symposium Supporters

2025 MRS Spring Meeting & Exhibit