Apr 8, 2025
4:30pm - 4:45pm
Summit, Level 4, Room 424
Qiuyu Shi1,Kangming Li2,1,Yao Fehils3,Daniel Persaud1,Jason Hattrick-Simpers1
University of Toronto1,Acceleration Consortium2,Artifical, Inc.3
Qiuyu Shi1,Kangming Li2,1,Yao Fehils3,Daniel Persaud1,Jason Hattrick-Simpers1
University of Toronto1,Acceleration Consortium2,Artifical, Inc.3
Integrating machine learning with automated experimentation platforms to make a Self-Driving Laboratory (SDL) has been promised to accelerate materials discovery. However, this promise relies on consistency in the data quality gathered by the platform, which is being provided to machine learning to drive the experimentation efforts. Inconsistencies in data over the lifetime of an SDL could occur due to several factors, for example, transcription mistakes, calculation errors, and equipment malfunctions, all of which may compromise SDL performance. Previous studies have demonstrated the potential of the k-Nearest Neighbor (kNN) imputation method for recovering noisy or missing values in datasets. However, a systematic study that integrates noise detection and recovery while examining the impact of dataset size, noise intensity, and noise type on recovery reliability remains highly necessary.
In this work, we establish an automated workflow for detecting and correcting noisy features in datasets, aiming to explore the limits of successful imputation as a function of dataset size, noise type, and noise intensity. We first use properties of the statistical distributions of the predicted features derived via kNN imputation to isolate noisy features, by comparing the Earth Mover’s Distance of each feature between noisy test data and clean validation data. Once detected, we apply the same imputation method to correct the noisy feature by imputing it and then using the remaining N-1 features to predict the N
th (noisy) feature. We identify which samples are recoverable and quantify the success of the recovery, showing that the influence of noise is closely tied to the distribution of the feature values. Furthermore, we systematically investigate how sample size, noise type, and noise intensity affect both the detectability and recoverability of noisy features, as well as their correlations with feature characteristics. We found that as Gaussian noise decreases in a feature, kNN rapidly loses the ability to identify that feature as noisy when the dataset size is below 1,000 samples. Our framework not only benchmarks kNN imputation efficiency but can also be extended to evaluate the stability of other imputation methods.
Overall, this study provides a robust framework for detecting and correcting noisy features in datasets while deepening our understanding of which noise types are most amenable to detection and recovery under various conditions. These findings could be used to strengthen data management strategies in SDL, leading to more resilient and precise experimental outcomes.