Dec 5, 2024
4:00pm - 4:15pm
Hynes, Level 2, Room 209
Julia Hsu1,Imon Mia1,Armi Tiihonen2,Mark Lee1,Roman Garnett3,Tonio Buonassisi4,William Vandenberghe1
The University of Texas at Dallas1,Aalto University2,Washington University in St. Louis3,Massachusetts Institute of Technology4
Julia Hsu1,Imon Mia1,Armi Tiihonen2,Mark Lee1,Roman Garnett3,Tonio Buonassisi4,William Vandenberghe1
The University of Texas at Dallas1,Aalto University2,Washington University in St. Louis3,Massachusetts Institute of Technology4
Optimization is a common task in materials science. Bayesian optimization (BO) is increasingly used in experimental work involving varying levels of automation. Before implementing BO in an experimental campaign, many researchers prefer to implement BO in a simulation environment using synthetic data, which provides pedagogical and troubleshooting value. Two large differences between experimental and simulation work are (1) experiments are often performed in batches, i.e., processing multiple samples at once, to save materials cost or time, and (2) experimental data contain aleatoric uncertainties manifesting as noise.<br/><br/>In this work, we develop a framework to visualize BO step by step, first as an evaluation tool for simulation environments (and later possibly a debugging tool for experiments). We showcase an example of simulated data with increasing noise, evaluating optimization strategies as a function of noise magnitude. In our demonstration, we implement batch BO using the Emukit package to find the optimum to 6-dimensional Ackley and Hartmann functions. 6 dimensions in the predictor inputs (<b><i>X</i></b>) are chosen to mimic the number of input variables commonly used in experimental work. The Ackley function represents a needle-in-a-haystack experimental manifold, i.e., hard-to-find global maximum in the objective (<i>y</i>), while the Hartmann function represents a more gradual landscape but contains a second local maximum similar in objective value to the global maximum but at a significantly different <b><i>X</i></b> value. Using synthetic data without noise, the optimization, i.e., learning, progress is first studied for how it is affected by the choices of acquisition function (expected improvement vs upper confidence bound), hyperparameters, and batch-picking method. Latin hypercube sampling (LHS) is used to pick initial <b><i>X</i></b> values for collecting data, followed by 50 learning cycles with a batch size of 4 at each round. 99 LHSs are implemented to understand statistical variations. The optimization results are evaluated based on instant regret in <b><i>X</i></b>, which is the Euclidian distance between the final optimal <b><i>X<sub>opt</sub></i></b> from the model and the <b><i>X<sub>max</sub></i></b> of the ground truth <i>y</i> maximum, averaged over the 99 LHS. While most papers in the literature track the difference between the model and ground truth <i>y</i> values, we argue that <b><i>X</i></b> is more important to experimenters because the inputs are what can be controlled and varied, and <i>y</i> values from the model deviate from the ground truth because of the details of the Gaussian process regression. The effects of noise on the optimization are evaluated for normally distributed noise levels ranging from 1 to 20 %. We show that adding noise based on the percentage of the ground truth <i>y</i> maximum, as is commonly done in the literature, overestimates the noise when compared to the signal-to-noise ratio in experiments. We also develop several visualization methods to show the optimization progression and outcomes as visualization is important for high-dimensional problems because it is difficult for humans to comprehend results for problems higher than three dimensions.<br/><br/>This work is supported by NSF CMMI-2109554. JWPH and TB acknowledge the support of Simons Foundation Pivot Fellowship.