Apr 11, 2025
4:00pm - 4:15pm
Summit, Level 4, Room 423
Savyasanchi Aggarwal1,2,Seán Kavanagh3,Kedar Hippalgaonkar2,4,David Scanlon5
University College London1,Agency for Science Technology and Research (A*STAR)2,Harvard University Center for the Environment3,Nanyang Technological University4,University of Birmingham5
Defects are key to the performance of materials across diverse applications. However, their simulation entails vast computational overhead due to requirements for high-level electronic structure (e.g. hybrid DFT), large supercells and appropriate structure-searching [1]. This has thus far inhibited accurate high-throughput investigations in defect systems. Deep Learning is a viable, low-cost alternative, but requires a large high-fidelity input database to achieve similar levels of accuracy and universality. While previous efforts [2-4] have aimed to construct such databases, their attempt to circumvent the cost of hybrid DFT with other approximations has limited their accuracy. Furthermore, they predate the work by Mosquera-Lois et al. [1], making it unclear how many of their datapoints are reliable representations of defect structures and energies.
This study builds upon the work in these previous efforts, utilising their model architectures to explore the most important features for improving accuracy and universality to maximise performance in predicting defect formation energy. Additionally, it evaluates appropriate tokenization methods to encode different descriptors. This is a vital consideration for point defects, where local structures have reduced symmetry and lead to a convoluted feature space if only encoded with fractional coordinates. To do this, many different methods were compared and modified, including popular structural representations like SOAP [5], CGCNN [6] and WyCryst [7], etc., alongside charge distributions.
To maintain accuracy, this study involved the construction of a large, high-quality database of point defect calculations using hybrid-DFT. This encompasses over 1200 intrinsic and extrinsic defect configurations over a wide range of host systems, from binary to quarternary compounds, comparable with previous dataset sizes. Furthermore, each datapoint has been modelled following the
ShakeNBreak approach developed by Mosquera-Lois et al. [8], ensuring the final defect energies match with ground-state defect configurations. Many other local and bulk properties for each system are also included, using methods from the DOPED [9] and pymatgen [10] packages. This high-fidelity database offers an excellent starting point for training deep learning models and mapping between defect properties and structures, and the steps taken in this study are vital for enabling reproducible and high-quality model performance.
[1] Mosquera-Lois, I. et al. (2023). Identifying the ground state structures of point defects in solids.
npj Comput. Mater., 9(1), 25
[2] Xiang, X. et al. (2024). Exploration of Deep Learning Models for Accelerated Defect Property Predictions and Device Design of Cubic Semiconductor Crystals.
J. Phys. Chem. C.[3] Rahman, M. H. et al. (2024). Accelerating defect predictions in semiconductors using graph neural networks.
APL Mach. Learn., 2(1)
[4] Witman, M. D. et al. (2023). Defect graph neural networks for materials discovery in high-temperature clean-energy applications.
Nat. Comput. Sci., 3(8), 675-686
[5] Himanen, L. et al. (2020). DScribe: Library of descriptors for machine learning in materials science.
Comput. Phys. Commun., 247, 106949
[6] Xie, T. et al. (2018). Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties.
Phys. Rev. Lett., 120(14), 145301
[7] Choudhary, K. et al. (2021). Atomistic line graph neural network for improved materials property predictions.
npj Comput. Mater., 7(1), 185.
[8] Mosquera-Lois, I. et al. (2022). ShakeNBreak: Navigating the defect configurational landscape.
J. Open Source Softw., 7(80), 4817
[9] Kavanagh et al. (2024). doped: Python toolkit for robust and repeatable charged defect supercell calculations.
J. Open Source Softw., 9(96), 6433
[10] Ong, S. P. et al. (2013). Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis.
Comput. Mater. Sci., 68, 314–319