Dec 5, 2024
9:15am - 9:30am
Hynes, Level 2, Room 210
Vineeth Venugopal1,Elsa Olivetti1
Massachusetts Institute of Technology1
Vineeth Venugopal1,Elsa Olivetti1
Massachusetts Institute of Technology1
The MatSciLLM Leaderboard is a large-scale evaluation of large language models (LLMs) specifically tailored to materials science, aimed at guiding researchers in selecting the most suitable models for different tasks in this domain and in answering critical questions about the role of these models in the field. This initiative assesses 23 LLMs across 12 different tasks - for a total of over 250 models - providing a detailed analysis of their capabilities and limitations within the context of materials science. By offering insights into model performance, this work seeks to address the pressing need for AI tools that can support complex materials research and discovery. A key focus of this evaluation is to determine how model size, fine-tuning, and other adaptations impact performance across a range of tasks. Larger models generally exhibit better performance compared to smaller models, and fine-tuned versions of LLMs consistently outperform their base counterparts. Despite these advantages, LLMs still face challenges in certain areas, such as numerical data prediction, where their performance remains suboptimal. Among the models evaluated, Llama3 and Mistral emerged as strong performers across multiple tasks, showcasing their versatility and robustness.<br/>This study also introduces a novel crosstask evaluation, which investigates whether LLMs trained on one specific materials science task exhibit improved performance on other tasks. This analysis raises important questions about whether these models can effectively "learn" materials science through task-specific training. Furthermore, we examine the role of model ensembles in enhancing performance, testing whether combining multiple models can outperform individual large models. This investigation provides valuable insights into the utility of ensemble approaches within materials science applications.<br/>Additionally, we evaluate the impact of data and model size on training effectiveness across 12 distinct tasks, offering a thorough exploration of how these factors influence model performance. This work represents the most extensive evaluation of LLMs in materials science to date, providing critical guidance for researchers and practitioners in the field. The findings underscore the importance of selecting the right model configuration for specific research needs and highlight the potential for further advancements in LLM performance through targeted optimization techniques.