Dec 4, 2024
8:00pm - 10:00pm
Hynes, Level 1, Hall A
Rachel Luu1,Markus Buehler1
Massachusetts Institute of Technology1
Rachel Luu1,Markus Buehler1
Massachusetts Institute of Technology1
Generative AI tools, such as large language models (LLMs), have demonstrated transformative abilities in generating content across diverse contexts, while exhibiting broad expertise and analytical rigor. By leveraging the extensive knowledge embedded in research literature, we explore the feasibility of finetuning foundational LLMs for application in scientific workflows. This work details the method behind developing scientific datasets and scientifically finetuned AI models, through our example model, BioinspiredLLM, a conversational LLM finetuned on a literature corpus of biological and bio-inspired materials.<br/><br/>In the initial step of our process, we employ text/data mining techniques to obtain a comprehensive collection of full-text articles in the field of structural biological materials, yielding over a thousand articles. These articles, initially in PDF format, undergo pre-processing using optical character recognition techniques to extract textual content. We then develop a data distillation technique to further process and clean the text, removing extraneous information while preserving the core knowledge and concepts. With this pre-processing step, we observe enhanced conversational performance in the final finetuned models. <br/><br/>We then finetune a variety of open-source foundational large language models using a low-rank adaptation strategy, introducing our specialized dataset while retaining the pre-trained knowledge of the models. We evaluate the finetuned models using custom benchmarks, demonstrating improved performance in tasks such as knowledge recall, hypothesis generation, and synthesis within the specialized domain. Notably, integration with Retrieval-Augmented Generation (RAG) techniques further enhance these capabilities. Additionally, we show that model efficiency and usability can be improved through quantization methods, enabling faster training and inference.<br/><br/>Finally, we discuss the implications of scientifically finetuned AI models and their integration into robust workflows for materials discovery, including multi-agent systems. The impact of this work is the demonstration of consistent performance improvements across various model types, suggesting that this finetuning method can be feasibly applied to a wide range of specialized scientific fields.