The development of new functional materials for technologies ranging from smartphones to automobiles relies heavily on experimental data, yet much of this valuable information remains buried within millions of published scientific papers. A research team at Japan's National Institute for Materials Science (NIMS) has developed two artificial intelligence tools designed to automate and accelerate the extraction of this data, addressing a critical bottleneck in materials science research. The relationship between materials and their properties is complex, with slight variations in composition or synthesis methods often resulting in dramatically different characteristics, making theoretical predictions difficult and increasing the importance of empirical data.
Led by Senior Researcher Dr. Yukari Katsura, the team focused on enhancing the construction of Starrydata, a materials property database launched in 2015 that previously relied on manual data collection from papers. "Graphs in the millions of papers published to date contain valuable experimental data collected by past researchers, and much of it remains untapped," Katsura explained. The first tool, Starrydata Auto-Suggestion for Sample Information, is already integrated into the Starrydata2 web system. When users paste text from a paper's abstract or experimental methods section, the system sends it to OpenAI's GPT via API and automatically displays candidate entries for pre-designed data fields specific to each materials domain. This tool helps standardize data entry while reducing the manual effort required.
The second, more comprehensive tool is Starrydata Auto-Summary GPT, which deconstructs entire open-access paper PDFs and automatically summarizes all descriptions of figures, tables, and samples as structured data in JSON format. The resulting data can be viewed as easy-to-read tables in a web browser, dramatically accelerating the work of data collectors in locating target information. The research detailing these tools was recently published in the journal Science and Technology of Advanced Materials: Methods at https://doi.org/10.1080/27660400.2025.2590811. Katsura noted that many publishers prohibit artificial intelligence use on paper PDFs, so the system currently targets open-access papers. "We found that by specifying a data structure and giving instructions to an LLM, we can accurately and comprehensively extract information about figures, tables, and samples from the text of paper PDFs across a wide range of fields," she said.
The tools represent a significant advancement because large language models like ChatGPT can perform flexible information extraction that considers background knowledge and context, enabling the conversion of complex scientific papers into structured data. While the JSON data output from the Auto-Summary tool isn't currently incorporated directly into the Starrydata database, it helps data collectors quickly locate target information for manual entry. Reading data points from graph images remains challenging for LLMs, so this task is performed using an independently developed semi-automated tool. "A paper is a logical structure assembled to convey the author's claims, but by deconstructing it and returning it to the form of experimental data, other researchers can also use it for their own research," Katsura explained.
The team aims to establish paper data collection as a recognized form of research within the scientific community and promote broader awareness of large-scale experimental data's potential. Currently, Starrydata has built databases for specific materials science fields like thermoelectric materials and magnets, but as an open dataset for new materials development, it's beginning to be utilized by leading researchers worldwide. The journal where the research was published, Science and Technology of Advanced Materials: Methods, focuses on emergent methods for accelerating materials development and maintains an open access website at https://www.tandfonline.com/STAM-M. By automating the extraction of experimental data that would otherwise require manual review of countless papers, these AI tools could enable researchers to gain inspiration through a bird's-eye view of materials data and realize property predictions based on empirical trends using machine learning. This approach addresses the fundamental challenge in materials science where theoretical models alone cannot provide reliable predictions, and researcher intuition built on years of experience has traditionally played a significant role.


