Alexander Dunn's research while affiliated with Lawrence Berkeley National Laboratory and other places

What is this page?


This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Publications (18)


Schematic comparison of previous relation extraction (RE) methods to this work
The objective of each method is to extract entities (colored text) and their relationships from unstructured text. a An example multi-step pipeline approach first performs entity recognition, then intermediate processing such as coreference resolution, and finally classification of links between entities. b seq2seq approaches encode relationships as 2-tuples in the output sequence. Named entities and relationship links are tagged with special symbols (e.g., “@FORMULA@", “@N2F@"). c The method shown in this work outputs entities and their relationships as JSON documents or other hierarchical structures.
Overview of the proposed sequence-to-sequence approach to document-level joint named entity recognition and relationship extraction task
In the first step, lists of JSON documents are prepared from abstracts according to a predefined schema, and the large language model (LLM) is trained. In the second step, this preliminary (intermediate) model is used to accelerate the preparation of additional training data by pre-annotation with the partially trained model and manual correction. An example error is shown highlighted in red. This step may be repeated multiple times with each subsequent partial fine-tuning improving in performance. In the final step, the LLM is fine-tuned on the complete dataset and used for inference to extract desired information from new text.
Annotation time as a function of intermediate large language model (LLM) fine-tuning samples for the named entity recognition and relation extraction (NERRE) method
We show the time taken for a domain expert to annotate new abstracts for the general materials chemistry task with assistance from intermediate (partially-trained) LLM-NERRE models on a (a) word basis, (b) material entry basis, and (c) token basis. Outputs from models trained on more data contain fewer mistakes and require less time to correct. Source data are provided as a Source Data file.
Test set performance vs. number of training samples for the doping extraction task using GPT-3 with the Doping-English schema
This schema specifically requires the model to learn a new and specific sentence structure to use as the output. We separate scores by (a) host-dopant links (relations), (b) host entities alone, and (c) dopant entities alone. We note that below approximately 10 samples, the scores are zero because the model has not learned the specific structure of the desired output sentences. Source data are provided as a Source Data file.
Diagrams of doping information extraction using large language models (LLMs) for joint named entity and relation extraction (NERRE)
In all three panels, an LLM trained to output a particular schema (far left) reads a raw text prompt and outputs a structured completion in that schema. The structured completion can then be parsed, decoded, and formatted to construct relational diagrams (far right). We show an example for each schema (desired output structure). Parsing refers to the reading of the structured output, while decoding refers to the programmatic (rule-based) conversion of that output into JSON form. Normalization and postprocessing are programmatic steps which transform raw strings (e.g., “Co+2") into structured entities with attributes (e.g., Element: Co, Oxidation state +2). a Raw sentences are passed to the model with Doping-English schema, which outputs newline-separated structured sentences that contain one host and one or more dopant entities. b Raw sentences are passed to a model with Doping-JSON schema, which outputs a nested JSON object. Each host entity has its own key-value pair, as does each dopant entity. There is also a list of host2dopant relations that links the corresponding dopant keys to each host key. c Example for the extraction with a model using the DopingExtra-English schema. This first part of the schema is the same as in a, but additional information is contained in doping modifiers, and results-bearing sentences are included at the end of the schema.

+1

Structured information extraction from scientific text with large language models
  • Article
  • Full-text available

February 2024

·

420 Reads

·

45 Citations

Nature Communications

·

Alexander Dunn

·

Sanghoon Lee

·

[...]

·

Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.

Download
Share

Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Scientific Text with LLMs

September 2023

·

87 Reads

·

7 Citations

Digital Discovery

Digital Discovery

Although gold nanorods have been the subject of much research, the pathways for controlling their shape and thereby their optical properties remain largely heuristically understood. Although it is apparent that the simultaneous presence of and interaction between various reagents during synthesis control these properties, computational and experimental approaches for exploring the synthesis space can be either intractable or too time-consuming in practice. This motivates an alternative approach leveraging the wealth of synthesis information already embedded in the body of scientific literature by developing tools to extract relevant structured data in an automated, high-throughput manner. To that end, we present an approach using the powerful GPT-3 language model to extract structured multi-step seed-mediated growth procedures and outcomes for gold nanorods from unstructured scientific text. GPT-3 prompt completions are fine-tuned to predict synthesis templates in the form of JSON documents from unstructured text input with an overall accuracy of 86% aggregated by entities and 76% aggregated by papers. The performance is notable, considering the model is performing simultaneous entity recognition and relation extraction. We present a dataset of 11 644 entities extracted from 1137 papers, resulting in 268 papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.


Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Literature with GPT-3

April 2023

·

121 Reads

Although gold nanorods have been the subject of much research, the pathways for controlling their shape and thereby their optical properties remain largely heuristically understood. Although it is apparent that the simultaneous presence of and interaction between various reagents during synthesis control these properties, computational and experimental approaches for exploring the synthesis space can be either intractable or too time-consuming in practice. This motivates an alternative approach leveraging the wealth of synthesis information already embedded in the body of scientific literature by developing tools to extract relevant structured data in an automated, high-throughput manner. To that end, we present an approach using the powerful GPT-3 language model to extract structured multi-step seed-mediated growth procedures and outcomes for gold nanorods from unstructured scientific text. GPT-3 prompt completions are fine-tuned to predict synthesis templates in the form of JSON documents from unstructured text input with an overall accuracy of $86\%$. The performance is notable, considering the model is performing simultaneous entity recognition and relation extraction. We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in 268 papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.


Fig. 7: Test set performance by number of training samples for the doping extraction task using the Doping-ENG model.
Fig. 8: Time taken for a domain expert to annotate new abstracts for the general materials chemistry task with assistance from intermediate (partially-trained) LLM-NERRE models. At each data point, an intermediate model trained on n samples sampled from the original training set (where n is shown on the x axis) infers the completion given the same text prompt presented to the annotator. The pre-populated annotation is then corrected by the annotator. Resulting times for each annotation are shown per abstract, per material mention (JSON document), and per prompt token. The green verification line represents the time taken for the annotator to simply verify whether a given annotation is entirely correct. The verification line represents a lower bound on the annotation time; even with a perfectly-performing model, the annotator must still take time to verify the annotation.
Parameters for the models trained on the three materials information extraction tasks.
Scores for the MOF-JSON model on the the metal-organic framework (MOF) information extraction task, evaluated on an exact word-match basis. Links are only correct if both entities and the relationship are correct.
Manual information extraction scores for the general materials task. Scores measure the model's ability to extract inter-related data together (i.e. assigning entities correct labels and grouping them appropriately, as described in Section II.)
Structured information extraction from complex scientific text with fine-tuned large language models

December 2022

·

1,305 Reads

·

12 Citations

Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.


Machine-Learning Rationalization and Prediction of Solid-State Synthesis Conditions

August 2022

·

158 Reads

·

38 Citations

Chemistry of Materials

There currently exist no quantitative methods to determine the appropriate conditions for solid-state synthesis. This not only hinders the experimental realization of novel materials but also complicates the interpretation and understanding of solid-state reaction mechanisms. Here, we demonstrate a machine-learning approach that predicts synthesis conditions using large solid-state synthesis data sets text-mined from scientific journal articles. Using feature importance ranking analysis, we discovered that optimal heating temperatures have strong correlations with the stability of precursor materials quantified using melting points and formation energies (ΔG f , ΔH f ). In contrast, features derived from the thermodynamics of synthesis-related reactions did not directly correlate to the chosen heating temperatures. This correlation between optimal solid-state heating temperature and precursor stability extends Tamman's rule from intermetallics to oxide systems, suggesting the importance of reaction kinetics in determining synthesis conditions. Heating times are shown to be strongly correlated with the chosen experimental procedures and instrument setups, which may be indicative of human bias in the data set. Using these predictive features, we constructed machine-learning models with good performance and general applicability to predict the conditions required to synthesize diverse chemical systems.


Machine-learning rationalization and prediction of solid-state synthesis conditions

April 2022

·

279 Reads

There currently exist no quantitative methods to determine the appropriate conditions for solid-state synthesis. This not only hinders the experimental realization of novel materials but also complicates the interpretation and understanding of solid-state reaction mechanisms. Here, we demonstrate a machine-learning approach that predicts synthesis conditions using large solid-state synthesis datasets text-mined from scientific journal articles. Using feature importance ranking analysis, we discovered that optimal heating temperatures have strong correlations with the stability of precursor materials quantified using melting points and formation energies ($\Delta G_f$, $\Delta H_f$). In contrast, features derived from the thermodynamics of synthesis-related reactions did not directly correlate to the chosen heating temperatures. This correlation between optimal solid-state heating temperature and precursor stability extends Tamman's rule from intermetallics to oxide systems, suggesting the importance of reaction kinetics in determining synthesis conditions. Heating times are shown to be strongly correlated with the chosen experimental procedures and instrument setups, which may be indicative of human bias in the dataset. Using these predictive features, we constructed machine-learning models with good performance and general applicability to predict the conditions required to synthesize diverse chemical systems. Codes and data used in this work can be found at: https://github.com/CederGroupHub/s4.


Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science

April 2022

·

326 Reads

·

77 Citations

Patterns

A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to an increase in publications. This problem may be addressed by using named entity recognition (NER) to extract structured summary-level data from unstructured materials science text. We compare the performance of four NER models on three materials science datasets. The four models include a bidirectional long short-term memory (BiLSTM) and three transformer models (BERT, SciBERT, and MatBERT) with increasing degrees of domain-specific materials science pre-training. MatBERT improves over the other two BERTBASE-based models by 1%∼12%, implying that domain-specific pre-training provides measurable advantages. Despite relative architectural simplicity, the BiLSTM model consistently outperforms BERT, perhaps due to its domain-specific pre-trained word embeddings. Furthermore, MatBERT and SciBERT models outperform the original BERT model to a greater extent in the small data limit. MatBERT’s higher-quality predictions should accelerate the extraction of structured data from materials science literature.


Experimental validation of high thermoelectric performance in RECuZnP 2 predicted by high-throughput DFT calculations

January 2021

·

158 Reads

·

46 Citations

Materials Horizons

Accurate density functional theory calculations of the interrelated properties of thermoelectric materials entail high computational cost, especially as crystal structures increase in complexity and size. New methods involving ab initio scattering and transport (AMSET) and compressive sensing lattice dynamics are used to compute the transport properties of quaternary CaAl2Si2-type rare-earth phosphides RECuZnP2 (RE = Pr, Nd, Er), which were identified to be promising thermoelectrics from high-throughput screening of 20 000 disordered compounds. Experimental measurements of the transport properties agree well with the computed values. Compounds with stiff bulk moduli (>80 GPa) and high speeds of sound (>3500 m s⁻¹) such as RECuZnP2 are typically dismissed as thermoelectric materials because they are expected to exhibit high lattice thermal conductivity. However, RECuZnP2 exhibits not only low electrical resistivity, but also low lattice thermal conductivity (∼1 W m⁻¹ K⁻¹). Contrary to prior assumptions, polar-optical phonon scattering was revealed by AMSET to be the primary mechanism limiting the electronic mobility of these compounds, raising questions about existing assumptions of scattering mechanisms in this class of thermoelectric materials. The resulting thermoelectric performance (zT of 0.5 for ErCuZnP2 at 800 K) is among the best observed in phosphides and can likely be improved with further optimization.




Citations (15)


... Second, due to the static nature of LLMs' knowledge and the use of generic text in training, they often struggle with extraction queries that require domainspecific clinical knowledge (Ji et al., 2023). Third, although LLMs may provide high accuracy for basic extraction tasks, they often miss fine-grained details (Dagdelen et al., 2024). This is because extraction of lung lesion information requires an understanding of domain-specific fields (such as margin and solidity) that are not included in predefined schema applicable for more general domains (Linguistic Data Consortium, 2006, 2008. ...

Reference:

Automated Clinical Data Extraction with Knowledge Conditioned LLMs
Structured information extraction from scientific text with large language models

Nature Communications

... (Polak et al., 2024) explore the use of GPT-4 to extract bulk modulus information from sentences extracted from scientific papers, while (Yang et al., 2023) uses zero-shot ChatGPT with a verification step to extract band gap information from sentences collected in a dataset by (Dong and Cole, 2022). (Walker et al., 2023) finetunes GPT-3.5 to extract growth procedures and outcomes for gold nanorods from paper text. (Montanelli et al., 2024) uses the Cohere model to extract phase-property relationships from a corpus of fulltext materials papers. ...

Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Scientific Text with LLMs
Digital Discovery

Digital Discovery

... Advancements in NLP, notably the use of large language models such as GPT-3, have revolutionized the structuring of unstructured data for misinformation analysis. Dunn et al. demonstrated that complex scientific information could be extracted from GPT-3, albeit cautioning about the risks of data 'hallucination' [26]. Agrawal et al. [27] showcased GPT-3's effectiveness in clinical information extraction, outperforming existing baselines even without domain-specific training. ...

Structured information extraction from complex scientific text with fine-tuned large language models

... As mentioned earlier, data selection and labeling are difficult in such models due to the absence of unsynthesizable crystal compounds. The second group of studies aims to develop models for predicting synthesis routes or reactions (e.g., solid-state, sol-gel, or solutionhydrothermal, precipitation), synthesis procedures, synthesis conditions (e.g., temperatures, times), or synthesis precursors or reactants [37][38][39][40][41][42][43][44][45][46] . These studies encompass a range of approaches, from data-driven learning of materials synthesis information using natural language processing of existing scientific literature [38][39][40][41][42][43][44] to the development of graph-based networks based on thermodynamic and kinetic data (i.e., physics-informed) [46][47][48][49] . ...

Machine-Learning Rationalization and Prediction of Solid-State Synthesis Conditions

Chemistry of Materials

... These tasks can be carried out by supervised machine learning algorithms or rule-based methods. The supervised machine learning NER approach was deployed to extract materials synthesis parameters [12], polymer names [13], general materials information [14] and doping procedures of materials [15] from literature. Similar models were applied to RE to extract synthesis parameters [16]. ...

Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science

Patterns

... Although these general-purpose models exhibited strong performance, the distributional shift of vocabulary led to sub-optimal performance on domain-specific natural language understanding and generation tasks (Beltagy et al., 2019). Following this observation, several domain-specific LLMs such as SCIBERT (Beltagy et al., 2019), BIOBERT , MATBERT (Walker et al., 2021), BATTERYBERT (Huang and Cole, 2022) and SCHOLARBERT (Hong et al., 2023) were developed with the goal of improving accuracy on in-domain NLP tasks Araci, 2019;Wu et al., 2023). ...

The Impact of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science
  • Citing Article
  • January 2021

SSRN Electronic Journal

... The evaluation of the TE performance in semiconductors also depends on the scattering mechanism of carriers; hence the current work considers three crucial types of carrier scattering: ADP scattering, IMP scattering, and POP scattering. In the literature, we see that IMP and POP scattering do have a more significant effect on electronic transport in terms of carrier lifetime over the generally considered ADP scattering based on deformation theory [75,76]. Therefore, these met-rics are drawn for both electron (n-type) and hole (p-type) doping in Fig. 6. ...

Experimental validation of high thermoelectric performance in RECuZnP 2 predicted by high-throughput DFT calculations
  • Citing Article
  • January 2021

Materials Horizons

... It is therefore no surprise that machine learning (ML) has gathered substantial interest as a means to develop efficient surrogate models for the prediction of elastic properties. In a nutshell, state-of-the-art ML models for elastic properties encode compositional information [19][20][21] and/or structural information [20][21][22][23] in a material as feature vectors and then map them to a target using some regression algorithms. This approach is adopted in many existing works for learning elastic properties (e.g., for alloys [24][25][26][27] and polycrystals [28,29]) and related atomic properties like stress and energy fields [30]. ...

Author Correction: Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm

npj Computational Materials

... The benchmark unifies four previously reported representations and five new representations we propose in this work, covering many relevant inductive biases. The MatText benchmark is based on open-source datasets and established materials informatics tooling Dunn et al. (2020); Ong et al. (2013) and can be easily applied to test other systems. • MatText Representations: To enable the MatText benchmark, we also develop a software package to convert geometric representations of materials into text-based representations. ...

Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm

npj Computational Materials

... For the former, standard computational Molecular Operating Environment (MOE) 18,19 chemical descriptors were used, while descriptors relying on a threedimensional shape of the investigated molecule were excluded to avoid adding any potential ambiguities to the descriptor space. Crystal-level descriptors, as implemented in a Python library, [20][21][22] included atomic orbital information, i.e., energies of highest ARTICLE pubs.aip.org/aip/aml occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO), orbital energies, types of and distances between atomic sites, and fractions of nearest neighbors for each atomic and bond type. ...

Gapped metals as thermoelectric materials revealed by high-throughput screening
  • Citing Article
  • August 2020

Journal of Materials Chemistry A