Alexander Dunn's research works | Lawrence Berkeley National Laboratory and other places

This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Schematic comparison of previous relation extraction (RE) methods to this work
The objective of each method is to extract entities (colored text) and their relationships from unstructured text. a An example multi-step pipeline approach first performs entity recognition, then intermediate processing such as coreference resolution, and finally classification of links between entities. b seq2seq approaches encode relationships as 2-tuples in the output sequence. Named entities and relationship links are tagged with special symbols (e.g., “@FORMULA@", “@N2F@"). c The method shown in this work outputs entities and their relationships as JSON documents or other hierarchical structures.

Overview of the proposed sequence-to-sequence approach to document-level joint named entity recognition and relationship extraction task
In the first step, lists of JSON documents are prepared from abstracts according to a predefined schema, and the large language model (LLM) is trained. In the second step, this preliminary (intermediate) model is used to accelerate the preparation of additional training data by pre-annotation with the partially trained model and manual correction. An example error is shown highlighted in red. This step may be repeated multiple times with each subsequent partial fine-tuning improving in performance. In the final step, the LLM is fine-tuned on the complete dataset and used for inference to extract desired information from new text.

Annotation time as a function of intermediate large language model (LLM) fine-tuning samples for the named entity recognition and relation extraction (NERRE) method
We show the time taken for a domain expert to annotate new abstracts for the general materials chemistry task with assistance from intermediate (partially-trained) LLM-NERRE models on a (a) word basis, (b) material entry basis, and (c) token basis. Outputs from models trained on more data contain fewer mistakes and require less time to correct. Source data are provided as a Source Data file.

Test set performance vs. number of training samples for the doping extraction task using GPT-3 with the Doping-English schema
This schema specifically requires the model to learn a new and specific sentence structure to use as the output. We separate scores by (a) host-dopant links (relations), (b) host entities alone, and (c) dopant entities alone. We note that below approximately 10 samples, the scores are zero because the model has not learned the specific structure of the desired output sentences. Source data are provided as a Source Data file.

Diagrams of doping information extraction using large language models (LLMs) for joint named entity and relation extraction (NERRE)
In all three panels, an LLM trained to output a particular schema (far left) reads a raw text prompt and outputs a structured completion in that schema. The structured completion can then be parsed, decoded, and formatted to construct relational diagrams (far right). We show an example for each schema (desired output structure). Parsing refers to the reading of the structured output, while decoding refers to the programmatic (rule-based) conversion of that output into JSON form. Normalization and postprocessing are programmatic steps which transform raw strings (e.g., “Co+2") into structured entities with attributes (e.g., Element: Co, Oxidation state +2). a Raw sentences are passed to the model with Doping-English schema, which outputs newline-separated structured sentences that contain one host and one or more dopant entities. b Raw sentences are passed to a model with Doping-JSON schema, which outputs a nested JSON object. Each host entity has its own key-value pair, as does each dopant entity. There is also a list of host2dopant relations that links the corresponding dopant keys to each host key. c Example for the extraction with a model using the DopingExtra-English schema. This first part of the schema is the same as in a, but additional information is contained in doping modifiers, and results-bearing sentences are included at the end of the schema.

Structured information extraction from scientific text with large language models

February 2024

420 Reads

45 Citations

Nature Communications

[...]

Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.

Download

A diagram illustrating the overall procedural approach for extracting synthesis templates from text with GPT-3 is shown. All unstructured text paragraphs were drawn from the seed-mediated gold nanorod growth dataset of 1137 filtered papers (purple). The first stage involves filling initial templates using a zero-shot question/answer framework with GPT-3, which is then corrected (orange). The plus sign indicates a combination of the texts and queries used as input. Template correction is done through manual editing of the templates according to the described annotation procedure. These annotated templates are used to fine-tune an initial GPT-3 model, which produces complete templates in a single prediction (green). From there, the process of iteratively predicting more templates with a fine-tuned model, correcting them, adding them to the training set, and then fine-tuning the model again is then performed (blue). The plus signs for these stages indicate that text-template pairs are used as input for fine-tuning

A diagram representing the structure of the seed-mediated gold nanorod growth JSON template. From left to right, the structure is divided into three components, the seed solution, the growth solution, and the resulting gold nanorods. For the seed and growth solution components, there are entries for the precursors and their associated quantities, as well as entries for experimental conditions such as the age and aging temperatures of the solutions and stir rates when adding the reducing agent (for the seed solution) or the seed solution (for the growth solution). For the gold nanorod component, there are entries for the characterization information that may be present, including the aspect ratio (ar), length (l), and longitudinal/transverse surface plasmon resonances (l/tspr)

An example of a question answering completion using GPT-3. The input is bounded by a purple box containing the prompt (orange), paragraph text (green), and query (blue). The output is bounded by a red box

A model prediction example is shown, with empty entries omitted. The original unstructured text is shown on the top, and the components of the predicted synthesis template in JSON form are shown on the bottom. The important information from the unstructured text is colored in orange (for precursors) and green (for quantities), while any errors are highlighted in red

A diagram depicting the different types of prediction errors made by the model is presented. Generally, two categories of errors exist: placement errors and transcription errors. Placement errors refer to whether the prediction has placed any information, correct or incorrect, into the appropriate fields as determined by the ground truth. These are indicated with the lines connecting the fields in the ground truth and the prediction templates. A false positive prediction occurs when the prediction places information in a field that is empty, while a false negative prediction is the reverse. A true negative prediction is when a field is empty in both the ground truth and the prediction, and a true positive prediction is when a field is non-empty in both the ground truth and the prediction. Since the placement evaluations do not consider whether the predicted value in a field is actually correct for true positives, an additional transcription evaluation is used to measure how well the predicted value explicitly matches the ground truth value. These are indicated with boxes encapsulating the fields. The transcription evaluation is only applied to true positive placements

Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Scientific Text with LLMs

September 2023

87 Reads

7 Citations

Digital Discovery

[...]

Although gold nanorods have been the subject of much research, the pathways for controlling their shape and thereby their optical properties remain largely heuristically understood. Although it is apparent that the simultaneous presence of and interaction between various reagents during synthesis control these properties, computational and experimental approaches for exploring the synthesis space can be either intractable or too time-consuming in practice. This motivates an alternative approach leveraging the wealth of synthesis information already embedded in the body of scientific literature by developing tools to extract relevant structured data in an automated, high-throughput manner. To that end, we present an approach using the powerful GPT-3 language model to extract structured multi-step seed-mediated growth procedures and outcomes for gold nanorods from unstructured scientific text. GPT-3 prompt completions are fine-tuned to predict synthesis templates in the form of JSON documents from unstructured text input with an overall accuracy of 86% aggregated by entities and 76% aggregated by papers. The performance is notable, considering the model is performing simultaneous entity recognition and relation extraction. We present a dataset of 11 644 entities extracted from 1137 papers, resulting in 268 papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.

Download

Fig. 3: An example of a question answering completion using GPT-3. The input is bounded by a purple box containing the prompt (orange), paragraph text (green), and query (blue). The output is bounded by a red box.

Fig. 4: A model prediction example is shown, with empty entries omitted. The original unstructured text is shown on the left, and the components of the predicted synthesis template in JSON form are shown on the right. The important information from the unstructured text is colored in orange (for precursors) and green (for quantities), while any errors are highlighted in red.

Fig. 5: A diagram depicting the different types of prediction errors made by the model is presented. Generally, two categories of errors exist: placement errors and transcription errors. Placement errors refer to whether the prediction has placed any information, correct or incorrect, into the appropriate fields as determined by the ground truth. These are indicated with the lines connecting the fields in the ground truth and the prediction templates. A false positive prediction occurs when the prediction places information in a field that is empty, while a false negative prediction is the reverse. A true negative prediction is when a field is empty in both the ground truth and the prediction, and a true positive prediction is when a field is non-empty in both the ground truth and the prediction. Since the placement evaluations do not consider whether the predicted value in a field is actually correct for true positives, an additional transcription evaluation is used to measure how well the predicted value explicitly matches the ground truth value. These are indicated with boxes encapsulating the fields. The transcription evaluation is only applied to true positive placements.

Fig. 6: A diagram showing the proportional overlaps of papers with complete synthesis procedure and outcome components. Each vertex of the triangle corresponds to the labeled recipe component. The areas of the circles are proportional to the corresponding number of papers inscribed. The circles on the midpoints of the edges correspond to papers with complete recipe components corresponding to the bounding vertices. The center circle corresponds to the papers with complete recipes and complete characterizations.

Fig. 7: A diagram showing the relationships between the gold nanorod aspect ratios and other gold nanorod measurements extracted from the literature including the (a) ratio between length and width and (b) the LSPR peak. The inlier datapoints are shown in purple and the outlier datapoints in red. The linear regressions derived from the text-mined data using all of the available data and only the inlier data are respectively shown in red and purple on each sub-diagram. For the comparison to the ratio between length and width (a), the ideal relation is shown with a dashed black line and for the LSPR comparison (b), a simulated relationship is shown with a dashed black line.[67]

Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Literature with GPT-3

April 2023

121 Reads

[...]

Although gold nanorods have been the subject of much research, the pathways for controlling their shape and thereby their optical properties remain largely heuristically understood. Although it is apparent that the simultaneous presence of and interaction between various reagents during synthesis control these properties, computational and experimental approaches for exploring the synthesis space can be either intractable or too time-consuming in practice. This motivates an alternative approach leveraging the wealth of synthesis information already embedded in the body of scientific literature by developing tools to extract relevant structured data in an automated, high-throughput manner. To that end, we present an approach using the powerful GPT-3 language model to extract structured multi-step seed-mediated growth procedures and outcomes for gold nanorods from unstructured scientific text. GPT-3 prompt completions are fine-tuned to predict synthesis templates in the form of JSON documents from unstructured text input with an overall accuracy of $86\%$. The performance is notable, considering the model is performing simultaneous entity recognition and relation extraction. We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in 268 papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.

Download

Fig. 7: Test set performance by number of training samples for the doping extraction task using the Doping-ENG model.

Fig. 8: Time taken for a domain expert to annotate new abstracts for the general materials chemistry task with assistance from intermediate (partially-trained) LLM-NERRE models. At each data point, an intermediate model trained on n samples sampled from the original training set (where n is shown on the x axis) infers the completion given the same text prompt presented to the annotator. The pre-populated annotation is then corrected by the annotator. Resulting times for each annotation are shown per abstract, per material mention (JSON document), and per prompt token. The green verification line represents the time taken for the annotator to simply verify whether a given annotation is entirely correct. The verification line represents a lower bound on the annotation time; even with a perfectly-performing model, the annotator must still take time to verify the annotation.

Parameters for the models trained on the three materials information extraction tasks.

Scores for the MOF-JSON model on the the metal-organic framework (MOF) information extraction task, evaluated on an exact word-match basis. Links are only correct if both entities and the relationship are correct.

Manual information extraction scores for the general materials task. Scores measure the model's ability to extract inter-related data together (i.e. assigning entities correct labels and grouping them appropriately, as described in Section II.)

Structured information extraction from complex scientific text with fine-tuned large language models

December 2022

1,305 Reads

12 Citations

[...]

Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.

Download

Figure 2. DI values and rankings of the top 15 synthesis features for heating temperature models (a and b) and heating time models (c and d). The data set is split into carbonate reactions (reactions with at least one carbonate precursor) (a and c) and noncarbonate reactions (reactions with no carbonate precursors) (b and d). Interactional DI (IADI): decrease of model R 2 when a feature is removed from the whole model that uses all features. Individual DI (IDI): R 2 of models trained using only one feature. Average partial DI (APDI): average R 2 increase when a feature is added to a submodel. Features are ordered according to the sum of all three DI values.

Figure 3. Regression result of linear models. The scatter plots show reported conditions vs predicted conditions for temperature prediction (a) and time prediction (b). Opacity of the markers indicates the weights of data points. Histograms of prediction errors are also shown.

Figure 4. Average effect of each chemical element to predicted heating temperatures (a) and times (b) in trained linear models. The values are coefficients of the corresponding features in the linear models, quantifying how much the predicted value changes relatively if a new chemical element is added to (or removed from) the synthesis.

Figure 5. Model performance versus number of training features for both linear and nonlinear (gradient boosting tree regressor) models. The x-axis shows the number of features used. The features are added in the order of DI value rankings. The first row shows performances of temperature prediction models trained on carbonate reactions (a) and noncarbonate reactions (b). The second row shows performances of time prediction models trained on reactions with (c) and without (d) carbonate precursors.

Figure 7. Performance of the models versus the number of features evaluated on the PCD data set. X-axes show the number of features used in each model. Features are added in the order of DI value rankings as in Figure 2. The left panels (a) and (c) show models trained on carbonate reactions, and the right panels (b) and (d) show models trained on noncarbonate reactions. The top panels (a) and (b) show the performance of models trained and evaluated on the PCD data set, which represent the upper bounds of OOS scores (c) and (d), which show performance of the models trained on the TMR data set. A higher OOS score indicates better model generalizability.

Machine-Learning Rationalization and Prediction of Solid-State Synthesis Conditions

August 2022

158 Reads

38 Citations

Chemistry of Materials

[...]

There currently exist no quantitative methods to determine the appropriate conditions for solid-state synthesis. This not only hinders the experimental realization of novel materials but also complicates the interpretation and understanding of solid-state reaction mechanisms. Here, we demonstrate a machine-learning approach that predicts synthesis conditions using large solid-state synthesis data sets text-mined from scientific journal articles. Using feature importance ranking analysis, we discovered that optimal heating temperatures have strong correlations with the stability of precursor materials quantified using melting points and formation energies (ΔG f , ΔH f ). In contrast, features derived from the thermodynamics of synthesis-related reactions did not directly correlate to the chosen heating temperatures. This correlation between optimal solid-state heating temperature and precursor stability extends Tamman's rule from intermetallics to oxide systems, suggesting the importance of reaction kinetics in determining synthesis conditions. Heating times are shown to be strongly correlated with the chosen experimental procedures and instrument setups, which may be indicative of human bias in the data set. Using these predictive features, we constructed machine-learning models with good performance and general applicability to predict the conditions required to synthesize diverse chemical systems.

Download

Figure 1: Schematic of the ML methods developed in this work for predicting solid-state synthesis conditions.

Figure 2: DI values and rankings of top 15 synthesis features for heating temperature models (a and b) and heating time models (c and d). The dataset is split into carbonate reactions (reactions with at least one carbonate precursor) (a and c) and non-carbonate reactions (reactions with no carbonate precursors) (b and d). Interactional dominance DI (IADI): decrease of model R 2 when a feature is removed from the whole model that uses all features. Individual dominance DI (IDI): R 2 of models trained using only one feature. Average partial dominance DI (APDI): average R 2 increase when a feature is added to a submodel. Features are ordered according to the sum of all three DI values.

Figure 4: The average effect of each chemical element to predicted heating temperatures (a) and times (b) in trained linear models. The values are coefficients of the corresponding features in the linear models, quantifying how much the predicted value changes relatively if a new chemical element is added to (or removed from) the synthesis.

Figure 7: The curves are the distribution of heating temperatures for each group of reactions in the training dataset. The dashed/dotted lines show temperature distributions for the reaction TiO 2 + BaCO 3 → BaTiO 3 + CO 2 (red dashed line for single-heating reactions and blue dotted line for multiple-heating reactions). Green solid line shows the temperature distribution for the entire dataset.

Machine-learning rationalization and prediction of solid-state synthesis conditions

April 2022

279 Reads

[...]

There currently exist no quantitative methods to determine the appropriate conditions for solid-state synthesis. This not only hinders the experimental realization of novel materials but also complicates the interpretation and understanding of solid-state reaction mechanisms. Here, we demonstrate a machine-learning approach that predicts synthesis conditions using large solid-state synthesis datasets text-mined from scientific journal articles. Using feature importance ranking analysis, we discovered that optimal heating temperatures have strong correlations with the stability of precursor materials quantified using melting points and formation energies ($\Delta G_f$, $\Delta H_f$). In contrast, features derived from the thermodynamics of synthesis-related reactions did not directly correlate to the chosen heating temperatures. This correlation between optimal solid-state heating temperature and precursor stability extends Tamman's rule from intermetallics to oxide systems, suggesting the importance of reaction kinetics in determining synthesis conditions. Heating times are shown to be strongly correlated with the chosen experimental procedures and instrument setups, which may be indicative of human bias in the dataset. Using these predictive features, we constructed machine-learning models with good performance and general applicability to predict the conditions required to synthesize diverse chemical systems. Codes and data used in this work can be found at: https://github.com/CederGroupHub/s4.

Download

Figure 1. Solid-state annotation example An example of the solid-state annotation scheme condensed from an example abstract in the solidstate dataset.

Figure 3. Gold nanoparticle annotation example

Figure 4. NER model precisions, recalls, and F1-scores Scatterplot summaries of the precisions, recalls, and F1-scores achieved by BiLSTM, BERT, SciBERT, and MatBERT model predictions with respect to the true labels on the solid-state dataset (A), doping dataset (B), and gold nanoparticle dataset (C).

Figure 5. NER model entity score heatmap A heatmap of entity-wise average F1-scores with the best score for each entity in bold.

Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science

April 2022

326 Reads

77 Citations

Patterns

[...]

A bottleneck in efficiently connecting new materials discoveries to established literature has arisen due to an increase in publications. This problem may be addressed by using named entity recognition (NER) to extract structured summary-level data from unstructured materials science text. We compare the performance of four NER models on three materials science datasets. The four models include a bidirectional long short-term memory (BiLSTM) and three transformer models (BERT, SciBERT, and MatBERT) with increasing degrees of domain-specific materials science pre-training. MatBERT improves over the other two BERTBASE-based models by 1%∼12%, implying that domain-specific pre-training provides measurable advantages. Despite relative architectural simplicity, the BiLSTM model consistently outperforms BERT, perhaps due to its domain-specific pre-trained word embeddings. Furthermore, MatBERT and SciBERT models outperform the original BERT model to a greater extent in the small data limit. MatBERT’s higher-quality predictions should accelerate the extraction of structured data from materials science literature.

Download

Experimental validation of high thermoelectric performance in RECuZnP 2 predicted by high-throughput DFT calculations

Article

January 2021

158 Reads

46 Citations

Materials Horizons

[...]

Accurate density functional theory calculations of the interrelated properties of thermoelectric materials entail high computational cost, especially as crystal structures increase in complexity and size. New methods involving ab initio scattering and transport (AMSET) and compressive sensing lattice dynamics are used to compute the transport properties of quaternary CaAl2Si2-type rare-earth phosphides RECuZnP2 (RE = Pr, Nd, Er), which were identified to be promising thermoelectrics from high-throughput screening of 20 000 disordered compounds. Experimental measurements of the transport properties agree well with the computed values. Compounds with stiff bulk moduli (>80 GPa) and high speeds of sound (>3500 m s⁻¹) such as RECuZnP2 are typically dismissed as thermoelectric materials because they are expected to exhibit high lattice thermal conductivity. However, RECuZnP2 exhibits not only low electrical resistivity, but also low lattice thermal conductivity (∼1 W m⁻¹ K⁻¹). Contrary to prior assumptions, polar-optical phonon scattering was revealed by AMSET to be the primary mechanism limiting the electronic mobility of these compounds, raising questions about existing assumptions of scattering mechanisms in this class of thermoelectric materials. The resulting thermoelectric performance (zT of 0.5 for ErCuZnP2 at 800 K) is among the best observed in phosphides and can likely be improved with further optimization.

The Impact of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science

Article

January 2021

217 Reads

14 Citations

SSRN Electronic Journal

[...]

Author Correction: Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm

December 2020

74 Reads

10 Citations

npj Computational Materials

[...]

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

Download

... Second, due to the static nature of LLMs' knowledge and the use of generic text in training, they often struggle with extraction queries that require domainspecific clinical knowledge (Ji et al., 2023). Third, although LLMs may provide high accuracy for basic extraction tasks, they often miss fine-grained details (Dagdelen et al., 2024). This is because extraction of lung lesion information requires an understanding of domain-specific fields (such as margin and solidity) that are not included in predefined schema applicable for more general domains (Linguistic Data Consortium, 2006, 2008. ...
Reference:
Automated Clinical Data Extraction with Knowledge Conditioned LLMs

Structured information extraction from scientific text with large language models

Citing Article
Full-text available
February 2024

Nature Communications

[...]

... (Polak et al., 2024) explore the use of GPT-4 to extract bulk modulus information from sentences extracted from scientific papers, while (Yang et al., 2023) uses zero-shot ChatGPT with a verification step to extract band gap information from sentences collected in a dataset by (Dong and Cole, 2022). (Walker et al., 2023) finetunes GPT-3.5 to extract growth procedures and outcomes for gold nanorods from paper text. (Montanelli et al., 2024) uses the Cohere model to extract phase-property relationships from a corpus of fulltext materials papers. ...
Reference:
Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets

Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Scientific Text with LLMs

Citing Article
Full-text available
September 2023

Digital Discovery

[...]

... Advancements in NLP, notably the use of large language models such as GPT-3, have revolutionized the structuring of unstructured data for misinformation analysis. Dunn et al. demonstrated that complex scientific information could be extracted from GPT-3, albeit cautioning about the risks of data 'hallucination' [26]. Agrawal et al. [27] showcased GPT-3's effectiveness in clinical information extraction, outperforming existing baselines even without domain-specific training. ...
Reference:
Machine Learning for Predicting Key Factors to Identify Misinformation in Football Transfer News

Structured information extraction from complex scientific text with fine-tuned large language models

Citing Preprint
File available
December 2022

[...]

... As mentioned earlier, data selection and labeling are difficult in such models due to the absence of unsynthesizable crystal compounds. The second group of studies aims to develop models for predicting synthesis routes or reactions (e.g., solid-state, sol-gel, or solutionhydrothermal, precipitation), synthesis procedures, synthesis conditions (e.g., temperatures, times), or synthesis precursors or reactants [37][38][39][40][41][42][43][44][45][46] . These studies encompass a range of approaches, from data-driven learning of materials synthesis information using natural language processing of existing scientific literature [38][39][40][41][42][43][44] to the development of graph-based networks based on thermodynamic and kinetic data (i.e., physics-informed) [46][47][48][49] . ...
Reference:
Impact of Data Bias on Machine Learning for Crystal Compound Synthesizability Predictions

Machine-Learning Rationalization and Prediction of Solid-State Synthesis Conditions

Citing Article
Full-text available
August 2022

Chemistry of Materials

[...]

... These tasks can be carried out by supervised machine learning algorithms or rule-based methods. The supervised machine learning NER approach was deployed to extract materials synthesis parameters [12], polymer names [13], general materials information [14] and doping procedures of materials [15] from literature. Similar models were applied to RE to extract synthesis parameters [16]. ...
Reference:
Question Answering models for information extraction from perovskite materials science literature

Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science

Citing Article
Full-text available
April 2022

Patterns

[...]

... Although these general-purpose models exhibited strong performance, the distributional shift of vocabulary led to sub-optimal performance on domain-specific natural language understanding and generation tasks (Beltagy et al., 2019). Following this observation, several domain-specific LLMs such as SCIBERT (Beltagy et al., 2019), BIOBERT , MATBERT (Walker et al., 2021), BATTERYBERT (Huang and Cole, 2022) and SCHOLARBERT (Hong et al., 2023) were developed with the goal of improving accuracy on in-domain NLP tasks Araci, 2019;Wu et al., 2023). ...
Reference:
INDUS: Effective and Efficient Language Models for Scientific Applications

The Impact of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science

Citing Article
January 2021

SSRN Electronic Journal

[...]

... The evaluation of the TE performance in semiconductors also depends on the scattering mechanism of carriers; hence the current work considers three crucial types of carrier scattering: ADP scattering, IMP scattering, and POP scattering. In the literature, we see that IMP and POP scattering do have a more significant effect on electronic transport in terms of carrier lifetime over the generally considered ADP scattering based on deformation theory [75,76]. Therefore, these met-rics are drawn for both electron (n-type) and hole (p-type) doping in Fig. 6. ...
Reference:
Ultralow thermal conduction and impurity scattering in Cu 2 HgSnS 4 : An Hg-harnessed diamondlike semiconductor for thermoelectric devices

Experimental validation of high thermoelectric performance in RECuZnP 2 predicted by high-throughput DFT calculations

Citing Article
January 2021

Materials Horizons

[...]

... It is therefore no surprise that machine learning (ML) has gathered substantial interest as a means to develop efficient surrogate models for the prediction of elastic properties. In a nutshell, state-of-the-art ML models for elastic properties encode compositional information [19][20][21] and/or structural information [20][21][22][23] in a material as feature vectors and then map them to a target using some regression algorithms. This approach is adopted in many existing works for learning elastic properties (e.g., for alloys [24][25][26][27] and polycrystals [28,29]) and related atomic properties like stress and energy fields [30]. ...
Reference:
An equivariant graph neural network for the elasticity tensors of all seven crystal systems

Author Correction: Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm

Citing Article
Full-text available
December 2020

npj Computational Materials

[...]

... The benchmark unifies four previously reported representations and five new representations we propose in this work, covering many relevant inductive biases. The MatText benchmark is based on open-source datasets and established materials informatics tooling Dunn et al. (2020); Ong et al. (2013) and can be easily applied to test other systems. • MatText Representations: To enable the MatText benchmark, we also develop a software package to convert geometric representations of materials into text-based representations. ...
Reference:
MatText: Do Language Models Need More than Text & Scale for Materials Modeling?

Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm

Citing Article
Full-text available
September 2020

npj Computational Materials

[...]

... For the former, standard computational Molecular Operating Environment (MOE) 18,19 chemical descriptors were used, while descriptors relying on a threedimensional shape of the investigated molecule were excluded to avoid adding any potential ambiguities to the descriptor space. Crystal-level descriptors, as implemented in a Python library, [20][21][22] included atomic orbital information, i.e., energies of highest ARTICLE pubs.aip.org/aip/aml occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO), orbital energies, types of and distances between atomic sites, and fractions of nearest neighbors for each atomic and bond type. ...
Reference:
Identification of novel organic polar materials: A machine learning study with importance sampling

Gapped metals as thermoelectric materials revealed by high-throughput screening

Citing Article
August 2020

Journal of Materials Chemistry A

[...]

Alexander Dunn's research while affiliated with Lawrence Berkeley National Laboratory and other places

What is this page?

Publications (18)

Citations (15)