PreprintPDF Available

Informed Chemical Classification of Organophosphorus Compounds via Unsupervised Machine Learning of X-ray Absorption Spectroscopy and X-ray Emission Spectroscopy

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

We analyze an ensemble of organophosphorus compounds to form an unbiased characterization of the information encoded in their X-ray absorption near edge structure (XANES) and valence-to-core X-ray emission spectra (VtC-XES). Data-driven emergence of chemical classes via unsupervised machine learning, specifically cluster analysis in the Uniform Manifold Approximation and Projection (UMAP) embedding, finds spectral sensitivity to coordination, oxidation, aromaticity, intramolecular hydrogen bonding, and ligand identity. Subsequently, we implement supervised machine learning via Gaussian Process classifiers to identify confidence in predictions which match our initial qualitative assessments of clustering. The results further support the benefit of utilizing unsupervised machine learning as a precursor to supervised machine learning.
1
Informed Chemical Classification of Organophosphorus
Compounds via Unsupervised Machine Learning of X-ray
Absorption Spectroscopy and X-ray Emission Spectroscopy
Samantha Tetef+1,
Vikram Kashyap+1,
Alexandra Velian2,
Niranjan Govind3,
Gerald T. Seidler1*
+Co-first authors
1Department of Physics, University of Washington, Seattle WA 98195, USA
2Department of Chemistry, University of Washington, Seattle WA 98195, USA
3Physical and Computational Sciences Directorate, Pacific Northwest National Laboratory,
Richland, Washington 99352, USA
Corresponding Author
*seidler@uw.edu
2
ABSTRACT
We analyze an ensemble of organophosphorus compounds to form an unbiased characterization
of the information encoded in their X-ray absorption near edge structure (XANES) and valence-
to-core X-ray emission spectra (VtC-XES). Data-driven emergence of chemical classes via
unsupervised machine learning, specifically cluster analysis in the Uniform Manifold
Approximation and Projection (UMAP) embedding, finds spectral sensitivity to coordination,
oxidation, aromaticity, intramolecular hydrogen bonding, and ligand identity. Subsequently, we
implement supervised machine learning via Gaussian Process classifiers to identify confidence in
predictions which match our initial qualitative assessments of clustering. The results further
support the benefit of utilizing unsupervised machine learning as a precursor to supervised
machine learning.
TOC GRAPHICS
KEYWORDS X-ray absorption fine structure, valence-to-core X-ray emission spectroscopy,
Gaussian Process, UMAP, unsupervised machine learning.
3
The information content in any spectroscopy method is constrained by the lossiness of
the underlying quantum mechanics that connects atomic-scale structure and dynamics to
experimental observables. Further limitations to the sensitivity of spectroscopy techniques often
include the inherent nonlinear or stochastic responses of the experimental probe. These facts
constrain our ability to correlate physical measurements, e.g., spectral features, to desired
microscopic properties. Thus, the emergence of data science and machine learning (ML) in
spectroscopy, with applications in all fields in the physical sciences, has exploded 1-5. These
data-driven models can frequently disentangle and infer patterns from lossy measurements as
well as provide insight into the information encoded in spectra.
In general, supervised ML studies across a wide range of spectroscopies target either
predicting properties from spectra or correlating specific properties of interest to spectral features
6. This necessarily assumes that sufficient information is, in fact, encoded in spectra; otherwise,
ML models will correlate spurious features to requested properties. This detail of encoded
information is often addressed by hand-selecting a targeted training domain which depends
heavily on prior knowledge 7. However, issues arise if the training domain is too small or biased.
First, if the training domain is too small, the model will be unable to generalize well beyond its
specialized scope, which violates the essential assumption that the training and test data are
sampled from the same distribution. Second, although some bias is essential for any machine
learning model 8, unwanted bias, especially from unrepresentative data, blindly undermines
reliability of inferences and has led to contemporary ethical concerns 9-12.
In the effort to combat unwanted bias as well as provide generalizability to complex
datasets, this study demonstrates the value of the pipeline exemplified in Figure 1, which
validates encoded information via unsupervised machine learning, i.e., cluster analysis on a
4
reduced-dimensional embedding of the spectra, before passing either the embedding or the
original spectra selected as an unbiased training (sub)set to a supervised machine learning
model. This pipeline removes implicit biases and spurious correlations by adding steps (3) and
(4) to a typical ML pipeline, which validate spectral sensitivity to properties requested during
supervised predictions.
Figure 1 Flowchart of an analysis framework that uses unsupervised machine learning (such as
cluster analysis) as a precursor to predictions on spectra via supervised machine learning.
We utilize this pipeline for a spectroscopy method that has seen an ongoing exploration
of ML applications: X-ray absorption spectroscopy (XAS) 13-31. XAS is most commonly used in
chemistry, biology, and materials science to investigate the element-specific local coordination
environment and electronic structure, with applications including energy storage 32, 33, catalysis
34, and photochemical dynamics 35. XAS, which includes both X-ray absorption near-edge
structure (XANES) and extended X-ray absorption fine structure (EXAFS), probes the
unoccupied electronic states of the excited state of a chosen atomic species. Conversely,
5
relaxation to fill the core hole results in either nonradiative (Auger) or radiative processes. The
latter results in the emission of X-ray fluorescence that can be finely characterized by X-ray
emission spectroscopy (XES) for insight into the occupied electronic states 36-38. Often discussed
as complementary to XANES in information content, valence-to-core XES (VtC-XES) is
produced when electrons de-excite from the valence shell to fill the core hole, giving direct
information about occupied electronic states involved in bonding. While XAS and XES have
traditionally been synchrotron-based methods, we note that their access, including for VtC-XES,
is now being steadily augmented with a renaissance of lab-based spectrometers 39-41, including in
studies of sufficient scale for data science methods 42.
In the first study to utilize ML in XAS, Timoshenko, et al. 13 predicted coordination from
XANES spectra using a neural network, while Zheng, et al. 15 also predicted coordination, except
using a random forest model. Notably, Torrisi, et al. 27 used a random forest model, except to
correlate polynomial fitting parameters of spectra to properties like bond distance. Other works
utilizing machine learning in XAS include a XANES matching algorithm 16, hierarchical
clustering on spectra 17, and use of an autoencoder to correlate coordination to a reduced
dimensional representation of spectra 18. Most of these studies assumed desired information was
in fact encoded in spectra, largely because of hand-crafting relevant training datasets. However,
our pipeline, via the unsupervised machine learning precursor, allows for explorative and
unbiased refinement of chemical descriptors a step that we propose is both necessary, and
likely sufficient, when addressing much more complex datasets.
The present study is prompted by our recent work 43 that compared the variance and
information content of sulfur K-edge XANES to VtC-XES spectra for sulforganics. We
found that nonlinear dimensionality reduction algorithms, a subset of unsupervised ML, provided
6
an effective way to extract features and thus important chemical information encoded in spectra.
Moreover, our results exemplified the benefits of utilizing unsupervised ML to mold and
understand the full potential of supervised ML analysis 44.
Here, we investigate the information content and sensitivity of phosphorus K-edge
XANES and VtC-XES in a more complex chemical system, organophosphorus compounds,
and indeed find sensitivity to a wider range of chemical properties, including coordination,
oxidation, aromaticity, intramolecular hydrogen bonding, and ligand identity. The dataset of
spectra we analyze is calculated from molecular structures gathered from the PubChem 45
database using moldl, a Python module we have written to aid in collecting and managing
molecular structure datasets. moldl is open-source and freely available to anyone. See the SI for
more details. For the rest of this paper, we will refer to the phosphorus K-edge XANES and VtC-
XES as just XANES and VtC-XES, respectively, for brevity.
Organophosphorus compounds have much higher total variance than sulforganics, as well
as higher variance within the same bonding geometry. We can therefore tune the input domain to
account for these highly variant structures, allowing us to understand the sensitivity of these
spectra to a wider range of properties. In addition, we can find, in an unbiased way, the extent of
the information that may be extracted using dimensionality reduction algorithms, especially
when confined to very limited dimensions. These explorations allow for full utilization of real
spectral information during supervised ML predictions.
To this end, we utilize Uniform Manifold Approximation and Projection (UMAP) 46 for
dimensionality reduction, which allows us to develop chemical classes by examining clustering
of spectra in a two-dimensional embedding. UMAP is a nonlinear embedding similar to t-
distributed Stochastic Neighbor Embedding (t-SNE) 47, which was used in our recent work 43 to
7
extract chemical classes. UMAP has additional benefits compared to t-SNE, such as being
parametric and preserving global structure, which allows for future data compression as well as
interpretation of overall global similarities. These advantages have led to its recent popularity,
such as in single cell RNA sequencing (scRNA-seq) data analysis 48, but has not yet seen use in
XAS analysis.
To begin, heuristically one expects coordination to yield the strongest distinguishing
feature between spectra, specifically the distinction between tricoordinate phosphorus and
tetracoordinate phosphorus. Not only do these coordination geometries have different hybridized
orbital character, but they are often a proxy for oxidation state. In organophosphorus compounds
with tricoordinate phosphorus centers, the phosphorus is typically in a 3+ oxidation state,
whereas compounds with tetracoordinate phosphorus centers usually have the phosphorus in a 5+
oxidation state. We chose compounds with a diverse number of oxygens bonded to phosphorus
within these two coordination configurations to further vary the effective charge on the
phosphorus. The spectral averages for both the VtC-XES and XANES spectra for each
tricoordinate phosphorus and tetracoordinate phosphorus class are shown in Fig. S1. We then
applied UMAP to the VtC-XES and XANES spectra to create a two-dimensional embedding of
the ensemble. The results are color-coded based on whether the compound includes tricoordinate
phosphorus or tetracoordinate phosphorus, as shown in Figure 2.
Individual classes within each coordination are shown in columns A and B. Additionally,
all R groups are constrained to exclusively carbons (e.g., alkyl or aryl chains), and sometimes
hydrogens (when bound to the oxygen) to achieve hydroxyl groups, but only for phosphates
(which we will explore later). As expected, coordination distinguishes most of the groupings of
the compounds, with a handful of outliers.
8
Figure 2 UMAP representation of VtC-XES (top) and XANES (bottom), color-coded by
coordination. R1, R2, and R3 are defined to be carbon-based aryl or alkyl chains, with only
phosphates allowed to have R1 and R2 as H atoms.
It follows that there are chemically relevant sub-groupings within each coordination.
Figure 3 shows the embedding color-coded within each of the tri- and tetra-coordinate classes
based on the number of oxygens bonded to the phosphorus. We expected effective charge of the
phosphorus to have the biggest impact on both the VtC-XES and XANES spectra. For the VtC-
XES, the ligand peaks (the small low-energy peak in Fig. S1) will increase in both energy and
intensity with an increase in phosphorus oxidation. From a molecular orbital perspective, this
trend is from a larger overlap between the ligand valence orbital and the phosphorus 3p orbital
9
(valence shell). In general, this feature (which also changes with different ligand symmetries and
orientation) is why VtC-XES is so strongly sensitive to ligand identity 49. For the XANES
spectra, an increase in the oxidation of the phosphorus, i.e., the number of oxygen ligands within
a coordination, will cause a blueshift of the absorption edge, also demonstrated by the average
spectra in Fig. S1.
Figure 3 UMAP representation of VtC-XES (top) and XANES (bottom) for tricoordinate
phosphorus (A) and tetracoordinate phosphorus (B) compounds, color-coded by number of
oxygens bonded to the phosphorus within each coordination.
Note that the phosphates are segregated from the other tetracoordinate phosphorus
compounds and seem to sub-cluster as well. This observation brings us to our next hypothesis
that VtC-XES and XANES are both sensitive to ligand identity. As stated earlier, VtC-XES is
highly sensitive to ligand identity, observed by changes in the ligand peak feature. Again,
because the absorption edge of a XANES spectrum shifts with oxidation, the electronegativity of
10
ligands will cause the biggest spectral change. However, even for ligands with approximately the
same electronegativity, different phase shifts and cross sections cause finer changes to the
XANES spectra.
To systematically probe the effect of ligand identity, a series of tetracoordinate
phosphorus compounds (phosphates) were evaluated in which the oxygen substituents were
replaced with one or two sulfur atoms. Compared to oxygen, sulfur is significantly less
electronegative, with a Pauling electronegativity value near that of carbon and phosphorus 50.
Thus, these oxygen-to-sulfur ligand substitutions likely cause the biggest spectral change by
adjusting the effective charge on the phosphorous. The resulting clusters are shown in Figure 4.
Figure 4 UMAP representation of VtC-XES (left) and XANES (right) for compounds with
sulfur ligands, color-coded by number of sulfurs.
As expected, the different ligand identities are contributing to cluster separations. The
VtC-XES also clearly has an outlier the orange phosphorothioate in the red dithiophosphate
cluster at the bottom right of that figure. Chemically, that compound (PubChem CID 104781) is
11
structurally different from others because the oxygens form one edge of a carbon tetrahedrane.
Thus, UMAP clearly identifies chemical outliers.
We then analyzed whether the spectra would be sensitive to substitutions of R groups (if
bonded to an oxygen) with a hydrogen atom, thus forming hydroxyl groups, as shown in Figure
5. Here, we have taken phosphinate and phosphonate as starting points, and consecutively
replaced O-R groups with OH groups. In general, this distinction seems to be better illuminated
by the VtC-XES spectra than the XANES (which is shown in Fig. S5), as the clustering in the
VtC-XES is suggestive of a sensitivity to hydroxyl groups. However, Figure 5 also exemplifies
that first-nearest neighbors, e.g., the oxygen ligands directly bonded to the phosphorus, likely
cause the biggest spectral changes and thus are the biggest contributing factor to clustering,
which is consistent with our earlier observations.
Figure 5 UMAP representation of the VtC-XES of compounds with consecutively more R
groups (if bonded to an oxygen) replaced with an H atom (to create hydroxyl groups), color-
coded by chemical class.
12
In the above discussion, we have motivated our classes by important chemical properties
that we heuristically expected to yield the biggest spectral differences. However, even within this
chemically driven framework, there are sub-clusters within our heuristic chemical classes which
are instead emergent from UMAP. For example, we found that sub-clustering of the phosphate
chemical class (exemplified by the multiple separate sub-clusters in Figures 3 and 4) was caused
by unexpected variations in the secondary substituent (atoms bound to oxygens, not directly to
phosphorus), indicating that XANES spectra is sensitive to even more subtle details than
anticipated.
Let us examine this sub-division of the phosphates, specifically in the UMAP embedding
of their XANES spectra. Applying UMAP to just phosphates, we achieve the embedding shown
in Figure 6, which has labeled the phosphates into four clusters determined by the dbscan 51
clustering algorithm: I, II, III, and IV. The average spectrum for each cluster is shown at the
bottom and the common structural motifs for each cluster are shown to the right.
77% of Cluster I is comprised of compounds with two alkyl R groups and the third group
either alkyl or aryl rings. This distinction is different from Clusters II to IV as they instead
typically have two R groups as H atoms instead of carbon-based groups. Cluster II is the largest
sub-cluster and 94% of the compounds have two hydroxyl groups bonded to the phosphorus and
the last R group an alkyl chain. These two clusters are the most distinct.
On the other hand, Cluster III and IV are similar in composition. Cluster III is comprised
of compounds with the third R group as: (a) alkyl rings, or cycloalkanes (36%), (b) aromatic
rings (23%), or (c) take part in intramolecular hydrogen bonding with one of the hydroxyl groups
bonding to the phosphorus. Cluster IV compounds are structurally very similar to Cluster III
compounds, even though their spectra are distinct. However, 54% of Cluster IV compounds have
13
their third R group as aromatic rings. All compounds in Clusters I to IV can be viewed in Figs.
S10 to S13. For some example compounds in each cluster along with their spectra and structure,
see Figs. S6 to S9. Additionally, given the linear nature of Clusters I, III, and IV in the UMAP
embedding, we tested the correlation between the embedding location and the energy of the
absorption edge, as demonstrated in Fig. S14, and found no strong correlation. This further
supports the nonlinear nature of spectra and the idea that spectral fingerprints in complex
datasets do not correlate solely to a single high-variant property like the absorption edge, but
rather a combination of properties.
Figure 6 UMAP representation of XANES of phosphates, color-coded by sub-clusters. Cluster-
averaged spectra and a summary structural motif for each cluster are also shown.
14
Taken en masse, these results independent of the specific dimensionality reduction
algorithm used show the extent to which chemically-relevant information is, or is not, encoded
by the quantum mechanics involved in XANES and VtC-XES. As to the specific algorithm,
UMAP can be used iteratively as more data is collected and thus has the potential to shown
evolutions through the domain space, similar to the latent space of a variational autoencoder
(VAE) 52. This property facilitates real-time analysis of high-throughput experiments. Finally,
and of key importance here, UMAP can generate embeddings of spectra that can be used for
unbiased refinement of the training data set in addition to a preprocessing step before supervised
ML predictions.
The most common use of supervised ML in X-ray spectroscopy is to predict numerical
properties, such as bond length or coordination, from XANES spectra. Here, we instead predict
chemical classes from both VtC-XES and XANES spectra. Moreover, we predict these classes
from a five-dimensional UMAP representation of the spectra instead of from the original spectra
themselves. Such preprocessing through dimensionality reduction can help separate inherently
correlated and nonlinear spectral features 44 as well as greatly reduce both the computational cost
and the effect of spectral noise.
Furthermore, we used a Gaussian Process (GP) in order to incorporate prior knowledge
into our models and generate an informed predictor 53. A GP is a non-parametric kernel method
that formally incorporates Bayes rule into the model, which not only allows for priors to be
specified during training, but also allows for a probabilistic interpretation of the results. This
probability gives uncertainty estimates, or conversely confidence, of the predictions. We note
that one of the biggest downsides of a GP is that it scales poorly, which is another reason why
15
applying a nonlinear dimensionality reduction routine like UMAP beforehand can transform this
problem into a computationally tractable one.
The results of training a GP on each of the five classification schemes (see Table S1) we
developed coordination, number of oxygen ligands, phosphate subcluster, number of sulfur
ligands, and number of hydroxyl ligands are shown in Figure 7, with the average accuracy
score on the test set as well as the probability of that prediction, i.e., the confidence score,
shown. There is a clear correlation between the average accuracy and confidence, indicating that
the GP is, in fact, properly modeling uncertainty of predictions.
Figure 7 Gaussian Process Classifier prediction accuracies with corresponding average
probability (“confidence”) for all chemically driven and cluster-driven classification schemes.
Finally, the accuracies and confidence of each prediction across the VtC-XES and
XANES data matches what we observed in our two-dimensional UMAP figures. This is clearly
demonstrated in the hydroxyl ligand and phosphate subcluster classification schemes, where the
XANES and VtC-XES, respectively, poorly cluster by these schemes, and the low corresponding
16
GP confidence reflects this. Overall, these results further validate that visualizing data via a
dimensionality reduction algorithm like UMAP correlates to extractable information content and
can properly inform classes to be used for supervised ML.
By utilizing UMAP and analyzing the resulting clustering in a two-dimensional
embedding of VtC-XES and XANES spectra of an ensemble of organophosphorus compounds,
we noticed sensitivity to coordination and ligand identity (specifically by distinguishing number
of oxygen ligands, sulfur ligands, and hydroxyl groups). Additionally, the XANES was clearly
more sensitive to phosphate sub-groupings (which resulted from an unexpected, unintuitive
fingerprint). However, all these results culminated in a valuable analysis framework: (1)
applying nonlinear dimensionality reduction routines and cluster analysis to check for both
heuristic chemical sensitivities and emergent ones present in spectra, (2) applying dimensionality
reduction methods like UMAP before querying supervised ML models, and (3) utilizing models
that incorporate prior knowledge, such as a Gaussian Process, to estimate uncertainty or
confidence of these predictions on the clustering-informed classes. Furthermore, this framework,
visualized in Figure 1, is broadly applicable it can easily be expanded to both other systems
and other one-dimensional spectroscopies providing a way to validate predictions instead of
relying solely on the initial construction of an appropriate training dataset.
17
ASSOCIATED CONTENT
Supporting Information.
The following files are available free of charge:
Computational Methods (docx)
Figure S1 Class averages of spectra with different coordination (png)
Figure S2 Scree plot of VtC-XES and XANES data (png)
Figure S3 PCA reconstruction of VtC-XES spectra (png)
Figure S4 PCA reconstruction of XANES spectra (png)
Figure S5 UMAP representation of XANES with H atom substitutions (png)
Figure S6 Phosphate sub-cluster I example spectra (png)
Figure S7 Phosphate sub-cluster II example spectra (png)
Figure S8 Phosphate sub-cluster III example spectra (png)
Figure S9 Phosphate sub-cluster IV example spectra (png)
Figure S10 Phosphate sub-cluster I structures (png)
Figure S11 Phosphate sub-cluster II structures (png)
Figure S12 Phosphate sub-cluster III structures (png)
Figure S13 Phosphate sub-cluster IV structures (png)
Figure S14 Phosphate sub-clusters correlation (png)
Figure S15 3D UMAP visualizations (png)
Table S1 Classification table (docx)
AUTHOR INFORMATION
The authors declare no competing financial interests.
18
ACKNOWLEDGMENT
ST acknowledges funding from NRT-DESE: Data Intensive Research Enabling Clean
Technologies (DIRECT) under grant no. NSF #1633216 and acknowledge funding from NSF
CHE-1904437. VK acknowledges support from the Washington NASA Space Grant from the
Washington NASA Space Grant Consortium (WSGC). NG acknowledges support from the US
Department of Energy, Office of Science, Office of Basic Energy Sciences, Chemical Sciences,
Geosciences and Biosciences under Award No. KC-030105172685. AV acknowledges support
from the Research Corporation for Science Advancement through a Cottrell Scholars Award.
This research benefited from computational resources provided by the Environmental Molecular
Sciences Laboratory (EMSL), a DOE Office of Science User Facility sponsored by the Office of
Biological and Environmental Research and located at PNNL. PNNL is operated by Battelle
Memorial Institute for the United States Department of Energy under DOE Contract No. DE-
AC05-76RL1830. Additionally, this work was facilitated through the use of advanced
computational, storage, and networking infrastructure provided by the Hyak supercomputer
system and funded by the STF at the University of Washington.
19
REFERENCES
1. Butler, K. T.; Davies, D. W.; Cartwright, H.; Isayev, O.; Walsh, A., Machine learning
for molecular and materials science. Nature 2018, 559 (7715), 547-555.
2. Zhou, Z. Q.; He, Q. F.; Liu, X. D.; Wang, Q.; Luan, J. H.; Liu, C. T.; Yang, Y.,
Rational design of chemically complex metallic glasses by hybrid modeling guided machine
learning. npj Computational Materials 2021, 7 (1), 138.
3. Liu, Y.; Zhao, T. L.; Ju, W. W.; Shi, S. Q., Materials discovery and design using
machine learning. Journal of Materiomics 2017, 3 (3), 159-177.
4. Liu, Y.; Guo, B. R.; Zou, X. X.; Li, Y. J.; Shi, S. Q., Machine learning assisted
materials design and discovery for rechargeable batteries. Energy Storage Materials 2020, 31,
434-450.
5. Saal, J. E.; Kirklin, S.; Aykol, M.; Meredig, B.; Wolverton, C., Materials Design and
Discovery with High-Throughput Density Functional Theory: The Open Quantum Materials
Database (OQMD). JOM 2013, 65 (11), 1501-1509.
6. Meza Ramirez, C. A.; Greenop, M.; Ashton, L.; Rehman, I. u., Applications of machine
learning in spectroscopy. Applied Spectroscopy Reviews 2021, 56 (8-10), 733-763.
7. Gordon, D. F.; Desjardins, M., Evaluation and Selection of Biases in Machine Learning.
Machine Learning 1995, 20 (1-2), 5-22.
8. Wolpert, D. H.; Macready, W. G., No free lunch theorems for optimization. IEEE
Transactions on Evolutionary Computation 1997, 1 (1), 67-82.
9. Alelyani, S., Detection and Evaluation of Machine Learning Bias. Applied Sciences 2021,
11 (14).
10. Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A., A Survey on Bias
and Fairness in Machine Learning. ACM Computing Surveys 2021, 54 (6).
11. Pot, M.; Kieusseyan, N.; Prainsack, B., Not all biases are bad: equitable and inequitable
biases in machine learning and radiology. Insights Into Imaging 2021, 12 (1).
12. Hiemstra, A. M. F.; Cassel, T.; Born, M. P.; Liem, C. C. S., The promises and perils of
machine learning algorithms to reduce bias and discrimination in personnel selection procedures.
Gedrag en Organisatie 2020, 33 (4), 279-299.
13. Timoshenko, J.; Lu, D. Y.; Lin, Y. W.; Frenkel, A. I., Supervised Machine-Learning-
Based Determination of Three-Dimensional Structure of Metallic Nanoparticles. Journal of
Physical Chemistry Letters 2017, 8 (20), 5091-5098.
14. Timoshenko, J.; Frenkel, A. I., "Inverting" X-ray Absorption Spectra of Catalysts by
Machine Learning in Search for Activity Descriptors. Acs Catalysis 2019, 9 (11), 10192-10211.
15. Zheng, C.; Chen, C.; Chen, Y.; Ong, S. P., Random Forest Models for Accurate
Identification of Coordination Environments from X-Ray Absorption Near-Edge Structure.
Patterns 2020, 1 (2), 100013.
16. Zheng, C.; Mathew, K.; Chen, C.; Chen, Y. M.; Tang, H. M.; Dozier, A.; Kas, J. J.;
Vila, F. D.; Rehr, J. J.; Piper, L. F. J.; Persson, K. A.; Ong, S. P., Automated generation and
ensemble-learned matching of X-ray absorption spectra. Npj Computational Materials 2018, 4,
12.
17. Kiyohara, S.; Miyata, T.; Tsuda, K.; Mizoguchi, T., Data-driven approach for the
prediction and interpretation of core-electron loss spectroscopy. Scientific Reports 2018, 8 (1),
13548.
20
18. Routh, P. K.; Liu, Y.; Marcella, N.; Kozinsky, B.; Frenkel, A. I., Latent Representation
Learning for Structural Characterization of Catalysts. The Journal of Physical Chemistry Letters
2021, 12 (8), 2086-2094.
19. Aarva, A.; Deringer, V. L.; Sainio, S.; Laurila, T.; Caro, M. A., Understanding X-ray
Spectroscopy of Carbonaceous Materials by Combining Experiments, Density Functional
Theory, and Machine Learning. Part I: Fingerprint Spectra. Chemistry of Materials 2019, 31
(22), 9243-9255.
20. Carbone, M. R.; Yoo, S.; Topsakal, M.; Lu, D., Classification of local chemical
environments from x-ray absorption spectra using supervised machine learning. Physical Review
Materials 2019, 3 (3), 033604.
21. Carbone, M. R.; Topsakal, M.; Lu, D.; Yoo, S., Machine-Learning X-Ray Absorption
Spectra to Quantitative Accuracy. Physical Review Letters 2020, 124 (15), 156401(6).
22. Liu, Y.; Marcella, N.; Timoshenko, J.; Halder, A.; Yang, B.; Kolipaka, L.; Pellin, M.
J.; Seifert, S.; Vajda, S.; Liu, P.; Frenkel, A. I., Mapping XANES spectra on structural
descriptors of copper oxide clusters using supervised machine learning. The Journal of Chemical
Physics 2019, 151 (16), 164201.
23. Martini, A.; Guda, S. A.; Guda, A. A.; Smolentsev, G.; Algasov, A.; Usoltsev, O.;
Soldatov, M. A.; Bugaev, A.; Rusalev, Y.; Lamberti, C.; Soldatov, A. V., PyFitit: The software
for quantitative analysis of XANES spectra using machine-learning algorithms. Computer
Physics Communications 2020, 250, 107064.
24. Miyazato, I.; Takahashi, L.; Takahashi, K., Automatic oxidation threshold recognition of
XAFS data using supervised machine learning. Molecular Systems Design & Engineering 2019,
4 (5), 1014-1018.
25. Guda, A. A.; Guda, S. A.; Martini, A.; Kravtsova, A. N.; Algasov, A.; Bugaev, A.;
Kubrin, S. P.; Guda, L. V.; Šot, P.; van Bokhoven, J. A.; Copéret, C.; Soldatov, A. V.,
Understanding X-ray absorption spectra by means of descriptors and machine learning
algorithms. npj Computational Materials 2021, 7 (1), 203.
26. Fang, Z.; Hu, W.; Wang, M.; Wang, R.; Zhong, S.; Chen, S., X-ray absorption
spectroscopy combined with machine learning for diagnosis of schistosomiasis cirrhosis.
Biomedical Signal Processing and Control 2020, 60, 101944.
27. Torrisi, S. B.; Carbone, M. R.; Rohr, B. A.; Montoya, J. H.; Ha, Y.; Yano, J.; Suram,
S. K.; Hung, L., Random forest machine learning models for interpretable X-ray absorption near-
edge structure spectrum-property relationships. npj Computational Materials 2020, 6 (1), 109.
28. Trejo, O.; Dadlani, A. L.; De La Paz, F.; Acharya, S.; Kravec, R.; Nordlund, D.;
Sarangi, R.; Prinz, F. B.; Torgersen, J.; Dasgupta, N. P., Elucidating the Evolving Atomic
Structure in Atomic Layer Deposition Reactions with in Situ XANES and Machine Learning.
Chemistry of Materials 2019, 31 (21), 8937-8947.
29. Rankine, C. D.; Madkhali, M. M. M.; Penfold, T. J., A Deep Neural Network for the
Rapid Prediction of X-ray Absorption Spectra. The Journal of Physical Chemistry A 2020, 124
(21), 4263-4270.
30. Rankine, C. D.; Penfold, T. J., Progress in the Theory of X-ray Spectroscopy: From
Quantum Chemistry to Machine Learning and Ultrafast Dynamics. The Journal of Physical
Chemistry A 2021, 125 (20), 4276-4293.
31. Kiyohara, S.; Tsubaki, M.; Mizoguchi, T., Learning excited states from ground states by
using an artificial neural network. Npj Computational Materials 2020, 6 (1), 68.
21
32. Cuisinier, M.; Cabelguen, P.-E.; Evers, S.; He, G.; Kolbeck, M.; Garsuch, A.; Bolin,
T.; Balasubramanian, M.; Nazar, L. F., Sulfur Speciation in LiS Batteries Determined by
Operando X-ray Absorption Spectroscopy. The Journal of Physical Chemistry Letters 2013, 4
(19), 3227-3232.
33. Asakura, D.; Hosono, E.; Niwa, H.; Kiuchi, H.; Miyawaki, J.; Nanba, Y.; Okubo, M.;
Matsuda, H.; Zhou, H.; Oshima, M.; Harada, Y., Operando soft x-ray emission spectroscopy of
LiMn2O4 thin film involving Liion extraction/insertion reaction. Electrochemistry
Communications 2015, 50, 93-96.
34. Zhou, Y.; Doronkin, D. E.; Zhao, Z.; Plessow, P. N.; Jelic, J.; Detlefs, B.;
Pruessmann, T.; Studt, F.; Grunwaldt, J.-D., Photothermal Catalysis over Nonplasmonic
Pt/TiO2 Studied by Operando HERFD-XANES, Resonant XES, and DRIFTS. ACS Catalysis
2018, 8 (12), 11398-11406.
35. Maiuri, M.; Garavelli, M.; Cerullo, G., Ultrafast Spectroscopy: State of the Art and Open
Challenges. Journal of the American Chemical Society 2020, 142 (1), 3-15.
36. Bunker, G., Introduction to XAFS: A Practical Guide to X-ray Absorption Fine Structure
Spectroscopy. Cambridge University Press: Cambridge, 2010.
37. Glatzel, P.; Bergmann, U., High resolution 1s core hole X-ray spectroscopy in 3d
transition metal complexeselectronic and structural information. Coordination Chemistry
Reviews 2005, 249 (1), 65-95.
38. de Groot, F., High-Resolution X-ray Emission and X-ray Absorption Spectroscopy.
Chemical Reviews 2001, 101 (6), 1779-1808.
39. Seidler, G. T.; Mortensen, D. R.; Remesnik, A. J.; Pacold, J. I.; Ball, N. A.; Barry, N.;
Styczinski, M.; Hoidn, O. R., A laboratory-based hard x-ray monochromator for high-resolution
x-ray emission spectroscopy and x-ray absorption near edge structure measurements. Review of
Scientific Instruments 2014, 85 (11), 113906.
40. Malzer, W.; Schlesiger, C.; Kanngießer, B., A century of laboratory X-ray absorption
spectroscopy A review and an optimistic outlook. Spectrochimica Acta Part B: Atomic
Spectroscopy 2021, 177, 106101.
41. Zimmermann, P.; Peredkov, S.; Abdala, P. M.; DeBeer, S.; Tromp, M.; Müller, C.;
van Bokhoven, J. A., Modern X-ray spectroscopy: XAS and XES in the laboratory. Coordination
Chemistry Reviews 2020, 423, 213466.
42. Holden, W. M.; Jahrman, E. P.; Govind, N.; Seidler, G. T., Probing Sulfur Chemical and
Electronic Structure with Experimental Observation and Quantitative Theoretical Prediction of
Kα and Valence-to-Core Kβ X-ray Emission Spectroscopy. The Journal of Physical Chemistry A
2020, 124 (26), 5415-5434.
43. Tetef, S.; Govind, N.; Seidler, G. T., Unsupervised machine learning for unbiased
chemical classification in X-ray absorption spectroscopy and X-ray emission spectroscopy. Phys.
Chem. Chem. Phys. 2021, 23 (41), 23586-23601.
44. Ceriotti, M., Unsupervised machine learning in atomistic simulations, between
predictions and understanding. The Journal of Chemical Physics 2019, 150 (15), 150901.
45. Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B. A.;
Thiessen, P. A.; Yu, B.; Zaslavsky, L.; Zhang, J.; Bolton, E. E., PubChem 2019 update:
improved access to chemical data. Nucleic Acids Research 2020, 47 (D1).
46. McInnes, L.; Healy, J.; Melville, J., UMAP: Uniform Manifold Approximation and
Projection for Dimension Reduction. arXiv 2020, (1802.03426).
22
47. van der Maaten, L.; Hinton, G., Visualizing Data using t-SNE. Journal of Machine
Learning Research 2008, 9, 2579-2605.
48. Pont, F.; Tosolini, M.; Fournie, J. J., Single-Cell Signature Explorer for comprehensive
visualization of single cell signatures across scRNA-seq datasets. Nucleic Acids Research 2019,
47 (21).
49. Rovezzi, M.; Glatzel, P., Hard x-ray emission spectroscopy: a powerful tool for the
characterization of magnetic semiconductors. Semicond. Sci. Technol. 2014, 29 (023002).
50. Murphy, L. R.; Meek, T. L.; Allred, A. L.; Allen, L. C., Evaluation and Test of Pauling's
Electronegativity Scale. The Journal of Physical Chemistry A 2000, 104 (24), 5867-5871.
51. Hahsler, M.; Piekenbrock, M.; Doran, D., dbscan: Fast Density-Based Clustering with R.
Journal of Statistical Software 2019, 91 (1), 1 - 30.
52. Shrestha, A.; Mahmood, A., Review of Deep Learning Algorithms and Architectures.
IEEE Access 2019, 7, 53040-53065.
53. Rasmussen, C. E.; Williams, C. K. I., Gaussian Processes for Machine Learning. The
MIT Press: 2006.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
X-ray absorption near-edge structure (XANES) spectra are the fingerprint of the local atomic and electronic structures around the absorbing atom. However, the quantitative analysis of these spectra is not straightforward. Even with the most recent advances in this area, for a given spectrum, it is not clear a priori which structural parameters can be refined and how uncertainties should be estimated. Here, we present an alternative concept for the analysis of XANES spectra, which is based on machine learning algorithms and establishes the relationship between intuitive descriptors of spectra, such as edge position, intensities, positions, and curvatures of minima and maxima on the one hand, and those related to the local atomic and electronic structure which are the coordination numbers, bond distances and angles and oxidation state on the other hand. This approach overcoms the problem of the systematic difference between theoretical and experimental spectra. Furthermore, the numerical relations can be expressed in analytical formulas providing a simple and fast tool to extract structural parameters based on the spectral shape. The methodology was successfully applied to experimental data for the multicomponent Fe:SiO2 system and reference iron compounds, demonstrating the high prediction quality for both the theoretical validation sets and experimental data.
Article
Full-text available
The compositional design of metallic glasses (MGs) is a long-standing issue in materials science and engineering. However, traditional experimental approaches based on empirical rules are time consuming with a low efficiency. In this work, we successfully developed a hybrid machine learning (ML) model to address this fundamental issue based on a database containing ~5000 different compositions of metallic glasses (either bulk or ribbon) reported since 1960s. Unlike the prior works relying on empirical parameters for featurization of data, we designed modeling guided data descriptors in line with the recent theoretical models on amorphization in chemically complex alloys for the development of the hybrid classification-regression ML algorithms. Our hybrid ML modeling was validated both numerically and experimentally. Most importantly, it enabled the discovery of MGs (either bulk or ribbon) through the ML-aided deep search of a multitude of quaternary to scenery alloy compositions. The computational framework herein established is expected to accelerate the design of MG compositions and expand their applications by probing the complex and multi-dimensional compositional space that has never been explored before.
Article
Full-text available
Machine learning models are built using training data, which is collected from human experience and is prone to bias. Humans demonstrate a cognitive bias in their thinking and behavior, which is ultimately reflected in the collected data. From Amazon’s hiring system, which was built using ten years of human hiring experience, to a judicial system that was trained using human judging practices, these systems all include some element of bias. The best machine learning models are said to mimic humans’ cognitive ability, and thus such models are also inclined towards bias. However, detecting and evaluating bias is a very important step for better explainable models. In this work, we aim to explain bias in learning models in relation to humans’ cognitive bias and propose a wrapper technique to detect and evaluate bias in machine learning models using an openly accessible dataset from UCI Machine Learning Repository. In the deployed dataset, the potentially biased attributes (PBAs) are gender and race. This study introduces the concept of alternation functions to swap the values of PBAs, and evaluates the impact on prediction using KL divergence. Results demonstrate females and Asians to be associated with low wages, placing some open research questions for the research community to ponder over.
Article
Full-text available
The application of machine learning (ML) technologies in medicine generally but also in radiology more specifically is hoped to improve clinical processes and the provision of healthcare. A central motivation in this regard is to advance patient treatment by reducing human error and increasing the accuracy of prognosis, diagnosis and therapy decisions. There is, however, also increasing awareness about bias in ML technologies and its potentially harmful consequences. Biases refer to systematic distortions of datasets, algorithms, or human decision making. These systematic distortions are understood to have negative effects on the quality of an outcome in terms of accuracy, fairness, or transparency. But biases are not only a technical problem that requires a technical solution. Because they often also have a social dimension, the 'distorted' outcomes they yield often have implications for equity. This paper assesses different types of biases that can emerge within applications of ML in radiology, and discusses in what cases such biases are problematic. Drawing upon theories of equity in healthcare, we argue that while some biases are harmful and should be acted upon, others might be unproblematic and even desirable-exactly because they can contribute to overcome inequities.
Article
The way to analyze data in spectroscopy has changed substantially. At the same time, data science has evolved to the point where spectroscopy can find space to be housed, adapted and be functional. The integration of the two sciences has introduced a knowledge gap between data scientists who know about advanced machine learning techniques and spectroscopists who have a solid background in chemometrics. To reach a symbiosis, the knowledge gap requires bridging. This review article focuses on introducing data science subjects to non-specialist spectroscopists, or those unfamiliar with the subject. The article will explain concepts that are covered in machine learning, such as supervised learning, unsupervised learning, deep learning, and most importantly, the difference between machine learning and artificial intelligence. This article also includes examples of published spectroscopy research, in which some of the concepts explained here are applied. Machine learning together with spectroscopy can provide a useful, fast, and efficient tool to analyze samples of interest both for industrial and research purposes.
Article
We report a comprehensive computational study of unsupervised machine learning for extraction of chemically relevant information in X-ray absorption near edge structure (XANES) and in valence-to-core X-ray emission spectra (VtC-XES) for classification of a broad ensemble of sulphorganic molecules. By progressively decreasing the constraining assumptions of the unsupervised machine learning algorithm, moving from principal component analysis (PCA) to a variational autoencoder (VAE) to t-distributed stochastic neighbour embedding (t-SNE), we find improved sensitivity to steadily more refined chemical information. Surprisingly, when embedding the ensemble of spectra in merely two dimensions, t-SNE distinguishes not just oxidation state and general sulphur bonding environment but also the aromaticity of the bonding radical group with 87% accuracy as well as identifying even finer details in electronic structure within aromatic or aliphatic sub-classes. We find that the chemical information in XANES and VtC-XES is very similar in character and content, although they unexpectedly have different sensitivity within a given molecular class. We also discuss likely benefits from further effort with unsupervised machine learning and from the interplay between supervised and unsupervised machine learning for X-ray spectroscopies. Our overall results, i.e., the ability to reliably classify without user bias and to discover unexpected chemical signatures for XANES and VtC-XES, likely generalize to other systems as well as to other one-dimensional chemical spectroscopies.
Article
With the widespread use of artificial intelligence (AI) systems and applications in our everyday lives, accounting for fairness has gained significant importance in designing and engineering of such systems. AI systems can be used in many sensitive environments to make important and life-changing decisions; thus, it is crucial to ensure that these decisions do not reflect discriminatory behavior toward certain groups or populations. More recently some work has been developed in traditional machine learning and deep learning that address such challenges in different subdomains. With the commercialization of these systems, researchers are becoming more aware of the biases that these applications can contain and are attempting to address them. In this survey, we investigated different real-world applications that have shown biases in various ways, and we listed different sources of biases that can affect AI applications. We then created a taxonomy for fairness definitions that machine learning researchers have defined to avoid the existing bias in AI systems. In addition to that, we examined different domains and subdomains in AI showing what researchers have observed with regard to unfair outcomes in the state-of-the-art methods and ways they have tried to address them. There are still many future directions and solutions that can be taken to mitigate the problem of bias in AI systems. We are hoping that this survey will motivate researchers to tackle these issues in the near future by observing existing work in their respective fields.
Article
The development of high-brilliance third-and fourth-generation light sources such as synchrotrons and X-ray free-electron lasers (XFELs), the emergence of laboratory-based X-ray spectrometers, and instrumental and methodological advances in X-ray absorption (XAS) and (non)resonant emission (XES and RXES/RIXS) spectroscopies have had far-reaching effects across the natural sciences. However, new kinds of experiments, and their ever-higher resolution and data acquisition rates, have brought acutely into focus the challenge of accurately, quickly, and cost-effectively analyzing the data; a far-from-trivial task that demands detailed theoretical calculations that are capable of capturing satisfactorily the underlying physics. The past decade has seen significant advances in the theory of core-hole spectroscopies for this purpose, driven by all of the developments above andcruciallya surge in demand. In this Perspective, we discuss the challenges of calculating core-excited states and spectra, and state-of-the-art developments in electronic structure theory, dynamics, and data-driven/machine-led approaches toward their better description.
Article
Supervised machine learning-enabled mapping of the X-ray absorption near edge structure (XANES) spectra to local structural descriptors offers new methods for understanding the structure and function of working nanocatalysts. We briefly summarize a status of XANES analysis approaches by supervised machine learning methods. We present an example of an autoencoder-based, unsupervised machine learning approach for latent representation learning of XANES spectra. This new approach produces a lower-dimensional latent representation, which retains a spectrum–structure relationship that can be eventually mapped to physicochemical properties. The latent space of the autoencoder also provides a pathway to interpret the information content “hidden” in the X-ray absorption coefficient. Our approach (that we named latent space analysis of spectra, or LSAS) is demonstrated for the supported Pd nanoparticle catalyst studied during the formation of Pd hydride. By employing the low-dimensional representation of Pd K-edge XANES, the LSAS method was able to isolate the key factors responsible for the observed spectral changes.
Article
In recent years, novel instrumentation for laboratory X-ray Absorption Spectroscopy (XAS) raised some interest and debate about its usefulness. Within the last two years then, a growing number of experiments and analytical applications using these new spectrometers were published. This review presents these applications and gives an overview of the fields of applications and the ways, the laboratory XAS instruments were utilized so far. The principles of the laboratory XAS spectrometer are described. The use of X-ray tube driven spectrometers for XAS, however, is not a novelty. First X-ray absorption spectra were taken long before synchrotron radiation facilities existed. And, more important for the scope of this review, beginning with the 80s, the XAS community undertook a considerable effort to create laboratory XAS spectrometers which were powerful enough for research in chemistry or materials science. The motivation of this effort as well as the application of laboratory XAS spectrometers have a lot in common with the current activities. We included a review of literature from this period and a discussion of commonalities and differences with contemporary work.