Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
1
Informed Chemical Classification of Organophosphorus
Compounds via Unsupervised Machine Learning of X-ray
Absorption Spectroscopy and X-ray Emission Spectroscopy
Samantha Tetef+1,
Vikram Kashyap+1,
Alexandra Velian2,
Niranjan Govind3,
Gerald T. Seidler1*
+Co-first authors
1Department of Physics, University of Washington, Seattle WA 98195, USA
2Department of Chemistry, University of Washington, Seattle WA 98195, USA
3Physical and Computational Sciences Directorate, Pacific Northwest National Laboratory,
Richland, Washington 99352, USA
Corresponding Author
*seidler@uw.edu
2
ABSTRACT
We analyze an ensemble of organophosphorus compounds to form an unbiased characterization
of the information encoded in their X-ray absorption near edge structure (XANES) and valence-
to-core X-ray emission spectra (VtC-XES). Data-driven emergence of chemical classes via
unsupervised machine learning, specifically cluster analysis in the Uniform Manifold
Approximation and Projection (UMAP) embedding, finds spectral sensitivity to coordination,
oxidation, aromaticity, intramolecular hydrogen bonding, and ligand identity. Subsequently, we
implement supervised machine learning via Gaussian Process classifiers to identify confidence in
predictions which match our initial qualitative assessments of clustering. The results further
support the benefit of utilizing unsupervised machine learning as a precursor to supervised
machine learning.
TOC GRAPHICS
KEYWORDS X-ray absorption fine structure, valence-to-core X-ray emission spectroscopy,
Gaussian Process, UMAP, unsupervised machine learning.
3
The information content in any spectroscopy method is constrained by the lossiness of
the underlying quantum mechanics that connects atomic-scale structure and dynamics to
experimental observables. Further limitations to the sensitivity of spectroscopy techniques often
include the inherent nonlinear or stochastic responses of the experimental probe. These facts
constrain our ability to correlate physical measurements, e.g., spectral features, to desired
microscopic properties. Thus, the emergence of data science and machine learning (ML) in
spectroscopy, with applications in all fields in the physical sciences, has exploded 1-5. These
data-driven models can frequently disentangle and infer patterns from lossy measurements as
well as provide insight into the information encoded in spectra.
In general, supervised ML studies across a wide range of spectroscopies target either
predicting properties from spectra or correlating specific properties of interest to spectral features
6. This necessarily assumes that sufficient information is, in fact, encoded in spectra; otherwise,
ML models will correlate spurious features to requested properties. This detail of encoded
information is often addressed by hand-selecting a targeted training domain which depends
heavily on prior knowledge 7. However, issues arise if the training domain is too small or biased.
First, if the training domain is too small, the model will be unable to generalize well beyond its
specialized scope, which violates the essential assumption that the training and test data are
sampled from the same distribution. Second, although some bias is essential for any machine
learning model 8, unwanted bias, especially from unrepresentative data, blindly undermines
reliability of inferences and has led to contemporary ethical concerns 9-12.
In the effort to combat unwanted bias as well as provide generalizability to complex
datasets, this study demonstrates the value of the pipeline exemplified in Figure 1, which
validates encoded information via unsupervised machine learning, i.e., cluster analysis on a
4
reduced-dimensional embedding of the spectra, before passing either the embedding or the
original spectra – selected as an unbiased training (sub)set – to a supervised machine learning
model. This pipeline removes implicit biases and spurious correlations by adding steps (3) and
(4) to a typical ML pipeline, which validate spectral sensitivity to properties requested during
supervised predictions.
Figure 1 Flowchart of an analysis framework that uses unsupervised machine learning (such as
cluster analysis) as a precursor to predictions on spectra via supervised machine learning.
We utilize this pipeline for a spectroscopy method that has seen an ongoing exploration
of ML applications: X-ray absorption spectroscopy (XAS) 13-31. XAS is most commonly used in
chemistry, biology, and materials science to investigate the element-specific local coordination
environment and electronic structure, with applications including energy storage 32, 33, catalysis
34, and photochemical dynamics 35. XAS, which includes both X-ray absorption near-edge
structure (XANES) and extended X-ray absorption fine structure (EXAFS), probes the
unoccupied electronic states of the excited state of a chosen atomic species. Conversely,
5
relaxation to fill the core hole results in either nonradiative (Auger) or radiative processes. The
latter results in the emission of X-ray fluorescence that can be finely characterized by X-ray
emission spectroscopy (XES) for insight into the occupied electronic states 36-38. Often discussed
as complementary to XANES in information content, valence-to-core XES (VtC-XES) is
produced when electrons de-excite from the valence shell to fill the core hole, giving direct
information about occupied electronic states involved in bonding. While XAS and XES have
traditionally been synchrotron-based methods, we note that their access, including for VtC-XES,
is now being steadily augmented with a renaissance of lab-based spectrometers 39-41, including in
studies of sufficient scale for data science methods 42.
In the first study to utilize ML in XAS, Timoshenko, et al. 13 predicted coordination from
XANES spectra using a neural network, while Zheng, et al. 15 also predicted coordination, except
using a random forest model. Notably, Torrisi, et al. 27 used a random forest model, except to
correlate polynomial fitting parameters of spectra to properties like bond distance. Other works
utilizing machine learning in XAS include a XANES matching algorithm 16, hierarchical
clustering on spectra 17, and use of an autoencoder to correlate coordination to a reduced
dimensional representation of spectra 18. Most of these studies assumed desired information was
in fact encoded in spectra, largely because of hand-crafting relevant training datasets. However,
our pipeline, via the unsupervised machine learning precursor, allows for explorative and
unbiased refinement of chemical descriptors – a step that we propose is both necessary, and
likely sufficient, when addressing much more complex datasets.
The present study is prompted by our recent work 43 that compared the variance and
information content of sulfur K-edge XANES to VtC-XES Kβ spectra for sulforganics. We
found that nonlinear dimensionality reduction algorithms, a subset of unsupervised ML, provided
6
an effective way to extract features and thus important chemical information encoded in spectra.
Moreover, our results exemplified the benefits of utilizing unsupervised ML to mold and
understand the full potential of supervised ML analysis 44.
Here, we investigate the information content and sensitivity of phosphorus K-edge
XANES and VtC-XES Kβ in a more complex chemical system, organophosphorus compounds,
and indeed find sensitivity to a wider range of chemical properties, including coordination,
oxidation, aromaticity, intramolecular hydrogen bonding, and ligand identity. The dataset of
spectra we analyze is calculated from molecular structures gathered from the PubChem 45
database using moldl, a Python module we have written to aid in collecting and managing
molecular structure datasets. moldl is open-source and freely available to anyone. See the SI for
more details. For the rest of this paper, we will refer to the phosphorus K-edge XANES and VtC-
XES Kβ as just XANES and VtC-XES, respectively, for brevity.
Organophosphorus compounds have much higher total variance than sulforganics, as well
as higher variance within the same bonding geometry. We can therefore tune the input domain to
account for these highly variant structures, allowing us to understand the sensitivity of these
spectra to a wider range of properties. In addition, we can find, in an unbiased way, the extent of
the information that may be extracted using dimensionality reduction algorithms, especially
when confined to very limited dimensions. These explorations allow for full utilization of real
spectral information during supervised ML predictions.
To this end, we utilize Uniform Manifold Approximation and Projection (UMAP) 46 for
dimensionality reduction, which allows us to develop chemical classes by examining clustering
of spectra in a two-dimensional embedding. UMAP is a nonlinear embedding similar to t-
distributed Stochastic Neighbor Embedding (t-SNE) 47, which was used in our recent work 43 to
7
extract chemical classes. UMAP has additional benefits compared to t-SNE, such as being
parametric and preserving global structure, which allows for future data compression as well as
interpretation of overall global similarities. These advantages have led to its recent popularity,
such as in single cell RNA sequencing (scRNA-seq) data analysis 48, but has not yet seen use in
XAS analysis.
To begin, heuristically one expects coordination to yield the strongest distinguishing
feature between spectra, specifically the distinction between tricoordinate phosphorus and
tetracoordinate phosphorus. Not only do these coordination geometries have different hybridized
orbital character, but they are often a proxy for oxidation state. In organophosphorus compounds
with tricoordinate phosphorus centers, the phosphorus is typically in a 3+ oxidation state,
whereas compounds with tetracoordinate phosphorus centers usually have the phosphorus in a 5+
oxidation state. We chose compounds with a diverse number of oxygens bonded to phosphorus
within these two coordination configurations to further vary the effective charge on the
phosphorus. The spectral averages for both the VtC-XES and XANES spectra for each
tricoordinate phosphorus and tetracoordinate phosphorus class are shown in Fig. S1. We then
applied UMAP to the VtC-XES and XANES spectra to create a two-dimensional embedding of
the ensemble. The results are color-coded based on whether the compound includes tricoordinate
phosphorus or tetracoordinate phosphorus, as shown in Figure 2.
Individual classes within each coordination are shown in columns A and B. Additionally,
all R groups are constrained to exclusively carbons (e.g., alkyl or aryl chains), and sometimes
hydrogens (when bound to the oxygen) to achieve hydroxyl groups, but only for phosphates
(which we will explore later). As expected, coordination distinguishes most of the groupings of
the compounds, with a handful of outliers.
8
Figure 2 UMAP representation of VtC-XES (top) and XANES (bottom), color-coded by
coordination. R1, R2, and R3 are defined to be carbon-based aryl or alkyl chains, with only
phosphates allowed to have R1 and R2 as H atoms.
It follows that there are chemically relevant sub-groupings within each coordination.
Figure 3 shows the embedding color-coded within each of the tri- and tetra-coordinate classes
based on the number of oxygens bonded to the phosphorus. We expected effective charge of the
phosphorus to have the biggest impact on both the VtC-XES and XANES spectra. For the VtC-
XES, the ligand peaks (the small low-energy peak in Fig. S1) will increase in both energy and
intensity with an increase in phosphorus oxidation. From a molecular orbital perspective, this
trend is from a larger overlap between the ligand valence orbital and the phosphorus 3p orbital
9
(valence shell). In general, this feature (which also changes with different ligand symmetries and
orientation) is why VtC-XES is so strongly sensitive to ligand identity 49. For the XANES
spectra, an increase in the oxidation of the phosphorus, i.e., the number of oxygen ligands within
a coordination, will cause a blueshift of the absorption edge, also demonstrated by the average
spectra in Fig. S1.
Figure 3 UMAP representation of VtC-XES (top) and XANES (bottom) for tricoordinate
phosphorus (A) and tetracoordinate phosphorus (B) compounds, color-coded by number of
oxygens bonded to the phosphorus within each coordination.
Note that the phosphates are segregated from the other tetracoordinate phosphorus
compounds and seem to sub-cluster as well. This observation brings us to our next hypothesis
that VtC-XES and XANES are both sensitive to ligand identity. As stated earlier, VtC-XES is
highly sensitive to ligand identity, observed by changes in the ligand peak feature. Again,
because the absorption edge of a XANES spectrum shifts with oxidation, the electronegativity of
10
ligands will cause the biggest spectral change. However, even for ligands with approximately the
same electronegativity, different phase shifts and cross sections cause finer changes to the
XANES spectra.
To systematically probe the effect of ligand identity, a series of tetracoordinate
phosphorus compounds (phosphates) were evaluated in which the oxygen substituents were
replaced with one or two sulfur atoms. Compared to oxygen, sulfur is significantly less
electronegative, with a Pauling electronegativity value near that of carbon and phosphorus 50.
Thus, these oxygen-to-sulfur ligand substitutions likely cause the biggest spectral change by
adjusting the effective charge on the phosphorous. The resulting clusters are shown in Figure 4.
Figure 4 UMAP representation of VtC-XES (left) and XANES (right) for compounds with
sulfur ligands, color-coded by number of sulfurs.
As expected, the different ligand identities are contributing to cluster separations. The
VtC-XES also clearly has an outlier – the orange phosphorothioate in the red dithiophosphate
cluster at the bottom right of that figure. Chemically, that compound (PubChem CID 104781) is
11
structurally different from others because the oxygens form one edge of a carbon tetrahedrane.
Thus, UMAP clearly identifies chemical outliers.
We then analyzed whether the spectra would be sensitive to substitutions of R groups (if
bonded to an oxygen) with a hydrogen atom, thus forming hydroxyl groups, as shown in Figure
5. Here, we have taken phosphinate and phosphonate as starting points, and consecutively
replaced O-R groups with OH groups. In general, this distinction seems to be better illuminated
by the VtC-XES spectra than the XANES (which is shown in Fig. S5), as the clustering in the
VtC-XES is suggestive of a sensitivity to hydroxyl groups. However, Figure 5 also exemplifies
that first-nearest neighbors, e.g., the oxygen ligands directly bonded to the phosphorus, likely
cause the biggest spectral changes and thus are the biggest contributing factor to clustering,
which is consistent with our earlier observations.
Figure 5 UMAP representation of the VtC-XES of compounds with consecutively more R
groups (if bonded to an oxygen) replaced with an H atom (to create hydroxyl groups), color-
coded by chemical class.
12
In the above discussion, we have motivated our classes by important chemical properties
that we heuristically expected to yield the biggest spectral differences. However, even within this
chemically driven framework, there are sub-clusters within our heuristic chemical classes which
are instead emergent from UMAP. For example, we found that sub-clustering of the phosphate
chemical class (exemplified by the multiple separate sub-clusters in Figures 3 and 4) was caused
by unexpected variations in the secondary substituent (atoms bound to oxygens, not directly to
phosphorus), indicating that XANES spectra is sensitive to even more subtle details than
anticipated.
Let us examine this sub-division of the phosphates, specifically in the UMAP embedding
of their XANES spectra. Applying UMAP to just phosphates, we achieve the embedding shown
in Figure 6, which has labeled the phosphates into four clusters determined by the dbscan 51
clustering algorithm: I, II, III, and IV. The average spectrum for each cluster is shown at the
bottom and the common structural motifs for each cluster are shown to the right.
77% of Cluster I is comprised of compounds with two alkyl R groups and the third group
either alkyl or aryl rings. This distinction is different from Clusters II to IV as they instead
typically have two R groups as H atoms instead of carbon-based groups. Cluster II is the largest
sub-cluster and 94% of the compounds have two hydroxyl groups bonded to the phosphorus and
the last R group an alkyl chain. These two clusters are the most distinct.
On the other hand, Cluster III and IV are similar in composition. Cluster III is comprised
of compounds with the third R group as: (a) alkyl rings, or cycloalkanes (36%), (b) aromatic
rings (23%), or (c) take part in intramolecular hydrogen bonding with one of the hydroxyl groups
bonding to the phosphorus. Cluster IV compounds are structurally very similar to Cluster III
compounds, even though their spectra are distinct. However, 54% of Cluster IV compounds have
13
their third R group as aromatic rings. All compounds in Clusters I to IV can be viewed in Figs.
S10 to S13. For some example compounds in each cluster along with their spectra and structure,
see Figs. S6 to S9. Additionally, given the linear nature of Clusters I, III, and IV in the UMAP
embedding, we tested the correlation between the embedding location and the energy of the
absorption edge, as demonstrated in Fig. S14, and found no strong correlation. This further
supports the nonlinear nature of spectra and the idea that spectral fingerprints in complex
datasets do not correlate solely to a single high-variant property like the absorption edge, but
rather a combination of properties.
Figure 6 UMAP representation of XANES of phosphates, color-coded by sub-clusters. Cluster-
averaged spectra and a summary structural motif for each cluster are also shown.
14
Taken en masse, these results – independent of the specific dimensionality reduction
algorithm used – show the extent to which chemically-relevant information is, or is not, encoded
by the quantum mechanics involved in XANES and VtC-XES. As to the specific algorithm,
UMAP can be used iteratively as more data is collected and thus has the potential to shown
evolutions through the domain space, similar to the latent space of a variational autoencoder
(VAE) 52. This property facilitates real-time analysis of high-throughput experiments. Finally,
and of key importance here, UMAP can generate embeddings of spectra that can be used for
unbiased refinement of the training data set in addition to a preprocessing step before supervised
ML predictions.
The most common use of supervised ML in X-ray spectroscopy is to predict numerical
properties, such as bond length or coordination, from XANES spectra. Here, we instead predict
chemical classes from both VtC-XES and XANES spectra. Moreover, we predict these classes
from a five-dimensional UMAP representation of the spectra instead of from the original spectra
themselves. Such preprocessing through dimensionality reduction can help separate inherently
correlated and nonlinear spectral features 44 as well as greatly reduce both the computational cost
and the effect of spectral noise.
Furthermore, we used a Gaussian Process (GP) in order to incorporate prior knowledge
into our models and generate an informed predictor 53. A GP is a non-parametric kernel method
that formally incorporates Bayes rule into the model, which not only allows for priors to be
specified during training, but also allows for a probabilistic interpretation of the results. This
probability gives uncertainty estimates, or conversely confidence, of the predictions. We note
that one of the biggest downsides of a GP is that it scales poorly, which is another reason why
15
applying a nonlinear dimensionality reduction routine like UMAP beforehand can transform this
problem into a computationally tractable one.
The results of training a GP on each of the five classification schemes (see Table S1) we
developed – coordination, number of oxygen ligands, phosphate subcluster, number of sulfur
ligands, and number of hydroxyl ligands – are shown in Figure 7, with the average accuracy
score on the test set as well as the probability of that prediction, i.e., the confidence score,
shown. There is a clear correlation between the average accuracy and confidence, indicating that
the GP is, in fact, properly modeling uncertainty of predictions.
Figure 7 Gaussian Process Classifier prediction accuracies with corresponding average
probability (“confidence”) for all chemically driven and cluster-driven classification schemes.
Finally, the accuracies and confidence of each prediction across the VtC-XES and
XANES data matches what we observed in our two-dimensional UMAP figures. This is clearly
demonstrated in the hydroxyl ligand and phosphate subcluster classification schemes, where the
XANES and VtC-XES, respectively, poorly cluster by these schemes, and the low corresponding
16
GP confidence reflects this. Overall, these results further validate that visualizing data via a
dimensionality reduction algorithm like UMAP correlates to extractable information content and
can properly inform classes to be used for supervised ML.
By utilizing UMAP and analyzing the resulting clustering in a two-dimensional
embedding of VtC-XES and XANES spectra of an ensemble of organophosphorus compounds,
we noticed sensitivity to coordination and ligand identity (specifically by distinguishing number
of oxygen ligands, sulfur ligands, and hydroxyl groups). Additionally, the XANES was clearly
more sensitive to phosphate sub-groupings (which resulted from an unexpected, unintuitive
fingerprint). However, all these results culminated in a valuable analysis framework: (1)
applying nonlinear dimensionality reduction routines and cluster analysis to check for both
heuristic chemical sensitivities and emergent ones present in spectra, (2) applying dimensionality
reduction methods like UMAP before querying supervised ML models, and (3) utilizing models
that incorporate prior knowledge, such as a Gaussian Process, to estimate uncertainty or
confidence of these predictions on the clustering-informed classes. Furthermore, this framework,
visualized in Figure 1, is broadly applicable – it can easily be expanded to both other systems
and other one-dimensional spectroscopies – providing a way to validate predictions instead of
relying solely on the initial construction of an appropriate training dataset.
17
ASSOCIATED CONTENT
Supporting Information.
The following files are available free of charge:
Computational Methods (docx)
Figure S1 Class averages of spectra with different coordination (png)
Figure S2 Scree plot of VtC-XES and XANES data (png)
Figure S3 PCA reconstruction of VtC-XES spectra (png)
Figure S4 PCA reconstruction of XANES spectra (png)
Figure S5 UMAP representation of XANES with H atom substitutions (png)
Figure S6 Phosphate sub-cluster I example spectra (png)
Figure S7 Phosphate sub-cluster II example spectra (png)
Figure S8 Phosphate sub-cluster III example spectra (png)
Figure S9 Phosphate sub-cluster IV example spectra (png)
Figure S10 Phosphate sub-cluster I structures (png)
Figure S11 Phosphate sub-cluster II structures (png)
Figure S12 Phosphate sub-cluster III structures (png)
Figure S13 Phosphate sub-cluster IV structures (png)
Figure S14 Phosphate sub-clusters correlation (png)
Figure S15 3D UMAP visualizations (png)
Table S1 Classification table (docx)
AUTHOR INFORMATION
The authors declare no competing financial interests.
18
ACKNOWLEDGMENT
ST acknowledges funding from NRT-DESE: Data Intensive Research Enabling Clean
Technologies (DIRECT) under grant no. NSF #1633216 and acknowledge funding from NSF
CHE-1904437. VK acknowledges support from the Washington NASA Space Grant from the
Washington NASA Space Grant Consortium (WSGC). NG acknowledges support from the US
Department of Energy, Office of Science, Office of Basic Energy Sciences, Chemical Sciences,
Geosciences and Biosciences under Award No. KC-030105172685. AV acknowledges support
from the Research Corporation for Science Advancement through a Cottrell Scholars Award.
This research benefited from computational resources provided by the Environmental Molecular
Sciences Laboratory (EMSL), a DOE Office of Science User Facility sponsored by the Office of
Biological and Environmental Research and located at PNNL. PNNL is operated by Battelle
Memorial Institute for the United States Department of Energy under DOE Contract No. DE-
AC05-76RL1830. Additionally, this work was facilitated through the use of advanced
computational, storage, and networking infrastructure provided by the Hyak supercomputer
system and funded by the STF at the University of Washington.
19
REFERENCES
1. Butler, K. T.; Davies, D. W.; Cartwright, H.; Isayev, O.; Walsh, A., Machine learning
for molecular and materials science. Nature 2018, 559 (7715), 547-555.
2. Zhou, Z. Q.; He, Q. F.; Liu, X. D.; Wang, Q.; Luan, J. H.; Liu, C. T.; Yang, Y.,
Rational design of chemically complex metallic glasses by hybrid modeling guided machine
learning. npj Computational Materials 2021, 7 (1), 138.
3. Liu, Y.; Zhao, T. L.; Ju, W. W.; Shi, S. Q., Materials discovery and design using
machine learning. Journal of Materiomics 2017, 3 (3), 159-177.
4. Liu, Y.; Guo, B. R.; Zou, X. X.; Li, Y. J.; Shi, S. Q., Machine learning assisted
materials design and discovery for rechargeable batteries. Energy Storage Materials 2020, 31,
434-450.
5. Saal, J. E.; Kirklin, S.; Aykol, M.; Meredig, B.; Wolverton, C., Materials Design and
Discovery with High-Throughput Density Functional Theory: The Open Quantum Materials
Database (OQMD). JOM 2013, 65 (11), 1501-1509.
6. Meza Ramirez, C. A.; Greenop, M.; Ashton, L.; Rehman, I. u., Applications of machine
learning in spectroscopy. Applied Spectroscopy Reviews 2021, 56 (8-10), 733-763.
7. Gordon, D. F.; Desjardins, M., Evaluation and Selection of Biases in Machine Learning.
Machine Learning 1995, 20 (1-2), 5-22.
8. Wolpert, D. H.; Macready, W. G., No free lunch theorems for optimization. IEEE
Transactions on Evolutionary Computation 1997, 1 (1), 67-82.
9. Alelyani, S., Detection and Evaluation of Machine Learning Bias. Applied Sciences 2021,
11 (14).
10. Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A., A Survey on Bias
and Fairness in Machine Learning. ACM Computing Surveys 2021, 54 (6).
11. Pot, M.; Kieusseyan, N.; Prainsack, B., Not all biases are bad: equitable and inequitable
biases in machine learning and radiology. Insights Into Imaging 2021, 12 (1).
12. Hiemstra, A. M. F.; Cassel, T.; Born, M. P.; Liem, C. C. S., The promises and perils of
machine learning algorithms to reduce bias and discrimination in personnel selection procedures.
Gedrag en Organisatie 2020, 33 (4), 279-299.
13. Timoshenko, J.; Lu, D. Y.; Lin, Y. W.; Frenkel, A. I., Supervised Machine-Learning-
Based Determination of Three-Dimensional Structure of Metallic Nanoparticles. Journal of
Physical Chemistry Letters 2017, 8 (20), 5091-5098.
14. Timoshenko, J.; Frenkel, A. I., "Inverting" X-ray Absorption Spectra of Catalysts by
Machine Learning in Search for Activity Descriptors. Acs Catalysis 2019, 9 (11), 10192-10211.
15. Zheng, C.; Chen, C.; Chen, Y.; Ong, S. P., Random Forest Models for Accurate
Identification of Coordination Environments from X-Ray Absorption Near-Edge Structure.
Patterns 2020, 1 (2), 100013.
16. Zheng, C.; Mathew, K.; Chen, C.; Chen, Y. M.; Tang, H. M.; Dozier, A.; Kas, J. J.;
Vila, F. D.; Rehr, J. J.; Piper, L. F. J.; Persson, K. A.; Ong, S. P., Automated generation and
ensemble-learned matching of X-ray absorption spectra. Npj Computational Materials 2018, 4,
12.
17. Kiyohara, S.; Miyata, T.; Tsuda, K.; Mizoguchi, T., Data-driven approach for the
prediction and interpretation of core-electron loss spectroscopy. Scientific Reports 2018, 8 (1),
13548.
20
18. Routh, P. K.; Liu, Y.; Marcella, N.; Kozinsky, B.; Frenkel, A. I., Latent Representation
Learning for Structural Characterization of Catalysts. The Journal of Physical Chemistry Letters
2021, 12 (8), 2086-2094.
19. Aarva, A.; Deringer, V. L.; Sainio, S.; Laurila, T.; Caro, M. A., Understanding X-ray
Spectroscopy of Carbonaceous Materials by Combining Experiments, Density Functional
Theory, and Machine Learning. Part I: Fingerprint Spectra. Chemistry of Materials 2019, 31
(22), 9243-9255.
20. Carbone, M. R.; Yoo, S.; Topsakal, M.; Lu, D., Classification of local chemical
environments from x-ray absorption spectra using supervised machine learning. Physical Review
Materials 2019, 3 (3), 033604.
21. Carbone, M. R.; Topsakal, M.; Lu, D.; Yoo, S., Machine-Learning X-Ray Absorption
Spectra to Quantitative Accuracy. Physical Review Letters 2020, 124 (15), 156401(6).
22. Liu, Y.; Marcella, N.; Timoshenko, J.; Halder, A.; Yang, B.; Kolipaka, L.; Pellin, M.
J.; Seifert, S.; Vajda, S.; Liu, P.; Frenkel, A. I., Mapping XANES spectra on structural
descriptors of copper oxide clusters using supervised machine learning. The Journal of Chemical
Physics 2019, 151 (16), 164201.
23. Martini, A.; Guda, S. A.; Guda, A. A.; Smolentsev, G.; Algasov, A.; Usoltsev, O.;
Soldatov, M. A.; Bugaev, A.; Rusalev, Y.; Lamberti, C.; Soldatov, A. V., PyFitit: The software
for quantitative analysis of XANES spectra using machine-learning algorithms. Computer
Physics Communications 2020, 250, 107064.
24. Miyazato, I.; Takahashi, L.; Takahashi, K., Automatic oxidation threshold recognition of
XAFS data using supervised machine learning. Molecular Systems Design & Engineering 2019,
4 (5), 1014-1018.
25. Guda, A. A.; Guda, S. A.; Martini, A.; Kravtsova, A. N.; Algasov, A.; Bugaev, A.;
Kubrin, S. P.; Guda, L. V.; Šot, P.; van Bokhoven, J. A.; Copéret, C.; Soldatov, A. V.,
Understanding X-ray absorption spectra by means of descriptors and machine learning
algorithms. npj Computational Materials 2021, 7 (1), 203.
26. Fang, Z.; Hu, W.; Wang, M.; Wang, R.; Zhong, S.; Chen, S., X-ray absorption
spectroscopy combined with machine learning for diagnosis of schistosomiasis cirrhosis.
Biomedical Signal Processing and Control 2020, 60, 101944.
27. Torrisi, S. B.; Carbone, M. R.; Rohr, B. A.; Montoya, J. H.; Ha, Y.; Yano, J.; Suram,
S. K.; Hung, L., Random forest machine learning models for interpretable X-ray absorption near-
edge structure spectrum-property relationships. npj Computational Materials 2020, 6 (1), 109.
28. Trejo, O.; Dadlani, A. L.; De La Paz, F.; Acharya, S.; Kravec, R.; Nordlund, D.;
Sarangi, R.; Prinz, F. B.; Torgersen, J.; Dasgupta, N. P., Elucidating the Evolving Atomic
Structure in Atomic Layer Deposition Reactions with in Situ XANES and Machine Learning.
Chemistry of Materials 2019, 31 (21), 8937-8947.
29. Rankine, C. D.; Madkhali, M. M. M.; Penfold, T. J., A Deep Neural Network for the
Rapid Prediction of X-ray Absorption Spectra. The Journal of Physical Chemistry A 2020, 124
(21), 4263-4270.
30. Rankine, C. D.; Penfold, T. J., Progress in the Theory of X-ray Spectroscopy: From
Quantum Chemistry to Machine Learning and Ultrafast Dynamics. The Journal of Physical
Chemistry A 2021, 125 (20), 4276-4293.
31. Kiyohara, S.; Tsubaki, M.; Mizoguchi, T., Learning excited states from ground states by
using an artificial neural network. Npj Computational Materials 2020, 6 (1), 68.
21
32. Cuisinier, M.; Cabelguen, P.-E.; Evers, S.; He, G.; Kolbeck, M.; Garsuch, A.; Bolin,
T.; Balasubramanian, M.; Nazar, L. F., Sulfur Speciation in Li–S Batteries Determined by
Operando X-ray Absorption Spectroscopy. The Journal of Physical Chemistry Letters 2013, 4
(19), 3227-3232.
33. Asakura, D.; Hosono, E.; Niwa, H.; Kiuchi, H.; Miyawaki, J.; Nanba, Y.; Okubo, M.;
Matsuda, H.; Zhou, H.; Oshima, M.; Harada, Y., Operando soft x-ray emission spectroscopy of
LiMn2O4 thin film involving Li–ion extraction/insertion reaction. Electrochemistry
Communications 2015, 50, 93-96.
34. Zhou, Y.; Doronkin, D. E.; Zhao, Z.; Plessow, P. N.; Jelic, J.; Detlefs, B.;
Pruessmann, T.; Studt, F.; Grunwaldt, J.-D., Photothermal Catalysis over Nonplasmonic
Pt/TiO2 Studied by Operando HERFD-XANES, Resonant XES, and DRIFTS. ACS Catalysis
2018, 8 (12), 11398-11406.
35. Maiuri, M.; Garavelli, M.; Cerullo, G., Ultrafast Spectroscopy: State of the Art and Open
Challenges. Journal of the American Chemical Society 2020, 142 (1), 3-15.
36. Bunker, G., Introduction to XAFS: A Practical Guide to X-ray Absorption Fine Structure
Spectroscopy. Cambridge University Press: Cambridge, 2010.
37. Glatzel, P.; Bergmann, U., High resolution 1s core hole X-ray spectroscopy in 3d
transition metal complexes—electronic and structural information. Coordination Chemistry
Reviews 2005, 249 (1), 65-95.
38. de Groot, F., High-Resolution X-ray Emission and X-ray Absorption Spectroscopy.
Chemical Reviews 2001, 101 (6), 1779-1808.
39. Seidler, G. T.; Mortensen, D. R.; Remesnik, A. J.; Pacold, J. I.; Ball, N. A.; Barry, N.;
Styczinski, M.; Hoidn, O. R., A laboratory-based hard x-ray monochromator for high-resolution
x-ray emission spectroscopy and x-ray absorption near edge structure measurements. Review of
Scientific Instruments 2014, 85 (11), 113906.
40. Malzer, W.; Schlesiger, C.; Kanngießer, B., A century of laboratory X-ray absorption
spectroscopy – A review and an optimistic outlook. Spectrochimica Acta Part B: Atomic
Spectroscopy 2021, 177, 106101.
41. Zimmermann, P.; Peredkov, S.; Abdala, P. M.; DeBeer, S.; Tromp, M.; Müller, C.;
van Bokhoven, J. A., Modern X-ray spectroscopy: XAS and XES in the laboratory. Coordination
Chemistry Reviews 2020, 423, 213466.
42. Holden, W. M.; Jahrman, E. P.; Govind, N.; Seidler, G. T., Probing Sulfur Chemical and
Electronic Structure with Experimental Observation and Quantitative Theoretical Prediction of
Kα and Valence-to-Core Kβ X-ray Emission Spectroscopy. The Journal of Physical Chemistry A
2020, 124 (26), 5415-5434.
43. Tetef, S.; Govind, N.; Seidler, G. T., Unsupervised machine learning for unbiased
chemical classification in X-ray absorption spectroscopy and X-ray emission spectroscopy. Phys.
Chem. Chem. Phys. 2021, 23 (41), 23586-23601.
44. Ceriotti, M., Unsupervised machine learning in atomistic simulations, between
predictions and understanding. The Journal of Chemical Physics 2019, 150 (15), 150901.
45. Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B. A.;
Thiessen, P. A.; Yu, B.; Zaslavsky, L.; Zhang, J.; Bolton, E. E., PubChem 2019 update:
improved access to chemical data. Nucleic Acids Research 2020, 47 (D1).
46. McInnes, L.; Healy, J.; Melville, J., UMAP: Uniform Manifold Approximation and
Projection for Dimension Reduction. arXiv 2020, (1802.03426).
22
47. van der Maaten, L.; Hinton, G., Visualizing Data using t-SNE. Journal of Machine
Learning Research 2008, 9, 2579-2605.
48. Pont, F.; Tosolini, M.; Fournie, J. J., Single-Cell Signature Explorer for comprehensive
visualization of single cell signatures across scRNA-seq datasets. Nucleic Acids Research 2019,
47 (21).
49. Rovezzi, M.; Glatzel, P., Hard x-ray emission spectroscopy: a powerful tool for the
characterization of magnetic semiconductors. Semicond. Sci. Technol. 2014, 29 (023002).
50. Murphy, L. R.; Meek, T. L.; Allred, A. L.; Allen, L. C., Evaluation and Test of Pauling's
Electronegativity Scale. The Journal of Physical Chemistry A 2000, 104 (24), 5867-5871.
51. Hahsler, M.; Piekenbrock, M.; Doran, D., dbscan: Fast Density-Based Clustering with R.
Journal of Statistical Software 2019, 91 (1), 1 - 30.
52. Shrestha, A.; Mahmood, A., Review of Deep Learning Algorithms and Architectures.
IEEE Access 2019, 7, 53040-53065.
53. Rasmussen, C. E.; Williams, C. K. I., Gaussian Processes for Machine Learning. The
MIT Press: 2006.