PreprintPDF Available

Informed Chemical Classification of Organophosphorus Compounds via Unsupervised Machine Learning of X-ray Absorption Spectroscopy and X-ray Emission Spectroscopy

February 2022

February 2022

DOI:10.26434/chemrxiv-2022-tlmm4

License
CC BY-NC-ND 4.0

Authors:

Niranjan Govind

Pacific Northwest National Laboratory

Show all 5 authorsHide

Preprints and early-stage research may not have been peer reviewed yet.

We analyze an ensemble of organophosphorus compounds to form an unbiased characterization of the information encoded in their X-ray absorption near edge structure (XANES) and valence-to-core X-ray emission spectra (VtC-XES). Data-driven emergence of chemical classes via unsupervised machine learning, specifically cluster analysis in the Uniform Manifold Approximation and Projection (UMAP) embedding, finds spectral sensitivity to coordination, oxidation, aromaticity, intramolecular hydrogen bonding, and ligand identity. Subsequently, we implement supervised machine learning via Gaussian Process classifiers to identify confidence in predictions which match our initial qualitative assessments of clustering. The results further support the benefit of utilizing unsupervised machine learning as a precursor to supervised machine learning.

Available via license: CC BY-NC-ND 4.0

Content may be subject to copyright.

Informed Chemical Classification of Organophosphorus

Compounds via Unsupervised Machine Learning of X-ray

Absorption Spectroscopy and X-ray Emission Spectroscopy

Samantha Tetef+1,

Vikram Kashyap+1,

Alexandra Velian2,

Niranjan Govind3,

Gerald T. Seidler1*

+Co-first authors

1Department of Physics, University of Washington, Seattle WA 98195, USA

2Department of Chemistry, University of Washington, Seattle WA 98195, USA

3Physical and Computational Sciences Directorate, Pacific Northwest National Laboratory,

Richland, Washington 99352, USA

Corresponding Author

*seidler@uw.edu

ABSTRACT

We analyze an ensemble of organophosphorus compounds to form an unbiased characterization

of the information encoded in their X-ray absorption near edge structure (XANES) and valence-

to-core X-ray emission spectra (VtC-XES). Data-driven emergence of chemical classes via

unsupervised machine learning, specifically cluster analysis in the Uniform Manifold

Approximation and Projection (UMAP) embedding, finds spectral sensitivity to coordination,

oxidation, aromaticity, intramolecular hydrogen bonding, and ligand identity. Subsequently, we

implement supervised machine learning via Gaussian Process classifiers to identify confidence in

predictions which match our initial qualitative assessments of clustering. The results further

support the benefit of utilizing unsupervised machine learning as a precursor to supervised

machine learning.

TOC GRAPHICS

KEYWORDS X-ray absorption fine structure, valence-to-core X-ray emission spectroscopy,

Gaussian Process, UMAP, unsupervised machine learning.

The information content in any spectroscopy method is constrained by the lossiness of

the underlying quantum mechanics that connects atomic-scale structure and dynamics to

experimental observables. Further limitations to the sensitivity of spectroscopy techniques often

include the inherent nonlinear or stochastic responses of the experimental probe. These facts

constrain our ability to correlate physical measurements, e.g., spectral features, to desired

microscopic properties. Thus, the emergence of data science and machine learning (ML) in

spectroscopy, with applications in all fields in the physical sciences, has exploded 1-5. These

data-driven models can frequently disentangle and infer patterns from lossy measurements as

well as provide insight into the information encoded in spectra.

In general, supervised ML studies across a wide range of spectroscopies target either

predicting properties from spectra or correlating specific properties of interest to spectral features

6. This necessarily assumes that sufficient information is, in fact, encoded in spectra; otherwise,

ML models will correlate spurious features to requested properties. This detail of encoded

information is often addressed by hand-selecting a targeted training domain which depends

heavily on prior knowledge 7. However, issues arise if the training domain is too small or biased.

First, if the training domain is too small, the model will be unable to generalize well beyond its

specialized scope, which violates the essential assumption that the training and test data are

sampled from the same distribution. Second, although some bias is essential for any machine

learning model 8, unwanted bias, especially from unrepresentative data, blindly undermines

reliability of inferences and has led to contemporary ethical concerns 9-12.

In the effort to combat unwanted bias as well as provide generalizability to complex

datasets, this study demonstrates the value of the pipeline exemplified in Figure 1, which

validates encoded information via unsupervised machine learning, i.e., cluster analysis on a

reduced-dimensional embedding of the spectra, before passing either the embedding or the

original spectra – selected as an unbiased training (sub)set – to a supervised machine learning

model. This pipeline removes implicit biases and spurious correlations by adding steps (3) and

(4) to a typical ML pipeline, which validate spectral sensitivity to properties requested during

supervised predictions.

Figure 1 Flowchart of an analysis framework that uses unsupervised machine learning (such as

cluster analysis) as a precursor to predictions on spectra via supervised machine learning.

We utilize this pipeline for a spectroscopy method that has seen an ongoing exploration

of ML applications: X-ray absorption spectroscopy (XAS) 13-31. XAS is most commonly used in

chemistry, biology, and materials science to investigate the element-specific local coordination

environment and electronic structure, with applications including energy storage 32, 33, catalysis

34, and photochemical dynamics 35. XAS, which includes both X-ray absorption near-edge

structure (XANES) and extended X-ray absorption fine structure (EXAFS), probes the

unoccupied electronic states of the excited state of a chosen atomic species. Conversely,

relaxation to fill the core hole results in either nonradiative (Auger) or radiative processes. The

latter results in the emission of X-ray fluorescence that can be finely characterized by X-ray

emission spectroscopy (XES) for insight into the occupied electronic states 36-38. Often discussed

as complementary to XANES in information content, valence-to-core XES (VtC-XES) is

produced when electrons de-excite from the valence shell to fill the core hole, giving direct

information about occupied electronic states involved in bonding. While XAS and XES have

traditionally been synchrotron-based methods, we note that their access, including for VtC-XES,

is now being steadily augmented with a renaissance of lab-based spectrometers 39-41, including in

studies of sufficient scale for data science methods 42.

In the first study to utilize ML in XAS, Timoshenko, et al. 13 predicted coordination from

XANES spectra using a neural network, while Zheng, et al. 15 also predicted coordination, except

using a random forest model. Notably, Torrisi, et al. 27 used a random forest model, except to

correlate polynomial fitting parameters of spectra to properties like bond distance. Other works

utilizing machine learning in XAS include a XANES matching algorithm 16, hierarchical

clustering on spectra 17, and use of an autoencoder to correlate coordination to a reduced

dimensional representation of spectra 18. Most of these studies assumed desired information was

in fact encoded in spectra, largely because of hand-crafting relevant training datasets. However,

our pipeline, via the unsupervised machine learning precursor, allows for explorative and

unbiased refinement of chemical descriptors – a step that we propose is both necessary, and

likely sufficient, when addressing much more complex datasets.

The present study is prompted by our recent work 43 that compared the variance and

information content of sulfur K-edge XANES to VtC-XES Kβ spectra for sulforganics. We

found that nonlinear dimensionality reduction algorithms, a subset of unsupervised ML, provided

an effective way to extract features and thus important chemical information encoded in spectra.

Moreover, our results exemplified the benefits of utilizing unsupervised ML to mold and

understand the full potential of supervised ML analysis 44.

Here, we investigate the information content and sensitivity of phosphorus K-edge

XANES and VtC-XES Kβ in a more complex chemical system, organophosphorus compounds,

and indeed find sensitivity to a wider range of chemical properties, including coordination,

oxidation, aromaticity, intramolecular hydrogen bonding, and ligand identity. The dataset of

spectra we analyze is calculated from molecular structures gathered from the PubChem 45

database using moldl, a Python module we have written to aid in collecting and managing

molecular structure datasets. moldl is open-source and freely available to anyone. See the SI for

more details. For the rest of this paper, we will refer to the phosphorus K-edge XANES and VtC-

XES Kβ as just XANES and VtC-XES, respectively, for brevity.

Organophosphorus compounds have much higher total variance than sulforganics, as well

as higher variance within the same bonding geometry. We can therefore tune the input domain to

account for these highly variant structures, allowing us to understand the sensitivity of these

spectra to a wider range of properties. In addition, we can find, in an unbiased way, the extent of

the information that may be extracted using dimensionality reduction algorithms, especially

when confined to very limited dimensions. These explorations allow for full utilization of real

spectral information during supervised ML predictions.

To this end, we utilize Uniform Manifold Approximation and Projection (UMAP) 46 for

dimensionality reduction, which allows us to develop chemical classes by examining clustering

of spectra in a two-dimensional embedding. UMAP is a nonlinear embedding similar to t-

distributed Stochastic Neighbor Embedding (t-SNE) 47, which was used in our recent work 43 to

extract chemical classes. UMAP has additional benefits compared to t-SNE, such as being

parametric and preserving global structure, which allows for future data compression as well as

interpretation of overall global similarities. These advantages have led to its recent popularity,

such as in single cell RNA sequencing (scRNA-seq) data analysis 48, but has not yet seen use in

XAS analysis.

To begin, heuristically one expects coordination to yield the strongest distinguishing

feature between spectra, specifically the distinction between tricoordinate phosphorus and

tetracoordinate phosphorus. Not only do these coordination geometries have different hybridized

orbital character, but they are often a proxy for oxidation state. In organophosphorus compounds

with tricoordinate phosphorus centers, the phosphorus is typically in a 3+ oxidation state,

whereas compounds with tetracoordinate phosphorus centers usually have the phosphorus in a 5+

oxidation state. We chose compounds with a diverse number of oxygens bonded to phosphorus

within these two coordination configurations to further vary the effective charge on the

phosphorus. The spectral averages for both the VtC-XES and XANES spectra for each

tricoordinate phosphorus and tetracoordinate phosphorus class are shown in Fig. S1. We then

applied UMAP to the VtC-XES and XANES spectra to create a two-dimensional embedding of

the ensemble. The results are color-coded based on whether the compound includes tricoordinate

phosphorus or tetracoordinate phosphorus, as shown in Figure 2.

Individual classes within each coordination are shown in columns A and B. Additionally,

all R groups are constrained to exclusively carbons (e.g., alkyl or aryl chains), and sometimes

hydrogens (when bound to the oxygen) to achieve hydroxyl groups, but only for phosphates

(which we will explore later). As expected, coordination distinguishes most of the groupings of

the compounds, with a handful of outliers.

Figure 2 UMAP representation of VtC-XES (top) and XANES (bottom), color-coded by

coordination. R1, R2, and R3 are defined to be carbon-based aryl or alkyl chains, with only

phosphates allowed to have R1 and R2 as H atoms.

It follows that there are chemically relevant sub-groupings within each coordination.

Figure 3 shows the embedding color-coded within each of the tri- and tetra-coordinate classes

based on the number of oxygens bonded to the phosphorus. We expected effective charge of the

phosphorus to have the biggest impact on both the VtC-XES and XANES spectra. For the VtC-

XES, the ligand peaks (the small low-energy peak in Fig. S1) will increase in both energy and

intensity with an increase in phosphorus oxidation. From a molecular orbital perspective, this

trend is from a larger overlap between the ligand valence orbital and the phosphorus 3p orbital

(valence shell). In general, this feature (which also changes with different ligand symmetries and

orientation) is why VtC-XES is so strongly sensitive to ligand identity 49. For the XANES

spectra, an increase in the oxidation of the phosphorus, i.e., the number of oxygen ligands within

a coordination, will cause a blueshift of the absorption edge, also demonstrated by the average

spectra in Fig. S1.

Figure 3 UMAP representation of VtC-XES (top) and XANES (bottom) for tricoordinate

phosphorus (A) and tetracoordinate phosphorus (B) compounds, color-coded by number of

oxygens bonded to the phosphorus within each coordination.

Note that the phosphates are segregated from the other tetracoordinate phosphorus

compounds and seem to sub-cluster as well. This observation brings us to our next hypothesis

that VtC-XES and XANES are both sensitive to ligand identity. As stated earlier, VtC-XES is

highly sensitive to ligand identity, observed by changes in the ligand peak feature. Again,

because the absorption edge of a XANES spectrum shifts with oxidation, the electronegativity of

ligands will cause the biggest spectral change. However, even for ligands with approximately the

same electronegativity, different phase shifts and cross sections cause finer changes to the

XANES spectra.

To systematically probe the effect of ligand identity, a series of tetracoordinate

phosphorus compounds (phosphates) were evaluated in which the oxygen substituents were

replaced with one or two sulfur atoms. Compared to oxygen, sulfur is significantly less

electronegative, with a Pauling electronegativity value near that of carbon and phosphorus 50.

Thus, these oxygen-to-sulfur ligand substitutions likely cause the biggest spectral change by

adjusting the effective charge on the phosphorous. The resulting clusters are shown in Figure 4.

Figure 4 UMAP representation of VtC-XES (left) and XANES (right) for compounds with

sulfur ligands, color-coded by number of sulfurs.

As expected, the different ligand identities are contributing to cluster separations. The

VtC-XES also clearly has an outlier – the orange phosphorothioate in the red dithiophosphate

cluster at the bottom right of that figure. Chemically, that compound (PubChem CID 104781) is

structurally different from others because the oxygens form one edge of a carbon tetrahedrane.

Thus, UMAP clearly identifies chemical outliers.

We then analyzed whether the spectra would be sensitive to substitutions of R groups (if

bonded to an oxygen) with a hydrogen atom, thus forming hydroxyl groups, as shown in Figure

5. Here, we have taken phosphinate and phosphonate as starting points, and consecutively

replaced O-R groups with OH groups. In general, this distinction seems to be better illuminated

by the VtC-XES spectra than the XANES (which is shown in Fig. S5), as the clustering in the

VtC-XES is suggestive of a sensitivity to hydroxyl groups. However, Figure 5 also exemplifies

that first-nearest neighbors, e.g., the oxygen ligands directly bonded to the phosphorus, likely

cause the biggest spectral changes and thus are the biggest contributing factor to clustering,

which is consistent with our earlier observations.

Figure 5 UMAP representation of the VtC-XES of compounds with consecutively more R

groups (if bonded to an oxygen) replaced with an H atom (to create hydroxyl groups), color-

coded by chemical class.

In the above discussion, we have motivated our classes by important chemical properties

that we heuristically expected to yield the biggest spectral differences. However, even within this

chemically driven framework, there are sub-clusters within our heuristic chemical classes which

are instead emergent from UMAP. For example, we found that sub-clustering of the phosphate

chemical class (exemplified by the multiple separate sub-clusters in Figures 3 and 4) was caused

by unexpected variations in the secondary substituent (atoms bound to oxygens, not directly to

phosphorus), indicating that XANES spectra is sensitive to even more subtle details than

anticipated.

Let us examine this sub-division of the phosphates, specifically in the UMAP embedding

of their XANES spectra. Applying UMAP to just phosphates, we achieve the embedding shown

in Figure 6, which has labeled the phosphates into four clusters determined by the dbscan 51

clustering algorithm: I, II, III, and IV. The average spectrum for each cluster is shown at the

bottom and the common structural motifs for each cluster are shown to the right.

77% of Cluster I is comprised of compounds with two alkyl R groups and the third group

either alkyl or aryl rings. This distinction is different from Clusters II to IV as they instead

typically have two R groups as H atoms instead of carbon-based groups. Cluster II is the largest

sub-cluster and 94% of the compounds have two hydroxyl groups bonded to the phosphorus and

the last R group an alkyl chain. These two clusters are the most distinct.

On the other hand, Cluster III and IV are similar in composition. Cluster III is comprised

of compounds with the third R group as: (a) alkyl rings, or cycloalkanes (36%), (b) aromatic

rings (23%), or (c) take part in intramolecular hydrogen bonding with one of the hydroxyl groups

bonding to the phosphorus. Cluster IV compounds are structurally very similar to Cluster III

compounds, even though their spectra are distinct. However, 54% of Cluster IV compounds have

their third R group as aromatic rings. All compounds in Clusters I to IV can be viewed in Figs.

S10 to S13. For some example compounds in each cluster along with their spectra and structure,

see Figs. S6 to S9. Additionally, given the linear nature of Clusters I, III, and IV in the UMAP

embedding, we tested the correlation between the embedding location and the energy of the

absorption edge, as demonstrated in Fig. S14, and found no strong correlation. This further

supports the nonlinear nature of spectra and the idea that spectral fingerprints in complex

datasets do not correlate solely to a single high-variant property like the absorption edge, but

rather a combination of properties.

Figure 6 UMAP representation of XANES of phosphates, color-coded by sub-clusters. Cluster-

averaged spectra and a summary structural motif for each cluster are also shown.

Taken en masse, these results – independent of the specific dimensionality reduction

algorithm used – show the extent to which chemically-relevant information is, or is not, encoded

by the quantum mechanics involved in XANES and VtC-XES. As to the specific algorithm,

UMAP can be used iteratively as more data is collected and thus has the potential to shown

evolutions through the domain space, similar to the latent space of a variational autoencoder

(VAE) 52. This property facilitates real-time analysis of high-throughput experiments. Finally,

and of key importance here, UMAP can generate embeddings of spectra that can be used for

unbiased refinement of the training data set in addition to a preprocessing step before supervised

ML predictions.

The most common use of supervised ML in X-ray spectroscopy is to predict numerical

properties, such as bond length or coordination, from XANES spectra. Here, we instead predict

chemical classes from both VtC-XES and XANES spectra. Moreover, we predict these classes

from a five-dimensional UMAP representation of the spectra instead of from the original spectra

themselves. Such preprocessing through dimensionality reduction can help separate inherently

correlated and nonlinear spectral features 44 as well as greatly reduce both the computational cost

and the effect of spectral noise.

Furthermore, we used a Gaussian Process (GP) in order to incorporate prior knowledge

into our models and generate an informed predictor 53. A GP is a non-parametric kernel method

that formally incorporates Bayes rule into the model, which not only allows for priors to be

specified during training, but also allows for a probabilistic interpretation of the results. This

probability gives uncertainty estimates, or conversely confidence, of the predictions. We note

that one of the biggest downsides of a GP is that it scales poorly, which is another reason why

applying a nonlinear dimensionality reduction routine like UMAP beforehand can transform this

problem into a computationally tractable one.

The results of training a GP on each of the five classification schemes (see Table S1) we

developed – coordination, number of oxygen ligands, phosphate subcluster, number of sulfur

ligands, and number of hydroxyl ligands – are shown in Figure 7, with the average accuracy

score on the test set as well as the probability of that prediction, i.e., the confidence score,

shown. There is a clear correlation between the average accuracy and confidence, indicating that

the GP is, in fact, properly modeling uncertainty of predictions.

Figure 7 Gaussian Process Classifier prediction accuracies with corresponding average

probability (“confidence”) for all chemically driven and cluster-driven classification schemes.

Finally, the accuracies and confidence of each prediction across the VtC-XES and

XANES data matches what we observed in our two-dimensional UMAP figures. This is clearly

demonstrated in the hydroxyl ligand and phosphate subcluster classification schemes, where the

XANES and VtC-XES, respectively, poorly cluster by these schemes, and the low corresponding

GP confidence reflects this. Overall, these results further validate that visualizing data via a

dimensionality reduction algorithm like UMAP correlates to extractable information content and

can properly inform classes to be used for supervised ML.

By utilizing UMAP and analyzing the resulting clustering in a two-dimensional

embedding of VtC-XES and XANES spectra of an ensemble of organophosphorus compounds,

we noticed sensitivity to coordination and ligand identity (specifically by distinguishing number

of oxygen ligands, sulfur ligands, and hydroxyl groups). Additionally, the XANES was clearly

more sensitive to phosphate sub-groupings (which resulted from an unexpected, unintuitive

fingerprint). However, all these results culminated in a valuable analysis framework: (1)

applying nonlinear dimensionality reduction routines and cluster analysis to check for both

heuristic chemical sensitivities and emergent ones present in spectra, (2) applying dimensionality

reduction methods like UMAP before querying supervised ML models, and (3) utilizing models

that incorporate prior knowledge, such as a Gaussian Process, to estimate uncertainty or

confidence of these predictions on the clustering-informed classes. Furthermore, this framework,

visualized in Figure 1, is broadly applicable – it can easily be expanded to both other systems

and other one-dimensional spectroscopies – providing a way to validate predictions instead of

relying solely on the initial construction of an appropriate training dataset.

ASSOCIATED CONTENT

Supporting Information.

The following files are available free of charge:

Computational Methods (docx)

Figure S1 Class averages of spectra with different coordination (png)

Figure S2 Scree plot of VtC-XES and XANES data (png)

Figure S3 PCA reconstruction of VtC-XES spectra (png)

Figure S4 PCA reconstruction of XANES spectra (png)

Figure S5 UMAP representation of XANES with H atom substitutions (png)

Figure S6 Phosphate sub-cluster I example spectra (png)

Figure S7 Phosphate sub-cluster II example spectra (png)

Figure S8 Phosphate sub-cluster III example spectra (png)

Figure S9 Phosphate sub-cluster IV example spectra (png)

Figure S10 Phosphate sub-cluster I structures (png)

Figure S11 Phosphate sub-cluster II structures (png)

Figure S12 Phosphate sub-cluster III structures (png)

Figure S13 Phosphate sub-cluster IV structures (png)

Figure S14 Phosphate sub-clusters correlation (png)

Figure S15 3D UMAP visualizations (png)

Table S1 Classification table (docx)

AUTHOR INFORMATION

The authors declare no competing financial interests.

ACKNOWLEDGMENT

ST acknowledges funding from NRT-DESE: Data Intensive Research Enabling Clean

Technologies (DIRECT) under grant no. NSF #1633216 and acknowledge funding from NSF

CHE-1904437. VK acknowledges support from the Washington NASA Space Grant from the

Washington NASA Space Grant Consortium (WSGC). NG acknowledges support from the US

Department of Energy, Office of Science, Office of Basic Energy Sciences, Chemical Sciences,

Geosciences and Biosciences under Award No. KC-030105172685. AV acknowledges support

from the Research Corporation for Science Advancement through a Cottrell Scholars Award.

This research benefited from computational resources provided by the Environmental Molecular

Sciences Laboratory (EMSL), a DOE Office of Science User Facility sponsored by the Office of

Biological and Environmental Research and located at PNNL. PNNL is operated by Battelle

Memorial Institute for the United States Department of Energy under DOE Contract No. DE-

AC05-76RL1830. Additionally, this work was facilitated through the use of advanced

computational, storage, and networking infrastructure provided by the Hyak supercomputer

system and funded by the STF at the University of Washington.

REFERENCES

1. Butler, K. T.; Davies, D. W.; Cartwright, H.; Isayev, O.; Walsh, A., Machine learning

for molecular and materials science. Nature 2018, 559 (7715), 547-555.

2. Zhou, Z. Q.; He, Q. F.; Liu, X. D.; Wang, Q.; Luan, J. H.; Liu, C. T.; Yang, Y.,

Rational design of chemically complex metallic glasses by hybrid modeling guided machine

learning. npj Computational Materials 2021, 7 (1), 138.

3. Liu, Y.; Zhao, T. L.; Ju, W. W.; Shi, S. Q., Materials discovery and design using

machine learning. Journal of Materiomics 2017, 3 (3), 159-177.

4. Liu, Y.; Guo, B. R.; Zou, X. X.; Li, Y. J.; Shi, S. Q., Machine learning assisted

materials design and discovery for rechargeable batteries. Energy Storage Materials 2020, 31,

434-450.

5. Saal, J. E.; Kirklin, S.; Aykol, M.; Meredig, B.; Wolverton, C., Materials Design and

Discovery with High-Throughput Density Functional Theory: The Open Quantum Materials

Database (OQMD). JOM 2013, 65 (11), 1501-1509.

6. Meza Ramirez, C. A.; Greenop, M.; Ashton, L.; Rehman, I. u., Applications of machine

learning in spectroscopy. Applied Spectroscopy Reviews 2021, 56 (8-10), 733-763.

7. Gordon, D. F.; Desjardins, M., Evaluation and Selection of Biases in Machine Learning.

Machine Learning 1995, 20 (1-2), 5-22.

8. Wolpert, D. H.; Macready, W. G., No free lunch theorems for optimization. IEEE

Transactions on Evolutionary Computation 1997, 1 (1), 67-82.

9. Alelyani, S., Detection and Evaluation of Machine Learning Bias. Applied Sciences 2021,

11 (14).

10. Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A., A Survey on Bias

and Fairness in Machine Learning. ACM Computing Surveys 2021, 54 (6).

11. Pot, M.; Kieusseyan, N.; Prainsack, B., Not all biases are bad: equitable and inequitable

biases in machine learning and radiology. Insights Into Imaging 2021, 12 (1).

12. Hiemstra, A. M. F.; Cassel, T.; Born, M. P.; Liem, C. C. S., The promises and perils of

machine learning algorithms to reduce bias and discrimination in personnel selection procedures.

Gedrag en Organisatie 2020, 33 (4), 279-299.

13. Timoshenko, J.; Lu, D. Y.; Lin, Y. W.; Frenkel, A. I., Supervised Machine-Learning-

Based Determination of Three-Dimensional Structure of Metallic Nanoparticles. Journal of

Physical Chemistry Letters 2017, 8 (20), 5091-5098.

14. Timoshenko, J.; Frenkel, A. I., "Inverting" X-ray Absorption Spectra of Catalysts by

Machine Learning in Search for Activity Descriptors. Acs Catalysis 2019, 9 (11), 10192-10211.

15. Zheng, C.; Chen, C.; Chen, Y.; Ong, S. P., Random Forest Models for Accurate

Identification of Coordination Environments from X-Ray Absorption Near-Edge Structure.

Patterns 2020, 1 (2), 100013.

16. Zheng, C.; Mathew, K.; Chen, C.; Chen, Y. M.; Tang, H. M.; Dozier, A.; Kas, J. J.;

Vila, F. D.; Rehr, J. J.; Piper, L. F. J.; Persson, K. A.; Ong, S. P., Automated generation and

ensemble-learned matching of X-ray absorption spectra. Npj Computational Materials 2018, 4,

12.

17. Kiyohara, S.; Miyata, T.; Tsuda, K.; Mizoguchi, T., Data-driven approach for the

prediction and interpretation of core-electron loss spectroscopy. Scientific Reports 2018, 8 (1),

13548.

18. Routh, P. K.; Liu, Y.; Marcella, N.; Kozinsky, B.; Frenkel, A. I., Latent Representation

Learning for Structural Characterization of Catalysts. The Journal of Physical Chemistry Letters

2021, 12 (8), 2086-2094.

19. Aarva, A.; Deringer, V. L.; Sainio, S.; Laurila, T.; Caro, M. A., Understanding X-ray

Spectroscopy of Carbonaceous Materials by Combining Experiments, Density Functional

Theory, and Machine Learning. Part I: Fingerprint Spectra. Chemistry of Materials 2019, 31

(22), 9243-9255.

20. Carbone, M. R.; Yoo, S.; Topsakal, M.; Lu, D., Classification of local chemical

environments from x-ray absorption spectra using supervised machine learning. Physical Review

Materials 2019, 3 (3), 033604.

21. Carbone, M. R.; Topsakal, M.; Lu, D.; Yoo, S., Machine-Learning X-Ray Absorption

Spectra to Quantitative Accuracy. Physical Review Letters 2020, 124 (15), 156401(6).

22. Liu, Y.; Marcella, N.; Timoshenko, J.; Halder, A.; Yang, B.; Kolipaka, L.; Pellin, M.

J.; Seifert, S.; Vajda, S.; Liu, P.; Frenkel, A. I., Mapping XANES spectra on structural

descriptors of copper oxide clusters using supervised machine learning. The Journal of Chemical

Physics 2019, 151 (16), 164201.

23. Martini, A.; Guda, S. A.; Guda, A. A.; Smolentsev, G.; Algasov, A.; Usoltsev, O.;

Soldatov, M. A.; Bugaev, A.; Rusalev, Y.; Lamberti, C.; Soldatov, A. V., PyFitit: The software

for quantitative analysis of XANES spectra using machine-learning algorithms. Computer

Physics Communications 2020, 250, 107064.

24. Miyazato, I.; Takahashi, L.; Takahashi, K., Automatic oxidation threshold recognition of

XAFS data using supervised machine learning. Molecular Systems Design & Engineering 2019,

4 (5), 1014-1018.

25. Guda, A. A.; Guda, S. A.; Martini, A.; Kravtsova, A. N.; Algasov, A.; Bugaev, A.;

Kubrin, S. P.; Guda, L. V.; Šot, P.; van Bokhoven, J. A.; Copéret, C.; Soldatov, A. V.,

Understanding X-ray absorption spectra by means of descriptors and machine learning

algorithms. npj Computational Materials 2021, 7 (1), 203.

26. Fang, Z.; Hu, W.; Wang, M.; Wang, R.; Zhong, S.; Chen, S., X-ray absorption

spectroscopy combined with machine learning for diagnosis of schistosomiasis cirrhosis.

Biomedical Signal Processing and Control 2020, 60, 101944.

27. Torrisi, S. B.; Carbone, M. R.; Rohr, B. A.; Montoya, J. H.; Ha, Y.; Yano, J.; Suram,

S. K.; Hung, L., Random forest machine learning models for interpretable X-ray absorption near-

edge structure spectrum-property relationships. npj Computational Materials 2020, 6 (1), 109.

28. Trejo, O.; Dadlani, A. L.; De La Paz, F.; Acharya, S.; Kravec, R.; Nordlund, D.;

Sarangi, R.; Prinz, F. B.; Torgersen, J.; Dasgupta, N. P., Elucidating the Evolving Atomic

Structure in Atomic Layer Deposition Reactions with in Situ XANES and Machine Learning.

Chemistry of Materials 2019, 31 (21), 8937-8947.

29. Rankine, C. D.; Madkhali, M. M. M.; Penfold, T. J., A Deep Neural Network for the

Rapid Prediction of X-ray Absorption Spectra. The Journal of Physical Chemistry A 2020, 124

(21), 4263-4270.

30. Rankine, C. D.; Penfold, T. J., Progress in the Theory of X-ray Spectroscopy: From

Quantum Chemistry to Machine Learning and Ultrafast Dynamics. The Journal of Physical

Chemistry A 2021, 125 (20), 4276-4293.

31. Kiyohara, S.; Tsubaki, M.; Mizoguchi, T., Learning excited states from ground states by

using an artificial neural network. Npj Computational Materials 2020, 6 (1), 68.

32. Cuisinier, M.; Cabelguen, P.-E.; Evers, S.; He, G.; Kolbeck, M.; Garsuch, A.; Bolin,

T.; Balasubramanian, M.; Nazar, L. F., Sulfur Speciation in Li–S Batteries Determined by

Operando X-ray Absorption Spectroscopy. The Journal of Physical Chemistry Letters 2013, 4

(19), 3227-3232.

33. Asakura, D.; Hosono, E.; Niwa, H.; Kiuchi, H.; Miyawaki, J.; Nanba, Y.; Okubo, M.;

Matsuda, H.; Zhou, H.; Oshima, M.; Harada, Y., Operando soft x-ray emission spectroscopy of

LiMn2O4 thin film involving Li–ion extraction/insertion reaction. Electrochemistry

Communications 2015, 50, 93-96.

34. Zhou, Y.; Doronkin, D. E.; Zhao, Z.; Plessow, P. N.; Jelic, J.; Detlefs, B.;

Pruessmann, T.; Studt, F.; Grunwaldt, J.-D., Photothermal Catalysis over Nonplasmonic

Pt/TiO2 Studied by Operando HERFD-XANES, Resonant XES, and DRIFTS. ACS Catalysis

2018, 8 (12), 11398-11406.

35. Maiuri, M.; Garavelli, M.; Cerullo, G., Ultrafast Spectroscopy: State of the Art and Open

Challenges. Journal of the American Chemical Society 2020, 142 (1), 3-15.

36. Bunker, G., Introduction to XAFS: A Practical Guide to X-ray Absorption Fine Structure

Spectroscopy. Cambridge University Press: Cambridge, 2010.

37. Glatzel, P.; Bergmann, U., High resolution 1s core hole X-ray spectroscopy in 3d

transition metal complexes—electronic and structural information. Coordination Chemistry

Reviews 2005, 249 (1), 65-95.

38. de Groot, F., High-Resolution X-ray Emission and X-ray Absorption Spectroscopy.

Chemical Reviews 2001, 101 (6), 1779-1808.

39. Seidler, G. T.; Mortensen, D. R.; Remesnik, A. J.; Pacold, J. I.; Ball, N. A.; Barry, N.;

Styczinski, M.; Hoidn, O. R., A laboratory-based hard x-ray monochromator for high-resolution

x-ray emission spectroscopy and x-ray absorption near edge structure measurements. Review of

Scientific Instruments 2014, 85 (11), 113906.

40. Malzer, W.; Schlesiger, C.; Kanngießer, B., A century of laboratory X-ray absorption

spectroscopy – A review and an optimistic outlook. Spectrochimica Acta Part B: Atomic

Spectroscopy 2021, 177, 106101.

41. Zimmermann, P.; Peredkov, S.; Abdala, P. M.; DeBeer, S.; Tromp, M.; Müller, C.;

van Bokhoven, J. A., Modern X-ray spectroscopy: XAS and XES in the laboratory. Coordination

Chemistry Reviews 2020, 423, 213466.

42. Holden, W. M.; Jahrman, E. P.; Govind, N.; Seidler, G. T., Probing Sulfur Chemical and

Electronic Structure with Experimental Observation and Quantitative Theoretical Prediction of

Kα and Valence-to-Core Kβ X-ray Emission Spectroscopy. The Journal of Physical Chemistry A

2020, 124 (26), 5415-5434.

43. Tetef, S.; Govind, N.; Seidler, G. T., Unsupervised machine learning for unbiased

chemical classification in X-ray absorption spectroscopy and X-ray emission spectroscopy. Phys.

Chem. Chem. Phys. 2021, 23 (41), 23586-23601.

44. Ceriotti, M., Unsupervised machine learning in atomistic simulations, between

predictions and understanding. The Journal of Chemical Physics 2019, 150 (15), 150901.

45. Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B. A.;

Thiessen, P. A.; Yu, B.; Zaslavsky, L.; Zhang, J.; Bolton, E. E., PubChem 2019 update:

improved access to chemical data. Nucleic Acids Research 2020, 47 (D1).

46. McInnes, L.; Healy, J.; Melville, J., UMAP: Uniform Manifold Approximation and

Projection for Dimension Reduction. arXiv 2020, (1802.03426).

47. van der Maaten, L.; Hinton, G., Visualizing Data using t-SNE. Journal of Machine

Learning Research 2008, 9, 2579-2605.

48. Pont, F.; Tosolini, M.; Fournie, J. J., Single-Cell Signature Explorer for comprehensive

visualization of single cell signatures across scRNA-seq datasets. Nucleic Acids Research 2019,

47 (21).

49. Rovezzi, M.; Glatzel, P., Hard x-ray emission spectroscopy: a powerful tool for the

characterization of magnetic semiconductors. Semicond. Sci. Technol. 2014, 29 (023002).

50. Murphy, L. R.; Meek, T. L.; Allred, A. L.; Allen, L. C., Evaluation and Test of Pauling's

Electronegativity Scale. The Journal of Physical Chemistry A 2000, 104 (24), 5867-5871.

51. Hahsler, M.; Piekenbrock, M.; Doran, D., dbscan: Fast Density-Based Clustering with R.

Journal of Statistical Software 2019, 91 (1), 1 - 30.

52. Shrestha, A.; Mahmood, A., Review of Deep Learning Algorithms and Architectures.

IEEE Access 2019, 7, 53040-53065.

53. Rasmussen, C. E.; Williams, C. K. I., Gaussian Processes for Machine Learning. The

MIT Press: 2006.

ResearchGate has not been able to resolve any citations for this publication.

Understanding X-ray absorption spectra by means of descriptors and machine learning algorithms

Article

Full-text available

Dec 2021

X-ray absorption near-edge structure (XANES) spectra are the fingerprint of the local atomic and electronic structures around the absorbing atom. However, the quantitative analysis of these spectra is not straightforward. Even with the most recent advances in this area, for a given spectrum, it is not clear a priori which structural parameters can be refined and how uncertainties should be estimated. Here, we present an alternative concept for the analysis of XANES spectra, which is based on machine learning algorithms and establishes the relationship between intuitive descriptors of spectra, such as edge position, intensities, positions, and curvatures of minima and maxima on the one hand, and those related to the local atomic and electronic structure which are the coordination numbers, bond distances and angles and oxidation state on the other hand. This approach overcoms the problem of the systematic difference between theoretical and experimental spectra. Furthermore, the numerical relations can be expressed in analytical formulas providing a simple and fast tool to extract structural parameters based on the spectral shape. The methodology was successfully applied to experimental data for the multicomponent Fe:SiO2 system and reference iron compounds, demonstrating the high prediction quality for both the theoretical validation sets and experimental data.

Rational design of chemically complex metallic glasses by hybrid modeling guided machine learning

Article

Full-text available

Dec 2021

The compositional design of metallic glasses (MGs) is a long-standing issue in materials science and engineering. However, traditional experimental approaches based on empirical rules are time consuming with a low efficiency. In this work, we successfully developed a hybrid machine learning (ML) model to address this fundamental issue based on a database containing ~5000 different compositions of metallic glasses (either bulk or ribbon) reported since 1960s. Unlike the prior works relying on empirical parameters for featurization of data, we designed modeling guided data descriptors in line with the recent theoretical models on amorphization in chemically complex alloys for the development of the hybrid classification-regression ML algorithms. Our hybrid ML modeling was validated both numerically and experimentally. Most importantly, it enabled the discovery of MGs (either bulk or ribbon) through the ML-aided deep search of a multitude of quaternary to scenery alloy compositions. The computational framework herein established is expected to accelerate the design of MG compositions and expand their applications by probing the complex and multi-dimensional compositional space that has never been explored before.

Detection and Evaluation of Machine Learning Bias

Article

Full-text available

Jul 2021

Salem Alelyani

Machine learning models are built using training data, which is collected from human experience and is prone to bias. Humans demonstrate a cognitive bias in their thinking and behavior, which is ultimately reflected in the collected data. From Amazon’s hiring system, which was built using ten years of human hiring experience, to a judicial system that was trained using human judging practices, these systems all include some element of bias. The best machine learning models are said to mimic humans’ cognitive ability, and thus such models are also inclined towards bias. However, detecting and evaluating bias is a very important step for better explainable models. In this work, we aim to explain bias in learning models in relation to humans’ cognitive bias and propose a wrapper technique to detect and evaluate bias in machine learning models using an openly accessible dataset from UCI Machine Learning Repository. In the deployed dataset, the potentially biased attributes (PBAs) are gender and race. This study introduces the concept of alternation functions to swap the values of PBAs, and evaluates the impact on prediction using KL divergence. Results demonstrate females and Asians to be associated with low wages, placing some open research questions for the research community to ponder over.

Not all biases are bad: equitable and inequitable biases in machine learning and radiology

Article

Full-text available

Feb 2021

The application of machine learning (ML) technologies in medicine generally but also in radiology more specifically is hoped to improve clinical processes and the provision of healthcare. A central motivation in this regard is to advance patient treatment by reducing human error and increasing the accuracy of prognosis, diagnosis and therapy decisions. There is, however, also increasing awareness about bias in ML technologies and its potentially harmful consequences. Biases refer to systematic distortions of datasets, algorithms, or human decision making. These systematic distortions are understood to have negative effects on the quality of an outcome in terms of accuracy, fairness, or transparency. But biases are not only a technical problem that requires a technical solution. Because they often also have a social dimension, the 'distorted' outcomes they yield often have implications for equity. This paper assesses different types of biases that can emerge within applications of ML in radiology, and discusses in what cases such biases are problematic. Drawing upon theories of equity in healthcare, we argue that while some biases are harmful and should be acted upon, others might be unproblematic and even desirable-exactly because they can contribute to overcome inequities.

Applications of machine learning in spectroscopy

Article

Nov 2021

The way to analyze data in spectroscopy has changed substantially. At the same time, data science has evolved to the point where spectroscopy can find space to be housed, adapted and be functional. The integration of the two sciences has introduced a knowledge gap between data scientists who know about advanced machine learning techniques and spectroscopists who have a solid background in chemometrics. To reach a symbiosis, the knowledge gap requires bridging. This review article focuses on introducing data science subjects to non-specialist spectroscopists, or those unfamiliar with the subject. The article will explain concepts that are covered in machine learning, such as supervised learning, unsupervised learning, deep learning, and most importantly, the difference between machine learning and artificial intelligence. This article also includes examples of published spectroscopy research, in which some of the concepts explained here are applied. Machine learning together with spectroscopy can provide a useful, fast, and efficient tool to analyze samples of interest both for industrial and research purposes.

Unsupervised Machine Learning for Unbiased Chemical Classification in X-ray Absorption Spectroscopy and X-ray Emission Spectroscopy

Article

Oct 2021

We report a comprehensive computational study of unsupervised machine learning for extraction of chemically relevant information in X-ray absorption near edge structure (XANES) and in valence-to-core X-ray emission spectra (VtC-XES) for classification of a broad ensemble of sulphorganic molecules. By progressively decreasing the constraining assumptions of the unsupervised machine learning algorithm, moving from principal component analysis (PCA) to a variational autoencoder (VAE) to t-distributed stochastic neighbour embedding (t-SNE), we find improved sensitivity to steadily more refined chemical information. Surprisingly, when embedding the ensemble of spectra in merely two dimensions, t-SNE distinguishes not just oxidation state and general sulphur bonding environment but also the aromaticity of the bonding radical group with 87% accuracy as well as identifying even finer details in electronic structure within aromatic or aliphatic sub-classes. We find that the chemical information in XANES and VtC-XES is very similar in character and content, although they unexpectedly have different sensitivity within a given molecular class. We also discuss likely benefits from further effort with unsupervised machine learning and from the interplay between supervised and unsupervised machine learning for X-ray spectroscopies. Our overall results, i.e., the ability to reliably classify without user bias and to discover unexpected chemical signatures for XANES and VtC-XES, likely generalize to other systems as well as to other one-dimensional chemical spectroscopies.

A Survey on Bias and Fairness in Machine Learning

Article

Jul 2021

With the widespread use of artificial intelligence (AI) systems and applications in our everyday lives, accounting for fairness has gained significant importance in designing and engineering of such systems. AI systems can be used in many sensitive environments to make important and life-changing decisions; thus, it is crucial to ensure that these decisions do not reflect discriminatory behavior toward certain groups or populations. More recently some work has been developed in traditional machine learning and deep learning that address such challenges in different subdomains. With the commercialization of these systems, researchers are becoming more aware of the biases that these applications can contain and are attempting to address them. In this survey, we investigated different real-world applications that have shown biases in various ways, and we listed different sources of biases that can affect AI applications. We then created a taxonomy for fairness definitions that machine learning researchers have defined to avoid the existing bias in AI systems. In addition to that, we examined different domains and subdomains in AI showing what researchers have observed with regard to unfair outcomes in the state-of-the-art methods and ways they have tried to address them. There are still many future directions and solutions that can be taken to mitigate the problem of bias in AI systems. We are hoping that this survey will motivate researchers to tackle these issues in the near future by observing existing work in their respective fields.

Progress in the Theory of X-ray Spectroscopy: From Quantum Chemistry to Machine Learning and Ultrafast Dynamics

Article

Mar 2021

The development of high-brilliance third-and fourth-generation light sources such as synchrotrons and X-ray free-electron lasers (XFELs), the emergence of laboratory-based X-ray spectrometers, and instrumental and methodological advances in X-ray absorption (XAS) and (non)resonant emission (XES and RXES/RIXS) spectroscopies have had far-reaching effects across the natural sciences. However, new kinds of experiments, and their ever-higher resolution and data acquisition rates, have brought acutely into focus the challenge of accurately, quickly, and cost-effectively analyzing the data; a far-from-trivial task that demands detailed theoretical calculations that are capable of capturing satisfactorily the underlying physics. The past decade has seen significant advances in the theory of core-hole spectroscopies for this purpose, driven by all of the developments above andcruciallya surge in demand. In this Perspective, we discuss the challenges of calculating core-excited states and spectra, and state-of-the-art developments in electronic structure theory, dynamics, and data-driven/machine-led approaches toward their better description.

Latent Representation Learning for Structural Characterization of Catalysts

Article

Feb 2021

Supervised machine learning-enabled mapping of the X-ray absorption near edge structure (XANES) spectra to local structural descriptors offers new methods for understanding the structure and function of working nanocatalysts. We briefly summarize a status of XANES analysis approaches by supervised machine learning methods. We present an example of an autoencoder-based, unsupervised machine learning approach for latent representation learning of XANES spectra. This new approach produces a lower-dimensional latent representation, which retains a spectrum–structure relationship that can be eventually mapped to physicochemical properties. The latent space of the autoencoder also provides a pathway to interpret the information content “hidden” in the X-ray absorption coefficient. Our approach (that we named latent space analysis of spectra, or LSAS) is demonstrated for the supported Pd nanoparticle catalyst studied during the formation of Pd hydride. By employing the low-dimensional representation of Pd K-edge XANES, the LSAS method was able to isolate the key factors responsible for the observed spectral changes.

A century of laboratory X-ray absorption spectroscopy – A review and an optimistic outlook

Article

Jan 2021
SPECTROCHIM ACTA B

In recent years, novel instrumentation for laboratory X-ray Absorption Spectroscopy (XAS) raised some interest and debate about its usefulness. Within the last two years then, a growing number of experiments and analytical applications using these new spectrometers were published. This review presents these applications and gives an overview of the fields of applications and the ways, the laboratory XAS instruments were utilized so far. The principles of the laboratory XAS spectrometer are described. The use of X-ray tube driven spectrometers for XAS, however, is not a novelty. First X-ray absorption spectra were taken long before synchrotron radiation facilities existed. And, more important for the scope of this review, beginning with the 80s, the XAS community undertook a considerable effort to create laboratory XAS spectrometers which were powerful enough for research in chemistry or materials science. The motivation of this effort as well as the application of laboratory XAS spectrometers have a lot in common with the current activities. We included a review of literature from this period and a discussion of commonalities and differences with contemporary work.

Informed Chemical Classification of Organophosphorus Compounds via Unsupervised Machine Learning of X-ray Absorption Spectroscopy and X-ray Emission Spectroscopy

Abstract

Recommended publications

Informed Chemical Classification of Organophosphorus Compounds via Unsupervised Machine Learning of...

Unsupervised Machine Learning for Unbiased Chemical Classification in X-ray Absorption Spectroscopy...

Search for Analytical Relations between X-Ray Absorption Spectra Descriptors and the Local Atomic St...

Machine learning powered by principal component descriptors as the key for sorted structural fit of...