ArticlePDF Available

Quantitative structure–property relationships prediction of some physico-chemical properties of glycerol based solvents

Authors:

Abstract and Figures

Quantitative structure–properties relationships (QSPR) models have been developed for three characteristic properties of a series of 62 new glycerol derivatives, relevant to solvent classification and substitution uses. Using structural descriptor variables, three equations have been found using multiple linear regression analysis, which can be applied for in silico prediction of physico-chemical properties, allowing a faster selection of target solvents for a given application.
Content may be subject to copyright.
García et al., Green Chem. 2013, 15, 22832293
1
Quantitative structure-property relationships prediction of some
physicochemical properties of glycerol based solvents.
José I. García,*a Héctor García-Marín,a José A. Mayoral,a,b and Pascual Pérezc
5
Quantitative Structure-Properties Relationships (QSPR) models have been developed for three
characteristic properties of a series of 62 new glycerol derivatives, relevant to solvent
classification and substitution uses. Using structural descriptor variables, three equations have
been found using Multiple Linear Regression analysis, which can be applied for in silico prediction
of physico-chemical properties, allowing a faster selection of target solvents for a given 10
application.
Introduction
Organic solvents are used in huge amounts in many industrial
and daily life applications, but unfortunately the majority of
them come from petroleum and they are often labelled as toxic 15
or hazardous substances. For this reason, substantial efforts
are being done to develop more benign solvents from
renewable sources. Our group has recently described a family
of solvents based on glycerol,1 a concomitant product in
biodiesel production. To facilitate the search of possible 20
substitution applications, we have also determined a number
of physico-chemical properties of these glycerol-derived
solvents, and compared them with those of conventional
organic solvents. Many of these properties are difficult to
measure, so it is clear that the development of efficient 25
quantitative structure-properties relationship (QSPR)
equations would be of great interest to accelerate the search of
the best solvent for a given application. The concept is based
on the fact that it exists a close relationship between the bulk
properties and the molecular structure of a series of similar 30
chemical compounds. In this context, solvent classification is
a very interesting issue, which has traditionally been
addressed from both microscopic (intermolecular interactions)
and macroscopic (as a continuum medium) approaches.
However, solvation processes are hard to parameterize given 35
that solvation energy (the only observable magnitude) is
controlled by a large amount of factors. For this reason,
classification of solvents, and especially that of neotheric
solvents, is far from being straightforward,2-7 and hence,
during the last three decades of the 20th century, many efforts 40
have been devoted to classify solvents using empirical
parameters.8
Quantitative structureproperty relationships are
mathematical equations relating chemical structure to a wide
variety of physical, chemical, and biological properties; in our 45
case, solvent properties. QSPR models, once established, can
be used to predict properties of compounds as yet unmeasured
or even unknown.9-13 In this context, there are many reports
about the applications of QSPR in connection with solvents,
such as physico-chemical properties in alkanes series,14 50
optical properties of organic compounds,15 thermophysical
properties of some fluids,16 solubility of hazardous
compounds,17 acidity constants of some acid derivatives,13
permeability of organic compounds in membranes,18 or
important properties of room temperature ionic liquids 55
(RTILs), such as toxicity.19-21 A major step in the
development of QSPR models is nding a set of molecular
descriptors able to represent the variation of the structural
features of the molecules, and therefore a wide variety of
descriptors have been reported for use in QSPR analysis.22-25 60
The molecular descriptors chosen (X) are correlated with one
or more response variables (Y) using different statistical
approaches. Among the many statistical procedures available
to establish those relationships, such as Partial Least Squares
Analysis (PLSA), Multiple Linear Regression (MLR), 65
Artificial Neural Network (ANN), or Principal Component
Analysis (PCA), a really good example about using QSPR in
the classification of Solvents through PCA can be found in the
literature.26 Probably MLR27 is the most used because it is
simple and intuitive. 70
Scheme 1. General structure and codification of the glycerol-derived
solvents used in this study.
From an industrial application point of view, there are three
main solvent features that must be taken into account, as are 75
1) behaviour in dissolution processes, which can be well
defined through the solvatochromic parameter E!
! (see
below),28-29 2) mechanical aspects, which can be quantified by
their viscosity, and 3) volatility aspects, very related to safety,
toxicity and air pollution, which can be considered through 80
the boiling point.
R
1
OOR
3
OR
2
R = Code =
H
Me
Et
i
Pr
n
Bu
i
Bu
t
Bu
CF
3
CH
2
CF
3
CF
2
CH
2
CF
3
(CF
2
)
2
CH
2
0
1
2
3i
4
4i
4t
3F
5F
7F
2
Table 1. List of properties in the 62 glycerol solvents studied in the
present work.
Solvent
Code
E!
!
Visc.
(cP)
1,2,3-propanetriol
000
0.812
93430
3–methoxy1,2propanediol
100
0.710
37.72
3–ethoxy1,2propanediol
200
0.690
35.14
3–nbutoxy1,2propanediol
400
0.680
42.03
1,3dimethoxy–2–propanol
101
0.610
3.46
1–isopropoxy–3–methoxy–2–propanol
103i
0.490
3.38
1–nbutoxy–3–methoxy–2–propanol
104
0.480
1–isobutoxy–3–methoxy–2–propanol
104i
0.490
1–tertbutoxy–3–methoxy–2–propanol
104t
0.440
1–nbutoxy–3–isopropoxy–2–propanol
403i
0.470
4.59
1,3dinbutoxy–2–propanol
404
0.450
5.53
3–nbutoxy–1–tertbutoxy–2–propanol
404t
0.390
3–nbutoxy–1–isobutoxy–2–propanol
404i
0.460
1–ethoxy–3–isopropoxy–2–propanol
203i
0.450
1–nbutoxy–3–ethoxy–2–propanol
204
0.450
1–tertbutoxy–3–ethoxy–2–propanol
204t
0.410
1–isobutoxy–3–ethoxy–2–propanol
204i
0.460
1,3diisopropoxy–2–propanol
3i03i
0.460
1–tertbutoxy–3–isopropoxy–2–propanol
3i04t
0.370
1–isobutoxy–3–isopropoxy–2–propanol
3i04i
0.440
1–isopropoxy–3–(2,2,2trifluoroethoxy)–2–
propanol
3i03F
0.590
6.90
1–nbutoxy–3–(2,2,2trifluoroethoxy)–2–propanol
403F
0.600
1–tertbutoxy–3–(2,2,2trifluoroethoxy)–2–
propanol
4t03F
0.570
8.61
1–isobutoxy–3–(2,2,2trifluoroethoxy)–2–
propanol
4i03F
0.600
1,3bis(2,2,2trifluoroethoxy)–2–propanol
3F03F
0.700
8.14
1,3bis(2,2,3,3,3pentafluoropropoxy)–2–propanol
5F05F
0.699
1,3bis(2,2,3,3,4,4,4heptafluorobutoxy)–2–
propanol
7F07F
0.685
19.60
1,2,3trimethoxypropane
111
1–isopropoxy2,3dimethoxypropane
113i
1.03
2–nbutoxy–3–methoxy–1–isopropoxypropane
143i
0.145
1–tertbutoxy2,3dimethoxypropane
114t
0.214
2–nbutoxy–1–tertbutoxy–3–methoxypropane
144t
1–nbutoxy2,3dimethoxypropane
114
0.178
1,2dinbutoxy–3–methoxypropane
144
1–isobutoxy2,3dimethoxypropane
114i
2–nbutoxy–1–isobutoxy–3–methoxypropane
144i
2–ethoxy–3–methoxy–1–isopropoxypropane
123i
0.167
3–ethoxy–2–methoxy–1–isopropoxypropane
213i
0.171
1–tertbutoxy–2–ethoxy–3–methoxypropane
124t
1 tertbutoxy–3–ethoxy–2–methoxypropane
214t
0.150
1–nbutoxy–2–ethoxy–3–methoxypropane
124
0.155
1–nbutoxy–3–ethoxy–2–methoxypropane
214
0.164
1–isobutoxy–2–ethoxy–3–methoxypropane
124i
1–isobutoxy–3–ethoxy–2–methoxypropane
214i
0.161
2,3diethoxy–1–isopropoxypropane
223i
1–tertbutoxy2,3diethoxypropane
224t
0.155
1–nbutoxy2,3diethoxypropane
224
0.161
1–isobutoxy2,3diethoxypropane
224i
1–nbutoxy–2–methoxy–3–isopropoxypropane
413i
0.155
1.67
1–nbutoxy–2–ethoxy–3–isopropoxypropane
423i
3–nbutoxy–1–tertbutoxy–2–methoxypropane
414t
0.141
3–nbutoxy–1–tertbutoxy–2–ethoxypropane
424t
1,3dinbutoxy–2–methoxypropane
414
0.145
3.78
3–nbutoxy–1–isobutoxy–2–methoxypropane
414i
0.150
3–nbutoxy–1–isobutoxy–2–ethoxypropane
424i
3–isopropoxy–2–methoxy–1–(2,2,2
trifluoroethoxy)propane
3i13F
3–tertbutoxy–2–methoxy–1–(2,2,2
trifluoroethoxy)propane
4t13F
0.373
2.14
1,2,3trinbutoxypropane
444
2.72
3–nbutoxy–2–methoxy–1–(2,2,2
trifluoroethoxy)propane
413F
2–methoxy1,3bis(2,2,2trifluoroethoxy)propane
3F13F
0.553
2.33
2–ethoxy1,3bis(2,2,2trifluoroethoxy)propane
3F23F
0.595
2–nbutoxy1,3bis(2,2,2
trifluoroethoxy)propane
3F43F
0.574
Therefore, we decided to select for the present work 62
solvents based on glycerol, all of them prepared in our
laboratory (Scheme 1 and Table 1),1 and the three above-5
mentioned properties, also determined by us, were analyzed
for this solvent set using several QSPR models.
Results and discussion
Molecular structure definition
There are many ways of describing the structure of a chemical 10
compound as a vector of numbers. In this work we have used
two different approaches, based on molecular connectivity
descriptors: topological parameters and DARC/PELCO
descriptors.
15
Figure 1. DARC/PELCO scheme used to describe glycerol based
solvent structures.
Topological parameters are based on the molecular graph of
each compound.25,31 They are easily determined from the 20
connectivity and adjacency matrixes of each compound. The
number of connected components of a graph is a topological
invariant that measures the number of structurally independent
or disjoint subnetworks. These parameters are excellent
descriptors of molecular size, shape and flexibility. They are 25
global parameters in the sense that the whole molecular
structure is condensed in a single number. The topological
descriptors selected for QSPR studies in this work are: i)
Hydrogen bond acceptor counters (HBA), ii) Hydrogen bond
donor counters (HBD), iii) Rotatable bond counters (RB), iv) 30
Flexibility index (
ϕ
),32 v) Balaban index (Bal),33 vi) Wiener
index (W),34 vii) Zagreb index (Z),35 viii) Kier shape index
(
κ
n),32 ix) Subcount index (SC),36 and x) Conectivity index
(
χ
).25 Full definition of the indices used in the statistical
analyses are given in the Supporting Information. 35
DARC/PELCO (Description, Acquisition, Retrieval and
Computeraided design / Perturbation of an Environment
which is Limited, Concentric and Ordered),37 is another
excellent way to describe chemical structures, yet much less
used in QSPR studies. This system is particularly suitable for 40
studying families of compounds with a common chemical
substructure. The DARC/PELCO method is based on the
exhaustive generation of all topochromatic sites around the
reference structure (F0), which corresponds to the glycerol
skeleton common to all structures, and the evaluation of their 45
contribution to the property. The DARC/PELCO descriptors
are local, since each one indicates the presence or absence of
a group of atoms in a given molecular position. Their
definition is shown in Figure 1. In this definition we have
incorporated the symmetry of the glycerol derivatives used, by 50
A
1
B
1
C
1
D
1
A
2
(BF2)
(C
F2
)
(DF2)
C
2
B
2
D2
F0
A2
B
2
C
2
D
2
(D
F2
)
(C
F2
)
(B
F2
)
3
assuming that the contributions of groups occupying
equivalent positions (i.e. those linked to carbons 1 and 3 of
the glycerol moiety) will display the same influence on the
property under study. Preliminary studies have demonstrated
that this simplification do not alter the results of the 5
regression analyses.
Solvent Properties Selection
Solvent polarity (𝐸!
!)
Solvent polarity parameters have demonstrated their
usefulness not only to classify organic solvents but also to 10
explain solvent effects on very different physical and
chemical processes. An excellent overview of solvent polarity
parameters and their applications can be found in the
outstanding Reichardt’s book.8 Although there are several
procedures to quantify solvent polarity, solvatochromism 15
measurements of probe dyes is undoubtedly the most
successful methodology for an accurate determination of this
solvent feature due to their easy determination and the high
sensitivity to small polarity changes. From this point of view,
the Dimroth and Reichardt ET(30) parameter28-29 is one of the 20
most widespread used parameter. ET(30) values represent a
blend of dipolarity/polarizability and hydrogen bond donor
solvation abilities of the solvent, the latter feature contributing
to the total ET(30) value to a greater extent. E!
! is a
normalized form of ET(30), taking the value 0 for hexadecane 25
and 1 for water.
Viscosity.
Viscosity describes a fluid's internal resistance to flow and
may be thought of as a measure of fluid friction. This property
is particularly interesting from the viewpoint of possible large 30
scale industrial applications, where big solvent volumes have
to be stirred and pumped from one place to another.
Boiling point
One major problem concerning the use of organic solvents is
the presence of traces of these compounds in the air. The most 35
common volatile organic compounds (VOC’s) are solvents
indeed. Nowadays a big effort is being done to solve this
problem, trying to substitute these volatile solvents with
others that are less or non-volatile. For this reason this
property is really important to be not only measured but also 40
predicted. Boiling point is a quick and easy form to estimate
the volatility of a solvent, since in general a higher boiling
point correlated with a lower volatility at ambient pressure
and temperature.
Quantitative Structure Properties Relationships 45
Multiple Linear Regression (MLR) with topological indices.
It is often assumed that the relationship between structural
parameters and experimental properties is well represented by
a linear model:
y = b0 + b1 x1 + b2 x2 + ... + bn xn or 50
Y = X·B (in matrix form) Eq. (1)
In Eq. (1) the bi are unknown coefficients, and the objective
of regression analysis is to estimate their values. As QSPR
data sets consist of variables that are diverse in range,
variation and size, prior to regression analysis auto-scaling is 55
usually applied, i.e., the ith column is mean centred (with xi)
and scaled with 1/SD(xi), where SD is the standard deviation.
When X is of full rank the least squares solution is: B = (XT
X)–1XTY, where B is the estimator vector for the regression
coecients. However, very often, not all these coefficients 60
have statistical significance, so the final QSPR model should
only keep those descriptors really contributing to the variation
in the property observed. To this end we used a stepwise
method for variable selection. In this way, independent
variables xi are entering and leaving in the regression 65
equation, and only those having statistically significant
coefficients are finally kept in the model fitting.
The three regression equations obtained for the three
experimental properties fitted are the following:
E!
!=b!+b!·HBA +b!·HBD +b!·SC!
! Eq. (2) 70
η=b!+b!·HBD Eq. (3)
bp =b!+!b!·RB +b!·HBD +b!·HBA Eq. (4)
The corresponding coefficient values and MLR parameters
are shown in Table 2.
Table 2. Linear regression parameters from equations 2–4.a 75
E!
!
η
b.p.
b0 ± e0
0.206 ± 0.035
b
111.1 ± 17.0
b1 ± e1
0.073 ± 0.010
14.50 ± 3.56
11.8 ± 1.9
b2 ± e2
0.194 ± 0.021
24.7 ± 5.0
b3 ± e3
0.019 ± 0.004
3.2 ± 1.2
N
46
17
62
R2
0.957
0.823
0.769
σ(y)
0.0437
7.51
12.2
Fc
72.39
74.57
64.52
a bi are the coefficients for each regressions, ei is the tolerance for the bi
value in a 95% confidence interval. N is the number of cases (solvents
data) used in each regression, R2 is the determination coefficient. b As the
b0 coefficient turned to be non-significant in the standard MLR analysis,
fitting was done by forcing the equation to pass through the origin of 80
coordinates. A slight improvement in R2 was obtained in this way. c F(3,43
0,05) = 2,84; F(1,18, 0,05) = 4,41; F(3,59, 0,05) = 2,84. All equations are
statistically significant (p > 95%).
As can be seen, hydrogen-bonding ability of the solvent
seems to be the most important feature in modeling the three 85
properties under study. This result is consistent with the kind
of intermolecular interactions involved. It is well-known that
E!
! values are dominated by the HBD ability of the solvent,
due to the strong specific solvation stablished through
hydrogen-bonding with the phenolate oxygen of the betaine 90
dye. Similarly, the strong solvent-solvent intermolecular
hydrogen-bond interactions of most of the glycerol-derived
solvents included in the study are in the origin of the viscosity
values obtained, and hence of the importance of this
coefficient in the MLR model. Finally, the same strong 95
intermolecular interactions can be invoked to explain the high
boiling points displayed by most of the solvents considered.
Figure 2 plots the experimental values vs. those calculated
with the three MLR models. The dotted line represents the
least squares fit between both sets of data. 100
4
Figure 2. Plots of predicted vs. experimental values of E!
! (a), Viscosity
(b), and boiling point (c), as calculated through MLR analysis using 5
topological indices.
As can be seen the best results are obtained in the case of
the E!
! solvation parameter, which is consistent with the
higher determination coefficient value obtained in the MLR
analysis. In the other two cases, although there a clear 10
correlation, as indicated by the grouping of points around the
diagonal, the fit is not good enough to lead to a fully
predictive model.
The robustness and predictivity character of the method
was tested by splitting the data into a training and a test set, 15
which was created by extracting eight solvents from the
complete set, so the training set consists of 54 solvents. The
solvents of the test set (Table 3) were selected bearing in mind
the representativity of the whole set and for all the properties
the test set size is within the usually recommended percentage 20
of 10-20% of total cases.
Table 3. Subgroup of eight solvents extracted from the total amount of
solvents in order to create the new 54 solvents training set.
Solvent
E!
!
η (cP)
b.p. (ºC)
200
0,690
35,140
221
104
0,480
208
3i03F
0,590
6,900
176
5F05F
0,699
204
113i
1,030
170
414t
0,141
234
4t13F
0,373
2,140
185
3F23F
0,595
171
The three new regression equations obtained with the new
54 solvents group of the training set are sumarized in Table 4. 25
Table 4. Linear regression parameters from equations 2–4 obtained with
the training set of solvents.a
E!
!
η
b.p.
b0 ± e0
0.196 ± 0.039
b
111.4 ± 17.6
b1 ± e1
0.071 ± 0.012
14.19 ± 4.59
11.7 ± 2.0
b2 ± e2
0.200 ± 0.024
25.9 ± 5.4
b3 ± e3
0.018 ± 0.005
3.0 ± 1.4
N
39
13
54
R2
0.953
0.791
0.782
σ(y)
0.0456
8.16
11.9
Fb
238.24
45.31
59.64
a bi are the coefficients for each regressions, ei is the tolerance for the bi
value in a 95% confidence interval. N is the number of cases (solvents
data) used in each regression, R2 is the determination coefficient. b As the 30
b0 coefficient turned to be non-significant in the standard MLR analysis,
fitting was done by forcing the equation to pass through the origin of
coordinates. A slight improvement in R2 was obtained in this way.
c F(3,36 0,05) = 2.88; F(1,12, 0,05) = 4.75; F(3,51, 0,05) = 2.79. All equations are
statistically significant (p > 95%). 35
As can be seen, the regression coefficients are in all cases
very close to those calculated with the full set of solvents,
which illustrates the robustness of the equations obtained.
These new equations were used to predict the polarity,
viscosity and boiling point of solvents in the test set. As a 40
measure of the goodness of the prediction we used the mean
unsigned error (MUE). In the case of E!
! the MUE of the
fitting of the training set was 0.028, whereas that of the
predictions of the test set was 0.030 and represent less than
5% of the whole range of values (0.671). This points to a 45
reasonable predictivity for the model developed. In the case of
viscosity the corresponding MUE for the training and test sets
are 7.28 and 5.08, respectively, i.e. 18% of the whole range of
values (41.0) in the worst case, which indicates the poorer
predictivity of the corresponding equations, although they 50
could be still be used in a semi-quantitative way. Finally, in
the case of the boiling points, the MUE values for the training
and test sets are 8.6 and 10.9, respectively. Again, the error is
only slightly higher in the case of the “pure predictions” (test
set), representing less than 8% of the whole range of values 55
(140.0), which would allow a reasonable degree of
5
predictivity. A plot comparing the predicted and experimental
values of the test set is in Figure S1 of the Electronic
Supplementary Information (ESI).
Partial Least Squares (PLS) Regression with topological
indices. 5
One problem when using topological indices is the high pair-
correlation existing between many of them, given they often
recover similar structural features of the target molecule. This
can have undesirable consequences in MLR analyses, since
the real significance of a variable cannot be ascertained if it is 10
highly correlated with another one. For instance, when
examining variable coefficients in Eq. 2 one should be aware
that HBA and SC!
! have a pair correlation coefficient as high
as 0.828 (full pairwise correlation data are gathered in Table
S3 in the ESI). 15
Table 5. PLS regression results obtained in the treatment of the
experimental solvent properties studied in this work.
bi
E!
! a
ηb
b.p.c
HBA
0.0141
0.238
8.9
HBD
0.1370
9.387
46.7
RB
0.0097
0.713
6.4
ϕ
0.0104
0.001
2.4
BalJX
0.1500
3.210
31.1
BalJY
0.0806
2.638
35.4
Wr
0.0000
0.003
0.0
Z
0.0007
0.005
0.2
κ!
!!.
0.0026
0.024
2.1
κ!
!!.
0.0093
0.004
2.0
κ!
!!.
0.0228
1.118
5.4
𝑆𝐶!
!
0.0030
0.013
2.3
𝑆𝐶!
!
0.0030
0.013
2.3
𝑆𝐶!
!
0.0022
0.025
1.7
𝑆𝐶!
!
0.0006
0.136
6.4
𝑆𝐶!
!"
0.0040
0.105
7.7
𝜒!
0.0041
0.000
1.1
𝜒!
0.0044
0.034
10.8
𝜒!
0.0100
0.106
5.0
𝜒!
!
0.0011
0.819
0.9
𝜒!
!"
0.0211
0.188
2.6
𝜒!
!.!.
0.0185
0.560
11.2
𝜒!
!.!.
0.0201
0.344
12.8
𝜒!
!.!.
0.0242
1.185
17.0
𝜒!
!,!.!.
0.0796
0.445
109.8
𝜒!
!",!.!.
0.0632
3.142
12.0
b0
0.9997
30.973
111.3
N
46
17
62
R2
0.969 (0.954)a
0.700 (0.535)
0.891 (0.770)
σ(y)
0.036
7.29
8.1
a PLS regression used 4 latent variables built from the 26 original ones.
b PLS regression used 3 latent variables built from the 26 original ones.
c PLS regression used 7 latent variables built from the 26 original ones. 20
d Values in parentheses correspond to full cross-validated analyses, i.e.
each value is predicted by the equation obtained leaving that solvent out.
The resulting fitting is therefore more representative of the true predictive
ability of the model.
A possible solution to this problem is to transform the original 25
variables in a new set of a few new orthogonal (not correlated)
variables, gathering most of the total variance of data. In the case
of PLS regression,38,39 both the dependent (y) and the
independent (x) variables are projected in a new space, trying of
maximize the explanation of the variance of y through the 30
variance of latent variables x. Once this relationship is found, the
PLS coefficients are projected back to the original x-space, to
obtain the corresponding regression coefficients.
Figure 3. Plots of predicted vs. experimental values of E!
! (a), Viscosity 35
(b), and boiling point (c), as calculated through PLS analysis using
topological indices.
When the PLS regression technique was applied to our
problem, slighty better models were obtained for two of the
three properties considered. The corresponding coefficients 40
and PLS parameters are shown in Table 5, the most important
6
coefficients corresponding again to the hydrogen-bonding
indices. Plots of predicted vs. experimental values of the
properties are displayed in Figure 3. As can be seen in these
plots, in the case of E!
! the PLS model fits very well the
values of most of the 62 solvents used in the analysis. The 5
MUE of the fitted values is 0.028, identical to that obtained in
the previous MLR analysis. The full cross-validated
predictions (i.e., those performed by leaving the predicted
point out of the PLS calculation of the coefficients) are close
to normal predictions in all but one case (7F07F), which 10
points to the robustness of the model and the reliability of the
predictions. The MUE in this case is only slightly higher,
0.034. On the other hand, viscosity displays a bad
behaviour concerning the PLS analysis, with a determination
coefficient (R2) even lower than that found in the MLR 15
analysis. Again, hydrogen bond donor ability and κ3
!α. are the
topological variables with higher coefficients. However the
MUE of the fitted values is 5.86, and that of the cross-
validated values increases to 7.88, values which are not far
from those obtained in the MLR analyses, although they seem 20
to be too high to allow reliable quantitative predictions.
Table 6. PLS regression results obtained in the treatment of the training
set of solvents.
bi
E!
! a
ηb
b.p.c
HBA
0.0145
0.142
4.181
HBD
0.1390
10.982
38.876
RB
0.0097
0.717
10.645
ϕ
0.0096
0.047
1.938
BalJX
0.1680
2.110
31.450
BalJY
0.0990
2.036
29.459
Wr
0.0000
0.002
0.006
Z
0.0006
0.003
0.226
κ!
!!.
0.0022
-0.019
0.010
κ!
!!.
0.0088
0.047
2.044
κ!
!!.
0.0213
1.290
1.801
𝑆𝐶!
!
0.0027
0.011
0.155
𝑆𝐶!
!
0.0027
0.011
0.155
𝑆𝐶!
!
0.0022
0.019
1.142
𝑆𝐶!
!
0.0006
0.125
3.202
𝑆𝐶!
!"
0.0042
0.079
1.859
𝜒!
0.0038
0.001
0.943
𝜒!
0.0034
0.017
3.560
𝜒!
0.0101
0.128
4.585
𝜒!
!
0.0008
0.757
1.390
𝜒!
!"
0.0219
0.268
7.733
𝜒!
!.!.
0.0180
0.441
8.287
𝜒!
!.!.
0.0184
0.186
7.186
𝜒!
!.!.
0.0198
0.864
0.827
𝜒!
!,!.!.
0.0699
0.959
69.209
𝜒!
!".,!.!.
0.0519
3.170
19.237
b0
1.0874
22.028
56.840
N
39
13
54
R2
0.967
0.668
0.874
σ(y)
0.036
7.55
8.7
a PLS regression used 4 latent variables built from the 26 original ones.
b PLS regression used 3 latent variables built from the 26 original ones. 25
c PLS regression used 8 latent variables built from the 26 original ones.
Finally, the fitting of boiling points is slightly better with
the PLS approach (higher R2 and lower σ(y)), and the
resulting model is quite robust, with only three outliers: 30
glycerol itself (000), 444 and 7F07F. In this case, the MUE
are 6.2 (fitted values) and 8.4 (cross-validated values),
slightly better than those found in the MLR analyses.
In order to have a more reliable proof of the predictive
ability of these equations, we splitted the data again into the 35
same training and tests sets used in the MLR analyses. The
results of the corresponding regressions are shown in Table 6.
Plots of experimental vs. predicted values (including solvents
in the test set) are shown in Figure S2 (ESI).
As can be seen from the values in Table 6, both the 40
goodness of the fitting and the regression coefficients
obtained with the training set of solvents are quite similar to
those calculated with the full set.
Concerning the prediction errors, the MUE for E!
!!are 0.027
for the training set (almost identical to that calculated with the 45
full set of solvents) and 0.030 for the test set, which points to
a good predictivity of the equations developed. Concerning
the viscosity, the corresponding MUE values are 6.33 and
4.62 for the training and test sets, respectively, which are also
quite close to that obtained with the full set of solvents (5.86) 50
and point to a worse predictivity of this property by the model
developed. Finally, the MUE for the prediction of boiling
points are 6.6 (training set) and 10.3 (test set). Even if the
latter is clearly higher, it still represent about 7% of the full
range of b.p. values, which may be enough to obtain a 55
reasonable predictivity of this solvent property.
Multiple Linear Regression (MLR) with DARC/PELCO
descriptors.
In this case we used again the stepwise method to include in
the regression equation only those variables which are 60
statistically significant. It should be noted that for predictive
purposes, given the local character of the DARC/PELCO
descriptiors, the values of the coefficients of all the variables
not included in the final equations must be taken as zero. The
three MLR equations thus obtained are the following: 65
E!
!=b!+b!·A!+b!·B!" +b!·A!+b!·B!+b!·C!" +b!·C!
Eq. (5)
η=b!+b!·A!+b!·C!" +b!·A! Eq. (6)
bp =b!+b!·D!+b!·A!+b!·C!+b!·C!+b!·B2 +b!·C!" +
b!·B!" +b!·A! Eq. (7) 70
Table 7. Linear regression parameters from equations 5–7.
bi
E!
!
η
b.p.
b0 ± e0
0.851 ± 0.057
70.79 ± 5.45
278.2 ± 10.6
b1 ± e1
0.278 ± 0.023
32.50 ± 3.10
19.1 ± 3.60
b2 ± e2
0.140 ± 0.024
6.90 ± 2.40
55.6 ± 6.68
b3 ± e3
0.160 ± 0.038
3.52 ± 2.50
33.6 ± 6.32
b4 ± e4
0.026 ± 0.012
12.6 ± 2.49
b5 ± e5
0.059 ± 0.032
7.9 ± 1.86
b6 ± e6
0.016 ± 0.014
12.0 ± 5.84
b7 ± e7
7.0 ± 3.98
b8 ± e8
6.1 ± 3.97
N
46
17
62
R2
0.972
0.981
0.933
σ(y)
0.036
2.08
6,9
Fa
229.23
228.29
92.18
a F(6,40 0,05) = 2.34; F(3,14, 0,05) = 3.34; F(8,54, 0,05) = 2.18. All equations are
statistically significant (p > 95%).
The corresponding coefficient values and MLR parameters are 75
7
shown in Table 7, and the plots of predicted vs. experimental
values of the properties are displayed in Figure 4.
5
Figure 4. Plots of predicted vs. experimental values of E!
! (a), Viscosity
(b), and boiling point (c), as calculated through MLR analysis using
DARC/PELCO descriptors.
As can be seen, the fitting of the three properties is better
than those described with the precedent approaches. Even the 10
viscosity display good values. In a first approach, this cannot
be ascribed to overfitting, given the final equation has only
three independent variables to fit seventeen data, i.e. more
than five times data than variables. Similarly, boiling point
also displays a very good fitting, with low standard error (ca. 15
7 ºC).
The robustness of the method was tested again by removing
the same test set of solvents (Table 3) from the entire data
and, as can be seen from the values gathered in Table 8, the
regresion coefficients in eq. 57 do not change dramatically, 20
all values lying withing the calculated confidence margins.
Table 8. Linear regression factors from equations 57 using a reduced
training set of 54 solvents.
bi
E!
!
η
b.p.
b0 ± e0
0.839 ± 0.059
74.13 ± 6.81
281.3 ± 11.0
b1 ± e1
0.289 ± 0.025
34.26 ± 3.78
18.4 ± 3.5
b2 ± e2
0.126 ± 0.026
6.99 ± 2.50
56.4 ± 6.8
b3 ± e3
0.148 ± 0.039
2.99 ± 2.99
33.0 ± 6.1
b4 ± e4
0.030 ± 0.013
12.2 ± 2.4
b5 ± e5
0.055 ± 0.041
7.8 ± 1.9
b6 ± e6
0.016 ± 0.014
10.6 ± 7.5
b7 ± e7
8.1 ± 4.3
b8 ± e8
6.5 ± 4.1
N
39
13
54
R2
0.975
0.983
0.940
σ(y)
0.035
2.05
6.6
F
206.40
174.86
87.83
F(6,33 0,05) = 2.42; F(3,10, 0,05) = 3.71; F(8,46, 0,05) = 2.18. All equations are
statistically significant (p > 95%). 25
Figure S3 (in ESI) shows the predicted data for the eight
members of the test group. It can be seen that the best
predictable property is the boiling point, whose deviations
from experiment are less than ten percent in the worst case.
E!
!!displays a more erratic behaviour, specially in ht ecase of 30
fluorinated compounds, for which deviations are important in
relative terms, although they preserve the qualitative order
experimentally observed. As expected, the largest deviations
correspond to those structural features less represented in the
training set (highly branched and highly fluorinated chains). 35
Concerning the MUE, in all cases the values for the fitted
values using the solvent training set are lower than those
obtained the topological descriptors (0.024, 1.37 and 4.6 for
E!
!, viscosity and b.p., respectively), but this values are
significantly higher for the test set (0.051, 2.05 and 10.3, 40
respectively). Anyway, these errors represent between 5% and
8% of the full range of values, which point to a reasonably
good predictivity of these equations.
As already mentioned, DARC/PELCO descriptors are
highly intuitive, given their straightforward matching with the 45
molecular structure. As a consequence, the prediction of the
property of a new compound is extremely simple. As an
example we present the calculation of the boiling point of a
glycerol-derived solvent, not belonging to our 62 solvent set,
namely 1,2,3-triethoxypropane (222). This compound and its 50
boiling point were described in the literature, so the example
represents a “real world” prediction, given that the property
was determined by other authors using a different
experimental technique. Table 9 gathers the detailed
prediction procedure from the calculated regression 55
coefficients. As can be seen, the predicted value (177 ºC) is
reasonably close to the experimental one (181 ºC)40, and
within the standard regression error (ca. 95% predicted values
should be within a range of ±14 ºC from experimental ones).
8
Table 9. Example of boiling point prediction of 1,2,3-triethoxypropane
(222) using the linear regression obtained with DARC/PELCO
descriptors.a
bi
No of fragments
Total contribution
F0
278.2
1
278.2
A1
6.1
1
6.1
A2
55.6
2
111.2
B1
0.0
1
0.0
B2!
7.9!
2!
15.8!
176.7
a Experimental value: 181 ºC.40 5
Multiple Linear Regression (MLR) with mixed DARC/PELCO
and topological descriptors.
More compact prediction equations (equations 810) were
obtained by mixing DARC/PELCO and topological indices,
thus considering simultaneously local and global structure 10
descriptors, respectively. The coefficients and statistical
parameters for these regressions are gathered in Table 10, and
the plots of predicted vs. experimental values of the properties
are displayed in Figure 5.
E!
!=b!+b!·HBD +b!·B!" +b!·A!+b!·χ!
!.!. Eq. (8) 15
η=b!+b!·A!+b!·A!+b!·χ! Eq. (9)
bp =b!+b!·A!+b!·RB +b!·Bal!" +b!·χ!
!.!.+b!·χ!
!.!. Eq. (10)
Table 10. Linear regression factors from equations 810.
bi
E!
!
η
b.p.
b0 ± e0
0.523 ± 0.122
67.55 ± 3.94
292.6 ± 35.8
b1 ± e1
0.140 ± 0.042
35.86 ± 2.51
49.7 ± 9.0
b2 ± e2
0.177 ± 0.020
5.27 ± 1.75
12.9 ± 3.4
b3 ± e3
0.099 ± 0.043
0.99 ± 0.23
26.0 ± 13.2
b4 ± e4
0.026 ± 0.010
20.4 ± 6.9
b5 ± e5
8.3 ± 7.5
N
46
17
62
R2
0.968
0.989
0.932
σ(y)
0.036
1.46
6.8
F
314.31
376.64
153.47
F(6,40 0,05) = 2.34; F(3,14, 0,05) = 3.34; F(8,54, 0,05) = 2.18. All equations are
statistically significant (p > 95%). 20
Although the statistical tests are very similar to those
obtained with the DARC/PELCO descriptors only, less
independent variables are used in the final equations, leading
to higher number of cases/number of variables ratios. In the
case of viscosity, the number of independent variables does 25
not change, but the standard error of the predictions is slightly
improved (from 2.05 to 1.46 cP).
The robustness and predictivity of these equations was
again tested by splitting the solvent set into training and test
sets. The corresponding regression results are gathered in 30
Table 11. As can be seen, there are not significant changes in
fitting parameters and regression coefficients. Figure S4 (in
ESI) shows the predicted data for the eight members of the
test group.
35
Figure 5. Plots of predicted vs. experimental values of E!
! (a), Viscosity
(b), and boiling point (c), as calculated through MLR analysis using
topological indices and DARC/PELCO descriptors. 40
9
Table 11. Linear regression factors from equations 810.
bi
𝐄𝐓
𝐍
𝛈
bp
b0 ± e0
0.512 ± 0.129
70.41 ± 4.49
288.6± 38.2
b1 ± e1
0.139 ± 0.045
–37.83 ± 2.71
49.3 ± 8.9
b2 ± e2
0.173 ± 0.021
5.46 ± 1.91
13.3 ± 3.5
b3 ± e3
0.112 ± 0.050
1.03 ± 0.23
22.5 ± 14.1
b4 ± e4
0.024 ± 0.011
19.5 ± 6.6
b5 ± e5
9.3 ± 7.6
N
39
13
54
R2
0.969
0.993
0.944
σ(y)
0.038
1.35
6.2
F
268.19
405.64
161.36
F(4,35 0,05) = 2.64; F(3,10, 0,05) = 3.71; F(5,49, 0,05) = 2.42. All equations are
statistically significant (p > 95%).
The comparison of the MUE calculated with the fitting of
the training set and the test set indicates that prediction errors 5
are significantly higher in the later case, but they are anyway
lower than those obtained with the precedent models,
representing 5-6% of the full range of experimental values in
all cases. A summary of the MUE calculated for all the
equations developed in this work is gathered in Table 12. 10
Table 12. Mean unsigned errors (MUE) calculated for the different
equations developed in this work.a
Model
E!
!
η
bp
training
test
training
test
training
test
MLR Topol.
0.028
0.030
7.28
5.08
8.6
10.9
PLS Topol.
0.027
0.030
6.34
4.62
6.6
10.3
MLR D.-P.
0.024
0.051
1.37
2.05
4.6
10.3
MLR Mixed
0.026
0.033
1.02
1.95
3.9
8.8
a Bolface values indicate errors within the 5% of the full range of
experimental values, and italicized values, indicate errors within the 8%
of the full range. 15
If we take the MUE calculated for the test set as a measure
of the actual predictivity of the equations, we can conclude
that good predictive models have been developed for the three
properties under study. Topological descriptors seem to be
more adequate for the prediction of ETN, mostly due to the 20
poor predictions of DARC/PELCO descriptors for fluorinated
solvents. The latter, on the other hand, perform much better in
the prediction of viscosities. Overall, the mixed
DARC/PELCO-topological model constitutes the best
compromise for reasonably predicting the three solvent 25
properties studied here.
A referee suggested that PLS analyses could also be applied
to the DARC/PELCO and mixed parameter models. The
corresponding results can be found in the Electronic
Supplementary Information, but in no case improvement over 30
the MLR equations could be obtained, so they will not
discussed here.
Experimental
Glycerol-based solvents were obtained by ring opening of
either the appropriate glycidol ether (non-symetric glycerol-35
based solvents) or epichlorohydrin (symetric glycerol-based
solvents) with corresponding alkoxide in alcoholic media, and
purified by vacuum distillation as described previously.1
The complete list of the 62 solvents used in QSPR analyses
and the values of the experimental properties studied are 40
gathered in Table 1.
Different topological descriptors were calculated for the
molecular structures of every solvent using Materials Studio
Modeling 4.0 from Accelrys. This software can calculate
topological descriptors on the basis of molecular structural 45
information. All these descriptors are gathered in Table S1 of
the Supporting Information.
DARC/PELCO descriptors where generated from the
scheme shown in Figure 1. The presence of a C unit (bearing
the corresponding hydrogen atoms) was codified as 1 in the 50
data matrix (2 if the unit is simultaneously present at both
symmetric sides of the glycerol moiety). C units bearing
fluorine atoms were codified as independent variables (those
starting with “F” in the regression analyses). The final
DARC/PELCO matrix is gathered in Table S2. 55
Multiple linear regression analyses were carried out using
the SPSS software. In all Tables the following information is
provided:
- Regression coefficients bi, as defined previously (B = (XT
X)–1XTY). 60
- Individual confidence intervals (at the 95% probability
level) of each bi coefficient. These confidence intervals are
calculated from the estimated standard error of bi and the
Student’s test with Np degrees of freedom:
ei = s.e.(bi)·t(Np, 0.975) 65
- Number of cases included in the regression, N.
- Multiple determination coefficient, R2, which is a measure
of the proportion of the total variation about the mean of y
explained by the regression.
- Standard error of the regression σ(y) is the root square of the 70
residual mean square, and it is estimate of the error with
which any observed value of y could be predicted by the
regression equation.
- F value, defined as the quotient of the regression and
residual mean squares. When compared with a Fisher-75
Snedecor F distribution with p1 and Np degrees of freedom,
at a 95% probability level (values given in the footables), it
allows establishing if the variance explained by the regression
equation is significantly different from that of the error. More
strictly, it tests the H0 hypothesis, i.e., that all regression 80
coefficients are zero. If the calculated F value is larger than
the tabulated one, the hypothesis is rejected, and the equation
is considered statistically significant.
- Stepwise linear regression procedure is a method to select
the “best” regression equation from a set of independent 85
vatiables, x. Each variable is sequentially included in the
equation, following its single correlation with the response, y.
For each new variable entering, a partial F-test is performed to
see if the improvement in the equation is significant. If the
variable is accepted, then partial F-tests are also performed for 90
the rest of variables already in the equation. Those not passing
the test are then eliminated. The procedure is repeated until no
more variables are included in the equation. Partial F-tests are
carried out at a 90% probability level.
The Mean Unsigned (or Absolute) Error (MUE or MAE) is is 95
an average of the absolute errors ei=|ŷi-yi|, where ŷi is the
value predicted by the model and yi the experimental value.
10
Conclusions
In this study three characteristic properties relevant to classify
solvents and facilitate the search of substitution uses have
been investigated in a series of 62 glycerol derivatives that
can be used as solvents. Global topological descriptors, based 5
on the molecular graphs, have been successfully applied to
analyze and predict solvent polarities, both using traditional
MLR and PLS regression analyses. However, boiling points
and viscosities are not so well modeled using this kind of
structural variables. 10
On the other hand, DARC/PELCO local structural
descriptors have revealed as clearly superior to describe the
viscosity of this family of solvents. Boiling points are
similarly well predicted with both kinds of approaches.
Overall, the mixed model with DARC/PELCO and 15
topological descriptors constitutes the best compromise for
reasonably predicting the three solvent properties studied in
this work.
Highly significant regression equations have been
developed for the three properties under study. The robustness 20
and predictive value of these equations have been
demonstrated through the use of an independent test set of
solvents. Therefore, the QSPR models developed provide
signicant additional insight into the relationship between the
molecular structure and some fundamental solvent properties. 25
Based on these results, it seems that quantitative structure
activity/property relationships (QSAR/QSPR) could be quite
useful for in silico prediction of physico-chemical properties,
allowing a faster selection of target solvents for a given
application. 30
Acknowledgements
Financial support from the Spanish MINECO (project
CTQ2011-28124-C02-01, the European Social Fund (ESF)
and the Gobierno de Aragón (Grupo Consolidado E11) is
gratefully acknowledged. 35
Notes and references
a Instituto de Síntesis Química y Catálisis Homogénea, Facultad de
Ciencias, CSIC-Univ. de Zaragoza, Pedro Cerbuna, 12, E-50009
Zaragoza, Spain. Tel: +34 976762271; E-mail: jig@unizar.es
b Dept. Organic Chemistry, Facultad de Ciencias, Univ. de Zaragoza, 40
Pedro Cerbuna, 12, E-50009 Zaragoza, Spain.
c Dept. Physical Chemistry, Facultad de Ciencias, Univ. de Zaragoza,
Pedro Cerbuna, 12, E-50009 Zaragoza, Spain.
1 García, J. I., García-Marín, H., Mayoral, J. A., Pérez, P., Green Chem., 45
2010, 12, 426.
2 Cramer, R. D., J. Am. Chem. Soc., 1980, 102, 1837.
3 Carlson, R., Design and Optimization in Organic Synthesis. Ed.
Elsevier, Amsterdam, 1992.
4 Chastrette, M., Rajzmann, M., Chanon, M., Purcell, K. F., J. Am. Chem. 50
Soc., 1985, 107, 1.
5 Koppel, I. A., Palm, V. A., The Influence of the Solvent on Organic
Reactivity, in Advances in Linear Free Energy Relationships. Ed. Plenum
Press, London, 1972.
6 Kamlet, M. J., Abboud, J. L., Abraham, M. H., Taft, R. W., J. Org. 55
Chem., 1983, 48 (17), 2877.
7 Catalán, J., Solvent Effects based on non-HBD Solvents in Handbook of
Solvents. Ed. William Andrew Publishing, New York, 2001.
8 Reichardt, C., Solvents and Solvent Effects. 3th ed.; Ed. Wiley-VCH,
Weinhein, 2003. 60
9 Ravi, M., Hopfinger, A. J., Hormann, R. E., Dinan, L., J. Chem. Inf.
Comput. Sci., 2001, 41, 1587.
10 Luke, B. T., J. Mol. Struct. (Theochem.), 1999, 13, 468.
11 Bruneau, P., J. Chem. Inf. Comput. Sci., 2001, 41, 1605.
12 Katritzky, A. R., Petrukhin, R., Tatham, D., J. Chem. Inf. Comput. Sci., 65
2001, 41, 679.
13 Ghasemi, J., Saaidpour, S., Brown, S. D., J. Mol. Struct. (Theochem.),
2007, 805, 27.
14 Brauner, N., Shachamb, M., Cholakovc, G. S., Statevad, R. P., Chem.
Eng. Sci., 2005, 60, 5458. 70
15 Lind, P., Lopes, C., Oberg, K., Eliasson, B., Chem. Phys. Lett., 2004,
387, 238.
16 Ungerer, P., Nieto-Draghi, C., Rousseau, B., Ahunbay, G., Lachet, V.,
J. Mol. Liq., 2007, 134, 71.
17 Ghasemi, J. B., Abdolmaleki, A., Mandoumi, N., J. Hazardous Mat., 75
2009, 161, 74.
18 Fatemi, M. H., Haghdadi, M., J. Mol. Struct., 2008, 886, 43.
19 Torrecilla, J. S., Palomar, J., Lemus, J., Rodríguez, F., Green Chem.,
2010, 12, 123.
20 Alvarez-Guerra, M., Irabien, A., Green Chem., 2011, 13, 1507. 80
21 Yan, F., Xia, S., Wang, Q., Ma, P. , J. Chem. Eng. Data, 2012, 57,
2252.
22 Consonni, V., Todeschini, R., Pavan, M., Gramatica, P., J. Chem. Inf.
Comput. Sci., 2002, 42, 693.
23 Krenkel, G., Castro, E. A., Toropov, A. A., J. Mol. Struct. (Theochem.), 85
2001, 542, 107.
24 Ghasemi, J., Shahmirani, S., Farahani, E. V., Ann. Chim., 2006, 96, 327.
25 Kier, L. B., Hall, L. H., Molecular Connectivity in Structure-Activity
Analysis. Ed. Research Studies Press Ltd, New York, 1985.
26 Katritzky, A. R., Fara, D. C., Kuanar, M., Hur, E., Karelson, M., J. 90
Phys. Chem. A, 2005, 109, 10323.
27 Draper, N. R., Smith, H., Applied Regression Analysis. Ed. Wiley-
Interscience, 1998.
28 Dimroth, K., Reichardt, C., Siepmann, T., Bohlmann, F., Liebigs Ann.
Chem., 1963, 1, 661. 95
29 Dimroth, K., Reichardt, C., Schweig, A., Liebigs Ann. Chem., 1963, 95,
669.
30 Lide, D. R., Handbook of Chemistry and Physics. 84th ed.; Ed. CRC,
New York, 2004.
31 Katritzky, A. R., Gordeeva, E. V., J. Chem. Inf. Comput. Sci., 1993, 100
835.
32 Hall, L. H., Kier, L. B., Rev. Comput. Chem. II, 1991, 367.
33 Balaban, A. T., Chem. Phys. Lett., 1982, 309.
34 Wiener, H., J. Chem. Phys., 1947, 17.
35 Bonchev, D., Information Theoretic Indices for Characterization of 105
Chemical Structures. Ed. Research Studies Press Ltd., New York, 1983.
36 Kier, L. B., Hall, L. H., Molecular Connectivity Indices in Chemistry
and Drug Research. Ed. deStevens, New York, 1976.
37 Dubois, J. E., Computer Representation and Manipulation of Chemical
Information. Ed. Wiley, New York, 1974. 110
38 Wold, S., Ruhe, A., Wold, H., Dunn, W., SIAM J. Sci. Stat. Comput.,
1984, 5, 735.
39 Geladi, P., Kowalski, B. R., Anal. Chim. Acta, 1986, 185, 1.
40 Fairbourne, A., Gibson, G. P., Stephens, D. W., J. Chem. Soc., 1931,
445.
115
S11
Electronic Supplementary Information for
Quantitative structure-property relationships prediction of
some physicochemical properties of glycerol based solvents.
José I. García,*a Héctor García-Marín,a José A. Mayoral,a,b and Pascual Pérezc
a Instituto de Síntesis Química y Catálisis Homogénea, Facultad de Ciencias, CSIC-Univ. de Zaragoza, Pedro
Cerbuna, 12, E-50009 Zaragoza, Spain. Tel: +34 976762271; E-mail: jig@unizar.es.
b Dept. Organic Chemistry, Facultad de Ciencias, Univ. de Zaragoza, Pedro Cerbuna, 12, E-50009 Zaragoza, Spain.
E-mail: mayoral@unizar.es.
c Dept. Physical Chemistry, Facultad de Ciencias, Univ. de Zaragoza, Pedro Cerbuna, 12, E-50009 Zaragoza, Spain.
E-mail: pascual@unizar.es.
12
Definition of the topological parameters
Topological indices are usually obtained from two-dimensional molecular structures (molecular graphs, G),
mostly through the connectivity adjacency (A(G)) and topological distance matrices (D(G)), and the vertex
degree vector (δ(G)):
A(G)
1
2
3
4
5
6
1
0
1
0
0
0
0
2
1
0
1
1
0
0
3
0
1
0
0
0
0
4
0
1
0
0
1
0
5
0
0
0
1
0
1
6
0
0
0
0
1
0
δ(G)
1
2
3
4
5
6
δ
1
3
1
2
2
1
D(G)
1
2
3
4
5
6
1
0
1
2
2
3
4
2
1
0
1
1
2
3
3
2
1
0
2
3
4
4
2
1
2
0
1
2
5
3
2
2
1
0
1
6
4
3
3
2
1
0
Topological indices are calculated from different invariant features of the molecular graph, and contain
information about molecular size, molecular shape, branching, molecular flexibility, etc. The exact definition
of the indices used in this work are given below.
Balaban indices (JX, JY):15, 23
Balaban index is defined as:
where M is the number of bonds, N is the number of atoms in the molecule, and si is calculated as the sum of
terms from a modified topological distance matrix. In this modified distance matrix, each bond contributes with
1/b to the total connectivity, with b=1 for single bonds, b=2 for double bonds, b=3 for triple bonds, and b=1.5
for aromatic bonds:
𝑠!=𝑑!"
!
!!!
Corrections for heteroatoms have been introduced through contributions for the modification of the
electronegativity (X) and the atomic radii (Y):
where i is the atomic number and Gi is the group number in the Periodic Table of the elements. From these
corrections, the 𝑠!
! values are defined as:
𝑠!
!=𝑋𝑠! (for JX index)
𝑠!
!=𝑌𝑠! (for JY index)
Wiener index (W):16, 24
The Wiener index is defined as the sum of the lengths of the shortest paths between all pairs of vertices in the
chemical graph representing the non-hydrogen atoms in the molecule. It is easily computed from the
topological distance matrix:
𝑊=
1
2𝑑!"
!!
This index is a measure of the centrality of the graph, and hence it is related with the molecular compactness.
Zagreb index:17
It is defined as the sum of squares of the difference between the number of electrons participating in covalent
bonds and the number of hydrogen atoms bonded to the same atom. This is equivalent to the sum of the squares
of the vertices degrees, δi:
+
=
a
j
a
i
ss
NM
M
J1
2
i
GiX 1567,00078,04196,0 +=
i
GiY 0537,00160,011 91,1 +=
S13
Randic and Kier & Hall connectivity indices (
χ
):18
χ
indices were first proposed by Randic25 from the vertices degrees, as:
=
1
𝛿𝑖𝛿𝑗
𝐵 , extended to all bonds in the molecule (B).
Kier and Hall extended the definition by including the number of edges of a given sub-graph (h), and different
kinds of sub-graphs (r):
!𝐺
!=
1
𝛿!
!!!
!!!
!!
!!!
where σn is the number of sub-graph of length h and δ is the vertex degree.
There are four kinds of sub-graphs, known as path (linear chains), cluster (branched chains), path/cluster, and
chain (cycles), each one emphasizing a particular aspect of the molecular connectivity. The n superindex refers
to the number of bonds considered to calculate the topological index. Thus, n=0 refers to individual atoms, n=1
refers to directly connected atoms, n=2 refers to three atoms connected through two consecutive bonds, and so
on.
, and hence
A further refinement10d, 18 can be included to the
χ
indicesby considering the atom valences, thus allowing
distinguishing the presence of heteroatoms in the structure. This is accomplished by calculating s “corrected” d
value, using the atomic number and the number of valence electrons of the vertex atoms:
Where Zv is the number of valence electrons, Z is the atomic number and h is the number of hydrogen atoms
bonded to the vertex atom. The resulting “valence-corrected” indices are named as
χ
v.
Kier & Hall count indices (SC):
SC is the count of sub-graphs of a given length present in the molecules. Thus, SC=0 is the number of atoms,
SC=1, the number of chemical bonds, SC=2, the number of pair bonds, and so on. For longer sub-graphs, path,
cluster, path/cluster and chain types can be also considered.
Kier shape indices (
κ
n):
All the prededent topological indices are heavily influenced by the size of the molecular graph. Kier developed
the
κ
indices to best discriminate between different shapes of the molecules. They are defined from sub-graphs
of a given length, taking into account also the maximum and minimum connectivity of the molecule for the
same length (a way to “normalize” the
κ
values, making them independent of the molecular size):
Where m is the length chosen of the sub-graph, mPi the number of sub-graphs of length m contained in the total
graph, and mPmax and mPmin is the maximum and minimum number possible of sub-graphs of length m that can
contain the total graph. Some examples are given below.
κ1, K =2:
κ2, K =2:
κ3, K =4:
==
i
ii
i
i
hZagreb
22
)(
σδ
( )
=
=
×××
=
n
iinji
sub
nP
1
1
)...(
1
δδδδ
( )
=
sub
n
s
n
PsubnChi ))((
χ
( )
( )
1
=v
v
v
ZZ
hZ
δ
!
( )
2
maxmin
·
·
i
m
mm
n
P
PP
K=
κ
!
1
min
1=NP
!
2
)1(
max
1+
=NN
P
!
edgesofnumberP
i
__
1
=
!
2
min
2=NP
!
2
)2)(1(
max
2
=NN
P
!
edgesadjace ntofnumberP
i
___
2
=
14
Similarly to the
χ
indices, a modification has been sugested for
κ
indices to account for the presence of
heteroatoms in the molecular graph. 14, 26 In this modification, both the covalent radii and the hybridizations
are considered. The
!
! indices are defined as the
κ
n ones, but substituting N by N+α, where a is defined as:
Where ri is the covalent radium of atom i and rCsp3 is taken as 0.77 Å (the covalent radius of a carbon atom with
sp3 hybridization).
Molecular flexibility index (
ϕ
):14
The starting hypothesis to define f is that an infinitely long linear saturated hydrocarbon molecule (i.e. all-sp3
CC bonds) is infinitely flexible. Flexibility is reduced by the presence of a limited number of atoms, rings,
branched chains, and the presence of atoms with covalent radii shorter than that of Csp3:
!
3
min
3
=NP
!
)_(
4
)2(
2
max
3
evenN
N
P
=
!
)_(
4
)3)(1(
max
3
oddN
NN
P
=
!
edgesadjacentoftriosP
i
___
3
=
"
"
#
$
%
%
&
'=
iCsp
i
r
r1
3
α
N
αα
κκ
ϕ
21
=
S15
Table S1. Topological parameters of 62 glycerol based solvents.
Code
HBA
HBD
RB
ϕ
BalJX
BalJY
W
Z
𝜿𝟏
𝒂𝒎
𝜿𝟐
𝒂𝒎
𝜿𝟑
𝒂𝒎
𝐒𝐂𝐩
𝟎
𝐒𝐂𝐩
𝟏
𝐒𝐂𝐩
𝟐
𝐒𝐂𝐩
𝟑
𝐒𝐂𝐜
𝟑
𝝌
𝟎
𝝌
𝟏
𝝌
𝟐
𝝌𝒑
𝟑
𝝌𝒄𝒍
𝟑
𝝌𝒗𝒎𝟎
𝝌𝒗𝒎𝟏
𝝌𝒗𝒎𝟐
𝝌𝒑
𝒗𝒎
𝟑
𝝌𝒄𝒍
𝒗𝒎
𝟑
000
3
3
5
3.02
2.572
2.814
31
20
5.88
3.08
2.88
6
5
5
4
1
4.99
2.81
1.92
1.39
0.29
3.33
1.71
1.02
0.42
0.13
100
3
2
5
3.98
2.620
2.901
50
24
6.88
4.05
3.72
7
6
6
5
1
5.70
3.31
2.30
1.48
0.29
4.29
2.09
1.29
0.57
0.13
200
3
2
6
4.95
2.665
2.926
76
28
7.88
5.03
4.88
8
7
7
6
1
6.41
3.81
2.66
1.75
0.29
5.00
2.68
1.50
0.73
0.13
400
3
2
8
6.91
2.723
2.939
153
36
9.88
6.99
6.88
10
9
9
8
1
7.82
4.81
3.36
2.25
0.29
6.42
3.68
2.26
1.16
0.13
101
3
1
5
4.95
2.686
2.996
75
28
7.88
5.03
4.88
8
7
7
6
1
6.41
3.81
2.68
1.56
0.29
5.26
2.47
1.56
0.72
0.13
103i
3
1
6
5.58
2.915
3.193
143
38
9.88
5.65
6.88
10
9
10
8
2
7.98
4.66
3.87
2.02
0.70
6.83
3.45
2.49
0.98
0.37
104
3
1
8
7.89
2.788
3.033
202
40
10.88
7.98
7.78
11
10
10
9
1
8.53
5.31
3.74
2.33
0.29
7.38
4.06
2.54
1.31
0.13
104i
3
1
7
6.51
2.909
3.162
194
42
10.88
6.58
7.78
11
10
11
9
2
8.69
5.16
4.22
2.26
0.70
7.54
3.91
3.04
1.12
0.54
104t
3
1
6
4.65
3.173
3.444
180
46
10.88
4.70
7.78
11
10
13
9
5
8.91
4.96
4.99
2.17
1.85
7.76
3.76
3.53
1.07
1.24
403i
3
1
9
8.40
2.973
3.205
324
50
12.88
8.48
9.80
13
12
13
11
2
10.10
6.16
4.93
2.79
0.70
8.95
5.04
3.46
1.57
0.37
404
3
1
11
10.86
2.907
3.121
419
52
13.88
10.96
10.88
14
13
13
12
1
10.65
6.81
4.80
3.10
0.29
9.50
5.64
3.51
1.91
0.13
404t
3
1
9
7.15
3.152
3.380
388
58
13.88
7.21
10.88
14
13
16
12
5
11.03
6.46
6.05
2.94
1.85
9.88
5.35
4.51
1.66
1.24
404i
3
1
10
9.36
2.984
3.203
408
54
13.88
9.44
10.88
14
13
14
12
2
10.81
6.66
5.28
3.03
0.70
9.66
5.50
4.01
1.71
0.54
203i
3
1
7
6.51
2.950
3.214
192
42
10.88
6.58
7.78
11
10
11
9
2
8.69
5.16
4.22
2.29
0.70
7.54
4.04
2.70
1.14
0.37
204
3
1
9
8.88
2.845
3.082
262
44
11.88
8.97
8.88
12
11
11
10
1
9.23
5.81
4.10
2.60
0.29
8.08
4.64
2.74
1.47
0.13
204t
3
1
7
5.46
3.176
3.434
237
50
11.88
5.51
8.88
12
11
14
10
5
9.61
5.46
5.35
2.44
1.85
8.46
4.35
3.74
1.22
1.24
204i
3
1
8
7.45
2.950
3.193
253
46
11.88
7.53
8.88
12
11
12
10
2
9.40
5.66
4.57
2.53
0.70
8.25
4.50
3.25
1.28
0.54
3i03i
3
1
7
6.34
3.079
3.334
243
48
11.88
6.40
8.88
12
11
13
10
3
9.56
5.52
5.05
2.47
1.11
8.41
4.43
3.42
1.24
0.60
3i04t
3
1
7
5.53
3.273
3.522
296
56
12.88
5.58
9.80
13
12
16
11
6
10.48
5.81
6.18
2.62
2.26
9.33
4.75
4.46
1.33
1.48
3i04i
3
1
8
7.23
3.068
3.305
314
52
12.88
7.30
9.80
13
12
14
11
3
10.27
6.02
5.40
2.72
1.11
9.12
4.89
3.97
1.38
0.77
3i03F
6
1
7
6.06
3.115
3.499
378
60
13.67
6.21
10.67
14
13
17
12
6
11.19
6.31
6.53
2.86
2.26
8.17
4.25
3.17
1.20
0.54
403F
6
1
9
7.72
3.032
3.384
484
62
14.67
7.90
11.60
15
14
17
13
5
11.73
6.96
6.41
3.17
1.85
8.72
4.86
3.21
1.53
0.31
4t03F
6
1
7
5.55
3.267
3.639
450
68
14.67
5.67
11.60
15
14
20
13
9
12.11
6.60
7.66
3.01
3.41
9.10
4.57
4.21
1.29
1.42
4i03F
6
1
8
6.87
3.106
3.465
472
64
14.67
7.03
11.60
15
14
18
13
6
11.90
6.81
6.88
3.10
2.26
8.88
4.71
3.72
1.34
0.72
3F03F
9
1
7
6.05
3.136
3.615
557
72
15.46
6.26
12.46
16
15
21
14
9
12.82
7.10
8.01
3.25
3.41
7.94
4.07
2.91
1.15
0.49
5F05F
13
1
9
6.90
3.636
4.205
1283
108
21.18
7.17
6.88
22
21
33
32
17
17.82
9.60
11.16
7.06
4.70
10.45
5.33
4.09
2.02
0.84
7F07F
17
1
11
8.01
4.071
4.718
2399
144
26.90
8.33
6.20
28
27
45
50
25
22.82
12.10
14.41
10.13
6.20
12.96
6.58
5.24
2.82
1.17
111
3
0
5
5.93
2.907
3.263
102
32
8.88
6.01
4.39
9
8
8
8
1
7.11
4.35
2.85
1.97
0.20
6.22
2.85
1.77
1.04
0.12
113i
3
0
6
6.51
3.111
3.431
182
42
10.88
6.58
6.28
11
10
11
10
2
8.69
5.20
4.03
2.43
0.61
7.79
3.84
2.70
1.30
0.35
143i
3
0
9
9.36
3.331
3.612
369
54
13.88
9.44
9.26
14
13
14
13
2
10.81
6.70
5.12
3.07
0.61
9.92
5.42
3.68
1.82
0.35
114t
3
0
6
5.46
3.341
3.652
225
50
11.88
5.51
7.32
12
11
14
11
5
9.61
5.49
5.16
2.58
1.77
8.72
4.15
3.74
1.39
1.23
144t
3
0
9
8.02
3.513
3.789
436
62
14.88
8.08
10.17
15
14
17
14
5
11.73
6.99
6.25
3.22
1.77
10.84
5.74
4.73
1.91
1.23
114
3
0
8
8.88
2.974
3.260
250
44
11.88
8.97
7.32
12
11
11
11
1
9.23
5.85
3.91
2.74
0.20
8.34
4.44
2.74
1.63
0.12
144
3
0
11
11.8
3.246
3.505
470
56
14.88
11.95
10.17
15
14
14
14
1
11.36
7.35
5.00
3.38
0.20
10.46
6.03
3.73
2.15
0.12
114i
3
0
7
7.46
3.089
3.383
241
46
11.88
7.53
7.32
12
11
12
11
2
9.40
5.70
4.39
2.67
0.61
8.50
4.30
3.24
1.44
0.53
144i
3
0
10
10.3
3.330
3.595
458
58
14.88
10.40
10.17
15
14
15
14
2
11.52
7.20
5.48
3.31
0.61
10.62
5.89
4.23
1.96
0.53
123i
3
0
7
7.45
3.240
3.552
232
46
11.88
7.53
7.32
12
11
12
11
2
9.40
5.70
4.42
2.55
0.61
8.50
4.42
2.92
1.37
0.35
213i
3
0
7
7.45
3.158
3.465
237
46
11.88
7.53
7.32
12
11
12
11
2
9.40
5.70
4.39
2.70
0.61
8.50
4.42
2.90
1.46
0.35
124t
3
0
7
6.29
3.450
3.753
282
54
12.88
6.35
8.22
13
12
15
12
5
10.32
5.99
5.54
2.70
1.77
9.42
4.74
3.96
1.46
1.23
214t
3
0
7
6.29
3.366
3.665
288
54
12.88
6.35
8.22
13
12
15
12
5
10.32
5.99
5.52
2.85
1.77
9.42
4.74
3.94
1.54
1.23
124
3
0
9
9.87
3.114
3.395
310
48
12.88
9.96
8.22
13
12
12
12
1
9.94
6.35
4.29
2.86
0.20
9.05
5.03
2.96
1.70
0.12
16
214
3
0
9
9.87
3.045
3.322
316
48
12.88
9.96
8.22
13
12
12
12
1
9.94
6.35
4.27
3.01
0.20
9.05
5.03
2.95
1.79
0.12
124i
3
0
8
8.40
3.219
3.507
300
50
12.88
8.48
8.22
13
12
13
12
2
10.10
6.20
4.77
2.79
0.61
9.21
4.89
3.46
1.51
0.53
214i
3
0
8
8.40
3.146
3.430
306
50
12.88
8.48
8.22
13
12
13
12
2
10.10
6.20
4.74
2.94
0.61
9.21
4.89
3.45
1.60
0.53
223i
3
0
8
8.40
3.304
3.604
294
50
12.88
8.48
8.22
13
12
13
12
2
10.10
6.20
4.77
2.82
0.61
9.21
5.01
3.12
1.53
0.35
224t
3
0
8
7.15
3.498
3.792
352
58
13.88
7.21
9.26
14
13
16
13
5
11.03
6.49
5.90
2.97
1.77
10.13
5.33
4.16
1.61
1.23
224
3
0
10
10.86
3.197
3.471
383
52
13.88
10.96
9.26
14
13
13
13
1
10.65
6.85
4.65
3.13
0.20
9.75
5.62
3.17
1.86
0.12
224i
3
0
9
9.36
3.292
3.572
372
54
13.88
9.44
9.26
14
13
14
13
2
10.81
6.70
5.12
3.06
0.61
9.92
5.47
3.67
1.67
0.53
413i
3
0
9
9.36
3.175
3.445
384
54
13.88
9.44
9.26
14
13
14
13
2
10.81
6.70
5.10
3.20
0.61
9.92
5.42
3.67
1.90
0.35
423i
3
0
10
10.32
3.329
3.598
458
58
14.88
10.40
10.17
15
14
15
14
2
11.52
7.20
5.48
3.32
0.61
10.62
6.01
3.89
1.96
0.35
414t
3
0
9
8.02
3.349
3.616
454
62
14.88
8.08
10.17
15
14
17
14
5
11.73
6.99
6.22
3.35
1.77
10.84
5.74
4.71
1.98
1.23
424t
3
0
10
8.90
3.500
3.765
535
66
15.88
8.97
11.21
16
15
18
15
5
12.44
7.49
6.60
3.47
1.77
11.55
6.33
4.93
2.05
1.23
414
3
0
11
11.86
3.105
3.357
488
56
14.88
11.95
10.17
15
14
14
14
1
11.36
7.35
4.97
3.51
0.20
10.46
6.03
3.72
2.23
0.12
414i
3
0
10
10.32
3.182
3.439
476
58
14.88
10.40
10.17
15
14
15
14
2
11.52
7.20
5.45
3.44
0.61
10.62
5.89
4.22
2.03
0.53
424i
3
0
11
11.28
3.339
3.595
559
62
15.88
11.37
11.21
16
15
16
15
2
12.23
7.70
5.83
3.56
0.61
11.33
6.47
4.44
2.10
0.53
3i13F
6
0
7
6.87
3.304
3.723
444
64
14.67
7.03
9.96
15
14
18
14
6
11.90
6.85
6.70
3.27
2.17
9.14
4.64
3.37
1.52
0.53
4t13F
6
0
7
6.28
3.457
3.864
522
72
15.67
6.42
11.00
16
15
21
15
9
12.82
7.14
7.83
3.42
3.33
10.06
4.95
4.41
1.61
1.41
444
3
0
14
14.84
3.443
3.683
789
68
17.88
14.94
13.17
18
17
17
17
1
13.48
8.85
6.06
4.15
0.20
12.58
7.62
4.70
2.74
0.12
413F
6
0
9
8.60
3.223
3.610
559
66
15.67
8.78
11.00
16
15
18
15
5
12.44
7.49
6.58
3.58
1.77
9.68
5.24
3.42
1.85
0.30
3F13F
9
0
7
6.80
3.325
3.835
638
76
16.46
7.02
11.72
17
16
22
16
9
13.53
7.64
8.18
3.66
3.33
8.90
4.46
3.12
1.47
0.48
3F23F
9
0
8
7.57
3.476
3.980
736
80
17.46
7.80
12.76
18
17
23
17
9
14.23
8.14
8.56
3.78
3.33
9.61
5.04
3.34
1.54
0.48
3F43F
9
0
10
9.15
3.642
4.119
987
88
19.46
9.41
14.72
20
19
25
19
9
15.65
9.14
9.27
4.29
3.33
11.02
6.04
4.11
1.99
0.48
S19
Figure S1. Predicted vs. experimental values of E!
!, viscosity, and boiling
point for the selected solvent test set using MLR analysis with topological
parameters (equations 2–4 in the main text). 5
10
Figure S2. Predicted vs. experimental values of E!
!, viscosity, and boiling
point for the selected solvent test set using PLS analysis with topological
parameters.
15
221
208
176
204
170
234
185
171
224
222
201
204
172
208
175
178
0
50
100
150
200
250
200
104
3i03F
5F05F
113i
414t
4t13F
3F23F
boiling point (ºC)
Exp.
Calc.
35,1
6,9
1,0
2,1
29,5
13,5
/2,5
/2,5
/5,0
0,0
5,0
10,0
15,0
20,0
25,0
30,0
35,0
40,0
200
3i03F
113i
4t13F
Dynamic visc. (cP) Exp.
Calc.
0,690
0,480
0,590
0,699
0,141
0,373
0,595
0,701
0,447
0,606
0,743
0,157
0,352
0,529
0,000
0,100
0,200
0,300
0,400
0,500
0,600
0,700
0,800
200
104
3i03F
5F05F
414t
4t13F
3F23F
ETN
Exp.
Calc.
20
Figure S3. Predicted vs. experimental values of E!
!, viscosity, and boiling
point for the selected solvent test set using MLR analysis with
DARC/PELCO descriptors (equations 5–7 in the text). 5
10
Figure S4. Predicted vs. experimental values of E!
!, viscosity, and boiling
point for the selected solvent test set using MLR analysis with mixed
topological and DARC/PELCO descriptors (equations 8–10 in the text)..
0,690
0,480
0,590
0,699
0,141
0,373
0,595
0,661
0,497
0,609
0,795
0,118
0,290
0,506
0,000
0,100
0,200
0,300
0,400
0,500
0,600
0,700
0,800
0,900
200
104
3i03F
5F05F
414t
4t13F
3F23F
ETN
Exp.
Calc.
221
208
176
204
170
234
185
171
233
207
192
185
178
224
194
178
0
50
100
150
200
250
200
104
3i03F
5F05F
113i
414t
4t13F
3F23F
boiling point (ºC)
Exp.
Calc.
35,1
6,9
1,0
2,1
39,9
5,6
2,6
2,6
0,0
5,0
10,0
15,0
20,0
25,0
30,0
35,0
40,0
45,0
200
3i03F
113i
4t13F
Dynamic visc. (cP) Exp.
Calc.
0,690
0,480
0,590
0,699
0,141
0,373
0,595
0,670
0,474
0,628
0,746
0,140
0,332
0,515
0,000
0,100
0,200
0,300
0,400
0,500
0,600
0,700
0,800
200
104
3i03F
5F05F
414t
4t13F
3F23F
ETN
Exp.
Calc.
221
208
176
204
170
234
185
171
236
209
190
198
173
219
189
183
0
50
100
150
200
250
200
104
3i03F
5F05F
113i
414t
4t13F
3F23F
boiling point (ºC)
Exp.
Calc.
35,1
6,9
1,0
2,1
39,2
6,3
/1,7
2,5
/5,0
0,0
5,0
10,0
15,0
20,0
25,0
30,0
35,0
40,0
45,0
200
3i03F
113i
4t13F
Dynamic visc. (cP)
Exp.
Calc.
S21
Table S4. Comparison between MLR and PLS analyses with DARC/PELCO descriptors for the three solvent properties studied.
Descriptor
𝐄𝐓
𝐍
Dynamic Viscosity (cP)
Boiling point (ºC)
MLR
PLSa
MLR
PLSb
MLR
PLSc
B0
0.851
0.796
70.79
71.800
278.2
279.3
A1
0.278
0.268
3.52
4.588
6.1
-8.4
A2
0.160
0.116
32.50
34.448
55.6
-55.7
B1
n.s.
0.045
n.s.
0.083
n.s.
5.0
B2
0.026
0.034
n.s.
0.546
7.9
7.7
BF2
0.140
0.134
n.s.
2.606
7.0
6.7
C1
n.s.
0.003
n.s.
0.083
33.6
15.3
C2
0.016
0.018
n.s.
0.799
12.6
12. 5
CF2
0.059
0.055
6.90
2.816
12.0
9.5
D1
n.s.
0.003
n.s.
0.083
n.s.
15.3
D2
n.s.
0.015
n.s.
0.799
19.1
18.9
DF2
n.s.
0.033
n.s.
2.816
n.s.
3.0
N
46
46
17
17
62
62
R2
0.972
0.968
0.981
0.991
0.933
0.935
σ
0.036
0.036
2.08
1.28
6.9
6.3
a 4 latent variables. b 5 latent variables. c 6 latent variables.
Given that C1 and D1 are linearly dependent (see Table S3), their behaviour differs in stepwise MLR and PLS analyses of the boiling
point response. In the former case, the variable entering in the equation takes the full value (33.6), whereas in the “back-projection” of 5
the PLS coefficients into the original variables, each coefficient takes half of the full value (15.3). Of course, the predictions within the
solvent set used are therefore identical, given that all structures for which C1=1, have D1=1 too. Similar, but not the same behaviour is
observed for other highly correlated parameters, such as CF2 and DF2.
10
22
Table S5. Comparison between MLR and PLS analyses with mixed topological and DARC/PELCO descriptors for the three
solvent properties studied.
Descriptor
𝐄𝐓
𝐍
Dynamic Viscosity (cP)
Boiling point (ºC)
MLR
PLSa
MLR
PLSb
MLR
PLSc
B0
0.523
0.865
67.55
156.41
292.6
171.2
A1
0.099
0.054
5.27
6.333
n.s.
10.918
A2
n.s.
0.005
35.86
27.740
49.7
26.743
B1
n.s.
0.014
n.s.
5.713
n.s.
3.438
B2
n.s.
0.012
n.s.
1.155
n.s.
2.847
BF2
0.177
0.007
n.s.
0.224
n.s.
1.900
C1
n.s.
0.004
n.s.
5.713
n.s.
4.677
C2
n.s.
0.024
n.s.
0.904
n.s.
5.368
CF2
n.s.
0.013
n.s.
0.223
n.s.
0.045
D1
n.s.
0.004
n.s.
5.713
n.s.
4.677
D2
n.s.
0.014
n.s.
0.904
n.s.
4.454
DF2
n.s.
0.006
n.s.
0.223
n.s.
3.253
HBA
n.s.
0.035
n.s.
1.564
n.s.
0.898
HBD
0.140
0.060
n.s.
21.407
n.s.
15.825
RB
n.s.
0.032
n.s.
14.793
12.9
15.721
φ
n.s.
0.014
n.s.
7.672
n.s.
0.395
BalJX
n.s.
0.013
n.s.
30.362
n.s.
0.272
BalJY
n.s.
0.014
n.s.
18.803
26.0
0.564
W
n.s.
0.000
n.s.
0.054
n.s.
0.032
Z
n.s.
0.002
n.s.
3.661
n.s.
1.387
κ1
n.s.
0.009
n.s.
5.738
n.s.
0.526
κ2
n.s.
0.013
n.s.
8.699
n.s.
0.765
κ3
n.s.
0.022
n.s.
5.679
n.s.
8.078
SC0
p
n.s.
0.007
n.s.
5.848
n.s.
0.463
SC1
p
n.s.
0.007
n.s.
5.848
n.s.
0.463
SC2
p
n.s.
0.006
n.s.
7.678
n.s.
0.231
SC3
p
n.s.
0.017
n.s.
2.194
n.s.
8.409
SC3
cl
n.s.
0.012
n.s.
3.522
n.s.
0.869
0χ
n.s.
0.003
0.99
1.811
n.s.
0.144
1χ
n.s.
0.007
n.s.
3.813
n.s.
0.412
2χ
n.s.
0.008
n.s.
8.389
n.s.
0.720
3χp
n.s.
0.003
n.s.
1.152
n.s.
3.781
3χcl
n.s.
0.002
n.s.
7.165
n.s.
2.006
0χvm
0.026
0.039
n.s.
3.486
8.3
3.314
1χvm
n.s.
0.009
n.s.
6.006
2.333
2χvm
n.s.
0.010
n.s.
12.129
20.4
3.026
3χp
vm
n.s.
0.009
n.s.
5.831
2.857
3χcl
vm
n.s.
0.009
n.s.
10.869
1.690
N
46
46
17
17
62
62
R2
0.968
0.968
0.989
0.999
0.932
0.935
s
0.036
0.036
1.46
0.20
6.8
6.3
a 6 latent variables. b 13 latent variables. c 9 latent variables.
5
... These chemicals eliminate water content, which is the highest light-scattering component in most biological tissues [16]. Immersion in glycerol solutions with different concentrations is effective for tissue optical clearing due to its osmatic characteristics [17]. It can be used to osmotically replace water content within tissues, as water is the leading cause of light scattering within biological tissues. ...
Article
Light propagation and penetration inside tissues highly affect the optical imaging of biological tissues. Two major factors influence that; the first is associated with the dense scattering properties of tissues. Consequently, optical clearing (OC) methods have been developed to reduce tissue scattering by matching the tissue layers’ refractive indices via different protocols. The second factor is related to the illuminating wavefront and the size of the incident light beam. The present work monitored the optical transmittance of skeletal muscles after applying different OC approaches (physical OC using 99%-glycerol immersion and photothermal OC using IR-laser irradiation). First, the optical transmittance of the samples before and after the two OC procedures were compared, revealing a transmittance increase of 300% and 20%. Then, the laser beam wavefront aberrations were compensated in real-time by utilizing an active-adaptive Shack-Hartmann wavefront sensor system to provide an ideal illumination wavefront. Finally, the transmittance of the samples was compared using uncompensated and compensated laser wavefronts providing a 35% increase in the transmittance after aberrations compensation. Moreover, the aberration-free incident laser beam’s transmittance with different spot diameters was investigated. The results revealed that the larger beam diameter provided higher transmittance, hence higher optical penetration within the tissue.
... Property-based, [24][25][26][27][28] substructure-based, 18,[29][30][31][32][33][34] and machine learning-based methods. [35][36][37][38][39] The property-based methods are efficient in computation but rely heavily on correlated properties, for example, the activity coefficient depending highly on temperature, which are not always available. ...
Article
Full-text available
Lipophilicity, as quantified by the decimal logarithm of the octanol–water partition coefficient (log KOW), is an essential environmental property. Deep neural networks (DNNs) based quantitative structure–property relationship (QSPR) studies have received more and more attention because of their excellent performance for prediction. However, the black‐box nature of DNNs limits the application range where interpretability is essential. Hence, this study aims to develop an accurate and interpretable deep neural network (AI‐DNN) model for log KOW prediction. A hybrid method of molecular representation was employed to guarantee the accuracy of the proposed AI‐DNN model. The hybrid molecular representations are able to integrate the directed message passing neural networks (D‐MPNNs) learned molecular representations and the fixed molecule‐level features of CDK descriptors, and can capture both the local and the global features of overall molecule. The performance analysis shows that the proposed QSPR model exhibits promising predictive accuracy and discriminative power in the structural isomers and stereoisomers. Moreover, the Monte Carlo Tree Search (MCTS) approach was used to interpret the proposed AI‐DNN model by identifying the molecular substructures contributed to the lipophilicity. This interpretability can be applied to critical fields where there is a high demand for interpretable deep networks, such as green solvent design and drug discovery.
... An approach for fast selection of solvents for a given industrial application with the use of chemometric tools is proposed by García et al. [34]. First, the QSPR (quantitative structure-property relationship) model is developed to find the relationship between the molecular structure and some fundamental solvent properties. ...
Article
Full-text available
In this review, we present the applications of chemometric techniques for green and sustainable chemistry. The techniques, such as cluster analysis, principal component analysis, artificial neural networks, and multivariate ranking techniques, are applied for dealing with missing data, grouping or classification purposes, selection of green material, or processes. The areas of application are mainly finding sustainable solutions in terms of solvents, reagents, processes, or conditions of processes. Another important area is filling the data gaps in datasets to more fully characterize sustainable options. It is significant as many experiments are avoided, and the results are obtained with good approximation. Multivariate statistics are tools that support the application of quantitative structure–property relationships, a widely applied technique in green chemistry.
Chapter
Over the last decade, there has been a significant shift from traditional mechanistic and empirical modelling into statistical and data-driven modelling for applications in reaction engineering. In particular, the integration of machine learning and first-principle models has demonstrated significant potential and success in the discovery of (bio)chemical kinetics, prediction and optimisation of complex reactions, and scale-up of industrial reactors. Summarising the latest research and illustrating the current frontiers in applications of hybrid modelling for chemical and biochemical reaction engineering, Machine Learning and Hybrid Modelling for Reaction Engineering fills a gap in the methodology development of hybrid models. With a systematic explanation of the fundamental theory of hybrid model construction, time-varying parameter estimation, model structure identification and uncertainty analysis, this book is a great resource for both chemical engineers looking to use the latest computational techniques in their research and computational chemists interested in new applications for their work.
Article
Solvate ionic liquids (ILs) are promising candidates for several applications due to their stability, high coulombic efficiency, and low volatility. In this work, we investigate the solvation of lithium-bistriflimide by different glycerol-derived triether solvents, using molecular dynamics simulations. Very strong interactions between Li+ and the solvent oxygen sites are found, leading to significant conformational changes in the solvent. By comparing the conformation of the neat solvents with their IL mixtures at different concentrations and temperatures, we find that the presence of Li+ induces a distinct crown-like structure in the solvent molecules. The Li+ cations and the surrounding solvent form a podand complex, which is stable even at elevated temperatures. These glycerol-derived solvents exhibit distinct interactions with Li+ cations which may be exploited in electrolytic applications or lithium recovery processes.
Article
Full-text available
Conversion of epichlorohydrin to glycidyl ethers creates versatile precursors that can be transformed into a variety of molecular species with glycerol skeletons, enabling the design of molecules with highly tailored functionalities. The synthesis of 2,2,2‐trifluoroethyl glycidyl ether (TFGE, IUPAC name: 2‐[(2,2,2‐trifluoroethoxy)methyl]oxirane, CAS# 1535‐91‐7) was optimized to provide high yield/selectivity and good “green metrics.” TFGE was then used as a platform molecule in the synthesis of asymmetric glycerol 1,3‐diether‐2‐alcohol derivatives, which were subsequently transformed to 1,2,3‐triethers or 1,3‐diether‐2‐ketones. The density, viscosity, and CO2 solubility of each molecule were measured and compared with those of other glycerol‐derived compounds as well as compounds with similar functional groups. Furthermore, quantum chemical calculations were performed to understand the structure–property–performance relationships of these molecules for CO2 absorption. Based on the results in this work, we foresee that TFGE (and similar glycidyl ethers) would offer great flexibility in molecular design of green solvents and precursors to more complex compounds.
Chapter
As an essential environmental property, octanol-water partition coefficient (KOW) quantifies the lipophilicity of a compound and it could be further employed to predict the toxicity. Thus, it is an indispensable factor and should be considered in screening and development of green solvents with respect to unconventional and novel compounds. Herein, a deep-learning-assisted predictive model has been developed to accurately and reliably calculate log KOW values for organic compounds. An embedding algorithm was specifically established for generating signatures automatically for molecular structures to express structural information and connectivity. Afterwards, the Tree-structured long short-term memory (Tree-LSTM) network was used in conjunction with signature descriptor for automatic feature selection, and it was then coupled with the back-propagation neural network to develop a deep neural network (DNN), which is used for modeling quantity structure-property relationship (QSPR) to predict log KOW. Comparing with an authoritative estimation method, the proposed DNN-based QSPR model exhibited the better predictive accuracy and greater discriminative power in terms of the structural isomers and stereoisomers. As such, the proposed deep learning approach can act as a promising and intelligent tool for developing environmental property prediction methods for guiding development or screening of green solvents.
Chapter
Environmental properties of compounds provide significant information in treating organic pollutants, which drives the chemical process and environmental science toward eco-friendly technology. Traditional group contribution methods play an important role in property estimations, whereas various disadvantages emerge in their applications, such as scattered predicted values for certain groups of compounds. In order to address such issues, an extraction strategy for molecular features is proposed in this research, which is characterized by interpretability and discriminating power with regard to isomers. Based on the Henry’s law constant data of organic compounds in water, we developed a hybrid predictive model that integrates the proposed strategy in conjunction with a neural network framework. The structure of the predictive model is optimized using cross-validation and grid search to improve its robustness. Moreover, the predictive model is improved by introducing the plane of best fit descriptor as input and adopting k-means clustering in sampling. In contrast with reported models in the literature, the developed predictive model demonstrates improved generality, higher accuracy, and fewer molecular features used in its development.
Article
Environmental properties of compounds provide significant information in treating organic pollutants, which drives the chemical process and environmental science toward eco-friendly technology. Traditional group contribution methods play an important role in property estimations, whereas various disadvantages emerge in their applications, such as scattered predicted values for certain groups of compounds. In order to address such issues, an extraction strategy for molecular features is proposed in this research, which is characterized by interpretability and discriminating power with regard to isomers. Based on the Henry's law constant data of organic compounds in water, we developed a hybrid predictive model that integrates the proposed strategy in conjunction with a neural network framework. The structure of the predictive model is optimized using cross-validation and grid search to improve its robustness. Moreover, the predictive model is improved by introducing the plane of best fit descriptor as input and adopting k-means clustering in sampling. In contrast with reported models in the literature, the developed predictive model demonstrates improved generality, higher accuracy, and fewer molecular features used in its development.
Article
Full-text available
A COSMO-RS descriptor (Sσ-profile) has been used in quantitative structure–activity relationship studies (QSARs) based on a neural network for the prediction of the toxicological effect of ionic liquids (ILs) on a leukemia rat cell line (LogEC50 IPC-81) for a wide variety of compounds including imidazolium, pyridinium, ammonium, phosphonium, pyrrolidinium and quinolinium ILs. Sσ-profile is a two-dimensional quantum-chemical parameter capable of characterising the electronic structure and molecular size of cations and anions. By using a COSMO-RS descriptor for a training set of 105 compounds (96 ILs and 9 closely related salts) with known biological activities (experimental LogEC50 IPC-81 values), a reliable neural network was designed for the systematic analysis of the influence of structural IL elements (cation side chain, head group, anion type and the presence of functional groups) on the cytotoxicity of 450 IL compounds. The Quantitative Structure–Activity Map (QSAM), a new concept developed here, was proposed as a valuable tool for (i) the molecular understanding of IL toxicity, by relating Log EC50 IPC-81 parameters to the electronic structure of compounds given by quantum-chemical calculations; and (ii) the sustainable design of IL products with low toxicity, by linking the chemical structure of counterions to the predictions of IL cytotoxicity in handy contour plots. As a principal contribution, quantum-chemical-based QSAM guides allow the analysis/quantification of the non-linear mixture effects of the toxicophores constituting the IL structures. Based on these favorable results, the QSAR model was applied to estimate IL cytotoxicities in order to screen commercially available compounds with comparatively low toxicities.
Article
There can be no general theory of organic reactivity without consideration of solvent effects. Solvent effects on rate and equilibrium constants, and on spectral characteristics, for example, are no less in magnitude than structural effects. It is generally agreed that the problem of solvent-solute interaction is no less complicated than that of structural effects. Therefore all the various approaches in this field, as reflected in the correspondingly vast literature, cannot be reviewed in this chapter, even briefly. Hence we have restricted ourselves to certain definite aspects of the problem of solvent effects, viz. the results one can hope to obtain by the use of the LFER principle.1,2,3
Article
A new topological index (TI) was proposed based on atom characters (e.g., atom radius, atom electronegativity, etc.) and atom positions in the hydrogen-suppressed molecule structure in our previous work. In this work, the TI was used for predicting the toxicity of ILs in acetylcholin esterase (log EC50 AChE) by the multiple linear regression (MLR) method. For ILs composed entirely of cations and anions, the TIs are calculated from cations and anions, respectively. The 221 ILs used in the MLR model are based on imidazolium (Im), pyridinium (Pyi), pyrrolidinium (Pyo), ammonium (Am), phosphonium (Ph), quinolinium (Qu), piperidinium (Pi), and morpholinium (Mo). The regression coefficient (R2) and the overall average absolute error (AAE) are 0.877 and 0.153, respectively.
Article
Ionic liquids have attracted a lot of attention as potential replacements for conventional volatile organic solvents, although they may pose environmental risks to aquatic ecosystems that have to be assessed. There is strong interest in developing mathematical models to estimate the ecotoxicity of ionic liquids, minimising the experimental investigations and the consequent consumption of time and resources. This paper presents a new approach for estimating the ecotoxicity of ILs, based on the standardised assay with the bacterium Vibrio fischeri, by means of the application of Partial Least Squares-Discriminant Analysis (PLS-DA). The PLS-DA model developed makes it possible to discriminate ionic liquids, formed by combinations of 30 anions and 64 cations, on the basis of their expected toxicity with respect to conventional solvents that they may replace. The successful results obtained in the validation of the model reveal that this approach can be useful as a screening tool to easily aid, from the early stages of the process, the design of aquatic environmentally friendly ionic liquids. This approach may also be useful for the further development of predictive models based on other aquatic organisms, for which more data are expected to be available in the near future.
Article
The use of partial least squares (PLS) for handling collinearities among the independent variables X in multiple regression is discussed. Consecutive estimates $({\text{rank }}1,2,\cdots )$ are obtained using the residuals from previous rank as a new dependent variable y. The PLS method is equivalent to the conjugate gradient method used in Numerical Analysis for related problems. To estimate the “optimal” rank, cross validation is used. Jackknife estimates of the standard errors are thereby obtained with no extra computation. The PLS method is compared with ridge regression and principal components regression on a chemical example of modelling the relation between the measured biological activity and variables describing the chemical structure of a set of substituted phenethylamines.
Article
We report the calculation of boiling points for several alkyl alcohols through the use of improved molecular descriptors based on the optimization of correlation weights of local invariants of graphs. As local invariants we have used the presence of different chemical elements (i.e. C, H, and O) and the existence of different vertex degree values (i.e. 1, 2, 3 and 4). The inherent flexibility of the chosen molecular descriptor seems to be rather suitable to obtain satisfactory predictions of the property under study. Comparison with other similar approximation reveals a very good behavior of the present method. The use of higher order polynomials do not seem to be necessary to improve the results regarding the simple linear fitting equations. Some possible future extensions are pointed out in order to achieve a more definitive conclusion about this approximation.