Content uploaded by José I García
Author content
All content in this area was uploaded by José I García on May 04, 2016
Content may be subject to copyright.
García et al., Green Chem. 2013, 15, 2283–2293
1
Quantitative structure-property relationships prediction of some
physico–chemical properties of glycerol based solvents.
José I. García,*a Héctor García-Marín,a José A. Mayoral,a,b and Pascual Pérezc
5
Quantitative Structure-Properties Relationships (QSPR) models have been developed for three
characteristic properties of a series of 62 new glycerol derivatives, relevant to solvent
classification and substitution uses. Using structural descriptor variables, three equations have
been found using Multiple Linear Regression analysis, which can be applied for in silico prediction
of physico-chemical properties, allowing a faster selection of target solvents for a given 10
application.
Introduction
Organic solvents are used in huge amounts in many industrial
and daily life applications, but unfortunately the majority of
them come from petroleum and they are often labelled as toxic 15
or hazardous substances. For this reason, substantial efforts
are being done to develop more benign solvents from
renewable sources. Our group has recently described a family
of solvents based on glycerol,1 a concomitant product in
biodiesel production. To facilitate the search of possible 20
substitution applications, we have also determined a number
of physico-chemical properties of these glycerol-derived
solvents, and compared them with those of conventional
organic solvents. Many of these properties are difficult to
measure, so it is clear that the development of efficient 25
quantitative structure-properties relationship (QSPR)
equations would be of great interest to accelerate the search of
the best solvent for a given application. The concept is based
on the fact that it exists a close relationship between the bulk
properties and the molecular structure of a series of similar 30
chemical compounds. In this context, solvent classification is
a very interesting issue, which has traditionally been
addressed from both microscopic (intermolecular interactions)
and macroscopic (as a continuum medium) approaches.
However, solvation processes are hard to parameterize given 35
that solvation energy (the only observable magnitude) is
controlled by a large amount of factors. For this reason,
classification of solvents, and especially that of neotheric
solvents, is far from being straightforward,2-7 and hence,
during the last three decades of the 20th century, many efforts 40
have been devoted to classify solvents using empirical
parameters.8
Quantitative structure–property relationships are
mathematical equations relating chemical structure to a wide
variety of physical, chemical, and biological properties; in our 45
case, solvent properties. QSPR models, once established, can
be used to predict properties of compounds as yet unmeasured
or even unknown.9-13 In this context, there are many reports
about the applications of QSPR in connection with solvents,
such as physico-chemical properties in alkanes series,14 50
optical properties of organic compounds,15 thermophysical
properties of some fluids,16 solubility of hazardous
compounds,17 acidity constants of some acid derivatives,13
permeability of organic compounds in membranes,18 or
important properties of room temperature ionic liquids 55
(RTILs), such as toxicity.19-21 A major step in the
development of QSPR models is finding a set of molecular
descriptors able to represent the variation of the structural
features of the molecules, and therefore a wide variety of
descriptors have been reported for use in QSPR analysis.22-25 60
The molecular descriptors chosen (X) are correlated with one
or more response variables (Y) using different statistical
approaches. Among the many statistical procedures available
to establish those relationships, such as Partial Least Squares
Analysis (PLSA), Multiple Linear Regression (MLR), 65
Artificial Neural Network (ANN), or Principal Component
Analysis (PCA), a really good example about using QSPR in
the classification of Solvents through PCA can be found in the
literature.26 Probably MLR27 is the most used because it is
simple and intuitive. 70
Scheme 1. General structure and codification of the glycerol-derived
solvents used in this study.
From an industrial application point of view, there are three
main solvent features that must be taken into account, as are 75
1) behaviour in dissolution processes, which can be well
defined through the solvatochromic parameter E!
! (see
below),28-29 2) mechanical aspects, which can be quantified by
their viscosity, and 3) volatility aspects, very related to safety,
toxicity and air pollution, which can be considered through 80
the boiling point.
R
1
OOR
3
OR
2
R = Code =
H
Me
Et
i
Pr
n
Bu
i
Bu
t
Bu
CF
3
CH
2
CF
3
CF
2
CH
2
CF
3
(CF
2
)
2
CH
2
0
1
2
3i
4
4i
4t
3F
5F
7F
2
Table 1. List of properties in the 62 glycerol solvents studied in the
present work.
Solvent
Code
E!
!
Visc.
(cP)
b.p.
(ºC)
1,2,3-propanetriol
000
0.812
93430
29030
3–methoxy–1,2–propanediol
100
0.710
37.72
222
3–ethoxy–1,2–propanediol
200
0.690
35.14
221
3–n–butoxy–1,2–propanediol
400
0.680
42.03
249
1,3–dimethoxy–2–propanol
101
0.610
3.46
170
1–isopropoxy–3–methoxy–2–propanol
103i
0.490
3.38
188
1–n–butoxy–3–methoxy–2–propanol
104
0.480
208
1–isobutoxy–3–methoxy–2–propanol
104i
0.490
200
1–tert–butoxy–3–methoxy–2–propanol
104t
0.440
195
1–n–butoxy–3–isopropoxy–2–propanol
403i
0.470
4.59
223
1,3–di–n–butoxy–2–propanol
404
0.450
5.53
248
3–n–butoxy–1–tert–butoxy–2–propanol
404t
0.390
230
3–n–butoxy–1–isobutoxy–2–propanol
404i
0.460
229
1–ethoxy–3–isopropoxy–2–propanol
203i
0.450
187
1–n–butoxy–3–ethoxy–2–propanol
204
0.450
220
1–tert–butoxy–3–ethoxy–2–propanol
204t
0.410
204
1–isobutoxy–3–ethoxy–2–propanol
204i
0.460
214
1,3–diisopropoxy–2–propanol
3i03i
0.460
202
1–tert–butoxy–3–isopropoxy–2–propanol
3i04t
0.370
202
1–isobutoxy–3–isopropoxy–2–propanol
3i04i
0.440
215
1–isopropoxy–3–(2,2,2–trifluoroethoxy)–2–
propanol
3i03F
0.590
6.90
176
1–n–butoxy–3–(2,2,2–trifluoroethoxy)–2–propanol
403F
0.600
210
1–tert–butoxy–3–(2,2,2–trifluoroethoxy)–2–
propanol
4t03F
0.570
8.61
199
1–isobutoxy–3–(2,2,2–trifluoroethoxy)–2–
propanol
4i03F
0.600
205
1,3–bis(2,2,2–trifluoroethoxy)–2–propanol
3F03F
0.700
8.14
197
1,3–bis(2,2,3,3,3–pentafluoropropoxy)–2–propanol
5F05F
0.699
204
1,3–bis(2,2,3,3,4,4,4–heptafluorobutoxy)–2–
propanol
7F07F
0.685
19.60
206
1,2,3–trimethoxypropane
111
150
1–isopropoxy–2,3–dimethoxypropane
113i
1.03
170
2–n–butoxy–3–methoxy–1–isopropoxypropane
143i
0.145
215
1–tert–butoxy–2,3–dimethoxypropane
114t
0.214
180
2–n–butoxy–1–tert–butoxy–3–methoxypropane
144t
219
1–n–butoxy–2,3–dimethoxypropane
114
0.178
199
1,2–di–n–butoxy–3–methoxypropane
144
234
1–isobutoxy–2,3–dimethoxypropane
114i
193
2–n–butoxy–1–isobutoxy–3–methoxypropane
144i
227
2–ethoxy–3–methoxy–1–isopropoxypropane
123i
0.167
161
3–ethoxy–2–methoxy–1–isopropoxypropane
213i
0.171
183
1–tert–butoxy–2–ethoxy–3–methoxypropane
124t
190
1 tert–butoxy–3–ethoxy–2–methoxypropane
214t
0.150
193
1–n–butoxy–2–ethoxy–3–methoxypropane
124
0.155
209
1–n–butoxy–3–ethoxy–2–methoxypropane
214
0.164
209
1–isobutoxy–2–ethoxy–3–methoxypropane
124i
198
1–isobutoxy–3–ethoxy–2–methoxypropane
214i
0.161
201
2,3–diethoxy–1–isopropoxypropane
223i
192
1–tert–butoxy–2,3–diethoxypropane
224t
0.155
199
1–n–butoxy–2,3–diethoxypropane
224
0.161
217
1–isobutoxy–2,3–diethoxypropane
224i
210
1–n–butoxy–2–methoxy–3–isopropoxypropane
413i
0.155
1.67
218
1–n–butoxy–2–ethoxy–3–isopropoxypropane
423i
222
3–n–butoxy–1–tert–butoxy–2–methoxypropane
414t
0.141
234
3–n–butoxy–1–tert–butoxy–2–ethoxypropane
424t
211
1,3–di–n–butoxy–2–methoxypropane
414
0.145
3.78
244
3–n–butoxy–1–isobutoxy–2–methoxypropane
414i
0.150
226
3–n–butoxy–1–isobutoxy–2–ethoxypropane
424i
241
3–isopropoxy–2–methoxy–1–(2,2,2–
trifluoroethoxy)–propane
3i13F
180
3–tert–butoxy–2–methoxy–1–(2,2,2–
trifluoroethoxy)–propane
4t13F
0.373
2.14
185
1,2,3–tri–n–butoxypropane
444
2.72
270
3–n–butoxy–2–methoxy–1–(2,2,2–
trifluoroethoxy)propane
413F
207
2–methoxy–1,3–bis(2,2,2–trifluoroethoxy)propane
3F13F
0.553
2.33
178
2–ethoxy–1,3–bis(2,2,2–trifluoroethoxy)propane
3F23F
0.595
171
2–n–butoxy–1,3–bis(2,2,2–
trifluoroethoxy)propane
3F43F
0.574
208
Therefore, we decided to select for the present work 62
solvents based on glycerol, all of them prepared in our
laboratory (Scheme 1 and Table 1),1 and the three above-5
mentioned properties, also determined by us, were analyzed
for this solvent set using several QSPR models.
Results and discussion
Molecular structure definition
There are many ways of describing the structure of a chemical 10
compound as a vector of numbers. In this work we have used
two different approaches, based on molecular connectivity
descriptors: topological parameters and DARC/PELCO
descriptors.
15
Figure 1. DARC/PELCO scheme used to describe glycerol based
solvent structures.
Topological parameters are based on the molecular graph of
each compound.25,31 They are easily determined from the 20
connectivity and adjacency matrixes of each compound. The
number of connected components of a graph is a topological
invariant that measures the number of structurally independent
or disjoint subnetworks. These parameters are excellent
descriptors of molecular size, shape and flexibility. They are 25
global parameters in the sense that the whole molecular
structure is condensed in a single number. The topological
descriptors selected for QSPR studies in this work are: i)
Hydrogen bond acceptor counters (HBA), ii) Hydrogen bond
donor counters (HBD), iii) Rotatable bond counters (RB), iv) 30
Flexibility index (
ϕ
),32 v) Balaban index (Bal),33 vi) Wiener
index (W),34 vii) Zagreb index (Z),35 viii) Kier shape index
(
κ
n),32 ix) Subcount index (SC),36 and x) Conectivity index
(
χ
).25 Full definition of the indices used in the statistical
analyses are given in the Supporting Information. 35
DARC/PELCO (Description, Acquisition, Retrieval and
Computer–aided design / Perturbation of an Environment
which is Limited, Concentric and Ordered),37 is another
excellent way to describe chemical structures, yet much less
used in QSPR studies. This system is particularly suitable for 40
studying families of compounds with a common chemical
substructure. The DARC/PELCO method is based on the
exhaustive generation of all topochromatic sites around the
reference structure (F0), which corresponds to the glycerol
skeleton common to all structures, and the evaluation of their 45
contribution to the property. The DARC/PELCO descriptors
are local, since each one indicates the presence or absence of
a group of atoms in a given molecular position. Their
definition is shown in Figure 1. In this definition we have
incorporated the symmetry of the glycerol derivatives used, by 50
A
1
B
1
C
1
D
1
A
2
(BF2)
(C
F2
)
(DF2)
C
2
B
2
D2
F0
A2
B
2
C
2
D
2
(D
F2
)
(C
F2
)
(B
F2
)
3
assuming that the contributions of groups occupying
equivalent positions (i.e. those linked to carbons 1 and 3 of
the glycerol moiety) will display the same influence on the
property under study. Preliminary studies have demonstrated
that this simplification do not alter the results of the 5
regression analyses.
Solvent Properties Selection
Solvent polarity (𝐸!
!)
Solvent polarity parameters have demonstrated their
usefulness not only to classify organic solvents but also to 10
explain solvent effects on very different physical and
chemical processes. An excellent overview of solvent polarity
parameters and their applications can be found in the
outstanding Reichardt’s book.8 Although there are several
procedures to quantify solvent polarity, solvatochromism 15
measurements of probe dyes is undoubtedly the most
successful methodology for an accurate determination of this
solvent feature due to their easy determination and the high
sensitivity to small polarity changes. From this point of view,
the Dimroth and Reichardt ET(30) parameter28-29 is one of the 20
most widespread used parameter. ET(30) values represent a
blend of dipolarity/polarizability and hydrogen bond donor
solvation abilities of the solvent, the latter feature contributing
to the total ET(30) value to a greater extent. E!
! is a
normalized form of ET(30), taking the value 0 for hexadecane 25
and 1 for water.
Viscosity.
Viscosity describes a fluid's internal resistance to flow and
may be thought of as a measure of fluid friction. This property
is particularly interesting from the viewpoint of possible large 30
scale industrial applications, where big solvent volumes have
to be stirred and pumped from one place to another.
Boiling point
One major problem concerning the use of organic solvents is
the presence of traces of these compounds in the air. The most 35
common volatile organic compounds (VOC’s) are solvents
indeed. Nowadays a big effort is being done to solve this
problem, trying to substitute these volatile solvents with
others that are less or non-volatile. For this reason this
property is really important to be not only measured but also 40
predicted. Boiling point is a quick and easy form to estimate
the volatility of a solvent, since in general a higher boiling
point correlated with a lower volatility at ambient pressure
and temperature.
Quantitative Structure Properties Relationships 45
Multiple Linear Regression (MLR) with topological indices.
It is often assumed that the relationship between structural
parameters and experimental properties is well represented by
a linear model:
y = b0 + b1 x1 + b2 x2 + ... + bn xn or 50
Y = X·B (in matrix form) Eq. (1)
In Eq. (1) the bi are unknown coefficients, and the objective
of regression analysis is to estimate their values. As QSPR
data sets consist of variables that are diverse in range,
variation and size, prior to regression analysis auto-scaling is 55
usually applied, i.e., the ith column is mean centred (with xi)
and scaled with 1/SD(xi), where SD is the standard deviation.
When X is of full rank the least squares solution is: B = (XT
X)–1XTY, where B is the estimator vector for the regression
coefficients. However, very often, not all these coefficients 60
have statistical significance, so the final QSPR model should
only keep those descriptors really contributing to the variation
in the property observed. To this end we used a stepwise
method for variable selection. In this way, independent
variables xi are entering and leaving in the regression 65
equation, and only those having statistically significant
coefficients are finally kept in the model fitting.
The three regression equations obtained for the three
experimental properties fitted are the following:
E!
!=b!+b!·HBA +b!·HBD +b!·SC!
! Eq. (2) 70
η=b!+b!·HBD Eq. (3)
bp =b!+!b!·RB +b!·HBD +b!·HBA Eq. (4)
The corresponding coefficient values and MLR parameters
are shown in Table 2.
Table 2. Linear regression parameters from equations 2–4.a 75
E!
!
η
b.p.
b0 ± e0
0.206 ± 0.035
—b
111.1 ± 17.0
b1 ± e1
0.073 ± 0.010
14.50 ± 3.56
11.8 ± 1.9
b2 ± e2
0.194 ± 0.021
—
24.7 ± 5.0
b3 ± e3
−0.019 ± 0.004
—
−3.2 ± 1.2
N
46
17
62
R2
0.957
0.823
0.769
σ(y)
0.0437
7.51
12.2
Fc
72.39
74.57
64.52
a bi are the coefficients for each regressions, ei is the tolerance for the bi
value in a 95% confidence interval. N is the number of cases (solvents
data) used in each regression, R2 is the determination coefficient. b As the
b0 coefficient turned to be non-significant in the standard MLR analysis,
fitting was done by forcing the equation to pass through the origin of 80
coordinates. A slight improvement in R2 was obtained in this way. c F(3,43
0,05) = 2,84; F(1,18, 0,05) = 4,41; F(3,59, 0,05) = 2,84. All equations are
statistically significant (p > 95%).
As can be seen, hydrogen-bonding ability of the solvent
seems to be the most important feature in modeling the three 85
properties under study. This result is consistent with the kind
of intermolecular interactions involved. It is well-known that
E!
! values are dominated by the HBD ability of the solvent,
due to the strong specific solvation stablished through
hydrogen-bonding with the phenolate oxygen of the betaine 90
dye. Similarly, the strong solvent-solvent intermolecular
hydrogen-bond interactions of most of the glycerol-derived
solvents included in the study are in the origin of the viscosity
values obtained, and hence of the importance of this
coefficient in the MLR model. Finally, the same strong 95
intermolecular interactions can be invoked to explain the high
boiling points displayed by most of the solvents considered.
Figure 2 plots the experimental values vs. those calculated
with the three MLR models. The dotted line represents the
least squares fit between both sets of data. 100
4
Figure 2. Plots of predicted vs. experimental values of E!
! (a), Viscosity
(b), and boiling point (c), as calculated through MLR analysis using 5
topological indices.
As can be seen the best results are obtained in the case of
the E!
! solvation parameter, which is consistent with the
higher determination coefficient value obtained in the MLR
analysis. In the other two cases, although there a clear 10
correlation, as indicated by the grouping of points around the
diagonal, the fit is not good enough to lead to a fully
predictive model.
The robustness and predictivity character of the method
was tested by splitting the data into a training and a test set, 15
which was created by extracting eight solvents from the
complete set, so the training set consists of 54 solvents. The
solvents of the test set (Table 3) were selected bearing in mind
the representativity of the whole set and for all the properties
the test set size is within the usually recommended percentage 20
of 10-20% of total cases.
Table 3. Subgroup of eight solvents extracted from the total amount of
solvents in order to create the new 54 solvents training set.
Solvent
E!
!
η (cP)
b.p. (ºC)
200
0,690
35,140
221
104
0,480
—
208
3i03F
0,590
6,900
176
5F05F
0,699
—
204
113i
—
1,030
170
414t
0,141
—
234
4t13F
0,373
2,140
185
3F23F
0,595
—
171
The three new regression equations obtained with the new
54 solvents group of the training set are sumarized in Table 4. 25
Table 4. Linear regression parameters from equations 2–4 obtained with
the training set of solvents.a
E!
!
η
b.p.
b0 ± e0
0.196 ± 0.039
—b
111.4 ± 17.6
b1 ± e1
0.071 ± 0.012
14.19 ± 4.59
11.7 ± 2.0
b2 ± e2
0.200 ± 0.024
—
25.9 ± 5.4
b3 ± e3
−0.018 ± 0.005
—
−3.0 ± 1.4
N
39
13
54
R2
0.953
0.791
0.782
σ(y)
0.0456
8.16
11.9
Fb
238.24
45.31
59.64
a bi are the coefficients for each regressions, ei is the tolerance for the bi
value in a 95% confidence interval. N is the number of cases (solvents
data) used in each regression, R2 is the determination coefficient. b As the 30
b0 coefficient turned to be non-significant in the standard MLR analysis,
fitting was done by forcing the equation to pass through the origin of
coordinates. A slight improvement in R2 was obtained in this way.
c F(3,36 0,05) = 2.88; F(1,12, 0,05) = 4.75; F(3,51, 0,05) = 2.79. All equations are
statistically significant (p > 95%). 35
As can be seen, the regression coefficients are in all cases
very close to those calculated with the full set of solvents,
which illustrates the robustness of the equations obtained.
These new equations were used to predict the polarity,
viscosity and boiling point of solvents in the test set. As a 40
measure of the goodness of the prediction we used the mean
unsigned error (MUE). In the case of E!
! the MUE of the
fitting of the training set was 0.028, whereas that of the
predictions of the test set was 0.030 and represent less than
5% of the whole range of values (0.671). This points to a 45
reasonable predictivity for the model developed. In the case of
viscosity the corresponding MUE for the training and test sets
are 7.28 and 5.08, respectively, i.e. 18% of the whole range of
values (41.0) in the worst case, which indicates the poorer
predictivity of the corresponding equations, although they 50
could be still be used in a semi-quantitative way. Finally, in
the case of the boiling points, the MUE values for the training
and test sets are 8.6 and 10.9, respectively. Again, the error is
only slightly higher in the case of the “pure predictions” (test
set), representing less than 8% of the whole range of values 55
(140.0), which would allow a reasonable degree of
5
predictivity. A plot comparing the predicted and experimental
values of the test set is in Figure S1 of the Electronic
Supplementary Information (ESI).
Partial Least Squares (PLS) Regression with topological
indices. 5
One problem when using topological indices is the high pair-
correlation existing between many of them, given they often
recover similar structural features of the target molecule. This
can have undesirable consequences in MLR analyses, since
the real significance of a variable cannot be ascertained if it is 10
highly correlated with another one. For instance, when
examining variable coefficients in Eq. 2 one should be aware
that HBA and SC!
! have a pair correlation coefficient as high
as 0.828 (full pairwise correlation data are gathered in Table
S3 in the ESI). 15
Table 5. PLS regression results obtained in the treatment of the
experimental solvent properties studied in this work.
bi
E!
! a
ηb
b.p.c
HBA
0.0141
0.238
8.9
HBD
0.1370
9.387
46.7
RB
0.0097
0.713
6.4
ϕ
−0.0104
−0.001
−2.4
BalJX
−0.1500
−3.210
31.1
BalJY
−0.0806
−2.638
35.4
Wr
0.0000
0.003
0.0
Z
0.0007
0.005
−0.2
κ!
!!.
0.0026
−0.024
2.1
κ!
!!.
−0.0093
0.004
−2.0
κ!
!!.
0.0228
−1.118
−5.4
𝑆𝐶!
!
0.0030
−0.013
2.3
𝑆𝐶!
!
0.0030
−0.013
2.3
𝑆𝐶!
!
0.0022
0.025
−1.7
𝑆𝐶!
!
−0.0006
0.136
−6.4
𝑆𝐶!
!"
0.0040
0.105
−7.7
𝜒!
0.0041
0.000
1.1
𝜒!
0.0044
−0.034
10.8
𝜒!
0.0100
−0.106
−5.0
𝜒!
!
−0.0011
0.819
0.9
𝜒!
!"
0.0211
−0.188
2.6
𝜒!
!.!.
−0.0185
−0.560
−11.2
𝜒!
!.!.
−0.0201
−0.344
−12.8
𝜒!
!.!.
−0.0242
−1.185
17.0
𝜒!
!,!.!.
−0.0796
0.445
109.8
𝜒!
!",!.!.
−0.0632
−3.142
12.0
b0
0.9997
30.973
−111.3
N
46
17
62
R2
0.969 (0.954)a
0.700 (0.535)
0.891 (0.770)
σ(y)
0.036
7.29
8.1
a PLS regression used 4 latent variables built from the 26 original ones.
b PLS regression used 3 latent variables built from the 26 original ones.
c PLS regression used 7 latent variables built from the 26 original ones. 20
d Values in parentheses correspond to full cross-validated analyses, i.e.
each value is predicted by the equation obtained leaving that solvent out.
The resulting fitting is therefore more representative of the true predictive
ability of the model.
A possible solution to this problem is to transform the original 25
variables in a new set of a few new orthogonal (not correlated)
variables, gathering most of the total variance of data. In the case
of PLS regression,38,39 both the dependent (y) and the
independent (x) variables are projected in a new space, trying of
maximize the explanation of the variance of y through the 30
variance of latent variables x. Once this relationship is found, the
PLS coefficients are projected back to the original x-space, to
obtain the corresponding regression coefficients.
Figure 3. Plots of predicted vs. experimental values of E!
! (a), Viscosity 35
(b), and boiling point (c), as calculated through PLS analysis using
topological indices.
When the PLS regression technique was applied to our
problem, slighty better models were obtained for two of the
three properties considered. The corresponding coefficients 40
and PLS parameters are shown in Table 5, the most important
6
coefficients corresponding again to the hydrogen-bonding
indices. Plots of predicted vs. experimental values of the
properties are displayed in Figure 3. As can be seen in these
plots, in the case of E!
! the PLS model fits very well the
values of most of the 62 solvents used in the analysis. The 5
MUE of the fitted values is 0.028, identical to that obtained in
the previous MLR analysis. The full cross-validated
predictions (i.e., those performed by leaving the predicted
point out of the PLS calculation of the coefficients) are close
to normal predictions in all but one case (7F07F), which 10
points to the robustness of the model and the reliability of the
predictions. The MUE in this case is only slightly higher,
0.034. On the other hand, viscosity displays a bad
behaviour concerning the PLS analysis, with a determination
coefficient (R2) even lower than that found in the MLR 15
analysis. Again, hydrogen bond donor ability and κ3
!α. are the
topological variables with higher coefficients. However the
MUE of the fitted values is 5.86, and that of the cross-
validated values increases to 7.88, values which are not far
from those obtained in the MLR analyses, although they seem 20
to be too high to allow reliable quantitative predictions.
Table 6. PLS regression results obtained in the treatment of the training
set of solvents.
bi
E!
! a
ηb
b.p.c
HBA
0.0145
0.142
4.181
HBD
0.1390
10.982
38.876
RB
0.0097
0.717
10.645
ϕ
−0.0096
0.047
1.938
BalJX
−0.1680
−2.110
31.450
BalJY
−0.0990
−2.036
29.459
Wr
0.0000
0.002
−0.006
Z
0.0006
0.003
−0.226
κ!
!!.
0.0022
-0.019
−0.010
κ!
!!.
−0.0088
0.047
2.044
κ!
!!.
0.0213
−1.290
−1.801
𝑆𝐶!
!
0.0027
−0.011
0.155
𝑆𝐶!
!
0.0027
−0.011
0.155
𝑆𝐶!
!
0.0022
0.019
−1.142
𝑆𝐶!
!
−0.0006
0.125
−3.202
𝑆𝐶!
!"
0.0042
0.079
−1.859
𝜒!
0.0038
−0.001
−0.943
𝜒!
0.0034
−0.017
3.560
𝜒!
0.0101
−0.128
−4.585
𝜒!
!
−0.0008
0.757
−1.390
𝜒!
!"
0.0219
−0.268
7.733
𝜒!
!.!.
−0.0180
−0.441
−8.287
𝜒!
!.!.
−0.0184
−0.186
−7.186
𝜒!
!.!.
−0.0198
−0.864
−0.827
𝜒!
!,!.!.
−0.0699
0.959
69.209
𝜒!
!".,!.!.
−0.0519
−3.170
19.237
b0
1.0874
22.028
−56.840
N
39
13
54
R2
0.967
0.668
0.874
σ(y)
0.036
7.55
8.7
a PLS regression used 4 latent variables built from the 26 original ones.
b PLS regression used 3 latent variables built from the 26 original ones. 25
c PLS regression used 8 latent variables built from the 26 original ones.
Finally, the fitting of boiling points is slightly better with
the PLS approach (higher R2 and lower σ(y)), and the
resulting model is quite robust, with only three outliers: 30
glycerol itself (000), 444 and 7F07F. In this case, the MUE
are 6.2 (fitted values) and 8.4 (cross-validated values),
slightly better than those found in the MLR analyses.
In order to have a more reliable proof of the predictive
ability of these equations, we splitted the data again into the 35
same training and tests sets used in the MLR analyses. The
results of the corresponding regressions are shown in Table 6.
Plots of experimental vs. predicted values (including solvents
in the test set) are shown in Figure S2 (ESI).
As can be seen from the values in Table 6, both the 40
goodness of the fitting and the regression coefficients
obtained with the training set of solvents are quite similar to
those calculated with the full set.
Concerning the prediction errors, the MUE for E!
!!are 0.027
for the training set (almost identical to that calculated with the 45
full set of solvents) and 0.030 for the test set, which points to
a good predictivity of the equations developed. Concerning
the viscosity, the corresponding MUE values are 6.33 and
4.62 for the training and test sets, respectively, which are also
quite close to that obtained with the full set of solvents (5.86) 50
and point to a worse predictivity of this property by the model
developed. Finally, the MUE for the prediction of boiling
points are 6.6 (training set) and 10.3 (test set). Even if the
latter is clearly higher, it still represent about 7% of the full
range of b.p. values, which may be enough to obtain a 55
reasonable predictivity of this solvent property.
Multiple Linear Regression (MLR) with DARC/PELCO
descriptors.
In this case we used again the stepwise method to include in
the regression equation only those variables which are 60
statistically significant. It should be noted that for predictive
purposes, given the local character of the DARC/PELCO
descriptiors, the values of the coefficients of all the variables
not included in the final equations must be taken as zero. The
three MLR equations thus obtained are the following: 65
E!
!=b!+b!·A!+b!·B!" +b!·A!+b!·B!+b!·C!" +b!·C!
Eq. (5)
η=b!+b!·A!+b!·C!" +b!·A! Eq. (6)
bp =b!+b!·D!+b!·A!+b!·C!+b!·C!+b!·B2 +b!·C!" +
b!·B!" +b!·A! Eq. (7) 70
Table 7. Linear regression parameters from equations 5–7.
bi
E!
!
η
b.p.
b0 ± e0
0.851 ± 0.057
70.79 ± 5.45
278.2 ± 10.6
b1 ± e1
−0.278 ± 0.023
−32.50 ± 3.10
19.1 ± 3.60
b2 ± e2
0.140 ± 0.024
6.90 ± 2.40
−55.6 ± 6.68
b3 ± e3
−0.160 ± 0.038
−3.52 ± 2.50
33.6 ± 6.32
b4 ± e4
−0.026 ± 0.012
—
12.6 ± 2.49
b5 ± e5
−0.059 ± 0.032
—
7.9 ± 1.86
b6 ± e6
−0.016 ± 0.014
—
12.0 ± 5.84
b7 ± e7
—
—
7.0 ± 3.98
b8 ± e8
—
—
−6.1 ± 3.97
N
46
17
62
R2
0.972
0.981
0.933
σ(y)
0.036
2.08
6,9
Fa
229.23
228.29
92.18
a F(6,40 0,05) = 2.34; F(3,14, 0,05) = 3.34; F(8,54, 0,05) = 2.18. All equations are
statistically significant (p > 95%).
The corresponding coefficient values and MLR parameters are 75
7
shown in Table 7, and the plots of predicted vs. experimental
values of the properties are displayed in Figure 4.
5
Figure 4. Plots of predicted vs. experimental values of E!
! (a), Viscosity
(b), and boiling point (c), as calculated through MLR analysis using
DARC/PELCO descriptors.
As can be seen, the fitting of the three properties is better
than those described with the precedent approaches. Even the 10
viscosity display good values. In a first approach, this cannot
be ascribed to overfitting, given the final equation has only
three independent variables to fit seventeen data, i.e. more
than five times data than variables. Similarly, boiling point
also displays a very good fitting, with low standard error (ca. 15
7 ºC).
The robustness of the method was tested again by removing
the same test set of solvents (Table 3) from the entire data
and, as can be seen from the values gathered in Table 8, the
regresion coefficients in eq. 5–7 do not change dramatically, 20
all values lying withing the calculated confidence margins.
Table 8. Linear regression factors from equations 5–7 using a reduced
training set of 54 solvents.
bi
E!
!
η
b.p.
b0 ± e0
0.839 ± 0.059
74.13 ± 6.81
281.3 ± 11.0
b1 ± e1
−0.289 ± 0.025
−34.26 ± 3.78
18.4 ± 3.5
b2 ± e2
0.126 ± 0.026
6.99 ± 2.50
−56.4 ± 6.8
b3 ± e3
−0.148 ± 0.039
−2.99 ± 2.99
33.0 ± 6.1
b4 ± e4
−0.030 ± 0.013
—
12.2 ± 2.4
b5 ± e5
−0.055 ± 0.041
—
7.8 ± 1.9
b6 ± e6
−0.016 ± 0.014
—
10.6 ± 7.5
b7 ± e7
—
—
8.1 ± 4.3
b8 ± e8
—
—
−6.5 ± 4.1
N
39
13
54
R2
0.975
0.983
0.940
σ(y)
0.035
2.05
6.6
F
206.40
174.86
87.83
F(6,33 0,05) = 2.42; F(3,10, 0,05) = 3.71; F(8,46, 0,05) = 2.18. All equations are
statistically significant (p > 95%). 25
Figure S3 (in ESI) shows the predicted data for the eight
members of the test group. It can be seen that the best
predictable property is the boiling point, whose deviations
from experiment are less than ten percent in the worst case.
E!
!!displays a more erratic behaviour, specially in ht ecase of 30
fluorinated compounds, for which deviations are important in
relative terms, although they preserve the qualitative order
experimentally observed. As expected, the largest deviations
correspond to those structural features less represented in the
training set (highly branched and highly fluorinated chains). 35
Concerning the MUE, in all cases the values for the fitted
values using the solvent training set are lower than those
obtained the topological descriptors (0.024, 1.37 and 4.6 for
E!
!, viscosity and b.p., respectively), but this values are
significantly higher for the test set (0.051, 2.05 and 10.3, 40
respectively). Anyway, these errors represent between 5% and
8% of the full range of values, which point to a reasonably
good predictivity of these equations.
As already mentioned, DARC/PELCO descriptors are
highly intuitive, given their straightforward matching with the 45
molecular structure. As a consequence, the prediction of the
property of a new compound is extremely simple. As an
example we present the calculation of the boiling point of a
glycerol-derived solvent, not belonging to our 62 solvent set,
namely 1,2,3-triethoxypropane (222). This compound and its 50
boiling point were described in the literature, so the example
represents a “real world” prediction, given that the property
was determined by other authors using a different
experimental technique. Table 9 gathers the detailed
prediction procedure from the calculated regression 55
coefficients. As can be seen, the predicted value (177 ºC) is
reasonably close to the experimental one (181 ºC)40, and
within the standard regression error (ca. 95% predicted values
should be within a range of ±14 ºC from experimental ones).
8
Table 9. Example of boiling point prediction of 1,2,3-triethoxypropane
(222) using the linear regression obtained with DARC/PELCO
descriptors.a
bi
No of fragments
Total contribution
F0
278.2
1
278.2
A1
−6.1
1
−6.1
A2
−55.6
2
−111.2
B1
0.0
1
0.0
B2!
7.9!
2!
15.8!
176.7
a Experimental value: 181 ºC.40 5
Multiple Linear Regression (MLR) with mixed DARC/PELCO
and topological descriptors.
More compact prediction equations (equations 8−10) were
obtained by mixing DARC/PELCO and topological indices,
thus considering simultaneously local and global structure 10
descriptors, respectively. The coefficients and statistical
parameters for these regressions are gathered in Table 10, and
the plots of predicted vs. experimental values of the properties
are displayed in Figure 5.
E!
!=b!+b!·HBD +b!·B!" +b!·A!+b!·χ!
!.!. Eq. (8) 15
η=b!+b!·A!+b!·A!+b!·χ! Eq. (9)
bp =b!+b!·A!+b!·RB +b!·Bal!" +b!·χ!
!.!.+b!·χ!
!.!. Eq. (10)
Table 10. Linear regression factors from equations 8–10.
bi
E!
!
η
b.p.
b0 ± e0
0.523 ± 0.122
67.55 ± 3.94
292.6 ± 35.8
b1 ± e1
0.140 ± 0.042
–35.86 ± 2.51
−49.7 ± 9.0
b2 ± e2
0.177 ± 0.020
−5.27 ± 1.75
12.9 ± 3.4
b3 ± e3
−0.099 ± 0.043
0.99 ± 0.23
−26.0 ± 13.2
b4 ± e4
−0.026 ± 0.010
20.4 ± 6.9
b5 ± e5
−8.3 ± 7.5
N
46
17
62
R2
0.968
0.989
0.932
σ(y)
0.036
1.46
6.8
F
314.31
376.64
153.47
F(6,40 0,05) = 2.34; F(3,14, 0,05) = 3.34; F(8,54, 0,05) = 2.18. All equations are
statistically significant (p > 95%). 20
Although the statistical tests are very similar to those
obtained with the DARC/PELCO descriptors only, less
independent variables are used in the final equations, leading
to higher number of cases/number of variables ratios. In the
case of viscosity, the number of independent variables does 25
not change, but the standard error of the predictions is slightly
improved (from 2.05 to 1.46 cP).
The robustness and predictivity of these equations was
again tested by splitting the solvent set into training and test
sets. The corresponding regression results are gathered in 30
Table 11. As can be seen, there are not significant changes in
fitting parameters and regression coefficients. Figure S4 (in
ESI) shows the predicted data for the eight members of the
test group.
35
Figure 5. Plots of predicted vs. experimental values of E!
! (a), Viscosity
(b), and boiling point (c), as calculated through MLR analysis using
topological indices and DARC/PELCO descriptors. 40
9
Table 11. Linear regression factors from equations 8–10.
bi
𝐄𝐓
𝐍
𝛈
bp
b0 ± e0
0.512 ± 0.129
70.41 ± 4.49
288.6± 38.2
b1 ± e1
0.139 ± 0.045
–37.83 ± 2.71
−49.3 ± 8.9
b2 ± e2
0.173 ± 0.021
−5.46 ± 1.91
13.3 ± 3.5
b3 ± e3
−0.112 ± 0.050
1.03 ± 0.23
−22.5 ± 14.1
b4 ± e4
−0.024 ± 0.011
19.5 ± 6.6
b5 ± e5
−9.3 ± 7.6
N
39
13
54
R2
0.969
0.993
0.944
σ(y)
0.038
1.35
6.2
F
268.19
405.64
161.36
F(4,35 0,05) = 2.64; F(3,10, 0,05) = 3.71; F(5,49, 0,05) = 2.42. All equations are
statistically significant (p > 95%).
The comparison of the MUE calculated with the fitting of
the training set and the test set indicates that prediction errors 5
are significantly higher in the later case, but they are anyway
lower than those obtained with the precedent models,
representing 5-6% of the full range of experimental values in
all cases. A summary of the MUE calculated for all the
equations developed in this work is gathered in Table 12. 10
Table 12. Mean unsigned errors (MUE) calculated for the different
equations developed in this work.a
Model
E!
!
η
bp
training
test
training
test
training
test
MLR Topol.
0.028
0.030
7.28
5.08
8.6
10.9
PLS Topol.
0.027
0.030
6.34
4.62
6.6
10.3
MLR D.-P.
0.024
0.051
1.37
2.05
4.6
10.3
MLR Mixed
0.026
0.033
1.02
1.95
3.9
8.8
a Bolface values indicate errors within the 5% of the full range of
experimental values, and italicized values, indicate errors within the 8%
of the full range. 15
If we take the MUE calculated for the test set as a measure
of the actual predictivity of the equations, we can conclude
that good predictive models have been developed for the three
properties under study. Topological descriptors seem to be
more adequate for the prediction of ETN, mostly due to the 20
poor predictions of DARC/PELCO descriptors for fluorinated
solvents. The latter, on the other hand, perform much better in
the prediction of viscosities. Overall, the mixed
DARC/PELCO-topological model constitutes the best
compromise for reasonably predicting the three solvent 25
properties studied here.
A referee suggested that PLS analyses could also be applied
to the DARC/PELCO and mixed parameter models. The
corresponding results can be found in the Electronic
Supplementary Information, but in no case improvement over 30
the MLR equations could be obtained, so they will not
discussed here.
Experimental
Glycerol-based solvents were obtained by ring opening of
either the appropriate glycidol ether (non-symetric glycerol-35
based solvents) or epichlorohydrin (symetric glycerol-based
solvents) with corresponding alkoxide in alcoholic media, and
purified by vacuum distillation as described previously.1
The complete list of the 62 solvents used in QSPR analyses
and the values of the experimental properties studied are 40
gathered in Table 1.
Different topological descriptors were calculated for the
molecular structures of every solvent using Materials Studio
Modeling 4.0 from Accelrys. This software can calculate
topological descriptors on the basis of molecular structural 45
information. All these descriptors are gathered in Table S1 of
the Supporting Information.
DARC/PELCO descriptors where generated from the
scheme shown in Figure 1. The presence of a C unit (bearing
the corresponding hydrogen atoms) was codified as 1 in the 50
data matrix (2 if the unit is simultaneously present at both
symmetric sides of the glycerol moiety). C units bearing
fluorine atoms were codified as independent variables (those
starting with “F” in the regression analyses). The final
DARC/PELCO matrix is gathered in Table S2. 55
Multiple linear regression analyses were carried out using
the SPSS software. In all Tables the following information is
provided:
- Regression coefficients bi, as defined previously (B = (XT
X)–1XTY). 60
- Individual confidence intervals (at the 95% probability
level) of each bi coefficient. These confidence intervals are
calculated from the estimated standard error of bi and the
Student’s test with N−p degrees of freedom:
ei = s.e.(bi)·t(N−p, 0.975) 65
- Number of cases included in the regression, N.
- Multiple determination coefficient, R2, which is a measure
of the proportion of the total variation about the mean of y
explained by the regression.
- Standard error of the regression σ(y) is the root square of the 70
residual mean square, and it is estimate of the error with
which any observed value of y could be predicted by the
regression equation.
- F value, defined as the quotient of the regression and
residual mean squares. When compared with a Fisher-75
Snedecor F distribution with p−1 and N−p degrees of freedom,
at a 95% probability level (values given in the footables), it
allows establishing if the variance explained by the regression
equation is significantly different from that of the error. More
strictly, it tests the H0 hypothesis, i.e., that all regression 80
coefficients are zero. If the calculated F value is larger than
the tabulated one, the hypothesis is rejected, and the equation
is considered statistically significant.
- Stepwise linear regression procedure is a method to select
the “best” regression equation from a set of independent 85
vatiables, x. Each variable is sequentially included in the
equation, following its single correlation with the response, y.
For each new variable entering, a partial F-test is performed to
see if the improvement in the equation is significant. If the
variable is accepted, then partial F-tests are also performed for 90
the rest of variables already in the equation. Those not passing
the test are then eliminated. The procedure is repeated until no
more variables are included in the equation. Partial F-tests are
carried out at a 90% probability level.
The Mean Unsigned (or Absolute) Error (MUE or MAE) is is 95
an average of the absolute errors ei=|ŷi-yi|, where ŷi is the
value predicted by the model and yi the experimental value.
10
Conclusions
In this study three characteristic properties relevant to classify
solvents and facilitate the search of substitution uses have
been investigated in a series of 62 glycerol derivatives that
can be used as solvents. Global topological descriptors, based 5
on the molecular graphs, have been successfully applied to
analyze and predict solvent polarities, both using traditional
MLR and PLS regression analyses. However, boiling points
and viscosities are not so well modeled using this kind of
structural variables. 10
On the other hand, DARC/PELCO local structural
descriptors have revealed as clearly superior to describe the
viscosity of this family of solvents. Boiling points are
similarly well predicted with both kinds of approaches.
Overall, the mixed model with DARC/PELCO and 15
topological descriptors constitutes the best compromise for
reasonably predicting the three solvent properties studied in
this work.
Highly significant regression equations have been
developed for the three properties under study. The robustness 20
and predictive value of these equations have been
demonstrated through the use of an independent test set of
solvents. Therefore, the QSPR models developed provide
significant additional insight into the relationship between the
molecular structure and some fundamental solvent properties. 25
Based on these results, it seems that quantitative structure
activity/property relationships (QSAR/QSPR) could be quite
useful for in silico prediction of physico-chemical properties,
allowing a faster selection of target solvents for a given
application. 30
Acknowledgements
Financial support from the Spanish MINECO (project
CTQ2011-28124-C02-01, the European Social Fund (ESF)
and the Gobierno de Aragón (Grupo Consolidado E11) is
gratefully acknowledged. 35
Notes and references
a Instituto de Síntesis Química y Catálisis Homogénea, Facultad de
Ciencias, CSIC-Univ. de Zaragoza, Pedro Cerbuna, 12, E-50009
Zaragoza, Spain. Tel: +34 976762271; E-mail: jig@unizar.es
b Dept. Organic Chemistry, Facultad de Ciencias, Univ. de Zaragoza, 40
Pedro Cerbuna, 12, E-50009 Zaragoza, Spain.
c Dept. Physical Chemistry, Facultad de Ciencias, Univ. de Zaragoza,
Pedro Cerbuna, 12, E-50009 Zaragoza, Spain.
1 García, J. I., García-Marín, H., Mayoral, J. A., Pérez, P., Green Chem., 45
2010, 12, 426.
2 Cramer, R. D., J. Am. Chem. Soc., 1980, 102, 1837.
3 Carlson, R., Design and Optimization in Organic Synthesis. Ed.
Elsevier, Amsterdam, 1992.
4 Chastrette, M., Rajzmann, M., Chanon, M., Purcell, K. F., J. Am. Chem. 50
Soc., 1985, 107, 1.
5 Koppel, I. A., Palm, V. A., The Influence of the Solvent on Organic
Reactivity, in Advances in Linear Free Energy Relationships. Ed. Plenum
Press, London, 1972.
6 Kamlet, M. J., Abboud, J. L., Abraham, M. H., Taft, R. W., J. Org. 55
Chem., 1983, 48 (17), 2877.
7 Catalán, J., Solvent Effects based on non-HBD Solvents in Handbook of
Solvents. Ed. William Andrew Publishing, New York, 2001.
8 Reichardt, C., Solvents and Solvent Effects. 3th ed.; Ed. Wiley-VCH,
Weinhein, 2003. 60
9 Ravi, M., Hopfinger, A. J., Hormann, R. E., Dinan, L., J. Chem. Inf.
Comput. Sci., 2001, 41, 1587.
10 Luke, B. T., J. Mol. Struct. (Theochem.), 1999, 13, 468.
11 Bruneau, P., J. Chem. Inf. Comput. Sci., 2001, 41, 1605.
12 Katritzky, A. R., Petrukhin, R., Tatham, D., J. Chem. Inf. Comput. Sci., 65
2001, 41, 679.
13 Ghasemi, J., Saaidpour, S., Brown, S. D., J. Mol. Struct. (Theochem.),
2007, 805, 27.
14 Brauner, N., Shachamb, M., Cholakovc, G. S., Statevad, R. P., Chem.
Eng. Sci., 2005, 60, 5458. 70
15 Lind, P., Lopes, C., Oberg, K., Eliasson, B., Chem. Phys. Lett., 2004,
387, 238.
16 Ungerer, P., Nieto-Draghi, C., Rousseau, B., Ahunbay, G., Lachet, V.,
J. Mol. Liq., 2007, 134, 71.
17 Ghasemi, J. B., Abdolmaleki, A., Mandoumi, N., J. Hazardous Mat., 75
2009, 161, 74.
18 Fatemi, M. H., Haghdadi, M., J. Mol. Struct., 2008, 886, 43.
19 Torrecilla, J. S., Palomar, J., Lemus, J., Rodríguez, F., Green Chem.,
2010, 12, 123.
20 Alvarez-Guerra, M., Irabien, A., Green Chem., 2011, 13, 1507. 80
21 Yan, F., Xia, S., Wang, Q., Ma, P. , J. Chem. Eng. Data, 2012, 57,
2252.
22 Consonni, V., Todeschini, R., Pavan, M., Gramatica, P., J. Chem. Inf.
Comput. Sci., 2002, 42, 693.
23 Krenkel, G., Castro, E. A., Toropov, A. A., J. Mol. Struct. (Theochem.), 85
2001, 542, 107.
24 Ghasemi, J., Shahmirani, S., Farahani, E. V., Ann. Chim., 2006, 96, 327.
25 Kier, L. B., Hall, L. H., Molecular Connectivity in Structure-Activity
Analysis. Ed. Research Studies Press Ltd, New York, 1985.
26 Katritzky, A. R., Fara, D. C., Kuanar, M., Hur, E., Karelson, M., J. 90
Phys. Chem. A, 2005, 109, 10323.
27 Draper, N. R., Smith, H., Applied Regression Analysis. Ed. Wiley-
Interscience, 1998.
28 Dimroth, K., Reichardt, C., Siepmann, T., Bohlmann, F., Liebigs Ann.
Chem., 1963, 1, 661. 95
29 Dimroth, K., Reichardt, C., Schweig, A., Liebigs Ann. Chem., 1963, 95,
669.
30 Lide, D. R., Handbook of Chemistry and Physics. 84th ed.; Ed. CRC,
New York, 2004.
31 Katritzky, A. R., Gordeeva, E. V., J. Chem. Inf. Comput. Sci., 1993, 100
835.
32 Hall, L. H., Kier, L. B., Rev. Comput. Chem. II, 1991, 367.
33 Balaban, A. T., Chem. Phys. Lett., 1982, 309.
34 Wiener, H., J. Chem. Phys., 1947, 17.
35 Bonchev, D., Information Theoretic Indices for Characterization of 105
Chemical Structures. Ed. Research Studies Press Ltd., New York, 1983.
36 Kier, L. B., Hall, L. H., Molecular Connectivity Indices in Chemistry
and Drug Research. Ed. deStevens, New York, 1976.
37 Dubois, J. E., Computer Representation and Manipulation of Chemical
Information. Ed. Wiley, New York, 1974. 110
38 Wold, S., Ruhe, A., Wold, H., Dunn, W., SIAM J. Sci. Stat. Comput.,
1984, 5, 735.
39 Geladi, P., Kowalski, B. R., Anal. Chim. Acta, 1986, 185, 1.
40 Fairbourne, A., Gibson, G. P., Stephens, D. W., J. Chem. Soc., 1931,
445.
115
S11
Electronic Supplementary Information for
Quantitative structure-property relationships prediction of
some physico–chemical properties of glycerol based solvents.
José I. García,*a Héctor García-Marín,a José A. Mayoral,a,b and Pascual Pérezc
a Instituto de Síntesis Química y Catálisis Homogénea, Facultad de Ciencias, CSIC-Univ. de Zaragoza, Pedro
Cerbuna, 12, E-50009 Zaragoza, Spain. Tel: +34 976762271; E-mail: jig@unizar.es.
b Dept. Organic Chemistry, Facultad de Ciencias, Univ. de Zaragoza, Pedro Cerbuna, 12, E-50009 Zaragoza, Spain.
E-mail: mayoral@unizar.es.
c Dept. Physical Chemistry, Facultad de Ciencias, Univ. de Zaragoza, Pedro Cerbuna, 12, E-50009 Zaragoza, Spain.
E-mail: pascual@unizar.es.
12
Definition of the topological parameters
Topological indices are usually obtained from two-dimensional molecular structures (molecular graphs, G),
mostly through the connectivity adjacency (A(G)) and topological distance matrices (D(G)), and the vertex
degree vector (δ(G)):
A(G)
1
2
3
4
5
6
1
0
1
0
0
0
0
2
1
0
1
1
0
0
3
0
1
0
0
0
0
4
0
1
0
0
1
0
5
0
0
0
1
0
1
6
0
0
0
0
1
0
δ(G)
1
2
3
4
5
6
δ
1
3
1
2
2
1
D(G)
1
2
3
4
5
6
1
0
1
2
2
3
4
2
1
0
1
1
2
3
3
2
1
0
2
3
4
4
2
1
2
0
1
2
5
3
2
2
1
0
1
6
4
3
3
2
1
0
Topological indices are calculated from different invariant features of the molecular graph, and contain
information about molecular size, molecular shape, branching, molecular flexibility, etc. The exact definition
of the indices used in this work are given below.
Balaban indices (JX, JY):15, 23
Balaban index is defined as:
where M is the number of bonds, N is the number of atoms in the molecule, and si is calculated as the sum of
terms from a modified topological distance matrix. In this modified distance matrix, each bond contributes with
1/b to the total connectivity, with b=1 for single bonds, b=2 for double bonds, b=3 for triple bonds, and b=1.5
for aromatic bonds:
𝑠!=𝑑!"
!
!!!
Corrections for heteroatoms have been introduced through contributions for the modification of the
electronegativity (X) and the atomic radii (Y):
where i is the atomic number and Gi is the group number in the Periodic Table of the elements. From these
corrections, the 𝑠!
! values are defined as:
𝑠!
!=𝑋∙𝑠! (for JX index)
𝑠!
!=𝑌∙𝑠! (for JY index)
Wiener index (W):16, 24
The Wiener index is defined as the sum of the lengths of the shortest paths between all pairs of vertices in the
chemical graph representing the non-hydrogen atoms in the molecule. It is easily computed from the
topological distance matrix:
𝑊=
1
2𝑑!"
!!
This index is a measure of the centrality of the graph, and hence it is related with the molecular compactness.
Zagreb index:17
It is defined as the sum of squares of the difference between the number of electrons participating in covalent
bonds and the number of hydrogen atoms bonded to the same atom. This is equivalent to the sum of the squares
of the vertices degrees, δi:
∑
+−
=
a
j
a
i
ss
NM
M
J1
2
i
GiX 1567,00078,04196,0 +−=
i
GiY 0537,00160,011 91,1 +−=
S13
Randic and Kier & Hall connectivity indices (
χ
):18
χ
indices were first proposed by Randic25 from the vertices degrees, as:
=
1
𝛿𝑖∙𝛿𝑗
𝐵 , extended to all bonds in the molecule (B).
Kier and Hall extended the definition by including the number of edges of a given sub-graph (h), and different
kinds of sub-graphs (r):
!𝐺
!=
1
𝛿!
!!!
!!!
!!
!!!
where σn is the number of sub-graph of length h and δ is the vertex degree.
There are four kinds of sub-graphs, known as path (linear chains), cluster (branched chains), path/cluster, and
chain (cycles), each one emphasizing a particular aspect of the molecular connectivity. The n superindex refers
to the number of bonds considered to calculate the topological index. Thus, n=0 refers to individual atoms, n=1
refers to directly connected atoms, n=2 refers to three atoms connected through two consecutive bonds, and so
on.
, and hence
A further refinement10d, 18 can be included to the
χ
indicesby considering the atom valences, thus allowing
distinguishing the presence of heteroatoms in the structure. This is accomplished by calculating s “corrected” d
value, using the atomic number and the number of valence electrons of the vertex atoms:
Where Zv is the number of valence electrons, Z is the atomic number and h is the number of hydrogen atoms
bonded to the vertex atom. The resulting “valence-corrected” indices are named as
χ
v.
Kier & Hall count indices (SC):
SC is the count of sub-graphs of a given length present in the molecules. Thus, SC=0 is the number of atoms,
SC=1, the number of chemical bonds, SC=2, the number of pair bonds, and so on. For longer sub-graphs, path,
cluster, path/cluster and chain types can be also considered.
Kier shape indices (
κ
n):
All the prededent topological indices are heavily influenced by the size of the molecular graph. Kier developed
the
κ
indices to best discriminate between different shapes of the molecules. They are defined from sub-graphs
of a given length, taking into account also the maximum and minimum connectivity of the molecule for the
same length (a way to “normalize” the
κ
values, making them independent of the molecular size):
Where m is the length chosen of the sub-graph, mPi the number of sub-graphs of length m contained in the total
graph, and mPmax and mPmin is the maximum and minimum number possible of sub-graphs of length m that can
contain the total graph. Some examples are given below.
κ1, K =2:
κ2, K =2:
κ3, K =4:
∑∑
−==
i
ii
i
i
hZagreb
22
)(
σδ
( )
∏
=
=
×××
=
n
iinji
sub
nP
1
1
)...(
1
δδδδ
( )
∑
=≡
sub
n
s
n
PsubnChi ))((
χ
( )
( )
1−−
−
=v
v
v
ZZ
hZ
δ
!
( )
2
maxmin
·
·
i
m
mm
n
P
PP
K=
κ
!
1
min
1−=NP
!
2
)1(
max
1+
=NN
P
!
edgesofnumberP
i
__
1
=
!
2
min
2−=NP
!
2
)2)(1(
max
2
−−
=NN
P
!
edgesadjace ntofnumberP
i
___
2
=
14
Similarly to the
χ
indices, a modification has been sugested for
κ
indices to account for the presence of
heteroatoms in the molecular graph. 14, 26 In this modification, both the covalent radii and the hybridizations
are considered. The
!
! indices are defined as the
κ
n ones, but substituting N by N+α, where a is defined as:
Where ri is the covalent radium of atom i and rCsp3 is taken as 0.77 Å (the covalent radius of a carbon atom with
sp3 hybridization).
Molecular flexibility index (
ϕ
):14
The starting hypothesis to define f is that an infinitely long linear saturated hydrocarbon molecule (i.e. all-sp3
C−C bonds) is infinitely flexible. Flexibility is reduced by the presence of a limited number of atoms, rings,
branched chains, and the presence of atoms with covalent radii shorter than that of Csp3:
!
3
min
3
−=NP
!
)_(
4
)2(
2
max
3
evenN
N
P−
=
!
)_(
4
)3)(1(
max
3
oddN
NN
P−−
=
!
edgesadjacentoftriosP
i
___
3
=
∑"
"
#
$
%
%
&
'−=
iCsp
i
r
r1
3
α
N
αα
κκ
ϕ
21
=
S15
Table S1. Topological parameters of 62 glycerol based solvents.
Code
HBA
HBD
RB
ϕ
BalJX
BalJY
W
Z
𝜿𝟏
𝒂𝒎
𝜿𝟐
𝒂𝒎
𝜿𝟑
𝒂𝒎
𝐒𝐂𝐩
𝟎
𝐒𝐂𝐩
𝟏
𝐒𝐂𝐩
𝟐
𝐒𝐂𝐩
𝟑
𝐒𝐂𝐜
𝟑
𝝌
𝟎
𝝌
𝟏
𝝌
𝟐
𝝌𝒑
𝟑
𝝌𝒄𝒍
𝟑
𝝌𝒗𝒎𝟎
𝝌𝒗𝒎𝟏
𝝌𝒗𝒎𝟐
𝝌𝒑
𝒗𝒎
𝟑
𝝌𝒄𝒍
𝒗𝒎
𝟑
000
3
3
5
3.02
2.572
2.814
31
20
5.88
3.08
2.88
6
5
5
4
1
4.99
2.81
1.92
1.39
0.29
3.33
1.71
1.02
0.42
0.13
100
3
2
5
3.98
2.620
2.901
50
24
6.88
4.05
3.72
7
6
6
5
1
5.70
3.31
2.30
1.48
0.29
4.29
2.09
1.29
0.57
0.13
200
3
2
6
4.95
2.665
2.926
76
28
7.88
5.03
4.88
8
7
7
6
1
6.41
3.81
2.66
1.75
0.29
5.00
2.68
1.50
0.73
0.13
400
3
2
8
6.91
2.723
2.939
153
36
9.88
6.99
6.88
10
9
9
8
1
7.82
4.81
3.36
2.25
0.29
6.42
3.68
2.26
1.16
0.13
101
3
1
5
4.95
2.686
2.996
75
28
7.88
5.03
4.88
8
7
7
6
1
6.41
3.81
2.68
1.56
0.29
5.26
2.47
1.56
0.72
0.13
103i
3
1
6
5.58
2.915
3.193
143
38
9.88
5.65
6.88
10
9
10
8
2
7.98
4.66
3.87
2.02
0.70
6.83
3.45
2.49
0.98
0.37
104
3
1
8
7.89
2.788
3.033
202
40
10.88
7.98
7.78
11
10
10
9
1
8.53
5.31
3.74
2.33
0.29
7.38
4.06
2.54
1.31
0.13
104i
3
1
7
6.51
2.909
3.162
194
42
10.88
6.58
7.78
11
10
11
9
2
8.69
5.16
4.22
2.26
0.70
7.54
3.91
3.04
1.12
0.54
104t
3
1
6
4.65
3.173
3.444
180
46
10.88
4.70
7.78
11
10
13
9
5
8.91
4.96
4.99
2.17
1.85
7.76
3.76
3.53
1.07
1.24
403i
3
1
9
8.40
2.973
3.205
324
50
12.88
8.48
9.80
13
12
13
11
2
10.10
6.16
4.93
2.79
0.70
8.95
5.04
3.46
1.57
0.37
404
3
1
11
10.86
2.907
3.121
419
52
13.88
10.96
10.88
14
13
13
12
1
10.65
6.81
4.80
3.10
0.29
9.50
5.64
3.51
1.91
0.13
404t
3
1
9
7.15
3.152
3.380
388
58
13.88
7.21
10.88
14
13
16
12
5
11.03
6.46
6.05
2.94
1.85
9.88
5.35
4.51
1.66
1.24
404i
3
1
10
9.36
2.984
3.203
408
54
13.88
9.44
10.88
14
13
14
12
2
10.81
6.66
5.28
3.03
0.70
9.66
5.50
4.01
1.71
0.54
203i
3
1
7
6.51
2.950
3.214
192
42
10.88
6.58
7.78
11
10
11
9
2
8.69
5.16
4.22
2.29
0.70
7.54
4.04
2.70
1.14
0.37
204
3
1
9
8.88
2.845
3.082
262
44
11.88
8.97
8.88
12
11
11
10
1
9.23
5.81
4.10
2.60
0.29
8.08
4.64
2.74
1.47
0.13
204t
3
1
7
5.46
3.176
3.434
237
50
11.88
5.51
8.88
12
11
14
10
5
9.61
5.46
5.35
2.44
1.85
8.46
4.35
3.74
1.22
1.24
204i
3
1
8
7.45
2.950
3.193
253
46
11.88
7.53
8.88
12
11
12
10
2
9.40
5.66
4.57
2.53
0.70
8.25
4.50
3.25
1.28
0.54
3i03i
3
1
7
6.34
3.079
3.334
243
48
11.88
6.40
8.88
12
11
13
10
3
9.56
5.52
5.05
2.47
1.11
8.41
4.43
3.42
1.24
0.60
3i04t
3
1
7
5.53
3.273
3.522
296
56
12.88
5.58
9.80
13
12
16
11
6
10.48
5.81
6.18
2.62
2.26
9.33
4.75
4.46
1.33
1.48
3i04i
3
1
8
7.23
3.068
3.305
314
52
12.88
7.30
9.80
13
12
14
11
3
10.27
6.02
5.40
2.72
1.11
9.12
4.89
3.97
1.38
0.77
3i03F
6
1
7
6.06
3.115
3.499
378
60
13.67
6.21
10.67
14
13
17
12
6
11.19
6.31
6.53
2.86
2.26
8.17
4.25
3.17
1.20
0.54
403F
6
1
9
7.72
3.032
3.384
484
62
14.67
7.90
11.60
15
14
17
13
5
11.73
6.96
6.41
3.17
1.85
8.72
4.86
3.21
1.53
0.31
4t03F
6
1
7
5.55
3.267
3.639
450
68
14.67
5.67
11.60
15
14
20
13
9
12.11
6.60
7.66
3.01
3.41
9.10
4.57
4.21
1.29
1.42
4i03F
6
1
8
6.87
3.106
3.465
472
64
14.67
7.03
11.60
15
14
18
13
6
11.90
6.81
6.88
3.10
2.26
8.88
4.71
3.72
1.34
0.72
3F03F
9
1
7
6.05
3.136
3.615
557
72
15.46
6.26
12.46
16
15
21
14
9
12.82
7.10
8.01
3.25
3.41
7.94
4.07
2.91
1.15
0.49
5F05F
13
1
9
6.90
3.636
4.205
1283
108
21.18
7.17
6.88
22
21
33
32
17
17.82
9.60
11.16
7.06
4.70
10.45
5.33
4.09
2.02
0.84
7F07F
17
1
11
8.01
4.071
4.718
2399
144
26.90
8.33
6.20
28
27
45
50
25
22.82
12.10
14.41
10.13
6.20
12.96
6.58
5.24
2.82
1.17
111
3
0
5
5.93
2.907
3.263
102
32
8.88
6.01
4.39
9
8
8
8
1
7.11
4.35
2.85
1.97
0.20
6.22
2.85
1.77
1.04
0.12
113i
3
0
6
6.51
3.111
3.431
182
42
10.88
6.58
6.28
11
10
11
10
2
8.69
5.20
4.03
2.43
0.61
7.79
3.84
2.70
1.30
0.35
143i
3
0
9
9.36
3.331
3.612
369
54
13.88
9.44
9.26
14
13
14
13
2
10.81
6.70
5.12
3.07
0.61
9.92
5.42
3.68
1.82
0.35
114t
3
0
6
5.46
3.341
3.652
225
50
11.88
5.51
7.32
12
11
14
11
5
9.61
5.49
5.16
2.58
1.77
8.72
4.15
3.74
1.39
1.23
144t
3
0
9
8.02
3.513
3.789
436
62
14.88
8.08
10.17
15
14
17
14
5
11.73
6.99
6.25
3.22
1.77
10.84
5.74
4.73
1.91
1.23
114
3
0
8
8.88
2.974
3.260
250
44
11.88
8.97
7.32
12
11
11
11
1
9.23
5.85
3.91
2.74
0.20
8.34
4.44
2.74
1.63
0.12
144
3
0
11
11.8
3.246
3.505
470
56
14.88
11.95
10.17
15
14
14
14
1
11.36
7.35
5.00
3.38
0.20
10.46
6.03
3.73
2.15
0.12
114i
3
0
7
7.46
3.089
3.383
241
46
11.88
7.53
7.32
12
11
12
11
2
9.40
5.70
4.39
2.67
0.61
8.50
4.30
3.24
1.44
0.53
144i
3
0
10
10.3
3.330
3.595
458
58
14.88
10.40
10.17
15
14
15
14
2
11.52
7.20
5.48
3.31
0.61
10.62
5.89
4.23
1.96
0.53
123i
3
0
7
7.45
3.240
3.552
232
46
11.88
7.53
7.32
12
11
12
11
2
9.40
5.70
4.42
2.55
0.61
8.50
4.42
2.92
1.37
0.35
213i
3
0
7
7.45
3.158
3.465
237
46
11.88
7.53
7.32
12
11
12
11
2
9.40
5.70
4.39
2.70
0.61
8.50
4.42
2.90
1.46
0.35
124t
3
0
7
6.29
3.450
3.753
282
54
12.88
6.35
8.22
13
12
15
12
5
10.32
5.99
5.54
2.70
1.77
9.42
4.74
3.96
1.46
1.23
214t
3
0
7
6.29
3.366
3.665
288
54
12.88
6.35
8.22
13
12
15
12
5
10.32
5.99
5.52
2.85
1.77
9.42
4.74
3.94
1.54
1.23
124
3
0
9
9.87
3.114
3.395
310
48
12.88
9.96
8.22
13
12
12
12
1
9.94
6.35
4.29
2.86
0.20
9.05
5.03
2.96
1.70
0.12
16
214
3
0
9
9.87
3.045
3.322
316
48
12.88
9.96
8.22
13
12
12
12
1
9.94
6.35
4.27
3.01
0.20
9.05
5.03
2.95
1.79
0.12
124i
3
0
8
8.40
3.219
3.507
300
50
12.88
8.48
8.22
13
12
13
12
2
10.10
6.20
4.77
2.79
0.61
9.21
4.89
3.46
1.51
0.53
214i
3
0
8
8.40
3.146
3.430
306
50
12.88
8.48
8.22
13
12
13
12
2
10.10
6.20
4.74
2.94
0.61
9.21
4.89
3.45
1.60
0.53
223i
3
0
8
8.40
3.304
3.604
294
50
12.88
8.48
8.22
13
12
13
12
2
10.10
6.20
4.77
2.82
0.61
9.21
5.01
3.12
1.53
0.35
224t
3
0
8
7.15
3.498
3.792
352
58
13.88
7.21
9.26
14
13
16
13
5
11.03
6.49
5.90
2.97
1.77
10.13
5.33
4.16
1.61
1.23
224
3
0
10
10.86
3.197
3.471
383
52
13.88
10.96
9.26
14
13
13
13
1
10.65
6.85
4.65
3.13
0.20
9.75
5.62
3.17
1.86
0.12
224i
3
0
9
9.36
3.292
3.572
372
54
13.88
9.44
9.26
14
13
14
13
2
10.81
6.70
5.12
3.06
0.61
9.92
5.47
3.67
1.67
0.53
413i
3
0
9
9.36
3.175
3.445
384
54
13.88
9.44
9.26
14
13
14
13
2
10.81
6.70
5.10
3.20
0.61
9.92
5.42
3.67
1.90
0.35
423i
3
0
10
10.32
3.329
3.598
458
58
14.88
10.40
10.17
15
14
15
14
2
11.52
7.20
5.48
3.32
0.61
10.62
6.01
3.89
1.96
0.35
414t
3
0
9
8.02
3.349
3.616
454
62
14.88
8.08
10.17
15
14
17
14
5
11.73
6.99
6.22
3.35
1.77
10.84
5.74
4.71
1.98
1.23
424t
3
0
10
8.90
3.500
3.765
535
66
15.88
8.97
11.21
16
15
18
15
5
12.44
7.49
6.60
3.47
1.77
11.55
6.33
4.93
2.05
1.23
414
3
0
11
11.86
3.105
3.357
488
56
14.88
11.95
10.17
15
14
14
14
1
11.36
7.35
4.97
3.51
0.20
10.46
6.03
3.72
2.23
0.12
414i
3
0
10
10.32
3.182
3.439
476
58
14.88
10.40
10.17
15
14
15
14
2
11.52
7.20
5.45
3.44
0.61
10.62
5.89
4.22
2.03
0.53
424i
3
0
11
11.28
3.339
3.595
559
62
15.88
11.37
11.21
16
15
16
15
2
12.23
7.70
5.83
3.56
0.61
11.33
6.47
4.44
2.10
0.53
3i13F
6
0
7
6.87
3.304
3.723
444
64
14.67
7.03
9.96
15
14
18
14
6
11.90
6.85
6.70
3.27
2.17
9.14
4.64
3.37
1.52
0.53
4t13F
6
0
7
6.28
3.457
3.864
522
72
15.67
6.42
11.00
16
15
21
15
9
12.82
7.14
7.83
3.42
3.33
10.06
4.95
4.41
1.61
1.41
444
3
0
14
14.84
3.443
3.683
789
68
17.88
14.94
13.17
18
17
17
17
1
13.48
8.85
6.06
4.15
0.20
12.58
7.62
4.70
2.74
0.12
413F
6
0
9
8.60
3.223
3.610
559
66
15.67
8.78
11.00
16
15
18
15
5
12.44
7.49
6.58
3.58
1.77
9.68
5.24
3.42
1.85
0.30
3F13F
9
0
7
6.80
3.325
3.835
638
76
16.46
7.02
11.72
17
16
22
16
9
13.53
7.64
8.18
3.66
3.33
8.90
4.46
3.12
1.47
0.48
3F23F
9
0
8
7.57
3.476
3.980
736
80
17.46
7.80
12.76
18
17
23
17
9
14.23
8.14
8.56
3.78
3.33
9.61
5.04
3.34
1.54
0.48
3F43F
9
0
10
9.15
3.642
4.119
987
88
19.46
9.41
14.72
20
19
25
19
9
15.65
9.14
9.27
4.29
3.33
11.02
6.04
4.11
1.99
0.48
S19
Figure S1. Predicted vs. experimental values of E!
!, viscosity, and boiling
point for the selected solvent test set using MLR analysis with topological
parameters (equations 2–4 in the main text). 5
10
Figure S2. Predicted vs. experimental values of E!
!, viscosity, and boiling
point for the selected solvent test set using PLS analysis with topological
parameters.
15
221
208
176
204
170
234
185
171
224
222
201
204
172
208
175
178
0
50
100
150
200
250
200
104
3i03F
5F05F
113i
414t
4t13F
3F23F
boiling point (ºC)
Exp.
Calc.
35,1
6,9
1,0
2,1
29,5
13,5
/2,5
/2,5
/5,0
0,0
5,0
10,0
15,0
20,0
25,0
30,0
35,0
40,0
200
3i03F
113i
4t13F
Dynamic visc. (cP) Exp.
Calc.
0,690
0,480
0,590
0,699
0,141
0,373
0,595
0,701
0,447
0,606
0,743
0,157
0,352
0,529
0,000
0,100
0,200
0,300
0,400
0,500
0,600
0,700
0,800
200
104
3i03F
5F05F
414t
4t13F
3F23F
ETN
Exp.
Calc.
20
Figure S3. Predicted vs. experimental values of E!
!, viscosity, and boiling
point for the selected solvent test set using MLR analysis with
DARC/PELCO descriptors (equations 5–7 in the text). 5
10
Figure S4. Predicted vs. experimental values of E!
!, viscosity, and boiling
point for the selected solvent test set using MLR analysis with mixed
topological and DARC/PELCO descriptors (equations 8–10 in the text)..
0,690
0,480
0,590
0,699
0,141
0,373
0,595
0,661
0,497
0,609
0,795
0,118
0,290
0,506
0,000
0,100
0,200
0,300
0,400
0,500
0,600
0,700
0,800
0,900
200
104
3i03F
5F05F
414t
4t13F
3F23F
ETN
Exp.
Calc.
221
208
176
204
170
234
185
171
233
207
192
185
178
224
194
178
0
50
100
150
200
250
200
104
3i03F
5F05F
113i
414t
4t13F
3F23F
boiling point (ºC)
Exp.
Calc.
35,1
6,9
1,0
2,1
39,9
5,6
2,6
2,6
0,0
5,0
10,0
15,0
20,0
25,0
30,0
35,0
40,0
45,0
200
3i03F
113i
4t13F
Dynamic visc. (cP) Exp.
Calc.
0,690
0,480
0,590
0,699
0,141
0,373
0,595
0,670
0,474
0,628
0,746
0,140
0,332
0,515
0,000
0,100
0,200
0,300
0,400
0,500
0,600
0,700
0,800
200
104
3i03F
5F05F
414t
4t13F
3F23F
ETN
Exp.
Calc.
221
208
176
204
170
234
185
171
236
209
190
198
173
219
189
183
0
50
100
150
200
250
200
104
3i03F
5F05F
113i
414t
4t13F
3F23F
boiling point (ºC)
Exp.
Calc.
35,1
6,9
1,0
2,1
39,2
6,3
/1,7
2,5
/5,0
0,0
5,0
10,0
15,0
20,0
25,0
30,0
35,0
40,0
45,0
200
3i03F
113i
4t13F
Dynamic visc. (cP)
Exp.
Calc.
S21
Table S4. Comparison between MLR and PLS analyses with DARC/PELCO descriptors for the three solvent properties studied.
Descriptor
𝐄𝐓
𝐍
Dynamic Viscosity (cP)
Boiling point (ºC)
MLR
PLSa
MLR
PLSb
MLR
PLSc
B0
0.851
0.796
70.79
71.800
278.2
279.3
A1
−0.278
−0.268
−3.52
−4.588
−6.1
-8.4
A2
−0.160
−0.116
−32.50
−34.448
−55.6
-55.7
B1
n.s.
−0.045
n.s.
0.083
n.s.
5.0
B2
−0.026
−0.034
n.s.
0.546
7.9
7.7
BF2
0.140
0.134
n.s.
2.606
7.0
6.7
C1
n.s.
0.003
n.s.
0.083
33.6
15.3
C2
−0.016
−0.018
n.s.
0.799
12.6
12. 5
CF2
−0.059
−0.055
6.90
2.816
12.0
9.5
D1
n.s.
0.003
n.s.
0.083
n.s.
15.3
D2
n.s.
−0.015
n.s.
0.799
19.1
18.9
DF2
n.s.
−0.033
n.s.
2.816
n.s.
3.0
N
46
46
17
17
62
62
R2
0.972
0.968
0.981
0.991
0.933
0.935
σ
0.036
0.036
2.08
1.28
6.9
6.3
a 4 latent variables. b 5 latent variables. c 6 latent variables.
Given that C1 and D1 are linearly dependent (see Table S3), their behaviour differs in stepwise MLR and PLS analyses of the boiling
point response. In the former case, the variable entering in the equation takes the full value (33.6), whereas in the “back-projection” of 5
the PLS coefficients into the original variables, each coefficient takes half of the full value (15.3). Of course, the predictions within the
solvent set used are therefore identical, given that all structures for which C1=1, have D1=1 too. Similar, but not the same behaviour is
observed for other highly correlated parameters, such as CF2 and DF2.
10
22
Table S5. Comparison between MLR and PLS analyses with mixed topological and DARC/PELCO descriptors for the three
solvent properties studied.
Descriptor
𝐄𝐓
𝐍
Dynamic Viscosity (cP)
Boiling point (ºC)
MLR
PLSa
MLR
PLSb
MLR
PLSc
B0
0.523
0.865
67.55
156.41
292.6
171.2
A1
−0.099
−0.054
−5.27
6.333
n.s.
10.918
A2
n.s.
−0.005
−35.86
−27.740
−49.7
−26.743
B1
n.s.
−0.014
n.s.
5.713
n.s.
−3.438
B2
n.s.
−0.012
n.s.
−1.155
n.s.
2.847
BF2
0.177
0.007
n.s.
−0.224
n.s.
1.900
C1
n.s.
−0.004
n.s.
5.713
n.s.
4.677
C2
n.s.
0.024
n.s.
0.904
n.s.
5.368
CF2
n.s.
0.013
n.s.
−0.223
n.s.
−0.045
D1
n.s.
−0.004
n.s.
5.713
n.s.
4.677
D2
n.s.
0.014
n.s.
0.904
n.s.
4.454
DF2
n.s.
−0.006
n.s.
−0.223
n.s.
−3.253
HBA
n.s.
0.035
n.s.
−1.564
n.s.
−0.898
HBD
0.140
0.060
n.s.
21.407
n.s.
15.825
RB
n.s.
0.032
n.s.
−14.793
12.9
15.721
φ
n.s.
−0.014
n.s.
7.672
n.s.
−0.395
BalJX
n.s.
−0.013
n.s.
−30.362
n.s.
0.272
BalJY
n.s.
−0.014
n.s.
−18.803
−26.0
−0.564
W
n.s.
0.000
n.s.
−0.054
n.s.
−0.032
Z
n.s.
−0.002
n.s.
3.661
n.s.
1.387
κ1
n.s.
−0.009
n.s.
−5.738
n.s.
0.526
κ2
n.s.
−0.013
n.s.
8.699
n.s.
−0.765
κ3
n.s.
0.022
n.s.
−5.679
n.s.
−8.078
SC0
p
n.s.
−0.007
n.s.
−5.848
n.s.
0.463
SC1
p
n.s.
−0.007
n.s.
−5.848
n.s.
0.463
SC2
p
n.s.
0.006
n.s.
7.678
n.s.
0.231
SC3
p
n.s.
−0.017
n.s.
−2.194
n.s.
−8.409
SC3
cl
n.s.
0.012
n.s.
−3.522
n.s.
0.869
0χ
n.s.
−0.003
0.99
1.811
n.s.
0.144
1χ
n.s.
−0.007
n.s.
−3.813
n.s.
0.412
2χ
n.s.
0.008
n.s.
8.389
n.s.
−0.720
3χp
n.s.
0.003
n.s.
−1.152
n.s.
3.781
3χcl
n.s.
0.002
n.s.
−7.165
n.s.
2.006
0χvm
−0.026
−0.039
n.s.
3.486
−8.3
−3.314
1χvm
n.s.
−0.009
n.s.
−6.006
2.333
2χvm
n.s.
−0.010
n.s.
12.129
20.4
3.026
3χp
vm
n.s.
−0.009
n.s.
5.831
2.857
3χcl
vm
n.s.
−0.009
n.s.
−10.869
1.690
N
46
46
17
17
62
62
R2
0.968
0.968
0.989
0.999
0.932
0.935
s
0.036
0.036
1.46
0.20
6.8
6.3
a 6 latent variables. b 13 latent variables. c 9 latent variables.
5