PreprintPDF Available

Features Transformation and Normalization: A Visual Approach (CDT-24)

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Pattern recognition involves extracting features, transforming/normalizing them, and them performing classification in order to assign new or predefined categories to the entities of interest. In this work, we focus on the transformation and normalization of features. The concept of morphing a set of points and visualizations of respective displacement fields are introduced and applied in order to visualize and better understand the possible effects of feature transformation and normalization. When applied with care, these operations have the potential to enhance the classification stage. However, we illustrate that features transformations and normalizations can also create false clusters as well as merge existing clusters, so that special attention is required when performing these operations. Some important features transformations and normalizations, including the standardization procedure as well as principal component analysis and linear discriminant analysis, are also presented and briefly discussed.
Content may be subject to copyright.
Features Transformation and Normalization:
A Visual Approach
(CDT-24)
Luciano da Fontoura Costa
luciano@ifsc.usp.br
ao Carlos Institute of Physics DFCM/USP
v.2 : 28th March 2020
v.1 : 23rd March 2020
Abstract
Pattern recognition involves extracting features, transforming/normalizing them, and them performing classification
in order to assign new or predefined categories to the entities of interest. In this work, we focus on the transformation
and normalization of features. The concept of morphing a set of points and visualizations of respective displacement
fields are introduced and applied in order to visualize and better understand the possible effects of feature transformation
and normalization. When applied with care, these operations have the potential to enhance the classification stage.
However, we illustrate that features transformations and normalizations can also create false clusters as well as merge
existing clusters, so that special attention is required when performing these operations. Some important features
transformations and normalizations, including the standardization procedure as well as principal component analysis
and linear discriminant analysis, are also presented and briefly discussed.
“L’occhio non vede cose ma figure di cose che significano altre
cose.”
Italo Calvino.
1 Introduction
Pattern recognition (e.g. [1, 2, 3]). is much more general
and widespread than often realized, as it underlies most
of human (as well as other living beings) intelligence and
activities. As a consequence, pattern recognition has an
ample range of applications, extending from quality con-
trol to text and image interpretation.
Figure 1 depicts the main data and operations involved
in pattern recognition. First, the patterns (or entities)
to be recognized have to be somehow generated (e.g. [4])
with basis on respective parameters. Then, a selected set
of features are extracted from the patterns so as to pro-
vide a quantitative characterization of the entities to be
recognized (e.g. [3]). This step is particularly challeng-
ing, as the selection of features is not straightforward and
depends on previous experience with the data, measure-
ments, and classification methods (e.g. [3]). The extracted
features can then be transformed or normalized with the
objective of deriving new features capable of improving
the entities characterization. Features transformations
can take each feature independently or combined with
other features. For instance, transformations can be used
to remove noise from features, to make them smoother,
to obtain more discriminative measurements, to reduce
the dimensionality of the feature space, etc. Normaliza-
tions are often required in order to remove translation,
rotation or scaling effects from the features, providing a
more standardized set of features. The final step in our
diagram consists of classifying, i.e. assigning previously
defined or new categories, to the entities with based on
their transformed/normalized features.
The present work focuses on the important stage of
features transformation and normalization. These two
tasks, which can have important effects (wanted and un-
wanted) in pattern recognition, are indeed unavoidable
as the very implementation of any measurement implies
some choice related to transformation and normalization.
For instance, in case we are measuring the weight of fruits
to be considered as respective features, we are immedi-
ately faced with the question of which unit to adopt, such
as grams, ounces, etc. In addition, the very definition of
1
features often involved transformations combining other
features, such as the ratio between the standard deviation
and mean of a given measurement (known as variation co-
efficient).
We will start by introducing the concept of morphing
a set of points, which will be subsequently applied to il-
lustrate the concept and effects of feature transformation
and normalization. Indeed, most feature transformations
and normalizations can be conceptualized in terms of mor-
phing and respective vector fields, which help to visualize
and understand the respective effects, such as creating
false clusters that did not originally exist and merging
clusters when they should be separated. The provided
examples motivate the need of special care and attention
when transforming and normalizing features.
Next, we will discuss independent feature transforma-
tion, in which each feature is transformed into a respective
new feature without considering other features. Transfor-
mations involving the combination of features are covered
subsequently. The specially important normalizations in-
volving the minimum/maximum values of the features, as
well as the standardization procedure, are then presented
and illustrated. The interesting feature normalizations
known as Principal Component Analysis (PCA), which is
an unsupervised methodology, as well as Linear Discrim-
inant Analysis (LDA), a supervised method, are then in-
troduced and discussed. Both the latter normalizations
involve linear combination of the original features.
2 Morphing a Set of Points
The term morphing can be understood as changing the
shape of a given geometric structure, such as the mirrors
in amusement parks. In this section we present the con-
cept of morphing a set of Ndiscrete points in <2, each of
them represented in terms of the respective coordinates
(x, y). The extension to higher dimensions is straight-
forward, though respective visualization of the morphing
operation becomes more challenging.
Each given point (x, y) is transformed into a new point
x, ˜y) by the morphing operation. We will consider
that the morphing is implemented through two respec-
tive scalar fields Sxand Sywhich are both functions of x
and y, i.e.
˜x=Sx(x, y)
˜y=Sy(x, y) (1)
The above concept can be directly extended to points
defined by higher dimensional feature spaces, as is typi-
cally the case in pattern recognition:
˜
Sfi=Sfi(f1, f2, . . . , fM) (2)
for i= 1,2, . . . , M features. For simplicity’s sake,
we will illustrate the morphing approach with respect to
points [x, y] in <2.
In other words, the morphing operation moves each
original point to a new position in the feature space.
The morphing operation can be more conveniently visu-
alized by decomposing it in terms of the original posi-
tion of the point, i.e. [x, y], and respective displacement
[Dx(x, y), Dy(x, y)]:
˜x=x+Dx(x, y)
˜y=y+Dy(x, y) (3)
This can be understood as transforming the action of
the scalar fields Sx(x, y) and Sy(x, y) into the effect of a
respective vector field [Dx(x, y), Dy(x, y )].
Figure 2 illustrates this decomposition.
It follows from Equations 1 and 4 that:
Dx(x, y) = Sx(x, y)x
Dy(x, y) = Sy(x, y)y(4)
Interestingly, the displacement vector
[Dx(x, y), Dy(x, y)] provides an interesting concep-
tual visualization of the transformation effect, which
will be used in this work in order to illustrate the effect
of several feature transformation/normalization opera-
tions. The following seciton provides some interesting
examples of how the application of these operations
can substantially change the cluster structure in feature
spaces.
3 Features Transformations
Let the points [x, y] be transformed by the following dis-
placement field:
˜x=Dx(x, y) = x2sgn(x)
˜y=Dy(x, y) = y2sgn(y) (5)
Figure 3 depicts the action of the above point transfor-
mation on a circularly bound set of uniformly distributed
points. As summarized by the visualization of the applied
displacement field, the original points moved toward the
center of the coordinate system. This happened more
noticeably with the points near the circular border, as
the magnitude of the applied field is stronger at those
positions, see Fig. 3(b). As a consequence, the density
2
Figure 1: The several stages involved in the endeavor of pattern recognition. First, patterns are generated according to respective parameter
configurations. Then, features are extracted from those patterns in order to provide respective quantitative representations. Though an
one-dimensional feature space f1is shown in this diagram for simplicity’s sake, higher dimensional spaces are typically involved in pattern
recognition. These features can subsequently be transformed and/or normalized (red arrows), which can involve combinations of the previous
features, yielding a new feature space ˜
f1. The transformed/ normalized features are then fed to a classification method, which will then
hopefully provide the respective correct categories. The present work focuses on the feature transformation and normalization stage.
Figure 2: The morphing from point [x, y] (blue vector) into the point
x, ˜y] (orange vector) can be decomposed in terms of the vector sum
of the original point position and the respective displacement vector
[Dx(x, y), Dy(x, y)] (green vector).
near the the border of the obtained point distribution is
found to be higher than that near the origin of the co-
ordinate system, which had its density nearly preserved.
The own border of the region where the points were orig-
inally contained changed shape as a consequence of the
transformation.
This first example corroborates the fact that applying
transformations to features can have substantial effects
on the obtained density and distribution of points.
Figure 4 illustrates another important effect that can
be observed when applying feature transformations. In
this case, the original points were subjected to a displace-
ment field that forced convergence at two distinct centers,
namely [3,3] and [3,3]. Each of these centers were im-
plied by a respective gaussian distribution of displacement
magnitudes.
This examples illustrates the situation in which false
clusters can arise as a consequence of feature transforma-
tions.
Another effect to be avoided is shown in Figure 5, in
which a feature space containing two well-defined clus-
ters has been mapped into a single cluster by applying
a displacement field Fig. 5(c) that does not depend
on yand that vectors with magnitudes defined by a one-
dimensional gaussian centered at the origin of the coordi-
nate system.
This situation illustrates that feature transforma-
tion/normalization can impact the separation and shape
of clusters.
Despite the possible problems illustrated above, fea-
tures transformations and normalizations are often very
useful when applied with due care and attention, as
they can help to emphasize clusters, control noise and
bias/distortions in the original data, as well as to reduce
the dimension of the feature space.
The remainder of this work presents some of feature
transformations and normalizations often adopted in pat-
tern recognition, but before that we characterize two main
groups of transformations: those depending only on each
feature (independent), and those involving combination of
features. The general forms of these two types of trans-
formations are given in Equations 6 and 7, respectively.
3
Figure 3: Action of the transformation defined by Eq. 5. The initial set of uniformly distributed points (a) is transformed through
the displacement field in (b) into a new set of points (c) that exhibits enhanced density at the respective borders. Observe that this
transformation also acts in changing the initially circular shape of points distribution border. The magnitude of the displacement field has
been shown to a fraction (0.5) of its original value in order not to clutter the visualization.
Figure 4: Feature transformations, as illustrated here, can inadvertently create clusters that were not present in the uniformly distributed
original data. Each of the concentration centers were implied by respective radial displacement fields (pointing toward the respective center)
with gaussian magnitudes.
˜
f1
˜
f2
.
.
.
˜
fM
=
S1(f1)
S2(f2)
.
.
.
SM(fM)
(6)
˜
f1
˜
f2
.
.
.
˜
fM
=
S1(f1, f2, . . . , fM)
S2(f1, f2, . . . , fM)
.
.
.
SM(f1, f2, . . . , fM)
(7)
For instance, the transformation in Equation 5 is an
independent transformation. Observe also that the trans-
formation scalar fields, e.g. Sxand Sy, will necessarily be
of the same type as the respectively associated displace-
ment vector fields, e.g. Dxand Dy.
4 MinMax Normalization
Given a feature (or measurement) fivarying in the inter-
val [fi,min, fi,max], a new respective version of this feature
˜
fivarying in the interval [0,1] can be obtained by apply-
ing the following independent feature transformation:
˜
fi=fifi,min
fi,max fi,min
(8)
For simplicity’s sake, this normalization transformation
will be henceforth called minmax normalization in the
present work.
Figure 6 illustrates the minmax feature normalization
with respect to two clusters of uniformly distributed
points in (a). The respective displacement, shown in (b),
implies a vertical expansion of the clusters, resulting in
4
Figure 5: Two well-separated clusters can be merged as a consequence of certain feature transformations such as that applied in this example.
the elongated clusters in (c). This new shape of the two
clusters can have impact in the subsequent classification
stage.
5 Unit Transformations
We have already briefly discussed in the introduction of
this work that the units in which the features are taken
can influence the respective classification, eventually re-
quiring transformation and normalization. Given a mea-
surement fivarying in the interval [fi,min, fi,max], it is
possible to linearly transform it into another measure-
ment fjvarying in the interval [fj,min, fj,max] by applying
the following equation:
fj= (fj,max fj,min)fifi,min
fi,max fi,min
+fj,min (9)
As illustrated in Figure 7, ihis transformation can be
understood as a minmax normalization, yielding the new
variation interval [0,1], followed by a scaling product by
(fj,max fj,min) and a translating subtraction by fj,min.
Observe that this transformation assumes a linear rela-
tionship between the two features of interest.
For instance, the conversion from Celsius (C) to Fahren-
heit (F) can be obtained by considering respective inter-
vals [0,100] and [32,212] as:
F= (212 32) C0
100 0+ 32 = 1.8C+ 32 (10)
Similarly, the conversion from Fahrenheit to Celsius can
be immediately obtained as:
C= (100 0) F32
212 32 + 0 = 100
180(F32) (11)
6 Standardization
Standardization of a measurement fiinvolves subtract-
ing its mean µfifollowed by a division by its respective
standard deviation σfi, i.e.:
˜
fi=fiµfi
σfi
(12)
This statistical transformation of a feature yields a new
feature ˜
fithat is dimensionless and that has zero means
and unit standard deviation (e.g. [5]). In addition, a great
deal of the instances of the new measurement ˜
fiwill tend
to fall within the interval [2,2].
The standardization of a feature fican be understood
as moving the center of the respective distribution to zero
(a translation in the feature space), accompanied by a
scale normalization in which the dispersion of the mea-
surements becomes fixed or standardized.
Because standardization of a features yields a dimen-
sionless respective feature, this operation is often applied
in order to normalize the influence of unit choices on the
respective classification.
Figure 8 illustrates the standardization of two clus-
ters composed of uniformly distributed points in a two-
dimensional respective feature space (the same situation
as in the previous example). The respective displacement
field, shown in (b), is similar but with more intense magni-
tudes, as the points undergo larger movements. Observe,
however, that the magnitude of the displacements should
often be considered in relative terms between the involved
displacements (e.g. even larger magnitudes would be ob-
tained if the clusters were further away from the coordi-
nate origin, but the result would still be the same). As
with the minmax normalization, the two clusters resulted
with an elongated shape that can impact the subsequent
processing.
5
Figure 6: The minmax normalization of two well-separated clusters of uniformly distributed points (a) . Elongated clusters are obtained (c)
as a consequence of the vertical expansion implied by the respective displacement field (b).
Figure 7: Generic unit transformations from a measurement firanging in the interval [fi,min , fi,max] into another measurement fjvarying
in the interval [fj,min, fj,max] can be conceptually understood as a minmax transformation followed by a scaling by (fj,max fj,min ) and
a translation by fj,min.
7 Principal Component Analysis
(PCA)
Given a set of Nentities, each represented by a respective
feature vector ~
fp= [f1,p;f2,p ;. . . ;fM,p ]T(Mfeatures),
with p= 1,2, . . . , N , it is possible to apply a transforma-
tion that completely decorrelates these features, allowing
a possible dimensionality reduction, in the sense of yield-
ing a new set of mfeatures such that m < M while im-
plying in little loss of variation. This can be achieved
by using the statistical transformation typically known as
principal component analysis (PCA, e.g. [6]).
We start by deriving the covariance matrix Kof the
feature vectors, which corresponds to a random vector,
which in many cases can be estimated as
Ki,j =covariance(fi, fj)PN
p=1(fi,p µfi)(fj,p µfj)
(N1)
(13)
Once this covariance matrix is obtained, its eigenvalues
and eigenvectors are obtained. The eigenvalues are then
sorted in decreasing order, yielding λ1, λ2, . . . , λM, with
respectively associated eigenvectors ~v1, ~v2, . . . , ~vM. The
latter are stacked as lines of an M×Mmatrix Q, i.e.:
Q=
~v1
~v2
. . .
~vM
(14)
The new feature vectors, ˜
fi, can now be obtained by
the PCA transformation as:
˜
~
f=Q~
fi(15)
This corresponds to a linear transformation, implying
that the new feature vectors are obtained as linear com-
binations of the original ones. Furthermore, this trans-
formation can be understood as a rotation of the original
feature space so that the data variance is concentrated
6
Figure 8: The standardization of twowell-separated clusters of uniformly distributed points (a). As with the minmax normalization, elongated
clusters are obtained (c) as a consequence of the vertical expansion implied by the respective displacement field (b).
along the first new axes, corresponding to the principal
new features. The variance of the new features is given
by the eigenvalues associated to the respective axes.
It can be shown (e.g. [6]) that PCA yields a new set
of features that are fully uncorrelated, implying in respec-
tive redundancy reduction. As a consequence, the new
covariance matrix will necessarily be diagonal.
We can define the covariance explanation index ηpro-
vided by the first mnew features as:
η=Pm
i=1 λi
PM
i=1 λi
(16)
If we set a value for η, such as 90% ,we can keep only
the mfeatures required for achieving at least that co-
variance explanation index, thus achieving a reduction of
dimensionality from Mto m. This is possible because, by
removing covariance, PCA also decreases the redundancy
of the features.
It is often interesting to standardize the features (see
Section 6) prior to PCA.
Figure 9 illustrates the effect of PCA of an elongated set
of points. Observe that the first new axis, ˜xresults aligned
with the direction of largest variation in the original set
of points.
8 Concluding Remarks
Pattern recognition involves several stages, including fea-
tures transformation and normalization. In this work, we
addressed the latter operations with the help of the con-
cept of morphing a set of points and visualization of re-
spective displacement fields. In addition to revising some
of the main transformations and normalizations, it has
also been shown that these operations can substantially
influence the subsequent stage of classification. Depend-
Figure 9: Illustration of PCA action on an elongated set of uniformly
distributed points. Obseve that the first new axis, ˜x, aligns itself
along the direction of maximum variation in the original data. The
eigenvalues associated to the two new axes are also shown, which
yields a variance explanation of η= 0.272/(0.272 + 0.011) = 96%
when keeping only the first new axis.
ing on the type of data and transformation/normalization,
we can both enhance and undermine cluster identification.
Therefore, great care should be taken while transforming
and normalizing features.
Acknowledgments.
Luciano da F. Costa thanks CNPq (grant
no. 307085/2018-0) for sponsorship. This work has
benefited from FAPESP grant 15/22308-2.
7
References
[1] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern
Classification. Wiley Interscience, 2000.
[2] K. Koutrombas and S. Theodoridis. Pattern Recogni-
tion. Academic Press, 2008.
[3] L. da F. Costa. Pattern cognition, pat-
tern recognition. Researchgate, Dec 2019.
https://www.researchgate.net/publication/
338168835_Pattern_Cognition_Pattern_
Recognition_CDT-19. Online; accessed 29-Feb-
2020.
[4] L. da F. Costa. Where do patterns to
be recognized come from? Researchgate,
2020. https://www.researchgate.net/
publication/339599069_Where_Do_Patterns_
To_Be_Recognized_Come_From_CDT-22. Online;
accessed 09-March-2020.
[5] L. da F. Costa. Statistical modeling.
https://www.researchgate.net/publication/
334726352_Statistical_Modeling_CDT-13, 2019.
[Online; accessed 22-Dec-2019.].
[6] F. Gewers, G. R. Ferreira, H. F. Arruda, F. N.
Silva, C. H. Comin, D. R. Amancio, and L. da F.
Costa. Principal component analysis: A natural
approach to data exploration. Researchgate, 2019.
https://www.researchgate.net/publication/
324454887_Principal_Component_Analysis_A_
Natural_Approach_to_Data_Exploration. accessed
25-Dec-2019.
Costa’s Didactic Texts CDTs
CDTs intend to be a halfway point between a
formal scientific article and a dissemination text
in the sense that they: (i) explain and illustrate
concepts in a more informal, graphical and acces-
sible way than the typical scientific article; and
(ii) provide more in-depth mathematical develop-
ments than a more traditional dissemination work.
It is hoped that CDTs can also incorporate new
insights and analogies concerning the reported
concepts and methods. We hope these character-
istics will contribute to making CDTs interesting
both to beginners as well as to more senior
researchers.
Each CDT focuses on a limited set of interrelated
concepts. Though attempting to be relatively
self-contained, CDTs also aim at being relatively
short. Links to related material are provided in
order to complement the covered subjects.
Observe that CDTs, which come with absolutely
no warranty, are non distributable and for non-
commercial use only.
The complete set of CDTs can be found
at: https://www.researchgate.net/project/
Costas-Didactic-Texts-CDTs.
8
ResearchGate has not been able to resolve any citations for this publication.
Article
Principal component analysis (PCA) is often applied for analyzing data in the most diverse areas. This work reports, in an accessible and integrated manner, several theoretical and practical aspects of PCA. The basic principles underlying PCA, data standardization, possible visualizations of the PCA results, and outlier detection are subsequently addressed. Next, the potential of using PCA for dimensionality reduction is illustrated on several real-world datasets. Finally, we summarize PCA-related approaches and other dimensionality reduction techniques. All in all, the objective of this work is to assist researchers from the most diverse areas in using and interpreting PCA.
Where do patterns to be recognized come from? Researchgate
  • L Da
  • F Costa
L. da F. Costa. Where do patterns to be recognized come from? Researchgate, 2020. https://www.researchgate.net/ publication/339599069_Where_Do_Patterns_