PreprintPDF Available

Features Transformation and Normalization: A Visual Approach (CDT-24)

March 2020

March 2020

DOI:10.13140/RG.2.2.33287.34727

License
CC BY-NC-ND 4.0

Authors:

Luciano da F. Costa

University of São Paulo

Preprints and early-stage research may not have been peer reviewed yet.

Pattern recognition involves extracting features, transforming/normalizing them, and them performing classification in order to assign new or predefined categories to the entities of interest. In this work, we focus on the transformation and normalization of features. The concept of morphing a set of points and visualizations of respective displacement fields are introduced and applied in order to visualize and better understand the possible effects of feature transformation and normalization. When applied with care, these operations have the potential to enhance the classification stage. However, we illustrate that features transformations and normalizations can also create false clusters as well as merge existing clusters, so that special attention is required when performing these operations. Some important features transformations and normalizations, including the standardization procedure as well as principal component analysis and linear discriminant analysis, are also presented and briefly discussed.

The several stages involved in the endeavor of pattern recognition. First, patterns are generated according to respective parameter configurations. Then, features are extracted from those patterns in order to provide respective quantitative representations. Though an one-dimensional feature space f 1 is shown in this diagram for simplicity's sake, higher dimensional spaces are typically involved in pattern recognition. These features can subsequently be transformed and/or normalized (red arrows), which can involve combinations of the previous features, yielding a new feature space˜fspace˜ space˜f 1 . The transformed/ normalized features are then fed to a classification method, which will then hopefully provide the respective correct categories. The present work focuses on the feature transformation and normalization stage.

…

The morphing from point [x, y] (blue vector) into the point [˜ x, ˜ y] (orange vector) can be decomposed in terms of the vector sum of the original point position and the respective displacement vector [Dx(x, y), Dy(x, y)] (green vector).

…

Action of the transformation defined by Eq. 5. The initial set of uniformly distributed points (a) is transformed through the displacement field in (b) into a new set of points (c) that exhibits enhanced density at the respective borders. Observe that this transformation also acts in changing the initially circular shape of points distribution border. The magnitude of the displacement field has been shown to a fraction (0.5) of its original value in order not to clutter the visualization.

…

Feature transformations, as illustrated here, can inadvertently create clusters that were not present in the uniformly distributed original data. Each of the concentration centers were implied by respective radial displacement fields (pointing toward the respective center) with gaussian magnitudes.

…

Two well-separated clusters can be merged as a consequence of certain feature transformations such as that applied in this example.

…

Figures - uploaded by Luciano da F. Costa

Content may be subject to copyright.

Content uploaded by Luciano da F. Costa

Content may be subject to copyright.

Content uploaded by Luciano da F. Costa

Content may be subject to copyright.

Features Transformation and Normalization:

A Visual Approach

(CDT-24)

Luciano da Fontoura Costa

luciano@ifsc.usp.br

S˜ao Carlos Institute of Physics – DFCM/USP

v.2 : 28th March 2020

v.1 : 23rd March 2020

Abstract

Pattern recognition involves extracting features, transforming/normalizing them, and them performing classiﬁcation

in order to assign new or predeﬁned categories to the entities of interest. In this work, we focus on the transformation

and normalization of features. The concept of morphing a set of points and visualizations of respective displacement

ﬁelds are introduced and applied in order to visualize and better understand the possible eﬀects of feature transformation

and normalization. When applied with care, these operations have the potential to enhance the classiﬁcation stage.

However, we illustrate that features transformations and normalizations can also create false clusters as well as merge

existing clusters, so that special attention is required when performing these operations. Some important features

transformations and normalizations, including the standardization procedure as well as principal component analysis

and linear discriminant analysis, are also presented and brieﬂy discussed.

“L’occhio non vede cose ma ﬁgure di cose che signiﬁcano altre

cose.”

Italo Calvino.

1 Introduction

Pattern recognition (e.g. [1, 2, 3]). is much more general

and widespread than often realized, as it underlies most

of human (as well as other living beings) intelligence and

activities. As a consequence, pattern recognition has an

ample range of applications, extending from quality con-

trol to text and image interpretation.

Figure 1 depicts the main data and operations involved

in pattern recognition. First, the patterns (or entities)

to be recognized have to be somehow generated (e.g. [4])

with basis on respective parameters. Then, a selected set

of features are extracted from the patterns so as to pro-

vide a quantitative characterization of the entities to be

recognized (e.g. [3]). This step is particularly challeng-

ing, as the selection of features is not straightforward and

depends on previous experience with the data, measure-

ments, and classiﬁcation methods (e.g. [3]). The extracted

features can then be transformed or normalized with the

objective of deriving new features capable of improving

the entities characterization. Features transformations

can take each feature independently or combined with

other features. For instance, transformations can be used

to remove noise from features, to make them smoother,

to obtain more discriminative measurements, to reduce

the dimensionality of the feature space, etc. Normaliza-

tions are often required in order to remove translation,

rotation or scaling eﬀects from the features, providing a

more standardized set of features. The ﬁnal step in our

diagram consists of classifying, i.e. assigning previously

deﬁned or new categories, to the entities with based on

their transformed/normalized features.

The present work focuses on the important stage of

features transformation and normalization. These two

tasks, which can have important eﬀects (wanted and un-

wanted) in pattern recognition, are indeed unavoidable

as the very implementation of any measurement implies

some choice related to transformation and normalization.

For instance, in case we are measuring the weight of fruits

to be considered as respective features, we are immedi-

ately faced with the question of which unit to adopt, such

as grams, ounces, etc. In addition, the very deﬁnition of

features often involved transformations combining other

features, such as the ratio between the standard deviation

and mean of a given measurement (known as variation co-

eﬃcient).

We will start by introducing the concept of morphing

a set of points, which will be subsequently applied to il-

lustrate the concept and eﬀects of feature transformation

and normalization. Indeed, most feature transformations

and normalizations can be conceptualized in terms of mor-

phing and respective vector ﬁelds, which help to visualize

and understand the respective eﬀects, such as creating

false clusters that did not originally exist and merging

clusters when they should be separated. The provided

examples motivate the need of special care and attention

when transforming and normalizing features.

Next, we will discuss independent feature transforma-

tion, in which each feature is transformed into a respective

new feature without considering other features. Transfor-

mations involving the combination of features are covered

subsequently. The specially important normalizations in-

volving the minimum/maximum values of the features, as

well as the standardization procedure, are then presented

and illustrated. The interesting feature normalizations

known as Principal Component Analysis (PCA), which is

an unsupervised methodology, as well as Linear Discrim-

inant Analysis (LDA), a supervised method, are then in-

troduced and discussed. Both the latter normalizations

involve linear combination of the original features.

2 Morphing a Set of Points

The term morphing can be understood as changing the

shape of a given geometric structure, such as the mirrors

in amusement parks. In this section we present the con-

cept of morphing a set of Ndiscrete points in <2, each of

them represented in terms of the respective coordinates

(x, y). The extension to higher dimensions is straight-

forward, though respective visualization of the morphing

operation becomes more challenging.

Each given point (x, y) is transformed into a new point

(˜x, ˜y) by the morphing operation. We will consider

that the morphing is implemented through two respec-

tive scalar ﬁelds Sxand Sywhich are both functions of x

and y, i.e.

˜x=Sx(x, y)

˜y=Sy(x, y) (1)

The above concept can be directly extended to points

deﬁned by higher dimensional feature spaces, as is typi-

cally the case in pattern recognition:

Sfi=Sfi(f1, f2, . . . , fM) (2)

for i= 1,2, . . . , M features. For simplicity’s sake,

we will illustrate the morphing approach with respect to

points [x, y] in <2.

In other words, the morphing operation moves each

original point to a new position in the feature space.

The morphing operation can be more conveniently visu-

alized by decomposing it in terms of the original posi-

tion of the point, i.e. [x, y], and respective displacement

[Dx(x, y), Dy(x, y)]:

˜x=x+Dx(x, y)

˜y=y+Dy(x, y) (3)

This can be understood as transforming the action of

the scalar ﬁelds Sx(x, y) and Sy(x, y) into the eﬀect of a

respective vector ﬁeld [Dx(x, y), Dy(x, y )].

Figure 2 illustrates this decomposition.

It follows from Equations 1 and 4 that:

Dx(x, y) = Sx(x, y)−x

Dy(x, y) = Sy(x, y)−y(4)

Interestingly, the displacement vector

[Dx(x, y), Dy(x, y)] provides an interesting concep-

tual visualization of the transformation eﬀect, which

will be used in this work in order to illustrate the eﬀect

of several feature transformation/normalization opera-

tions. The following seciton provides some interesting

examples of how the application of these operations

can substantially change the cluster structure in feature

spaces.

3 Features Transformations

Let the points [x, y] be transformed by the following dis-

placement ﬁeld:

˜x=Dx(x, y) = −x2sgn(x)

˜y=Dy(x, y) = −y2sgn(y) (5)

Figure 3 depicts the action of the above point transfor-

mation on a circularly bound set of uniformly distributed

points. As summarized by the visualization of the applied

displacement ﬁeld, the original points moved toward the

center of the coordinate system. This happened more

noticeably with the points near the circular border, as

the magnitude of the applied ﬁeld is stronger at those

positions, see Fig. 3(b). As a consequence, the density

Figure 1: The several stages involved in the endeavor of pattern recognition. First, patterns are generated according to respective parameter

conﬁgurations. Then, features are extracted from those patterns in order to provide respective quantitative representations. Though an

one-dimensional feature space f1is shown in this diagram for simplicity’s sake, higher dimensional spaces are typically involved in pattern

recognition. These features can subsequently be transformed and/or normalized (red arrows), which can involve combinations of the previous

features, yielding a new feature space ˜

f1. The transformed/ normalized features are then fed to a classiﬁcation method, which will then

hopefully provide the respective correct categories. The present work focuses on the feature transformation and normalization stage.

Figure 2: The morphing from point [x, y] (blue vector) into the point

[˜x, ˜y] (orange vector) can be decomposed in terms of the vector sum

of the original point position and the respective displacement vector

[Dx(x, y), Dy(x, y)] (green vector).

near the the border of the obtained point distribution is

found to be higher than that near the origin of the co-

ordinate system, which had its density nearly preserved.

The own border of the region where the points were orig-

inally contained changed shape as a consequence of the

transformation.

This ﬁrst example corroborates the fact that applying

transformations to features can have substantial eﬀects

on the obtained density and distribution of points.

Figure 4 illustrates another important eﬀect that can

be observed when applying feature transformations. In

this case, the original points were subjected to a displace-

ment ﬁeld that forced convergence at two distinct centers,

namely [−3,−3] and [3,3]. Each of these centers were im-

plied by a respective gaussian distribution of displacement

magnitudes.

This examples illustrates the situation in which false

clusters can arise as a consequence of feature transforma-

tions.

Another eﬀect to be avoided is shown in Figure 5, in

which a feature space containing two well-deﬁned clus-

ters has been mapped into a single cluster by applying

a displacement ﬁeld – Fig. 5(c) – that does not depend

on yand that vectors with magnitudes deﬁned by a one-

dimensional gaussian centered at the origin of the coordi-

nate system.

This situation illustrates that feature transforma-

tion/normalization can impact the separation and shape

of clusters.

Despite the possible problems illustrated above, fea-

tures transformations and normalizations are often very

useful when applied with due care and attention, as

they can help to emphasize clusters, control noise and

bias/distortions in the original data, as well as to reduce

the dimension of the feature space.

The remainder of this work presents some of feature

transformations and normalizations often adopted in pat-

tern recognition, but before that we characterize two main

groups of transformations: those depending only on each

feature (independent), and those involving combination of

features. The general forms of these two types of trans-

formations are given in Equations 6 and 7, respectively.

Figure 3: Action of the transformation deﬁned by Eq. 5. The initial set of uniformly distributed points (a) is transformed through

the displacement ﬁeld in (b) into a new set of points (c) that exhibits enhanced density at the respective borders. Observe that this

transformation also acts in changing the initially circular shape of points distribution border. The magnitude of the displacement ﬁeld has

been shown to a fraction (0.5) of its original value in order not to clutter the visualization.

Figure 4: Feature transformations, as illustrated here, can inadvertently create clusters that were not present in the uniformly distributed

original data. Each of the concentration centers were implied by respective radial displacement ﬁelds (pointing toward the respective center)

with gaussian magnitudes.



















S1(f1)

S2(f2)

SM(fM)







(6)



















S1(f1, f2, . . . , fM)

S2(f1, f2, . . . , fM)

SM(f1, f2, . . . , fM)







(7)

For instance, the transformation in Equation 5 is an

independent transformation. Observe also that the trans-

formation scalar ﬁelds, e.g. Sxand Sy, will necessarily be

of the same type as the respectively associated displace-

ment vector ﬁelds, e.g. Dxand Dy.

4 MinMax Normalization

Given a feature (or measurement) fivarying in the inter-

val [fi,min, fi,max], a new respective version of this feature

fivarying in the interval [0,1] can be obtained by apply-

ing the following independent feature transformation:

fi=fi−fi,min

fi,max −fi,min

(8)

For simplicity’s sake, this normalization transformation

will be henceforth called minmax normalization in the

present work.

Figure 6 illustrates the minmax feature normalization

with respect to two clusters of uniformly distributed

points in (a). The respective displacement, shown in (b),

implies a vertical expansion of the clusters, resulting in

Figure 5: Two well-separated clusters can be merged as a consequence of certain feature transformations such as that applied in this example.

the elongated clusters in (c). This new shape of the two

clusters can have impact in the subsequent classiﬁcation

stage.

5 Unit Transformations

We have already brieﬂy discussed in the introduction of

this work that the units in which the features are taken

can inﬂuence the respective classiﬁcation, eventually re-

quiring transformation and normalization. Given a mea-

surement fivarying in the interval [fi,min, fi,max], it is

possible to linearly transform it into another measure-

ment fjvarying in the interval [fj,min, fj,max] by applying

the following equation:

fj= (fj,max −fj,min)fi−fi,min

fi,max −fi,min

+fj,min (9)

As illustrated in Figure 7, ihis transformation can be

understood as a minmax normalization, yielding the new

variation interval [0,1], followed by a scaling product by

(fj,max −fj,min) and a translating subtraction by fj,min.

Observe that this transformation assumes a linear rela-

tionship between the two features of interest.

For instance, the conversion from Celsius (C) to Fahren-

heit (F) can be obtained by considering respective inter-

vals [0,100] and [32,212] as:

F= (212 −32) C−0

100 −0+ 32 = 1.8C+ 32 (10)

Similarly, the conversion from Fahrenheit to Celsius can

be immediately obtained as:

C= (100 −0) F−32

212 −32 + 0 = 100

180(F−32) (11)

6 Standardization

Standardization of a measurement fiinvolves subtract-

ing its mean µfifollowed by a division by its respective

standard deviation σfi, i.e.:

fi=fi−µfi

σfi

(12)

This statistical transformation of a feature yields a new

feature ˜

fithat is dimensionless and that has zero means

and unit standard deviation (e.g. [5]). In addition, a great

deal of the instances of the new measurement ˜

fiwill tend

to fall within the interval [−2,2].

The standardization of a feature fican be understood

as moving the center of the respective distribution to zero

(a translation in the feature space), accompanied by a

scale normalization in which the dispersion of the mea-

surements becomes ﬁxed or standardized.

Because standardization of a features yields a dimen-

sionless respective feature, this operation is often applied

in order to normalize the inﬂuence of unit choices on the

respective classiﬁcation.

Figure 8 illustrates the standardization of two clus-

ters composed of uniformly distributed points in a two-

dimensional respective feature space (the same situation

as in the previous example). The respective displacement

ﬁeld, shown in (b), is similar but with more intense magni-

tudes, as the points undergo larger movements. Observe,

however, that the magnitude of the displacements should

often be considered in relative terms between the involved

displacements (e.g. even larger magnitudes would be ob-

tained if the clusters were further away from the coordi-

nate origin, but the result would still be the same). As

with the minmax normalization, the two clusters resulted

with an elongated shape that can impact the subsequent

processing.

Figure 6: The minmax normalization of two well-separated clusters of uniformly distributed points (a) . Elongated clusters are obtained (c)

as a consequence of the vertical expansion implied by the respective displacement ﬁeld (b).

Figure 7: Generic unit transformations from a measurement firanging in the interval [fi,min , fi,max] into another measurement fjvarying

in the interval [fj,min, fj,max] can be conceptually understood as a minmax transformation followed by a scaling by (fj,max −fj,min ) and

a translation by fj,min.

7 Principal Component Analysis

(PCA)

Given a set of Nentities, each represented by a respective

feature vector ~

fp= [f1,p;f2,p ;. . . ;fM,p ]T(Mfeatures),

with p= 1,2, . . . , N , it is possible to apply a transforma-

tion that completely decorrelates these features, allowing

a possible dimensionality reduction, in the sense of yield-

ing a new set of mfeatures such that m < M while im-

plying in little loss of variation. This can be achieved

by using the statistical transformation typically known as

principal component analysis (PCA, e.g. [6]).

We start by deriving the covariance matrix Kof the

feature vectors, which corresponds to a random vector,

which in many cases can be estimated as

Ki,j =covariance(fi, fj)≈PN

p=1(fi,p −µfi)(fj,p −µfj)

(N−1)

(13)

Once this covariance matrix is obtained, its eigenvalues

and eigenvectors are obtained. The eigenvalues are then

sorted in decreasing order, yielding λ1, λ2, . . . , λM, with

respectively associated eigenvectors ~v1, ~v2, . . . , ~vM. The

latter are stacked as lines of an M×Mmatrix Q, i.e.:







←~v1→

←~v2→

. . .

←~vM→







(14)

The new feature vectors, ˜

fi, can now be obtained by

the PCA transformation as:

f=Q~

fi(15)

This corresponds to a linear transformation, implying

that the new feature vectors are obtained as linear com-

binations of the original ones. Furthermore, this trans-

formation can be understood as a rotation of the original

feature space so that the data variance is concentrated

Figure 8: The standardization of twowell-separated clusters of uniformly distributed points (a). As with the minmax normalization, elongated

clusters are obtained (c) as a consequence of the vertical expansion implied by the respective displacement ﬁeld (b).

along the ﬁrst new axes, corresponding to the principal

new features. The variance of the new features is given

by the eigenvalues associated to the respective axes.

It can be shown (e.g. [6]) that PCA yields a new set

of features that are fully uncorrelated, implying in respec-

tive redundancy reduction. As a consequence, the new

covariance matrix will necessarily be diagonal.

We can deﬁne the covariance explanation index ηpro-

vided by the ﬁrst mnew features as:

η=Pm

i=1 λi

(16)

If we set a value for η, such as 90% ,we can keep only

the mfeatures required for achieving at least that co-

variance explanation index, thus achieving a reduction of

dimensionality from Mto m. This is possible because, by

removing covariance, PCA also decreases the redundancy

of the features.

It is often interesting to standardize the features (see

Section 6) prior to PCA.

Figure 9 illustrates the eﬀect of PCA of an elongated set

of points. Observe that the ﬁrst new axis, ˜xresults aligned

with the direction of largest variation in the original set

of points.

8 Concluding Remarks

Pattern recognition involves several stages, including fea-

tures transformation and normalization. In this work, we

addressed the latter operations with the help of the con-

cept of morphing a set of points and visualization of re-

spective displacement ﬁelds. In addition to revising some

of the main transformations and normalizations, it has

also been shown that these operations can substantially

inﬂuence the subsequent stage of classiﬁcation. Depend-

Figure 9: Illustration of PCA action on an elongated set of uniformly

distributed points. Obseve that the ﬁrst new axis, ˜x, aligns itself

along the direction of maximum variation in the original data. The

eigenvalues associated to the two new axes are also shown, which

yields a variance explanation of η= 0.272/(0.272 + 0.011) = 96%

when keeping only the ﬁrst new axis.

ing on the type of data and transformation/normalization,

we can both enhance and undermine cluster identiﬁcation.

Therefore, great care should be taken while transforming

and normalizing features.

Acknowledgments.

Luciano da F. Costa thanks CNPq (grant

no. 307085/2018-0) for sponsorship. This work has

beneﬁted from FAPESP grant 15/22308-2.

References

[1] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern

Classiﬁcation. Wiley Interscience, 2000.

[2] K. Koutrombas and S. Theodoridis. Pattern Recogni-

tion. Academic Press, 2008.

[3] L. da F. Costa. Pattern cognition, pat-

tern recognition. Researchgate, Dec 2019.

https://www.researchgate.net/publication/

338168835_Pattern_Cognition_Pattern_

Recognition_CDT-19. Online; accessed 29-Feb-

2020.

[4] L. da F. Costa. Where do patterns to

be recognized come from? Researchgate,

2020. https://www.researchgate.net/

publication/339599069_Where_Do_Patterns_

To_Be_Recognized_Come_From_CDT-22. Online;

accessed 09-March-2020.

[5] L. da F. Costa. Statistical modeling.

https://www.researchgate.net/publication/

334726352_Statistical_Modeling_CDT-13, 2019.

[Online; accessed 22-Dec-2019.].

[6] F. Gewers, G. R. Ferreira, H. F. Arruda, F. N.

Silva, C. H. Comin, D. R. Amancio, and L. da F.

Costa. Principal component analysis: A natural

approach to data exploration. Researchgate, 2019.

https://www.researchgate.net/publication/

324454887_Principal_Component_Analysis_A_

Natural_Approach_to_Data_Exploration. accessed

25-Dec-2019.

Costa’s Didactic Texts – CDTs

CDTs intend to be a halfway point between a

formal scientiﬁc article and a dissemination text

in the sense that they: (i) explain and illustrate

concepts in a more informal, graphical and acces-

sible way than the typical scientiﬁc article; and

(ii) provide more in-depth mathematical develop-

ments than a more traditional dissemination work.

It is hoped that CDTs can also incorporate new

insights and analogies concerning the reported

concepts and methods. We hope these character-

istics will contribute to making CDTs interesting

both to beginners as well as to more senior

researchers.

Each CDT focuses on a limited set of interrelated

concepts. Though attempting to be relatively

self-contained, CDTs also aim at being relatively

short. Links to related material are provided in

order to complement the covered subjects.

Observe that CDTs, which come with absolutely

no warranty, are non distributable and for non-

commercial use only.

The complete set of CDTs can be found

at: https://www.researchgate.net/project/

Costas-Didactic-Texts-CDTs.

ResearchGate has not been able to resolve any citations for this publication.

Principal Component Analysis: A Natural Approach to Data Exploration

Article

May 2021

Principal component analysis (PCA) is often applied for analyzing data in the most diverse areas. This work reports, in an accessible and integrated manner, several theoretical and practical aspects of PCA. The basic principles underlying PCA, data standardization, possible visualizations of the PCA results, and outlier detection are subsequently addressed. Next, the potential of using PCA for dimensionality reduction is illustrated on several real-world datasets. Finally, we summarize PCA-related approaches and other dimensionality reduction techniques. All in all, the objective of this work is to assist researchers from the most diverse areas in using and interpreting PCA.

Optical Pattern Recognition

Article

Dec 1501
Opt Acta

L. Finkelstein

Where do patterns to be recognized come from? Researchgate

Jan 2020

L Da
F Costa

L. da F. Costa. Where do patterns to be recognized come from? Researchgate, 2020. https://www.researchgate.net/ publication/339599069_Where_Do_Patterns_

Features Transformation and Normalization: A Visual Approach (CDT-24)

Abstract and Figures

Recommended publications

Submit your application to win an all-inclusive 11-days at Sao Paulo School of Advanced Sciences on...

A Compact Guide to PCA (CDT-47)

A Compact Guide to LDA (CDT-53)

Multiset Neurons

Multiset neurons

The Coincidence Similarity Index under Rotation