ArticlePDF Available

Control of a Speech Robot via an Optimum Neural-Network-Based Internal Model With Constraints

Authors:

Abstract and Figures

An optimum internal model with constraints is proposed and discussed for the control of a speech robot, which is based on the human-like behavior. The main idea of the study is that the robot movements are carried out in such a way that the length of the path traveled in the internal space, under external acoustical and mechanical constraints, is minimized. This optimum strategy defines the designed internal model, which is responsible for the robot task planning. First, an exact analytical way to deal with the problem is proposed. Next, by using some empirical findings, an approximate solution for the designed internal model is developed. Finally, the implementation of this solution, which is applied to the control of a speech robot, yields interesting results in the field of task-planning strategies, task anticipation (namely, speech coarticulation), and the influence of force on the accuracy of executed tasks.
Content may be subject to copyright.
142 IEEE TRANSACTIONS ON ROBOTICS, VOL. 26, NO. 1, FEBRUARY 2010
Control of a Speech Robot via an Optimum
Neural-Network-Based Internal Model
With Constraints
Iaroslav V. Blagouchine and Eric Moreau, Senior Member, IEEE
Abstract—An optimum internal model with constraints is pro-
posed and discussed for the control of a speech robot, which is based
on the human-like behavior. The main idea of the study is that the
robot movements are carried out in such a way that the length of
the path traveled in the internal space, under external acoustical
and mechanical constraints, is minimized. This optimum strategy
defines the designed internal model, which is responsible for the
robot task planning. First, an exact analytical way to deal with the
problem is proposed. Next, by using some empirical findings, an
approximate solution for the designed internal model is developed.
Finally, the implementation of this solution, which is applied to the
control of a speech robot, yields interesting results in the field of
task-planning strategies, task anticipation (namely, speech coar-
ticulation), and the influence of force on the accuracy of executed
tasks.
Index Terms—Artificial neural networks (ANNs), constrained
optimization, Lagrange’s multipliers method, mathematical and
computational issues in robotics control, mathematical physics,
models and theories of speech production, λ-model [equilibrium-
point hypothesis (EPH)], optimum control, optimum task planning,
path and trajectory planning, robotics of speech production, robot-
motion planning, variational calculus.
I. INTRODUCTION
ROBOTICS of speech production is quite a challenging
subject in modern design and engineering. Since it is
well-known that the tongue is one of the principal elements,
which is responsible for speech production, many speech robots
are based on the modeling of the movements of the tongue.
To ensure the quality of these movements and, consequently,
that of the produced speech (both have to be as close as pos-
sible to the real ones), the control of such robots is extremely
important. To this end, the principles of control of such artificial-
intelligence devices are often borrowed from human beings, and
many robots are based on human-like behavior and are modeled
in close conjunction with the motor-control theories [1]–[5].
Manuscript received April 14, 2009; revised July 7, 2009 and September 17,
2009. First published November 13, 2009; current version published February
9, 2010. This paper was recommended for publication by Associate Editor T.
Kanda and Editor J.-P. Laumond upon evaluation of the reviewers’ comments.
I. V. Blagouchine is with the Department of Telecommunication, Insti-
tute of Engineering Sciences of Toulon-Var-School of Engineering, Univer-
sity of Toulon, Toulon F-83162, France, and also with the Department of
Mobile Communication, Eur´
ecom, F-06904 Sophia-Antipolis, France (e-mail:
iaroslav.blagouchine@univ-tln.fr).
E. Moreau is with the Department of Telecommunication, Institute of Engi-
neering Sciences of Toulon-Var-School of Engineering, University of Toulon,
Toulon F-83162, France (e-mail: moreau@univ-tln.fr).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TRO.2009.2033331
Currently, there is no single unique theory in the field of mo-
tor control and task planning. Over the past 80 years, many
different approaches were developed and are currently in com-
petition; these include the electromyographic approach [1]–[3],
the information-channel approach [3], [6]–[8], the global econ-
omy of the diverse mechanical factors approaches [9]–[15],
the equilibrium-point hypothesis (EPH, which is also known
as the λ-model) [1], [5], [16]–[27], the internal models’ ap-
proaches [28]–[30], etc.
Among these approaches, the EPH, economy’s approaches,
and internal models’ ones have received some special interest
in speech robotics.
The EPH is a development of the classic linear damped spring
model of muscle [31], which is completed by central nervous
system (CNS) influence. According to the EPH, the muscle can
be modeled as a nonlinear spring, which is controlled by a spe-
cial motor command λ, which descends from CNS. The force
F, which is generated by such a muscle, depends on the differ-
ence between its actual length land the CNS motor command
λ, as well as on several other physical parameters associated
with muscle. In other words, F(l, λ)=f(s)H(s), where f(·)
is the transfer function of muscle, H(·)is the Heaviside step
function, and the parameter s, which is called activation, is de-
fined as s=lλfor the static case and as s=lλ+κνfor
the dynamic case, where parameters κand νare, respectively,
the damping factor and the speed of muscle lengthening. As to
the transfer function of muscle f(·), it has been noted that it is
a nonlinear function, and it is quite well-approximated by an
exponential function. Thus
F(λ)=ρ(ecs 1) H(s)(1)
where cis the form parameter, and ρis a parameter related
to the force-generating capability of muscle [19], [32], [33]. By
expanding the latter expression in the Maclaurin series for s>0,
it is straightforward to see that for 0<s1,the muscle, in
first approximation, behaves as a classic linear spring
F(λ)=ρ
n=1
(cs)n
n!=ρcs+O(s2)(2)
which is another argument in favor of this model. From the point
of view of robotics and cybernetics, the λ-model is especially
attractive, because it provides a simple mathematical mean to
conceive artificial-intelligence devices based on the human-like
behavior, without going into details about the underlying prin-
ciples of motor control. The EPH has also become quite popular
in the articulatory speech-production field, for which correct
1552-3098/$26.00 © 2009 IEEE
BLAGOUCHINE AND MOREAU: CONTROL OF A SPEECH ROBOT VIA AN OPTIMUM NEURAL-NETWORK-BASED INTERNAL MODEL 143
Fig. 1. Simplified anatomical structure of tongue (from [43]).
modeling of the tongue and jaw movements is of great im-
portance, because these are physically responsible for speech
production. Due to the increasing researcher’s interest and to
the constantly growing computational capacities, many of such
works have arisen over the past 15 years (e.g., see, [33]–[40]).
One of these examples is the articulatory-based speech robot
that we used in our study (see e.g. [33], [41] and [42]).*This
robot represents an artificial tongue, which is modeled by six
main muscles that are responsible to shape and move the tongue
in the sagittal plane: posterior and anterior parts of genioglos-
sus,styloglossus,hyoglossus,inferior and superior longitudi-
nalis, and verticalis (see Fig. 1) [43]. Each of these muscles is
controlled by its own motor command λi,i=1,...,n,n=6,
according to the EPH.1Their forces are generated according
to (1), with different ρfor each muscle, and with constant
c=1 cm1[19], [32]. Initial vocal-tract geometry is recon-
structed from anatomical cineradiographic data. By means of
the finite-element method [44], the tongue is divided into small
volumes connected by 221 nodes, each of which anatomically
belongs to the defined muscle(s). The motion of each node is
then described by a second-order ordinary differential equa-
tion (ODE) with damping and external terms, due to viscosity,
gravity, and contact reaction forces. The stiffness matrix, which
determins the distribution of the internal forces within the finite-
element structure, is calculated by the finite-element algorithm.
Such a complex system of the ODEs is solved numerically by
means of the Runge–Kutta method using MATLAB software,
which finally gives the trajectory of motion of each node and,
by further interpolation, the motion of the tongue body. In order
to achieve the vocal-tract reconstruction, lips, palate, and phar-
ynx are also added to model mechanical contacts with tongue
(see Fig. 2). The jaw is represented by static rigid structures to
which the tongue is attached. Note, finally, that there are also
other articulatory-based speech robots that might be interest-
ing [45]–[59], especially because the optimum internal model
that we will introduce is meant as a general model and can be
used with many other speech robots using similar principles of
control.
The diverse-economy approaches consider that the move-
ments are defined by some economy principle, that is to say,
1For simplicity, we will write these motor commands as components of vector
λ(λ1,...,λn).
*This articulatory-based speech robot is called throughout the paper the biomecha-
nical tongue model (BTM).
Fig. 2. Modeled vocal tract and its further cutting by an acoustical-tube model
for the computation of formants F. The upper contour is the palate, and the
lower one is the tongue dorsum. The lips’ area is variable, which ranges from
0.5to3.0cm
2, depending on the vowel.
the movements are always carried out in such a way that some
criterion is optimized. These approaches are basically inspired
from analytical mechanics, namely, from principle of least
action [60]–[69], which is one of the most universal princi-
ples of physics (many fundamental equations of physics can
be deduced from it). This principle states that the motions are
always carried out in such a way that the action2is minimum.
However, because of high complexity of biosystems, direct ap-
plications of this principle in motor-control theories are quite
limited. Under these circumstances, the exact mathematical de-
scription being almost impossible, and one of the possible solu-
tions might be to describe them more globally, i.e., by supposing
that some global criterion is optimized during the movement.
This criterion, which is often known as a cost, may be defined
in many different ways, e.g., time cost, energy cost, force cost,
impulse cost, accuracy cost, etc. In this field, the concept that
appears to be the most frequent and interesting is that of the
minimum of the jerk cost3[9]–[11]. The idea of the economy
principles also affected robotics and, in particular, the speech-
motor-control community, and several studies using economy
principles appeared. Basically, these works propose to search
for the shortest trajectories between the steady-state positions
in the command internal space [32], [36], [39], [70]–[73]. In
some of these works, the shortest distance principle is explicitly
stated, and thus, authors suggest using straight lines as solutions
(the straight line is the shortest trajectory between two points if
there are no constraints). In others, it is implicitly formulated
by constant-rate transitions between steady-state positions, i.e.,
again by straight-line transitions. These works reported that by
shifting motor-control commands λat constant speed, realistic
articulatory movements and speech signals may be produced.
It is interesting to note that the minimization of the trajectory
2The action is the definite integral over time interval of the Lagrangian, the
latter being the difference between kinetic and potential energies.
3Jerk, also known as jolt, is the rate of change of acceleration, i.e., the third
derivative of displacement with respect to time. The jerk cost is, therefore,
defined as definite integral over time interval of the square of jerk.
144 IEEE TRANSACTIONS ON ROBOTICS, VOL. 26, NO. 1, FEBRUARY 2010
length (3), as well as that of the jerk cost, are quite similar, at
least, mathematically, to that of the action. In fact, in all three
cases, one seeks to optimize a trajectory-depending functional,
which is given as definite integral over time interval. Finally,
the exploration of optimum principles may also be interesting
in the field of inverse problems. By optimizing a cost functional
(or just function) under constraints on outputs, one can find the
corresponding inputs (as we will show later). This idea is not
novel in the speech field. For more details, see [74]–[76].4
Finally, there is another approach that should be mentioned in
the human-like robotics context: the internal models [28]–[30].
This approach supposes that any living creature has an internal
representation of all the external tasks he/she can do.5Thus,
typically, the learning of a new task implies, inter alia, the de-
termination of the corresponding place in the internal space. It is
also important to note that there is no bijection between internal
and external spaces, since the same task can be achieved differ-
ently. The notion of tasks is closely related to that of targets,
and it is very important for the planning theories. In speech mo-
tor control, there are different interpretations of targets, which
is also known as reference frames. For example, these may
be vocal-tract configurations (e.g., tongue shape, constriction
position, and lips area) and, consequently, the output acoustic
patterns, which may be expressed in terms of formants. Since
there is great variability of the formants for the same vowel,
the auditory system normalizes them, in order to recognize the
vowel, which is the so-called target-normalization theory.The
other theory is more complex and takes into account not only
the static acousticophonetic parameters but also the dynamic
ones, such as transitions and their character6(e.g., linear, dif-
ferent nonlinear forms, etc.). These transitions are often asso-
ciated with formant transitions [79], [80], because it has been
empirically established that this dynamic acoustic information
also contributes for vowel identification [81] (see also various
coarticulation references given in Section III-A2), which is the
so-called dynamic-target-specification theory [82], [83]. How-
ever, these two basic interpretations of targets are far from being
exhaustive, and the reader might particularly appreciate the the-
oretical study in [78], where one can find muscle-length targets,
articulator targets,constriction-position targets,acoustic tar-
gets,auditory perceptual targets,etc.
The aim of our study is to propose an optimum internal model
for the control of the speech robots based on the EPH. The mo-
tion planning of the robot is performed in its internal space,
whose coordinates are λ-motor commands of the EPH (inter-
nal space λis, therefore, n-dimensional space). Being inspired
by the principle of least action, it is proposed that the robot
task planning is based on the global optimum principle, which
is related to the aforementioned internal space, with external
4This work seems to be misinterpreted in [77], where the cost function from
[76] was called “length,” while it is clearly called by its author as “variation,”
and in addition, the formula provided in [76] does not represent the length.
5It has been even suggested that as in human beings, the most probable site
for the latter may be the cerebellar cortex [26].
6The question where the transitions are planned and how they are controlled
is also a subject of controversial discussion; some works suggested that it may
be in spatial reference frames, while others reported that it may be more closely
related to physical levels (e.g., joints and muscles) [78].
constraints related to the execution of tasks (e.g., the quality of
the executed tasks), or in motor-control words, to the targets.
It is proposed, namely, from that all the movements, including
those of the tongue, which are mainly responsible for speech
production and are controlled by the λcommands according
to the EPH, are carried out in such a way that the length of
path, which is traveled in the internal space λ,is minimized,
under external physical constraints, namely, acoustical and me-
chanical ones. The robot’s behavior is, therefore, completely
determined by this optimum principle, which permits finding
the corresponding optimum commands λ, which are sent to the
robot. Therefore, the originality of our work consists in two im-
portant differences with respect to previously referenced works
in the EPH-based robotics field. First, previous works do not
perform the minimization of length in order to find correspond-
ing motor commands λ. They just use the fact that the straight
line is the shortest path between two points. However, the latter
fact is true if there are no constraints,7and with constraints, their
approaches cannot provide solutions. Second, the optimization
that we carry out is, in addition, a constraint one. We first per-
form it under one constraint (the acoustical one) and then under
two constraints (the acoustical and the mechanical ones).
II. OPTIMUM INTERNAL MODEL
A. Preliminaries
First of all, we specify what we exactly mean by external
physical constraints. The acoustical constraints consist in the
specification of the sound that we wish to produce. Its speci-
fication is made in terms of the spectrum, and since the opti-
mum internal model is mainly designed for vowels, the latter
can be roughly approximated via the first kformants of vowel,
which are denoted by vector F(F1,...,F
k). In practice, the
formants are obtained via the BTM, which is followed by an
acoustical tube model (see Fig. 3). First, the BTM provides
the vocal-tract geometry (x,y), and then, the acoustical tube
model cuts the vocal tract in cross sections (see Fig. 2) and
approximates it by a tube of variable cross section. This yields
the area function of the vocal tract, by which, the formants are
computed [84]–[86]. The mechanical constraint consists in the
requirement to keep the prescribed mean force’s level contained
in the tongue during speech production. This level is calculated
as the arithmetic (or sample) mean of the absolute values of
the forces at each node of the BTM. Physically, this level may
be interpreted as mean muscular tongue effort, or measure of
global tongue stiffness, and phonetically, it helps to account for
lax and tense vowels.
B. Mathematical Formalization of the Formants–Commands
and Force–Commands Relationships and Learning of the
Artificial Neural Networks
As we saw before, the BTM is not a fully analytical robot.
In other words, there are no explicit analytical relationships
7Or they are trivial, e.g. straight line or plane passing through endpoints.
BLAGOUCHINE AND MOREAU: CONTROL OF A SPEECH ROBOT VIA AN OPTIMUM NEURAL-NETWORK-BASED INTERNAL MODEL 145
Fig. 3. (Top) Direct or real model and its replacement by the (bottom) approx-
imate one.
between its inputs and outputs. However, in order to mathemat-
ically implement the minimization algorithm of the optimum
internal model, we need the analytical relationships between
the formants Fand the motor commands λ, which are denoted
by vector field F(λ), and between the global mean force’s level
Fand λ, which are denoted by scalar field F(λ),aswellas
their derivatives. For this reason, we approximated the BTM,
followed by an acoustical tube model, by two artificial neural
networks (ANNs) [5], [87]–[91] (see Fig. 3). The choice of the
ANNs for similar problems was already suggested by several
authors [92]–[97]; moreover, the ANNs are precisely known
for their good properties for multidimensional approximations.
Besides, the replacement of a particular BTM by ANNs has po-
tentially another application: the generalization of the proposed
internal model for its use with other speech robots based on the
EPH or using similar principles of control, whose input–output
relationships may be approximated by the ANNs.
The learned ANNs (for details, see the Appendix) reveal
the general nonlinear character of the dependencies F(λ)and
F(λ), as shown in Fig. 4. Nevertheless, one can note that the
dependencies F(λ)and, especially, F(λ)are not highly non-
linear; it suggests that it would be also reasonable to try the use
of the multidimensional polynomials instead of the ANNs, since
the former are “lighter” for calculations from the computational
point of view.
C. Model Itself
The optimum internal model, which is designed according
to the principle of the shortest path in the internal space un-
der constraints, logically leads to the calculus of variations
[98]–[105]. In fact, the problem to find a curve, whose length
is least under constraints, is one of the typical problems of
Fig. 4. Dependencies (upper six subfigures) F1(λ)and (lower six subfig-
ures) F(λ)approximated by the corresponding ANN. Since both the functions
F1(λ)and F(λ)depend on six variables (n=6), they are shown in the fol-
lowing way: We fix five variables out of six and show the dependency solely on
the remaining sixth variable. Six panels for F1and for Fshow these depen-
dencies, where the sixth variable switches from λ1to λ6, respectively. Three
different cases are presented in each small panel; they are obtained by setting
five fixed variables to their minimal, mean, and maximal values, respectively.
variational calculus, which is known as geodesic problem.The
length of a curve, which is given in parametric form λiλi(t),
in the n-dimensional space λ, can be written as [99], [100],
[102]–[105]
L[λ(t)] = t2
t1˙
λ2
1+···+˙
λ2
ndt =t2
t1
˙
λ(t)
dt (3)
where t1and t2are, respectively, the initial and final times of
movement λ(t1)and λ(t2)—their corresponding positions in the
internal space λ. We will now seek the vector-valued function
λ(t), i.e., the set of functions λi(t)n
i=1, which minimizes this
integral under two constraints: acoustical and mechanical.
The acoustical constraint consists in the specification of the
initial and final phonetic targets, which correspond, respectively,
to the initial t1,and final t2moments. These targets are the zones
in the formant space F. Formally, the constraint is defined as the
appertaining of the first kformants of each produced vowel to its
own specific formant zone, which is defined by a k-dimensional
ellipsoidorectangle in the formant space F. In other words,
146 IEEE TRANSACTIONS ON ROBOTICS, VOL. 26, NO. 1, FEBRUARY 2010
mathematically, the formants of the jth produced vowel, which
are denoted by Fj(F1,j ,...,F
k,j),mustsatisfyH(Ga,j)=
0, where
Ga,j Fj=
k
l=1
(Fl,j(λj)
Fl,j)2ηj
2ηj
l,j 1(4)
where
Fl,j are the prescribed formants, Fl,j are the produced
ones, parameters l,j define the axes of the formant ellip-
soidorectangle, ηjdefines its shape (rounded or rectangular),
λjλ(tj)is the vector motor command that is responsible for
the production of jth vowel, and j=1and 2, since we have only
two targets (vowels): the initial one and the final one. By an ellip-
soidorectangle, we actually mean a voluminous k-dimensional
figure, which is obtained from the previous equation by setting
Ga,j =0.Forηj=1,itisak-dimensional ellipse; then, by
increasing the parameter ηj, it becomes more and more rectan-
gular, and finally, for large ηj, it becomes definitively a hyper-
rectangle. The equality, which is given by H(Ga,j )=0,means
that we wish the formants of the produced sounds to be in-
side their ellipsoidorectangles; this can be viewed as a weak
constraint, because we do not specify where exactly we want
the formants to be; they must be just somewhere inside the
ellipsoidorectangles. We could pose Ga,j =1, i.e., the strict
belonging to the given set of formants of the produced vowels;
however, this constraint is too rigid and, in practice, it does not
seem very real, since the slight fluctuation of the phonetic tar-
gets is always present (actually, parameters l,j were precisely
introduced for this purpose, i.e., in order to define the size of the
formant zones),8or in addition, Ga,j =0, i.e., the strict apper-
taining to the surface of the formant ellipsoidorectangle (which
is another weak constraint, because we do not specify where
exactly on the ellipsoidorectangle’s surface the formants must
be, but it is stronger than H(Ga,j)=0). Furthermore, the con-
straints of the type H(Ga,j )=0will be called the constraints
of the first kind; those of the type Ga,j =0will be called the
constraints of the second kind.
The mechanical constraint simply consists in the equality of
the global mean force’s level Fto the prescribed value
F(t)
(time-dependent in general), which must be kept during the
whole transition between t1and t2:
Gm(λ,t)=F(λ)
F(t)=0 (5)
i.e., this constraint has to be satisfied every time and every-
where and not only in t1and t2, as it is for the acoustical one.
Note that the introduction of the constraints in the model aims
precisely to mathematically formalize the targets (see Section
I). The acoustical constraints represent actually a sort of static
targets, which are in accordance with target-normalization the-
ory (especially that of the first kind). In contrast, the mechanical
8For instance, it is well known that the distribution of formants about its mean
Fjis near-normal. Thus, for the particular case ηj=1, the parameters jdefine
the formant zone of the constant probability level a1,a>1, with respect to
the maximum level at Fj, if we pose the referents
Fj=Fjand jequal to
2lna×standard deviation of the aforementioned normal distribution; in
other words, jdefine the formant equiprobability’s ellipses.
constraint, which is a dynamic one, corresponds to the dynamic-
target-specification theory (since it must be satisfied during the
transition and not only at the static endpoints belonging to some
zone) and aims to better represent the reality of the system.
As we may recall from variational calculus, the function λ(t)
that minimizes the functional (3) is the solution of the corre-
sponding system of the Euler–Lagrange differential equations.
For the ordinary variational problem, which requires the station-
arity of the functional
Y[λ(t)] = t2
t1
f(λ,˙
λ,t)dt (6)
with given fixed boundary conditions λ(t1)=λ1,and λ(t2)=
λ2, under the mconstraints Gj(λ,t)=0,for j=1,...,m,
the solution can be found from the following system of nEuler–
Lagrange equations:
λif+
m
j=1
µjGj
d
dt
˙
λif+
m
j=1
µjGj
=0 (7)
where µjµj(t)are the Lagrange’s undetermined multipliers.
The latter equation may be reduced to the following one, which
is represented in vector form as
∂f
λ+
m
j=1
µj
∂Gj
λd
dt
∂f
˙
λ=0(8)
where ∂/∂λis the operator of partial differentiation with respect
to each component of the vector λ. Note that since we have n
differential partial equations and mequations of constraint, we
can find all ncomponents of λ(t)and mLagrange’s multipliers;
the remaining 2nunknowns, due to ndifferential equations of
second order, can be found from the 2nboundary or initial
conditions. Note also that the constraints Gjmay be static
or dynamic, that does not change the previously mentioned
differential equation, since they do not contain ˙
λ(for more
information, see [100] and [104], where we can also find the
cases of the constraints given as ODEs). Obviously, similar
reasoning also applies to the mechanical constraint (5).
For the functional (3), the Euler–Lagrange equations are par-
ticularly simple, because the integrand contains only the deriva-
tive of λ(t), which is a particularity of all geodesic problems,
i.e., we have the following system:
d
dt
˙
λi
˙
λ2
1+···+˙
λ2
n
µF
λi
=0 i=1,...,n
(9)
with the additional equation (5) to find µ(t). The latter expres-
sion can also be written as
d
dt
˙
λ
˙
λ(t)
µF
λ=0.(10)
After the total differentiation with respect to time, it becomes
¨
λ˙
λ,˙
λ−˙
λ˙
λ,¨
λ
˙
λ
3µF
λ=0(11)
BLAGOUCHINE AND MOREAU: CONTROL OF A SPEECH ROBOT VIA AN OPTIMUM NEURAL-NETWORK-BASED INTERNAL MODEL 147
where ·denotes the scalar product. Moreover, for the particular
case n=3, by using the well-known rule of linear algebra
transforming the scalar products into the vector ones, we may
simplify (11) as follows:
˙
λר
λ×˙
λ
˙
λ
3µF
λ=0.(12)
We can even generalize (9)–(12), for the cases when there are
more than one constraint that must be fulfilled along optimum
solution λ(t), i.e., in each point of λ(t). By using (7) or (8), we
can generalize (11), as follows:
¨
λ˙
λ,˙
λ−˙
λ˙
λ,¨
λ
˙
λ
3
m
j=1
µj
∂Gj
λ=0.(13)
Note that the acoustical constraints [see (4)] are not present in
any of these equations. This is because they are related only to
the boundary conditions but not to the whole path λ(t), which
is precisely the complexity of our case. In fact, (9)–(13) and (5)
are only the necessary conditions to which the optimum solu-
tion λ(t)must satisfy, and they are not sufficient for its complete
determination. Generally, the latter is carried out with the help
of the boundary conditions. However, in our case, these con-
ditions are not given explicitly but implicitly via the acoustical
constraints (4), i.e., t1is related to Ga,1,t2, and to Ga,2,as
follows:
tjλjFjGa,j ,j=1,2.(14)
In this case, which is often called in literature the undeter-
mined endpoints case [100], the optimum function λ(t)must
also satisfy a supplementary system of the differential equa-
tions,9involving the derivatives of the acoustical constraints,
and this condition is sufficient to completely determine the op-
timum solution λ(t), thereby giving the trajectory of motion in
the internal space.
The solution of such a system of partial differential equations
represents a quite complicated problem of mathematical physics
(we recall that the dependencies F(λ)and F(λ)are given by
two nonlinear ANNs). Thus, first, we would like to discuss
some mathematical issues related to the (9)–(13), raising serious
questions about validity of some variants of the EPH, and then
propose a solution for the problem.
D. Discussion of the Drawbacks of the Linearized λ-Model
1) Brief Description of the Linearized λ-Model: Classic
EPH does not imply any dynamic description of the applied
motor commands λ. In order to bring some dynamics to the
system, the so-called linearized λ-model was proposed. This
variant of the EPH is basically the classic λ-model, supplemen-
tary supposing the time transitions between static positions in
the internal space λmay be effected only at a constant rate.
In other words, during the change of posture, from one to an-
other, the commands λare modified linearly with time, i.e.,
λ(t)=αt+β, where αand βare the constant coefficients
9The so-called left-hand and right-hand endpoint requirements [100].
(vectors). We have already mentioned this model in Section I; it
is a simplest implementation of the shortest distance principle in
the internal motor-command space. However, we will show that
this model works only in trivial cases, and it does not support
dynamic systems well.
2) Contradiction With the Principle of the Shortest Path:
We will show now that if the constraints on the form of the
path, i.e., Gj(λ,t), are nonlinear10 in λ, the principle of the
shortest path and the linearized λ-model are in mathematical
contradiction.
Reductio ad absurdum: The unique solutions that the lin-
earized λ-model allows are the linear ones: λ(t)=αt+β.If
we now substitute these linear solutions into (13), we obtain
that the left part of this equation is always equal to zero, which
also means that the remaining part is always zero; however, the
dependencies Gj(λ,t),j=1,...,mare, in general, nonlinear
in λ, and therefore, the sum of their derivatives cannot be null in
all cases. Furthermore, if at least one of these dependencies is
nonlinear in λ, the sum of their derivatives cannot be null, which
means that the first term in the left part of (9)–(13) cannot be
always null, and thus, the optimum solutions cannot be linear
functions λ(t)=αt+β. It is simple to show that the unique
case when this first term vanishes is that when λ(t)becomes
a linear function. An extremely simple proof, which is based
on geometry, may be done for the particular 3-D internal space
case. From (12), we have ˙
λר
λ×˙
λ=0, which means either
˙
λis parallel to ¨
λ×˙
λ, or one of them is zero. The parallelism
is impossible because ¨
λ×˙
λis orthogonal to both of its ar-
guments, and one of them is precisely ˙
λ. Thus, one of these
vectors is null. In the most general case, ¨
λ=0, and therefore,
λ(t)=αt+β.
It is important to note that the impossibility of the linear solu-
tions is not due to any particular formulation of the constraints
but to the nature of the geodesic problem itself, for which the
solutions are, in fact, always determined by the constraints (e.g.,
for the original geodesic problem, the Earth’s surface determines
the corresponding solutions). If the constraints are nonlinear, the
optimum solutions λ(t)cannot be linear. Thus, only the trivial
constraints related to the whole traveled path λ(t)[e.g., those
described in footnote 7] can be compatible with the linearized λ-
model. These findings cast doubts on the linearized λ-model for
the task planning and its general use in motor-control theories.
3) Contradiction With the Finite-Energy and Finite-Power
Principles: Another contradiction is that with the finite energy
and power principles. Generally, the processes having finite en-
ergy and, especially, finite power, are said to be physically stable
(note that some processes can have the infinite energy, but finite
power, for instance, classic small nondamped harmonic oscil-
lations). It is not complicated to show that the Feldman’s lin-
earized λ-model (both static and dynamic variants; see Section I)
may lead to the mechanical process of infinitely growing en-
ergy and power. We decided to compare four spring mod-
els: the classic linear damped spring model, the exponential
damped spring model, the exponential damped spring model
controlled by linear λcommand (i.e., static Feldman’s model
10Or even linear, but in other contexts than described in footnote 7.
148 IEEE TRANSACTIONS ON ROBOTICS, VOL. 26, NO. 1, FEBRUARY 2010
Fig. 5. The behaviors of the classic linear spring, the exponential spring, and
the exponential springs of the Feldman’s linearized λ-model (static and dynamic
variants) for positive and negative α.
with λ(t)=αt +β), and the exponential damped spring model
controlled by linear λcommand with the adjustment of feedback
related to the current velocity11 (i.e., dynamic Feldman’s model
with λ(t)=αt +β). The first model is classically described by
a second-order ODE of motion: ¨x+γ˙x+ρx =0. The second
one can be described by
¨x+γ˙x+ρ(eηx1) = 0 (15)
the third model by
¨x+γ˙x+ρ(eηxαtβ1) = 0 (16)
and the fourth one by
¨x+γ˙x+ρ(eηx+κ˙xαtβ1) = 0 (17)
the displacement being a function of time, i.e., xx(t), and
γ,ρ,η,κ,α, and βbeing constant parameters. These param-
eters are (for all models) given by γ=0.16,ρ=1,η=1,
κ=0.05,α=+0.25 (case α>0), α=0.50 (case α<0),
and β=5. The initial conditions that are fixed for all differential
equations are given by x(0) = 0,and ˙x(0) = 1. Unfortunately,
only the first equation has an exact analytical solution. We,
therefore, had to resort to numerical methods in order to ob-
tain the corresponding solutions (namely, we used the ode45
MATLAB’s function, which is based on the Runge–Kutta
method). The results are shown in Fig. 5. From this figure,
we can ascertain that both linear and exponential noncontrolled
spring models have a stable exponentially damped oscillating
solutions. On the contrary, for the exponential spring models,
which are controlled by linear λcommand, the solutions are
infinitely increasing tending to +for α>0and infinitely
decreasing tending to −∞ for α<0; therefore, for both latter
cases, they are not in the L1,fort[0,), and neither in L2,for
t[0,), and the energy and power of such a process tend to
infinity.
4) Potentially Limited System Dynamics: Finally, third
drawback of linearized λ-models lie in the fact that the prescrip-
tion of linear variations of λ(t)may potentially limit the dynam-
ics of the system. Taking into account that the dependency F(λ)
11It is sometimes called the proprioceptive feedback.
is associated with the physics of the system (its biomechanics)
and cannot be changed without relearning, the prescription of
linear λ(t)impacts to the dynamics of the system. In this case,
it is clear that if F(λ)is linear in λ, so does F(t)F(λ(t))
in t;ifF(λ)is quadratic in λ, so does F(t)in t,etc.Inthis
context, it is difficult to imagine how, with linearized λ-model,
one can truly implement dynamic-target-specification theory,
thus allowing, for example, some variability of transitions (see
Section I), if the system dynamics are fixed by its biomechanics
and the CNS influence is taken into account only in a sketchy
linear form.
E. Solution in First Approximation
The exact analytical solution of the optimum internal model
problem, which is described in Section II-C, is not simple. The
solution of a system of partial differential equations, which
imply neural networks, and whose boundary conditions are not
given explicitly, is a quite complicated problem of mathematical
physics. On the other hand, one can easily see from Fig. 4 that the
dependencies F(λ)are not strongly nonlinear, and therefore,
they may be replaced by the straight lines in first approxima-
tion. As to the dependencies F(λ), they are strongly nonlinear;
however, as it follows from Section II-C, these constraints do
not affect the form of the solution but only its endpoints. Thus,
basing on Section II-D2, the optimum solutions become the
straight lines λ(t)=αt+β, and the undetermined coefficients
αand βare the variables by which the mechanical and acous-
tical constraints must be satisfied. In other words, in our case,
where the constraints on the form of the path are not far from
the linear ones, the linearized λ-model can be actually viewed
as the first approximation to the global problem of finding the
optimal path. We can no longer “play” with the form of this op-
timum path λ(t)(that was precisely the main role of variational
calculus) but only with the limits of the integral (3), which are
written in implicit form, i.e., λ1λ(t1)and λ2λ(t2). Since
the dependency F(λ)is now the linear one and the optimum
solution λ(t)is the straight line, it is sufficient that the constraint
(5) was satisfied only in two points (e.g., at the ends λ1and λ2)
in order to be satisfied in every point of the optimum solution
λ(t). Mathematically, this means that the mechanical constraint
(5), which was initially on the form of the path, now becomes
that on the boundary conditions. As to the acoustical constraints
(4), it remains as before on the boundary conditions λ1and λ2.
By substituting linear solutions λ(t)=αt+βinto the func-
tional (3), we obtain the following well-known formula of the
straight line’s length in n-dimensional space:
L(λ1,λ2)=t2
t1
α
dt =
α
t2t1=
λ2λ1
.
Moreover, we extend this approach to more than two vowels
between which the commands are linear, say, pvowels. In this
BLAGOUCHINE AND MOREAU: CONTROL OF A SPEECH ROBOT VIA AN OPTIMUM NEURAL-NETWORK-BASED INTERNAL MODEL 149
case, the total length becomes
L(λ1,...,λp)=
p1
j=1tj+1
tj
α
dt =
p1
j=1
λj+1 λj
=
p1
j=1
n
i=1
(λi,j+1 λi,j )2.(18)
Obviously, in this case, the boundary conditions have to be
fulfilled in the points λ1,λ2,...,λp, instead of λ1and λ2.
We now state the exact mathematical formulation of the prob-
lem: We seek to extremize the function L(λ1,...,λp), with re-
spect to the variables (λ1,...,λp), under pacoustical12 and p
mechanical constraints [see also (4) and (5)]
Ga,j (λj)=0 j=1,...,p
Gm,j (λj)=0 j=1,...,p. (19)
Thus, instead of the optimization with respect to the form of the
traveled path and its ends (defined implicitly via the boundary
conditions), we now carry out the optimization only with respect
to the ends (λ1,...,λp). In other words, the initial problem of
the constrained optimization of functional becomes that of the
constrained optimization of function, which is usually more
simple to solve.
The latter optimization problem is classically solved by
means of the Lagrange’s undetermined multipliers method for
functions. This method consists in introducing a composite
function U(λ1,...,λp,µa,µm)of n×p+2pvariables, which
is a sum of function L(λ1,...,λp)of n×pvariables and of
2pconstraints (19), which are weighted by the correspond-
ing Lagrange’s multipliers µa(µa,1,...,µ
a,p),and µm
(µm,1,...,µ
m,p)
U=L+
p
h=1
µa,hGa,h (λh)+
p
h=1
µm,hGm,h (λh).(20)
Here, index jwas replaced by hin order not to get confused
with the further derivatives. The optimization itself consists in
finding the optimum set (
λ1,...,
λp,
µa,
µm)such that
∂U
λjλj=
λj
=0,∂U
µaµa=
µa
=0,∂U
µmµm=
µm
=0
j=1,...,p (21)
or, in short, grad U=0. We recall that the equality of the deriva-
tives of Uto zero with respect to the Lagrange’s multipliers gives
actually the constraints given in (19), which is precisely the in-
terest of the Lagrange’s method: The constrained optimization
of the function Lof n×pvariables reduces to the unconstrained
one of the function Uof n×p+2pvariables, where the last
2pvariables are the Lagrange’s undetermined multipliers.
12Note that use of the acoustical constraints of the second kind instead of
the first one cannot be considered too restrictive for our problem from an
acousticophonetic point of view. On the one hand, the position of the borders
is variable and can be adjusted via the parameters l,j and
Fl,j . On the other
hand, as we shall see later, almost always, the solution for the constraints of the
second kind is also that for the constraints of the first kind.
The procedure of differentiation is quite particular for the
length’s function L(λ1,...,λp). The derivative with respect to
λjis not always calculated in the same way; this is because, the
first λ1and the last λpterms are present only once in the sum
(18), while all the intermediate terms λ2,...,λp1are present
twice. Thus, by differentiating with respect to each component
of λ1, we obtain the following system:
∂L
λi,1
=(λi,2λi,1)
n
i=1
(λi,2λi,1)2
,i=1,...,n (22)
or in the following vector form:
∂L
λ1
=λ2λ1
λ2λ1
.(23)
With respect to λ2,...,λp1, we obtain
∂L
λj
=λjλj1
λjλj1
λj+1 λj
λj+1 λj
(24)
where j=2,...,p1. Finally, for j=p, it yields
∂L
λp
=λpλp1
λpλp1
.(25)
The differentiation of the constraints is less sophisticated. For
the acoustical one, it becomes
∂Ga,h(λh)
λj
=
2ηj
k
l=1 Fl,j(λj)
Fl,j 2ηj1
2ηj
l,j ·∂Fl,j
λj
,h=j
0,h=j
for j=1,...,p, where the derivatives of the lth formant of jth
vowel with respect to the motor commands of this vowel λjare
calculated according to (36), i.e.,
∂Fl,j
λj
=2
S1
s=1
b2
1vs,l(wsλj)e(wsλjb1)2(26)
for l=1,...,k,j=1,...,p, and dim λj=n(i.e., in all k×
p×nderivatives). It may be noted that since the constraints on
the boundary conditions of the hth vowel are independent from
the jth vowel, these derivative are all null. The derivatives of
the mechanical constraint are simply given by
∂Gm,h(λh)
λj
=
F(λj)
λj
,h=j
0,h=j
(27)
where the last derivatives are calculated similarly to the formant
ones (without index land with the ANN weights corresponding
to the λFnetwork).
The system (21) cannot be solved analytically. The optimiza-
tion of Uwas, therefore, performed numerically by means of the
gradient descent method (which is also known as the method of
steepest descent), implemented in the MATLAB programming
language.
150 IEEE TRANSACTIONS ON ROBOTICS, VOL. 26, NO. 1, FEBRUARY 2010
TAB L E I
FREQUENCIES (IN HERTZ)WEUSED FOR THE DEFINITION OF THE PHONETIC
TARGETS FOR THE ACOUSTICAL CONSTRAINT
III. RESULTS
A. Only Acoustical Constraints Are Applied
1) Simulations With the Model and Effect of Different Task-
Planning Strategies: We first considered a slightly simpler case,
which is one without the third term in (20), i.e., without mechan-
ical constraints (5) at all. First of all, we fixed static phonetic
targets (i.e., prescribed formants) according to Table I. Besides,
the formant frequencies
Fjare also used for the initialization
of optimization algorithm: From the initial database used for
the ANN learning, we search for the couples of data λF,
having Fas close as possible to the formant zone’s centers
Fj
(these frequencies are generally found in the range ±10 Hz for
the first formant, and ±30 Hz for the second one). The cor-
responding motor commands λare taken as initial points for
the gradient descent method. Note also that since there is no
bijection between spaces λand F, there is a finite (nk)-
dimensional volume in λ, any point of which maps to the same
point in F, e.g., to the projection of the initial point
Fj. Thus,
the initial point in λcannot be uniquely determined. Indeed, tak-
ing into account that the numerical optimization method is the
gradient descent one, the optimum solution may be potentially
influenced by the choice of this initial point λ. However, the
projection of the initial point (i.e., ellipsoidorectangle’s center),
regardless of the initial point itself in λ, is quite close to the
solution (i.e., to the ellipsoidorectangle’s border), and taking
into account that the function L(λ1,...,λp), which is given by
(18), is “sufficiently” convex, and the constraints (19) can be
considered locally monotonic13 (see Fig. 4), and therefore, the
local minima problem does not really affect the solution. On the
other hand, we also tested this potential dependency empirically,
and the optimization algorithm always returned the same final
solution (with ANN accuracy), regardless of the initial point.
The optimization results given by our algorithm for the se-
quence of three vowels [i a O]areshowninFig.6. As
13One may also note that the projections of the traveled paths onto the formant
space F(shown in dots in the figures) are just slightly curved in all experiments.
Fig. 6. Optimization of the sequence [i a O]. Formant zones are defined
by the corresponding formant ellipses. Notation: “Optimum ends F(λj)
(
F1,...,
Fp).
we can observe from this figure, the formant zones were de-
fined as ellipses, i.e., all ηj=1. The set of the Lagrange’s
multipliers found by our optimization algorithm is
µa=
(0.0670,0.1700,0.0880) mm, and that of the optimal motor
commands
λ1,...,
λp(in millimeters) is given by
λ1= (33.76,48.12,46.84,74.92,18.38,63.78),for [i]
λ2= (46.96,48.02,41.78,74.04,18.70,63.90),for [a]
λ3= (51.04,49.44,41.21,72.02,18.74,64.02),for [O]
where the gradient step was set to 0.0125. The set of the cor-
responding formants
F1,...,
FpF(
λ1),...,F(
λp),
which is the projection of the found optimal commands to the
formant space, is given by (in hertz, first and second formants)
F1= (353.3,2117.2),for [i]
F2= (569.1,1284.4),for [a]
F3= (570.9,1054.5),for [O].(28)
We can clearly observe that when the optimization is finished, all
three acoustical constraints vanish, and the function Ureaches
its minimum, which is designated by
U, and is equal to 18.962
mm. A total of 253 iterations of the method of steepest descent
for the Lagrange’s multipliers method were necessary to find
this optimal solution. Note that the path traveled in the formant
space Fis shown in dots, in order to emphasize the fact that
the real path is actually traveled in the internal space λ, and the
path shown in the formant space is only its projection to F.
Analogously, the optimum endpoints are found in the internal
space λ, and we showed their projections to the formant space.
Note that they are exactly on ellipse’s borders, that means the
fulfillment of the acoustical constraints from (19) (the same can
be actually directly observed from the lower left panel of Fig. 6).
We present now a different case of optimization: All initial pa-
rameters are the same, except the definition of the formant zone
BLAGOUCHINE AND MOREAU: CONTROL OF A SPEECH ROBOT VIA AN OPTIMUM NEURAL-NETWORK-BASED INTERNAL MODEL 151
Fig. 7. Optimization of the sequence [i a O]. Formant zones of [i] and [O]
are defined by formant ellipses (i.e., η1=η3=1) and that of [a] by an ellip-
soidorectangle with parameter η2=3.
for the vowel [a], which is defined by an ellipsoidorectangle
with parameter η2=3(see Fig. 7). For this case, the optimiza-
tion algorithm found:
µa=(0.0645,0.0580,0.0850) mm, and
U=18.693 mm. The set of the formants (
F1,...,
Fp), which
corresponds to the found optimal commands, is given by (in
hertz, first and second formants)
F1= (352.7,2115.4),for [i]
F2= (568.1,1262.5),for [a]
F3= (568.9,1057.3),for [O].(29)
By comparing two latter sets (28) and (29), or Figs. 6 and 7,
we note that the main difference is the formants of [a], espe-
cially the second formant; in the first case, it is 21.9 Hz (1.7%)
greater than that in the second one. This actually means that by
accenting differently the same vowel in the same sequence, we
can obtain its different acoustical variants. Note that the param-
eter η=3means, on the one hand, a different geometrical form
of the formant zone, and on the other hand, when the solution
reaches its border and starts to leave the formant zone, the acous-
tical constraint increases much more strongly than that for the
normal ellipse given by η=1. Thus, the acoustical constraint
defines not only the formant zone but the degree of accentuation
of this zone as well (note that we employ the word “accentua-
tion” especially in this sense). Therefore, even the small simple
changes of strategy of task planning can have an impact on the
formant space F. Note also that the character of the impact is
mostly local, i.e., other vowels of the sequence, whose strategy
remained unchanged, showed a relatively small modifications
of their positions in the formant space F.
2) Phenomenon of Task Anticipation: We found out that
our model is in accordance with the phenomenon of task
anticipation or, more precisely, with its acoustic variant ob-
served for a long time by phoneticians and which is known
in phonetics as coarticulation or effect of phonetic environ-
ment [80]–[82], [108]–[112]. In the sequence of several vow-
Fig. 8. Optimization of the sequence [i a œ O]. Formant zones are defined by
the corresponding formant ellipses.
Fig. 9. Optimization of the sequence [i a Oœ]. Formant zones are defined by
the corresponding formant ellipses.
els, this phenomenon represents the influence of the following
vowel(s) on the previous one(s); especially, it concerns two
vowels following one after another, i.e., in the sequence of p
vowels, the jth vowel is especially influenced by the (j+1)th
vowel j=1,...,p1. Acoustically, this phenomenon can
be observed in terms of formants. Usually, this phenomenon,
in different degrees (depending on the context and other con-
ditions, e.g., accentuation), is present in real speech, which is
why we wanted to find out if the proposed model was able to
reproduce it.
For demonstration, we choose an example where the triple
phenomenon of the task anticipation is produced: sequences
[i a œ O] versus [i a Oœ]. Thus, the acoustical anticipa-
tion will be studied on vowels [a], [O], and [œ]. In addi-
tion, two different planning strategies will be compared. The
optimizations of the sequences with elliptic planning strate-
gies related to the constraints are performed in Figs. 8 and
9. For the former, the optimization algorithm returns
µa=
(0.0673,0.1770,0.1200,0.0780) mm,
U=23.989 mm, and the
152 IEEE TRANSACTIONS ON ROBOTICS, VOL. 26, NO. 1, FEBRUARY 2010
Fig. 10. Optimization of the sequence [i a œ O]. Formant zone of [a] is defined
by an ellipsoidorectangle of η2=3.
optimal formant set (in hertz)
F1= (357.4,2117.2),for [i]
F2= (571.1,1296.0),for [a]
F3= (528.6,1422.1),for [œ]
F4= (551.0,1062.8),for [O].(30)
For the latter, the optimization algorithm returns
µa=
(0.0658,0.1980,0.1419,0.061) mm,
U=25.010 mm, and the
optimal formant set (in hertz)
F1= (354.0,2119.9),for [i]
F2= (567.6,1285.1),for [a]
F3= (556.9,1066.5),for [O]
F4= (515.8,1388.0),for [œ].(31)
Thus, we can note a light anticipation on each vowel; however,
by taking into account the precision, we mainly observe the
anticipation on the vowel [œ]; the formant difference between
the two cases is F=(12.76,34.12)Hz, which represents
2.4% on each formant, or 3.4% of total difference.14
We will now show that according to our model, the chosen
strategy can be the reason for greater or smaller anticipation.
Once again, we will change the strategy for the vowel [a], by
defining its formant zone by an ellipsoidorectangle of η2=
3, and we will compare two previous sequences under these
conditions. The results are presented in Figs. 10 and 11. For
the sequence [i a œO], our optimization algorithm returned
µa=(0.0674,0.052,0.123,0.078) mm,
U=23.885 mm, and
the optimal formant set (in hertz)
F1= (358.0,2122.9),for [i]
F2= (572.0,1320.3),for [a]
14The latter is %td = 100(∆F1/F1)2+···+(Fk/Fk)2.
Fig. 11. Optimization of the sequence [i a Oœ]. Formant zone of [a] is defined
by an ellipsoidorectangle of η2=3.
F3= (529.7,1431.6),for [œ]
F4= (551.8,1063.5),for [O].(32)
For the sequence [i a Oœ], it gave
µa=(0.065,0.080,
0.1421,0.061)mm,
U=25.028 mm, and the optimal formant
set (in hertz)
F1= (353.3,2118.9),for [i]
F2= (568.3,1263.6),for [a]
F3= (557.0,1065.2),for [O]
F4= (515.4,1387.4),for [œ].(33)
We obtain the anticipation on [a] for its second formant to be
56.7 Hz, which represents 4.3% of its initial value. Note that the
previous vowel [i] also resulted to be concerned by this modifi-
cation of strategy for [a]; it decreased its first formant by 4.7 Hz
(1.3%). Thus, our model confirms that the degree of anticipa-
tion may depend on the chosen strategy. Moreover, since the
anticipation on [a] was very small (practically null) in the case
of the elliptic formant strategy, one can suppose that actually,
the anticipation itself may be one of the consequences of the
chosen strategy. Thus, we think that the acoustical anticipation,
or coarticulation, is due not to the muscular mechanics and body
dynamics, as was claimed in several previous studies [32], [72],
but due probably to the centrally planned mechanisms.15
Finally, it would be also interesting to compare the obtained
results to the real ones obtained by phoneticians. Unfortunately,
quantitative analysis of such a kind seems quite difficult and
inconsistent for several reasons. First, the proposed solution is
only a first-approximation solution. Second, the BTM is not
ideal, and neither is the ANN. Third, it is very difficult to com-
pare our results with those found in phonetic literature, because
15Moreover, these works reported that not only context-sensitivity arises from
biomechanics, but it also need not be represented in control CNS commands,
while we obtained this effect precisely from CNS-planning mechanism.
BLAGOUCHINE AND MOREAU: CONTROL OF A SPEECH ROBOT VIA AN OPTIMUM NEURAL-NETWORK-BASED INTERNAL MODEL 153
the measurement techniques are very different, and even ba-
sic acousticophonetic data strongly differ (e.g., compare [107]
with [112], although both studies are on American English vow-
els). Moreover, in most of the coarticulation works, the antic-
ipation extent is measured not by frequency deviation (as we
did) but by human-perception error rate, i.e., by means of the
so-called discriminant analysis (percentage of correct versus
incorrect identification by listeners) [82], [107]. This analysis
in based on the fact that nearly all errors involved confusions
between adjacent vowels [107], [112]. However, such an error
rate is difficult to interpret in absolute formant values that we
reported. Another problem is that the discriminant analysis is
quite subjective (it is produced by some speakers, and perceived
by some listeners), and therefore, it is individual-dependent.
Fourth, in our model, there are some individual-dependent pa-
rameters, ηjand i,j , which are meant to represent concrete
speaker; in other words, the set of these parameters is meant to
represent age, sex, accent, effort, concentration, weariness, etc.
However, of course, their concrete values are difficult to estimate
numerically. This is why we think that it is better to compare the
obtained results qualitatively, rather than quantitatively. If, nev-
ertheless, we accept to compare our results quantitatively with
those that were provided by a small amount of works where
the coarticulation frequency deviations were reported, e.g., [80]
and [108], we find out that they are of the same order (for ex-
ample, anticipation on [A]isupto10HzforF1and up to 50 Hz
for F2, depending on the context). However, more interesting is
the overall analysis of these results. It, in particular, shows that
the anticipation extent depends on the size of formant zones, as
well as on the position of the vowel inside formant space. Gen-
erally, vowels having larger formant zones and greater opening
angles to the neighboring vowels possess greater anticipation,
e.g., cases of [æ], [E], [œ], [I], [U], [u], but not [i] having small
opening angle or [A] and [a] having small opening angles and
zones (see e.g., [80], [107], and [112]). We obtained similar re-
sults: quite small anticipation on [i], [a] (case of small formant
zone, η=1)or[O] (which was limited to only two close neigh-
bors), and greater anticipation on [œ] and on [a] (case of larger
formant zone, η=3).
3) Few Words About Two Kinds of the Acoustical Constraints
From Practical Point of View and Their Roles in the Previous
Optimizations: One can also note that, although we gave the
solution for the constrained problem with the acoustical con-
straints of the second kind (see Section II-C), in fact, it will
be also that for the problem with weaker constraints, those of
the first kind, if we suppose the local monotonicity inside the
ellipsoidorectangles16 of the dependencies F(λ). The latter can
be generally done, because as we previously saw (see Fig. 4;
Section II-B), these relationships, depending on the interval, can
be locally considered monotonic. Furthermore, not only it gives
the solution for the problem with constraints of the first kind,
but also, it is better to use the constraints of the second kind to
find a solution to the problem with the constraints of the first
kind. In fact, as we may recall from previous figures, during
16See also little discussion devoted to the initial point choice in
Section III-A.1.
Fig. 12. Correct optimization of the sequence [i a] by using the constraints
of the second kind; those of the first kind are, therefore, also satisfied and the
minimum of Uis reached.
Fig. 13. Incorrect optimization of the sequence [i a] by using the constraints
of the first kind; those of the second kind are also satisfied, but the minimum of
Uis not reached.
the optimization process (different iterations), the solution may
temporarily leave the authorized formant zone. This is not so
important for the constraints of the second kind but is very im-
portant for those of the first kind, because the Heaviside function
will not permit leaving even temporarily, the authorized zone,
by wrongfully stopping the gradient algorithm because of its
sharp increase (this mostly remains true, even if we approxi-
mate the Heaviside function by its continuous variants; see the
formula given later). We present a comparison of such cases in
Figs. 12 and 13. Note that the acoustical constraints of the first
and second kinds are fulfilled in both cases (the formant projec-
tions of the optimal solutions are on ellipse’s borders); however,
the correct solution is given only by the model with the con-
straints of the second kind, i.e., we have
U=14.084 mm, in the
case of the constraints of the second kind, while in the case of
the constraints of the first kind,
U=21.105 mm, which is the
wrong solution, because the length is not minimum. Thus, the
154 IEEE TRANSACTIONS ON ROBOTICS, VOL. 26, NO. 1, FEBRUARY 2010
Fig. 14. Optimization of the sequence [i O] with corresponding mean force’s
level. Formant zones are defined by the corresponding formant ellipses. Only
acoustical constraints are applied.
gradient algorithm was unable to find the correct solution by us-
ing directly the constraints of the first kind; therefore, it is better
to circumvent these difficulties by using the constraints of the
second kind to find the optimal solution for the constraints of the
first kind. Finally, in order to improve the quality of convergence
of the optimization algorithm, the Heaviside function was re-
placed by its continuous approximation H(z)(1 + th az)/2,
with parameter ataken in the range 60–1000.
B. Both Acoustical and Mechanical Constraints Are Applied
The main drawback of our model without mechanical con-
straints is that the optimal solution gives the formants that are
always on the borders of their ellipsoidorectangles, for both
kinds of the acoustical constraints. Moreover, for many vow-
els, only some parts of these borders can be covered (e.g., for
the vowel [a], only the upper part of its formant zone can be
covered), and the covered zone depends essentially on the po-
sition of vowel inside the vocalic triangle. This is actually the
consequence of the chosen strategy of planning (i.e., the mini-
mization of the traveled path under constraints) and of the fact
that the dependency F(λ)is often near-monotonic. Therefore,
one can suppose that the addition of the mechanical constraints
may permit reaching some uncovered zones.
First of all, in all previous simulations unconstrained mechan-
ically, the mean force’s level, which corresponds to the optimal
ends of the traveled path, was different. Now, we run another
simulation unconstrained mechanically, i.e., sequence [i O].17
The optimization results are shown in Fig. 14; the numerical
17We recall that in this section, [i] and [O] are those with from Table I.
Fig. 15. Optimization of the sequence [i O] with mechanical and acoustical
constraints. All formant zones are defined by ellipses. Case of equal mean force’s
levels (low levels).
ones are given as
U=22.23 mm
µa=(0.0647,0.1124) mm
F1= (299.7,2196.3) Hz, for [i]
F2= (535.9,901.0) Hz, for [O]
F1=0.65 N, for [i]
F2=0.25 N, for [O] (34)
where
FjF(
λj),j=1,...,p; in other words, jcorre-
sponds to the sound in the sequence. We will now try to pre-
scribe the mean force’s level with both kinds of the acoustical
constraints and to prescribe the equal global mean force’s level
to each vowel in the sequence.
In the first example, we try to force both vowels to decrease
their mean force’s level by reaching the value 0.20 N. The results
of this optimization are given as
U=26.90 mm
µa=(0.36,0.13) mm
µm=(6.8,1.0) mm/N
F1= (292.5,2194.4) Hz, for [i]
F2= (531.8,897.0) Hz, for [O]
F1=0.20 N, for [i]
F2=0.20 N, for [O].
Fig. 15 shows the optimization. Mainly, we note that [O]did
not significantly change its position on the formant zone’s bor-
der, while [i] was slightly influenced. It can be understood: the
BLAGOUCHINE AND MOREAU: CONTROL OF A SPEECH ROBOT VIA AN OPTIMUM NEURAL-NETWORK-BASED INTERNAL MODEL 155
Fig. 16. Optimization of the sequence [i O] with mechanical and acoustical
constraints. All formant zones are defined by ellipses. Case of equal mean force’s
levels (high levels).
prescribed mean force’s level is much closer to [O], which is
why it impacts especially on [i], whose unconstrained level was
high. On the other hand, the modification of its position was not
very great, notwithstanding the quite different muscular com-
mands. This is normal because, as we previously said, there
is no bijection between the internal and external spaces; thus,
a whole domain in the internal space may correspond to one
point in the external space. In other words, the most significant
changes were produced in this domain of the internal space;
more precisely, for [i], the total difference in the internal space
λis %td =32.8%, while that in the external space Fis only
%td =2.4%! This is also related to the fact that the length of
path traveled in the internal space increased: 26.90 mm versus
22.23 mm.
In the second example, we will try to obtain a high level of the
prescribed mean force’s level, say
F=0.77 N, for all vowels
in the same sequence. The optimization returns the following:
U=29.65 mm
µa=(0.077,0.191) mm
µm=(0.89,5.50) mm/N
F1= (272.6,2191.4) Hz, for [i]
F2= (562.5,915.4) Hz, for [O]
F1=0.77 N, for [i]
F2=0.77 N, for [O].
Fig. 16 shows the optimization. In this case, both vowels were
influenced: formants of [i] by %td =9.1% and those of [O]
by %td =5.2%; the corresponding values in the internal space
Fig. 17. Optimization of the sequence [i O] with mechanical and acoustical
constraints. All formant zones are defined by ellipses. Case of different mean
force’s levels (low and high levels).
λare much greater: %td =19.9% and %td =43.7%,respec-
tively.
Two previous simulations showed that the model allows the
prescription of equal mean force’s levels to each vowel in se-
quence. However, this case is not very realistic, and the prescrip-
tion of different mean force’s levels to the vowels would better
represent the reality. In the next example, for the sequence [i O],
we will try to prescribe law mean force’s level to the first vowel
(which was high in the simulation unconstrained mechanically),
which is given by
Fi=0.3 N and, conversely, high level to the
second one, which is given by
FO=0.6 N. This simulation,
which is shown in Fig. 17, returned the following optimization
results:
U=34.70 mm
µa=(0.139,0.211) mm
µm=(2.78,4.17) mm/N
F1= (308.1,2200.6) Hz, for [i]
F2= (536.8,901.7) Hz, for [O]
F1=0.30 N, for [i]
F2=0.60 N, for [O].
Both vowels were influenced, as well as the character of the
projection of the traveled path. As to the formant deviation,
those of [i] changed by %td =2.8%, and those of [O] changed
by %td =1.9%; the corresponding values in the internal space λ
are much larger: %td =24.7%,and %td =32.4% respectively,
with respect to the simulation unconstrained mechanically (see
(34) and Fig. 14).
156 IEEE TRANSACTIONS ON ROBOTICS, VOL. 26, NO. 1, FEBRUARY 2010
These results show that regardless of the mean force’s level,
it is possible to keep the vowels in their formant zones (on the
border), and thus, there was no impact on the fulfillment of the
acoustical constraints; mathematically, the problem could al-
ways be solved. The prescribed mean force’s level has variable
impact on the external formant space Fby letting sometimes
to accede to the formant zones uncovered before. On the con-
trary, the prescribed mean force’s level strongly impacts on the
internal command space λ, which is always influenced in great
degree; in addition, the distance between the optimum ends al-
ways increases, which suggests that change of the external tasks
may be often compensated by the CNS.
IV. CONCLUSION
We presented an optimum neural-network-based internal
model for the control of the speech robots, controlled by EPH.
The internal model was designed according to the principle of
the shortest path in the internal command space, under acousti-
cal and mechanical constraints, which were meant to represent
static acoustic and dynamic mechanic targets.
We first dealt with the obtaining of the exact analytical solu-
tion, which was based on the calculus of variations. We proved
that this solution cannot be, in general, a linear function that
casts doubts on the so-called linearized λ-model, which has re-
ceived some interest in artificial-intelligence device modeling
and supposes that the time transitions in the internal space may
be only linear. Moreover, it was also shown that the linearized
λ-model may have some potential instability issues because of
infinitely growing energy and power and that it may not be
fully compatible with dynamic-specification target theory. We,
therefore, suggest to reconsider the linearized λ-model.
Then, by using some empirical findings, we developed a
first approximation solution for the proposed optimum inter-
nal model. Experimental tests showed that this model was in
accordance with the phenomenon of the acousticophonetic an-
ticipation (i.e., coarticulation), and it also showed that the degree
of the latter is closely related to the chosen task-planning strat-
egy. Moreover, it was also suggested that the anticipation itself
may be due to the internal CNS strategy and not to the dynamics
or biomechanics of physical system (such as the human body),
as was previously suggested in several works [32], [72].
As to the influence of the mean muscular tongue effort, which
helps to account for lax and tense vowels, its strict prescription
via mechanical constraints did not lead to the problem over-
posed mathematically, and we always succeeded to obtain op-
timum solutions. Thus, the model did not permit to answer to
questions related to the relationships between the forces and
the hypo/hyper-speech (see, e.g., [113]), namely, we could not
affirm that a greater or smaller mean force’s level leads to the
restriction or enlargement of the formant zones. We could reach
the small levels, as well as the great ones. The strict prescription
of the mean force’s level has a variable impact on the external
formant space. In contrast, it strongly impacts on the internal
command space; the length of traveled path increased, and the
motor commands become quite different, even if the impact on
the formant space was small. In other words, the robot, which
was controlled by the proposed internal model, was able to
compensate the change of one of the external tasks, without
significantly changing the quality of the another one, by the
corresponding shift in the internal space, which was determined
by the proposed optimum algorithm.
APPENDIX
ANN: CONSTRUCTION,LEARNING,AND VALIDATION
To construct the approximate model, we employed the two-
layer ANN using radial basis neurons for the first layer (called
also hidden or radial layer) and linear neurons for the second one
(which is also called output layer) [5], [87]–[91]. The chosen
radial basis transfer function is the Gaussian curve e(·)2.The
choice of the radial basis networks is motivated by the fact
that they are usually considered as one of the best for most
nonlinear approximations, thus giving a good tradeoff between
the complexity and the precision of the network.
By basing on the architecture of such a network [91], the for-
malization of the input–output relationships is not complicated.
The output of the sth radial layer neuron asis given by
as=e(wsλb1)2,s=1,...,S
1(35)
where ws(w1,s,...,w
n,s)is the input weight vector of the
sth radial neuron, λ(λ1,...,λn)is the input training vector,
b1is the narrowness’ parameter for the Gaussian function of the
sth hidden neuron, S1is the number of the hidden layer neurons,
and ·denotes the Euclidian distance. The lth output of the
linear layer αlcan be written as
αl=vla+b2,l =
S1
s=1
vs,le(wsλb1)2+b2,l (36)
for l=1,...,S
2, where ais the column vector (a1,...,a
S1)t,
vlis the output weight line vector (v1,l,...,v
S1,l),S2is the
number of the output linear neurons, and b2,l is the bias of the
lth linear neuron (where “ t” stands for transpose). The learning
procedure consists of determining the weight vectors ws,vl,
and the bias vector b2(b2,1,...,b
2,S2)by minimizing the
sum of output squared errors (SSEs) on the set of input–output
data, where parameters b1are often left for the users.
Practical implementation of the ANNs was carried out with
neural network toolbox of MATLAB [91], by means of newrb
function. To learn the network, we first generated 17 000 ran-
dom motor commands λdistributed uniformly, occupying a
six-orthotope in the internal space λ. Then, these motor com-
mands are applied to the BTM one after another (see Fig. 3).
This produces 17 000 output formant vectors F(see Fig. 18)
and output global mean force’s levels F.ForF, only first two
formants are taken into account, i.e., k=2(first two formants
are, in general, sufficient to distinguish the vowel [86], [107]).18
Then, 17 000 data were split in two parts: 6000 data were given
to the ANNs in order to choose 340 optimal ones for the con-
struction of the latter (hidden layer of the ANN is, therefore,
18Despite that the values of nand kwere fixed in experiments, all the formulas
will be written for arbitrary values of nand k, in order to keep the generality of
the model.
BLAGOUCHINE AND MOREAU: CONTROL OF A SPEECH ROBOT VIA AN OPTIMUM NEURAL-NETWORK-BASED INTERNAL MODEL 157
Fig. 18. Distribution of the 17 000 output vectors F, i.e., first two formants
F1and F2.
composed of 340 radial layer neurons, i.e., S1= 340;formore
details, see [91]), and the remaining 11 000 were used to test
the accuracy of the obtained ANNs, which is about 1% on each
output.
REFERENCES
[1] M. L. Latash, Neurophysiological Basis of Movement. Champaign, IL:
Hum. Kinet., 1998.
[2] M. L. Latash and F. Lestienne, Motor Control and Learning.New
York: Springer Sci.–Bus. Media, 2006.
[3] R. A. Schmidt and T. D. Lee, Motor Control and Learning, 4th ed.
Champaign, IL: Hum. Kinet., 2005.
[4] P. Cordo and S. R. Harnad, Movement Control. Cambridge, U.K:
Cambridge Univ., 1994.
[5] M. A. Arbib, The Handbook of Brain Theory and Neural Networks, 2nd
ed. Cambridge, MA: MIT, 2003.
[6] P. M. Fitts, “The information capacity of the human motor system in
controlling the amplitude of movement,” J. Exp. Psychol., vol. 47, no. 6,
pp. 381–391, 1954.
[7] P. M. Fitts and J. R. Peterson, “Information capacity of discrete motor
responses,” J. Exp. Psychol., vol. 67, no. 2, pp. 103–112, 1964.
[8] P. M. Fitts and B. K. Radford, “Information capacity of discrete motor
responses under different cognitive sets,” J. Exp. Psychol., vol. 71,
pp. 475–482, 1966.
[9] N. Hogan, “An organizing principle for a class of voluntary movements,”
J. Neurosci., vol. 4, no. 11, pp. 2745–2764, 1984.
[10] N. Hogan, “Moving gracefully: Quantitative theories of motor coordina-
tion,” Trends Neurosci., vol. 10, pp. 170–174, 1985.
[11] W. L. Nelson, “Physical principles for economies of skilled movements,”
Biol. Cybern., vol. 46, pp. 135–147, 1983.
[12] M. Dornay, M. K. Y. Uno, and R. Suzuki, “Minimum muscle–tension
change trajectories predicted by using a 17-muscle model of the monkey’s
arm,” J. Motor Behav., vol. 28, no. 2, pp. 83–100, 1996.
[13] Y. Uno, M. Kawato, and R. Suzuki, “Formation and control of optimal
trajectory in human multijoint arm movement,” Biol. Cybern, vol. 61,
no. 2, pp. 89–101, 1989.
[14] H. Hatze and J. D. Buys, “Energy-optimal controls in the mammalian
neuromuscular system,” Biol. Cybern., vol. 27, pp. 9–20, 1977.
[15] P. Zukofsky, “Arm movements in skilled violin playing,” presented at
the 22nd Annu. Meeting Psychon. Soc., Philadelphia, PA, 1981.
[16] D. G. Asatryan and A. G. Feldman, “Functional tuning of the nervous
system with control of movement or maintenance of a steady posture.
I: Mechanographic analysis of the work of the joint or execution of a
postural task,” Biophys. J., vol. 10, pp. 925–935, 1965.
[17] A. G. Feldman, “Functional tuning of the nervous system with control of
movement or maintenance of a steady posture. II: Controllable parame-
ters of the muscles,” Biophys. J., vol. 11, pp. 565–578, 1966.
[18] A. G. Feldman, “Functional tuning of the nervous system with control
of movement or maintenance of a steady posture. III: Mechanographic
analysis of execution by man of the simplest of motor tasks,” Biophys.
J., vol. 11, pp. 766–775, 1966.
[19] A. G. Feldman and G. N. Orlovsky, “The influence of different descend-
ing systems on the tonic stretch reflex in the cat,” Exp. Neurol.,vol.37,
pp. 481–494, 1972.
[20] A. G. Feldman, “Once more on the equilibrium point hypothesis (λ
model) for motor control,” J. Motor Behav., vol. 18, pp. 17–54, 1986.
[21] A. G. Feldman, “Change of muscle length as a consequence of a shift in
an equilibrium of muscle load system,” Biophys., vol. 19, pp. 544–548,
1974.
[22] A. G. Feldman, “Control of the length of a muscle,” Biophys.,vol.19,
pp. 766–771, 1974.
[23] E. Bizzi, N. Hogan, F. A. Mussa-Ivaldi, and S. F. Giszter, “Does the
nervous system use equilibrium-point control to guide single and multiple
joint movements?” Behav. Brain Sci., vol. 15, pp. 603–613, 1992.
[24] X. Gu and D. H. Ballard, “An equilibrium point based model
unifying movement control in humanoids,” in Proc. Robot. Sci. Syst.,
2006, pp. 1–7.
[25] E. Bizzi, N. Hogan, F. A. Mussa-Ivaldi, and S. F. Giszter, “The
equilibrium-point framework: A point of departure,” Behav. Brain Sci.,
vol. 15, pp. 808–815, 1992.
[26] X. Gu and D. H. Ballard, “Robot movement planning and control based
on equilibrium point hypothesis,” in Proc. IEEE Conf. Robot., Autom.
Mechatron., 2006, pp. 1–6.
[27] S. A. Migliore and S. DeWeerth, “Control of robotic joints using prin-
ciples from the equilibrium point hypothesis of animal motor control,”
M.S. thesis, Georgia Inst. Technol., Atlanta, GA, 2004.
[28] M. Kawato, “Internal models for motor control and trajectory planning,”
Curr. Opin. Neurobiol., vol. 9, pp. 718–727, 1999.
[29] J. R. Flanagan and A. M. Wing, “The role of internal models in mo-
tion planning and control: evidence from grip force adjustments during
movements of hand-held loads,” J. Neurosci., vol. 17, pp. 1519–1528,
1997.
[30] M. Kawato, M. Isobe, Y. Maeda, and R. Suzuki, “Coordinates trans-
formation and learning control for visually-guided voluntary movement
with iteration: A Newton-like method in a function space,” Biol. Cybern.,
vol. 59, no. 3, pp. 161–177, 1988.
[31] J. D. Cooke, “The organization of simple skilled movements,” in Ad-
vances in Psychology: Tutorials in Motor Behavior, G. E. Stelmach and
J. Requin, Eds. Amsterdam, The Netherlands: North-Holland, 1980.
[32] P. L. Gribble, D. J. Ostry, V. Sanguineti, and R. Laboissi`
ere, “Are complex
control signals required for human arm movement?” J. Neurophysiol.,
vol. 79, no. 3, pp. 1409–1424, 1998.
[33] V. Sanguineti, R. Laboissi`
ere, and D. J. Ostry, “A dynamic biomechanical
model for neural control of speech production,” J. Acoust. Soc. Amer.,
vol. 103, no. 3, pp. 1615–1627, 1998.
[34] J. R. Flanagan, D. J. Ostry, and A. G. Feldman, “Control of human
jaw and multi-joint arm movements,” in Cerebral Control of Speech
and Limb Movements, G. Hammond, Ed. New York: Springer-Verlag,
1990, pp. 29–58.
[35] D. J. Ostry, J. R. Flanagan, A. G. Feldman, and K. G. Munhall, “Human
jaw movement kinematics and control,” in Tutorials in Motor Behavior
II, G. E. Stelmac and J. Requin, Eds. Amsterdam, The Netherlands:
Elsevier, 1992, pp. 646–660.
[36] “Special issue on the equilibrium point hypothesis and its applications
in speech,” Bull. Commun. Parl´
ee, vol. 4, pp. 5–110, 1998.
[37] R. Laboissi´
ere, D. J. Ostry, and A. G. Feldman, “The control of mul-
timuscle systems: Human jaw and hyoid movements,” Biol. Cybern.,
vol. 74, no. 3, pp. 373–384, 1996.
[38] R. Wilhelms-Tricarico, “Physiological modeling of speech production:
Methods for modeling soft-tissue articulators,” J. Acoust. Soc. Amer.,
vol. 97, no. 5, pp. 3085–3098, 1995.
[39] R. Wilhelms-Tricarico and C.-M. Wu, “A biomechanical model of the
tongue,” in Proc. Bioeng. Conf., vol. 35, K. B. Chandran, R. Vanderby,
Jr., and M. S. Hefzy, Eds. New York: ASME, 1997, pp. 69–70.
[40] P. Buscemi, M. Carlson, and R. W. Tricarico, “A computational approach
to muscle modeling of the human tongue via the finite element method
along with motion control correlations with MRI tracking data for simple
speech patterns,” J. Med. Devices, vol. 2, no. 2, p. 027548, 2008.
[41] H. Ranca, C. Servaisa, P.-F. Chauvyb, S. Debaudb, and S. Mischle,
“Effect of surface structure on frictional behaviour of a tongue/palate
tribological system,” Tribol. Int., vol. 39, no. 12, pp. 1518–1526, 2006.
[42] W. S. Levine, C. E. Torcaso, and M. Stone, “Controlling the shape of a
muscular hydrostat: A tongue or tentacle,” Lect. Notes Control Inf. Sci.,
vol. 321, pp. 20–222, 2005.
[43] W. H. Lewis, Ed., Gray’s Anatomy of the Human Body, 20th U.S. ed.
Philadelphia, PA: Lea & Febiger, 1918.
158 IEEE TRANSACTIONS ON ROBOTICS, VOL. 26, NO. 1, FEBRUARY 2010
[44] O. C. Zienkiewicz and R. L. Taylor, The Finite Element Method. Basic
Formulation and Linear Problems. New York: McGraw-Hill, 1989.
[45] P. E. Rubin, T. Baer, and P. Mermelstein, “An articulatory synthesizer
for perceptual research,” J. Acoust. Soc. Amer., vol. 70, pp. 321–328,
1981.
[46] P. E. Rubin, E. Saltzman, L. Goldstein, R. McGowan, M. Tiede, and
C. Browman, “CASY and extensions to the task-dynamic model,” in
Proc. 1st ESCA Tutorial Res. Workshop Speech Producing Model., 4th
Speech Prod. Semin., 1996, pp. 125–128.
[47] J. S. Perkell and K. N. Stevens, “A physiologically-oriented model of
tongue activity in speech production,” Ph.D. dissertation, Speech Com-
mun. Group, Dept. Electr. Eng., Mass. Inst. Technol., Cambridge, MA,
1974.
[48] S. Kiritani, K. Miyawaki, O. Fujimura, and J. E. Miller, “Computational
model of the tongue,” J. Acoust. Soc. Amer., vol. 57, no. S1, pp. S3–S3,
1975.
[49] Y. Kakita and O. Fujimura, “Computational model of the tongue: A
revised version,” J. Acoust. Soc. Amer., vol. 62, no. S1, pp. S15–S16,
1977.
[50] M. M. Sondhi and J. Schroeter, “A nonlinear articulatory speech synthe-
sizer using both time- and frequency-domain elements,”in Proc. ICASSP,
Tokyo, Japan, 1986, pp. 1999–2002.
[51] M. M. Sondhi and J. Schroeter, “A hybrid time-frequency domain articu-
latory speech synthesizer,” IEEE Trans. Acoust. Speech Signal Process.,
vol. ASSP-35, no. 7, pp. 955–967, Jul. 1986.
[52] S. Maeda, “An articulatory model of the tongue based on a statistical
analysis,” J. Acoust. Soc. Amer., vol. 65, no. s1, p. S22, 1988.
[53] S. Maeda, “Improved articulatory model,” J. Acoust. Soc. Amer., vol. 84,
no. s1, p. S146, 1988.
[54] F. Vogt, “Finite element modeling of the tongue,” in Proc. Int. Workshop
Audio Vis. Speech Process., 2005, pp. 143–144.
[55] O. Engwall, “A 3D tongue model based on MRI data,” in Proc. 6th Int.
Conf. Spoken Lang. Process., Beijing, China, 2000, pp. 901–904.
[56] K. van den Doel, F. Vogt, R. E. English, and S. Fels, “Towardsarticulatory
speech synthesis with a dynamic 3D finite element tongue model,” in
Proc. ISSP, 2006, pp. 59–66.
[57] [Online]. Available: http://www.takanishi.mech.waseda.ac.jp/top/
research/voice/index.htm
[58] K. Nishikawa, K. Asama, K. Hayashi, H. Takanobu, and A. Takanishi,
“Development of a talking robot,” in Proc. IEEE/RSJ Int. Conf. Intell.
Robots Syst., 2000, pp. 1760–1765.
[59] K. Fukui, K. Nishikawa, S. Ikeo, E. Shintaku, K. Takada, H. Takanobu,
M. Honda, and A. Takanishi, “Development of a talking robot with
vocal cords and lips having human-like biological structures,” in Proc.
IEEE/RSJ Int. Conf. Intell. Robots Syst., 2005, pp. 2023–2028.
[60] P. L. M. de Maupertuis, “Accord de diff´
erentes lois de la nature qui avaient
jusqu’ici paru incompatibles (Eng. trans.: “Accordbetween different laws
of nature that seemed incompatible”),”presented at the M ´
emoires l’Acad.
Sci. Paris, Paris, France, 1744, p. 417.
[61] P. L. M. de Maupertuis, “Les lois du mouvement et du repos, d´
eduites
d’un principe de m´
etaphysique (Eng. trans.: “Derivation of the laws of
motion and equilibrium from a metaphysical principle”),” presented at
the M´
emoires l’Acad. Sci. Paris, Berlin, Germany, 1746, p. 267.
[62] J.-L. Lagrange, Euvres de Lagrange. Tome Onzi`
eme: M´
ecanique Ana-
lytique, Tome Premier, Quatri`
eme ´
ed. Paris, France: Gauthier-Villars et
fils, imprimeurs-libraires, 1888.
[63] W. R. Hamilton, “On a general method in dynamics,” Philos. Trans. R.
Soc., 1834, pp. 247–308.
[64] W. R. Hamilton, “On a general method in dynamics,” Philos. Trans. R.
Soc., 1835, pp. 95–144.
[65] H. Goldstein, Classical Mechanics. Reading, MA: Addison-Wesley,
1953.
[66] V. I. Arnold, Mathematical Methods of Classical Mechanics, 2nd ed.
Berlin, Germany: Springer-Verlag, 1989.
[67] L. D. Landau and E. M. Lifshitz, Course of Theoretical Physics. Vol. I:
Mechanics, 3rd ed. Oxford, U.K.: Elsevier, 2003.
[68] L. N. Hand and J. D. Finch, Analytical Mechanics. Cambridge, U.K.:
Cambridge Univ., 1998.
[69] C. Lanczos, The Variational Principles of Mechanics. Toronto, ON,
Canada: Univ. Toronto, 1970.
[70] D. J. Ostry and K. G. Munhall, “Control of jaw orientation and posi-
tion in mastication and speech,” J. Neurosci., vol. 71, pp. 1528–1545,
1994.
[71] P. L. Gribble, R. Laboissi´
ere, and D. J. Ostry, “Control of human
arm and jaw motion: issues related to musculo-skeletal geometry,” in
Self-Organization, Computational Maps and Motor Control, vol. 118.
Amsterdam, The Netherlands: North-Holland/Elsevier, 1997, pp. 483–
506.
[72] D. J. Ostry, P. L. Gribble, and V. L. Gracco, “Coarticulation of jaw move-
ments in speech production: is context sensitivity in speech kinematics
centrally planned?” J. Neurosci., vol. 16, pp. 1570–1579, 1996.
[73] A. G. Feldman, S. V. Adamovich, D. J. Ostry, and J. R. Flanagan, “The
origin of electromyograms—Explanations based on the equilibrium point
hypothesis,” in Multiple Muscle Systems: Biomechanics and Movement
Organization, J. Winters and S. Woo, Eds. Berlin, Germany: Springer-
Verlag, 1990, pp. 195–213.
[74] M. R. Schroeder, “Determination of the geometry of the human vocal
tract by acoustic measurements,” J. Acoust. Soc. Amer., vol. 41, no. 4B,
pp. 1002–1010, 1967.
[75] S. Hiroya and M. Honda, “Estimation of articulatory movements from
speech acoustics using an hmm-based speech production model,” IEEE
Trans. Speech Audio Process., vol. 12, no. 2, pp. 175–185, Mar.
2004.
[76] R. Marret, “Apprentissage des relations entre commandes musculaires et
g´
eom´
etrie de la langue,” Master’s thesis, Inst. Nat. Polytech. de Grenoble,
Grenoble, France, 2002.
[77] P. Perrier, L. Ma, and Y. Payan, “Modeling the production of VCV
sequences via the inversion of a biomechanical model of the tongue,”
in Proc. 9th Eur. Conf. Speech Commun. Technol., 2005, pp. 1041–
1044.
[78] F. H. Guenther, M. Hampson, and D. Johnson, “A theoretical investiga-
tion of reference frames for the planning of speech movements,” Psychol.
Rev., vol. 105, pp. 611–633, 1998.
[79] D. Kewley-Port and S. S. Goodman, “Thresholds for second formant
transitions in front vowels,” J. Acoust. Soc. Amer., vol. 118, no. 5,
pp. 3252–3260, 2005.
[80] J. Hillenbrand, M. J. Clark, and T. M. Nearey, “Effects of consonant
environment on vowel formant patterns,” J. Acoust. Soc. Amer.,vol.109,
no. 2, pp. 748–763, 2001.
[81] T. L. Gottfried and W. Strange, “Identification of coarticulated vowels,”
J. Acoust. Soc. Amer., vol. 68, no. 6, pp. 1626–1635, 1980.
[82] W. Strange, J. J. Jenkins, and T. L. Johnson, “Dynamic specification of
coarticulated vowels,” J. Acoust. Soc. Amer., vol. 74, no. 3, pp. 695–705,
1983.
[83] J. E. Andruski and T. M. Nearey, “On the sufficiency of compound target
specification of isolated vowels and vowels in /bvb/ syllables,” J. Acoust.
Soc. Amer., vol. 91, pp. 390–410, 1992.
[84] H. Dudley and T. H. Tarnoczy, “The calculation of vowel resonances,
and an electrical vocal tract,” J. Acoust. Soc. Amer., vol. 22, no. 6,
pp. 740–753, 1950.
[85] G. Fant, Acoustic Theory of Speech Production. Hague, The
Netherlands: Mouton, 1960.
[86] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals.
Englewood Cliffs, NJ: Prentice-Hall, 1978.
[87] Y. H. Hu and J. N. Hwang, Eds., Handbook of Neural Network Signal
Processing. Boca Raton, FL: CRC, 2002.
[88] L. Jain, Recent Advances in Artificial Neural Networks. Design and
Applications (Int. Ser. Comput. Intell.), A. M. Fanelli, Ed. Boca Raton,
FL: CRC, 2000.
[89] S. Chen, C. Cowan, and P. Grant, “Orthogonal least squares learning
algorithm for radial basis function networks,” IEEE Trans. Neural Netw.,
vol. 2, no. 2, pp. 302–309, Mar. 1991.
[90] T. Poggio and F. Girosi, “Networks for approximation and learning,”
Proc. IEEE, vol. 78, no. 9, pp. 1481–1497, Sep. 1990.
[91] H. Demuth, M. Beale, and M. Hagan, Neural Network Toolbox for Use
With MATLAB (Version 5). Natick, MA: The MathWorks, 2006.
[92] M. Kawato, K. Furukawa, and R. Suzuki, “A hierarchical neural-network
model for control and learning of voluntary movement,” Biol. Cybern.,
vol. 57, no. 3, pp. 169–185, 1987.
[93] F. H. Guenther, “A neural network model of speech acquisition and motor
equivalent speech production,” Biol. Cybern., no. 72, pp. 43–53, 1994.
[94] T. D. Sanger, “Neural network learning control of robot manipulators
using gradually increasing task difficulty,” IEEE Trans. Robot. Autom.,
vol. 10, no. 3, pp. 323–333, Jun. 1994.
[95] F. H. Guenther, “Speech sound acquisition, coarticulation, and rate ef-
fects in a neural network model of speech production,” Psychol. Rev.,
vol. 102, pp. 594–621, 1995.
[96] F. H. Guenther and J. W. Bohland, “Learning sound categories: A neural
model and supporting experiments. Acoustical science and technology,”
Acoust. Sci. Technol., vol. 23, no. 4, pp. 213–220, 2002.
BLAGOUCHINE AND MOREAU: CONTROL OF A SPEECH ROBOT VIA AN OPTIMUM NEURAL-NETWORK-BASED INTERNAL MODEL 159
[97] M. G. Rahim, C. C. Goodyear, W. B. Kleijn, J. Schroeter, and
M. M. Sondhi, “On the use of neural networks in articulatory speech
synthesis,” J. Acoust. Soc. Amer., vol. 93, no. 2, pp. 1109–1121, 1993.
[98] L. Eulero. (1744). Methodus inveniendi lineas curvas maximi min-
imive proprietate gaudentes, sive solutio problematisisoperimetrici latis-
simo sensu accepti. Lausannæ/Genevæ, Switzerland: Apud Marcum-
Michaelem Bousquet/Socios [Online]. Available: http://math.dartmouth.
edu/euler/pages/E065.html
[99] M. J. Forray, Variational Calculus in Science and Engineering.New
York: McGraw-Hill, 1967.
[100] R. Weinstock, Calculus of Variations With Applications to Physics and
Engineering. New York: McGraw-Hill, 1952.
[101] G. A. Bliss, Lectures on the Calculus of Variations. Chicago, IL: Univ.
Chicago, 1947.
[102] V. I. Smirnov, A Course of Higher Mathematics, vol. I–V, Oxford, U.K.:
Pergamon, 1964.
[103] R. Courant and D. Hilbert, Methods of Mathematical Physics,vol.I.New
York: Interscience, 1966.
[104] G. A. Korn and T. M. Korn, Mathematical Handbook for Scientists
and Engineers. Definitions, Theorems, and Formulas for Reference and
Review, 2nd ed. enlarged and revised ed. New York: McGraw-Hill,
1968.
[105] I. N. Bronshtein and K. A. Semendyayev, Handbook of Mathematics,
3rd ed. Berlin, Germany: Springer-Verlag, 1998.
[106] Calliope, La Parole et Son Traitement Automatique. Paris, France:
Dunod, 1989.
[107] G. E. Peterson and H. L. Barney, “Control methods used in a study of
the vowels,” J. Acoust. Soc. Amer., vol. 24, no. 2, pp. 175–184, 1952.
[108] K. N. Stevens and A. S. House, “Perturbation of vowel articulations by
consonantal context,” J. Speech Hear. Res., vol. 6, pp. 111–128, 1963.
[109] M. J. Macchi, “Identification of vowels spoken in isolation versus vowels
spoken in consonantal context,” J. Acoust. Soc. Amer., vol. 68, no. 6,
pp. 1636–1642, 1980.
[110] T. M. Nearey, “Static, dynamic, and relational properties in vowel per-
ception,” J. Acoust. Soc. Amer., vol. 85, no. 5, pp. 2088–2113, 1989.
[111] J. Talley, “Vowel perception in varied symmetric CVC contexts,” J.
Acoust. Soc. Amer., vol. 108, no. 5, pp. 2601–2601, 2000.
[112] J. Hillenbrand, L. A. Getty, K. Wheeler, and M. J. Clark, “Acoustic char-
acteristics of american english vowels,” J. Acoust. Soc. Amer., vol. 95,
no. 5, pp. 2875–2875, 1995.
[113] B. Lindblom, “Explaining phonetic variation: a sketch of the H&H the-
ory,” in Speech Production and Speech Modelling, W.J. Hardcastle and
A. Marchal, Eds. Dordrecht, The Netherlands: Kluwer, 1990, pp. 403–
439.
Iaroslav V. Blagouchine was born in St. Petersburg,
Russia, on December 22, 1979. He received the B.S.
degree in physics from the St. Petersburg State Uni-
versity in 2000, the M.S. degree in electronic engi-
neering from the Grenoble Institute of Technology,
and the Ph.D. degree in signal processing and ap-
plied mathematics from the ´
Ecole Centrale, France,
in 2001 and 2009, respectively.
From 2001 to 2002, he was with the Department
Se˜
nales, Sistemas y Radiocomunicaciones Universi-
dad Polit´
ecnica de Madrid, Madrid, Spain. During
2003, he was a Research Engineer with the Grenoble Institute of Technology,
Grenoble, France, where he was also a Teacher Assistant from 2004 to 2007.
From 2007 to 2009, he was a Postdoctoral Researcher and a Teacher Assis-
tant with the Telecommunication Department, University of Toulon, Toulon,
France. Since September 2009, he has been a Research Engineer with the
Department of Mobile Communication, Eur´
ecom, Sophia Antipolis, France.
His current research interests include biologically inspired robotics (especially
equilibrium-point-hypothesis-based), speech robotics, constraint optimization
techniques, variational calculus, and statistical signal processing.
Eric Moreau (M’96–SM’08) was born in Lille,
France. He graduated from the Ecole Nationale
Sup´
erieure des Arts et M´
etiers, Paris, France, in 1989.
He received the Agr´
egation de Physique degree from
the Ecole Normale Sup´
erieure de Cachan, Cachan
Cedex, France, in 1990 and the DEA and Ph.D. de-
grees in signal processing from the Universit´
e Paris-
Sud in 1991 and 1995, respectively.
From 1995 to 2001, he was an Assistant Professor
with the Department of Telecommunications, Insti-
tute of Engineering Sciences of Toulon-Var-School
of Engineering, University of Toulon, Toulon, France, where he is currently a
Professor. His current research interests include constraint optimization, neural
networks applications, and statistical signal processing.
... How the CNS can adapt to the age-related changes and predict the appropriate motor commands to stabilize the body, has been a challenge for postural control research the latest years [3], [9]. The conceptual hypothesis of the CNS, which is called the Internal Model (IM) has received interest in the recent years with characteristic works to be in [9], [10], [11]. This theoretical model predicts the motor commands from the sensory inputs integration, during the changes in the environment and tries to address the issues of how the brain adapts to the body or to the environmental changes, as well as, how the brain predicts the motor commands with delayed and noisy sensory input signals. ...
... In the last decades, with the achieved technological progress in computers and sensors, many researchers became more interested in collecting human data and trying to implement and validate the internal models by System Identification and Machine Learning techniques. Due to the cause and effect of the relationship of the input-output signals, these approaches can infer more practical models that can be validated based on further experimentation [17] and used in human-inspired robots like the one in [10]. For instance, in [18] the authors presented a system identification technique for the control strategies of humans in different situations, which was inspired by the IM hypothesis. ...
Conference Paper
Full-text available
The second most common cause of injury in the elderly population is falling. In an effort to understand the mechanism behind the reduced ability to maintain balance in any posture or activity, we study the performance of the central nervous system as a controller of the body, while maintaining the balance in some postures or activities. Towards this direction, forty-five subjects aged over 70 were tested in different trials of quiet stance: a) hard stable surface with open eyes, b) stable surface with closed eyes, c) soft unstable surface with open eyes, and d) unstable surface, while eyes were closed. In the sequel, the body kinematics were described by legs and trunk segment angles in the sagittal plane, while the muscle activations were described by a weighted sum of rectified EMG signals from tibialis anterior and gastrocnemius muscles of left and right legs. Using the neuro-science hypothesis and adaptive control theory, a completely novel model was identified for the CNS based on the feedback internal model. The proposed model is able to predict the output commands, based on a recurrent neural network, while the efficiency of the proposed scheme has been proven based on multiple experimental results, showing that the model can sufficiently predict the muscle activity based on the optimum sensory inputs.
... Fuzzy system and neural network had high nonlinear approximation ability and were applied as a powerful tool to solve complex nonlinear systems. Scholars had applied fuzzy control, neural network control and so on to the robot, and introduced an intelligent control method of the robot [19][20][21]. In the application, the structures of neural network control and fuzzy control were more complex, the weight and structure of the network were difficult to determine, and the rules of fuzzy control had no unified form so far. ...
... However, in the recent articles on this topic [14], [15], and by utilizing the latest developments in computing and sensor technology, system identification and machine learning, the corresponding algorithms are becoming more attractive. In [16] authors have used deep reinforcement learning as a controller for a humanoid robot, or in [17] to control the robot motion planning a neural network based on the internal model was used and in [1] authors simulated a reinforcement learning algorithm to stabilize an inverted pendulum. ...
Conference Paper
Full-text available
The human body is mechanically unstable, while the brain as the main controller, is responsible to maintain our balance. However, the mechanisms of the brain towards balancing are still an open research question and thus in this article, we propose a novel modeling architecture for replicating and understanding the fundamental mechanisms for generating balance in the humans. Towards this aim, a nonlinear Recurrent Neural Network (RNN) has been proposed and trained that has the ability to predict the performance of the Central Nervous System (CNS) in stabilizing the human body with high accuracy and that has been trained based on multiple collected human-based balancing data and by utilizing system identification techniques. One fundamental contribution of the article is the fact that the obtained network, for the balancing mechanisms, is experimentally evaluated on a single link inverted pendulum that replicates the basic model of the human balance and can be directly extended in the area of humanoids and balancing exoskeletons.
Article
Full-text available
Industrial Cloud Robotics is an amalgamation of cloud computing and an industrial robot to establish a service-oriented manufacturing system. Teleoperation of an industrial robot – the ability to control a robot from a remote location, is facilitated through this system. In this paper, a framework is proposed by utilizing speech recognition and industrial cloud robots for the applications of sustainable manufacturing. We have used Google’s speech recognition services to control the robot manipulator. We have employed Android Speech API in a custom android application that receives the speech signal, transcribes, and forwards it to a server. The application can be viewed or accessed by any host computer, which primarily serves as a user. The monitoring unit and the data is fetched by the Robot Speech Interface unit via the internet and web sockets. The interface triggers the required action of the robot through the relay board actuation of the digital input facility of the robot. Through this work, it is realized that, despite the disturbances and noise interferences, speed and reliability are not compromised.
Article
Full-text available
A number of internal model concepts are now widespread in neuroscience and cognitive science. These concepts are supported by behavioral, neurophysiological, and imaging data; furthermore, these models have had their structures and functions revealed by such data. In particular, a specific theory on inverse dynamics model learning is directly supported by unit recordings from cerebellar Purkinje cells. Multiple paired forward inverse models describing how diverse objects and environments can be controlled and learned separately have recently been proposed. The 'minimum variance model' is another major recent advance in the computational theory of motor control. This model integrates two furiously disputed approaches on trajectory planning, strongly suggesting that both kinematic and dynamic internal models are utilized in movement planning and control
Chapter
Movement is arguably the most fundamental and important function of the nervous system. Purposive movement requires the coordination of actions within many areas of the cerebral cortex, cerebellum, basal ganglia, spinal cord, and peripheral nerves and sensory receptors, which together must control a highly complex biomechanical apparatus made up of the skeleton and muscles. Beginning at the level of biomechanics and spinal reflexes and proceeding upward to brain structures in the cerebellum, brainstem and cerebral cortex, the chapters in this book highlight the important issues in movement control. Commentaries provide a balanced treatment of the articles that have been written by experts in a variety of areas concerned with movement, including behaviour, physiology, robotics, and mathematics.
Book
Motor Control and Learning focuses on the effects of development, aging, and practice on the control of human voluntary movement. These issues have been at the center of attention of the motor control community, but no book until now has addressed all of these issues under one cover in the context of contemporary views on the control of human voluntary movement. This book emphasizes the links between progress in basic motor control research and applied areas such as motor disorders and motor rehabilitation. Contributors are established scientists in the areas of both theoretical/experimental motor control and its applications. The chapters focus more on large, general issues than on their particular research. As a result, Motor Control and Learning is relevant to both professionals in the areas of motor control, movement disorders, and motor rehabilitation, and to students who are starting their careers in one of these actively developed areas. Dr. Mark L. Latash is Professor of Kinesiology at the Pennsylvania State University. Dr. Francis G. Lestienne is Professor and Director of the Center for Science and Technology in Physical Activity and Sports at the Université de Caen Basse-Normandie, France.
Chapter
The acoustic characteristics of any speech sound are determined by the whole complex of the movement and configurations of the speech production process. We have seen that some aspects of speech production have a fairly predictable effect on the acoustic speech signal. For example, periodicity in the acoustic waveform is the acoustic consequence of vocal fold vibration that characterises voiced sounds, while a nearly random fluctuation in air pressure variation results from a turbulent airstream in the production of most voiceless sounds.