Conference PaperPDF Available

How to run a world record? A Reinforcement Learning approach

Authors:

Abstract and Figures

Finding the optimal distribution of exerted effort by an athlete in competitive sports has been widely investigated in the fields of sport science, applied mathematics and optimal control. In this article, we propose a reinforcement learning-based solution to the optimal control problem in the running race application. Well-known mathematical model of Keller is used for numerically simulating the dynamics in runner's energy storage and motion. A feed-forward neural network is employed as the probabilistic controller model in continuous action space which transforms the current state (position, velocity and available energy) of the runner to the predicted optimal propulsive force that the runner should apply in the next time step. A logarithmic barrier reward function is designed to evaluate performance of simulated races as a continuous smooth function of runner's position and time. The neural network parameters, then, are identified by maximizing the expected reward using on-policy actor-critic policy-gradient RL algorithm. We trained the controller model for three race lengths: 400, 1500 and 10000 meters and found the force and velocity profiles that produce a near-optimal solution for the runner's problem. Results conform with Keller's theoretical findings with relative percent error of 0.59% and are comparable to real world records with relative percent error of 2.38%, while the same error for Keller's findings is 2.82%.
Content may be subject to copyright.
How to run a world record? A Reinforcement Learning approach
Sajad Shahsavari, Eero Immonen
Computational Engineering and Analysis (COMEA)
Turku University of Applied Sciences
20520 Turku, Finland
Email: sajad.shahsavari@turkuamk.fi
Masoomeh Karami, Hashem Haghbayan,
and Juha Plosila
Department of Computing
University of Turku (UTU)
20500 Turku, Finland
KEYWORDS
Optimal Control, Optimal Pacing Profile, Reinforce-
ment Learning, Competitive Running Race
ABSTRACT
Finding the optimal distribution of exerted effort by an
athlete in competitive sports has been widely investigated
in the fields of sport science, applied mathematics and op-
timal control. In this article, we propose a reinforcement
learning-based solution to the optimal control problem in
the running race application. Well-known mathematical
model of Keller is used for numerically simulating the
dynamics in runner’s energy storage and motion. A feed-
forward neural network is employed as the probabilistic
controller model in continuous action space which trans-
forms the current state (position, velocity and available
energy) of the runner to the predicted optimal propulsive
force that the runner should apply in the next time step.
A logarithmic barrier reward function is designed to
evaluate performance of simulated races as a continuous
smooth function of runner’s position and time. The neural
network parameters, then, are identified by maximizing
the expected reward using on-policy actor-critic policy-
gradient RL algorithm. We trained the controller model
for three race lengths: 400, 1500 and 10000 meters and
found the force and velocity profiles that produce a
near-optimal solution for the runner’s problem. Results
conform with Keller’s theoretical findings with relative
percent error of 0.59% and are comparable to real world
records with relative percent error of 2.38%, while the
same error for Keller’s findings is 2.82%.
I INTRODUCTION
I-A Background and motivation
A competitive runner attempts to optimize their pac-
ing to minimize the time spent in covering a given dis-
tance. This optimal pacing (or velocity) profile is always a
trade-off between moving faster and saving energy during
the race, and is specific to the individual’s physique
characteristics. The optimal pacing problem has been
widely studied in sports science (e.g. Abbiss and Laursen
(2008); Casado, Hanley, Jim´
enez-Reyes, and Renfree
(2021); Jones, Vanhatalo, Burnley, Morton, and Poole
(2010)) as well as mathematical modeling (e.g. Alvarez-
Ramirez (2002)) and optimal control theory (e.g. Keller
(1974); Reardon (2013); Woodside (1991)). Through
technological advances in high-performance computing
as well as in wearable devices, which today are capable
of predicting the running power on-line, the problem
has also been addressed by computational optimization
(e.g. Aftalion (2017); Maro´
nski and Rogowski (2011)).
Finding the optimal running power profile that yields
minimum race time is carried out by solving a dynam-
ical optimization problem, subject to constraints from
human metabolism and race conditions. Although the-
oretical solutions for this optimal control problem are
proposed in prior research, they fail on more complicated,
higher order mathematical models and constraints. On
the other hand, numerical approaches, such as nonlinear
programming in the so-called direct methods, suffer from
local convergence, i.e., tending to converge to solutions
close to the supplied first guesses, thus, requiring a good
initial guess for the solution. This article addresses these
issues by using probabilistic controller model trained by
reinforcement learning (RL).
Optimizing the pacing strategy for a given race is a
well-known open problem (Aftalion & Tr´
elat, 2021). This
problem is non-trivial because of: (1) the complexity of
the human body metabolism (processes of aerobic and
anaerobic energy production/consumption), (2) individ-
ual variations in the human physique, (3) mathematical
modeling of human metabolism considering individual
variations, and (4) resolving the corresponding optimal
control problem based on the developed mathematical
model.
The study of mathematical models of human body
metabolism tries to specify the differential equations that
describe the energy (or oxygen) variation in one’s body
based on the recovery rate and amount of applied work.
Typically these models include a number of physiological
parameters that characterize the athlete’s body and need
to be identified, individually, based on the experimental
data on athlete’s performance. Among these parameters,
oxygen uptake rate ( ˙
V O2, rate of oxygen recovery or
energy production during the exercise) and critical power
(CP, highest possible long-lasting rate of energy consump-
tion) are the most significant ones. Beside running races,
mathematical modeling for kinetic and metabolic variabil-
ity has been used in a wide variety of other competitive
sports such as road cycling (Wolf, 2019), swimming
(McGibbon, Pyne, Shephard, & Thompson, 2018), horse
racing (Mercier & Aftalion, 2020), wheelchair athletics
(Cooper, 1990), etc.
The challenge of resolving optimal control profile is
not restricted in the sport sciences. Generally, optimal
control of dynamical systems governed by differential
equations, as a fundamental problem in mathematical
optimization, has numerous applications in scientific and
engineering research. Such practical applications, for ex-
ample, include space shuffle reentry trajectory optimiza-
tion (Rahimi, Dev Kumar, & Alighanbari, 2013), unem-
ployment minimization in economics and management
(Kamien & Schwartz, 2012), robotic resource manage-
ment, task allocation (Elamvazhuthi & Berman, 2015) and
etc., all of which are ultimately reduced to the problem of
finding a sequence of decisions (control variables) over
a period of time which optimizes an objective function.
Analytical solutions such as Linear–Quadratic Regulator
(LQR) can solve optimal control problem for linear
systems with quadratic cost function, but usually systems
are nonlinear and state-constrained. Numerical solutions
for such nonlinear problems can be categorized into two
classes: (1) indirect methods: apply first-order necessary
optimality conditions based on Pontyagin’s Maximum
Principle to turn the optimal control problem into a multi-
point boundary value problem, which can be solved by
an appropriate solver, and (2) direct methods: discretize
the problem on time, approximate control input, state, or
both with parametric functions and iteratively identify the
best parameter set which optimizes the objective function
(B¨
ohme & Frank, 2017).
In the current article, we propose a solution based
on reinforcement learning to the optimal control problem
in running races. In essence, the approach is to use a
statistical parameterized control model driven towards
the optimal solution by experiencing in the simulated
race, with no a priori assumptions, whilst sufficiently
generic to be applied on a variety types of dynamical
models. Moreover, unlike the widely used direct methods
in optimal control, here we optimize the parameterized
probabilistic policy function, enabling it to explore the
control trajectory space and iteratively improve its per-
formance. We utilize the well-known mathematical model
of Keller (1973) representing the dynamics of motion
and human body’s energy conservation in the running
race application. We used theoretical optimal solution
for this model provided by Keller (1974) to validate
the solution of our proposed method, since physiological
constants of a world-champion runner are also identified
in Keller (1973), enabling us to reproduce the exact
same dynamical model. It is interesting that while Keller
assumes a 3-stage run (initial, middle, end), here this 3-
part profile is not assumed a priori, but it is part of the
solution.
I-B Contributions and key limitations
Overall, this paper presents a machine learning so-
lution that is able to learn the optimal pacing profiles
predicted by Keller’s theory of competitive running. More
specifically, the main contributions of the present work
include:
A neural network-based probabilistic policy
model transforming the state of the system into
optimal control action in continuous-space.
A reinforcement learning-based parameter op-
timization procedure to iteratively improve the
policy model’s performance.
Numerical simulation of the dynamical sys-
tem that is essentially an implementation of
Keller’s mathematical model (differential equa-
tions governing runner’s force, energy and ve-
locity dynamics) temporally discretized by the
Runge–Kutta approximation method (Tan &
Chen, 2012).
Validation discussion on the proposed method by
comparing the predictions of our model with the
theoretical results provided by Keller (1973).
It is important to note that the study of convergence
of the learner and sensitivity of the approximation (time-
discretization of the numerical model due to change in
the step size) is performed qualitatively and not studied
thoroughly in the present research. Another limitation
of the current work is that the proposed RL procedure
requires many number of simulated experiences in the
environment. Nonetheless, the optimized model by simu-
lated experience could potentially be used as a reasonable
initial point of an on-line real-world system, thus reducing
the number of required training data samples. Besides,
even though the proposed analytical techniques in the
literature tries to consider several characteristics/features
of the runners’ physical characteristics and the environ-
ment at the same time, the complex nature of modeling
an individual runner makes the solution not to be globally
suitable for all the runners. Different physical characteris-
tics of the runners and different environments wherein the
runners act provide a wide range of features that affect
the constant parameters used in the models and/or might
even change the mathematical modeling. Another fact
is that the mental and psychological features that might
significantly affect the modeling are mainly the result
of runner and its environment interaction. In this paper,
instead of a bottom-up structural/formal solution for the
runner’s optimal energy consumption, we propose a high-
level behavioural model that can be trained and refined
over time and can consider and accept many features all
in one model. Moreover, the proposed computational ap-
proach herein is easily portable to different mathematical
models, sports, terrains and even in different applications
altogether, like battery electric vehicles.
I-C Relation to previous work
This work is directly linked to Keller (1973). This
model is a pioneering mathematical model relying on
Newton’s law of motion and an energy conservation
model assuming that the ˙
V O2is constant during the race
(the model is fully described in detail in Subsection II-A).
Several modifications to this fundamental work have been
introduced by considering: the effect of fatigue (Wood-
side, 1991), wind resistance and altitude (Quinn, 2004),
variation in oxygen uptake rate (Behncke, 1997), and etc.
To incorporate aerobic and anaerobic energy, a conceptual
hydraulic model has been developed (Morton, 2006)
comprised of two energy tanks: storage of aerobic energy
which has infinite capacity but limited consumption rate,
and storage of anaerobic energy which has unlimited
consumption rate but is limited in capacity. Aftalion and
Bonnans (2014) has utilized the dynamics of this hy-
draulic model and solved the optimal running problem for
1500-meter race by Runge-Kutta discretization scheme
and nonlinear programming solver, all implemented in the
Bocop toolbox (Bonnans, Martinon, & Gr´
elard, 2012).
Reinforcement learning has been used for optimal
control of dynamical systems in different applications
(see for example J. Duan et al. (2019) for optimal
charge/discharge control of hybrid AC-DC microgrids
or Liu, Xie, and Modiano (2019) on computer network
queueing systems), but to the authors’ knowledge it has
not been used in optimal control of competitive sports.
I-D Organization of the article
Next sections of this paper is structured as follows:
In section II, we formally define the runner’s problem
and details of the proposed reinforcement learning-based
solution. Afterward, validation experiments are described
and predictions of the control input for three track lengths
are presented in section III. Finally, we conclude the paper
in section IV and discuss future directions.
II PROPOSED METHOD
II-A Formal case study definition
Keller (1973) considered a mathematical model to in-
corporate the applied force dynamics and runner’s energy
conservation. The dynamical model is described shortly
here for clarification.
The time Tto run the track with length Dis related
to the velocity profile v(t)by:
D=ZT
0
v(t)dt (1)
Governing differential equation of runner’s force bal-
ance is defined as:
dv(t)
dt +v(t)
τ=f(t)(2)
with v(t)being the instantaneous runner’s velocity, f(t)
the total propulsive force per unit mass applied by the
runner and τthe damping/friction constant coefficient.
Note that propulsive force f(t)is under control of the
runner through which the velocity is manipulated. The
initial condition is v(0) = 0, i.e., initially the runner is at
rest, and the upper bound limit for propulsive force is
f(t)Fmax,t[0, T ](3)
Dynamics of the runner’s energy storage is described
by differential equation:
dE(t)
dt =σf(t)·v(t)(4)
in which E(t)is the quantity of available muscular energy
per unit mass at given time, σis the rate of energy
recovery per unit mass at which energy is supplied to
the body by the respiratory and circulatory systems (in
excess of the non-running metabolism). Note that initially
the runner has a certain amount of available energy in
their muscles, i.e., E(0) = E0, and energy can never be
negative, i.e.,
E(t)0,t[0, T ](5)
The optimal control problem is to find f(t),v(t)and
E(t)such that (2)-(5) and initial conditions are satisfied
and Tdefined by (1) is minimized. The four physiological
constants τ,Fmax,σand E0and the distance Dare given.
Keller (1974) attained analytical solution to this prob-
lem by using calculus of variation with optimization
of constrained differential equations. He determined the
optimal velocity profile under three assumptions: (1) the
race should be divided into exactly three distinct phases,
(2) the runner should start with maximum force in the first
phase (f(t) = Fmax,for t[0, t1]), and (3) the runner
should deplete the energy source before the ending point
and finish the race with zero energy in the third phase
(E(t) = 0,for t[t2, T ]). With these assumptions, he
carried out the velocity profile and found that the velocity
in the middle phase (with no pre-assumption on it) is
constant. He argued that these assumptions arise from the
fact that in the optimal solution, variables with inequality
constraints should appear at their extreme bounds (and
become equality).
II-B Reinforcement learning formulation
Our proposed algorithm for finding optimal control
inputs of the runner is based on policy gradient rein-
forcement learning method thereby we use (1) a fully
connected feed-forward neural network as the parameter-
ized probabilistic controller model, and (2) a numerical
simulation of the dynamical model, that is described in
Subsection II-A, to simulate the runner’s environment and
provide the reward signal. The neural network controller
model is used as the parametric policy function to trans-
form current state of the runner into predicted optimal
propulsive force for the next time step. This predicted
action value is then used in the numerical simulation
to apply the propulsive force, update the runner’s state
accordingly and compute the reward value of the new
state. Then the reward value will be used eventually
to update the neural network parameters using gradient-
based stochastic optimization, see Figure 1 for conceptual
illustration.
The RL algorithm, in general, relies on trial and error,
while enforcing actions in the good trials by rewarding
them. In our consideration of the runner’s problem, for
example, positions close to the finish line are highly
rewarded, while any constraint violation will terminate
the run with considerably smaller reward value. Utilized
policy gradient method is a family of RL algorithms that
parameterize the policy function directly and optimize
its parameters by gradient ascent on the performance
objective (cumulative reward, also known as return). This
optimization is usually performed on-policy, meaning
that each update only uses data collected while acting
according to the most recent version of the policy. The
key idea underlying policy gradient methods in RL is to
push up the probabilities of actions that lead to higher
return, and push down the probabilities of actions that
lead to lower return, until the model reaches to the optimal
policy.
In this section, we describe the controller model
that predicts the propulsive force for a given state, the
simulated runner and the RL optimization procedure to
identify the controller model parameters.
1) Neural network control model
The neural network controller model takes runner’s
current state (namely instantaneous position, velocity and
available energy) as input and through several parametric
hidden layers generates the corresponding predicted mean
value of the normal distribution on runner’s propulsive
Controller model
Environment
Action
𝑓
𝑡
𝑓
𝑡
Action
𝑓
𝑡
State, Reward
𝑠𝑡,𝑟
𝑡
𝑠𝑡,𝑟
𝑡
State, Reward
𝑠𝑡,𝑟
𝑡
Fig. 1: Modules of reinforcement learning framework: (1)
Controller model: reads the state and reward from current
time of the environment and computes the next action ft.
Also, reward signal rtis used to train the model, (2)
Environment: reads the action input to advance execution
of the dynamical model one step forward.
𝜇𝑡 𝜇𝑡
𝑥𝑡
𝑣𝑡
𝐸𝑡
|𝜙
|𝜙
|𝜙
|𝜙
|𝜙
|𝜙
|𝜙
|𝜙
|𝜙
|𝜙
|𝜙
|𝜙
|𝜙
|𝜙
|𝜙
|𝜙
Fig. 2: Structure of a sample neural network as the
predictor of the mean of the normal distribution on
propulsive force, given runner’s state.
force (as the next control input). Suppose that the runner’s
state in time tis st= (xt, vt, Et). If the corresponding
output of the neural network for this input state is µt, then
the next propulsive force is sampled from the probabilistic
policy distribution of πt(ft|st)as
ft N (µt, σ 2)
where σ2is assumed to be constant for all t.
Trainable parameters of the neural network are the set
of weight matrices Wiand bias vectors Bifor each layer.
Forward path of the neural network consists of computing
the values of the stacked hidden layers by
zi=Wi·hi1+Bi
hi= tanh(zi)
where tanh(·)is used as the activation function. Finally,
the only output of the network is weighted average of the
last hidden layer values. Figure 2 shows the structure of
the neural network control model.
2) Environment
In RL context, a simulated (or real) environment
is needed to perform predicted action of the controller
model and compute reward value of the selected action.
We implemented a discretized version of dynamics of
Keller’s mathematical model (Equations (2) and (4)). This
numerical simulation reads the instantaneous propulsive
force and accordingly performs one step of update in
the runner’s state. We employed the Runge-Kutta method
(RK4) to numerically approximate new values for the
state variables (energy and velocity) based on their gov-
erning differential equations and previous values. RK4 is
a forth-order iterative approximation method and uses a
weighted average of four slope values (evaluation of the
derivative function) at the beginning, middle and end of
the time interval tto approximate the next value of the
variable. The accumulative error in RK4 is in order of
O(∆t4)compared to first-order Euler method (Butcher,
2016) with accumulative error in order of O(∆t). Thus,
RK4 will lower the approximation error caused by time
discretization and reduce the sensitivity of simulation to
the time step size.
The step method of the runner’s simulation will check
the success/failure conditions after each update (to check
if race is finished or failed). Particularly, successful races
are recognized if xt=D(or greater than), and failures
are recognized if one of the followings happens: (1)
predicted propulsive force exceeds its maximum possible
value (ft> Fmax), (2) runner runs out of energy
(Et<0), or (3) irrationally, runner moves backward
(vt<0).
In addition, in the RL framework a reward signal is
needed for the model optimizer to form the objective
function in the training process. Therefore, while step the
runner forward, simulation model will compute a reward
value for every executed time step. The lower the race
time, the more the reward value should be. We use the
following instantaneous reward function for each state-
action (step):
rt(st, ft) =
1
t·D
Dxtif xt< D and not failed,
1
t·1
ϵif xt=D,
0if failed.
(6)
where st= (xt, vt, Et)is runner’s current state, ftis
predicted action control, failed is true if any of the three
failure constraints mentioned above is violated and ϵis a
small value. This reward function exponentially increases
the reward, as the runner’s position approaches the finish
line and integrates the race time with inverse relation.
Given a trajectory (sequence of states and actions in
one run) as δ= (s0, f0, s1, f1, ..., sT), the cumulative
reward (return) is defined as
R(δ) =
T
X
t=0
rt(st, ft)
Now, if we consider the return function up to some
intermediate time tas R(δ, t), then it will be proportional
to the logarithmic barrier function of
R(δ, t) 1
t·log(1 xt
D)
which helps the optimization convergence (unlike discon-
tinuous reward functions).
Figure 3 demonstrates functions rtand R(δ, t)over
time for a successful sample trajectory of a 400-meter
track. Final peak of instantaneous reward is occurred
when the simulation reaches to the finish line. Also,
high values in the beginning is due to the term 1/t
and beneficial in the sense that it leads the optimization
toward forward movements at the start of the race. Steep
portions in the cumulative reward graph will speed up the
0 10 20 30 40
Time (sec)
0
10
20
30
Instantaneous reward
0
250
500
750
1000
1250
1500
Cumulative reward
Fig. 3: Graph of (1) instantaneous reward over time (blue
curve) which continuously and rapidly increases as the
runner’s position xtapproaches the finish line, and (2)
cumulative reward over time (red curve) which is the
cumulative sum of instantaneous reward values up to time
t.
stochastic optimization, since each training step is based
on the gradient of expected return (details in the following
Subsection).
3) Optimization
The weight parameters of the controller model neural
network are initialized by Kaiming uniform initializa-
tion method (He, Zhang, Ren, & Sun, 2015): Wi
N(0,2/ni), where niis the number of inputs in layer
i. These model parameters need to be identified. We
consider the standard policy-gradient model-free actor-
critic reinforcement learning algorithm (Degris, Pilarski,
& Sutton, 2012; Y. Duan, Chen, Houthooft, Schulman, &
Abbeel, 2016). This algorithm maximizes the expected
return function
J(π) = E
δπ[R(δ)] (7)
incrementally, to identify these parameters (denote all
learnable parameters by θ) of the parametric probabilistic
policy πθ(f|s)by applying the gradient of (7) as:
θk+1 =θk+αθJ(πθ)|θk(8)
θJ(πθ) = E
δπθ
[
T
X
t=0
θlog πθ(ft|st)Aπθ(st, ft)] (9)
where Aπθ(st, ft)is the advantage function. The advan-
tage function measures whether or not the action is better
or worse than the policy’s default behavior (Schulman,
Moritz, Levine, Jordan, & Abbeel, 2015) and is defined
as:
Aπθ(st, ft) = Qπθ(st, ft)Vπθ(st)(10)
with Qπθ(st, ft)being the expected return when starting
at state st, taking action ftand then following the
policy πθfrom there (on-policy action-value function) and
Vπθ(st)being the expected return when starting in state
stand following the policy πθ(Value function). These
two functions are defined, formally, as:
Qπ(s, f ) = E
δπ[R(δ)|s0=s, f0=f](11)
Vπ(s) = E
δπ[R(δ)|s0=s](12)
In practice, the expected values are estimated with
a sample mean. If we collect a set of trajectories D=
{δi}i=1,...,N (along with their return values) where each
is obtained by employing policy πθin running the envi-
ronment, the policy gradient is estimated with
ˆg=1
NX
δ∈D
T
X
t=0
θlog πθ(ft|st)ˆ
A(st, ft)(13)
The advantage function is also estimated by training
a separate neural network to predict the value function
Vπ(s). Training this model is performed along with train-
ing the policy by minimizing the mean squared residual
error between estimated value for Vπ(s)and computed
return value in training data.
Essentially, the training procedure consists of repet-
itively performing these operations: (1) collect training
data Dby using current policy’s parameters θkand
running the environment, (2) compute estimated value
function ˆ
V(s)for training data, (3) compute estimated
advantage function ˆ
A(s, f )for training data, (4) compute
estimated gradient by Equation 13, (5) update the policy’s
parameters (and also value function estimation model’s)
by applying the gradient ascent (descent) step in Equation
8, and repeat the procedure from (1). For more detail
about the RL training procedure see Achiam (2018); Mnih
et al. (2016).
III VALIDATION EXPERIMENTS
III-A Setup
We trained separate neural network controller models
for three distances 400, 1500 and 10000 meters. The
fitted physiological parameters τ= 0.892s,Fmax =
12.2m/s2,σ= 41.54J/kg ·sand E0= 2405.8J/kg
are used from Keller (1973) to simulate a replication
of the same dynamical model. Therefore, results in our
model and Keller’s theoretical results are comparable. A
one-layer neural network with 32 neurons in its hidden
layer is used (with total of 161 learnable parameters) for
both control input model and value function estimators.
The model is trained for 40000 steps, in each step
with a batch of 5000 training data samples (collected
by running the environment with current policy). Adam
optimizer (Kingma & Ba, 2014) has been employed for
the stochastic optimization of neural network parameters.
A fixed step size is used to for the numerical simulation
of the dynamical model (0.1 seconds for 400-meter, 1
second for 1500- and 10000-meter track). The value for
the normal distribution variance σ2has been set to 0.36.
III-B Results and discussion
Figure 4 shows the progress of training procedure
for 400-meter-long track over training steps. Top curve
shows the trajectory return value averaged in a batch
over training steps, middle curve shows trajectory lengths
averaged in a batch, and the bottom curve shows the
value of State-Value function averaged in a batch. The
average trajectory length (middle curve) approaches the
optimal value (43.9 seconds for the 400-m case) more
rapidly at the beginning of the training (0-5K training
steps) and then optimizer tries to improve the time, while
not violating any constraint. In the bottom curve, it can
be seen that the controller model steers the simulated
runner into states with higher values that achieve higher
rewards. Each training step consists of a single update in
the parameters of policy model based on the computed
loss over a batch of 5000 data samples. Similar training
procedure is performed for the 1500 and 10000 meters.
After completion of training steps, all intermediate mod-
els have been tested for the best performance and the
best race time has been found among them. The predicted
mean value for the propulsive force has been used as the
control action and no sampling is performed in the test
phase.
Table I shows the results of best race times found by
the RL-based method along with the theoretical results
of Keller and their percent errors with the actual world
record times. Average percent error between our RL
records and world records is 2.38% (while the error of
Keller’s findings is 2.82%). The percent error between
our RL records and Keller’s records is 0.59%.
Figure 5 shows the propulsive force profile, dynamics
of runner’s position, velocity and available energy in race
tracks with length 400, 1500 and 10000 meters (plotted
in 5a, 5b and 5c respectively). Keller’s solution is also
plotted for comparison. The force and velocity profiles
for 1500 and 10000 meters are analogous to theoretical
optimal solution which shows the proposed RL method
is able to find the sub-optimal solution of the dynamical
system. However, interestingly the force profile for the
400-meter track (top plot in Figure 5a) is different in
shape with the theoretical solution, even though both
result in close race times. The shape difference is due
to the fact that RL method is capable of finding more
complex profiles, compared to Keller’s solution that has
a global view point of the solution with a priori assump-
tions which the race should be divided into exactly three
phases. In other words, there are only two parameters
in Keller’s solution that specify the solution completely:
the duration of the first and last phase (t1and t2in
Keller (1973)), compared to 161 parameters in our RL-
based controller model. Another potential reason for the
different profile shapes is higher resolution of numerical
simulation (smaller time step value) for 400 meter that
enables fine-grained control for decreasing the velocity
smoothly in the ending part of the race. Note for example
the ending part (in the time interval from about 38 to
44 seconds) of energy dynamics (bottom plot in Figure
5a) that is laid close to, but not crossing zero energy.
In this situation, there is risk of constraint violation
when the controller model advances with a relatively
large time step. This risk increases especially with the
probabilistic sampling in the force distribution that can
generate a sample propulsive force value far from the
mean, therefore violating a system constraint, even with
an acceptable mean value lower than maximum force.
500
1000
Avg Traj. Return
100
200
300
400
Avg Traj. Length
(sec)
0 5k 10k 15k 20k 25k 30k 35k 40k
Training Step (Batch)
0
100
200
300
Avg Value Function
Fig. 4: Learning curves for 400-m track (return, length
and evaluation of value function of states, all averaged
over batch of training data).
IV CONCLUSIONS
In this article, we described a reinforcement-based
solution to find the optimal propulsive force profile in
the running race application. The mathematical model of
Keller (1973) has been employed to model the dynamics
of motion and available energy in the runner’s body.
A numerical simulation model has been implemented
for this mathematical model using Runge-Kutta approx-
imation method, and along with a reward calculation
with a designed logarithmic barrier function, has been
used as the environment in the RL framework. Then,
the parameters of the neural network-based probabilistic
controller model have been trained by policy-gradient
model-free actor-critic RL algorithm. Obtained optimal
solution conforms with the theoretical results, proving
the concept of the proposed method. Unlike theoretical
analysis of the problem, the proposed method does not
require imposing any a priori assumptions regarding the
characteristics of the solution, enabling the method to (1)
be applied on any dynamical system, and (2) find more
complicated solutions in the extended search space.
Despite capability of solving the optimal control prob-
lem, the proposed RL-based method has some limitations.
First, the training procedure requires to run many trials
(experiences) in the simulated environment fitted to the
athlete’s body. This limits capability of the method in
the on-line application from the scratch. However, an off-
line trained model with simulated data can initialize the
on-line adaptation with real-world sensory data from the
runner. Another limitation of the method is that it does not
guarantee the optimality of the solution which is a direct
consequence of stochastic optimization used to maximize
the total reward.
Future research work on the topic can focus on ex-
ploiting more realistic mathematical model for the energy
TABLE I: Race time results for 1972 world records, Keller’s solution and our trained controller model (along with
their percent errors). The first two data are obtained from Keller (1973).
Track Length World Record
(sec)
Keller’s Theory
Record (sec)
Our RL
Record (sec)
Keller’s Theory
Error (%)
Our RL
Error(%)
400 meters 44.5 43.27 43.9 -2.76 -1.34
1500 meters 213.1 219.44 219.9 2.97 3.19
10000 meters 1659.4 1614.1 1616.0 -2.72 -2.61
0 10 20 30 40
0.0
2.5
5.0
7.5
10.0
12.5
Propulsive Force (
m
/
s
2)
RL
Keller
0 10 20 30 40
0
100
200
300
400
Position (
m
)
0 10 20 30 40
0
2
4
6
8
10
Velocity (
m
/
s
)
0 10 20 30 40
Time (s)
0
0.5
1
1.5
2
2.5
Energy (
kJ
)
(a) 400-meter track
0 50 100 150 200
0.0
2.5
5.0
7.5
10.0
12.5
Propulsive Force (
m
/
s
2)
RL
Keller
0 50 100 150 200
0
500
1000
1500
Position (
m
)
0 50 100 150 200
0
2
4
6
8
10
Velocity (
m
/
s
)
0 50 100 150 200
Time (s)
0
0.5
1
1.5
2
2.5
Energy (
kJ
)
(b) 1500-meter track
0 500 1000 1500
0.0
2.5
5.0
7.5
10.0
12.5
Propulsive Force (
m
/
s
2)
RL
Keller
0 500 1000 1500
0
2k
4k
6k
8k
10k
Position (
m
)
0 500 1000 1500
0
2
4
6
8
10
Velocity (
m
/
s
)
0 500 1000 1500
Time (s)
0
0.5
1
1.5
2
2.5
Energy (
kJ
)
(c) 10000-meter track
Fig. 5: Propulsive force profile (predicted with our RL method in blue and Keller’s in dashed red) and resulting position,
velocity and energy dynamics of the runner.
and motion dynamics (with potentially variable oxygen
uptake rate and critical power during the race). Another
interesting direction is to incorporate variable variance in
the controller model and study the exploration behaviour
in the search space due to probabilistic sampling. Here,
the neural network predicts the time- and state-dependant
variance σ2
tof the normal distribution along with its
mean, regulating the exploration range during the run
and potentially reduce the risk of inequality constraints
violation, especially at the end of the track. Sensitivity
analysis on the time step size may also be investigated in
the future.
SOURCE CODE
The source code of the framework is avail-
able on: https://github.com/COMEA-TUAS/rl-optimal-
control-keller
ACKNOWLEDGEMENTS
The authors gratefully acknowledge funding from
Academy of Finland (ADAFI project).
REFERENCES
Abbiss, C. R., & Laursen, P. B. (2008). Describing and understanding
pacing strategies during athletic competition. Sports Medicine,
38(3), 239–252.
Achiam, J. (2018). Spinning up in deep reinforcement learning. URL
https://spinningup. openai. com.
Aftalion, A. (2017). How to run 100 meters. SIAM Journal on Applied
Mathematics,77(4), 1320–1334.
Aftalion, A., & Bonnans, J. F. (2014). Optimization of running strategies
based on anaerobic energy and variations of velocity. SIAM
Journal on Applied Mathematics,74(5), 1615–1636.
Aftalion, A., & Tr´
elat, E. (2021). Pace and motor control optimization
for a runner. Journal of Mathematical Biology,83(1), 1–21.
Alvarez-Ramirez, J. (2002). An improved peronnet-thibault mathemati-
cal model of human running performance. European journal of
applied physiology,86(6), 517–525.
Behncke, H. (1997). Optimization models for the force and energy in
competitive running. Journal of mathematical biology,35(4),
375–390.
B¨
ohme, T. J., & Frank, B. (2017). Indirect methods for optimal control.
In Hybrid systems, optimal control and hybrid vehicles: Theory,
methods and applications (pp. 215–231). Cham: Springer
International Publishing. doi: 10.1007/978-3-319-51317-1 7
Bonnans, F., Martinon, P., & Gr´
elard, V. (2012). Bocop-a collection of
examples (Unpublished doctoral dissertation). Inria.
Butcher, J. C. (2016). Numerical methods for ordinary differential
equations. John Wiley & Sons.
Casado, A., Hanley, B., Jim´
enez-Reyes, P., & Renfree, A. (2021). Pacing
profiles and tactical behaviors of elite runners. Journal of Sport
and Health Science,10(5), 537–549.
Cooper, R. A. (1990). A force/energy optimization model for wheelchair
athletics. IEEE transactions on systems, man, and cybernetics,
20(2), 444–449.
Degris, T., Pilarski, P. M., & Sutton, R. S. (2012). Model-free
reinforcement learning with continuous action in practice. In
2012 american control conference (acc) (pp. 2177–2182).
Duan, J., Yi, Z., Shi, D., Lin, C., Lu, X., & Wang, Z. (2019).
Reinforcement-learning-based optimal control of hybrid energy
storage systems in hybrid ac–dc microgrids. IEEE Transactions
on Industrial Informatics,15(9), 5355–5364.
Duan, Y., Chen, X., Houthooft, R., Schulman, J., & Abbeel, P. (2016).
Benchmarking deep reinforcement learning for continuous con-
trol. In International conference on machine learning (pp. 1329–
1338).
Elamvazhuthi, K., & Berman, S. (2015). Optimal control of stochastic
coverage strategies for robotic swarms. In 2015 ieee interna-
tional conference on robotics and automation (icra) (pp. 1822–
1829).
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep
into rectifiers: Surpassing human-level performance on imagenet
classification. In Proceedings of the ieee international confer-
ence on computer vision (pp. 1026–1034).
Jones, A. M., Vanhatalo, A., Burnley, M., Morton, R. H., & Poole,
D. C. (2010). Critical power: implications for determination of
v o2max and exercise tolerance. Medicine & Science in Sports
& Exercise,42(10), 1876–1890.
Kamien, M. I., & Schwartz, N. L. (2012). Dynamic optimization: the
calculus of variations and optimal control in economics and
management. courier corporation.
Keller, J. B. (1973). A theory of competitive running. Physics today,
26(9), 42–47.
Keller, J. B. (1974). Optimal velocity in a race. The American
Mathematical Monthly,81(5), 474–480.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980.
Liu, B., Xie, Q., & Modiano, E. (2019). Reinforcement learning
for optimal control of queueing systems. In 2019 57th annual
allerton conference on communication, control, and computing
(allerton) (pp. 663–670).
Maro´
nski, R., & Rogowski, K. (2011). Minimum-time running: a
numerical approach. Acta of Bioengineering and Biomechan-
ics/Wrocław University of Technology,13(2), 83–86.
McGibbon, K. E., Pyne, D., Shephard, M., & Thompson, K. (2018).
Pacing in swimming: A systematic review. Sports Medicine,
48(7), 1621–1633.
Mercier, Q., & Aftalion, A. (2020). Optimal speed in thoroughbred
horse racing. Plos one,15(12), e0235024.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T.,
. . . Kavukcuoglu, K. (2016). Asynchronous methods for deep
reinforcement learning. In International conference on machine
learning (pp. 1928–1937).
Morton, R. H. (2006). The critical power and related whole-body
bioenergetic models. European journal of applied physiology,
96(4), 339–354.
Quinn, M. (2004). The effects of wind and altitude in the 400-m sprint.
Journal of sports sciences,22(11-12), 1073–1081.
Rahimi, A., Dev Kumar, K., & Alighanbari, H. (2013). Particle swarm
optimization applied to spacecraft reentry trajectory. Journal of
Guidance, Control, and Dynamics,36(1), 307–310.
Reardon, J. (2013). Optimal pacing for running 400-and 800-m track
races. American Journal of Physics,81(6), 428–435.
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015).
High-dimensional continuous control using generalized advan-
tage estimation. arXiv preprint arXiv:1506.02438.
Tan, D., & Chen, Z. (2012). On a general formula of fourth order runge-
kutta method. Journal of Mathematical Science & Mathematics
Education,7(2), 1–10.
Wolf, S. (2019). Applications of optimal control to road cycling
(Unpublished doctoral dissertation).
Woodside, W. (1991). The optimal strategy for running a race (a
mathematical model for world records from 50 m to 275 km).
Mathematical and computer modelling,15(10), 1–12.
AUTHOR BIOGRAPHIES
SAJAD SHAHSAVARI works as Researcher at Turku University of
Applied Sciences, and is a PhD candidate at University of Turku,
Department of Computing, Finland.
EERO IMMONEN is an Adjunct Professor at Department of Mathe-
matics at University of Turku, Finland, and works as Principal Lecturer
at Turku University of Applied Sciences, Finland.
MASOOMEH KARAMI is a PhD candidate in Autonomous Systems
Lab at University of Turku, Department of Computing, Finland.
HASHEM HAGHBAYAN is a post-doctoral researcher at University
of Turku, Department of Computing, Finland.
JUHA PLOSILA is a Professor in the field of Autonomous Systems
and Robotics at the University of Turku, Department of Computing,
Finland.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We present a model which encompasses pace optimization and motor control effort for a runner on a fixed distance. We see that for long races, the long term behaviour is well approximated by a turnpike problem, that allows to define an approximate optimal velocity. We provide numerical simulations quite consistent with this approximation which leads to a simplified problem. The advantage of this simplified formulation for the velocity is that if we have velocity data of a runner on a race, and have access to his V˙O2max\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\dot{V}O2_{\mathrm {max}}$$\end{document}, then we can infer the values of all the physiological parameters. We are also able to estimate the effect of slopes and ramps.
Article
Full-text available
The objective of this work is to provide a mathematical analysis on how a Thoroughbred horse should regulate its speed over the course of a race to optimize performance. Because Thoroughbred horses are not capable of running the whole race at top speed, determining what pace to set and when to unleash the burst of speed is essential. Our model relies on mechanics, energetics (both aerobic and anaerobic) and motor control. It is a system of coupled ordinary differential equations on the velocity, the propulsive force and the anaerobic energy, that leads to an optimal control problem that we solve. In order to identify the parameters meaningful for Thoroughbred horses, we use velocity data on races in Chantilly (France) provided by France Galop, the French governing body of flat horse racing in France. Our numerical simulations of performance optimization then provide the optimal speed along the race, the oxygen uptake evolution in a race, as well as the energy or the propulsive force. It also predicts how the horse has to change its effort and velocity according to the topography (altitude and bending) of the track.
Article
Full-text available
The pacing behaviors used by elite athletes differ among individual sports, necessitating the study of sport-specific pacing profiles. Additionally, pacing behaviors adopted by elite runners differ depending on race distance. An “all-out” strategy, characterized by initial rapid acceleration and reduction in speed in the later stages, is observed during 100 m and 200 m events; 400 m runners also display positive pacing patterns, which is characterized by a reduction in speed throughout the race. Similarly, 800 m runners typically adopt a positive pacing strategy during paced “meet” races. However, during championship races, depending on the tactical approaches used by dominant athletes, pacing can be either positive or negative (characterized by an increase in speed throughout). A U-shaped pacing strategy (characterized by a faster start and end than during the middle part of the race) is evident during world record performances at meet races in 1500 m, mile, 5000 m and 10,000 m events. Although a parabolic J-shaped pacing profile (in which the start is faster than the middle part of the race but is slower than the endspurt) can be observed during championship 1500 m races, a negative pacing strategy with microvariations of pace is adopted by 5000 m and 10,000 m runners in championship races. Major cross country and marathon championship races are characterized by a positive pacing strategy; whereas a U-shaped pacing strategy, which is the result of a fast endspurt, is adopted by 3000 m steeplechasers and half marathoners. In contrast, recent world record marathon performances have been characterized by even pacing, which emphasizes the differences between championship and meet races at distances longer than 800 m. Studies reviewed suggest further recommendations for athletes. 800 m runners should avoid running wide on the bends throughout the whole race. In turn, during major championship events, 1500 m, 5000 m, and 10,000 m runners should try to run close to the inside of the track as much as possible during the decisive stages of the race when the speed is high. Staying within the leading positions during the last lap is recommended to optimize finishing position during 1500 m and 5000 m major championship races. Athletes with more modest aims than winning a medal at major championships are advised to adopt a realistic pace during the initial stages of long-distance races and stay within a pack of runners. Coaches of elite athletes should take into account the observed difference in pacing profiles adopted in meet races vs. those used in championship races: fast times achieved during races with the help of one or more pacemakers are not necessarily replicated in winner-takes-all championship races, where pace varies substantially. Although existing studies examining pacing characteristics in elite runners through an observational approach provide highly ecologically valid performance data, they provide little information regarding the underpinning mechanisms that explain the behaviors shown. Therefore, further research is needed in order to make a meaningful impact on the discipline. Researchers should design and conduct interventions that enable athletes to carefully choose strategies that are not influenced by poor decisions made by other competitors, allowing these athletes to develop more optimal and successful behaviors.
Article
Full-text available
In this paper, a reinforcement-learning-based online optimal (RL-OPT) control method is proposed for hybrid energy storage system (HESS) in AC/DC microgrids involving photovoltaic (PV) systems and diesel generators (DG). Due to the low system inertia, conventional unregulated charging and discharging (C&D) of energy storages in microgrids may introduce disturbances that degrade the power quality and system performance, especially in fast C&D situations. Secondary and tertiary control levels can optimize the state of charge (SOC) reference of HESS; however, they are lacking the direct controllability of regulating the transient performance. Additionally, the uncertainties in practical systems greatly limit the performance of conventional model-based controllers. In this study, the optimal control theory is applied to optimize the C&D profile and to suppress the disturbances caused by integrating HESS. Neural networks (NN) are devised to estimate the nonlinear dynamics of HESS based on the input/output measurements, and to learn the optimal control input for bidirectional-converter-interfaced HESS using the estimated system dynamics. Because the proposed RLOPT method is fully decentralized, which only requires the local measurements, the plug & play capability of HESS can be easily realized. Both islanded and grid-tied modes are considered. Extensive simulations and experiments are conducted to evaluate the effectiveness of the proposed method.
Article
Full-text available
Background: Pacing strategy, or how energy is distributed during exercise, can substantially impact athletic performance and is considered crucial for optimal performance in many sports. This is particularly true in swimming given the highly resistive properties of water and low mechanical efficiency of the swimming action. Objectives: The aim of this systematic review was to determine the pacing strategies utilised by competitive swimmers in competition and their reproducibility, and to examine the impact of different pacing strategies on kinematic, metabolic and performance variables. This will provide valuable and practical information to coaches and sports science practitioners. Data sources: The databases Web of Science, Scopus, SPORTDiscus and PubMed were searched for published articles up to 1 August 2017. Study selection: A total of 23 studies examining pool-based swimming competitions or experimental trials in English-language and peer-reviewed journals were included in this review. Results: In short- and middle-distance swimming events maintenance of swimming velocity is critical, whereas in long-distance events a low lap-to-lap variability and the ability to produce an end spurt in the final lap(s) are key. The most effective strategy in the individual medley (IM) is to conserve energy during the butterfly leg to optimise performance in subsequent legs. The pacing profiles of senior swimmers remain relatively stable irrespective of opponents, competition stage or type, and performance time. Conclusion: Implementing event-specific pacing strategies should benefit the performance of competitive swimmers. Given differences between swimmers, there is a need for greater individualisation when considering pacing strategy selection across distances and strokes.
Article
Full-text available
The aim of this paper is to bring a mathematical justification to the optimal way of organizing one's effort when running. It is well known from physiologists that all running exercises of duration less than 3mn are run with a strong initial acceleration and a decelerating end; on the contrary, long races are run with a final sprint. This can be explained using a mathematical model describing the evolution of the velocity, the anaerobic energy, and the propulsive force: a system of ordinary differential equations, based on Newton's second law and energy conservation, is coupled to the condition of optimizing the time to run a fixed distance. We show that the monotony of the velocity curve vs time is the opposite of that of the oxygen uptake ($\dot{V}O2$) vs time. Since the oxygen uptake is monotone increasing for a short run, we prove that the velocity is exponentially increasing to its maximum and then decreasing. For longer races, the oxygen uptake has an increasing start and a decreasing end and this accounts for the change of velocity profiles. Numerical simulations are compared to timesplits from real races in world championships for 100m, 400m and 800m and the curves match quite well.
Conference Paper
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Article
For a general class of methods, which includes linear multistep and Runge-Kutta methods as special cases, A concept of order relative to a given starting procedure is defined and an order of convergence theorem is proved. The definition is given an algebraic interpretation and illustrated by the derivation of a particular fourth-order method.
Conference Paper
Recently, researchers have made significant progress combining the advances in deep learning for learning feature representations with reinforcement learning. Some notable examples include training agents to play Atari games based on raw pixel data and to acquire advanced manipulation skills using raw sensory inputs. However, it has been difficult to quantify progress in the domain of continuous control due to the lack of a commonly adopted benchmark. In this work, we present a benchmark suite of continuous control tasks, including classic tasks like cart-pole swing-up, tasks with very high state and action dimensionality such as 3D humanoid locomotion, tasks with partial observations, and tasks with hierarchical structure. We report novel findings based on the systematic evaluation of a range of implemented reinforcement learning algorithms. Both the benchmark and reference implementations are released open-source in order to facilitate experimental reproducibility and to encourage adoption by other researchers.