Learning to Forecast Pedestrian Intention from Pose Dynamics


For an autonomous car, the ability to foresee a humans action is very useful for mitigating the risk of a possible collision. To humans this pedestrian intention foresight comes naturally as they are able to recognize another person’s actions just by perceiving subtle changes in posture. Approximating this intention inference ability by directly training a deep neural network is useful but especially challenging. First, sufficiently large datasets for intention recognition with frame–wise human pose and intention annotations are rare and expensive to compile. Second, training on smaller datasets can lead to overfitting and make it difficult to adapt to intra-class variations in action executions. Therefore, in this paper, we propose a real time framework that learns (i) intention recognition using weak-supervision and (ii) locomotion dynamics of intention from pose information using transfer learning. This new formulation is able to tackle the lack of frame-wise annotations and to learn intra-class variation in action executions. We empirically demonstrate that our proposed approach leads to earlier and more stable detection of intention than other state of the art approaches with real time operation and the ability to detect intention one second before the pedestrian reaches the kerb.
Multi-Class Analysis: The results for multi-class analysis are presented in the form of probability graphs as illustrated in Fig. 8. From the graphs it can be seen that as time to event decreases the probability of the correct class increases. Of particular interest are the turning and walking case which have a large semantic overlap. Both actions initially consist of a person walking longitudinally along the road. At a later point in time the person for the turning class suddenly turns on to the road. This case is reflected in Fig. 8d where the initially the probability of walking is higher than the probability of turning but as TTE gets smaller the probability of turning begins to increase quickly. Similar behaviour can be seen for the stopping and starting class.
Run Time Comparison: Table II shows the results for average running time for all the considered methods on a CPU as well as a GPU (Nvidia GTX Titan X). Owing to its tiny size in comparison to the other networks, the slow fusion architecture has the fastest average running time on the CPU. On the GPU our network is the fastest and is able to function in real time for the pedestrian intention recognition use case and fit on embedded hardware. The run time does not include the time needed for pedestrian detection.
For modern intelligent systems, enabling a physical au-
tonomous agent to correctly preempt human actions and react
appropriately is imperative. In an inner city traffic scenario
an autonomous car should be able to foresee a pedestrian’s
future action and proactively take measures to avoid a
potentially dangerous situation. Early intention recognition
is therefore key for safe and comfortable driving.
Fig. 1illustrates an urban traffic scenario where a woman
(orange bounding box) is approaching the road. Given a
history of visual observations, the task of interest is to predict
whether the woman intends to cross the road or stop at the
kerb. Based on past experience and anticipation a human
observer can anticipate that she is likely to stop at the kerb.
This decision is grounded on subtle changes of the observed
person [1]. Even if scene context is not visible, as illustrated
in the enlarged bounding boxes of Fig. 1, an observer can
still infer that she is likely to stop based solely on the subtle
changes in posture [2].
Initial works framed intention recognition as a path pre-
diction problem [3], [4], [5], [6], where the focus is to predict
1The authors are affiliated with with Robert Bosch GmbH, Hildesheim,
Germany, (email- Omair.Ghori, Radek.Mackowiak,
Niklas.Beuter, Lucas.RegoDrumond, )
2The authors are affiliated with Heidel-
berg University, Heidelberg Germany, (email-,
Fig. 1: The sequence above illustrates an inner city traffic
scene where a person (orange bounding box) approaches the
road and eventually stops at the kerb. Our proposed method
is able to correctly infer intention one second before the
pedestrian reaches the kerb, which for a car traveling at
30kph would translate to a braking distance of eight meters.
The enlarged bounding boxes focus on the pedestrians
without the context of the environment.
the short term path for a pedestrian of interest. Such an
approach avoided the difficulties associated with annotating
an intention recognition dataset. Annotating each frame in
a sequence with intention labels is a challenging, expensive
and highly subjective task [1]. As the enlarged bounding
boxes of Fig. 1illustrate, based on a single frame devoid
of context, both crossing and stopping intentions look likely.
However when viewed as a sequence of images, the true
intention becomes clearer. Thus while only a sequential
chunk of frames can be annotated for intention, defining
the temporal extent of this chunk of frames is a very
subjective task. A few earlier methods utilize pedestrian head
pose [7], [3] as an additional feature for path prediction.
These methods are however, evaluated on staged datasets
with actors, where the impact of head pose as a feature for
intention prediction cannot be truly gauged. In the wild head
pose only implies that a pedestrian is aware of an oncoming
In this paper we present a framework for learning pedes-
trian intention in a weakly supervised manner. Instead of
utilizing raw pixel data for inferring intention, we propose
to use a high-level structure from a given RGB input
frame and subsequently track its temporal dynamics for
pedestrian intention recognition. Framing our problem as
weakly supervised allows us to overcome the limitations
imposed by lack of precise temporal annotations.
Based on the domain knowledge of human posture en-
coding information about intention, we use human pose
as our high-level feature. Although intention recognition
using the human skeleton has been studied before [8], [9],
[10], these works require detailed 3D pose information for
recognition. In contrast we investigate intention recognition
in 2D monocular image sequences where groundtruth pose
annotations are not available.
The main contributions of this paper are:
To overcome the lack frame-wise annotations we posit
our problem as weakly supervised learning; just a
single label per sequence is need for inferring real–time
intention recognition per frame.
We propose human pose as our high-level feature to
have a compact feature representation, be interpretable
by humans, learn its temporal dynamics, and to have an
efficient model for running on embedded hardware.
In order to enable accurate intention recognition even
when using a relatively small dataset for training, we use
transfer learning to learn a latent pose representation.
With the increased focus on achieving fully autonomous
driving, pedestrian intention recognition as a means of avoid-
ing possible collisions with pedestrians is very important and
has therefore received attention from the research commu-
nity. Recently Fang et al.[11] utilized pedestrian pose as a
feature for intention recognition. In their work they first use a
CNN for pedestrian pose estimation. Based on the pose key-
points they compute a total of 396 features based on distances
and angles between pairs of keypoints. These features are
then used to train a binary classifier for intention recognition.
In contrast our approach is completely end-to-end learned
with no hand crafted features. Furthermore we utilize the
full 14 keypoint skeleton instead of the 9 keypoints use by
Fang et al. Most prior works related to pedestrian intention
recognition have tended to focus on path prediction [7], [10],
[3], [4], [5], [6], [12] instead of explicit intention recognition.
Schneider et al. [4] investigated various Kalman filter based
models for pedestrian path prediction in a one second future
time window. The proposed approach leveraged features
extracted from pedestrian motion dynamics only. Kooij et
al. [7] extended this initial work by incorporating additional
features which approximate the pedestrians behaviour as
well as the environmental context. Two non-linear methods
for estimating crossing intention of a laterally approaching
pedestrian were introduced by Keller et al. [6]. Probabilis-
tic Hierarchical Trajectory Matching (PHTM) tries to find
the best match for a partially observed pedestrian motion
track among a database of motion snippets. The closest
matching pedestrian trajectory is then used as a model
for approximating future pedestrian position. For longer
term pedestrian path prediction, Rehder et al. [13] frame
the problem as one of planning and treat the pedestrians
destination as a latent variable. Using inverse reinforcement
learning [14] investigates trajectory prediction for multiple
people. However, this works only for a fixed surveillance
camera setup while we work with a moving camera.
Gaussian Process Dynamical Models (GPDM) have also
been demonstrated to be effective using dense optical flow
features and pedestrian motion dynamics as input features
[6] and body pose as an input feature [10]. Head pose has
also been used as a complementary feature for intention
recognition [3] [15]. However, both these approaches are
only ever evaluated on a dataset of staged scenarios. Hariy-
ono et al. [16] investigated intention recognition by develop-
ing a model for a walking human. Unlike our work however,
they do not explicitly estimate human pose but instead rely
on statistics extracted from a pedestrian bounding box.
A. Problem Formulation
Having a sequence of images from time 0to T,F=
{Ft:t= 0, ..., T }, we are interested in recognizing the
intention, yt, of an observed pedestrian at time tbefore
an action occurs at time T, where t < T . Due to the
ambiguity of detecting intention from a single frame, we aim
to infer the intention class with the maximum probability
given a sequence of frames prior to occurrence of action.
Thus estimating the intention up to frame Ftis formulated
as maximum a posteriori Bayesian inference problem,
t= arg max
where Mis the number of predefined intention categories,
and P(yt|F0:t)is the probability of intention class ytgiven
all the frames up to frame t. Instead of explicitly computing
the probabilities, we take the softmax over the model outputs,
which gives us the distribution over the Mpossible classes
P(yt|F0:t)exp hyt(F0:t)
m=1 exp hm(F0:t),(2)
where hyt(F0:t)are the features extracted from frames 0to
tfor intention yt. Due to the recurrent nature of the problem,
we are able to estimate the intention class for each frame up
to t,
t= arg max
where Mis the set of all possible intention annotations,
Yt= [y0, . . . , yt]. The probability of an intention class given
a sequence of frames,P(Yt|F0:t), can be approximated as
P(yj|yj1, Fj),(5)
where P(0yj|yj1, Ft)is the probability of the intention
class at frame jgiven the current frame Fjand the previous
intention class, yj1.
The former approximation, Eq.(4), estimates each inten-
tion class independently, requiring the computation of all
spatio–temporal features hyt(F0:t)at each timestep before
estimating the softmax in Eq.(2). In contrast, the later
approximation, Eq.(5), infers the intention class based on
the current frame and the previous intention, allowing the
extracted features, hyt(F0:t), to be calculated recurrently,
independent of the total number of time steps. Despite
reducing the computational cost of the feature extractor,
Eq.(5) is not able to learn long–term dynamics of the
intention. Therefore, following the idea of computing the
feature extractor recurrently, we decouple P(Yt|F0:t)into
two components, a visual and a temporal feature extractor.
The former captures spatial frame-wise characteristics of the
intention; whereas temporal feature extractor captures long–
term dynamics of the intention given the visual features;
hence given a sequence of frames before the action actually
begins we learn over the temporal dynamics of features
indicative of the persons intention. As the classifier starts to
see those features occurring during test time, the probability
of the correct class will begin to increase.
B. Visual Feature Extraction
As a first step towards intention recognition a feature
representation of the visual contents of frame, Ft, needs to
be extracted. Normally, both intentions and their associated
actions can be well represented by generic visual descrip-
tors [17]. These feature representations work for action
classes which differ greatly in execution. In contrast, we
focus on human pose as a compact visual feature descriptor.
The proposed feature encodes information about a persons
intention [2], and helps in identifying subtle differences in
motor movements.
The challenge then is to estimate the pose of a pedestrian q
in a given frame Ft. In order to do this, first an off-the-shelf
pedestrian detector is used to obtain a bounding box image,
t, of pedestrian q. Given this bounding box image as the
input of a feature transformation we estimate coordinates of
a 14 keypoint human skeleton as defined in [18]. To this
end we learn a function, φ, parametrized by parameters θ,
to regress the real valued vector output, ˆ
Z, corresponding to
pose of pedestrian q, from the grayscal input image Iq
Specifically we train a Convolutional Neural Network
(CNN), represented by the transformation φfor estimating
Fig. 3: The top row shows the posture as estimated by our
pose CNN after training on a standard pose dataset. The
bottom row refinement in pose after end-to-end intention
recognition learning. The network is never explicitly trained
for pose on the intention dataset.
the pose from the obtained bounding box image. The CNN
architecture is similar to the one proposed by Belagiannis
et al. [19]. Fig. 2summarizes the architecture we utilize for
pose estimation in the box marked Pose CNN. The top row
of Fig. 3illustrates the pose estimation output of the network
on frames of a sequence in our dataset.
For parameter learning, Tukey’s biweight loss func-
tion [19] is minimized. This loss function is defined as:
ρ(ri) = (c2
61(1 (r
c)2)3, if |r| ≤ c
6, otherwise ,(7)
where ris the residual, defined as r=Zˆ
Zand cis a
tuning constant. For regression tasks, Tukey’s biweight loss
function is more robust to the influence of outliers than the
more commonly used L2loss. We set an initial learning rate
of 0.01 and use AdaGrad [20] for optimizing the network.
C. Intention Recognition Learning
In the intention recognition step we aim to classify the
intention of pedestrian qbased on the feature representation,
0:t, extracted from the input frames F0:tas described
in Sec. III-B. For notational simplicity, we denote Z0:t=
φ(F0:t, θ)to represent the features extracted from pedestrian
q. Given a sequence of inputs Z0:t, we learn a function,
Φ, that maps the temporal dynamics of the input to a
56 dims
128 dims
M dims
M dims
Conv 5x5
32 filters
stride:2 pad:0
1 channel
Conv 3x3
32 filters
stride:1 pad:0
Conv 3x3
64 filters
stride:1 pad:1
Conv 3x3
64 filters
Conv 12x7
1024 filters
stride:1 pad:
2048 dims
1024 dims
FC Reg.
N dims
Fig. 2: The network architecture for the pose estimation CNN and the intention recognition network.
Fig. 4: A snapshot showing the variation in appearance
of pedestrians in our compiled dataset. Youtube and Self
recorded were captured in the wild while Daimler and Hanau
are with actors.
M-dimensional output score vector and gives the distribution
over the Mpossible intention classes.Therefore, for intention
recognition we can reformulate Eq. (3) as,
t= arg max
P(Yt|F0:t) = arg max
Φ(Fi) = exp hyi(Zi, h(Zi1, . . . ); Ω)
m=1 exp hm(Zi, h(Zi1, . . . ); Ω) ,(9)
and hyi(Zi, h(Zi1, . . . ); Ω) represents the accumulated
spatio–temporal features of the intention yiup to frame
i, and h(φ(Fi;θ), h(φ(Fi1; Ω) represents the accumulated
M-dimensional output score vector. For modeling the re-
quired complex long–term temporal dynamics from the input
pose, the function hyi(Zi, h(Zi1, . . . ); Ω) is modeled with
a shallow LSTM network parametrized by . The LSTM
takes as input the estimated human posture, Zi=φ(Fi;θ), at
frame iand together with its hidden internal state models the
temporal dynamics of the human pose. The network structure
is as defined in Fig. 2in the box marked Intention LSTM.
The second LSTM layer is followed by a fully connected
layer which outputs a M-dimensional feature vector of scores
for all intention classes at each time step.
Before we can start training our network two obstacles
must be addressed, namely the lack of posture annotations
and frame level intention labels for any of the frames in any
of the sequences in the dataset.
We tackle the first issue by utilizing transfer learning.
Specifically, the network specified by Eq. (6) is first trained
on a standard pose training dataset [18] and then used as an
initialization for the Pose CNN in Fig. 2. During end-to-end
training the layers in this portion of the overall model
receive a smaller learning rate than the recurrent layers,
thus the convolutional parameters are fine tuned for intention
recognition during training thereby learning more relevant
and discriminative features. The effect of this fine tuning
on pose estimation is visible in the bottom row of Fig. 3.
The lack of frame level intention annotations is dealt with
by treating our task as weakly supervised. Only a sequence
label, YT, is provided which reflects the intention at the final
time step, T. We therefore perform back propagation through
time only at the final time step of a sequence. Optimizing
in this manner ensures that we make no assumptions about
when the intention begins to manifest and thereby avoid
any bias that might be introduced due to subjective labeling.
Therefore, intention recognition is formulated to minimize:
θ= arg min
L(YT,Φ(FT; Ω, θ)),(10)
where Lis the cross-entropy loss between the correct
intention class, YT, encoded by a one-hot label vector, and
the probabilities of the intention class at end of the sequence,
Φ(FT; Ω, θ). For end-to-end training the LSTM network has
an initial learning rate set to 0.01 and is optimized with back
propagation through time and Adagrad [20].
A. Datasets
For evaluating our proposed method we compile a dataset
of sequences of pedestrians walking towards the road. It
leverages two existing datasets, namely Daimler [4] and
Hanau [21] datasets, both of which contain sequences
recorded with instructed actors. Moreover, we extend them
with sequences of pedestrians in real world traffic scenarios.
The real world sequences contain recordings from United
Kingdom, Canada, United State of America, Germany,
Turkey, China and Pakistan. A snapshot of pedestrians in
our datasets can be seen in Fig. 4. The diverse appearance
and attitude towards road safety make this a particularly
difficult dataset. Additionally, the dataset contains sequences
recorded at day and night time as well as in cloudy and
sunny conditions. In total we have 466 sequences, with 270
sequences reserved for training, 35 for validation and 161
for the test, taking care to preserve any predefined split of
the Daimler and Hanau datasets. The real world sequences
contain 58 sequences downloaded from YouTube and 315
recorded by us. All sequences were resampled at a frame of
16 frames per second.
In addition to the intention classes introduced by [4],
namely: crossing, stopping, starting and turning, we also
include the ’walking along’ class. This class refers to the
case when a pedestrian walks along the road longitudinally.
The semantic meaning of each class is illustrated in Fig. 5.
We use this dataset for all our evaluations. Reported results
are for trainings and evaluations which utilize groundtruth
pedestrian bounding boxes as inputs.
B. Baselines
We contrast our approach with several other methods
which do not utilize explicitly explainable features such as
pose. The idea is to highlight how domain knowledge, i.e.
pose encoding intention, allows us to estimate pedestrian
intention more efficiently and robustly.
C3D/SVM: A simple baseline is established by extracting
fc6features using the C3D [22] net and training a linear
SVM classifier on top. A sliding window of 6 frames, with
(a) The orange arrow represents the crossing class, blue the
stopping class while yellow illustrates the starting class.
(b) The green arrow represents the walking-along class while
purple shows the turning class.
Fig. 5: The figure shows the five pedestrian intention classes
where the arrows only illustrate the direction of pedestrian
a stride of one is run over each sequence in order to extract
spatio-temporal features. Wider observation time window
widths were also tested but they negatively affected the
detection of fast changing intention such as a running person
coming to a stop. We therefore settled on a time observation
window width of 6.
CNN Slow Fusion (CNN SF) [23]: We train a CNN based
on the SF architecture for end-to-end intention recognition.
The network takes as input multiple frames and outputs the
intention class probabilities. Best results were achieved with
a sliding window of 6 frames.
CNN+LSTM variations: This method is similar to our
approach except a generic CNN based feature extractor, pre-
trained on ImageNet, is used in conjunction with a recurrent
Pose Feature/SVM [11]: This method initially uses a state
of the art pose estimator [24] trained on the MS-COCO
dataset [25] for estimating the human pose. The estimated
pose is then used to calculate 396 features (distances and
angles between body keypoints). These features are then used
to train a SVM for classifying the probability of different
Probabilistic Hierarchical Trajectory Matching
(PHTM) [6]: aims to find the best match for a partially
observed pedestrian motion track from among a database of
previously observed motion snippets. Then closest matching
snippet is then used as a model for extrapolating future
pedestrian position for the current observed track. However,
this approach requires an offline predestrian trajectory data,
being only available for Daimler Dataset.
Latent-dynamic Conditional Random Field (LD-
CRF) [3]: also takes as input pedestrian motion dynamics.
In addition it also utilizes pedestrian head pose as an input;
being a proxy for situational awareness.
C. Intention Recognition Results
From an Advanced Driver Assistance Systems (ADAS)
point of view, the two most important classes are a pedestrian
crossing the road or stopping at the kerb. All other classes
form a subset of these two major classes. Our evaluation
is therefore performed at two levels of granularity with
respect to the intention classes: a binary classification prob-
lem, where the goal is to predict whether an approaching
pedestrian plans on stopping or crossing in front of the ego
vehicle and a second more in depth multi-class classification
scenario with all labeled classes being considered.
Binary classification is evaluated on the basis of the
F1-score across three time horizons: one second, half a
second and one frame before event. The single frame before
event case represents a time interval dependent on the frame
rate. The time horizons were chosen keeping in mind realistic
urban driving scenarios as well as the erratic nature of
pedestrian movement.
Multi-class evaluation analyzes how the probabilities of
each intention class varies with respect to Time to Event
(TTE) [4]. This metric has previously been employed for
measuring performance of pedestrian intention recognition
methods [3], [4], [6], [7].
Binary Class Analysis: Table Ishows the F1-score across
three time horizons for the stopping and crossing classes. It
is clear from the results that our proposed model outperforms
the baseline methods across all three time horizons. Of
particular significance is our superior intention recognition
performance one second before the event. The results table
reflects that the spatio-temporal features extracted using the
C3D net are informative and help in distinguishing between
the two intention classes only as the time of event gets
closer. One second before the event the C3D based model
is biased towards the crossing intention class. As the time
of event approaches and pedestrian appearance begins to
differ for both classes the accuracy of the C3D based method
increases. Furthermore from Table Iwe can see that the F1
score using C3D/SVM for crossing class fluctuates as time of
event approaches. This is a direct consequence of operating
on a fixed observation time window. When reasoning about
F1-Score wrt. Time to Event for
stopping /crossing
1 second 1/2 second 1/16 second
Random 0.14 /0.11 0.14 /0.11 0.14 /0.11
C3D/SVM [22] 0.57 /0.71 0.60 /0.59 0.77 /0.79
CNN SF [23] 0.33 /0.40 0.47 /0.52 0.64 /0.69
VGG-M [26]+LSTM 0.48 /0.52 0.43 /0.47 0.43 /0.46
Latent Pose CNN+LSTM (ours) 0.71 /0.72 0.72 /0.73 0.87 /0.85
TABLE I: F1-score for Pedestrian intention recognition on
stopping and crossing classes. Our method outperforms
all other evaluated methods with stopping intention being
accurately detected one second before event by a significant
Fig. 6: Variation in stopping probability for a stopping scenario for our method and selected baseline methods. For the other
methods an initial time lag, associated with frame accumulation, is also seen before a valid output is available.
intention, frames outside of the observation window are
not considered due to which larger variance is observed
in output probabilities at each time step. Fig. 6highlights
the probability variation with respect to time for a stopping
scenario. Initially there are no visible signs of the pedestrian
stopping; our model therefore outputs a higher probability for
crossing intention. In contrast, C3D/SVM and Slow Fusion
both operate on a chunk of frames and initially their output
is not available as the required number of frames are not
accumulated. In subsequent frames, our model starts to detect
subtle changes of posture like a reduction of step size and
leaning back slightly, and hence infers a stopping intention
with high probability 0.75 seconds before event.
We extend our baseline comparisons to include methods
which utilize pedestrian motion dynamics. Specifically we
implemented an approach similar to PHTM [6] and a
Latent-dynamic Conditional Random Field (LDCRF) based
approach [3]. Also included are results from the Pose
Time To Event (frames) -505101520
Prob. of Stopping
Pose+LSTM (ours)
PHTM-like [6]
Pose feature/SVM [11]
Fig. 7: The curves show the average change in probability of
stopping for all Daimler [4] stopping scenarios. Our method
reacts more quickly with stopping probability increasing
earlier as TTE decreases. Negative TTE values indicate
frames after the event occurrence.
feature/SVM [11] approach as it performs evaluation on the
same Daimler data subset. This evaluation is performed only
for the Daimler [4] subset of the data as it contains pedestrian
trajectory information. Fig. 7shows the variation in average
stopping probability for all stopping scenarios. Our posture
based approach reacts much more quickly when changes
indicative of stopping are seen. From the curve for the
LDCRF based approach it can be seen that while head pose
plays a part in determining intention, change in head pose
occurs much later than changes in posture associated with
stopping. The Pose feature/SVM [11] approach performs
well on this particular subset, however after time of event
the probability of stopping anomalously begins to decrease
whereas for the other methods the probability increases.
Multi-Class Analysis: The results for multi-class analysis
are presented in the form of probability graphs as illustrated
in Fig. 8. From the graphs it can be seen that as time to event
decreases the probability of the correct class increases. Of
particular interest are the turning and walking case which
have a large semantic overlap. Both actions initially consist
of a person walking longitudinally along the road. At a
later point in time the person for the turning class suddenly
turns on to the road. This case is reflected in Fig. 8d where
the initially the probability of walking is higher than the
probability of turning but as TTE gets smaller the probability
of turning begins to increase quickly. Similar behaviour can
be seen for the stopping and starting class.
Run Time Comparison: Table II shows the results for
average running time for all the considered methods on a
CPU as well as a GPU (Nvidia GTX Titan X). Owing to
its tiny size in comparison to the other networks, the slow
fusion architecture has the fastest average running time on
the CPU. On the GPU our network is the fastest and is able to
function in real time for the pedestrian intention recognition
use case and fit on embedded hardware. The run time does
not include the time needed for pedestrian detection.
Time to Event 0102030
(a) Stopping Scenario
Time to Event 0102030
(b) Crossing Scenarios
Time to Event 0102030
(c) Starting Scenarios
Time to Event 0102030
(d) Turning Scenarios
Time to Event 0102030
(e) Walking Scenarios
Fig. 8: The five graphs above illustrate the variation in average probability for each class for the five different scenarios. As
the TTE decreases the probability of the correct class begins to rise.
Method Running Time in ms
C3D/SVM [22] 1330 ms 10 ms
CNN SF [23] 13 ms 10 ms
VGG-M [26]+LSTM 490 ms 8 ms
Latent Pose CNN+LSTM (ours) 97 ms 6.1 ms
TABLE II: Average running times for each method for
intention recognition on a per frame basis on both CPU and
GPU. As can be clearly seen our network is the fastest on
the GPU. These run times do not take into account the time
needed for pedestrian detection
D. Ablation Analysis
To investigate the benefit of using pose as an intermediate
feature for intention recognition we extend our experiments.
We firstly train the recurrent portion of our network with
pose features as predicted by an off the shelf state of the
art pose estimation CNN [24]. In addition we also train
the recurrent network from a random initialization with the
convolutional portion of the network being frozen. Together
these experiments allow us to gauge the impact of end
to end training, how the quality of predicted pose affects
our final intention recognition performance as well as the
benefit of allowing the model to abstract the pose feature.
For completeness the full model is trained from a random
initialization as well. In this case, given the relatively small
amount of data and the large number of trainable parameters
overfitting is a concern.
The results of our experiments are summarized in Ta-
ble III. Results for the VGG-M+LSTM network variant are
also included for comparison. Our proposed network together
with the staggered semi-supervised approach to intention
recognition training outperforms other variations to the net-
work structure or training methods. The F1-score for the Pose
CNN+LSTM trained from a random initialization illustrates
that directly training a deep network using the small available
dataset leads to strong overfitting. Furthermore utilizing a
pre-trained feature extractor such as VGG-M and fine tuning
that still leads to overfitting. It is important to point out
that our network has almost eight times fewer parameters
than the VGG variant. The LSTM network trained with the
Method F1-Score 1 second before Event
Pose CNN(random init.)+LSTM Undefined
Pose CNN (fixed)+LSTM 0.64
Pose estimate[24]+LSTM 0.66
VGG-M [26]+LSTM 0.48
Latent Pose CNN+LSTM (ours) 0.71
TABLE III: F1-score on the Pedestrian Intention Recogni-
tion dataset one second before stopping. The CNN+LSTM
trained from random initialization completely overfits on the
crossing class and fails to recognize any stopping sequence.
output of a state of the art pose estimation network [24]
gives competitive results but the achieved performance is
still lower than our method.
In this paper we propose an efficient intention recogni-
tion method from pose dynamics. We learn to specifically
infer pedestrian intention using weak–supervised learning
that bypasses the difficulties of labeling a dataset: (i) the
subjective nature of labeling intention sequences and (ii)
the cost associated with labeling a sufficiently large dataset.
Furthermore, we learn the locomotion dynamics of intention
from pose information using transfer learning. Pose offers
a compact feature representation which is interpretable by
humans and also allows us to have a small model that runs
on embedded hardware. Our proposed approach outperforms
current state-of-the-art intention recognition systems in terms
of accuracy and is able to accurately infer pedestrian in-
tention one second before a pedestrian reaches the road, a
capability which is invaluable for urban autonomous driving.
