Conference PaperPDF Available

Learning to Forecast Pedestrian Intention from Pose Dynamics

June 2018

June 2018

DOI:10.1109/IVS.2018.8500657

Conference: 2018 IEEE Intelligent Vehicles Symposium (IV)

Authors:

Omair Ghori

Radek Mackowiak

Bosch

Miguel Angel Bautista

Universität Heidelberg

Show all 7 authorsHide

For an autonomous car, the ability to foresee a humans action is very useful for mitigating the risk of a possible collision. To humans this pedestrian intention foresight comes naturally as they are able to recognize another person’s actions just by perceiving subtle changes in posture. Approximating this intention inference ability by directly training a deep neural network is useful but especially challenging. First, sufficiently large datasets for intention recognition with frame–wise human pose and intention annotations are rare and expensive to compile. Second, training on smaller datasets can lead to overfitting and make it difficult to adapt to intra-class variations in action executions. Therefore, in this paper, we propose a real time framework that learns (i) intention recognition using weak-supervision and (ii) locomotion dynamics of intention from pose information using transfer learning. This new formulation is able to tackle the lack of frame-wise annotations and to learn intra-class variation in action executions. We empirically demonstrate that our proposed approach leads to earlier and more stable detection of intention than other state of the art approaches with real time operation and the ability to detect intention one second before the pedestrian reaches the kerb.

The sequence above illustrates an inner city traffic scene where a person (orange bounding box) approaches the road and eventually stops at the kerb. Our proposed method is able to correctly infer intention one second before the pedestrian reaches the kerb, which for a car traveling at 30kph would translate to a braking distance of eight meters. The enlarged bounding boxes focus on the pedestrians without the context of the environment.

…

A snapshot showing the variation in appearance of pedestrians in our compiled dataset. Youtube and Self recorded were captured in the wild while Daimler and Hanau are with actors.

…

The curves show the average change in probability of stopping for all Daimler [4] stopping scenarios. Our method reacts more quickly with stopping probability increasing earlier as TTE decreases. Negative TTE values indicate frames after the event occurrence. feature/SVM [11] approach as it performs evaluation on the same Daimler data subset. This evaluation is performed only for the Daimler [4] subset of the data as it contains pedestrian trajectory information. Fig. 7 shows the variation in average stopping probability for all stopping scenarios. Our posture based approach reacts much more quickly when changes indicative of stopping are seen. From the curve for the LDCRF based approach it can be seen that while head pose plays a part in determining intention, change in head pose occurs much later than changes in posture associated with stopping. The Pose feature/SVM [11] approach performs well on this particular subset, however after time of event the probability of stopping anomalously begins to decrease whereas for the other methods the probability increases. Multi-Class Analysis: The results for multi-class analysis are presented in the form of probability graphs as illustrated in Fig. 8. From the graphs it can be seen that as time to event decreases the probability of the correct class increases. Of particular interest are the turning and walking case which have a large semantic overlap. Both actions initially consist of a person walking longitudinally along the road. At a later point in time the person for the turning class suddenly turns on to the road. This case is reflected in Fig. 8d where the initially the probability of walking is higher than the probability of turning but as TTE gets smaller the probability of turning begins to increase quickly. Similar behaviour can be seen for the stopping and starting class. Run Time Comparison: Table II shows the results for average running time for all the considered methods on a CPU as well as a GPU (Nvidia GTX Titan X). Owing to its tiny size in comparison to the other networks, the slow fusion architecture has the fastest average running time on the CPU. On the GPU our network is the fastest and is able to function in real time for the pedestrian intention recognition use case and fit on embedded hardware. The run time does not include the time needed for pedestrian detection.

…

The five graphs above illustrate the variation in average probability for each class for the five different scenarios. As the TTE decreases the probability of the correct class begins to rise.

…

Figures - uploaded by Radek Mackowiak

Content may be subject to copyright.

Content uploaded by Radek Mackowiak

Content may be subject to copyright.

Learning to Forecast Pedestrian Intention from Pose Dynamics

Omair Ghori1,2, Radek Mackowiak1, Miguel Bautista2, Niklas Beuter1, Lucas Drumond1,

Ferran Diego1and Bj¨

orn Ommer2

Abstract— For an autonomous car, the ability to foresee a

humans action is very useful for mitigating the risk of a possible

collision. To humans this pedestrian intention foresight comes

naturally as they are able to recognize another person’s actions

just by perceiving subtle changes in posture. Approximating this

intention inference ability by directly training a deep neural

network is useful but especially challenging. First, sufﬁciently

large datasets for intention recognition with frame–wise human

pose and intention annotations are rare and expensive to

compile. Second, training on smaller datasets can lead to

overﬁtting and make it difﬁcult to adapt to intra-class variations

in action executions. Therefore, in this paper, we propose a

real time framework that learns (i) intention recognition using

weak-supervision and (ii) locomotion dynamics of intention

from pose information using transfer learning. This new

formulation is able to tackle the lack of frame-wise annotations

and to learn intra-class variation in action executions. We

empirically demonstrate that our proposed approach leads to

earlier and more stable detection of intention than other state

of the art approaches with real time operation and the ability

to detect intention one second before the pedestrian reaches the

kerb.

I. INTRODUCTION

For modern intelligent systems, enabling a physical au-

tonomous agent to correctly preempt human actions and react

appropriately is imperative. In an inner city trafﬁc scenario

an autonomous car should be able to foresee a pedestrian’s

future action and proactively take measures to avoid a

potentially dangerous situation. Early intention recognition

is therefore key for safe and comfortable driving.

Fig. 1illustrates an urban trafﬁc scenario where a woman

(orange bounding box) is approaching the road. Given a

history of visual observations, the task of interest is to predict

whether the woman intends to cross the road or stop at the

kerb. Based on past experience and anticipation a human

observer can anticipate that she is likely to stop at the kerb.

This decision is grounded on subtle changes of the observed

person [1]. Even if scene context is not visible, as illustrated

in the enlarged bounding boxes of Fig. 1, an observer can

still infer that she is likely to stop based solely on the subtle

changes in posture [2].

Initial works framed intention recognition as a path pre-

diction problem [3], [4], [5], [6], where the focus is to predict

1The authors are afﬁliated with with Robert Bosch GmbH, Hildesheim,

Germany, (email- Omair.Ghori, Radek.Mackowiak,

Niklas.Beuter, Lucas.RegoDrumond,

Ferran.DiegoAndilla@de.bosch.com )

2The authors are afﬁliated with Heidel-

berg University, Heidelberg Germany, (email-

miguel.bautista@iwr.uni-heidelberg.de,

ommer@uni-heidelberg.de)

Fig. 1: The sequence above illustrates an inner city trafﬁc

scene where a person (orange bounding box) approaches the

road and eventually stops at the kerb. Our proposed method

is able to correctly infer intention one second before the

pedestrian reaches the kerb, which for a car traveling at

30kph would translate to a braking distance of eight meters.

The enlarged bounding boxes focus on the pedestrians

without the context of the environment.

the short term path for a pedestrian of interest. Such an

approach avoided the difﬁculties associated with annotating

an intention recognition dataset. Annotating each frame in

a sequence with intention labels is a challenging, expensive

and highly subjective task [1]. As the enlarged bounding

boxes of Fig. 1illustrate, based on a single frame devoid

of context, both crossing and stopping intentions look likely.

However when viewed as a sequence of images, the true

intention becomes clearer. Thus while only a sequential

chunk of frames can be annotated for intention, deﬁning

the temporal extent of this chunk of frames is a very

subjective task. A few earlier methods utilize pedestrian head

pose [7], [3] as an additional feature for path prediction.

These methods are however, evaluated on staged datasets

with actors, where the impact of head pose as a feature for

intention prediction cannot be truly gauged. In the wild head

pose only implies that a pedestrian is aware of an oncoming

car.

In this paper we present a framework for learning pedes-

trian intention in a weakly supervised manner. Instead of

utilizing raw pixel data for inferring intention, we propose

to use a high-level structure from a given RGB input

frame and subsequently track its temporal dynamics for

pedestrian intention recognition. Framing our problem as

weakly supervised allows us to overcome the limitations

imposed by lack of precise temporal annotations.

Based on the domain knowledge of human posture en-

coding information about intention, we use human pose

as our high-level feature. Although intention recognition

using the human skeleton has been studied before [8], [9],

[10], these works require detailed 3D pose information for

recognition. In contrast we investigate intention recognition

in 2D monocular image sequences where groundtruth pose

annotations are not available.

The main contributions of this paper are:

•To overcome the lack frame-wise annotations we posit

our problem as weakly supervised learning; just a

single label per sequence is need for inferring real–time

intention recognition per frame.

•We propose human pose as our high-level feature to

have a compact feature representation, be interpretable

by humans, learn its temporal dynamics, and to have an

efﬁcient model for running on embedded hardware.

•In order to enable accurate intention recognition even

when using a relatively small dataset for training, we use

transfer learning to learn a latent pose representation.

II. REL ATE D WORK

With the increased focus on achieving fully autonomous

driving, pedestrian intention recognition as a means of avoid-

ing possible collisions with pedestrians is very important and

has therefore received attention from the research commu-

nity. Recently Fang et al.[11] utilized pedestrian pose as a

feature for intention recognition. In their work they ﬁrst use a

CNN for pedestrian pose estimation. Based on the pose key-

points they compute a total of 396 features based on distances

and angles between pairs of keypoints. These features are

then used to train a binary classiﬁer for intention recognition.

In contrast our approach is completely end-to-end learned

with no hand crafted features. Furthermore we utilize the

full 14 keypoint skeleton instead of the 9 keypoints use by

Fang et al. Most prior works related to pedestrian intention

recognition have tended to focus on path prediction [7], [10],

[3], [4], [5], [6], [12] instead of explicit intention recognition.

Schneider et al. [4] investigated various Kalman ﬁlter based

models for pedestrian path prediction in a one second future

time window. The proposed approach leveraged features

extracted from pedestrian motion dynamics only. Kooij et

al. [7] extended this initial work by incorporating additional

features which approximate the pedestrians behaviour as

well as the environmental context. Two non-linear methods

for estimating crossing intention of a laterally approaching

pedestrian were introduced by Keller et al. [6]. Probabilis-

tic Hierarchical Trajectory Matching (PHTM) tries to ﬁnd

the best match for a partially observed pedestrian motion

track among a database of motion snippets. The closest

matching pedestrian trajectory is then used as a model

for approximating future pedestrian position. For longer

term pedestrian path prediction, Rehder et al. [13] frame

the problem as one of planning and treat the pedestrians

destination as a latent variable. Using inverse reinforcement

learning [14] investigates trajectory prediction for multiple

people. However, this works only for a ﬁxed surveillance

camera setup while we work with a moving camera.

Gaussian Process Dynamical Models (GPDM) have also

been demonstrated to be effective using dense optical ﬂow

features and pedestrian motion dynamics as input features

[6] and body pose as an input feature [10]. Head pose has

also been used as a complementary feature for intention

recognition [3] [15]. However, both these approaches are

only ever evaluated on a dataset of staged scenarios. Hariy-

ono et al. [16] investigated intention recognition by develop-

ing a model for a walking human. Unlike our work however,

they do not explicitly estimate human pose but instead rely

on statistics extracted from a pedestrian bounding box.

III. PROP OSE D APPRO ACH

A. Problem Formulation

Having a sequence of images from time 0to T,F=

{Ft:t= 0, ..., T }, we are interested in recognizing the

intention, yt, of an observed pedestrian at time tbefore

an action occurs at time T, where t < T . Due to the

ambiguity of detecting intention from a single frame, we aim

to infer the intention class with the maximum probability

given a sequence of frames prior to occurrence of action.

Thus estimating the intention up to frame Ftis formulated

as maximum a posteriori Bayesian inference problem,

yMAP

t= arg max

yt∈{1,...,M}

P(yt|F0:t),(1)

where Mis the number of predeﬁned intention categories,

and P(yt|F0:t)is the probability of intention class ytgiven

all the frames up to frame t. Instead of explicitly computing

the probabilities, we take the softmax over the model outputs,

which gives us the distribution over the Mpossible classes

as:

P(yt|F0:t)≈exp hyt(F0:t)

m=1 exp hm(F0:t),(2)

where hyt(F0:t)are the features extracted from frames 0to

tfor intention yt. Due to the recurrent nature of the problem,

we are able to estimate the intention class for each frame up

to t,

YMAP

t= arg max

Yt∈M

P(Yt|F0:t),(3)

where Mis the set of all possible intention annotations,

Yt= [y0, . . . , yt]. The probability of an intention class given

a sequence of frames,P(Yt|F0:t), can be approximated as

P(Yt|F0:t)≈

j=0

P(yj|F0:j),(4)

≈P(y0|F0)

j=1

P(yj|yj−1, Fj),(5)

where P(0yj|yj−1, Ft)is the probability of the intention

class at frame jgiven the current frame Fjand the previous

intention class, yj−1.

The former approximation, Eq.(4), estimates each inten-

tion class independently, requiring the computation of all

spatio–temporal features hyt(F0:t)at each timestep before

estimating the softmax in Eq.(2). In contrast, the later

approximation, Eq.(5), infers the intention class based on

the current frame and the previous intention, allowing the

extracted features, hyt(F0:t), to be calculated recurrently,

independent of the total number of time steps. Despite

reducing the computational cost of the feature extractor,

Eq.(5) is not able to learn long–term dynamics of the

intention. Therefore, following the idea of computing the

feature extractor recurrently, we decouple P(Yt|F0:t)into

two components, a visual and a temporal feature extractor.

The former captures spatial frame-wise characteristics of the

intention; whereas temporal feature extractor captures long–

term dynamics of the intention given the visual features;

hence given a sequence of frames before the action actually

begins we learn over the temporal dynamics of features

indicative of the persons intention. As the classiﬁer starts to

see those features occurring during test time, the probability

of the correct class will begin to increase.

B. Visual Feature Extraction

As a ﬁrst step towards intention recognition a feature

representation of the visual contents of frame, Ft, needs to

be extracted. Normally, both intentions and their associated

actions can be well represented by generic visual descrip-

tors [17]. These feature representations work for action

classes which differ greatly in execution. In contrast, we

focus on human pose as a compact visual feature descriptor.

The proposed feature encodes information about a persons

intention [2], and helps in identifying subtle differences in

motor movements.

The challenge then is to estimate the pose of a pedestrian q

in a given frame Ft. In order to do this, ﬁrst an off-the-shelf

pedestrian detector is used to obtain a bounding box image,

t, of pedestrian q. Given this bounding box image as the

input of a feature transformation we estimate coordinates of

a 14 keypoint human skeleton as deﬁned in [18]. To this

end we learn a function, φ, parametrized by parameters θ,

to regress the real valued vector output, ˆ

Z, corresponding to

pose of pedestrian q, from the grayscal input image Iq

Zq=φ(Iq

t;θ).(6)

Speciﬁcally we train a Convolutional Neural Network

(CNN), represented by the transformation φfor estimating

Fig. 3: The top row shows the posture as estimated by our

pose CNN after training on a standard pose dataset. The

bottom row reﬁnement in pose after end-to-end intention

recognition learning. The network is never explicitly trained

for pose on the intention dataset.

the pose from the obtained bounding box image. The CNN

architecture is similar to the one proposed by Belagiannis

et al. [19]. Fig. 2summarizes the architecture we utilize for

pose estimation in the box marked Pose CNN. The top row

of Fig. 3illustrates the pose estimation output of the network

on frames of a sequence in our dataset.

For parameter learning, Tukey’s biweight loss func-

tion [19] is minimized. This loss function is deﬁned as:

ρ(ri) = (c2

61−(1 −(r

c)2)3, if |r| ≤ c

6, otherwise ,(7)

where ris the residual, deﬁned as r=Z−ˆ

Zand cis a

tuning constant. For regression tasks, Tukey’s biweight loss

function is more robust to the inﬂuence of outliers than the

more commonly used L2loss. We set an initial learning rate

of 0.01 and use AdaGrad [20] for optimizing the network.

C. Intention Recognition Learning

In the intention recognition step we aim to classify the

intention of pedestrian qbased on the feature representation,

0:t, extracted from the input frames F0:tas described

in Sec. III-B. For notational simplicity, we denote Z0:t=

φ(F0:t, θ)to represent the features extracted from pedestrian

q. Given a sequence of inputs Z0:t, we learn a function,

Φ, that maps the temporal dynamics of the input to a

INTENTION RECOGNITION MODEL

INTENTION LSTM

INPUT

LSTM

56 dims

Drop:0.5

LSTM

128 dims

SOFTMAX

Ouput

M dims

1xM

M dims

POSE CNN

Conv 5x5

32 ﬁlters

stride:2 pad:0

ReLU, LRN

Max-Pool

INPUT

1 channel

Conv 3x3

32 ﬁlters

stride:1 pad:0

ReLU, LRN

Max-Pool

Conv 3x3

64 ﬁlters

stride:1 pad:1

ReLU

Conv 3x3

64 ﬁlters

stride:1

pad:0

ReLU, LRN

Max-Pool

Drop:0.5

120x80

Conv 12x7

1024 ﬁlters

stride:1 pad:

ReLU, LRN

1xN

2048 dims

ReLU

Drop:0.5

1024 dims

ReLU

FC Reg.

Output

N dims

1xN

Fig. 2: The network architecture for the pose estimation CNN and the intention recognition network.

Fig. 4: A snapshot showing the variation in appearance

of pedestrians in our compiled dataset. Youtube and Self

recorded were captured in the wild while Daimler and Hanau

are with actors.

M-dimensional output score vector and gives the distribution

over the Mpossible intention classes.Therefore, for intention

recognition we can reformulate Eq. (3) as,

YMAP

t= arg max

Yt∈M

P(Yt|F0:t) = arg max

Yt∈M

i=0

Φ(Fi),(8)

where

Φ(Fi) = exp hyi(Zi, h(Zi−1, . . . ); Ω)

m=1 exp hm(Zi, h(Zi−1, . . . ); Ω) ,(9)

and hyi(Zi, h(Zi−1, . . . ); Ω) represents the accumulated

spatio–temporal features of the intention yiup to frame

i, and h(φ(Fi;θ), h(φ(Fi−1; Ω) represents the accumulated

M-dimensional output score vector. For modeling the re-

quired complex long–term temporal dynamics from the input

pose, the function hyi(Zi, h(Zi−1, . . . ); Ω) is modeled with

a shallow LSTM network parametrized by Ω. The LSTM

takes as input the estimated human posture, Zi=φ(Fi;θ), at

frame iand together with its hidden internal state models the

temporal dynamics of the human pose. The network structure

is as deﬁned in Fig. 2in the box marked Intention LSTM.

The second LSTM layer is followed by a fully connected

layer which outputs a M-dimensional feature vector of scores

for all intention classes at each time step.

Before we can start training our network two obstacles

must be addressed, namely the lack of posture annotations

and frame level intention labels for any of the frames in any

of the sequences in the dataset.

We tackle the ﬁrst issue by utilizing transfer learning.

Speciﬁcally, the network speciﬁed by Eq. (6) is ﬁrst trained

on a standard pose training dataset [18] and then used as an

initialization for the Pose CNN in Fig. 2. During end-to-end

training the layers in this portion of the overall model

receive a smaller learning rate than the recurrent layers,

thus the convolutional parameters are ﬁne tuned for intention

recognition during training thereby learning more relevant

and discriminative features. The effect of this ﬁne tuning

on pose estimation is visible in the bottom row of Fig. 3.

The lack of frame level intention annotations is dealt with

by treating our task as weakly supervised. Only a sequence

label, YT, is provided which reﬂects the intention at the ﬁnal

time step, T. We therefore perform back propagation through

time only at the ﬁnal time step of a sequence. Optimizing

in this manner ensures that we make no assumptions about

when the intention begins to manifest and thereby avoid

any bias that might be introduced due to subjective labeling.

Therefore, intention recognition is formulated to minimize:

Ω,ˆ

θ= arg min

Ω,θ

L(YT,Φ(FT; Ω, θ)),(10)

where Lis the cross-entropy loss between the correct

intention class, YT, encoded by a one-hot label vector, and

the probabilities of the intention class at end of the sequence,

Φ(FT; Ω, θ). For end-to-end training the LSTM network has

an initial learning rate set to 0.01 and is optimized with back

propagation through time and Adagrad [20].

IV. EXP ERI MEN TS

A. Datasets

For evaluating our proposed method we compile a dataset

of sequences of pedestrians walking towards the road. It

leverages two existing datasets, namely Daimler [4] and

Hanau [21] datasets, both of which contain sequences

recorded with instructed actors. Moreover, we extend them

with sequences of pedestrians in real world trafﬁc scenarios.

The real world sequences contain recordings from United

Kingdom, Canada, United State of America, Germany,

Turkey, China and Pakistan. A snapshot of pedestrians in

our datasets can be seen in Fig. 4. The diverse appearance

and attitude towards road safety make this a particularly

difﬁcult dataset. Additionally, the dataset contains sequences

recorded at day and night time as well as in cloudy and

sunny conditions. In total we have 466 sequences, with 270

sequences reserved for training, 35 for validation and 161

for the test, taking care to preserve any predeﬁned split of

the Daimler and Hanau datasets. The real world sequences

contain 58 sequences downloaded from YouTube and 315

recorded by us. All sequences were resampled at a frame of

16 frames per second.

In addition to the intention classes introduced by [4],

namely: crossing, stopping, starting and turning, we also

include the ’walking along’ class. This class refers to the

case when a pedestrian walks along the road longitudinally.

The semantic meaning of each class is illustrated in Fig. 5.

We use this dataset for all our evaluations. Reported results

are for trainings and evaluations which utilize groundtruth

pedestrian bounding boxes as inputs.

B. Baselines

We contrast our approach with several other methods

which do not utilize explicitly explainable features such as

pose. The idea is to highlight how domain knowledge, i.e.

pose encoding intention, allows us to estimate pedestrian

intention more efﬁciently and robustly.

C3D/SVM: A simple baseline is established by extracting

fc6features using the C3D [22] net and training a linear

SVM classiﬁer on top. A sliding window of 6 frames, with

(a) The orange arrow represents the crossing class, blue the

stopping class while yellow illustrates the starting class.

(b) The green arrow represents the walking-along class while

purple shows the turning class.

Fig. 5: The ﬁgure shows the ﬁve pedestrian intention classes

where the arrows only illustrate the direction of pedestrian

movement.

a stride of one is run over each sequence in order to extract

spatio-temporal features. Wider observation time window

widths were also tested but they negatively affected the

detection of fast changing intention such as a running person

coming to a stop. We therefore settled on a time observation

window width of 6.

CNN Slow Fusion (CNN SF) [23]: We train a CNN based

on the SF architecture for end-to-end intention recognition.

The network takes as input multiple frames and outputs the

intention class probabilities. Best results were achieved with

a sliding window of 6 frames.

CNN+LSTM variations: This method is similar to our

approach except a generic CNN based feature extractor, pre-

trained on ImageNet, is used in conjunction with a recurrent

network.

Pose Feature/SVM [11]: This method initially uses a state

of the art pose estimator [24] trained on the MS-COCO

dataset [25] for estimating the human pose. The estimated

pose is then used to calculate 396 features (distances and

angles between body keypoints). These features are then used

to train a SVM for classifying the probability of different

intentions.

Probabilistic Hierarchical Trajectory Matching

(PHTM) [6]: aims to ﬁnd the best match for a partially

observed pedestrian motion track from among a database of

previously observed motion snippets. Then closest matching

snippet is then used as a model for extrapolating future

pedestrian position for the current observed track. However,

this approach requires an ofﬂine predestrian trajectory data,

being only available for Daimler Dataset.

Latent-dynamic Conditional Random Field (LD-

CRF) [3]: also takes as input pedestrian motion dynamics.

In addition it also utilizes pedestrian head pose as an input;

being a proxy for situational awareness.

C. Intention Recognition Results

From an Advanced Driver Assistance Systems (ADAS)

point of view, the two most important classes are a pedestrian

crossing the road or stopping at the kerb. All other classes

form a subset of these two major classes. Our evaluation

is therefore performed at two levels of granularity with

respect to the intention classes: a binary classiﬁcation prob-

lem, where the goal is to predict whether an approaching

pedestrian plans on stopping or crossing in front of the ego

vehicle and a second more in depth multi-class classiﬁcation

scenario with all labeled classes being considered.

Binary classiﬁcation is evaluated on the basis of the

F1-score across three time horizons: one second, half a

second and one frame before event. The single frame before

event case represents a time interval dependent on the frame

rate. The time horizons were chosen keeping in mind realistic

urban driving scenarios as well as the erratic nature of

pedestrian movement.

Multi-class evaluation analyzes how the probabilities of

each intention class varies with respect to Time to Event

(TTE) [4]. This metric has previously been employed for

measuring performance of pedestrian intention recognition

methods [3], [4], [6], [7].

Binary Class Analysis: Table Ishows the F1-score across

three time horizons for the stopping and crossing classes. It

is clear from the results that our proposed model outperforms

the baseline methods across all three time horizons. Of

particular signiﬁcance is our superior intention recognition

performance one second before the event. The results table

reﬂects that the spatio-temporal features extracted using the

C3D net are informative and help in distinguishing between

the two intention classes only as the time of event gets

closer. One second before the event the C3D based model

is biased towards the crossing intention class. As the time

of event approaches and pedestrian appearance begins to

differ for both classes the accuracy of the C3D based method

increases. Furthermore from Table Iwe can see that the F1

score using C3D/SVM for crossing class ﬂuctuates as time of

event approaches. This is a direct consequence of operating

on a ﬁxed observation time window. When reasoning about

Method

F1-Score wrt. Time to Event for

stopping /crossing

1 second 1/2 second 1/16 second

Random 0.14 /0.11 0.14 /0.11 0.14 /0.11

C3D/SVM [22] 0.57 /0.71 0.60 /0.59 0.77 /0.79

CNN SF [23] 0.33 /0.40 0.47 /0.52 0.64 /0.69

VGG-M [26]+LSTM 0.48 /0.52 0.43 /0.47 0.43 /0.46

Latent Pose CNN+LSTM (ours) 0.71 /0.72 0.72 /0.73 0.87 /0.85

TABLE I: F1-score for Pedestrian intention recognition on

stopping and crossing classes. Our method outperforms

all other evaluated methods with stopping intention being

accurately detected one second before event by a signiﬁcant

margin.

Fig. 6: Variation in stopping probability for a stopping scenario for our method and selected baseline methods. For the other

methods an initial time lag, associated with frame accumulation, is also seen before a valid output is available.

intention, frames outside of the observation window are

not considered due to which larger variance is observed

in output probabilities at each time step. Fig. 6highlights

the probability variation with respect to time for a stopping

scenario. Initially there are no visible signs of the pedestrian

stopping; our model therefore outputs a higher probability for

crossing intention. In contrast, C3D/SVM and Slow Fusion

both operate on a chunk of frames and initially their output

is not available as the required number of frames are not

accumulated. In subsequent frames, our model starts to detect

subtle changes of posture like a reduction of step size and

leaning back slightly, and hence infers a stopping intention

with high probability 0.75 seconds before event.

We extend our baseline comparisons to include methods

which utilize pedestrian motion dynamics. Speciﬁcally we

implemented an approach similar to PHTM [6] and a

Latent-dynamic Conditional Random Field (LDCRF) based

approach [3]. Also included are results from the Pose

Time To Event (frames) -505101520

Prob. of Stopping

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pose+LSTM (ours)

PHTM-like [6]

LDCRF [3]

Pose feature/SVM [11]

Fig. 7: The curves show the average change in probability of

stopping for all Daimler [4] stopping scenarios. Our method

reacts more quickly with stopping probability increasing

earlier as TTE decreases. Negative TTE values indicate

frames after the event occurrence.

feature/SVM [11] approach as it performs evaluation on the

same Daimler data subset. This evaluation is performed only

for the Daimler [4] subset of the data as it contains pedestrian

trajectory information. Fig. 7shows the variation in average

stopping probability for all stopping scenarios. Our posture

based approach reacts much more quickly when changes

indicative of stopping are seen. From the curve for the

LDCRF based approach it can be seen that while head pose

plays a part in determining intention, change in head pose

occurs much later than changes in posture associated with

stopping. The Pose feature/SVM [11] approach performs

well on this particular subset, however after time of event

the probability of stopping anomalously begins to decrease

whereas for the other methods the probability increases.

Multi-Class Analysis: The results for multi-class analysis

are presented in the form of probability graphs as illustrated

in Fig. 8. From the graphs it can be seen that as time to event

decreases the probability of the correct class increases. Of

particular interest are the turning and walking case which

have a large semantic overlap. Both actions initially consist

of a person walking longitudinally along the road. At a

later point in time the person for the turning class suddenly

turns on to the road. This case is reﬂected in Fig. 8d where

the initially the probability of walking is higher than the

probability of turning but as TTE gets smaller the probability

of turning begins to increase quickly. Similar behaviour can

be seen for the stopping and starting class.

Run Time Comparison: Table II shows the results for

average running time for all the considered methods on a

CPU as well as a GPU (Nvidia GTX Titan X). Owing to

its tiny size in comparison to the other networks, the slow

fusion architecture has the fastest average running time on

the CPU. On the GPU our network is the fastest and is able to

function in real time for the pedestrian intention recognition

use case and ﬁt on embedded hardware. The run time does

not include the time needed for pedestrian detection.

Time to Event 0102030

Probability

0.5

(a) Stopping Scenario

Time to Event 0102030

Probability

0.5

(b) Crossing Scenarios

Time to Event 0102030

Probability

0.5

Time to Event 0102030

Probability

0.5

(d) Turning Scenarios

Time to Event 0102030

Probability

0.5

(e) Walking Scenarios

Fig. 8: The ﬁve graphs above illustrate the variation in average probability for each class for the ﬁve different scenarios. As

the TTE decreases the probability of the correct class begins to rise.

Method Running Time in ms

CPU GPU

C3D/SVM [22] 1330 ms 10 ms

CNN SF [23] 13 ms 10 ms

VGG-M [26]+LSTM 490 ms 8 ms

Latent Pose CNN+LSTM (ours) 97 ms 6.1 ms

TABLE II: Average running times for each method for

intention recognition on a per frame basis on both CPU and

GPU. As can be clearly seen our network is the fastest on

the GPU. These run times do not take into account the time

needed for pedestrian detection

D. Ablation Analysis

To investigate the beneﬁt of using pose as an intermediate

feature for intention recognition we extend our experiments.

We ﬁrstly train the recurrent portion of our network with

pose features as predicted by an off the shelf state of the

art pose estimation CNN [24]. In addition we also train

the recurrent network from a random initialization with the

convolutional portion of the network being frozen. Together

these experiments allow us to gauge the impact of end

to end training, how the quality of predicted pose affects

our ﬁnal intention recognition performance as well as the

beneﬁt of allowing the model to abstract the pose feature.

For completeness the full model is trained from a random

initialization as well. In this case, given the relatively small

amount of data and the large number of trainable parameters

overﬁtting is a concern.

The results of our experiments are summarized in Ta-

ble III. Results for the VGG-M+LSTM network variant are

also included for comparison. Our proposed network together

with the staggered semi-supervised approach to intention

recognition training outperforms other variations to the net-

work structure or training methods. The F1-score for the Pose

CNN+LSTM trained from a random initialization illustrates

that directly training a deep network using the small available

dataset leads to strong overﬁtting. Furthermore utilizing a

pre-trained feature extractor such as VGG-M and ﬁne tuning

that still leads to overﬁtting. It is important to point out

that our network has almost eight times fewer parameters

than the VGG variant. The LSTM network trained with the

Method F1-Score 1 second before Event

Pose CNN(random init.)+LSTM Undeﬁned

Pose CNN (ﬁxed)+LSTM 0.64

Pose estimate[24]+LSTM 0.66

VGG-M [26]+LSTM 0.48

Latent Pose CNN+LSTM (ours) 0.71

TABLE III: F1-score on the Pedestrian Intention Recogni-

tion dataset one second before stopping. The CNN+LSTM

trained from random initialization completely overﬁts on the

crossing class and fails to recognize any stopping sequence.

output of a state of the art pose estimation network [24]

gives competitive results but the achieved performance is

still lower than our method.

V. CONCLUSION AND FUTURE WORK

In this paper we propose an efﬁcient intention recogni-

tion method from pose dynamics. We learn to speciﬁcally

infer pedestrian intention using weak–supervised learning

that bypasses the difﬁculties of labeling a dataset: (i) the

subjective nature of labeling intention sequences and (ii)

the cost associated with labeling a sufﬁciently large dataset.

Furthermore, we learn the locomotion dynamics of intention

from pose information using transfer learning. Pose offers

a compact feature representation which is interpretable by

humans and also allows us to have a small model that runs

on embedded hardware. Our proposed approach outperforms

current state-of-the-art intention recognition systems in terms

of accuracy and is able to accurately infer pedestrian in-

tention one second before a pedestrian reaches the road, a

capability which is invaluable for urban autonomous driving.

REFERENCES

[1] S. Schmidt and B. F¨

arber, “Pedestrians at the kerb recognising the

action intentions of humans,” Transportation Research Part F: Trafﬁc

Psychology and Behaviour, vol. 12, no. 4, pp. 300 – 310, 2009.

[2] J. M. Kilner, “More than one pathway to action understanding,” Trends

in Cognitive Sciences, vol. 15, no. 8, p. 352, 2011.

[3] A. Schulz and R. Stiefelhagen, “Pedestrian intention recognition

using latent-dynamic conditional random ﬁelds,” in Intelligent Vehicles

Symposium (IV), 2015 IEEE, 2015, pp. 622–627.

[4] N. Schneider and D. Gavrila, “Pedestrian path prediction with recur-

sive bayesian ﬁlters: A comparative study,” in Pattern Recognition,

2013, vol. 8142, pp. 174–183.

[5] S. Kohler, M. Goldhammer, S. Bauer, K. Doll, U. Brunsmann, and

K. Dietmayer, “Early detection of the pedestrian’s intention to cross

the street,” in Intelligent Transportation Systems (ITSC), 2012 15th

International IEEE Conference on, Sept 2012, pp. 1759–1764.

[6] C. Keller and D. Gavrila, “Will the pedestrian cross? a study on pedes-

trian path prediction,” IEEE Transactions on Intelligent Transportation

Systems, vol. 15, no. 2, pp. 494–506, 2014.

[7] J. Kooij, N. Schneider, F. Flohr, and D. Gavrila, “Context-based

pedestrian path prediction,” in Computer Vision ECCV 2014. Springer

International Publishing, 2014, vol. 8694, pp. 618–633.

[8] Y. Li, C. Lan, J. Xing, W. Zeng, C. Yuan, and J. Liu, Online Human

Action Detection Using Joint Classiﬁcation-Regression Recurrent Neu-

ral Networks, 2016, pp. 203–220.

[9] M. Zanﬁr, M. Leordeanu, and C. Sminchisescu, “The moving pose: An

efﬁcient 3d kinematics descriptor for low-latency action recognition

and detection,” in 2013 IEEE International Conference on Computer

Vision, Dec 2013, pp. 2752–2759.

[10] R. Quintero, I. Parra, D. F. Llorca, and M. A. Sotelo, “Pedestrian

intention and pose prediction through dynamical models and behaviour

classiﬁcation,” in IEEE 18th International Conference on Intelligent

Transportation Systems (ITSC), 2015.

[11] Z. Fang, D. V´

azquez, and A. M. L´

opez, “On-board detection of

pedestrian intentions,” Sensors, vol. 17, no. 10, p. 2193, 2017.

[12] H. Kataoka, Y. Aoki, Y. Satoh, S. Oikawa, and Y. Matsui, “Fine-

grained walking activity recognition via driving recorder dataset,”

in IEEE 18th International Conference on Intelligent Transportation

Systems (ITSC), 2015, 2015.

[13] E. Rehder and H. Kloeden, “Goal-directed pedestrian prediction,” in

2015 IEEE International Conference on Computer Vision Workshop

(ICCVW), Dec 2015, pp. 139–147.

[14] W.-C. Ma, D.-A. Huang, N. Lee, and K. M. Kitani, “Forecasting

interactive dynamics of pedestrians with ﬁctitious play,” in The IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), July

2017.

[15] F. Flohr, M. Dumitru-Guzu, J. Kooij, and D. Gavrila, “Joint prob-

abilistic pedestrian head and body orientation estimation,” in Intel-

ligent Vehicles Symposium Proceedings, 2014 IEEE, June 2014, pp.

617–622.

[16] J. Hariyono and K.-H. Jo, “Detection of pedestrian crossing road: A

study on pedestrian pose recognition,” Neurocomputing, vol. 234, pp.

144 – 153, 2017.

[17] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network

for skeleton based action recognition,” in The IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), June 2015.

[18] D. Hall and P. Perona, “Fine-grained classiﬁcation of pedestrians in

video: Benchmark and state of the art,” in CVPR, 2015.

[19] V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab, “Robust

optimization for deep regression,” in International Conference on

Computer Vision (ICCV), 2015.

[20] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods

for online learning and stochastic optimization,” J. Mach. Learn. Res.,

2011.

[21] D. Kondermann, R. Nair, S. Meister, W. Mischler, B. G¨

ussefeld,

S. Hofmann, C. Brenner, and B. J¨

ahne, “Stereo ground truth with

error bars,” in Asian Conference on Computer Vision, ACCV, 2014.

[22] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning

spatiotemporal features with 3d convolutional networks,” in The IEEE

International Conference on Computer Vision (ICCV), 2015.

[23] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and

L. Fei-Fei, “Large-scale video classiﬁcation with convolutional neural

networks,” in CVPR, 2014.

[24] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person

2d pose estimation using part afﬁnity ﬁelds,” in CVPR, 2017.

[25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,

P. Doll´

ar, and C. L. Zitnick, “Microsoft coco: Common objects in

context,” in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla,

B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International

Publishing, 2014, pp. 740–755.

[26] K. Chatﬁeld, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return

of the devil in the details: Delving deep into convolutional nets,” in

British Machine Vision Conference, 2014.

Prediction of the Future State of Pedestrians While Jaywalking Under Non-Lane-Based Heterogeneous Traffic Conditions

Article

Apr 2023

This study proposes a novel framework to predict jaywalkers’ future state in non-lane-based heterogeneous traffic conditionsby combining the effects of the surrounding dynamics with jaywalkers’ poses. Different variables, such as the pedestrian pose,walking speed, location in the road environment, count and direction of approaching traffic, speed and type of closestapproaching vehicle, and so forth, are used as input variables. The dataset for this study consists of 47,588 samples gatheredby analyzing 1753 jaywalkers under non-lane-based heterogeneous traffic situations. Keypoint detection on the pedestrianbody is made using MediaPipe. YOLOv4 and DeepSORT are used to detect and track road users to get trajectory data.Training and testing datasets are prepared for different prediction horizons to test the proposed models’ applicability forroads of varying design speeds. Four machine learning models based on ensemble techniques, namely random forest (RF),adaptive boosting (AdaBoost), gradient boosting, and extreme gradient boosting, are trained and tested for different predic-tion horizons from 0.5 to 4 s. Up to the prediction horizon of 1 s, all models performed equally well with Area under theROC curve (AUC) values above 0.95. At higher prediction horizons, the RF is found to outperform the other models. Allmodels, except AdaBoost, maintained an AUC value of greater than 0.9 when predicting future states up to a maximum of2.5 s. The proposed model performs well for both short-term and long-term predictions by combining the effect of sur-rounding dynamics with pedestrian stance and speed. The outcomes can be utilized to assist infrastructure-to-vehicle connec-tivity in empowering vehicles to navigate through jaywalkers safely, enhancing pedestrian safety.

Pedestrian's Intention Recognition, Fusion of Handcrafted Features in a Deep Learning Approach

Article

May 2021

The safety of vulnerable road users (VRU) is a major concern for both advanced driver assistance systems (ADAS) and autonomous vehicle manufacturers. To guarantee people safety on roads, autonomous vehicles must be able to detect the presence of pedestrians, track them, and predict their intention to cross the road. Most of the earlier work on pedestrian intention recognition focused on using either handcrafted features or an end-to-end deep learning approach. In this project, we investigate the impact of fusing handcrafted features with auto learned features by using a two-stream neural network architecture. Our results show that the combined approach improves the performance. Furthermore, the proposed method achieved very good results on the JAAD dataset. Depending on whether we considered the immediate frames before the crossing or only half a second before that point, we received prediction accuracy of 91%, and 84%, respectively.

Predicting Pedestrian Movement in Unsignalized Crossings: A Contextual Cue-Based Approach

Conference Paper

Jan 2024

To ensure safe and secure coexistence between pedestrians and autonomous vehicles (AVs), AVs must be able to anticipate pedestrian behavior and respond to it. This research gathers video data from real traffic scenes to predict pedestrian crossing intentions at unsignalized crossings. Computer vision techniques such as YOLOv4, Deep SORT, and perspective transformation are employed for road user detection, tracking, and mapping image coordinates to world coordinates to prepare trajectory datasets. Using trajectory data, several features influencing pedestrian intention like walking speed, location in the road environment, count and direction of approaching traffic, speed and type of closest approaching vehicle upstream, etc., are extracted. The dataset for this study was obtained by analyzing 1,411 pedestrians, resulting in 223,136 samples. To predict pedestrian crossing intentions, LSTM and Bi-LSTM with an attention mechanism model were built and trained to anticipate the pedestrian crossing intention at unsignalized crossing. The proposed model successfully combined the characteristics and surrounding dynamics of pedestrians to produce accurate predictions, Bi-LSTM with an attention mechanism outperformed LSTM, achieving an AUC of 95.3%, 91.1%, 89.2%, 87.5%, and 84% on the testing dataset at unsignalized crossing on the 0.6 sec, 1.2 s, 1.8 s, 2.4 s, and 3 s time horizons. These outcomes can be used to improve Connected and Autonomous Vehicle (CAV) technologies, infrastructure-to-vehicle (I2V) connectivity, and driver assistance systems to enhance pedestrian safety while navigating through pedestrian crosswalks.

Prediction of the Future State of Pedestrians while Jaywalking under Non-lane-based Heterogeneous Traffic Conditions

Conference Paper

Jan 2023

In this study, a framework is proposed to predict jaywalkers' future state by employing machine learning algorithms. Different variables such as pedestrian pose, walking speed, location in the road environment, count and direction of approaching traffic, speed and type of closest approaching vehicle in upstream, etc., are used as input variables. The dataset for this study consists of 47588 samples gathered by analysing 1753 jaywalkers under non-lane-based heterogeneous traffic situations. Keypoint detection on the pedestrian body is made using MediaPipe. YOLOv4 and DeepSORT are used to detect and track road users to get trajectory data. By testing the performance of several machine learning models based on evaluation metrics, the best model is determined. Training and testing datasets are prepared for different prediction horizons to test the proposed models’ applicability for roads of varying design speeds. Four machine learning models based on ensemble techniques such as Random Forest (RF), AdaBoost, Gradient Boosting, and Extreme Gradient Boosting are trained and tested for different prediction horizons from 0.5s to 4s. Up to the prediction horizon of 1s, all models have performed equally well with AUC values above 0.95. At higher prediction horizons, Random Forest is found to be outperforming other models. All models, except AdaBoost, maintained an AUC value of greater than 0.9 when predicting the future state up to a maximum of 2.5s. The outcomes of this study can be utilised to assist Connected and Autonomous Vehicles (CAV), infrastructure-to-vehicle (I2V) connectivity, and driver assistance technologies in empowering vehicles to navigate through jaywalkers safely, enhancing pedestrian safety.

Stacking-Based Ensemble Learning Method for the Recognition of the Pedestrian Crossing Intention

Article

Full-text available

Nov 2022
J ADV TRANSPORT

Accurate recognition of pedestrian crossing intentions is essential for the safe operation of autonomous vehicles on urban roads. However, the current pedestrian crossing intention recognition model has the problems of relatively low recognition accuracy and short recognition advance time. Based on the above problems, this paper carried out a study on the recognition model of pedestrian crossing intention. Firstly, the pedestrian and vehicle crossing data were collected through laser radar and a high-definition monitor, and 1980 groups of valid samples were selected. Secondly, the pedestrian crossing intention characterization parameter set was determined through statistical analysis. Finally, this paper proposes a pedestrian crossing intention recognition model based on stacking ensemble learning. The ensemble learning framework integrates random forest (RF), support vector machine (SVM), long short-term memory network (LSTM), an attention mechanism, and bidirectional LSTM (AT-Bi-LSTM). Compared with traditional machine learning methods, the proposed method shows greater advantages in recognition accuracy. The model recognition accuracy reaches 95.36% when the model is recognized at 0.5 s before crossing the zebra crossing, and the model recognition accuracy is 89.27% when the model is recognized at 1s before crossing the zebra crossing. The research in this paper is of great significance for building a more intelligent pedestrian-vehicle collaboration and promoting the industrial application of the autonomous vehicle.

Multimodal-Based Pedestrian Crossing Intention Recognition Method*

Conference Paper

Nov 2023

Recognition of Pedestrians’ Street-Crossing Intentions Based on Skeleton Features基于骨架特征的行人过街意图识别

Article

Jan 2024

An integrated method is proposed to solve the problem of frequent conflicts between autonomous vehicles and pedestrians in the street crossing scene. The method involves pedestrian detection, tracking, and intention recognition. First, an enhanced YOLOv8 is introduced by combining the C2f_CA module to achieve accurate pedestrian detection, tracking and pose estimation. Second, a variety of intention recognition features are proposed to characterize the position and pose of pedestrians in spatial and time domains. Finally, by taking the feature data as input for the base learners, the intention classification model is proposed based on the Stacking model with SVM, KNN and random forest as the base learners and XGBoost as the meta learner. The experimental results show that the enhanced YOLOv8 improves the detection accuracy by 5.4% compared with the original model, and the intention recognition based on the Stacking model can achieve 94.0% accuracy on the JAAD dataset, which is improved by more than 3.4% compared with the existing intention recognition models. Furthermore, when different parts of a pedestrian are occluded, the accuracy of the Stacking model still reaches 65.8%–73.3%, which verifies the robustness of the proposed model. The proposed model provides reliable inputs for decision planning of autonomous vehicles, which is conducive to improving the safety of self-driving.

Recognition Assistance Interface for Human-Automation Cooperation in Pedestrian Risk Prediction

Article

Jun 2023

div>Autonomous driving systems (ADS) have been widely tested in real-world environments with operators who must monitor and intervene due to remaining technical challenges. However, intervention methods that require operators to take over control of the vehicle involve many drawbacks related to human performance. ADS consist of recognition, decision, and control modules. The latter two phases are dependent on the recognition phase, which still struggles with tasks involving the prediction of human behavior, such as pedestrian risk prediction. As an alternative to full automation of the recognition task, cooperative recognition approaches utilize the human operator to assist the automated system in performing challenging recognition tasks, using a recognition assistance interface to realize human-machine cooperation. In this study, we propose a recognition assistance interface for cooperative recognition in order to achieve safer and more efficient driving through improved human-automation cooperation. A simulator experiment with 18 participants is conducted to evaluate our recognition assistance interface in comparison with a conventional control intervention, in terms of driving safety, efficiency, and usability. Recognition of pedestrian crossing intention is selected for the cooperation task, and driving scenarios in which the automated system cannot reliably recognize the crossing intentions of pedestrians at non-signalized locations are selected as the driving scenario. Statistical analysis of our experimental results reveals that the proposed recognition assistance interface allowed more accurate operator intervention, was easier to use, and achieved more stable vehicle control than the control intervention. We also found that sharing the recognition information of the automated driving system with operators could divide their attention, impairing intervention performance. Our experimental results suggest that the unifying presentation of the system recognition information and the operator’s manipulation target on the touchscreen of the user interface addresses this problem.</div

Pedestrian Crossing Intention Prediction at Red-Light Using Pose Estimation

Article

Mar 2022

Pedestrians’ red-light crossing can present a threat to traffic safety. Among all the existing work related to pedestrian’s red-light crossing, there are few studies using trajectory data in time sequence. This paper uses pose estimation (keypoint detection) to generate pedestrians’ variables from CCTV videos. Four machine learning models are used to predict pedestrians’ crossing intention at intersections’ red-light. The best model achieves an accuracy of 0.920 and AUC value of 0.849, with data from three intersections. Different prediction horizons (up to 4 sec) are used. With longer prediction horizons, the sample size gets smaller, which partially leads to worse model performance. However, the performance with prediction horizon up to 2 sec is still good (AUC value as 0.841). It is found that keypoint variables such as the angles between ankle and knee (left side) and elbow and shoulder (right side) are important. This model can be further implemented in the Infrastructure-to-Vehicle (I2V) applications and thus prevent accidents due to pedestrians’ red-light crossing by issuing warnings to drivers.

Assessing Cross-dataset Generalization of Pedestrian Crossing Predictors

Conference Paper

Jun 2022

On-Board Detection of Pedestrian Intentions

Article

Full-text available

Sep 2017
SENSORS-BASEL

Avoiding vehicle-to-pedestrian crashes is a critical requirement for nowadays advanced driver assistant systems (ADAS) and future self-driving vehicles. Accordingly, detecting pedestrians from raw sensor data has a history of more than 15 years of research, with vision playing a central role. During the last years, deep learning has boosted the accuracy of image-based pedestrian detectors. However, detection is just the first step towards answering the core question, namely is the vehicle going to crash with a pedestrian provided preventive actions are not taken? Therefore, knowing as soon as possible if a detected pedestrian has the intention of crossing the road ahead of the vehicle is essential for performing safe and comfortable maneuvers that prevent a crash. However, compared to pedestrian detection, there is relatively little literature on detecting pedestrian intentions. This paper aims to contribute along this line by presenting a new vision-based approach which analyzes the pose of a pedestrian along several frames to determine if he or she is going to enter the road or not. We present experiments showing 750 ms of anticipation for pedestrians crossing the road, which at a typical urban driving speed of 50 km/h can provide 15 additional meters (compared to a pure pedestrian detector) for vehicle automatic reactions or to warn the driver. Moreover, in contrast with state-of-the-art methods, our approach is monocular, neither requiring stereo nor optical flow information.

A Probabilistic Framework for Joint Pedestrian Head and Body Orientation Estimation

Article

Full-text available

Aug 2015

We present a probabilistic framework for the joint estimation of pedestrian head and body orientation from a mobile stereo vision platform. For both head and body parts, we convert the responses of a set of orientation-specific detectors into a (continuous) probability density function. The parts are localized by means of a pictorial structure approach, which balances part-based detector responses with spatial constraints. Head and body orientations are estimated jointly to account for anatomical constraints. The joint single-frame orientation estimates are integrated over time by particle filtering. The experiments involved data from a vehicle-mounted stereo vision camera in a realistic traffic setting; 65 pedestrian tracks were supplied by a state-of-the-art pedestrian tracker. We show that the proposed joint probabilistic orientation estimation framework reduces the mean absolute head and body orientation error up to 15° compared with simpler methods. This results in a mean absolute head/body orientation error of about 21°/19°, which remains fairly constant up to a distance of 25 m. Our system currently runs in near real time (8-9 Hz).

Online Human Action Detection Using Joint Classification-Regression Recurrent Neural Networks

Conference Paper

Full-text available

Oct 2016

Human action recognition from well-segmented 3D skeleton data has been intensively studied and has been attracting an increasing attention. Online action detection goes one step further and is more challenging, which identifies the action type and localizes the action positions on the fly from the untrimmed stream data. In this paper, we study the problem of online action detection from streaming skeleton data. We propose a multi-task end-to-end Joint Classification-Regression Recurrent Neural Network to better explore the action type and temporal localization information. By employing a joint classification and regression optimization objective, this network is capable of automatically localizing the start and end points of actions more accurately. Specifically, by leveraging the merits of the deep Long Short-Term Memory (LSTM) subnetwork, the proposed model automatically captures the complex long-range temporal dynamics, which naturally avoids the typical sliding window design and thus ensures high computational efficiency. Furthermore, the subtask of regression optimization provides the ability to forecast the action prior to its occurrence. To evaluate our proposed model, we build a large streaming video dataset with annotations. Experimental results on our dataset and the public G3D dataset both demonstrate very promising performance of our scheme.

Forecasting Interactive Dynamics of Pedestrians with Fictitious Play

Conference Paper

Jul 2017

Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields

Conference Paper

Jul 2017

Stereo Ground Truth with Error Bars

Conference Paper

Nov 2014

Creating stereo ground truth based on real images is a measurement task. Measurements are never perfectly accurate: the depth at each pixel follows an error distribution. A common way to estimate the quality of measurements are error bars. In this paper we describe a methodology to add error bars to images of previously scanned static scenes. The main challenge for stereo ground truth error estimates based on such data is the nonlinear matching of 2D images to 3D points. Our method uses 2D feature quality, 3D point and calibration accuracy as well as covariance matrices of bundle adjustments. We sample the reference data error which is the 3D depth distribution of each point projected into 3D image space. The disparity distribution at each pixel location is then estimated by projecting samples of the reference data error on the 2D image plane. An analytical Gaussian error propagation is used to validate the results. As proof of concept, we created ground truth of an image sequence with 100 frames. Results show that disparity accuracies well below one pixel can be achieved, albeit with much large errors at depth discontinuities mainly caused by uncertain estimates of the camera location.

Detection of Pedestrian Crossing Road A Study on Pedestrian Pose Recognition

Article

Dec 2016
NEUROCOMPUTING

Detection of pedestrian crossing road is the objective of this work. The model incorporates the pedestrian pose recognition and lateral speed, motion direction and spatial layout of the environment. Pedestrian poses are recognized according to the spatial body language ratio. The center of mass of the body relative to its width and height is used to define the pedestrian pose. Motion trajectory is obtained by using point tracking on the centroid of detected human region, and then estimated velocity is determined. Spatial layout is determined by the distance of the pedestrian to the road lane boundary. These models will be then hierarchically separated according to their action (walking, starting, bending and stopping). In order to classify the pedestrian crossing road, a walking human model is proposed. A walking human is defined by ratio of the centroid location from the ground plane divided by the height of bounding box that should satisfy a constraint. The proposed algorithms are evaluated using publicly available datasets and our pedestrian walking dataset. The performance result shows that the correct pedestrian crossing road classification is 98.10%.

Learning spatiotemporal features with 3d convolutional networks

Article