Fig 2 - uploaded by Miguel Romero
Content may be subject to copyright.
For each time-stamp t, all the next positions from t until t + H are predicted.

For each time-stamp t, all the next positions from t until t + H are predicted.

Source publication
Preprint
Full-text available
Head motion prediction is an important problem with 360\degree\ videos, in particular to inform the streaming decisions. Various methods tackling this problem with deep neural networks have been proposed recently. In this article we first show the startling result that all such existing methods, which attempt to benefit both from the history of pas...

Contexts in source publication

Context 1
... section reviews the existing methods relevant for the problem we consider. We start by formulating the exact problem: it consists, at each video playback time t, in predicting the future user's head positions between t and t + H, as illustrated in Fig. 1 and represented in Fig. 2, with the only knowledge of this user's past positions and the (entire) video content. We therefore do not consider methods aiming to predict the entire user trajectory from the start based on the content and on the starting point as, e.g., targeted by the challenge in [16] or summarizing a 360 • video into 2D [17], [18]. As well, and ...
Context 2
... the settings of the works we compare with, T start is set to 0 sec. for all the curves generated in Sec. 3. In order to skip the exploration phase, as explained in Sec. 5.4, and be more favorable to all methods as they are not able to consider non-stationarity of the motion process, we set T start = 6 sec. from Sec. 5 onward. We now refer to Fig. 2. Let H be the prediction horizon. We define the terms prediction step s, and video time-stamp t, such that: at every time-stamp t ∈ [T start , T ], we run predictionsˆPpredictionsˆ predictionsˆP t+s , for all prediction steps s ∈ [0, H]. We formulate the problem of trajectory prediction as finding the best model F * H ...
Context 3
... depicted in Fig. 4, a seq2seq framework consists of an encoder and a decoder. The encoder receives the historic window input (samples from t − M to t − 1 shown in Fig. 2) and generates an internal representation. The decoder receives the output of the encoder and progressively produces predictions over the target horizon, by re-injecting the previous prediction as input for the new prediction time-steps. This is a strong baseline (not only a trivial-static or linear predictor) processing the head ...
Context 4
... Exploratory, we rely on the users behavior. Indeed, the more the users tend to have a focusing behavior, the lower the entropy of the GT saliency map 2 . Thus we consider the entropy of the GT saliency map of each video to assign the video to one category or the other. We sort the videos of the test set in increasing entropy, and we represent in Fig. 20 the results averaged over the bottom 10% (focus-type videos) and top 10% (exploratory ...
Context 5
... confirm the analysis that led us to introduce this new architecture TRACK for dynamic head motion prediction, we perform an ablation study of the additional elements we brought compared to CVPR18-improved: we either replace the RNN processing the CB-saliency with two FC layers (line named AblatSal in Fig. 22), or replace the fusion RNN with two FC layers (line named AblatFuse). Fig. 18 and 22 confirm the analysis in Sec. 7.1: the removal of the first extra RNN (not present in CVPR18) processing the saliency input has more impact: AblatSal degrades away from the deep-position-only baseline in the first time-steps. The degradation is not as ...

Similar publications

Article
Full-text available
Predicting human motion based on past observed motion is one of the challenging issues in computer vision and graphics. Existing research works are dealing with this issue by using discriminative models and showing the results for cases that follow a homogeneous distribution (in distribution) and not discussing the issues of the domain shift proble...