Fig 2 - uploaded by Miguel Romero
Content may be subject to copyright.
For each time-stamp t, all the next positions from t until t + H are predicted.

For each time-stamp t, all the next positions from t until t + H are predicted.

Source publication
Article
Full-text available
We consider predicting the user's head motion in 360 videos, with 2 modalities only: the past user's position and the video content (not knowing other users' traces). We make two main contributions. First, we re-examine existing deep-learning approaches for this problem and identify hidden flaws from a thorough root-cause analysis. Second, from the...

Contexts in source publication

Context 1
... section reviews the existing methods relevant for the problem we consider. We start by formulating the exact problem: it consists, at each video playback time t, in predicting the future user's head positions between t and t + H, as illustrated in Fig. 1 and represented in Fig. 2, with the only knowledge of this user's past positions and the (entire) video content. We therefore do not consider methods aiming to predict the entire user trajectory from the start based on the content and on the starting point as, e.g., targeted by the challenge in [16] or summarizing a 360 • video into 2D [17], [18]. As well, and ...
Context 2
... the settings of the works we compare with, T start is set to 0 sec. for all the curves generated in Sec. 3. In order to skip the exploration phase, as explained in Sec. 5.4, and be more favorable to all methods as they are not able to consider non-stationarity of the motion process, we set T start = 6 sec. from Sec. 5 onward. We now refer to Fig. 2. Let H be the prediction horizon. We define the terms prediction step s, and video time-stamp t, such that: at every time-stamp t ∈ [T start , T ], we run predictionsˆPpredictionsˆ predictionsˆP t+s , for all prediction steps s ∈ [0, H]. We formulate the problem of trajectory prediction as finding the best model F * H ...
Context 3
... depicted in Fig. 4, a seq2seq framework consists of an encoder and a decoder. The encoder receives the historic window input (samples from t − M to t − 1 shown in Fig. 2) and generates an internal representation. The decoder receives the output of the encoder and progressively produces predictions over the target horizon, by re-injecting the previous prediction as input for the new prediction time-steps. This is a strong baseline (not only a trivial-static or linear predictor) processing the head ...
Context 4
... Exploratory, we rely on the users behavior. Indeed, the more the users tend to have a focusing behavior, the lower the entropy of the GT saliency map 2 . Thus we consider the entropy of the GT saliency map of each video to assign the video to one category or the other. We sort the videos of the test set in increasing entropy, and we represent in Fig. 20 the results averaged over the bottom 10% (focus-type videos) and top 10% (exploratory ...
Context 5
... confirm the analysis that led us to introduce this new architecture TRACK for dynamic head motion prediction, we perform an ablation study of the additional elements we brought compared to CVPR18-improved: we either replace the RNN processing the CB-saliency with two FC layers (line named AblatSal in Fig. 22), or replace the fusion RNN with two FC layers (line named AblatFuse). Fig. 18 and 22 confirm the analysis in Sec. 7.1: the removal of the first extra RNN (not present in CVPR18) processing the saliency input has more impact: AblatSal degrades away from the deep-position-only baseline in the first time-steps. The degradation is not as ...
Context 6
... section reviews the existing methods relevant for the problem we consider. We start by formulating the exact problem: it consists, at each video playback time t, in predicting the future user's head positions between t and t + H, as illustrated in Fig. 1 and represented in Fig. 2, with the only knowledge of this user's past positions and the (entire) video content. We therefore do not consider methods aiming to predict the entire user trajectory from the start based on the content and on the starting point as, e.g., targeted by the challenge in [16] or summarizing a 360 • video into 2D [17], [18]. As well, and ...
Context 7
... the settings of the works we compare with, T start is set to 0 sec. for all the curves generated in Sec. 3. In order to skip the exploration phase, as explained in Sec. 5.4, and be more favorable to all methods as they are not able to consider non-stationarity of the motion process, we set T start = 6 sec. from Sec. 5 onward. We now refer to Fig. 2. Let H be the prediction horizon. We define the terms prediction step s, and video time-stamp t, such that: at every time-stamp t ∈ [T start , T ], we run predictionsˆPpredictionsˆ predictionsˆP t+s , for all prediction steps s ∈ [0, H]. We formulate the problem of trajectory prediction as finding the best model F * H ...
Context 8
... depicted in Fig. 4, a seq2seq framework consists of an encoder and a decoder. The encoder receives the historic window input (samples from t − M to t − 1 shown in Fig. 2) and generates an internal representation. The decoder receives the output of the encoder and progressively produces predictions over the target horizon, by re-injecting the previous prediction as input for the new prediction time-steps. This is a strong baseline (not only a trivial-static or linear predictor) processing the head ...
Context 9
... Exploratory, we rely on the users behavior. Indeed, the more the users tend to have a focusing behavior, the lower the entropy of the GT saliency map 2 . Thus we consider the entropy of the GT saliency map of each video to assign the video to one category or the other. We sort the videos of the test set in increasing entropy, and we represent in Fig. 20 the results averaged over the bottom 10% (focus-type videos) and top 10% (exploratory ...
Context 10
... confirm the analysis that led us to introduce this new architecture TRACK for dynamic head motion prediction, we perform an ablation study of the additional elements we brought compared to CVPR18-improved: we either replace the RNN processing the CB-saliency with two FC layers (line named AblatSal in Fig. 22), or replace the fusion RNN with two FC layers (line named AblatFuse). Fig. 18 and 22 confirm the analysis in Sec. 7.1: the removal of the first extra RNN (not present in CVPR18) processing the saliency input has more impact: AblatSal degrades away from the deep-position-only baseline in the first time-steps. The degradation is not as ...

Citations

... Another line of research addresses the prediction of the egocentric scanpath under the free-viewing condition [33,34,35,36,37,38]. Xu et al. [33] train their model on individual observer fixation trajectories while freely viewing 360 • videos on a VR headset. ...
Preprint
Full-text available
Previous research on scanpath prediction has mainly focused on group models, disregarding the fact that the scanpaths and at-tentional behaviors of individuals are diverse. The disregard of these differences is especially detrimental to social human-robot interaction, whereby robots commonly emulate human gaze based on heuristics or predefined patterns. However, human gaze patterns are heterogeneous and varying behaviors can significantly affect the outcomes of such human-robot interactions. To fill this gap, we developed a deep learning-based social cue integration model for saliency prediction to instead predict scanpaths in videos. Our model learned scanpaths by recursively integrating fixation history and social cues through a gating mechanism and sequential attention. We evaluated our approach on gaze datasets of dynamic social scenes, observed under the free-viewing condition. The introduction of fixation history into our models makes it possible to train a single unified model rather than the resource-intensive approach of training individual models for each set of scanpaths. We observed that the late neural integration approach surpasses early fusion when training models on a large dataset, in comparison to a smaller dataset with a similar distribution. Results also indicate that a single unified model, trained on all the observers' scanpaths, performs on par or better than individually trained models. We hypothesize that this outcome is a result of the group saliency representations instilling universal attention in the model, while the supervisory signal and fixation history guide it to learn personalized attentional behaviors, providing the unified model a benefit over individual models due to its implicit representation of universal attention.
... In the existing works, VR video streaming research almost always divides VR video into tiles and system mainly transmits the content in FoV [4][5][6][7][8][9][10][11][12][13][14][15][16][17]. There are also many articles study caching algorithm that can cache popular content on the edge server to reduce remote request delay [4][5][6][7][8][9][10][11][12]. ...
... There are also many articles study caching algorithm that can cache popular content on the edge server to reduce remote request delay [4][5][6][7][8][9][10][11][12]. In the prediction study of FoV, researchers prefer to use machine learning to design predictors [13][14][15][16][17]. The above VR video tiles transmission and FoV prediction can be achieved without an edge server, but the caching method generally requires an edge server and caches the content on the edge server. ...
... In [19], the author develop a predictive scheduling policy for proactive scheduling and deadline scheduling at the access point (AP), and optimizes the scheduling decision. In [13], the author uses machine learning to predict the user's head movement. In addition to caching and prediction research, there are also articles that study the balanced allocation of resources to improve the user's quality of experience (QoE). ...
Article
Full-text available
The transmission optimization of VR video streaming can improve the quality of user experience, which includes content prediction optimization and caching strategy optimization. Existing work either focuses on content prediction or on caching strategy. However, in the end-edge-cloud system, prediction and caching should be considered together. In this paper, we jointly optimize the four stages of prediction, caching, computing and transmission in mobile edge caching system, aimed to maximize the user’s quality of experience. In terms of caching strategy, we design a caching algorithm VIE with unknown future request content, which can efficiently improve the content hit rate, as well as the durations for prediction, computing and transmission. The VIE caching algorithm is proved to be ahead of other algorithms in terms of delay. We optimize the four stages under arbitrary resource allocation and obtain the optimal results. Finally, under the real scenario, the proposed caching algorithm is verified by comparing with several other caching algorithms, simulation results show that the user’s QoE is improved under the proposed caching algorithm.
... Various approaches have been developed to identify important 2D views, with most falling into two categories. Firstly, many works [4,11,23,26,30,33,38,40,45] rely on manual viewer control to select views. In this approach, the viewer manually adjusts their viewing direction during playback, receiving the corresponding 2D view. ...
... Pros & Cons. Many existing works [4,11,23,25,26,28,30,33,38,40,45] consider MANUAL to be satisfactory for viewers. This is because manual control allows viewers to obtain views that align with their own preferences and are therefore important to them. ...
... Consequently, viewers' viewing preferences exhibit a dynamic degree of convergence that varies across video frames. It is noteworthy that some works [1,13,23,[30][31][32]35] have examined viewers' behaviors and patterns during 360 • video viewing, but they solely investigate MANUAL mode. These studies observe dynamic convergence in the viewports of viewers in MANUAL, where viewers have a limited field of view. ...
Preprint
Full-text available
360-degree video has become increasingly popular in content consumption. However, finding the viewing direction for important content within each frame poses a significant challenge. Existing approaches rely on either viewer input or algorithmic determination to select the viewing direction, but neither mode consistently outperforms the other in terms of content-importance. In this paper, we propose 360TripleView, the first view management system for 360-degree video that automatically infers and utilizes the better view mode for each frame, ultimately providing viewers with higher content-importance views. Through extensive experiments and a user study, we demonstrate that 360TripleView achieves over 90\% accuracy in inferring the better mode and significantly enhances content-importance compared to existing methods.
... In the past ten years, many scanpath prediction methods in 360°videos have been proposed, differing mainly in three aspects: 1) the input formats and modalities, 2) the computational prediction mechanisms, and 3) the loss functions. For the input formats and modalities, Rondón et al. [11] revealed that the user's past scanpath solely suffices to inform the prediction for time horizons shorter than two to three seconds. Nevertheless, the majority of existing methods take 360°video frames as an "indispensable" form of visual input for improved scanpath prediction. ...
... Among numerous 360°video representations, the equirectangular projection (ERP) format is the most widely adopted, which however exhibits noticeable geometric deformations, especially for objects at high latitudes. For the computational prediction mechanisms, existing methods are inclined to rely on external algorithms for saliency detection [11], [12], [13] or optical flow estimation [12], [13] for visual feature analysis, whose performance is inevitably upper-bounded by these external methods, which are often trained on planar rather than 360°videos. After multimodal feature extraction and aggregation, a sequence-to-sequence (seq2seq) predictor, implemented by an unfolded recurrent neural network (RNN) or a transformer, is adopted to gather historical information. ...
... For the loss functions in guiding the optimization, some form of "ground-truth" scanpaths is commonly specified to gauge the prediction accuracy. A convenient choice is the mean squared error (MSE) [11], [13], [14] or its spherical derivative [15], which assumes the underlying probability distribution to be unimodal Gaussian. Such "imitation learning" is weak at capturing the scanpath uncertainty of an individual user and the scanpath diversity of different users. ...
Preprint
Full-text available
Predicting human scanpaths when exploring panoramic videos is a challenging task due to the spherical geometry and the multimodality of the input, and the inherent uncertainty and diversity of the output. Most previous methods fail to give a complete treatment of these characteristics, and thus are prone to errors. In this paper, we present a simple new criterion for scanpath prediction based on principles from lossy data compression. This criterion suggests minimizing the expected code length of quantized scanpaths in a training set, which corresponds to fitting a discrete conditional probability model via maximum likelihood. Specifically, the probability model is conditioned on two modalities: a viewport sequence as the deformation-reduced visual input and a set of relative historical scanpaths projected onto respective viewports as the aligned path input. The probability model is parameterized by a product of discretized Gaussian mixture models to capture the uncertainty and the diversity of scanpaths from different users. Most importantly, the training of the probability model does not rely on the specification of ``ground-truth'' scanpaths for imitation learning. We also introduce a proportional-integral-derivative (PID) controller-based sampler to generate realistic human-like scanpaths from the learned probability model. Experimental results demonstrate that our method consistently produces better quantitative scanpath results in terms of prediction accuracy (by comparing to the assumed ``ground-truths'') and perceptual realism (through machine discrimination) over a wide range of prediction horizons. We additionally verify the perceptual realism improvement via a formal psychophysical experiment and the generalization improvement on several unseen panoramic video datasets.
... 3) Deep-learning-based Approaches: With the rapid development of deep learning technology, numerous viewport prediction methods [9]- [17], [24], [25] utilized Convolutional Neural Networks (CNN) or/and Recurrent Neural Networks (RNN), including Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) to model a user's head movement in 360°video consumption. Some content-agnostic models [9]- [12] only consider historical viewport scanpaths, while the other content-aware methods [13]- [17], [24], [25] also incorporate video content information, such as video frames or saliency maps, in addition to the historical trajectory for improving prediction accuracy. ...
... 3) Deep-learning-based Approaches: With the rapid development of deep learning technology, numerous viewport prediction methods [9]- [17], [24], [25] utilized Convolutional Neural Networks (CNN) or/and Recurrent Neural Networks (RNN), including Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) to model a user's head movement in 360°video consumption. Some content-agnostic models [9]- [12] only consider historical viewport scanpaths, while the other content-aware methods [13]- [17], [24], [25] also incorporate video content information, such as video frames or saliency maps, in addition to the historical trajectory for improving prediction accuracy. ...
... Rondon et al. [17] proposed the method TRACK that take account of the past scanpath and the saliency map of the future frame to train a prediction model with LSTM. It inputs the 1-second viewport scanpath and predicts the future 5-second viewport scanpath. ...
Preprint
Full-text available
Numerous viewport prediction methods are proposed recently for 360° video streaming. However, most of the methods are not evaluated in terms of the QoE in the video streaming system. In order to understand the progress of currently existing methods, we build a tile-based viewport adaptive video streaming system and evaluate existing viewport prediction methods based on objective quality assessment metrics. We vary the parameters in the streaming system to see their effect on the viewport prediction methods. The experimental results show that viewport prediction methods improve the visual quality when the streaming system has a longer buffer length.
... Rondon et al. [77] found that the orthodromic metric is the most suitable distance metric for spherical surfaces. It can handle the periodicity of the latitude, while fitting the spherical geometry distance problem more accurately. ...
Article
Full-text available
Optimizing user quality of experience (QoE) for 360° videos faces two major roadblocks: inaccurate viewport prediction and viewers missing the plot of a story. To tackle these issues simultaneously, alignment edits have emerged as a promising solution. These edits, also known as "re-targeting edits," work by aligning the user’s field of view with a specific region of interest in the video content. Despite their potential benefits, there is limited knowledge about the actual impacts of alignment edits on user experience (UX). Therefore, we conducted subjective experiments following ITU-T P.919 methodology to explore their effects on QoE. We proposed an alignment edit based on gradual rotation of the 360° frame, aiming to replicate natural viewing behavior. We tested this approach under various conditions and thoroughly analyzed its impacts using both head motion data and feedback from observers, focusing on their sense of presence, comfort, and perceived experience. The results of our experiments are encouraging. Our proposed gradual alignment technique achieved a level of comfort and presence comparable to that of instant edits. Furthermore, all alignment edits tested led to a noticeable reduction in head movement speed after the edit, affirming their potential utility for on-demand video streaming. Notably, the gradual edits, in particular, induced a significant reduction of approximately 8% in head movement speeds when compared to the instant alignment technique. These findings shed light on the positive effects of alignment edits on user experience and firmly establish the viability of the proposed gradual alignment technique to enhance QoE during video consumption.
... Both show acceptable prediction results for short prediction intervals but fail for long intervals because the assumptions become unaligned with reality. Multiple machine learning (ML)-based prediction solutions have also been proposed based on a linear regression model [13], multilayer perceptrons (MLPs) [14], [15], convolutional neural networks (CNNs) [16], recurrent neural networks (RNNs) [17]- [19], long short-term memory (LSTM) [20], [21], and gated recurrent units (GRUs) [20], [22], which learn the correlations between past head pose data and the future head pose. Another approach to motion prediction is to utilize the user's neck surface electromyographic (sEMG) data to make predictions using a trained artificial neural network, based on the fact that myoelectric signals precede exertion [23], [24]. ...
... Fan et al. [18] proposed the use of an RNN to predict the fixation point of the user, where the input to the RNN consists of the user motion data and the saliency map from the video frame and the output is the orientation of the user. A similar approach was also proposed by [19]. ...
Article
Full-text available
Offloading virtual reality (VR) computations to a cloud computing entity can enable support for VR services on low-end user devices but may result in increased latency, which will lead to mismatch between the user’s viewport and the received VR image, thus inducing motion sickness. Predicting future motion and rendering future images accordingly is a promising solution to the latency problem. In this paper, we develop velocity- and error-aware model switching schemes applicable to a wide range of existing motion prediction models. First, we consider the chattering problem of machine learning (ML)-based prediction models and the relationship between the velocity and the prediction error gap between an ML model and the case of no prediction (NOP). Accordingly, we propose a velocity-aware switching (VAS) scheme that combines the outputs from the ML model and the NOP case via a weight determined by the head motion velocity. Next, we develop an ensemble method combining a set of outputs from VAS and other models, called error-aware switching (EAS). EAS switches between model outputs based on the error statistics of those outputs under the parallel execution of multiple models, including VAS models. For EAS, schemes for both hard switching and soft integration of the model outputs are proposed. We evaluate the proposed schemes based on real VR motion traces for diverse ML-based prediction models.
... We also Step-by-step of our proposed subtitle processing. use the orthodromic distance, which is the shortest distance between two points on the surface of a sphere by measuring along the surface of the sphere [21]. We perform the experiment on a single RTX 3070 GPU and average the performance of 5 different runs for each method. ...
... Since a correlation between saliency map and user trajectory has been empirically proved in [106], many efforts have been dedicated to study, infer, and exploit saliency in ODV streaming. Specifically, deep learning frameworks aimed at predicting users trajectories where augmented by using saliency maps as further input [73,107,108]. Different learning architectures and paradigms were considered in these studies: Reinforcement Learning (RL) based approach looking at the user's behaviour as sequential actions taken over time [73]; and recurrent learning approach exploiting the temporal correlation of users trajectories [107,108]. Xu et al. [73] proposed an RL based workflow that first estimates the saliency map for each frame and then predicts the viewport direction based on historical data and predicted saliency. ...
... Specifically, deep learning frameworks aimed at predicting users trajectories where augmented by using saliency maps as further input [73,107,108]. Different learning architectures and paradigms were considered in these studies: Reinforcement Learning (RL) based approach looking at the user's behaviour as sequential actions taken over time [73]; and recurrent learning approach exploiting the temporal correlation of users trajectories [107,108]. Xu et al. [73] proposed an RL based workflow that first estimates the saliency map for each frame and then predicts the viewport direction based on historical data and predicted saliency. This prediction is cast as a RL agent that aims at minimizing the prediction loss (dissimilarity between the predicted and ground-truth trajectories). ...
... The learning framework was able to overcome main limitations such as central saliency bias and single object focus (i.e., ODV users quickly scan through all objects in a single viewport). Interestingly, Rondón et al. [108] show that the historical data points (in terms of past trajectories) and content features may influence the future trajectories differently based on the prediction horizon. They observe that users trajectory is affected by the content mainly if toward the end of the trajectory. ...
Chapter
Full-text available
Omnidirectional videos (ODVs) have gone beyond the passive paradigm of traditional video, offering higher degrees of immersion and interaction. The revolutionary novelty of this technology is the possibility for users to interact with the surrounding environment, and to feel a sense of engagement and presence in a virtual space. Users are clearly the main driving force of immersive applications and consequentially the services need to be properly tailored to them. In this context, this chapter highlights the importance of the new role of users in ODV streaming applications, and thus the need for understanding their behaviour while navigating within ODVs. A comprehensive overview of the research efforts aimed at advancing ODV streaming systems is also presented. In particular, the state-of-the-art solutions under examination in this chapter are distinguished in terms of system-centric and user-centric streaming approaches: the former approach comes from a quite straightforward extension of well-established solutions for the 2D video pipeline while the latter one takes the benefit of understanding users’ behaviour and enable more personalised ODV streaming.
... Therefore, the viewport-based adaptive streaming, which only streams the user's viewport with high quality, has arisen as the primary technique for bandwidth saving over the best-effort Internet [1]. Hence, predicting a user's future head movement based on their past head movement becomes an essential task for the 360 • video streaming system [2]- [6]. ...
... Given a user's viewport scanpath in the previous H seconds, the viewport prediction method [2]- [6] predicts the user's viewport scanpath in the following F seconds. The viewport scanpath of a user watching a 360 • video in duration T can be defined as {P t } T t=0 . ...
... Referring to exiting viewport prediction methods [2]- [4], we selected two metrics, which are average great-circle distance, the average ratio of overlapping tiles, to evaluate the prediction accuracy. The great-circle distance computes the distance between the predicted point and ground-truth point on a sphere. ...
Conference Paper
Full-text available
Predicting the user's viewport scanpath is an essential task for 360° viewport-based adaptive streaming. It informs the system which parts of content should be streamed with high quality for bandwidth saving over the best-effort Internet. However , in light of growing privacy concerns among consumers and increasingly strict data privacy legislation, user data collection and storage have been constrained. This paper proposes a novel privacy-preserving framework employing Federated Learning (FL) for online viewport prediction in a live 360° video streaming scenario. In this framework, the user data is only collected and processed on the client-side in the current viewing session and not shared with external parties, e.g., servers, and other clients. We evaluated the framework over a widely-used dataset and measure the computation and transmission time of the proposed streaming system. The experiments show that our framework provides high prediction accuracy and achieves real-time computation requirements of live video streaming. On privacy preservation, our results demonstrate that in a tile-based 360° video streaming system, the user identification rate can be decreased by 18.11 percentage points in 4 × 3 tiles per frame and 9.65 percentage points in 16×9 tiles per frame. The code will be publicly available to further contribute to the community.