For each time-stamp t, all the next positions from t until t + H are predicted.

Source publication

TRACK: A New Method from a Re-examination of Deep Architectures for Head Motion Prediction in 360-degree Videos

Article

Full-text available

Apr 2021

We consider predicting the user's head motion in 360 videos, with 2 modalities only: the past user's position and the video content (not knowing other users' traces). We make two main contributions. First, we re-examine existing deep-learning approaches for this problem and identify hidden flaws from a thorough root-cause analysis. Second, from the...

Context 1

... section reviews the existing methods relevant for the problem we consider. We start by formulating the exact problem: it consists, at each video playback time t, in predicting the future user's head positions between t and t + H, as illustrated in Fig. 1 and represented in Fig. 2, with the only knowledge of this user's past positions and the (entire) video content. We therefore do not consider methods aiming to predict the entire user trajectory from the start based on the content and on the starting point as, e.g., targeted by the challenge in [16] or summarizing a 360 • video into 2D [17], [18]. As well, and ...

View in full-text

Context 2

... the settings of the works we compare with, T start is set to 0 sec. for all the curves generated in Sec. 3. In order to skip the exploration phase, as explained in Sec. 5.4, and be more favorable to all methods as they are not able to consider non-stationarity of the motion process, we set T start = 6 sec. from Sec. 5 onward. We now refer to Fig. 2. Let H be the prediction horizon. We define the terms prediction step s, and video time-stamp t, such that: at every time-stamp t ∈ [T start , T ], we run predictionsˆPpredictionsˆ predictionsˆP t+s , for all prediction steps s ∈ [0, H]. We formulate the problem of trajectory prediction as finding the best model F * H ...

View in full-text

Context 3

... depicted in Fig. 4, a seq2seq framework consists of an encoder and a decoder. The encoder receives the historic window input (samples from t − M to t − 1 shown in Fig. 2) and generates an internal representation. The decoder receives the output of the encoder and progressively produces predictions over the target horizon, by re-injecting the previous prediction as input for the new prediction time-steps. This is a strong baseline (not only a trivial-static or linear predictor) processing the head ...

View in full-text

Context 4

... Exploratory, we rely on the users behavior. Indeed, the more the users tend to have a focusing behavior, the lower the entropy of the GT saliency map 2 . Thus we consider the entropy of the GT saliency map of each video to assign the video to one category or the other. We sort the videos of the test set in increasing entropy, and we represent in Fig. 20 the results averaged over the bottom 10% (focus-type videos) and top 10% (exploratory ...

View in full-text

Context 5

... confirm the analysis that led us to introduce this new architecture TRACK for dynamic head motion prediction, we perform an ablation study of the additional elements we brought compared to CVPR18-improved: we either replace the RNN processing the CB-saliency with two FC layers (line named AblatSal in Fig. 22), or replace the fusion RNN with two FC layers (line named AblatFuse). Fig. 18 and 22 confirm the analysis in Sec. 7.1: the removal of the first extra RNN (not present in CVPR18) processing the saliency input has more impact: AblatSal degrades away from the deep-position-only baseline in the first time-steps. The degradation is not as ...

Context 6

Context 7

Context 8

Context 9

Context 10

Unified Dynamic Scanpath Predictors Outperform Individually Trained Neural Models

Preprint

Full-text available

May 2024

Previous research on scanpath prediction has mainly focused on group models, disregarding the fact that the scanpaths and at-tentional behaviors of individuals are diverse. The disregard of these differences is especially detrimental to social human-robot interaction, whereby robots commonly emulate human gaze based on heuristics or predefined patterns. However, human gaze patterns are heterogeneous and varying behaviors can significantly affect the outcomes of such human-robot interactions. To fill this gap, we developed a deep learning-based social cue integration model for saliency prediction to instead predict scanpaths in videos. Our model learned scanpaths by recursively integrating fixation history and social cues through a gating mechanism and sequential attention. We evaluated our approach on gaze datasets of dynamic social scenes, observed under the free-viewing condition. The introduction of fixation history into our models makes it possible to train a single unified model rather than the resource-intensive approach of training individual models for each set of scanpaths. We observed that the late neural integration approach surpasses early fusion when training models on a large dataset, in comparison to a smaller dataset with a similar distribution. Results also indicate that a single unified model, trained on all the observers' scanpaths, performs on par or better than individually trained models. We hypothesize that this outcome is a result of the group saliency representations instilling universal attention in the model, while the supervisory signal and fixation history guide it to learn personalized attentional behaviors, providing the unified model a benefit over individual models due to its implicit representation of universal attention.

Online Caching Algorithm for VR Video Streaming in Mobile Edge Caching System

Article

Full-text available

Jan 2024
MOBILE NETW APPL

The transmission optimization of VR video streaming can improve the quality of user experience, which includes content prediction optimization and caching strategy optimization. Existing work either focuses on content prediction or on caching strategy. However, in the end-edge-cloud system, prediction and caching should be considered together. In this paper, we jointly optimize the four stages of prediction, caching, computing and transmission in mobile edge caching system, aimed to maximize the user’s quality of experience. In terms of caching strategy, we design a caching algorithm VIE with unknown future request content, which can efficiently improve the content hit rate, as well as the durations for prediction, computing and transmission. The VIE caching algorithm is proved to be ahead of other algorithms in terms of delay. We optimize the four stages under arbitrary resource allocation and obtain the optimal results. Finally, under the real scenario, the proposed caching algorithm is verified by comparing with several other caching algorithms, simulation results show that the user’s QoE is improved under the proposed caching algorithm.

360TripleView: 360-Degree Video View Management System Driven by Convergence Value of Viewing Preferences

Preprint

Full-text available

Jun 2023

360-degree video has become increasingly popular in content consumption. However, finding the viewing direction for important content within each frame poses a significant challenge. Existing approaches rely on either viewer input or algorithmic determination to select the viewing direction, but neither mode consistently outperforms the other in terms of content-importance. In this paper, we propose 360TripleView, the first view management system for 360-degree video that automatically infers and utilizes the better view mode for each frame, ultimately providing viewers with higher content-importance views. Through extensive experiments and a user study, we demonstrate that 360TripleView achieves over 90\% accuracy in inferring the better mode and significantly enhances content-importance compared to existing methods.

Scanpath Prediction in Panoramic Videos via Expected Code Length Minimization

Preprint

Full-text available

May 2023

Predicting human scanpaths when exploring panoramic videos is a challenging task due to the spherical geometry and the multimodality of the input, and the inherent uncertainty and diversity of the output. Most previous methods fail to give a complete treatment of these characteristics, and thus are prone to errors. In this paper, we present a simple new criterion for scanpath prediction based on principles from lossy data compression. This criterion suggests minimizing the expected code length of quantized scanpaths in a training set, which corresponds to fitting a discrete conditional probability model via maximum likelihood. Specifically, the probability model is conditioned on two modalities: a viewport sequence as the deformation-reduced visual input and a set of relative historical scanpaths projected onto respective viewports as the aligned path input. The probability model is parameterized by a product of discretized Gaussian mixture models to capture the uncertainty and the diversity of scanpaths from different users. Most importantly, the training of the probability model does not rely on the specification of ``ground-truth'' scanpaths for imitation learning. We also introduce a proportional-integral-derivative (PID) controller-based sampler to generate realistic human-like scanpaths from the learned probability model. Experimental results demonstrate that our method consistently produces better quantitative scanpath results in terms of prediction accuracy (by comparing to the assumed ``ground-truths'') and perceptual realism (through machine discrimination) over a wide range of prediction horizons. We additionally verify the perceptual realism improvement via a formal psychophysical experiment and the generalization improvement on several unseen panoramic video datasets.

A QoE Study for Viewport Prediction Approaches in Tile-based 360 Video Streaming

Preprint

Full-text available

Jan 2023

Numerous viewport prediction methods are proposed recently for 360° video streaming. However, most of the methods are not evaluated in terms of the QoE in the video streaming system. In order to understand the progress of currently existing methods, we build a tile-based viewport adaptive video streaming system and evaluate existing viewport prediction methods based on objective quality assessment metrics. We vary the parameters in the streaming system to see their effect on the viewport prediction methods. The experimental results show that viewport prediction methods improve the visual quality when the streaming system has a longer buffer length.

Impact of Alignment Edits on the Quality of Experience of 360° Videos

Article

Full-text available

Jan 2023

Optimizing user quality of experience (QoE) for 360° videos faces two major roadblocks: inaccurate viewport prediction and viewers missing the plot of a story. To tackle these issues simultaneously, alignment edits have emerged as a promising solution. These edits, also known as "re-targeting edits," work by aligning the user’s field of view with a specific region of interest in the video content. Despite their potential benefits, there is limited knowledge about the actual impacts of alignment edits on user experience (UX). Therefore, we conducted subjective experiments following ITU-T P.919 methodology to explore their effects on QoE. We proposed an alignment edit based on gradual rotation of the 360° frame, aiming to replicate natural viewing behavior. We tested this approach under various conditions and thoroughly analyzed its impacts using both head motion data and feedback from observers, focusing on their sense of presence, comfort, and perceived experience. The results of our experiments are encouraging. Our proposed gradual alignment technique achieved a level of comfort and presence comparable to that of instant edits. Furthermore, all alignment edits tested led to a noticeable reduction in head movement speed after the edit, affirming their potential utility for on-demand video streaming. Notably, the gradual edits, in particular, induced a significant reduction of approximately 8% in head movement speeds when compared to the instant alignment technique. These findings shed light on the positive effects of alignment edits on user experience and firmly establish the viability of the proposed gradual alignment technique to enhance QoE during video consumption.

Velocity- and Error-Aware Switching of Motion Prediction Models for Cloud Virtual Reality

Article

Full-text available

Jan 2023

Offloading virtual reality (VR) computations to a cloud computing entity can enable support for VR services on low-end user devices but may result in increased latency, which will lead to mismatch between the user’s viewport and the received VR image, thus inducing motion sickness. Predicting future motion and rendering future images accordingly is a promising solution to the latency problem. In this paper, we develop velocity- and error-aware model switching schemes applicable to a wide range of existing motion prediction models. First, we consider the chattering problem of machine learning (ML)-based prediction models and the relationship between the velocity and the prediction error gap between an ML model and the case of no prediction (NOP). Accordingly, we propose a velocity-aware switching (VAS) scheme that combines the outputs from the ML model and the NOP case via a weight determined by the head motion velocity. Next, we develop an ensemble method combining a set of outputs from VAS and other models, called error-aware switching (EAS). EAS switches between model outputs based on the error statistics of those outputs under the parallel execution of multiple models, including VAS models. For EAS, schemes for both hard switching and soft integration of the model outputs are proposed. We evaluate the proposed schemes based on real VR motion traces for diverse ML-based prediction models.

Automatic Keyword Extraction for Viewport Prediction of 360-degree Virtual Tourism Video

Conference Paper

Full-text available

Nov 2022

Streaming and User Behaviour in Omnidirectional Videos

Chapter

Full-text available

Sep 2022

Omnidirectional videos (ODVs) have gone beyond the passive paradigm of traditional video, offering higher degrees of immersion and interaction. The revolutionary novelty of this technology is the possibility for users to interact with the surrounding environment, and to feel a sense of engagement and presence in a virtual space. Users are clearly the main driving force of immersive applications and consequentially the services need to be properly tailored to them. In this context, this chapter highlights the importance of the new role of users in ODV streaming applications, and thus the need for understanding their behaviour while navigating within ODVs. A comprehensive overview of the research efforts aimed at advancing ODV streaming systems is also presented. In particular, the state-of-the-art solutions under examination in this chapter are distinguished in terms of system-centric and user-centric streaming approaches: the former approach comes from a quite straightforward extension of well-established solutions for the 2D video pipeline while the latter one takes the benefit of understanding users’ behaviour and enable more personalised ODV streaming.

Privacy-Preserving Viewport Prediction using Federated Learning for 360° Live Video Streaming

Conference Paper

Full-text available

Sep 2022

Predicting the user's viewport scanpath is an essential task for 360° viewport-based adaptive streaming. It informs the system which parts of content should be streamed with high quality for bandwidth saving over the best-effort Internet. However , in light of growing privacy concerns among consumers and increasingly strict data privacy legislation, user data collection and storage have been constrained. This paper proposes a novel privacy-preserving framework employing Federated Learning (FL) for online viewport prediction in a live 360° video streaming scenario. In this framework, the user data is only collected and processed on the client-side in the current viewing session and not shared with external parties, e.g., servers, and other clients. We evaluated the framework over a widely-used dataset and measure the computation and transmission time of the proposed streaming system. The experiments show that our framework provides high prediction accuracy and achieves real-time computation requirements of live video streaming. On privacy preservation, our results demonstrate that in a tile-based 360° video streaming system, the user identification rate can be decreased by 18.11 percentage points in 4 × 3 tiles per frame and 9.65 percentage points in 16×9 tiles per frame. The code will be publicly available to further contribute to the community.

For each time-stamp t, all the next positions from t until t + H are predicted.

Contexts in source publication

Citations