Classification scheme.

Classification scheme.

Source publication
Article
Full-text available
Today′s content-based video retrieval technologies are still far from human′s requirements. A fundamental reason is the lack of content representation that is able to bridge the gap between visual features and semantic conception in video. In this paper, we propose a motion pattern descriptor, motion texture that characterizes motion in a...

Context in source publication

Context 1
... video clips do not always have salient motion patterns and the semantic conceptions are not always conveyed by motion patterns, not all of the video clips can be mapped to semantic conception based on motion patterns. Therefore, we define a classification scheme, as shown in Figure 5, to fa- cilitate semantic video classification. At fist, video clips are classified into two basic classes: pat- terned and nonpatterned. ...

Similar publications

Article
Full-text available
Recent literature comprises a large number of papers on the query and retrieval of visual information based on its content. At the same time, a number of prototype systems have been implemented enabling searching through on-line image databases and still image retrieval. However, it has been often pointed out that meaningful/semantic information sh...
Conference Paper
Full-text available
We conducted a user study with 4 video clips and 37 viewing ses-sions on how users interact with a web-based zoomable video sys-tem, where users can zoom and pan within the video to view se-lected regions-of-interest with more detail. The study shows that frequency of interaction is very high and the period during which users watch the video withou...
Conference Paper
Full-text available
Keyframe extraction methods aim to find in a video sequence the most significant frames, according to specific criteria. In this paper we propose a new method to search, in a video database, for frames that are related to a given keyword, and to extract the best ones, according to a proposed quality factor. We first exploit a speech to text algorit...
Article
Full-text available
We present a continuous and unobtrusive approach to ana-lyze and reason about users' personal experiences of inter-acting with virtual and game environments. Focusing on an immersive educational game environment that we are devel-oping, this is achieved through the capture and storage of user's movements and events that occur as a result of inter-a...
Thesis
Full-text available
This dissertation proposes a new approach regarding multimedia documents creation. To support this approach, an evolutionary module and a multimedia production system were conceived, integrated and implemented. A new paradigm is pursued for editing multimedia described documents. Cut and merge of video segments is applied by a Genetic Algorithm. Th...

Citations

... According to the field area and field distribution characteristics, the football video shot classification, event detection, and video summaries are realized. Ma et al. [13] realized the classification of simple sports in sports videos by detecting some motion patterns in video frames, such as running, jumping, serving shots, panning, and zooming. Liu et al. [14] realized the classification of videos by separating the target objects and other information in the video. ...
Article
Full-text available
As a hot research topic, sports video classification research has a wide range of applications in switched TV, video on demand, smart TV, and other fields and is closely related to people’s lives. Under this background, sports video classification research has aroused great interest in people. However, the existing methods usually use manual video classification, which the workers themselves often influence. It is challenging to ensure the accuracy of the results, leading to the wrong classification. Due to these limitations, we introduce neural network technology to the automatic classification of sports. This paper proposed a novel attention-based graph convolution-guided third-order hourglass network (AGTH-Net) classification model. First, we designed a kind of figure convolution model based on the attention mechanism. The model is the key to introduce the attention mechanism for neighborhood node weights’ allocation. It reduces the impact of error nodes in the neighborhood while avoiding manual weight assignment. Second, according to the sports complex video image characteristics, we use the third-order hourglass network structure. It is used for the extraction and fusion of multiscale characteristics of sports. In addition, in the hourglass, internal network residual-intensive modules are introduced, realizing characteristics in different levels of network transfer and reuse. It is helpful for maximum details to feature extracting and enhancing the network expression ability. Comparison and ablation experiments are also carried out to prove the effectiveness and superiority of the proposed algorithm.
... The audio features normally used are the well known Mel-Frequency Cepstral Coefficients (MFCCs) [3]. Regarding video features, image features can be used [4,5], but also motion features [6] or a combination of both [2,3]. ...
Article
The aim of the present work is to design a system for automatic classification of personal video recordings based on simple audiovisual features that can be easily implemented in different devices. Specifically, the main objective is to classify frame by frame personal video recordings into 24 semantically meaningful categories. Such categories include information about the environment like indoor or outdoor, the presence or absence of people and their activity, ranging from sports to partying. In order to achieve a robust classification, features derived from both audio and image data will be used and combined with state of the art classifiers such as Gaussian Mixture Models or Support Vector Machines. In the process, several combination schemes of features and classifiers are defined and evaluated over a real data set of personal video recordings. The system learns which parameters and classifiers are most appropriate for this task.
... In another approach [27] semantic aspects of a video genre, such as editing, motion and color distribution has been used as features and the decision tree algorithm was used to build the classifier. In [19] motion pattern (block motion estimation algorithm) from the compressed domain features has been used for video classification and retrieval. Support vector machines (SVM) have been used for sports video classification in [25]. ...
Article
This paper presents a genre-specific modeling strategy capable of improving the task of content based video classification and the speed of data retrieval operations. With the ever increasing growth of video data it is important to classify video shots into groups based on its content. For that reason, it is of primary concern to design systems that could automatically classify videos into different genres based on its content. We consider the genre recognition task as a classification problem. We use support vector machines to perform the classification task and propose an improved video classification method. The experimental results show that genre-specific modeling of features can significantly improve the performance. Results have been compared with two contemporary works on video classification, to demonstrate the superiority of our proposed framework.
... Djeraba (2002) stated that low-level video features are features extracted from the video clips and audio tracks without referring to any external knowledge. For example, color, texture, and motion are major low-level features extracted from video clips (Gibert et al., 2003;Huang et al., 1999;Ma and Zhang, 2003). Fischer et al. (1995) utilized audio features such as volume of audio, audio wave forms, and audio frequency spectrum. ...
... First, the prior knowledge is not required for it to obtain a high generalization performance and it can perform consistently with very high input dimensions. Second, SVM can obtain a global optimal solution and is especially suitable for solving classification problems with small samples (Ma and Zhang, 2003). In addition, SVM has shown excellent video classification performances (Jing et al., 2004;Lazebnik et al., 2006;Zhang et al., 2007). ...
... There are a lot of approaches one can consider for classifying videos. SVM (Support vector machine) and HMM (Hidden Markov model) are examples of model-based clas- sifiers [5, 13, 17] . However, recent work on image classification has shown that nearest neighbor based classifiers serve as fast classifiers which can provide good performance even with large datasets [4] . ...
Conference Paper
Object of Interest (OOI) detection has been widely used in many recent works in video analysis, especially in video similarity and video retrieval. In this paper, we describe a generic video classification algorithm using object of interest detection. We use online user-submitted videos and aim to categorize the videos into six broad categories hot star, news, anime, pets, sports and commercials. We show through our experiments that, detecting and describing the object of interest improves the video classification accuracy by about 10 percentage points.
... Videos can then be indexed based on either global or segmentation features. In [24], a motion pattern descriptor namely motion texture is proposed for video retrieval and the classification of simple camera and object motion patterns . In [8], spatio-temporal interactions between objects are expressed by predicate logic for video retrieval. ...
Conference Paper
Full-text available
Event detection plays an essential role in video content anal- ysis. However, the existing features are still weak in event detection because: i) most features just capture what is in- volved in an event or how the event evolves separately, and thus cannot completely describe the event; ii) to capture event evolution information, only motion distribution over the whole frame is used which proves to be noisy in un- constrained videos; iii) the estimated object motion is usu- ally distorted by camera movement. To cope with these problems, in this paper, we propose a new motion feature, namely Expanded Relative Motion Histogram of Bag-of- Visual-Words (ERMH-BoW) to employ motion relativity and visual relatedness for event detection. In ERMH-BoW, by representing what aspect of an event with Bag-of-Visual- Words (BoW), we construct relative motion histograms be- tween visual words to depict the object activities or how aspect of the event. ERMH-BoW thus integrates both what and how aspects for a complete event description. Instead of motion distribution features, local motion of visual words is employed which is more discriminative in event detec- tion. Meanwhile, we show that by employing relative mo- tion, ERMH-BoW is able to honestly describe object ac- tivities in an event regardless of varying camera movement. Besides, to alleviate the visual word correlation problem in BoW, we propose a novel method to expand the relative mo- tion histogram. The expansion is achieved by diffusing the relative motion among correlated visual words measured by visual relatedness. To validate the effectiveness of the pro- posed feature, ERMH-BoW is used to measure video clip similarity with Earth Mover's Distance (EMD) for event detection. We conduct experiments for detecting LSCOM events in TRECVID 2005 video corpus, and performance is improved by 74% and 24% compared with existing motion distribution feature and BoW feature respectively.
... If dealing with video content, the detection of semantic concepts often follows a keyframe extraction for efficiency reasons, leading to an image retrieval problem (e.g., [10,18]). A valuable source of information beyond such static image content is motion, which has for example been employed in form of motion descriptors [8]. Each of them is fed to feature pipelines F 1 , .., F k , in which visual features are extracted and give scores P F j (X i ) for each keyframe. ...
Conference Paper
Full-text available
We present a system that automatically tags videos, i.e. detects high-level semantic concepts like objects or actions in them. To do so, our system does not rely on datasets manually annotated for research purposes. Instead, we propose to use videos from online portals like youtube.com as a novel source of training data, whereas tags provided by users during upload serve as ground truth annotations. This allows our system to learn autonomously by automatically downloading its training set. The key contribution of this work is a number of large-scale quantitative experiments on real-world online videos, in which we investigate the influence of the individual system components, and how well our tagger generalizes to novel content. Our key results are: (1) Fair tagging results can be obtained by a late fusion of several kinds of visual features. (2) Using more than one keyframe per shot is helpful. (3) To generalize to different video content (e.g., another video portal), the system can be adapted by expanding its training set.
... Ideally none of the five dimensions would be compressed in order to best represent motion patterns in a series of video shots. Motion textures [10] retain magnitude, direction, and time, but compress the two dimensions of space. Previous work on using motion for a variety of human-specific motion detection includes motion history images for detection of human motion in videos [16] and unsupervised learning of human action categories [17] to name a few. ...
Conference Paper
Among the various types of semantic concepts modeled, events pose the greatest challenge in terms of computational power needed to represent the event and accuracy that can be achieved in modeling it. We introduce a novel low-level visual feature that summarizes motion in a shot. This feature leverages motion vectors from MPEG-encoded video, and aggregates local motion vectors over time in a matrix, which we refer to as a motion image. The resulting motion image is representative of the overall motion in a video shot, having compressed the temporal dimension while preserving spatial ordering. Building motion models using this feature permits us to combine the power of discriminant modeling with the dynamics of the motion in video shots that cannot be accomplished by building generative models over a time series of motion features from multiple frames in the video shot. Evaluation of models built using several motion image features in the TRECVID 2005 dataset shows that use of this novel motion feature results an average improvement in concept detection performance by 140% over existing motion features. Furthermore, experiments also reveal that when this motion feature is combined with static feature representations of a single keyframe from the shot such as color and texture features, the fused detection results in an improvement between 4 to 12% over the fusion across the static features alone.
... Ngo et al. [20] analyzed the orientated patterns in a spatio-temporal image slice to capture the dominant motion by detecting the peak of the orientation histogram along the temporal dimension. Ma et al. [22] utilized moment functions to describe the angular distributions of motion vectors from directional slices along the temporal axis. The resulting descriptor is used for classifying motion patterns. ...
... Aiming at a nonparametric representation, our approach does not manually introduce heuristic knowledge like subregions partition [7], [25] and directional slicing [22]. Rather we emphasize the unsupervised feature space analysis (i.e., mean shift) to capture dominant modes in single or multiple OFF/MVFs for characterizing motion. ...
... Two recent learning-based methods were reported in [9], [22]. A causal spatio-temporal Gibbs model was proposed in [9] to learn the cooccurrences of a sequence of local motion-related measurements to discriminate motion classes of interest. ...
Article
Full-text available
Motion characterization plays a critical role in video indexing. An effective way of characterizing camera motion facilitates the video representation, indexing and retrieval tasks. This paper describes a novel nonparametric motion representation to achieve an effective and robust recognition of parts of the video in which camera is static, or panning, or tilting, or zooming, etc. This representation employs the mean shift filtering and the vector histograms to produce a compact description of a motion field. The basic idea is to perform spatio-temporal mode-seeking in the motion feature space and use the histograms-based spatial distributions of dominant motion modes to represent a motion field. Unlike most existing approaches, which focus on the estimation of a parametric motion model from a dense optical flow field (OFF) or a block matching-based motion vector field (MVF), the proposed method combines the motion representation and machine learning techniques (e.g., support vector machines) to perform camera motion analysis from the classification point of view. The main motivation lies in the impossibility of uniformly securing a proper parametric assumption in a wide range of video scenarios. The diverse camera shot sizes and frequent occurrences of bad OFF/MVF necessitates a learning mechanism, which can not only capture the domain-independent parametric constraints, but also acquire the domain-dependent knowledge to tolerate the influence of bad OFF/MVF. In order to improve performance, we can use this learning-based method to train enhanced classifiers aiming at a certain context (i.e., shot size, neighbor OFF/MVFs, and video genre). Other visual cues (e.g., dominant color) can also be incorporated for further motion analysis. Our main aim is to use a generic feature space analysis method to explore a flexible OFF/MVF representation in a nonparametric technique, which could be fed into a learning framework to robustly capture the global motion by incorporating the context information. Results on videos with various types of content (23 191 MVFs culled from MPEG-7 dataset, and 20 000 MVFs culled from broadcast tennis, soccer, and basketball videos) are reported to validate the proposed approach.
... A distance between events is then built, based on the comparison of the empirical histograms of these features. Recently in [11] , the authors have proposed motion pattern descriptors extracted from motion vector fields and have exploited support vector machines (SVM) for classification of video clips into semantic categories. For an objective very close to video summarization, shots overview, the method in [5] relies on the nonlinear temporal modelling of waveletbased motion features. ...
Article
Full-text available
We present a method for motion-based video segmentation and segment classification as a step towards video summarization. The sequential segmentation of the video is performed by detecting changes in the dominant image motion, assumed to be related to camera motion and represented by a 2D affine model. The detection is achieved by analysing the temporal variations of some coefficients of the 2D affine model (robustly) estimated. The obtained video segments supply reasonable temporal units to be further classified. For the second stage, we adopt a statistical representation of the residual motion content of the video scene, relying on the distribution of temporal co-occurrences of local motion-related measurements. Pre-identified classes of dynamic events are learned off-line from a training set of video samples of the genre of interest. Each video segment is then classified according to a Maximum Likelihood criterion. Finally, excerpts of the relevant classes can be selected for video summarization. Experiments regarding the two steps of the method are presented on different video genres leading to very encouraging results while only low-level motion information is considered.