Conference Paper

Recovering Human Body Configurations: Combining Segmentation and Recognition.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The goal of this work is to detect a human figure image and localize his joints and limbs along with their associated pixel masks. In this work we attempt to tackle this problem in a general setting. The dataset we use is a collection of sports news photographs of baseball players, varying dramatically in pose and clothing. The approach that we take is to use segmentation to guide our recognition algorithm to salient bits of the image. We use this segmentation approach to build limb and torso detectors, the outputs of which are assembled into human figures. We present quantitative results on torso localization, in addition to shortlisted full body configurations.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The most common features include: image silhouettes [1], for effectively separating the person from background in static scenes; color [16], for modeling un-occluded skin or clothing; edges [16], for modeling external and internal contours of the body; and gradients [5], for modeling the texture over the body parts. Less common features include shading and focus [14]. To reduce dimensionality and increase robustness to noise, these raw features are often encapsulated in image descriptors, such as shape context [1,2,6], SIFT [6] and histogram of oriented gradients [5]. ...
... Different component-primarily based tactics try and gather regions of an image into body parts and successively construct the ones components into a body. top examples of such strategies are added via Mori et al. [14] and Ren et al. [17]. In [14], first-ratepixels have been first assembled into frame parts based totally on the assessment of low-level photo cues, along with contour, form, shading, and attention. ...
... top examples of such strategies are added via Mori et al. [14] and Ren et al. [17]. In [14], first-ratepixels have been first assembled into frame parts based totally on the assessment of low-level photo cues, along with contour, form, shading, and attention. The part-proposals had been then pruned and assembled together the usage of length, frame part adjacency, and garb symmetry. ...
Technical Report
Full-text available
Markerless movement capture (MoCap). The ordinary motion of the human frame is captured and analyzed without attaching markers or straps. MoCap technology is beneficial for applications ranging from man or woman animation to medical evaluation of gait pathologies. in this paper, we cognizance mainly on human pose estimation the use of MoCap technology and information evaluation, human pose estimation lets in for higher stage reasoning in the context of human-laptop interaction and activity popularity, we validate the effectiveness of our approach on the mission of articulated human pose estimation. The paper will present and talk one-of-a-kind answers to determine the human pose.
... Furthermore, the features which belong to the same coherent region are linearly dependent on each other [5]. In this work, superpixels are created using the algorithm mentioned in [19] and the superpixel creation from an input image is illustrated in Figure 1. ...
... In this work, L represents the number of groups/clusters and n represents the number of superpixels. Superpixels are created by the Algorthim decribed in [19]. Algorthim 2 describes the total framework of the image segmentation procedure. ...
... In recent years, superpixel segmentation has became an integral preprocessing tool in various image processing and computer vision applications such as object detection [1,2], recognition [3], semantic segmentation [4], image classification [5], object proposal detection [6,7], visual tracking [8,9], indoor seen understanding [10], and salient object detection [11][12][13][14][15][16]. Superpixel segmentation of an image partitions it into non-overlapping regions where each constituent region is a grouping of pixels that are similar in color or other low-level cues. ...
... The proposed AWkS algorithm is discussed in Sect. 3. Qualitative and quantitative performance evaluation of different distance measures under AWkS is presented in Sect. 4. Application of the AWkS algorithm for saliency detection constitutes Sect. 5. Finally, Sect. ...
Article
Full-text available
Clustering inspired superpixel algorithms perform a restricted partitioning of an image, where each visually coherent region containing perceptually similar pixels serves as a primitive in subsequent processing stages. Simple linear iterative clustering (SLIC) has emerged as a standard superpixel generation tool due to its exceptional performance in terms of segmentation accuracy and speed. However, SLIC applies a manually adjusted distance measure for dis-similarity computation which directly affects the quality of superpixels. In this work, self-adjustable distance measures are adapted from the weighted k-means clustering (W-k-means) for generating superpixel segmentation. In the proposed distance measures, an adaptive weight associated with each variable reflects its relevance in the clustering process. Intuitively, the variable weights correspond to the normalization terms in SLIC that affect the trade-off between superpixels boundary adherence and compactness. Weights that influence consistency in superpixel generation are automatically updated. The variable weights update is accomplished during optimization with a closed-form solution based on the current image partition. The proposed adaptive, W-k-means-based superpixels (AWkS) experimented on three benchmarks under different distance measure outperform the conventional SLIC algorithm with respect to various boundary adherence metrics. Finally, the effectiveness of the AWkS over SLIC is demonstrated for saliency detection.
... As the foothold of basic sports movement teaching and improvement of students' physical quality, it is an important content of developing sports activities in PE class. A large number of PE movements are designed through the combination of various basic movements and the arrangement of changing rhythm, whose movement route and sequence usually have their own specific requirements, and so beginners may have many problems in learning coherent movements [19], which are mainly the following: ...
Article
Full-text available
For a long time, the situation of students’ learning in physical education (PE) was not optimistic, especially the basic movement learning after class, which lacked effective online learning tools. With the in-depth research of deep neural network and the rapid development of computer hardware, the artificial intelligence technology based on deep learning has performed well in the field of basic teaching. Therefore, in this paper, an intelligent teaching system of basic movements in PE is designed. First, the information of coordinate points is collected according to the Gaussian model, and the pose of students is estimated by OpenPose. Second, the overall architecture and functional modules of the system are designed. Finally, the deviation limbs that affect the standard of overall movements are identified by the matching algorithm, which realises the evaluation and feedback of basic movements in PE. Through this teaching system, teachers can obtain the learning situation of students’ movements, and students can adjust their movements through the feedback, which achieves the convenient interaction of PE teaching.
... Humans base their decisions on a wide range of bottom-up and top-down cues, ranging from color to texture to an overall "figure/ground" contour, and the context that surrounds the object to be recognized [1][2][3][4][5][6][7][8][9]. Humans combine or seamlessly switch between such cues [10,11]. These cues help recognize the presence of an "object" instead of accurately predicting the low-level details about it (e.g. ...
Preprint
Full-text available
Some recent artificial neural networks (ANNs) have claimed to model important aspects of primate neural and human performance data. Their demonstrated performance in object recognition is still dependent on exploiting low-level features for solving visual tasks in a way that humans do not. Out-of-distribution or adversarial input is challenging for ANNs. Humans instead learn abstract patterns and are mostly unaffected by certain extreme image distortions. We introduce a set of novel image transforms inspired by neurophysiological findings and evaluate humans and networks on an object recognition task. We show that machines perform better than humans for certain transforms and struggle to perform at par with humans on other transforms that are easy for humans. We quantify the differences in accuracy for humans and machines and find a ranking for our transforms through human data. We also suggest how certain characteristics of human visual processing can be adapted to improve the performance of ANNs for our difficult-for-machines transforms.
... Therefore, some scholars have proposed a second method: OBIA [8]. This approach includes the watershed [9], mean shift [10][11][12], graph-based segmentation with local variations [13], normalized cut (Ncut) [14], Ncut-based Super-Pixel (Ncut-SP) [15,16], turbopixel [17,18], etc. These classification methods aim to divide an image into oversegmented regions that are small in size but high in spectral homogeneity, with small inter-class differences within regions, and each region typically represents the most basic land cover category, such as cropland, buildings, etc. ...
Chapter
In remote sensing image classification, it is difficult to distinguish the homogeneity of same land class and the heterogeneity between different land classes. Moreover, high spatial resolution remote sensing images often show the phenomenon of ground object classes fragmentation and salt-and-pepper noise after classification. To improve the above phenomenon, Markov random field (MRF) is a widely used method for remote sensing image classification due to its effective spatial context description. Some MRF-based methods capture more image information by building interaction between pixel granularity and object granularity. Some other MRF-based methods construct representations at different semantic layers on the image to extract the spatial relationship of objects. This paper proposes a new MRF-based method that combines multi-granularity and different semantic layers of information to improve remote sensing image classification. A hierarchical interaction algorithm is proposed that iteratively updates information between different granularity and semantic layers to generate results. The experimental results demonstrate that: on the Gaofen-2 imagery, the proposed model shows a better classification performance than other methods.KeywordsMarkov random field (MRF)Remote sensing image classificationMulticlass-layer
... This approach allows the categorization of the images into a defined number of segments. This pixel-grid, however, is not a natural representation of the actual image (Mori et al., 2004). ...
Thesis
Full-text available
Planetary exploration is one of the main goals that humankind has established as a must for space exploration in order to be prepared for colonizing new places and provide scientific data for a better understanding of the formation of our solar system. In order to provide a safe approach, several safety measures must be undertaken to guarantee not only the success of the mission but also the safety of the crew. One of these safety measures is the Autonomous Hazard, Detection, and Avoidance (HDA) sub-system for celestial body landers that will enable different spacecraft to complete solar system exploration. The main objective of the HDA sub-system is to assemble a map of the local terrain during the descent of the spacecraft so that a safe landing site can be marked down. This thesis will be focused on a passive method using a monocular camera as its primary detection sensor due to its form factor and weight, which enables its implementation alongside the proposed HDA algorithm in the Intuitive Machines lunar lander NOVA-C as part of the Commercial Lunar Payload Services technological demonstration in 2021 for the NASA Artemis program to take humans back to the moon. This algorithm is implemented by including two different sources for making decisions, a two-dimensional (2D) vision-based HDA map and a three-dimensional (3D) HDA map obtained through a Structure from Motion process in combination with a plane fitting sequence. These two maps will provide different metrics in order to provide the lander a better probability of performing a safe touchdown. These metrics are processed to optimize a cost function.
... In order to improve the computational and memory efficiency of our construction, we introduce the notion of molecular superpixels. Originally developed in computer vision [Ren and Malik, 2003, Mori et al., 2004, Kohli et al., 2009, superpixels are defined as perceptually uniform regions in the image. In the molecular context, we refer to superpixels as segments on the protein surface capturing higher-level fingerprint features and protein motifs such as hydrophobic binding sites. ...
Preprint
Full-text available
Proteins are fundamental biological entities mediating key roles in cellular function and disease. This paper introduces a multi-scale graph construction of a protein -- HoloProt -- connecting surface to structure and sequence. The surface captures coarser details of the protein, while sequence as primary component and structure -- comprising secondary and tertiary components -- capture finer details. Our graph encoder then learns a multi-scale representation by allowing each level to integrate the encoding from level(s) below with the graph at that level. We test the learned representation on different tasks, (i.) ligand binding affinity (regression), and (ii.) protein function prediction (classification). On the regression task, contrary to previous methods, our model performs consistently and reliably across different dataset splits, outperforming all baselines on most splits. On the classification task, it achieves a performance close to the top-performing model while using 10x fewer parameters. To improve the memory efficiency of our construction, we segment the multiplex protein surface manifold into molecular superpixels and substitute the surface with these superpixels at little to no performance loss.
... Superpixel segmentation is a kind of commonly used preprocessing method in computer vision fields such as target detection and image segmentation [33][34][35]. Superpixel segmentation is to divide the pixels in the image into different groups with certain properties. The pixels belonging to the same superpixel are similar in texture, brightness, color, or other characteristics [36]. ...
Article
Full-text available
Convolutional neural networks (CNNs) can extract advanced features of joint spectral–spatial information, which are useful for hyperspectral image (HSI) classification. However, the patch-based neighborhoods of samples with fixed sizes are usually used as the input of the CNNs, which cannot dig out the homogeneousness between the pixels within and outside of the patch. In addition, the spatial features are quite different in different spectral bands, which are not fully utilized by the existing methods. In this paper, a two-branch convolutional neural network based on multi-spectral entropy rate superpixel segmentation (TBN-MERS) is designed for HSI classification. Firstly, entropy rate superpixel (ERS) segmentation is performed on the image of each spectral band in an HSI, respectively. The segmented images obtained are stacked band by band, called multi-spectral entropy rate superpixel segmentation image (MERSI), and then preprocessed to serve as the input of one branch in TBN-MERS. The preprocessed HSI is used as the input of the other branch in TBN-MERS. TBN-MERS extracts features from both the HSI and the MERSI and then utilizes the fused spectral–spatial features for the classification of HSIs. TBN-MERS makes full use of the joint spectral–spatial information of HSIs at the scale of superpixels and the scale of neighborhood. Therefore, it achieves excellent performance in the classification of HSIs. Experimental results on four datasets demonstrate that the proposed TBN-MERS can effectively extract features from HSIs and significantly outperforms some state-of-the-art methods with a few training samples.
... 3) From the definition of superpixel. The definition of superpixel determines that the SI can preserve both the local structure and texture information [30], [31]. In other words, SI contains mid-level semantic information [32]. ...
Preprint
Full-text available
Anomaly detection in medical images refers to the identification of abnormal images with only normal images in the training set. Most existing methods solve this problem with a self-reconstruction framework, which tends to learn an identity mapping and reduces the sensitivity to anomalies. To mitigate this problem, in this paper, we propose a novel Proxy-bridged Image Reconstruction Network (ProxyAno) for anomaly detection in medical images. Specifically, we use an intermediate proxy to bridge the input image and the reconstructed image. We study different proxy types, and we find that the superpixel-image (SI) is the best one. We set all pixels' intensities within each superpixel as their average intensity, and denote this image as SI. The proposed ProxyAno consists of two modules, a Proxy Extraction Module and an Image Reconstruction Module. In the Proxy Extraction Module, a memory is introduced to memorize the feature correspondence for normal image to its corresponding SI, while the memorized correspondence does not apply to the abnormal images, which leads to the information loss for abnormal image and facilitates the anomaly detection. In the Image Reconstruction Module, we map an SI to its reconstructed image. Further, we crop a patch from the image and paste it on the normal SI to mimic the anomalies, and enforce the network to reconstruct the normal image even with the pseudo abnormal SI. In this way, our network enlarges the reconstruction error for anomalies. Extensive experiments on brain MR images, retinal OCT images and retinal fundus images verify the effectiveness of our method for both image-level and pixel-level anomaly detection.
... Thus, the bottom-up pose estimation methods have less limit in its application and more robust to rapid movements. In the past decades, there are more and more researchers focus on these bottom-up approaches [53][54][55][56][57][58]. Among these methods, pictorial structure based methods are the most successful techniques for bottom-up pose estimation. ...
Thesis
Human pose estimation is a challenging problem in computer vision and shares all the difficulties of object detection. This thesis focuses on the problems of human pose estimation in still images or video, including the diversity of appearances, changes in scene illumination and confounding background clutter. To tackle these problems, we build a robust model consisting of the following components. First, the top-down and bottom-up methods are combined to estimation human pose. We extend the Pictorial Structure (PS) model to cooperate with annealed particle filter (APF) for robust multi-view pose estimation. Second, we propose an upper body based multiple mixture parts (MMP) model for human pose estimation that contains two stages. In the pre-estimation stage, there are three steps: upper body detection, model category estimation for upper body, and full model selection for pose estimation. In the estimation stage, we address the problem of a variety of human poses and activities. Finally, a Deep Convolutional Neural Network (DCNN) is introduced for human pose estimation. A Local Multi-Resolution Convolutional Neural Network (LMR-CNN) is proposed to learn the representation for each body part. Moreover, a LMR-CNN based hierarchical model is defined to meet the structural complexity of limb parts. The experimental results demonstrate the effectiveness of the proposed model
... Similar nodes are assigned higher weights. Superpixels are then created by minimizing a cost function defined over the graph [42]. In the field of HSI processing, the entropy rate superpixel (ERS) [43] is the most commonly used graph-based superpixel segmentation algorithm. ...
Article
Full-text available
Recently, many spectral-special classification models have emerged one after another in the remote sensing community. These models aim to introduce the spatial information of the pixel to improve the accuracy of the class attribute of the pixel. However, for the spectral-spatial classification algorithms, not all pixels need to introduce the corresponding spatial information since the use of a large amount of spatial information has a costly time. To solve this problem, this paper proposes a robust dual-stage spatial embedding (RDSSE) model for spectral-spatial classification of hyperspectral image, which is composed of the following several main steps: First, an over-segmentation algorithm is employed to cluster original hyperspectral image into many superpixel blocks with shape adaptive characteristics. Then, we design a k-peak criterion to fuse the spectral feature of pixels within and between superpixels. Next, a low time-consumption spectral classifier is introduced to conduct primary classification for a testing pixel to achieve the corresponding probability distribution. Specifically, the difference between the probability of the largest class and that of the second-largest class is served as class confidence. Finally, the predicted label of the low-confidence testing pixels is reclassified based on a high-accuracy spectral-spatial classification method. Experimental results on several real images illustrated that the proposed RDSSE can obtain superior performance with respect to several competitive methods.
... Both of them assume that the original cluster is composed of two non-overlapping subclusters, and only one of them is effective, achieving excellent classification performance in [38][39][40]. Superpixel segmentation has been extensively developed in computer vision [41][42][43] in recent years. According to texture structure of the image, superpixel segmentation algorithm can cluster the image into many non-overlapping homogeneous regions. ...
Article
Full-text available
This paper presents a composite kernel method (MWASCK) based on multiscale weighted adjacent superpixels (ASs) to classify hyperspectral image (HSI). The MWASCK adequately exploits spatial-spectral features of weighted adjacent superpixels to guarantee that more accurate spectral features can be extracted. Firstly, we use a superpixel segmentation algorithm to divide HSI into multiple superpixels. Secondly, the similarities between each target superpixel and its ASs are calculated to construct the spatial features. Finally, a weighted AS-based composite kernel (WASCK) method for HSI classification is proposed. In order to avoid seeking for the optimal superpixel scale and fuse the multiscale spatial features, the MWASCK method uses multiscale weighted superpixel neighbor information. Experiments from two real HSIs indicate that superior performance of the WASCK and MWASCK methods compared with some popular classification methods.
... Superpixel segmentation aims to obtain local regions with appearance and location consistency. It is used to extract perceptually meaningful element regions, which significantly reduces the computation complexity for other computer vision applications, such as saliency detection [1], [2], object segmentation [3]- [7], object detection [8] and recognition [9], and biomedical image analysis [10]. ...
Article
Full-text available
Superpixel segmentation could be of benefit to computer vision tasks due to its perceptually meaningful results with similar appearance and location. To obtain the accurate superpixel segmentation, existing methods introduce geodesic distance to fit the object boundaries. However, conventional geodesic distance easily suffers from error accumulation and excessive time consumption. This paper proposes a fast superpixel segmentation method based on a new geodesic distance, called forgetting geodesic distance. In contrast to the conventional geodesic distance, the forgetting geodesic distance utilizes a forgetting factor to gradually reduce the influence of previous path cost and focuses on the latest pixels’ difference. Intuitively, a pixel with lower difference with respect to the latest path contextual distance will be more similar with the corresponding region. In the new path, the path cost devotes much greater attention to the latest pixels’ difference and could significantly relieve error accumulation. The pixels are also aggregated with less dependence on seeds as the path extends, which avoids the seed updating. The experimental results validate that the proposed method obtains 2 percent and 1 percent gain on average compared with most of the state-of-the-art methods in terms of BSD500 and VOC2012 datasets, respectively.
... Superpixels can represent images more concisely and efficiently than pixels, which can significantly reduce the complexity of subsequent image processing. After the concept of superpixels was proposed by Ren and Malik [1] in 2003, the simplified image expression became a key step in many computer vision tasks, such as segmentation [2][3][4][5][6], saliency detection [7][8][9], object recognition [10,11], modelling [12,13], 3D reconstruction [14], and other vision tasks [15][16][17]. For these applications, the high-quality superpixels are desired to meet the following properties: ...
Article
Full-text available
Superpixel segmentation has become a crucial tool in many image processing and computer vision applications. In this paper, a novel content-sensitive superpixel generation algorithm with boundary adjustment is proposed. First, the image local entropy was used to measure the amount of information in the image, and the amount of information was evenly distributed to each seed. It placed more seeds to achieve the lower under-segmentation in content-dense regions, and placed the fewer seeds to increase computational efficiency in content-sparse regions. Second, the Prim algorithm was adopted to generate uniform superpixels efficiently. Third, a boundary adjustment strategy with the adaptive distance further optimized the superpixels to improve the performance of the superpixel. Experimental results on the Berkeley Segmentation Database show that our method outperforms competing methods under evaluation metrics.
... In this article, we aim at bringing some arguments for the fore-mentioned questions and trying to answer the last one in an over-segmentation scenario. Over-segmentation still constitutes a current trend, dating back from the SuperPixels [2] and TurboPixels [3] approaches. The philosophy behind over-segmentation is to obtain a dense segmentation map which offers more flexibility to the consequent tasks, such as object detection and recognition [4]. ...
Article
Full-text available
It is said that image segmentation is a very difficult or complex task. First of all, we emphasize the subtle difference between the notions of difficulty and complexity. Then, in this article, we focus on the question of how two widely used color image complexity measures correlate with the number of segments resulting in over-segmentation. We study the evolution of both the image complexity measures and number of segments as the image complexity is gradually decreased by means of low-pass filtering. In this way, we tackle the possibility of predicting the difficulty of color image segmentation based on image complexity measures. We analyze the complexity of images from the point of view of color entropy and color fractal dimension and for color fractal images and the Berkeley data set we correlate these two metrics with the segmentation results, more specifically the number of quasi-flat zones and the number of JSEG regions in the resulting segmentation map. We report on our experimental results and draw conclusions.
... В нашей работе произведён анализ различных алгоритмов для сегментации изображений горных пород на литотипы как с помощью выделения суперпикселей (семантически схожих между собой областей изображений) с помощью методов машинного зрения [Mori et al., 2004;Xiaofeng and Malik, 2003], так и с помощью алгоритмов семантической сегментации изображений с применением глубокой нейронной сети типа U-Net [Ronneberger et al., 2015]. ...
Conference Paper
Full-text available
The computational power of the reservoir modelling is growing nowadays enabling the use of more precise core descriptions. The industry needs high accuracy models for precise reserves estimation. As a way to improve that, different authors proposed semiautomatic image segmentation algorithms based on color spaces approaches. The segmentation algorithms are common in machine vision as most images consist of semantically different parts. This paper focuses on the review and application of different machine vision algorithms for semi-supervised segmentation of full core images based on superpixel approach. Such an approach takes into account pixel groups with their semantic (texture, intensities, etc.) meaning. The reviewed algorithms can contribute to the precise description of rocks at different scales. The automatic way to segment lithotypes and other characteristics of rock introduced. U-Net like convolutional neural network fine-tuned on a small dataset may produce meaningful results.
... Image parsing consists of three key characteristics: 1) integration of generic image segmentation and object recognition; 2) combination of bottomup (object detection, edge detection, image segmentation) and top-down modules (object recognition, shape prior); 3) competition from background regions and foreground objects through analysis-and-synthesis. Similar methods have been developed for grammar-based models [59], scene understanding [30,17], human pose recognition [40], segmentation [4], and object detection [28]. Scene parsing bears a * indicates equal contribution. ...
Preprint
Recently, the vision community has shown renewed interest in the effort of panoptic segmentation --- previously known as image parsing. While a large amount of progress has been made within both the instance and semantic segmentation tasks separately, panoptic segmentation implies knowledge of both (countable) "things" and semantic "stuff" within a single output. A common approach involves the fusion of respective instance and semantic segmentations proposals, however, this method has not explicitly addressed the jump from instance segmentation to non-overlapping placement within a single output and often fails to layout overlapping instances adequately. We propose a straightforward extension to the Mask R-CNN framework that is tasked with resolving how two instance masks should overlap one another in the fused output as a binary relation. We show competitive increases in overall panoptic quality (PQ) and particular gains in the "things" portion of the standard panoptic segmentation benchmark, reaching state-of-the-art against methods with comparable architectures.
... Many applications use knowledge of the 3D structure of an object to assist in image analysis. Applications include estimating limb orientation in humans [Mori et al., 2004;Rogez et al., 2008] and animals [Gardenier et al., 2018], or for assisting in the classification of common objects. Liebelt and Schmid [Liebelt and Schmid, 2010] use detailed 3D models of the object's geometry to assist with identifying those objects in images. ...
Conference Paper
Full-text available
Finding corresponding features between stereo images collected in low contrast environments (e.g. underwater) can be challenging. We present a technique for ranging objects from stereo images which does not rely on locating corresponding features between frames. The rotation of the object relative to the two cameras is estimated using a predictor trained on renders of a simplified computer generated 3D model of the object. The difference in apparent rotation of the object between the two camera frames, the rotational disparity, is used to constrain its possible positions, and allows its location to be determined based on its appearance within each of the stereo frames independently , without the need for obtaining point correspondences on the object. The implementation presented uses a convolutional neural network trained on rendered images, to estimate the rotation an object in real images.The network correctly estimated the angle of renders (to within 2 •) of a model with an accuracy of 84.1%, and predicted the angle of a real object from images with a mean error of 2.7 •. A second network trained to directly estimate the rotational disparity between a pair of renders correctly predicted (to within 2 •) with 100% accuracy on the artificial renders, but failed to generate meaningful predictions from real images .
... Segmentation-based tracking algorithms have been investigated actively to better represent non-rigid target. In particular, superpixel has been one of the most promising representations with demonstrated success in image segmentation [7] and object recognition [8]. These methods segment the image into numerous superpixels, which not only shows more flexibility in modeling the target constitute part by well preserving their edge information, but also reduces the complexity of sophisticated image processing tasks since the number of the superpixels is much smaller than the number of pixels when adopting sliding window or particle sampling scheme. ...
Conference Paper
Full-text available
Visual tracking, a fundamental task in computer vision, has been criticized less well-posed since reliable target information only given at first frame. In this case, most of the existing template-matching-based trackers fail to locate the target when non-rigid deformations or variations occur. To address these issues, we propose a principled way to take advantage of the superpixel labeling and discriminative tracking algorithms. For each frame, a correlation tracker is first adopted to provide the coarse target location. Afterwards, a collaborative segmentation approach is advocated to segment the surrounding region of the target into superpixels. Target appearance and motion trajectory are considered as spatial and temporal constrains and incorporated into superpixel labeling module. The fine-segmentation result, in turn, provides a more accurate target status for template updating. The effectiveness of the proposed algorithm is validated through experimental comparison on widely-used tracking benchmark datasets.
... The problem of object parsing, which aims to decompose objects into their semantic parts, has been addressed by numerous works [27,29,38,43,45], most of which have concentrated on parsing humans. However, none of the aforementioned works have parsed objects at an instance level as shown in Fig. 1, but rather category level. ...
... These MSER are stable under different imaging conditions and can be captured under different viewpoints. In [RM03,MREM04], super-pixels obtained using normalised cuts are used as regions for feature extraction and since all the super-pixels extracted from an image have similar scales, this method is not scale invariant. Some of the other affine feature/region detectors present in the literature are Intensity Based Regions (IBR) [TG04] and Edge Based Regions (EBR) [TG04]. ...
Thesis
Nowadays computer vision algorithms can be found abundantly in applications relatedto video surveillance, 3D reconstruction, autonomous vehicles, medical imaging etc. Image/object matching and detection forms an integral step in many of these algorithms.The most common methods for Image/object matching and detection are based on localimage descriptors, where interest points in the image are initially detected, followed byextracting the image features from the neighbourhood of the interest point and finally,constructing the image descriptor. In this thesis, contributions to the field of the imagefeature matching using rotating half filters are presented. Here we follow three approaches:first, by presenting a new low bit-rate descriptor and a cascade matching strategy whichare integrated on a video platform. Secondly, we construct a new local image patch descriptorby embedding the response of rotating half filters in the Histogram of Orientedgradient (HoG) framework and finally by proposing a new approach for descriptor constructionby using second order image statistics. All the three approaches provides aninteresting and promising results by outperforming the state of art descriptors.Key-words: Rotating half filters, local image descriptor, image matching, Histogram of Orientated Gradients (HoG), Difference of Gaussian (DoG).
... In [96], a simple 2D model in the form of a stick figure is fitted to a silhouette. The authors of [197] localize joints and limbs in an image based on multiple visual cues. A method for recovering a 3D human pose from a monocular image is presented in [3]. ...
Thesis
Full-text available
Technology plays an important role in modern sport. Athletes and coaches benefit from the development of methods for automatic analysis of sports motion. Significant progress in this field has been made in recent years, however several relevant challenges still remain. In this work, novel methods for motion analysis are proposed, which address these challenges. A single sports discipline, namely fencing, was chosen for the evaluation of the proposed methods. Fencing is a very technical sport, in which all discussed issues are relevant. The research in this thesis considers three important subjects related to sports analysis. Firstly, recognition of sport-specific actions is addressed. General action recognition methods are not sufficient for sports analysis, since sports actions include different motion pattern and are characterized by different parameters. In particular, actions with similar trajectories, but different dynamics of motion can correspond to different techniques. In this work, novel methods for extraction, selection and fusion of features relevant for sports actions, based on visual and inertial signals, are proposed, and applied to classification of basic fencing footwork actions. The second subject is devoted to the temporal segmentation and qualitative analysis. In order to analyze sports actions it is required to perform temporal segmentation of the continuous motion in the captured training routine. The qualitative analysis of the detected action segments allows to provide the athletes with relevant information regarding the performed actions. In this work novel methods for model-based adaptive signal filtering are proposed, which allow to efficiently detect lunge actions in continuous fencing footwork routine, based on visual and inertial data. The qualitative parameters of the lunge actions are determined, and delivered to fencers in real-time during practice. The third subject is related to the issue of providing feedback.
... Superpixel segmentation aims at obtaining local regions with appearance and location consistency. It is used to extract perceptually meaningful element regions, which significantly reduces the computation complexity for other computer vision applications, such as object segmentation [1], [2], object detection [3] and recognition [4]. ...
Article
This letter proposes a fast superpixel segmentation method based on boundary sampling and interpolation. The basic idea is as follow: instead of labeling local region pixels, we estimate superpixel boundary by interpolating candidate boundary pixel from a down-sampling image segmentation. On the one hand, there exists high spatial redundancy within each local region, which could be discarded. On the other hand, we estimate the labels of candidate boundary pixels via sampling superpixel boundary within corresponding neighbour. Benefiting from the reduction of candidate pixel distance calculation, the proposed method significantly accelerates superpixel segmentation. Experiments on BSD500 benchmark demonstrate that our method needs half the time compared with the state-of-the-arts while almost no accuracy reduction.
Article
Fractures in reservoirs have a profound impact on hydrocarbon production operations. The more accurately fractures can be detected, the better the exploration and production processes can be optimized. Therefore, fracture detection is an essential step in understanding the reservoir's behavior and the stability of the wellbore. The conventional method for detecting fractures is image logging, which captures images of the borehole and fractures. However, the interpretation of these images is a laborious and subjective process that can lead to errors, inaccuracies, and inconsistencies, even when aided by software. Automating this process is essential for expediting operations, minimizing errors, and increasing efficiency.While there have been some attempts to automate fracture detection, this paper takes a novel approach by proposing the use of YOLOv5 as a Deep Learning (DL) tool to detect fractures automatically. YOLOv5 is unique in that it excels at speed, both in training and detection, while maintaining high accuracy in fracture detection. We observed that YOLOv5 can detect fractures in near real-time with a high mean average precision (mAP) of 98.2, requiring significantly less training than other DL algorithms. Furthermore, our approach overcomes the shortcomings of other fracture detection methods.The proposed method has many potential benefits, including reducing manual interpretation errors, decreasing the time required for fracture detection, and improving fracture detection accuracy. Our approach can be utilized in various reservoir engineering applications, including hydraulic fracturing design, wellbore stability analysis, and reservoir simulation. By using this technique, the efficiency and accuracy of hydrocarbon exploration and production can be significantly improved.
Article
Full-text available
Some recent artificial neural networks (ANNs) claim to model aspects of primate neural and human performance data. Their success in object recognition is, however, dependent on exploiting low-level features for solving visual tasks in a way that humans do not. As a result, out-of-distribution or adversarial input is often challenging for ANNs. Humans instead learn abstract patterns and are mostly unaffected by many extreme image distortions. We introduce a set of novel image transforms inspired by neurophysiological findings and evaluate humans and ANNs on an object recognition task. We show that machines perform better than humans for certain transforms and struggle to perform at par with humans on others that are easy for humans. We quantify the differences in accuracy for humans and machines and find a ranking of difficulty for our transforms for human data. We also suggest how certain characteristics of human visual processing can be adapted to improve the performance of ANNs for our difficult-for-machines transforms.
Article
Deep neural networks (DNNs) have underpinned most of recent progress of hyperspectral image (HSI) classification. One premise of their success lies on the high image quality without noise corruption. However, due to the limitation of imaging sensor and imaging conditions, HSIs captured in practice inevitably suffer from random noise, which will degrade the generalization performance and robustness of most existing DNN based methods. In this study, we propose a dynamic super-pixel normalization based DNN for HSI classification, which can adaptively relieve the negative effect caused by various types of noise corruption and improve the generalization performance. To achieve this goal, we propose a dynamic super-pixel normalization module, for a given super-pixel which normalizes the inner pixel features using parameters dynamically generated based on themselves. By doing this, such a module enables adaptively restoring the similarity among pixels within the super-pixel corrupted by random noise through aligning their feature distribution, thus enhancing the generalization performance on noisy HSI. Moreover, it can be directly plugged into any other existing DNN architectures. To appropriately train the proposed DNN model, we further present a semi-supervised learning framework, which integrates the cross entropy loss and Kullback-Leibler divergence loss on labeled samples with the infomation entropy loss on the unlabeled samples for joint learning to well sidestep over-fitting. Experiments on three benchmark HSI classification datasets demonstrate the advantages of the proposed method over several state-of-the-art competitors in handling HSIs under different types of noise corruption.
Article
Superpixel generation is increasingly an important area for computer vision tasks. While superpixels with highly regular shapes are preferred to make the subsequent processing easier, the accuracy of the superpixel boundaries is also necessary. Previous methods usually depend on a distance function considering both spatial and color coherency regularization on the whole image, which however is hard to balance between shape regularity and boundary adherence, especially when the desired number of superpixels is small. In addition, non-adaptive parameters and insufficient contour information also affect the performance of segmentation. To mitigate these problems, we propose a robust divide-and-conquer superpixel segmentation method, of which the core idea is that we apply a new contour information extraction and a pixel clustering to separate the input image into flat and non-flat regions, where the former targets shape regularity and the latter emphasizes boundary adherence, followed by an efficient hierarchical merging to clean up tiny and dangling superpixels. Our algorithm requires no additional parameter tuning except the desired number of superpixels since our internal parameters are self-adaptive to the image contents. Experimental results demonstrate that for public benchmark datasets, our algorithm consistently generates more regular superpixels with stronger boundary adherence than state-of-the-art methods while maintaining a competitive efficiency. The source code is available at https://github.com/YunyangXu/HQSGRD .
Article
In this paper, we present BodyTrak, an intelligent sensing technology that can estimate full body poses on a wristband. It only requires one miniature RGB camera to capture the body silhouettes, which are learned by a customized deep learning model to estimate the 3D positions of 14 joints on arms, legs, torso, and head. We conducted a user study with 9 participants in which each participant performed 12 daily activities such as walking, sitting, or exercising, in varying scenarios (wearing different clothes, outdoors/indoors) with a different number of camera settings on the wrist. The results show that our system can infer the full body pose (3D positions of 14 joints) with an average error of 6.9 cm using only one miniature RGB camera (11.5mm x 9.5mm) on the wrist pointing towards the body. Based on the results, we disscuss the possible application, challenges, and limitations to deploy our system in real-world scenarios.
Article
An Innovative approach for representation and description of shape components for object recognition based on complex potential is proposed. In the complex plane, the flow of velocity is a crucial factor to discriminate different shapes. Hence, the present paper computes the potential flow by transforming the shape of the input object into complex plane. The present paper computes the Vortex based Complex Potential signature (VCP) by considering the radial lines as Equipotential lines and the circles as streamlines. The proposed VCP signature is described with the Fourier transformation for the generation of feature vector. The Chebyshev distance measure is used in the shape toning stage. The efficiency of the proposed descriptor is evaluated with the MPEG-CE-1 Set B database. The results prove the competency of the proposed descriptor than the benchmark descriptors
Thesis
Gastrointestinal (GI) diseases are among the most frequently occurring diseases that pose a significant threat to people’s health. Endoscopic techniques represent the gold standard for GI disease diagnosis. Endoscopic examinations are, however, resource-intensive and highly demanding in terms of equipment cost and the need for trained personnel. Also, the evaluation of the severity and sub-classification of different endoscopic findings may vary from one physician to another. Accurate detection and classification of GI anomalies are crucial for effective treatment planning. Numerous computer-aided GI image processing and classification techniques have been proposed. In this thesis, GI image segmentation and classification were explored based on two superpixel segmentation methods, namely simple linear iterative clustering (SLIC) and linear spectral clustering (LSC). Experiments were carried out on the Kvasir dataset which includes GI tract images from the upper part (esophagitis and z-line), middle part (pylorus and polyps), and the lower part (cecum and ulcerative colitis). For these images, three types of features were extracted and fed into a support vector machine (SVM) classifier. The GI images were evaluated and compared with and without superpixel segmentation. LSC-based superpixels were shown to be generally more intuitive, perceptually satisfactory, and uniform compared to those of the SLIC method. Moreover, in comparison to pixel-wise classification results, the results obtained by the superpixel-based classification were found to be generally better (for both the LSC- and SLIC-based methods). Experimental results show that the performance of each superpixel method varies from one GI part to another. For the upper GI tract, the SLIC-based classifier outperformed the LSC-based one, reaching accuracy, sensitivity, and specificity values of 77.33%, 77.89%, and 76.8%, respectively. However, for both middle and lower GI parts, the LSC-based classifier outperformed the SLIC-based one. Accuracy, sensitivity, and specificity values were respectively: 98.5%, 100%, and 97.1% for the middle GI tract, and 93.67%, 91.72%, and 95.8% for the lower GI tract. In terms of computational time, the SLIC-based method was moderately faster than the LSC one.
Article
Anomaly detection in medical images refers to the identification of abnormal images with only normal images in the training set. Most existing methods solve this problem with a self-reconstruction framework, which tends to learn an identity mapping and reduces the sensitivity to anomalies. To mitigate this problem, in this paper, we propose a novel Proxy-bridged Image Reconstruction Network (ProxyAno) for anomaly detection in medical images. Specifically, we use an intermediate proxy to bridge the input image and the reconstructed image. We study different proxy types, and we find that the superpixel-image (SI) is the best one. We set all pixels' intensities within each superpixel as their average intensity, and denote this image as SI. The proposed ProxyAno consists of two modules, a Proxy Extraction Module and an Image Reconstruction Module. In the Proxy Extraction Module, a memory is introduced to memorize the feature correspondence for normal image to its corresponding SI, while the memorized correspondence does not apply to the abnormal images, which leads to the information loss for abnormal image and facilitates the anomaly detection. In the Image Reconstruction Module, we map an SI to its reconstructed image. Further, we crop a patch from the image and paste it on the normal SI to mimic the anomalies, and enforce the network to reconstruct the normal image even with the pseudo abnormal SI. In this way, our network enlarges the reconstruction error for anomalies. Extensive experiments on brain MR images, retinal OCT images and retinal fundus images verify the effectiveness of our method for both image-level and pixel-level anomaly detection.
Chapter
Boundary extraction; Contour detection
Article
Change detection on surface of earth plays an important role in global-scale pattern of climate and biogeochemistry of the world, which helps to comprehend the connections and associations between human and nature. Remote Sensing and Geographic Information Systems can possibly provide accurate data in regards to land use and land cover changes. However, pixel-based change detection methods are limited in suppressing outliers and noise; they often fail to process remote sensing images with high spatial-/spectral-resolution. To conquer these drawbacks, a superpixel-level change detection and analysis method is proposed in this paper. Superpixels are the atomic regions gathering pixels with similar property, which will be more efficient and robust than pixels. Deep neural network is a powerful feature learning and classification tool, it can represent superpixel abstractly and classify them robustly. The learning progress of deep architectures includes unsupervised sample selection and supervised feature learning, unsupervised progress aims at selecting training samples for deep neural network, supervised progress aims at learning the representation of superpixels and fine-tuning the whole network to finish classification. Experimental results on multi-temporal images have demonstrated that the proposed approach can handle the task of change detection and analysis effectively and accurately.
Article
Recognition of malposition and location of high-temperature forgings play a critical role in the realisation of robotised die forging, which is an ongoing trend in intelligent manufacturing. This study is aimed at the robotised die forging of the scraper beam of armoured face conveyor, which is the only transporting equipment used in the coal mine workface. Firstly, a novel process to recognise the malposition and location of high-temperature forgings using two monocular cameras, one placed horizontally and another vertically, is proposed. Secondly, a novel image preprocessing algorithm combining the algorithms of grey linear transformation, exponential transformation, and median filtering is proposed. After processing the high-temperature forging image using the proposed image preprocessing algorithm, the grey difference between the target region and background region of the processed image is highly increased. This is conducive to the subsequent image segmentation and contour extraction processes conducted in the forging region. Thirdly, after the comparison and analysis of the three commonly used image segmentation methods, including edge detection, threshold segmentation, and region growing methods, it is discovered that the region growing method is suitable for image segmentation of high-temperature forgings. Fourthly, a two-way modified and blob analysis based forging location algorithm is proposed to reduce the location error caused by the axial and radial asymmetric flash during the forging process. Finally, the proposed algorithms are validated by experiments and the location recognition error of the proposed location algorithm is only 0.86059 mm. This study provide technical support for the realisation of robotised die forging.
Article
Full-text available
The proposed work employs the segmentation method using patches and labels to segment the citrus canker leaf diseased portion. The patches and labels method is based on the region merging, color mapping and clustering techniques with statistical tests to determine the merging of regions. The method utilizes the color feature of the leaf images, where the leaf image can be segmented into multiple parts by its colors. The color intensity feature of the leaf image is used as basis for grouping the pixels into patches. Range of colors are considered for process and grouping of respective pixels within the color range to form patches which is based on color threshold levels (Q values). The leaf image is represented at 9 different color threshold levels (Q), where the nth level of threshold applies 2 n-1 number of colors in color space for further color mapping to form patches. The patched image divides and represents different regions of the leaf image as segmented output. The patched image forms the grouping pixels within neighborhood connectivity, is represented as collection of clustered color patches with labels. The boundary information of each labeled patch is achieved. The labeling of the clustered color patches aids in segregation between region of interest and other uninterested region.
Article
The quantity and quality of training samples have a great influence on the performance of most hyperspectral image classification approaches. However, in a real scenario, manually annotating a large number of accurate training samples is extremely labor-intensive and time-consuming. In this article, a multilabel training sample augmentation method is proposed. Instead of giving an exact label to each pixel, we just precisely label a small number of pixels by giving them a single label (called single-label samples) and annotate a large number of pixels in certain regions together by giving them multiple labels (called multilabel samples). Furthermore, in order to make full use of the multilabel training samples, a superpixel segmentation and recursive filtering-based method is proposed. The proposed method consists of the following major steps: recursive filtering-based feature extraction, superpixel-based segmentation, and spectral-spatial similarity-based mislabeled sample removal. Experimental results demonstrate that the proposed method can significantly improve the classification accuracy of multiple classifiers by using the multilabel training samples.
Chapter
There is a growing reliance on imaging equipment in medical domain, hence medical experts’ specialized visual perceptual capability becomes the key of their superior performance. In this paper, we propose a principled generative model to detect and segment out dermatological lesions by exploiting the experts’ perceptual expertise represented by their patterned eye movement behaviors during examining and diagnosing dermatological images. The image superpixels’ diagnostic significance levels are inferred based on the correlations between their appearances and the spatial structures of the experts’ signature eye movement patterns. In this process, the global relationships between the superpixels are also manifested by the spans of the signature eye movement patterns. Our model takes into account these dependencies between experts’ perceptual skill and image properties to generate a holistic understanding of cluttered dermatological images. A Gibbs sampler is derived to use the generative model’s structure to estimate the diagnostic significance and lesion spatial distributions from superpixel-based representation of dermatological images and experts’ signature eye movement patterns. We demonstrate the effectiveness of our approach on a set of dermatological images on which dermatologists’ eye movements are recorded. It suggests that the integration of experts’ perceptual skill and dermatological images is able to greatly improve medical image understanding and retrieval.
Article
Purpose For many years, deep convolutional neural networks have achieved state-of-the-art results on a wide variety of computer vision tasks. 3D human pose estimation makes no exception and results on public benchmarks are impressive. However, specialized domains, such as operating rooms, pose additional challenges. Clinical settings include severe occlusions, clutter and difficult lighting conditions. Privacy concerns of patients and staff make it necessary to use unidentifiable data. In this work, we aim to bring robust human pose estimation to the clinical domain. Methods We propose a 2D–3D information fusion framework that makes use of a network of multiple depth cameras and strong pose priors. In a first step, probabilities of 2D joints are predicted from single depth images. These information are fused in a shared voxel space yielding a rough estimate of the 3D pose. Final joint positions are obtained by regressing into the latent pose space of a pre-trained convolutional autoencoder. Results We evaluate our approach against several baselines on the challenging MVOR dataset. Best results are obtained when fusing 2D information from multiple views and constraining the predictions with learned pose priors. Conclusions We present a robust 3D human pose estimation framework based on a multi-depth camera network in the operating room. Depth images as only input modalities make our approach especially interesting for clinical applications due to the given anonymity for patients and staff.
Article
Full-text available
Image boundaries and regularity are two important factors in superpixel segmentation. Balancing the influence of image boundaries and regularity is key to producing superpixels. In this paper, we present a novel superpixel segmentation algorithm, called weighted superpixel segmentation (WSS), which is capable of generating superpixels with high boundary adherence and regular shape in a linear time. In WSS, superpixels are generated according to a distance metric defined by the combination of a weight function term, color distance term and plane distance term. Unlike other superpixel algorithms, the weight function is calculated for each pixel to determine the weight of the color distance term and plane distance term in the distance metric. To increase superpixel regularity, superpixel seeds are initialized in a hexagonal manner. Then, we use the distance metric to obtain the initial superpixels. Determining the seed search range is an essential factor to improve algorithm accuracy; thus, a dynamic circle search range is designed in our algorithm that can provide better superpixel results. Finally, a merging strategy is applied to obtain the final superpixels and ensure that the number of superpixels agrees with expectations. Experimental results demonstrate that WSS performs as well as or even better than the existing methods in terms of several commonly used evaluation metrics in superpixel segmentation.
Article
In this work, we consider the problem of single-query 6-DoF camera pose estimation, i.e. estimating the position and orientation of a camera by using reference images and a point cloud. We perform a systematic comparison of three state-of-the-art strategies for 6-DoF camera pose estimation: feature-based, photometric-based and mutual-information-based approaches. Two standard datasets with self-driving setups are used for experiments, and the performance of the studied methods is evaluated in terms of success rate, translation error and maximum orientation error. Building on the analysis of the results, we evaluate a hybrid approach that combines feature-based and mutual-information-based pose estimation methods to benefit from their complementary properties for pose estimation. Experiments show that (1) in cases with large appearance change between query and reference, the hybrid approach outperforms feature-based and mutual-information-based approaches by an average increment of 9.4% and 8.7% in the success rate, respectively; (2) in cases where query and reference images are captured at similar imaging conditions, the hybrid approach performs similarly as the feature-based approach, but outperforms both photometric-based and mutual-information-based approaches with a clear margin; (3) the feature-based approach is consistently more accurate than mutual-information-based and photometric-based approaches when at least 4 consistent matching points are found between the query and reference images.
Article
Full-text available
This paper provides an algorithm for partitioning grayscale images into disjoint regions of coherent brightness and texture. Natural images contain both textured and untextured regions, so the cues of contour and texture differences are exploited simultaneously. Contours are treated in the intervening contour framework, while texture is analyzed using textons. Each of these cues has a domain of applicability, so to facilitate cue combination we introduce a gating operator based on the texturedness of the neighborhood at a pixel. Having obtained a local measure of how likely two nearby pixels are to belong to the same region, we use the spectral graph theoretic framework of normalized cuts to find partitions of the image into regions of coherent texture and brightness. Experimental results on a wide range of images are shown.
Article
Full-text available
The ability to recognize humans and their activities by vision is key for a machine to interact intelligently and effortlessly with a human-inhabited environment. Because of many potentially important applications, “looking at people” is currently one of the most active application domains in computer vision. This survey identifies a number of promising applications and provides an overview of recent developments in this domain. The scope of this survey is limited to work on whole-body or hand motion; it does not include work on human faces. The emphasis is on discussing the various methodologies; they are grouped in 2-D approaches with or without explicit shape models and 3-D approaches. Where appropriate, systems are reviewed. We conclude with some thoughts about future directions.
Conference Paper
Full-text available
Human activity can be described as a sequence of 3D body postures. The traditional approach to recognition and D reconstruction of human activity has been to track motion in 3D, mainly using advanced geometric and dynamic models. In this paper we reverse this process. View based activity recognition serves as an input to a human body location tracker with the ultimate goal of 3D reanimation in mind. We demonstrate that specific human actions can be detected from single frame postures in a video sequence. By recognizing the image of a person's posture as corresponding to a particular key frame from a set of stored key frames, it is possible to map body locations from the key frames to actual frames. This is achieved using a shape matching algorithm based on qualitative similarity that computes point to point correspondence between shapes, together with information about appearance. As the mapping is from fixed key frames, our tracking does not suffer from the problem of having to reinitialise when it gets lost. It is effectively a closed loop. We present experimental results both for recognition and tracking for a sequence of a tennis player.
Article
Full-text available
This paper describes a pedestrian detection system that integrates image intensity information with motion information. We use a detection style algorithm that scans a detector over two consecutive frames of a video sequence. The detector is trained (using AdaBoost) to take advantage of both motion and appearance information to detect a walking person. Past approaches have built detectors based on motion information or detectors based on appearance information, but ours is the first to combine both sources of information in a single detector. The implementation described runs at about 4 frames/second, detects pedestrians at very small scales (as small as 20 15 pixels), and has a very low false positive rate.Our approach builds on the detection work of Viola and Jones. Novel contributions of this paper include: (i) development of a representation of image motion which is extremely efficient, and (ii) implementation of a state of the art pedestrian detection system which operates on low resolution images under difficult conditions (such as rain and snow).
Article
Full-text available
An unsupervised learning algorithm that can obtain a probabilistic model of an object composed of a collection of parts (a moving human body in our examples) automatically from unlabeled training data is presented. The training data include both useful "foreground" features as well as features that arise from irrelevant background clutter - the correspondence between parts and detected features is unknown. The joint probability density function of the parts is represented by a mixture of decomposable triangulated graphs which allow for fast detection. To learn the model structure as well as model parameters, an EM-like algorithm is developed where the labeling of the data (part assignments) is treated as hidden variables. The unsupervised learning technique is not limited to decomposable triangulated graphs. The efficiency and effectiveness of our algorithm is demonstrated by applying it to generate models of human motion automatically from unlabeled image sequences, and testing the learned models on a variety of sequences.
Conference Paper
Full-text available
A new exemplar-based, probabilistic paradigm for visual tracking is presented. Probabilistic mechanisms are attractive because they handle fusion of information, especially temporal fusion, in a principled manner. Exemplars are selected representatives of raw training data, used here to represent probabilistic mixture distributions of object configurations. Their use avoids tedious hand-construction of object models and problems with changes of topology. Using exemplars in place of a parameterized model poses several challenges, addressed here with what we call the “Metric Mixture” (M<sup>2</sup>) approach. The M<sup>2</sup> model has several valuable properties. Principally, it provides alternatives to standard learning algorithms by allowing the use of metrics that are not embedded in a vector space. Secondly, it uses a noise model that is learned from training data. Lastly, it eliminates any need for an assumption of probabilistic pixelwise independence. Experiments demonstrate the effectiveness of the M<sup>2</sup> model in two domains tracking walking people using chamfer distances on binary edge images and tracking mouth movements by means of a shuffle distance
Article
Full-text available
We present a general example-based framework for detecting objects in static images by components. The technique is demonstrated by developing a system that locates people in cluttered scenes. The system is structured with four distinct example-based detectors that are trained to separately find the four components of the human body: the head, legs, left arm, and right arm. After ensuring that these components are present in the proper geometric configuration, a second example-based classifier combines the results of the component detectors to classify a pattern as either a “person” or a “nonperson.” We call this type of hierarchical architecture, in which learning occurs at multiple stages, an adaptive combination of classifiers (ACC). We present results that show that this system performs significantly better than a similar full-body person detector. This suggests that the improvement in performance is due to the component-based approach and the ACC data classification architecture. The algorithm is also more robust than the full-body person detection method in that it is capable of locating partially occluded views of people and people whose body parts have little contrast with the background
Article
Finding people in pictures presents a particularly difficult object recognition problem. We show how to find people by finding candidate body segments, and then constructing assemblies of segments that are consistent with the constraints on the appearance of a person that result from kinematic properties. Since a reasonable model of a person requires at least nine segments, it is not possible to inspect every group, due to the huge combinatorial complexity. We propose two approaches to this problem. In one, the search can be pruned by using projected versions of a classifier that accepts groups corresponding to people. We describe an efficient projection algorithm for one popular classifier, and demonstrate that our approach can be used to determine whether images of real scenes contain people. The second approach employs a probabilistic framework, so that we can draw samples of assemblies, with probabilities proportional to their likelihood, which allows to draw human-like assemblies more often than the non-person ones. The main performance problem is in segmentation of images, but the overall results of both approaches on real images of people are encouraging.
Article
For a machine to be able to ‘see’, it must know something about the object it is ‘looking’ at. A common method in machine vision is to provide the machine with general rather than specific knowledge about the object. An alternative technique, and the one used in this paper, is a model-based approach in which particulars about the object are given and this drives the analysis. The computer program described here, the WALKER model, maps images into a description in which a person is represented by the series of hierarchical levels, i.e. a person has an arm which has a lower-arm which has a hand. The performance of the program is illustrated by superimposing the machine-generated picture over the original photographic images.
Conference Paper
While navigating in an environment, a vision system has to be able to recognize where it is and what the main objects in the scene are. We present a context-based vision system for place and object recognition. The goal is to identify familiar locations (e.g., office 610, conference room 941, main street), to categorize new environments (office, corridor, street) and to use that information to provide contextual priors for object recognition (e.g., tables are more likely in an office than a street). We present a low-dimensional global image representation that provides relevant information for place recognition and categorization, and show how such contextual information introduces strong priors that simplify object recognition. We have trained the system to recognize over 60 locations (indoors and outdoors) and to suggest the presence and locations of more than 20 different object types. The algorithm has been integrated into a mobile system that provides realtime feedback to the user.
Conference Paper
We propose a two-class classification model for grouping. Human segmented natural images are used as positive examples. Negative examples of grouping are constructed by randomly matching human segmentations and images. In a preprocessing stage an image is over-segmented into super-pixels. We define a variety of features derived from the classical Gestalt cues, including contour, texture, brightness and good continuation. Information-theoretic analysis is applied to evaluate the power of these grouping cues. We train a linear classifier to combine these features. To demonstrate the power of the classification model, a simple algorithm is used to randomly search for good segmentations. Results are shown on a wide range of images.
Conference Paper
We propose a general framework for parsing images into regions and objects. In this framework, the detection and recognition of objects proceed simultaneously with image segmentation in a competitive and cooperative manner. We illustrate our approach on natural images of complex city scenes where the objects of primary interest are faces and text. This method makes use of bottom-up proposals combined with top-down generative models using the data driven Markov chain Monte Carlo (DDMCMC) algorithm, which is guaranteed to converge to the optimal estimate asymptotically. More precisely, we define generative models for faces, text, and generic regions- e.g. shading, texture, and clutter. These models are activated by bottom-up proposals. The proposals for faces and text are learnt using a probabilistic version of AdaBoost. The DDMCMC combines reversible jump and diffusion dynamics to enable the generative models to explain the input images in a competitive and cooperative manner. Our experiments illustrate the advantages and importance of combining bottom-up and top-down models and of performing segmentation and object detection/recognition simultaneously.
Conference Paper
Example-based methods are effective for parameter estimation problems when the underlying system is simple or the dimensionality of the input is low. For complex and high-dimensional problems such as pose estimation, the number of required examples and the computational complexity rapidly become prohibitively high. We introduce a new algorithm that learns a set of hashing functions that efficiently index examples relevant to a particular estimation task. Our algorithm extends locality-sensitive hashing, a recently developed method to find approximate neighbors in time sublinear in the number of examples. This method depends critically on the choice of hash functions that are optimally relevant to a particular estimation problem. Experiments demonstrate that the resulting algorithm, which we call parameter-sensitive hashing, can rapidly and accurately estimate the articulated pose of human figures from a large database of example images.
Conference Paper
We consider the problem of segmenting an image into foreground and background, with foreground containing solely objects of interest known a priori. We propose an integration model that incorporates both edge detection and object part detection results. It consists of two parallel processes: low-level pixel grouping and high-level patch grouping. We seek a solution that optimizes a joint grouping criterion in a reduced space enforced by grouping correspondence between pixels and patches. Using spectral graph partitioning, we show that a near global optimum can be found by solving a constrained eigenvalue problem. We report promising experimental results on a dataset of 15 objects under clutter and occlusion.
Article
Example-based methods are effective for parameter estimation problems when the underlying system is simple or the dimensionality of the input is low. For complex and high-dimensional problems such as pose estimation, the number of required examples and the computational complexity rapidly become prohibitively high. We introduce a new algorithm that learns a set of hashing functions that efficiently index examples in a way relevant to a particular estimation task. Our algorithm extends locality-sensitive hashing, a recently developed method to find approximate neighbors in time sublinear in the number of examples. This method depends critically on the choice of hash functions; we show how to find the set of hash functions that are optimally relevant to a particular estimation problem. Experiments demonstrate that the resulting algorithm, which we call Parameter-Sensitive Hashing, can rapidly and accurately estimate the articulated pose of human figures from a large database of example images.
Article
We consider the problem of segmenting an image into foreground and background, with foreground containing solely objects of interest known a priori. We propose an integration model that incorporates both edge detection and object part detection results. It consists of two parallel processes: low-level pixel grouping and high-level patch grouping. We seek a solution that optimizes a joint grouping criterion in a reduced space enforced by grouping correspondence between pixels and patches. Using spectral graph partitioning, we show that a near global optimum can be found by solving a constrained eigenvalue problem. We report promising experimental results on a dataset of 15 objects under clutter and occlusion.
Conference Paper
The problem we consider in this paper is to take a single two-dimensional image containing a human body, locate the joint positions, and use these to estimate the body configuration and pose in three-dimensional space. The basic approach is to store a number of exemplar 2D views of the human body in a variety of different configurations and viewpoints with respect to the camera. On each of these stored views, the locations of the body joints (left elbow, right knee, etc.) are manually marked and labelled for future use. The test shape is then matched to each stored view, using the technique of shape context matching in conjunction with a kinematic chain-based deformation model. Assuming that there is a stored view sufficiently similar in configuration and pose, the correspondence process will succeed. The locations of the body joints are then transferred from the exemplar view to the test shape. Given the joint locations, the 3D body configuration and pose are then estimated.
Article
A pictorial structure is a collection of parts arranged in a deformable configuration. Each part is represented using a simple appearance model and the deformable configuration is represented by spring-like connections between pairs of parts. While pictorial structures were introduced a number of years ago, they have not been broadly applied to matching and recognition problems. This has been due in part to the computational difficulty of matching pictorial structures to images. In this paper we present an efficient algorithm for finding the best global match of a pictorial structure to an image. The running time of the algorithm is optimal and it it takes only a few seconds to match a model with five to ten parts. With this improved algorithm, pictorial structures provide a practical and powerful framework for qualitative descriptions of objects and scenes, and are suitable for many generic image recognition problems. We illustrate the approach using simple models of a person and a car.
Anthrokids -anthropometric data of children
  • Nist
NIST. Anthrokids -anthropometric data of children, http://ovrt.nist.gov/projects/anthrokids/, 1977.