Figure 1 - uploaded by Sanja Fidler
Content may be subject to copyright.
A holistic scene understanding approach to semantic segmentation consists of a conditional random field (CRF) model that jointly reasons about: (a) classification of local patches (segmentation), (b) object detection, (c) shape analysis, (d) scene recognition and (e) contextual reasoning. In this paper we analyze the relative importance of each of these components by building an array of hybrid human-machine CRFs where each component is performed by a machine (default), or replaced by human subjects or ground truth, or is removed all together (top).

A holistic scene understanding approach to semantic segmentation consists of a conditional random field (CRF) model that jointly reasons about: (a) classification of local patches (segmentation), (b) object detection, (c) shape analysis, (d) scene recognition and (e) contextual reasoning. In this paper we analyze the relative importance of each of these components by building an array of hybrid human-machine CRFs where each component is performed by a machine (default), or replaced by human subjects or ground truth, or is removed all together (top).

Source publication
Conference Paper
Full-text available
Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning. In this work, we are interested in understanding the roles of these different tasks in aiding semantic segmentation. Towards this...

Context in source publication

Context 1
... is a con- ditional random field (CRF) that models the interplay be- tween segmentation and a variety of components such as local super-pixel appearance, object detection, scene recog- nition, shape analysis, class co-occurrence, and compatibil- ity of classes with scene categories. To gain insights into the relative importance of these different factors or tasks, we isolate each one, and substitute a machine with a human for that task, keeping the rest of the model intact (Figure 1). The resultant improvement in segmentation performance, if any, will give us an indication of how much "head room" there is to improve segmentation by focusing research ef- forts on that task. ...

Citations

... Moreover, the lower resolution feature maps in deeper layers poses a challenge for object detectors to precisely locate the defects in the original image [10]. Not only that, as Mottaghi et al. [36] suggests, humans are bad at classifying an object without contextual information compared to machines but is way better when provided. For instance, humans are able to distinguish different type of trees by taking into account the surrounding weather conditions and location of where the tree is planted. ...
Article
The quality of a printed circuit board (PCB) is paramount towards ensuring proper functionality of electronic products. To achieve the required quality standards, substantial research and development efforts were invested to automate PCB inspection for defect detection, primarily using computer vision techniques. Despite these advancements, the accuracy of such techniques is often susceptible towards varying board and component size. Efforts to increase its accuracy especially for small or tiny defects on a PCB often lead to a tradeoff with reduced real-time performance, which in turn limits its applicability in the manufacturing industry. Hence, this paper puts forward an enhanced deep learning network which addresses the difficulty in inferring tiny or varying defects on a PCB in real-time. Our proposed enhancements consist of i) A novel multi-scale feature pyramid network to enhance tiny defect detection through context information inclusion; and ii) A refined complete intersection over union loss function to precisely encapsulate tiny defects. Experimental results on a publicly available PCB defects dataset demonstrate that our model achieves 99.17% mean-average precision, while maintaining real-time inferencing speed at 90 frames per second. In addition, we introduce three trend detection algorithms which alert an operator when abnormal development of defect characteristics is detected. Each algorithm is responsible for localizing defect buildups, increasing defect size and increasing defect occurrences, respectively. As a whole, the proposed model is capable of performing accurate and reliable real-time PCB inspection with the aid of an automated alert capability. The dataset and trained models are available at: https://github.com/JiaLim98/YOLO-PCB.
... Internal-External Network Feature Fusion Attention Model (Lim et al., 2021) Object detection ResNet Other Global, Local Feature fusion SSD Deformable Part-based Model (Mottaghi et al., 2014) Object detection / Spatial Global, Local Markov random field Siamese Context Network (Sun and Jacobs, 2017) Object detection Custom Spatial Global Siamese CNN Bayes Probabilistic Model (Torralba et al., 2010) Object detection / Spatial Global, Local Bayesian Model Semantic Relation Reasoning Model (Zhu et al., 2021) Object detection ResNet Other Global SSD Cascaded Refinement Network (Johnson et al., 2018) Scene graph generation GCN Spatial Global, Local GCN Iterative Message Passing (Xu et al., 2017) Scene graph generation VGGNet Spatial Global, Local Conditional random fields Graph R-CNN Scene graph generation GCN Spatial Global, Local GCN MOTIFNET (Zellers et al., 2018) Scene graph generation ResNet Spatial Global Bayesian model Conditional Random Field (CRF) (Mottaghi et al., 2013) Semantic segmentation / Other Global Conditional random field Context-based SVM (Du et al., 2012) Text detection Custom Spatial Local SVM Visual-language Re-ranker (Sabir et al., 2018) Text detection ResNet/GoogLeNet Other Global Language model PLEX (Wang et al., 2011) Text detection / Spatial Local Trie structure Scene Context-based Model (Zhu et al., 2016) Text detection Custom Other Global, Local CNN/SVM Context-dependent Diffusion Network Visual relationship detection VGGNet Spatial Global Graph model Dynamic Tree Structure (Tang et al., 2019) Visual Q&A VGGNet Spatial Global, Local Tree-structured model advantages of spatial semantic context to achieve better performance. However, the label co-occurrence may describe the object relation preciously when the dataset is large enough and the objects are highly correlated. ...
Preprint
Full-text available
Contextual information plays an important role in many computer vision tasks, such as object detection, video action detection, image classification, etc. Recognizing a single object or action out of context could be sometimes very challenging, and context information may help improve the understanding of a scene or an event greatly. Appearance context information, e.g., colors or shapes of the background of an object can improve the recognition accuracy of the object in the scene. Semantic context (e.g. a keyboard on an empty desk vs. a keyboard next to a desktop computer ) will improve accuracy and exclude unrelated events. Context information that are not in the image itself, such as the time or location of an images captured, can also help to decide whether certain event or action should occur. Other types of context (e.g. 3D structure of a building) will also provide additional information to improve the accuracy. In this survey, different context information that has been used in computer vision tasks is reviewed. We categorize context into different types and different levels. We also review available machine learning models and image/video datasets that can employ context information. Furthermore, we compare context based integration and context-free integration in mainly two classes of tasks: image-based and video-based. Finally, this survey is concluded by a set of promising future directions in context learning and utilization.
... Deep learning methods have shown very promising semantic labeling performance due to the capability of learning discriminative features, where the receptive fields of the neurons in the convolution layers are cascaded to implicitly capture contextual information [28]. Such contextual knowledge is important for understanding and capturing local and global pixel dependencies [6,40,41]. For example a multi-scale CNN was used to improve the limitation from hand-crafted features [15]. ...
Article
Full-text available
Rare-class objects in natural scene images that are usually small and less frequent often convey more important information for scene understanding than the common ones. However, they are often overlooked in scene labeling studies due to two main reasons, low occurrence frequency and limited spatial coverage. Many methods have been proposed to enhance overall semantic labeling performance, but only a few consider rare-class objects. In this work, we present a deep semantic labeling framework with special consideration of rare classes via three techniques. First, a novel dual-resolution coarse-to-fine superpixel representation is developed, where fine and coarse superpixels are applied to rare classes and background areas respectively. This unique dual representation allows seamless incorporation of shape features into integrated global and local convolutional neural network (CNN) models. Second, shape information is directly involved during the CNN feature learning for both frequent and rare classes from the re-balanced training data, and also explicitly involved in data inference. Third, the proposed framework incorporates both shape information and the CNN architecture into semantic labeling through a fusion of probabilistic multi-class likelihood. Experimental results demonstrate competitive semantic labeling performance on two standard datasets both qualitatively and quantitatively, especially for rare-class objects.
... Our work extends weakly supervised learning methods by involving humans in the loop (Vaughan 2018). Existing human-in-the-loop approaches mainly leverage crowds to label individual data instances (Yan et al. 2011;Yang et al. 2018) or to debug the training data (Krishnan et al. 2016;Yang et al. 2019) or components (Parikh and Zitnick 2011;Mottaghi et al. 2013;Nushi et al. 2017) of a machine learning system. Unlike these works, we leverage crowd workers to label sampled microposts in order to obtain keyword-specific expectations, which can then be generalized to help classify microposts containing the same keyword, thus amplifying the utility of the crowd. ...
Article
Full-text available
Microblogging platforms such as Twitter are increasingly being used in event detection. Existing approaches mainly use machine learning models and rely on event-related keywords to collect the data for model training. These approaches make strong assumptions on the distribution of the relevant microposts containing the keyword – referred to as the expectation of the distribution – and use it as a posterior regularization parameter during model training. Such approaches are, however, limited as they fail to reliably estimate the informativeness of a keyword and its expectation for model training. This paper introduces a Human-AI loop approach to jointly discover informative keywords for model training while estimating their expectation. Our approach iteratively leverages the crowd to estimate both keyword-specific expectation and the disagreement between the crowd and the model in order to discover new keywords that are most beneficial for model training. These keywords and their expectation not only improve the resulting performance but also make the model training process more transparent. We empirically demonstrate the merits of our approach, both in terms of accuracy and interpretability, on multiple real-world datasets and show that our approach improves the state of the art by 24.3%.
... (III) Empower the annotator. In most annotation approaches there is a fixed sequence of annotation actions [11,17,22,51,58,92] or the sequence is determined by the machine [43,69,83]. In contrast, Fluid Annotation empowers the annotator: he sees at a glance the best available machine segmentation of all scene elements, and then decides what to annotate and in which order. ...
Conference Paper
We introduce Fluid Annotation, an intuitive human-machine collaboration interface for annotating the class label and outline of every object and background region in an image. Fluid annotation is based on three principles:(I) Strong Machine-Learning aid. We start from the output of a strong neural network model, which the annotator can edit by correcting the labels of existing regions, adding new regions to cover missing objects, and removing incorrect regions.The edit operations are also assisted by the model.(II) Full image annotation in a single pass. As opposed to performing a series of small annotation tasks in isolation [51,68], we propose a unified interface for full image annotation in a single pass.(III) Empower the annotator.We empower the annotator to choose what to annotate and in which order. This enables concentrating on what the ma-chine does not already know, i.e. putting human effort only on the errors it made. This helps using the annotation budget effectively. Through extensive experiments on the COCO+Stuff dataset [11,51], we demonstrate that Fluid Annotation leads to accurate an-notations very efficiently, taking 3x less annotation time than the popular LabelMe interface [70].
... (I) Strong Machine-Learning aid. Popular semantic segmentation datasets [11,17,22,51,58,92] are annotated fully manually which is very costly. Instead Fluid Annotation starts from the output of a neural network model [30], which the annotator can edit by correcting the label of existing regions, adding new regions to cover missing objects, and removing incorrect regions (Fig. 2). ...
... (III) Empower the annotator. In most annotation approaches there is a xed sequence of annotation actions [11,17,22,51,58,92] or the sequence is determined by the machine [43,69,83]. In contrast, Fluid Annotation empowers the annotator: he sees at a glance the best available machine segmentation of all scene elements, and then decides what to annotate and in which order. is enables to focus on what the machine does not already know, i.e. pu ing human e ort only on the errors it made, and typically addressing the biggest errors rst. is helps using the annotation budget e ectively , and also steers towards labeling hard examples rst. ...
Preprint
We introduce Fluid Annotation, an intuitive human-machine collaboration interface for annotating the class label and outline of every object and background region in an image. Fluid Annotation starts from the output of a strong neural network model, which the annotator can edit by correcting the labels of existing regions, adding new regions to cover missing objects, and removing incorrect regions. Fluid annotation has several attractive properties: (a) it is very efficient in terms of human annotation time; (b) it supports full images annotation in a single pass, as opposed to performing a series of small tasks in isolation, such as indicating the presence of objects, clicking on instances, or segmenting a single object known to be present. Fluid Annotation subsumes all these tasks in one unified interface. (c) it empowers the annotator to choose what to annotate and in which order. This enables to put human effort only on the errors the machine made, which helps using the annotation budget effectively. Through extensive experiments on the COCO+Stuff dataset, we demonstrate that Fluid Annotation leads to accurate annotations very efficiently, taking three times less annotation time than the popular LabelMe interface.
... In these problems, the classifier can apply its favorite label on input sample regardless of the label of neighboring samples. In machine vision tasks, CRFs are usually used to identify objects and segment the scene [1,32,35,36,51,52,65]. ...
Preprint
Full-text available
This paper gives an overview on semantic segmentation consists of an explanation of this field, it's status and relation with other vision fundamental tasks, different datasets and common evaluation parameters that have been used by researchers. This survey also includes an overall review on a variety of recent approaches (RDF, MRF, CRF, etc.) and their advantages and challenges and shows the superiority of CNN-based semantic segmentation systems on CamVid and NYUDv2 datasets. In addition, some areas that is ideal for future work have mentioned.
... Most of the recent object detection efforts have focused on recognizing and localizing thing classes, such as cat and car. Such classes have a specific size [21,27] and shape [21,51,55,39,17,14], and identifiable parts (e.g. a car has wheels). Indeed, the main recognition challenges [18,43,35] are all about things. ...
... Defining things and stuff. The literature provides definitions for several aspects of stuff and things, including: (1) Shape: Things have characteristic shapes (car, cat, phone), whereas stuff is amorphous (sky, grass, water) [21,59,28,51,55,39,17,14]. (2) Size: Things occur at characteristic sizes with little variance, whereas stuff regions are highly variable in size [21,2,27]. ...
... This concept finds the weakest link in connected system by estimating the improvement levels in performance when a certain component is replaced with some other similar component. Later on, this concept was applied practically to study the CRF applied and its performance when used for scene understanding by Mottaghi et al. [47,48] in their paper. So the components were replaced by crowd-workers using Mechanical Turk and it was found out that the system performance increased when human classifications were provided to the CRF, whereas, when used individually, humans function without any machine. ...
Article
Full-text available
This paper avails an in-depth analysis of the field of crowdsourcing and its impacts when used in the world of machine learning. It comprises of various contributions that crowdsourcing can make to improvise the techniques that employ machine learning like – producing data, debugging and checking of models, hybrid smart machines to reduce the human intervention required to facilitate high quality performance by artificial intelligence and developmental experimentation to improve human-computer interaction. A discussion regarding the nature of crowd-workers follows next which focuses on various factors like their reaction to different forms of motivation, their behaviour towards each other and deceit among them. The takeaways of this paper include a few tips and routines to be followed to achieve success through crowdsourcing.
... Scene classification is one of the primary goals in computer vision, involving many sub-tasks, such as object detection and recognition. These sub-tasks have been studied intensely over the past few decades, and there is still ample room for improvement (Mottaghi et al., 2013). In general, scene classification refers to the process of learning to answer a "what" question from a given sample, where the answer is naturally determined by what objects a scene contains. ...
Article
Full-text available
Indoor scene classification forms a basis for scene interaction for service robots. The task is challenging because the layout and decoration of a scene vary considerably. Previous studies on knowledge-based methods commonly ignore the importance of visual attributes when constructing the knowledge base. These shortcomings restrict the performance of classification. The structure of a semantic hierarchy was proposed to describe similarities of different parts of scenes in a fine-grained way. Besides the commonly used semantic features, visual attributes were also introduced to construct the knowledge base. Inspired by the processes of human cognition and the characteristics of indoor scenes, we proposed an inferential framework based on the Markov logic network. The framework is evaluated on a popular indoor scene dataset, and the experimental results demonstrate its effectiveness.