Article

CSAST: Content self-supervised and style contrastive learning for arbitrary style transfer

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Arbitrary artistic style transfer has achieved great success with deep neural networks, but it is still difficult for existing methods to tackle the dilemma of content preservation and style translation due to the inherent content-and-style conflict. In this paper, we introduce content self-supervised learning and style contrastive learning to arbitrary style transfer for improved content preservation and style translation, respectively. The former one is based on the assumption that stylization of a geometrically transformed image is perceptually similar to applying the same transformation to the stylized result of the original image. This content self-supervised constraint noticeably improves content consistency before and after style translation, and contributes to reducing noises and artifacts as well. Furthermore, it is especially suitable to video style transfer, due to its ability to promote inter-frame continuity, which is of crucial importance to visual stability of video sequences. For the latter one, we construct a contrastive learning that pull close style representations (Gram matrices) of the same style and push away that of different styles. This brings more accurate style translation and more appealing visual effect. A large number of qualitative and quantitative experiments demonstrate superiority of our method in improving arbitrary style transfer quality, both for images and videos.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Subsequently, it iteratively optimizes the content image to match the style of the referenced image. Since then, numerous methods have been proposed to enhance various aspects of performance, such as controllability [19][20][21]23], text-driven [2,3,11,12] and quality [25][26][27][28][29]. Controllability: Babaeizadeh et al. [19] propose a method based on the feed-forward neural network to achieve a tradeoff between content and style by adjusting the weights of the corresponding losses. ...
Article
Full-text available
Text-driven image stylization aims to synthesize content images with learned textual styles. Recent studies have shown the potential of the diffusion model for producing rich stylizations. However, existing approaches inefficiently control the degree of stylization, which hinders the balance between style and content in generated images. In this paper, we propose a Controllable Text-Driven Image Stylization (ConIS) Framework based on the diffusion model. The proposed framework introduces two modules into the pre-trained text-to-image model. The first is an unconditional null-text inversion (UNTI) module, which optimizes null-text embedding to reduce the bias between inversion and sampling in the diffusion model. Given a content image, this module is able to reconstruct it without semantic guidance. The second is a null-text dilution (NTD) module. We design a parameterization mechanism for the semantic intensity of textual conditions, which indirectly controls the degree of stylization through the style degree factor. Finally, we replace the attention maps used in the sampling process with those from the UNTI module to constrain the structure of content images. Experiments have shown that the proposed method enables fine-grained control over the degree of stylization without retraining or fine-tuning the network. Both qualitative and quantitative results indicate that the ConIS framework outperforms state-of-the-art methods in balancing artistic detail and content structure.
... Its idea is to perform the instance-level discrimination and learn the feature embedding, by pulling the features from the same image together and pushing those from different ones away. It has been observed that contrastive loss has been utilized in several low-level visual tasks, resulting in outstanding performance in various applications such as style transfer [40], image generation [24], image smoothing [43], image super-resolution and Deraining [4,35]. Patch-NCE [23] proposes patch-based contrastive learning, which uses a noise-contrastive estimation framework by learning the correspondence between the patches of the input image and the corresponding generated image patches. ...
Article
Full-text available
Existing image-to-image (I2I) translation methods achieve state-of-the-art performance by incorporating the patch-wise contrastive learning into generative adversarial networks. However, patch-wise contrastive learning only focuses on the local content similarity but neglects the global structure constraint, which affects the quality of the generated images. In this paper, we propose a new unpaired I2I translation framework based on dual contrastive regularization and spectral normalization, namely SN-DCR. To maintain consistency of the global structure and texture, we design the dual contrastive regularization using different deep feature spaces respectively. In order to improve the global structure information of the generated images, we formulate a semantic contrastive loss to make the global semantic structure of the generated images similar to the real images from the target domain in the semantic feature space. We use gram matrices to extract the style of texture from images. Similarly, we design a style contrastive loss to improve the global texture information of the generated images. Moreover, to enhance the stability of the model, we employ the spectral normalized convolutional network in the design of our generator. We conduct comprehensive experiments to evaluate the effectiveness of SN-DCR, and the results prove that our method achieves SOTA in multiple tasks. The code and pretrained models are available at https://github.com/zhihefang/SN-DCR.
... Subsequently, it iteratively optimizes the content image to match the style of the referenced image. Since then, numerous methods have been proposed to enhance various aspects of performance, such as controllability [18][19][20]22], text-driven [2,3,11,12] and quality [24][25][26][27][28]. ...
Preprint
Full-text available
Text-driven image stylization aims to synthesize content images with learned textual styles. Recent studies have shown the potential of the diffusion model for producing rich stylizations. However, existing approaches inefficiently control the degree of stylization, which hinders the balance between style and content in generated images. In this paper, we propose a Controllable Text-Driven Image Stylization (ConIS) Framework based on the diffusion model. The proposed framework introduces two modules into the pre-trained text-to-image model. The first is an Unconditional Null-text Inversion (UNTI) module, which optimizes null-text embedding to reduce the bias between inversion and sampling in the diffusion model. Given a content image, this module is able to reconstruct it without semantic guidance. The second is a Null-text Dilution (NTD) module. We design a parameterization mechanism for the semantic intensity of textual conditions, which indirectly controls the degree of stylization through the style degree factor. Finally, we replace the attention maps used in the sampling process with those from the UNTI module to constrain the structure of content images. Experiments have shown that the proposed method enables fine-grained control over the degree of stylization without retraining or fine-tuning the network. Both qualitative and quantitative results indicate that the ConIS framework outperforms state-of-the-art methods in balancing artistic detail and content structure.
... According to the number of models to transfer style, it can be divided into three types. Specifically, there are Per-Style-Per-Model, Multiple-Style-Per-Model [6][7][8][9], and Arbitrary-Style-Per-Model [3,[10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26]30]. Arbitrary-Style-Per-Model belongs to arbitrary style transfer. ...
Preprint
Full-text available
Most arbitrary style transfer methods only consider transferring of the features of the style image and the content image. Although the pixel-wise style transfer is achieved. It is limited to preserve the content structure, the model tends to transfer the style features, the loss of image’s information occur during transfer process. The model incline to transfer the style features and preservation of the content structure is weak. The generated pictures will produce artifacts and patterns of style pictures. In this paper, an attention feature distribution matching method for arbitrary style transfer is proposed. In network architecture, a combination of self-attention mechanism and second-order statistics is used to perform style transfer and the style strengthen block (SSB) enhances the style features of generated images. In loss function, traditional content loss is not used, we integrate the attention mechanism and feature distribution matching to construct loss function. The constraints are strengthened to avoid artifacts in generated image. Qualitative and quantitative experiments demonstrate effectiveness of our method compared with state-of-the-art arbitrary style transfer in improving arbitrary style transfer quality.
Article
Full-text available
Most arbitrary style transfer methods only consider transferring the features of the style and content images. Although the pixel-wise style transfer is achieved. It is limited to preserving the content structure, the model tends to transfer the style features, and the loss of image information occurs during the transfer process. The model incline to transfer the style features and preservation of the content structure is weak. The generated pictures will produce artifacts and patterns of style pictures. In this paper, an attention feature distribution matching method for arbitrary style transfer is proposed. In network architecture, a combination of the self-attention mechanism and second-order statistics is used to perform style transfer, and the style strengthen block enhances the style features of generated images. In the loss function, the traditional content loss is not used. We integrate the attention mechanism and feature distribution matching to construct the loss function. The constraints are strengthened to avoid artifacts in the generated image. Qualitative and quantitative experiments demonstrate the effectiveness of our method compared with state-of-the-art arbitrary style transfer in improving arbitrary style transfer quality.
Conference Paper
Full-text available
Scarcity of labeled data has motivated the development of semi-supervised learning methods, which learn from large portions of unlabeled data alongside a few labeled samples. Consistency Regularization between model's predictions under different input perturbations, particularly has shown to provide state-of-the art results in a semi-supervised framework. However, most of these method have been limited to classification and segmentation applications. We propose Transformation Consistency Regularization, which delves into a more challenging setting of image-to-image translation, which remains unexplored by semi-supervised algorithms. The method introduces a diverse set of geometric transformations and enforces the model's predictions for unlabeled data to be invariant to those transformations. We evaluate the efficacy of our algorithm on three different applications: image colorization, denoising and super-resolution. Our method is significantly data efficient, requiring only around 10 - 20% of labeled samples to achieve similar image reconstructions to its fully-supervised counterpart. Furthermore, we show the effectiveness of our method in video processing applications, where knowledge from a few frames can be leveraged to enhance the quality of the rest of the movie.
Article
Full-text available
Prior normalization methods rely on affine transformations to produce arbitrary image style transfers, of which the parameters are computed in a pre-defined way. Such manually-defined nature eventually results in the high-cost and shared encoders for both style and content encoding, making style transfer systems cumbersome to be deployed in resource-constrained environments like on the mobile-terminal side. In this paper, we propose a new and generalized normalization module, termed as Dynamic Instance Normalization (DIN), that allows for flexible and more efficient arbitrary style transfers. Comprising an instance normalization and a dynamic convolution, DIN encodes a style image into learnable convolution parameters, upon which the content image is stylized. Unlike conventional methods that use shared complex encoders to encode content and style, the proposed DIN introduces a sophisticated style encoder, yet comes with a compact and lightweight content encoder for fast inference. Experimental results demonstrate that the proposed approach yields very encouraging results on challenging style patterns and, to our best knowledge, for the first time enables an arbitrary style transfer using MobileNet-based lightweight architecture, leading to a reduction factor of more than twenty in computational cost as compared to existing approaches. Furthermore, the proposed DIN provides flexible support for state-of-the-art convolutional operations, and thus triggers novel functionalities, such as uniform-stroke placement for non-natural images and automatic spatial-stroke control.
Conference Paper
Full-text available
Generative Adversarial Networks (GANs) excel at creating realistic images with complex models for which maximum likelihood is infeasible. However, the convergence of GAN training has still not been proved. We propose a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent on arbitrary GAN loss functions. TTUR has an individual learning rate for both the discriminator and the generator. Using the theory of stochastic approximation, we prove that the TTUR converges under mild assumptions to a stationary local Nash equilibrium. The convergence carries over to the popular Adam optimization, for which we prove that it follows the dynamics of a heavy ball with friction and thus prefers flat minima in the objective landscape. For the evaluation of the performance of GANs at image generation, we introduce the `Fréchet Inception Distance'' (FID) which captures the similarity of generated images to real ones better than the Inception Score. In experiments, TTUR improves learning for DCGANs and Improved Wasserstein GANs (WGAN-GP) outperforming conventional GAN training on CelebA, CIFAR-10, SVHN, LSUN Bedrooms, and the One Billion Word Benchmark. https://papers.nips.cc/paper/7240-gans-trained-by-a-two-time-scale-update-rule-converge-to-a-local-nash-equilibrium
Article
Full-text available
Universal style transfer aims to transfer any arbitrary visual styles to content images. Existing feed-forward based methods, while enjoying the inference efficiency, are mainly limited by inability of generalizing to unseen styles or compromised visual quality. In this paper, we present a simple yet effective method that tackles these limitations without training on any pre-defined styles. The key ingredient of our method is a pair of feature transforms, whitening and coloring, that are embedded to an image reconstruction network. The whitening and coloring transforms reflect a direct matching of feature covariance of the content image to a given style image, which shares similar spirits with the optimization of Gram matrix based cost in neural style transfer. We demonstrate the effectiveness of our algorithm by generating high-quality stylized images with comparisons to a number of recent methods. We also analyze our method by visualizing the whitened features and synthesizing textures via simple feature coloring.
Article
Full-text available
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Conference Paper
Full-text available
In this paper, we analyse two well-known objective image quality metrics, the peak-signal-to-noise ratio (PSNR) as well as the structural similarity index measure (SSIM), and we derive a simple mathematical relationship between them which works for various kinds of image degradations such as Gaussian blur, additive Gaussian white noise, jpeg and jpeg2000 compression. A series of tests realized on images extracted from the Kodak database gives a better understanding of the similarity and difference between the SSIM and the PSNR.
Article
Full-text available
Objective methods for assessing perceptual image quality traditionally attempted to quantify the visibility of errors (differences) between a distorted image and a reference image using a variety of known properties of the human visual system. Under the assumption that human visual perception is highly adapted for extracting structural information from a scene, we introduce an alternative complementary framework for quality assessment based on the degradation of structural information. As a specific example of this concept, we develop a structural similarity index and demonstrate its promise through a set of intuitive examples, as well as comparison to both subjective ratings and state-of-the-art objective methods on a database of images compressed with JPEG and JPEG2000. A MATLAB implementation of the proposed algorithm is available online at http://www.cns.nyu.edu/∼lcv/ssim/.
Article
Video style transfer is attracting increasing attention from the artificial intelligence community because of its numerous applications, such as augmented reality and animation production. Relative to traditional image style transfer, video style transfer presents new challenges, including how to effectively generate satisfactory stylized results for any specified style while maintaining temporal coherence across frames. Towards this end, we propose a Multi-Channel Correlation network (MCCNet), which can be trained to fuse exemplar style features and input content features for efficient style transfer while naturally maintaining the coherence of input videos to output videos. Specifically, MCCNet works directly on the feature space of style and content domain where it learns to rearrange and fuse style features on the basis of their similarity to content features. The outputs generated by MCC are features containing the desired style patterns that can further be decoded into images with vivid style textures. Moreover, MCCNet is also designed to explicitly align the features to input and thereby ensure that the outputs maintain the content structures and the temporal continuity. To further improve the performance of MCCNet under complex light conditions, we also introduce illumination loss during training. Qualitative and quantitative evaluations demonstrate that MCCNet performs well in arbitrary video and image style transfer tasks. Code is available at https://github.com/diyiiyiii/MCCNet.
Article
In this paper, we propose a cycle consistent network based end-to-end TTS for speaking style transfer, including intra-speaker, inter-speaker, and unseen speaker style transfer for both parallel and unparallel transfer. The proposed approach is built upon a multi-speaker Variational Autoencoder (VAE) TTS model. The model is usually trained in a paired manner, which means the reference speech is totally paired with the output including speaker identity, text, and style. To achieve a better quality for style transfer, which for most cases is in an unpaired manner, we augment the model with an unpaired path with a separated variational style encoder. The unpaired path takes as input an unpaired reference speech and yields an unpaired output. The unpaired output, which lacks direct ground-truth target, is then successfully constrained by a delicately designed cycle consistent network. Specifically, the unpaired output of the forward transfer is fed into the model again as an unpaired reference input, and after the backward transfer yields an output expected to be the same as the original unpaired reference speech. Ablation study shows the effectiveness of the unpaired path, separated style encoders and cycle consistent network in the proposed model. The final evaluation demonstrates the proposed approach significantly outperforms the Global Style Token (GST) and VAE based systems for all the six style transfer categories, in metrics of naturalness, speech quality, similarity of speaker identity, and similarity of speaking style.
Chapter
Scarcity of labeled data has motivated the development of semi-supervised learning methods, which learn from large portions of unlabeled data alongside a few labeled samples. Consistency Regularization between model’s predictions under different input perturbations, particularly has shown to provide state-of-the art results in a semi-supervised framework. However, most of these method have been limited to classification and segmentation applications. We propose Transformation Consistency Regularization, which delves into a more challenging setting of image-to-image translation, which remains unexplored by semi-supervised algorithms. The method introduces a diverse set of geometric transformations and enforces the model’s predictions for unlabeled data to be invariant to those transformations. We evaluate the efficacy of our algorithm on three different applications: image colorization, denoising and super-resolution. Our method is significantly data efficient, requiring only around 10–20% of labeled samples to achieve similar image reconstructions to its fully-supervised counterpart. Furthermore, we show the effectiveness of our method in video processing applications, where knowledge from a few frames can be leveraged to enhance the quality of the rest of the movie.
Chapter
An unsupervised image-to-image translation (UI2I) task deals with learning a mapping between two domains without paired images. While existing UI2I methods usually require numerous unpaired images from different domains for training, there are many scenarios where training data is quite limited. In this paper, we argue that even if each domain contains a single image, UI2I can still be achieved. To this end, we propose TuiGAN, a generative model that is trained on only two unpaired images and amounts to one-shot unsupervised learning. With TuiGAN, an image is translated in a coarse-to-fine manner where the generated image is gradually refined from global structures to local details. We conduct extensive experiments to verify that our versatile method can outperform strong baselines on a wide variety of UI2I tasks. Moreover, TuiGAN is capable of achieving comparable performance with the state-of-the-art UI2I models trained with sufficient data.
Article
Image style transfer renders the content of an image into different styles. Current methods made decent progress with transferring the style of single image, however, visual statistics from one image cannot reflect the full scope of an artist. Also, previous work did not put content preservation in the important position, which would result in poor structure integrity, thus deteriorating the comprehensibility of generated image. These two problems would limit the visual quality improvement of style transfer results. Targeting at style resemblance and content preservation problems, we propose a style transfer system composed of collection representation space and semantic-guided reconstruction. We train an encoder–decoder network with art collections to construct a representation space that can reflect the style of the artist. Then, we use semantic information as guidance to reconstruct the target representation of the input image for better content preservation. We conduct both quantitative analysis and qualitative evaluation to assess the proposed method. Experiment results demonstrate that our approach well balanced the trade-off between capturing artistic characteristics and preserving content information in style transfer tasks.
Conference Paper
Unsupervised domain mapping aims to learn a function GXY to translate domain X to Y in the absence of paired examples. Finding the optimal G XY without paired data is an ill-posed problem, so appropriate constraints are required to obtain reasonable solutions. While some prominent constraints such as cycle consistency and distance preservation successfully constrain the solution space, they overlook the special properties of images that simple geometric transformations do not change the image's semantic structure. Based on this special property, we develop a geometry-consistent generative adversarial network (Gc-GAN), which enables one-sided unsupervised domain mapping. GcGAN takes the original image and its counterpart image transformed by a predefined geometric transformation as inputs and generates two images in the new domain coupled with the corresponding geometry-consistency constraint. The geometry-consistency constraint reduces the space of possible solutions while keep the correct solutions in the search space. Quantitative and qualitative comparisons with the baseline (GAN alone) and the state-of-the-art methods including CycleGAN [66] and DistanceGAN [5] demonstrate the effectiveness of our method.
Conference Paper
Neural style transfer has drawn considerable attention from both academic and industrial field. Although visual effect and efficiency have been significantly improved, existing methods are unable to coordinate spatial distribution of visual attention between the content image and stylized image, or render diverse level of detail via different brush strokes. In this paper, we tackle these limitations by developing an attention-aware multi-stroke style transfer model. We first propose to assemble self-attention mechanism into a style-agnostic reconstruction autoencoder framework, from which the attention map of a content image can be derived. By performing multi-scale style swap on content features and style features, we produce multiple feature maps reflecting different stroke patterns. A flexible fusion strategy is further presented to incorporate the salient characteristics from the attention map, which allows integrating multiple stroke patterns into different spatial regions of the output image harmoniously. We demonstrate the effectiveness of our method, as well as generate comparable stylized images with multiple stroke patterns against the state-of-the-art methods.
Article
This paper proposes Markovian Generative Adversarial Networks (MGANs), a method for training generative neural networks for efficient texture synthesis. While deep neural network approaches have recently demonstrated remarkable results in terms of synthesis quality, they still come at considerable computational costs (minutes of run-time for low-res images). Our paper addresses this efficiency issue. Instead of a numerical deconvolution in previous work, we precompute a feed-forward, strided convolutional network that captures the feature statistics of Markovian patches and is able to directly generate outputs of arbitrary dimensions. Such network can directly decode brown noise to realistic texture, or photos to artistic paintings. With adversarial training, we obtain quality comparable to recent neural texture synthesis methods. As no optimization is required any longer at generation time, our run-time performance (0.25M pixel images at 25Hz) surpasses previous neural texture synthesizers by a significant margin (at least 500 times faster). We apply this idea to texture synthesis, style transfer, and video stylization.
Conference Paper
We consider image transformation problems, where an input image is transformed into an output image. Recent methods for such problems typically train feed-forward convolutional neural networks using a per-pixel loss between the output and ground-truth images. Parallel work has shown that high-quality images can be generated by defining and optimizing perceptual loss functions based on high-level features extracted from pretrained networks. We combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image transformation tasks. We show results on image style transfer, where a feed-forward network is trained to solve the optimization problem proposed by Gatys et al. in real-time. Compared to the optimization-based method, our network gives similar qualitative results but is three orders of magnitude faster. We also experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gives visually pleasing results.
Conference Paper
This paper proposes Markovian Generative Adversarial Networks (MGANs), a method for training generative networks for efficient texture synthesis. While deep neural network approaches have recently demonstrated remarkable results in terms of synthesis quality, they still come at considerable computational costs (minutes of run-time for low-res images). Our paper addresses this efficiency issue. Instead of a numerical deconvolution in previous work, we precompute a feed-forward, strided convolutional network that captures the feature statistics of Markovian patches and is able to directly generate outputs of arbitrary dimensions. Such network can directly decode brown noise to realistic texture, or photos to artistic paintings. With adversarial training, we obtain quality comparable to recent neural texture synthesis methods. As no optimization is required at generation time, our run-time performance (0.25 M pixel images at 25 Hz) surpasses previous neural texture synthesizers by a significant margin (at least 500 times faster). We apply this idea to texture synthesis, style transfer, and video stylization.
Article
Head portraits are popular in traditional painting. Automating portrait painting is challenging as the human visual system is sensitive to the slightest irregularities in human faces. Applying generic painting techniques often deforms facial structures. On the other hand portrait painting techniques are mainly designed for the graphite style and/or are based on image analogies; an example painting as well as its original unpainted version are required. This limits their domain of applicability. We present a new technique for transferring the painting from a head portrait onto another. Unlike previous work our technique only requires the example painting and is not restricted to a specific style. We impose novel spatial constraints by locally transferring the color distributions of the example painting. This better captures the painting texture and maintains the integrity of facial structures. We generate a solution through Convolutional Neural Networks and we present an extension to video. Here motion is exploited in a way to reduce temporal inconsistencies and the shower-door effect. Our approach transfers the painting style while maintaining the input photograph identity. In addition it significantly reduces facial deformations over state of the art.
Article
We consider image transformation problems, where an input image is transformed into an output image. Recent methods for such problems typically train feed-forward convolutional neural networks using a \emph{per-pixel} loss between the output and ground-truth images. Parallel work has shown that high-quality images can be generated by defining and optimizing \emph{perceptual} loss functions based on high-level features extracted from pretrained networks. We combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image transformation tasks. We show results on image style transfer, where a feed-forward network is trained to solve the optimization problem proposed by Gatys et al in real-time. Compared to the optimization-based method, our network gives similar qualitative results but is three orders of magnitude faster. We also experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gives visually pleasing results .
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Article
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Artistic style transfer with internal-external learning and contrastive learning
  • Chen
Wasserstein generative adversarial networks
  • Arjovsky