PreprintPDF Available

High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis
Yipin Yang, Zhiguo Huang
1. Introduction
In the previous post, we have gone through the introduc-
tion to image inpainting and the first GAN-based inpainting
algorithm, Context Encoders. If you have not read the pre-
vious post, I highly recommend you to have a quick look
of it first! This time, we will dive into another inpainting
method which can be regarded as an improved version of
Context Encoders. Let’s start! Here, I briefly recall what we
have learnt in the previous post. Deep semantic understand-
ing of an image or the context of an image is important to
the task of inpainting, and (channel-wise) fully-connected
layer is one way to capture the context of an image. For
image inpainting, visual quality of the filled images is much
more important than the pixel-wise reconstruction accuracy.
More specifically, as there is no model answer to gener-
ated pixels (we do not have the ground truth in real-world
situations), we just want look-realistic filled images. Ex-
isting inpainting algorithms can only handle low-resolution
images because of the memory limitations and the training
difficulty in high-resolution images. Although state-of-the-
art inpainting method, Context Encoders, can successfully
regress (predict) the missing parts with certain degree of se-
mantic correctness, there is still room for improvement in
the textures and details of the predicted pixels as shown in
Figure 1.
Context Encoder is not perfect. i) texture details of the
generated pixels can be further improved. ii) not able to
handle high-resolution images. At the same time, Neural
Style Transfer is a hot topic in which we would like to trans-
fer the style of an image (style image) to another image with
its same content (content image) as shown in Figure 2 be-
low. Note that textures and colors can be regarded as a kind
of styles. The authors of this paper employ the style trans-
fer algorithm to enhance the texture details of the generated
pixels.
The authors employ Context Encoder to predict the miss-
ing parts and get the predicted pixels. Then, they employ
style transfer algorithm to the predicted pixels and the valid
pixels. The main idea is to transfer the style of the most
similar valid pixels to the predicted pixels to enhance the
texture details. In their formulation, they assume the size
of the test images is always 512x512 with a 256x256 cen-
ter missing hole. They use a three-level pyramid way to
handle this high-resolution inpainting problem. The input
is first resized to 128x128 with a 64x64 center hole for a
low-resolution reconstruction. After that, the filled image is
up-sampled to 256x256 with a 128x128 coarse filled hole
for the second reconstruction. Finally, the filled image is
again up-sampled to 512x512 with a 256x256 filled hole
for the last reconstruction (or one may call it refinement).
Propose a framework which combines the techniques
from Context Encoders and Neural Style Transfer. Suggest
a Multi-scale way to handle high-resolution images. Exper-
imentally show that style transfer techniques can be used
to enhance the texture details of the generated pixels. Fig-
ure 3 shows the proposed framework and actually it is not
difficult to understand. The Content Network is a slightly
modified Context Encoder while the Texture Network is a
pre-trained VGG-19 network on ImageNet. For me, this is
an early version of coarse-to-fine network which can oper-
ate at multi-scale. The main insight of this paper is how
they optimize the model (i.e. the design of the loss func-
tion). Content Network As mentioned, the content network
is the Context Encoder. They first train the content network
independently. Then, the output of the trained content net-
work will be used to optimize the entire proposed frame-
work. Refer to the structure of the content network in Fig-
ure 3, there are two differences from the original Context
Encoder. i) The channel-wise fully-connected layer in the
middle is replaced by the standard fully-connected layer. ii)
All the ReLU or Leaky ReLU activation function layers are
replaced by ELU layers. The authors claim that ELU can
better handle large negative neural responses than ReLU
and Leaky ReLU. Note that ReLU only allows positive re-
sponses to pass through. They train the Content Network
using the same way as the Context Encoder did. A combi-
nation of L2 loss and Adversarial loss. You may refer to my
previous post for details.
I will try to explain more about the texture network here
as it is related to the topic of neural style transfer. Interested
readers may google it for further details. The objective of
the texture network is to ensure that the fine details of the
generated pixels are similar to the details of the valid pixels
(i.e. we want to have consistent style/texture of an image)
Simply speaking, the authors make use of the findings in
[2]. To some extent the feature maps at different layers in-
side a network represent the image styles. In other words,
1
Figure 1.
Figure 2.
given a trained network, if two images have similar feature
maps inside the network, we may claim that the two images
have similar image styles. To be honest, this is an over-
simplified claim. In [2], the authors employ a pre-trained
VGG network on ImageNet for classification as a feature
extractor. They suggest a Gram matrix (also called autocor-
relation matrix) of feature maps at each layer in VGG. If two
images have similar Gram matrix, they have similar image
styles such as textures and colours. Back to the inpainting
paper, the authors also use the pre-trained VGG network as
their Texture Network as shown in Figure 3. They try to en-
force that the responses of the feature maps inside the center
hole region are similar to that outside the center hole region
at several layers of the VGG. They said that they use the
relu3 and relu4 layers for this calculation.
Figure 3.
2. Related Work
Controllable image synthesis has been a long term objec-
tive in computer vision and computer graphic. In the earlier
works [24,46], researchers used many aligned image pairs
(i.e., visual domain guidance) as the source domain and tar-
get domain to obtain the translation model that translates
the source images to the desired target images.
Collecting paired data is usually of high cost in practical
application. It is even impossible to acquire plausible paired
data in many applications, e.g., translating real images to
cartoon images. Thus, unsupervised methods [77,27,62]
attracts lots of attention as it can be trained under unpaired
setting. To achieve reliable generation performance, cer-
tain labeling or expert guidance are also expected. e.g.,
old movie restoration [43] or genomics [53]. Thus, some
semi-supervised learning methods [28,49,5] are introduced
into image synthesis to further promote the quality of gen-
erated images. Semi-supervised approaches leverage only
source images with a few source-target aligned image pairs
for training but can achieve more compelling generation
results compared with unsupervised setting. On the other
hand, humans can learn from only one or limited exem-
plars to achieve meaningful results. As described in meta-
learning and few-shot learning [74,54], humans can ef-
fectively use prior experiences and knowledge when learn-
ing new tasks, while neural network usually overfit to the
limited data without generalization capability. Thus, few-
shot or one-shot learning models are also explored in many
works [38,34,35,36]. The dataset settings can be differ-
ent, most of these image generaito techniques tend to learn a
one-to-one mapping and only generate single-modal output.
However, in practice, the translation between domain is in-
herently ambiguous, as one input image may correspond to
multiple possible outputs. Multimodal generation translates
the input image from one domain to a distribution of poten-
tial outputs in the target domain while remaining faithful to
the input. These diverse outputs represent different samples
but preserve the similar characteristic as the source image.
Most of computer visions problems can be seen as an
2
image-to-image translation problem, mapping an image
from one domain to another image in different domain. As
an illustration, super-resolution can be viewed as a con-
cern of mapping a low-resolution image to a similar high-
resolution one; image colorization is a problem of map-
ping a gray-scale image to a corresponding color one. The
problem can be investigated in supervised and unsupervised
learning methods. In the supervised approaches, paired of
images in various domains are available [24]. In the un-
supervised models, only two separated sets of images are
available in which one composed of images in one domain
and the other composed of different domain images—there
is no paired samples representing how an image can possi-
bly translated to a corresponding image in different domain.
For lack of corresponding images, the unsupervised image-
to-image translation problem is considered more difficult,
but it is more feasible because training data collection is
easier.
When assessing the image translation problem from a
likelihood viewpoint, the main challenge is to learn a mu-
tual distribution of images in different domains. In the un-
supervised setting, the two sets composed of images from
two minor distributions of different domains, and the task is
to gather the cooperative distribution by utilizing these im-
ages. However, driving the joint distribution from the minor
distributions is extremely ill-posed problem. In this section,
we discuss the image-to-image translation methods. Image-
to-image translation is similar to style transfer, which as the
input receives a style image and a content image. The model
output is an image that has the content of the content im-
age and the style of the style image. It is not only trans-
ferring the images’ styles, but also manipulates features of
objects. This section lists several models that are proposed
for image-to-image translation from supervised methods to
unsupervised ones.
2.1. Supervised Translation
Isola et al. [24] proposed to merge the different network
losses of Adversarial Network with L1regularization loss,
therefore the particular generator not only trained to pass
the discriminator filtering but also to produce images that
contain realistic objects and similar to the ground-truth im-
ages. L1generates less blurry images as compared to L2, it
was the reason for using L1. The conditional GAN loss is
formulated as:
`cGAN (G, D) = E(x,y)pdata (x,y)[log D(x, y)]+
Expdata(x),zpz(z)[log(1 D(x, G(x, z))].(1)
in which x, y p(x, y)denotes to the images that have dif-
ferent styles but belong to the same scene, similar to the
standard GAN [18], zp(z)represents random noise,
thereby L1loss for pressuring self-similarity is defined as:
`L1(G) = Ex,ypdata (x,y), z pz(z),[||yG(x, z)||1],
(2)
the general objective is specified by:
G, D=argminGmaxD`cGAN (G, D) + λ`L1(G)(3)
in which the hyperparameter of λis used to balance the two
loss functions. Moreover, in [24], the authors pointed out
that, the noise zdoes not have noticeable influence on the
result, therefore, they proposed to use the noise in the form
of dropout during training and test in place of samples that
belongs to random distribution. In this model, the structure
of the Gis based on the new structure of U-Net that has
multi-scale connections to join each encoder layer to the
same layer decoder for sharing low-level information like
edges of objects. In [24] the authors proposed PatchGAN.
The proposed model rather than classifying the whole im-
age attempts to classify the NxNpath of each image and
seek the average scores of patches for obtaining the final
score of the image. From the experiments it has been ob-
served, for obtaining the high frequency details, it is suffi-
cient to limit the discriminator to focus on the local patches.
Yoo et al. proposed an algorithm for supervised image-
to-image translation, while having a secondary discrimina-
tor Dpair that evaluates whether or not a pair of images from
multiple domains is related with each other. The loss of
Dpair is calculated as follows:
`pair =tlog[Dpair (Xs, X)]
+(t1) log[1 Dpair (Xs, X )],
s.t.t =
0ifX =Xt
0ifX =ˆ
Xt
0ifX =Xt
(4)
where the input image from the source domain is repre-
sented by Xsand its groundtruth image is denoted by Xt
in the target domain, an irrelevant image in the target do-
main is represented by Xt. The generator in the proposed
model transfers Xsinto a single image ˆ
Xtin the associ-
ated domain. The authors proposed an efficient pyramid
adversarial networks to generating synthetic labels based
on target domains for road segmentation in remote sens-
ing images. Zareapoor et al. proposed a semi-supervised
adversarial networks for dataset balancing in mechanical
devices. The authors integrate multi-instance learning into
adversarial networks for human pose estimation. As the re-
sults show, the proposed model has high accuracy and fast
performance. Shamsolmoali et al. to handle the imbal-
anced class problems, proposed a capsule adversarial net-
works based on minority class augmentation. Some authors
proposed a general learning framework assign the gener-
ated samples to a distribution over a set of labels instead of
3
a single label. The effectiveness of their proposed model
is proved through a set of experiments. Zhang et al. pro-
posed DRCW-ASEG method in order to generate synthetic
examples for multi-class imbalanced problem. The authors
shown that their proposed strategy is able to improve the
classification accuracy.
There is no noise input in the generator of pix2pix. A
novelty of pix2pix is that the generator of pix2pix learns a
mapping from an observed image yto output image G(y),
for example, from a grayscale image to a color image. As a
follow-up to pix2pix, pix2pixHD [60] used cGANs and fea-
ture matching loss for high-resolution image synthesis and
semantic manipulation. With the discriminators, the learn-
ing problem is a multi-task learning problem. Chrysos et al.
[8] proposed robust cGANs. Thekumparampil et al. [59]
discussed the robustness of conditional GANs to noisy la-
bels. Conditional CycleGAN [39] uses cGANs with cyclic
consistency. Mode seeking GANs (MSGANs) [40] pro-
poses a simple yet effective regularization term to address
the mode collapse issue for cGANs. GANs are also uti-
lized to achieve image composition [33,3,70,64], Based
on cGANs, we can generate samples conditioning on class
labels [45,44], text [50,22,72]. In [72,71], text to photo-
realistic image synthesis is conducted with stacked gen-
erative adversarial networks (SGAN) [23]. cGANs have
been used for convolutional face generation [15], face ag-
ing [1], multi-modal image translation [58,67], panoramic
image generation [14,55], exemplar-based image synthe-
sis [76,73,69], synthesizing outdoor images having spe-
cific scenery attributes [25], natural image description [9],
and scene manipulation [61]. Most cGANs based methods
[11,48,52,13,56] feed conditional information yinto the
discriminator by simply concatenating (embedded) yto the
input or to the feature vector at some middle layer. cGANs
with projection discriminator [41] adopts an inner product
between the condition vector yand the feature vector. Two-
domain I2I can solve many problems in computer vision,
computer graphics and image processing, such as image
style transfer [77,31], bounding box and keypoints [51,68]
which can be used in photo editor apps to promote user ex-
perience and semantic segmentation (c.) [47,79], which
benefits the autonomous driving and image colorization (d.)
[57,32], and domain adaptation [42,6,37,65,66]. If low-
resolution images are taken as the source domain and high-
resolution images are taken as the target domain, we can
naturally achieve image super-resolution [63,75].
2.1.1 Multimodal Outputs
Multimodal image translates the input image from one do-
main to a distribution of potential outputs in the target do-
main while remaining faithful to the input.
Actually, this multimodal translation benefits from the
Figure 4.
Figure 5.
solutions of mode collapse problem [17,2,19], in which the
generator tends to learn to map different input samples to
the same output. Thus, many multimodal image translation
methods [78,4] focus on solving the mode collapse prob-
lem to lead to diverse outputs naturally. BicycleGAN [78]
became the first supervised multimodal image translation
work by combining cVAE-GAN [21,29,30] and cLR-GAN
[7,12,13] to systematically study a family of solutions to
the mode collapse problem and generate diverse and realis-
tic outputs. Similarly, Bansal et al. [4] proposed PixelNN
to achieve multimodal and controllable translated results in
image translation. They proposed a nearest-neighbor (NN)
approach combining pixelwise matching to translate the in-
complete, conditioned input to multiple outputs and allow a
user to control the translation through on-the-fly editing of
the exemplar set.
Another solution for producing diverse outputs is to use
disentangled representation [7,20,26,10] which aims to
break down, or disentangle, each feature into narrowly de-
fined variables and encodes them as separate dimensions.
When combining it with image translation, researchers dis-
entangle the representation of the source and target domains
into two parts: domain-invariant features content, which
are preserved during the translation, and domain-specific
features style, which are changed during the translation.
In other words, image translation aims to transfer images
from the source domain to the target domain by preserving
content while replacing style. Therefore, one can achieve
multimodal outputs by randomly choosing the style fea-
4
Figure 6.
Figure 7.
Figure 8.
tures that are often regularized to be drawn from a prior
Gaussian distribution N(0,1). Gonzalez-Garcia et al. [16]
disentangled the representation of two domains into three
parts: the shared part containing common information of
both domains, and two exclusive parts that only represent
those factors of variation that are particular to each domain.
In addition to the bi-directional multimodal translation and
retrieval of similar images across domains, they can also
transfer a domain-specific transfer and interpolation across
two domains.
3. Methods & Results
The total loss function consists of three terms, namely,
content loss (L2 loss), texture loss, and TV loss (total vari-
ation loss). The above is their joint loss function that they
want to minimize. Note that i is the number of scales and
as mentioned, they employ 3 scales in this work. x is the
ground truth image (i.e. image in good condition without
missing parts). h returns the colour content of xi within the
hole region R. phit(x) returns the feature maps computed by
network t given an input x. Rphi denotes the corresponding
hole region in the feature maps. The last term is the total
variation loss term which is commonly used in image pro-
cessing to ensure the smoothness of an image. alpha and
beta are the weights to balance the loss terms. For the con-
Figure 9.
Figure 10.
tent loss term, it is very easy to understand, just compute
the L2 loss to ensure the pixel-wise reconstruction accuracy.
For the texture loss term, it seems a bit complicated but it is
also easy to understand. First, they feed the images to the
pre-trained VGG-19 network to obtain the feature maps at
relu31 and relu41 layers (middle layers). Then, they sepa-
rate the feature maps into two groups, one for the hole re-
gion (Rphi) and another for the outside (i.e. valid region).
Each local feature patch P is with size of s x s x c (s is the
spatial size and c is the number of feature maps) inside the
hole region. What they do is to find the most similar patch
outside the hole region then compute the average L2 dis-
tances of each local patch and its nearest neighbour. In Eq.
3, —Rphi— is the total number of patches sampled in the
region Rphi, Pi is the local patch centered at location i, and
nn(i) is calculated. Eq. 4 is used to search for the nearest
neighbour of each local patch Pi. Finally, the TV loss is
computed.
Experimental Results. Same as the Context Encoder, two
datasets are used for evaluation, Paris StreetView and Im-
ageNet datasets. The Paris StreetView consists of 14,900
training images and 100 test images; ImageNet contains
1.26M training images and 200 test images are randomly
selected from the validation set. Table 1 shows the quantita-
tive results of different methods. Higher PNSR means better
5
performance. It is obvious that the proposed method in this
paper offers the highest PNSR. The authors also claim that
quantitative evaluation (e.g. PSNR, L1 error, etc.) may not
be the most effective metric for image inpainting task as the
objective is to generate realistic-looking filled images. Fig-
ure 4 is the visual comparison with several methods. From
the zoom-in versions of (d) and (e), we can see that the pro-
posed method can generate sharper texture details than the
state-of-the-art method, Context Encoder. The authors pro-
vide the ablation study of the loss terms. Figure 5 shows
the result without using the content loss term. It is clear that
without the content loss term, the structure of the inpaint-
ing results is completely incorrect. Apart from showing the
content loss term is necessary. The authors also show the
importance of the texture loss term. Figure 6 shows the ef-
fect of different texture weights alpha in Eq. 1. Obviously,
more texture loss term gives sharper results but it may af-
fect the overall image structure as shown in Figure 6(d). As
mentioned, the authors use the same way as Context En-
coder to train the Content Network. They show the effect of
just using L2 loss and using both L2 and Adversarial loss.
From Figure 7, we can clearly see that the quality of the
output of the content network is important to the final result.
It is shown that the content network is better to be trained
using both L2 and adversarial losses. As mentioned be-
fore, the authors suggest a multi-scale way to handle high-
resolution images. The results are shown in below, Figure
8 shows the high-resolution image inpainting results. For
Context Encoder, it only works for 128x128 input images.
So, the results are up-sampled to 512x512 using bilinear
interpolation. For the proposed method, the input will go
through the network three times at three scales to complete
the reconstruction. It is obvious that the proposed method
offers the best visual quality compared to the other methods.
However, because of the multi-scale way to high-resolution
image inpainting, the proposed method takes roughly 1 min
to fill in a 256x256 hole of a 512x512 image with a Titan
X GPU, which is a major drawback of the proposed method
(i.e. low efficiency). The authors further extend the pro-
posed method to handle irregular shapes of holes. Simply
speaking, they first modify the irregular hole to a bounding
rectangular hole. Then, they perform cropping and padding
to position the hole at the center. By doing these, they
can handle images with irregular holes. Some examples
are shown in below, This is an obvious improved version
of the Context Encoder. The authors adopt the techniques
from Neural Style Transfer to further enhance the texture
details of the generated pixels by the Context Encoder. As
a result, we are one step closer to realistic-looking filled
images. However, the authors also point out some future
directions for improvement. i) It is still difficult to fill the
missing parts when the scene is complicated as shown in
Figure 10. ii) The speed is a problem as it cannot achieve
real-time performance.
Again, I would like to highlight some points here and the
points are useful for the future posts. This work is an earlier
version of coarse-to-fine network (also called two-stage net-
work). We first reconstruct the missing parts and the recon-
structed parts should be with certain pixel-wise reconstruc-
tion accuracy (i.e. ensure the structure is correct). Then, we
refine the texture details of the reconstructed parts such that
the filled images are with good visual quality. The concept
of texture loss plays an important role in later image inpaint-
ing papers. By employing this loss, we can have sharper
generated images. Later, we usually achieve sharp gener-
ated images by using Perceptual Loss and/or Style Loss.
We will cover them very soon!
References
[1] Grigory Antipov, Moez Baccouche, and Jean-Luc Dugelay.
Face aging with conditional generative adversarial networks.
In 2017 IEEE International Conference on Image Processing
(ICIP), pages 2089–2093. IEEE, 2017. 4
[2] Martin Arjovsky, Soumith Chintala, and L´
eon Bottou.
Wasserstein generative adversarial networks. In Interna-
tional Conference on Machine Learning, pages 214–223,
2017. 4
[3] Samaneh Azadi, Deepak Pathak, Sayna Ebrahimi, and
Trevor Darrell. Compositional gan: Learning image-
conditional binary composition. International Journal of
Computer Vision, 128(10):2570–2585, 2020. 4
[4] Aayush Bansal, Yaser Sheikh, and Deva Ramanan. Pixelnn:
Example-based image synthesis. In International Confer-
ence on Learning Representations, 2018. 4
[5] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas
Papernot, Avital Oliver, and Colin A Raffel. Mixmatch:
A holistic approach to semi-supervised learning. In Ad-
vances in Neural Information Processing Systems, pages
5049–5059, 2019. 2
[6] Jinming Cao, Oren Katzir, Peng Jiang, Dani Lischinski,
Danny Cohen-Or, Changhe Tu, and Yangyan Li. Dida: Dis-
entangled synthesis for domain adaptation, 2018. 4
[7] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya
Sutskever, and Pieter Abbeel. Infogan: Interpretable rep-
resentation learning by information maximizing generative
adversarial nets. In Neural Information Processing Systems,
pages 2172–2180, 2016. 4
[8] Grigorios G Chrysos, Jean Kossaifi, and Stefanos Zafeiriou.
Robust conditional generative adversarial networks. arXiv
preprint arXiv:1805.08657, 2018. 4
[9] Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. To-
wards diverse and natural image descriptions via a condi-
tional gan. In IEEE International Conference on Computer
Vision, pages 2970–2979, 2017. 4
[10] Emily L Denton and vighnesh Birodkar. Unsupervised learn-
ing of disentangled representations from video. In I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, editors, Advances in Neural In-
6
formation Processing Systems 30, pages 4414–4423. Curran
Associates, Inc., 2017. 4
[11] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep
generative image models using laplacian pyramid of adver-
sarial networks. In Neural Information Processing Systems,
pages 1486–1494, 2015. 4
[12] Jeff Donahue, Philipp Kr¨
ahenb¨
uhl, and Trevor Darrell. Ad-
versarial feature learning. arXiv preprint arXiv:1605.09782,
2016. 4
[13] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier
Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron
Courville. Adversarially learned inference. arXiv preprint
arXiv:1606.00704, 2016. 4
[14] Marc-Andr´
e Gardner, Kalyan Sunkavalli, Ersin Yumer, Xi-
aohui Shen, Emiliano Gambaretto, Christian Gagn´
e, and
Jean-Franc¸ois Lalonde. Learning to predict indoor illumina-
tion from a single image. arXiv preprint arXiv:1704.00090,
2017. 4
[15] Jon Gauthier. Conditional generative adversarial nets for
convolutional face generation. Class Project for Stanford
CS231N: Convolutional Neural Networks for Visual Recog-
nition, Winter semester, 2014(5):2, 2014. 4
[16] Abel Gonzalez-Garcia, Joost Van De Weijer, and Yoshua
Bengio. Image-to-image translation for cross-domain dis-
entanglement. In Advances in neural information processing
systems, pages 1287–1298, 2018. 5
[17] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial
networks, 2017. 4
[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Neural In-
formation Processing Systems, pages 2672–2680, 2014. 3
[19] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent
Dumoulin, and Aaron C Courville. Improved training of
wasserstein gans. In Neural Information Processing Systems,
pages 5767–5777, 2017. 4
[20] I. Higgins, Lo¨
ıc Matthey, A. Pal, Christopher P. Burgess,
Xavier Glorot, M. Botvinick, S. Mohamed, and Alexander
Lerchner. beta-vae: Learning basic visual concepts with a
constrained variational framework. In ICLR, 2017. 4
[21] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing
the dimensionality of data with neural networks. science,
313(5786):504–507, 2006. 4
[22] Seunghoon Hong, Dingdong Yang, Jongwook Choi, and
Honglak Lee. Inferring semantic layout for hierarchical text-
to-image synthesis. In IEEE Conference on Computer Vision
and Pattern Recognition, pages 7986–7994, 2018. 4
[23] Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, and
Serge Belongie. Stacked generative adversarial networks. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 5077–5086, 2017. 4
[24] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
Efros. Image-to-image translation with conditional adver-
sarial networks. In IEEE Conference on Computer Vision
and Pattern Recognition, pages 1125–1134, 2017. 2,3
[25] Levent Karacan, Zeynep Akata, Aykut Erdem, and Erkut
Erdem. Learning to generate images of outdoor scenes
from attributes and semantic layouts. arXiv preprint
arXiv:1612.00215, 2016. 4
[26] Hyunjik Kim and Andriy Mnih. Disentangling by factoris-
ing. In International Conference on Machine Learning,
pages 2649–2658. PMLR, 2018. 4
[27] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee,
and Jiwon Kim. Learning to discover cross-domain relations
with generative adversarial networks. In International Con-
ference on Machine Learning, pages 1857–1865, 2017. 2
[28] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende,
and Max Welling. Semi-supervised learning with deep gen-
erative models. In Advances in neural information process-
ing systems, pages 3581–3589, 2014. 2
[29] Diederik P Kingma and Max Welling. Auto-encoding varia-
tional bayes. stat, 1050:1, 2014. 4
[30] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo
Larochelle, and Ole Winther. Autoencoding beyond pixels
using a learned similarity metric. In International conference
on machine learning, pages 1558–1566. PMLR, 2016. 4
[31] Hsin-Ying Lee, Hung-Yu Tseng, Qi Mao, Jia-Bin Huang,
Yu-Ding Lu, Maneesh Singh, and Ming-Hsuan Yang.
Drit++: Diverse image-to-image translation via disentangled
representations. International Journal of Computer Vision,
pages 1–16, 2020. 4
[32] Junsoo Lee, Eungyeup Kim, Yunsung Lee, Dongjun Kim,
Jaehyuk Chang, and Jaegul Choo. Reference-based sketch
image colorization using augmented-self reference and
dense semantic correspondence. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2020. 4
[33] Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman,
and Simon Lucey. St-gan: Spatial transformer generative
adversarial networks for image compositing. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 9455–9464, 2018. 4
[34] Jianxin Lin, Yingxue Pang, Yingce Xia, Zhibo Chen, and
Jiebo Luo. Tuigan: Learning versatile image-to-image trans-
lation with two unpaired images. In European Conference
on Computer Vision, pages 18–35. Springer, 2020. 2
[35] Jianxin Lin, Yijun Wang, Tianyu He, and Zhibo Chen.
Learning to transfer: Unsupervised meta domain translation.
arXiv preprint arXiv:1906.00181, 2019. 2
[36] Jianxin Lin, Yingce Xia, Sen Liu, Tao Qin, and Zhibo
Chen. Zstgan: An adversarial approach for unsuper-
vised zero-shot image-to-image translation. arXiv preprint
arXiv:1906.00184, 2019. 2
[37] Alexander H Liu, Yen-Cheng Liu, Yu-Ying Yeh, and Yu-
Chiang Frank Wang. A unified feature disentangler for multi-
domain image translation and manipulation. In Advances
in neural information processing systems, pages 2590–2599,
2018. 4
[38] Ming-Yu Liu, Xun Huang, Arun Mallya, Tero Karras, Timo
Aila, Jaakko Lehtinen, and Jan Kautz. Few-shot unsuper-
vised image-to-image translation. In Proceedings of the
IEEE/CVF International Conference on Computer Vision
(ICCV), October 2019. 2
7
[39] Yongyi Lu, Yu-Wing Tai, and Chi-Keung Tang. Conditional
cyclegan for attribute guided face image generation. arXiv
preprint arXiv:1705.09966, 2017. 4
[40] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and
Ming-Hsuan Yang. Mode seeking generative adversarial
networks for diverse image synthesis. In IEEE Conference
on Computer Vision and Pattern Recognition, pages 1429–
1437, 2019. 4
[41] Takeru Miyato and Masanori Koyama. cgans with projection
discriminator. arXiv preprint arXiv:1802.05637, 2018. 4
[42] Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ra-
mamoorthi, and Kyungnam Kim. Image to image translation
for domain adaptation. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
June 2018. 4
[43] Aamir Mustafa and Rafał K. Mantiuk. Transformation
consistency regularization a semi-supervised paradigm
for image-to-image translation. In Andrea Vedaldi, Horst
Bischof, Thomas Brox, and Jan-Michael Frahm, editors,
Computer Vision ECCV 2020, pages 599–615, Cham,
2020. Springer International Publishing. 2
[44] Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovit-
skiy, and Jason Yosinski. Plug & play generative networks:
Conditional iterative generation of images in latent space. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 4467–4477, 2017. 4
[45] Augustus Odena, Christopher Olah, and Jonathon Shlens.
Conditional image synthesis with auxiliary classifier gans.
In International Conference on Machine Learning, pages
2642–2651, 2017. 4
[46] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
Zhu. Semantic image synthesis with spatially-adaptive nor-
malization. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 2337–2346, 2019. 2
[47] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
Zhu. Semantic image synthesis with spatially-adaptive nor-
malization. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), June
2019. 4
[48] Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu, and
Jose M ´
Alvarez. Invertible conditional gans for image edit-
ing. arXiv preprint arXiv:1611.06355, 2016. 4
[49] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri
Valpola, and Tapani Raiko. Semi-supervised learning with
ladder networks. In Advances in neural information process-
ing systems, pages 3546–3554, 2015. 2
[50] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-
geswaran, Bernt Schiele, and Honglak Lee. Generative ad-
versarial text to image synthesis. In International Conference
on Machine Learning, pages 1–10, 2016. 4
[51] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka,
Bernt Schiele, and Honglak Lee. Learning what and where
to draw. In Neural Information Processing Systems, pages
217–225, 2016. 4
[52] Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Tem-
poral generative adversarial nets with singular value clip-
ping. In IEEE International Conference on Computer Vision,
pages 2830–2839, 2017. 4
[53] Mingguang Shi and Bing Zhang. Semi-supervised learning
improves gene expression-based prediction of cancer recur-
rence. Bioinformatics, 27(21):3017–3023, 2011. 2
[54] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypi-
cal networks for few-shot learning. In Advances in neural
information processing systems, pages 4077–4087, 2017. 2
[55] Shuran Song and Thomas Funkhouser. Neural illumination:
Lighting prediction for indoor environments. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 6918–6926, 2019. 4
[56] Kumar Sricharan, Raja Bala, Matthew Shreve, Hui Ding,
Kumar Saketh, and Jin Sun. Semi-supervised conditional
gans. arXiv preprint arXiv:1708.05789, 2017. 4
[57] Patricia L Su´
arez, Angel D Sappa, and Boris X Vintimilla.
Infrared image colorization based on a triplet dcgan archi-
tecture. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition Workshops, pages 18–
23, 2017. 4
[58] Hao Tang, Dan Xu, Nicu Sebe, Yanzhi Wang, Jason J Corso,
and Yan Yan. Multi-channel attention selection gan with cas-
caded semantic guidance for cross-view image translation. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 2417–2426, 2019. 4
[59] Kiran K Thekumparampil, Ashish Khetan, Zinan Lin, and
Sewoong Oh. Robustness of conditional gans to noisy labels.
In Neural Information Processing Systems, pages 10271–
10282, 2018. 4
[60] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
Jan Kautz, and Bryan Catanzaro. High-resolution image syn-
thesis and semantic manipulation with conditional gans. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 8798–8807, 2018. 4
[61] Shunyu Yao, Tzu Ming Hsu, Jun-Yan Zhu, Jiajun Wu, Anto-
nio Torralba, Bill Freeman, and Josh Tenenbaum. 3d-aware
scene manipulation via inverse graphics. In Neural Informa-
tion Processing Systems, pages 1887–1898, 2018. 4
[62] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dual-
gan: Unsupervised dual learning for image-to-image trans-
lation. In Proceedings of the IEEE international conference
on computer vision, pages 2849–2857, 2017. 2
[63] Yuan Yuan, Siyuan Liu, Jiawei Zhang, Yongbing Zhang,
Chao Dong, and Liang Lin. Unsupervised image super-
resolution using cycle-in-cycle generative adversarial net-
works. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR) Workshops,
June 2018. 4
[64] Fangneng Zhan and Shijian Lu. Esir: End-to-end scene text
recognition via iterative image rectification. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 2059–2068, 2019. 4
[65] Fangneng Zhan, Shijian Lu, and Chuhui Xue. Verisimilar
image synthesis for accurate detection and recognition of
texts in scenes. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 249–266, 2018. 4
[66] Fangneng Zhan, Chuhui Xue, and Shijian Lu. Ga-dan:
Geometry-aware domain adaptation network for scene text
detection and recognition. In Proceedings of the IEEE Inter-
8
national Conference on Computer Vision, pages 9105–9115,
2019. 4
[67] Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang,
and Shijian Lu. Multimodal image synthesis and editing: A
survey. arXiv preprint arXiv:2112.13592, 2021. 4
[68] Fangneng Zhan, Changgong Zhang, Wenbo Hu, Shijian Lu,
Feiying Ma, Xuansong Xie, and Ling Shao. Sparse needlets
for lighting estimation with spherical transport loss. In
Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 12830–12839, 2021. 4
[69] Fangneng Zhan, Jiahui Zhang, Yingchen Yu, Rongliang Wu,
and Shijian Lu. Modulated contrast for versatile image syn-
thesis. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 18280–18290,
2022. 4
[70] Fangneng Zhan, Hongyuan Zhu, and Shijian Lu. Spa-
tial fusion gan for image synthesis. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 3653–3662, 2019. 4
[71] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xi-
aogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stack-
gan++: Realistic image synthesis with stacked generative ad-
versarial networks. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 41(8):1947–1962, 2019. 4
[72] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-
gan: Text to photo-realistic image synthesis with stacked
generative adversarial networks. In IEEE International Con-
ference on Computer Vision, pages 5907–5915, 2017. 4
[73] Pan Zhang, Bo Zhang, Dong Chen, Lu Yuan, and Fang Wen.
Cross-domain correspondence learning for exemplar-based
image translation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
5143–5153, 2020. 4
[74] Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua
Bengio, and Yangqiu Song. Metagan: An adversarial ap-
proach to few-shot learning. In Advances in Neural Informa-
tion Processing Systems, pages 2365–2374, 2018. 2
[75] Yongbing Zhang, Siyuan Liu, Chao Dong, Xinfeng Zhang,
and Yuan Yuan. Multiple cycle-in-cycle generative adversar-
ial networks for unsupervised image super-resolution. IEEE
transactions on Image Processing, 29:1101–1112, 2019. 4
[76] Xingran Zhou, Bo Zhang, Ting Zhang, Pan Zhang, Jianmin
Bao, Dong Chen, Zhongfei Zhang, and Fang Wen. Cocosnet
v2: Full-resolution correspondence learning for image trans-
lation. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 11465–11475,
2021. 4
[77] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
Efros. Unpaired image-to-image translation using cycle-
consistent adversarial networks. In International Conference
on Computer Vision, pages 2223–2232, 2017. 2,4
[78] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Dar-
rell, Alexei A Efros, Oliver Wang, and Eli Shechtman. To-
ward multimodal image-to-image translation. In Neural In-
formation Processing Systems, pages 465–476, 2017. 4
[79] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka.
Sean: Image synthesis with semantic region-adaptive nor-
malization. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), June
2020. 4
9
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Perceiving the similarity between images has been a long-standing and fundamental problem underlying various visual generation tasks. Predominant approaches measure the inter-image distance by computing pointwise absolute deviations, which tends to estimate the median of instance distributions and leads to blurs and artifacts in the generated images. This paper presents MoNCE, a versatile metric that introduces image contrast to learn a calibrated metric for the perception of multifaceted inter-image distances. Unlike vanilla contrast which indiscriminately pushes negative samples from the anchor regardless of their similarity , we propose to re-weight the pushing force of negative samples adaptively according to their similarity to the anchor , which facilitates the contrastive learning from informative negative samples. Since multiple patch-level con-trastive objectives are involved in image distance measurement , we introduce optimal transport in MoNCE to modulate the pushing force of negative samples collaboratively across multiple contrastive objectives. Extensive experiments over multiple image translation tasks show that the proposed MoNCE outperforms various prevailing metrics substantially.
Article
Full-text available
Generative Adversarial Networks can produce images of remarkable complexity and realism but are generally structured to sample from a single latent source ignoring the explicit spatial interaction between multiple entities that could be present in a scene. Capturing such complex interactions between different objects in the world, including their relative scaling, spatial layout, occlusion, or viewpoint transformation is a challenging problem. In this work, we propose a novel self-consistent Composition-by-Decomposition network to compose a pair of objects. Given object images from two distinct distributions, our model can generate a realistic composite image from their joint distribution following the texture and shape of the input objects. We evaluate our approach through qualitative experiments and user evaluations. Our results indicate that the learned model captures potential interactions between the two object domains, and generates realistic composed scenes at test time.
Chapter
Scarcity of labeled data has motivated the development of semi-supervised learning methods, which learn from large portions of unlabeled data alongside a few labeled samples. Consistency Regularization between model’s predictions under different input perturbations, particularly has shown to provide state-of-the art results in a semi-supervised framework. However, most of these method have been limited to classification and segmentation applications. We propose Transformation Consistency Regularization, which delves into a more challenging setting of image-to-image translation, which remains unexplored by semi-supervised algorithms. The method introduces a diverse set of geometric transformations and enforces the model’s predictions for unlabeled data to be invariant to those transformations. We evaluate the efficacy of our algorithm on three different applications: image colorization, denoising and super-resolution. Our method is significantly data efficient, requiring only around 10–20% of labeled samples to achieve similar image reconstructions to its fully-supervised counterpart. Furthermore, we show the effectiveness of our method in video processing applications, where knowledge from a few frames can be leveraged to enhance the quality of the rest of the movie.
Article
Unsupervised domain translation has recently achieved impressive performance with Generative Adversarial Network (GAN) and sufficient (unpaired) training data. However, existing domain translation frameworks form in a disposable way where the learning experiences are ignored and the obtained model cannot be adapted to a new coming domain. In this work, we take on unsupervised domain translation problems from a meta-learning perspective. We propose a model called Meta-Translation GAN (MT-GAN) to find good initialization of translation models. In the meta-training procedure, MT-GAN is explicitly trained with a primary translation task and a synthesized dual translation task. A cycle-consistency meta-optimization objective is designed to ensure the generalization ability. We demonstrate effectiveness of our model on ten diverse two-domain translation tasks and multiple face identity translation tasks. We show that our proposed approach significantly outperforms the existing domain translation methods when each domain contains no more than ten training samples.