PreprintPDF Available

High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis

March 2021

March 2021

License
CC BY 4.0

Authors:

Yipin Yang

Beijing Institute of Technology

Preprints and early-stage research may not have been peer reviewed yet.

Content uploaded by Yipin Yang

Content may be subject to copyright.

High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis

Yipin Yang, Zhiguo Huang

1. Introduction

In the previous post, we have gone through the introduc-

tion to image inpainting and the ﬁrst GAN-based inpainting

algorithm, Context Encoders. If you have not read the pre-

vious post, I highly recommend you to have a quick look

of it ﬁrst! This time, we will dive into another inpainting

method which can be regarded as an improved version of

Context Encoders. Let’s start! Here, I brieﬂy recall what we

have learnt in the previous post. Deep semantic understand-

ing of an image or the context of an image is important to

the task of inpainting, and (channel-wise) fully-connected

layer is one way to capture the context of an image. For

image inpainting, visual quality of the ﬁlled images is much

more important than the pixel-wise reconstruction accuracy.

More speciﬁcally, as there is no model answer to gener-

ated pixels (we do not have the ground truth in real-world

situations), we just want look-realistic ﬁlled images. Ex-

isting inpainting algorithms can only handle low-resolution

images because of the memory limitations and the training

difﬁculty in high-resolution images. Although state-of-the-

art inpainting method, Context Encoders, can successfully

regress (predict) the missing parts with certain degree of se-

mantic correctness, there is still room for improvement in

the textures and details of the predicted pixels as shown in

Figure 1.

Context Encoder is not perfect. i) texture details of the

generated pixels can be further improved. ii) not able to

handle high-resolution images. At the same time, Neural

Style Transfer is a hot topic in which we would like to trans-

fer the style of an image (style image) to another image with

its same content (content image) as shown in Figure 2 be-

low. Note that textures and colors can be regarded as a kind

of styles. The authors of this paper employ the style trans-

fer algorithm to enhance the texture details of the generated

pixels.

The authors employ Context Encoder to predict the miss-

ing parts and get the predicted pixels. Then, they employ

style transfer algorithm to the predicted pixels and the valid

pixels. The main idea is to transfer the style of the most

similar valid pixels to the predicted pixels to enhance the

texture details. In their formulation, they assume the size

of the test images is always 512x512 with a 256x256 cen-

ter missing hole. They use a three-level pyramid way to

handle this high-resolution inpainting problem. The input

is ﬁrst resized to 128x128 with a 64x64 center hole for a

low-resolution reconstruction. After that, the ﬁlled image is

up-sampled to 256x256 with a 128x128 coarse ﬁlled hole

for the second reconstruction. Finally, the ﬁlled image is

again up-sampled to 512x512 with a 256x256 ﬁlled hole

for the last reconstruction (or one may call it reﬁnement).

Propose a framework which combines the techniques

from Context Encoders and Neural Style Transfer. Suggest

a Multi-scale way to handle high-resolution images. Exper-

imentally show that style transfer techniques can be used

to enhance the texture details of the generated pixels. Fig-

ure 3 shows the proposed framework and actually it is not

difﬁcult to understand. The Content Network is a slightly

modiﬁed Context Encoder while the Texture Network is a

pre-trained VGG-19 network on ImageNet. For me, this is

an early version of coarse-to-ﬁne network which can oper-

ate at multi-scale. The main insight of this paper is how

they optimize the model (i.e. the design of the loss func-

tion). Content Network As mentioned, the content network

is the Context Encoder. They ﬁrst train the content network

independently. Then, the output of the trained content net-

work will be used to optimize the entire proposed frame-

work. Refer to the structure of the content network in Fig-

ure 3, there are two differences from the original Context

Encoder. i) The channel-wise fully-connected layer in the

middle is replaced by the standard fully-connected layer. ii)

All the ReLU or Leaky ReLU activation function layers are

replaced by ELU layers. The authors claim that ELU can

better handle large negative neural responses than ReLU

and Leaky ReLU. Note that ReLU only allows positive re-

sponses to pass through. They train the Content Network

using the same way as the Context Encoder did. A combi-

nation of L2 loss and Adversarial loss. You may refer to my

previous post for details.

I will try to explain more about the texture network here

as it is related to the topic of neural style transfer. Interested

readers may google it for further details. The objective of

the texture network is to ensure that the ﬁne details of the

generated pixels are similar to the details of the valid pixels

(i.e. we want to have consistent style/texture of an image)

Simply speaking, the authors make use of the ﬁndings in

[2]. To some extent the feature maps at different layers in-

side a network represent the image styles. In other words,

Figure 1.

Figure 2.

given a trained network, if two images have similar feature

maps inside the network, we may claim that the two images

have similar image styles. To be honest, this is an over-

simpliﬁed claim. In [2], the authors employ a pre-trained

VGG network on ImageNet for classiﬁcation as a feature

extractor. They suggest a Gram matrix (also called autocor-

relation matrix) of feature maps at each layer in VGG. If two

images have similar Gram matrix, they have similar image

styles such as textures and colours. Back to the inpainting

paper, the authors also use the pre-trained VGG network as

their Texture Network as shown in Figure 3. They try to en-

force that the responses of the feature maps inside the center

hole region are similar to that outside the center hole region

at several layers of the VGG. They said that they use the

relu3 and relu4 layers for this calculation.

Figure 3.

2. Related Work

Controllable image synthesis has been a long term objec-

tive in computer vision and computer graphic. In the earlier

works [24,46], researchers used many aligned image pairs

(i.e., visual domain guidance) as the source domain and tar-

get domain to obtain the translation model that translates

the source images to the desired target images.

Collecting paired data is usually of high cost in practical

application. It is even impossible to acquire plausible paired

data in many applications, e.g., translating real images to

cartoon images. Thus, unsupervised methods [77,27,62]

attracts lots of attention as it can be trained under unpaired

setting. To achieve reliable generation performance, cer-

tain labeling or expert guidance are also expected. e.g.,

old movie restoration [43] or genomics [53]. Thus, some

semi-supervised learning methods [28,49,5] are introduced

into image synthesis to further promote the quality of gen-

erated images. Semi-supervised approaches leverage only

source images with a few source-target aligned image pairs

for training but can achieve more compelling generation

results compared with unsupervised setting. On the other

hand, humans can learn from only one or limited exem-

plars to achieve meaningful results. As described in meta-

learning and few-shot learning [74,54], humans can ef-

fectively use prior experiences and knowledge when learn-

ing new tasks, while neural network usually overﬁt to the

limited data without generalization capability. Thus, few-

shot or one-shot learning models are also explored in many

works [38,34,35,36]. The dataset settings can be differ-

ent, most of these image generaito techniques tend to learn a

one-to-one mapping and only generate single-modal output.

However, in practice, the translation between domain is in-

herently ambiguous, as one input image may correspond to

multiple possible outputs. Multimodal generation translates

the input image from one domain to a distribution of poten-

tial outputs in the target domain while remaining faithful to

the input. These diverse outputs represent different samples

but preserve the similar characteristic as the source image.

Most of computer visions problems can be seen as an

image-to-image translation problem, mapping an image

from one domain to another image in different domain. As

an illustration, super-resolution can be viewed as a con-

cern of mapping a low-resolution image to a similar high-

resolution one; image colorization is a problem of map-

ping a gray-scale image to a corresponding color one. The

problem can be investigated in supervised and unsupervised

learning methods. In the supervised approaches, paired of

images in various domains are available [24]. In the un-

supervised models, only two separated sets of images are

available in which one composed of images in one domain

and the other composed of different domain images—there

is no paired samples representing how an image can possi-

bly translated to a corresponding image in different domain.

For lack of corresponding images, the unsupervised image-

to-image translation problem is considered more difﬁcult,

but it is more feasible because training data collection is

easier.

When assessing the image translation problem from a

likelihood viewpoint, the main challenge is to learn a mu-

tual distribution of images in different domains. In the un-

supervised setting, the two sets composed of images from

two minor distributions of different domains, and the task is

to gather the cooperative distribution by utilizing these im-

ages. However, driving the joint distribution from the minor

distributions is extremely ill-posed problem. In this section,

we discuss the image-to-image translation methods. Image-

to-image translation is similar to style transfer, which as the

input receives a style image and a content image. The model

output is an image that has the content of the content im-

age and the style of the style image. It is not only trans-

ferring the images’ styles, but also manipulates features of

objects. This section lists several models that are proposed

for image-to-image translation from supervised methods to

unsupervised ones.

2.1. Supervised Translation

Isola et al. [24] proposed to merge the different network

losses of Adversarial Network with L1regularization loss,

therefore the particular generator not only trained to pass

the discriminator ﬁltering but also to produce images that

contain realistic objects and similar to the ground-truth im-

ages. L1generates less blurry images as compared to L2, it

was the reason for using L1. The conditional GAN loss is

formulated as:

`cGAN (G, D) = E(x,y)∼pdata (x,y)[log D(x, y)]+

Ex∼pdata(x),z∼pz(z)[log(1 −D(x, G(x, z))].(1)

in which x, y ∼p(x, y)denotes to the images that have dif-

ferent styles but belong to the same scene, similar to the

standard GAN [18], z∼p(z)represents random noise,

thereby L1loss for pressuring self-similarity is deﬁned as:

`L1(G) = Ex,y∼pdata (x,y), z ∼pz(z),[||y−G(x, z)||1],

(2)

the general objective is speciﬁed by:

G∗, D∗=argminGmaxD`cGAN (G, D) + λ`L1(G)(3)

in which the hyperparameter of λis used to balance the two

loss functions. Moreover, in [24], the authors pointed out

that, the noise zdoes not have noticeable inﬂuence on the

result, therefore, they proposed to use the noise in the form

of dropout during training and test in place of samples that

belongs to random distribution. In this model, the structure

of the Gis based on the new structure of U-Net that has

multi-scale connections to join each encoder layer to the

same layer decoder for sharing low-level information like

edges of objects. In [24] the authors proposed PatchGAN.

The proposed model rather than classifying the whole im-

age attempts to classify the NxNpath of each image and

seek the average scores of patches for obtaining the ﬁnal

score of the image. From the experiments it has been ob-

served, for obtaining the high frequency details, it is sufﬁ-

cient to limit the discriminator to focus on the local patches.

Yoo et al. proposed an algorithm for supervised image-

to-image translation, while having a secondary discrimina-

tor Dpair that evaluates whether or not a pair of images from

multiple domains is related with each other. The loss of

Dpair is calculated as follows:

`pair =−tlog[Dpair (Xs, X)]

+(t−1) log[1 −Dpair (Xs, X )],

s.t.t =









0ifX =Xt

0ifX =ˆ

0ifX =Xt

(4)

where the input image from the source domain is repre-

sented by Xsand its groundtruth image is denoted by Xt

in the target domain, an irrelevant image in the target do-

main is represented by Xt. The generator in the proposed

model transfers Xsinto a single image ˆ

Xtin the associ-

ated domain. The authors proposed an efﬁcient pyramid

adversarial networks to generating synthetic labels based

on target domains for road segmentation in remote sens-

ing images. Zareapoor et al. proposed a semi-supervised

adversarial networks for dataset balancing in mechanical

devices. The authors integrate multi-instance learning into

adversarial networks for human pose estimation. As the re-

sults show, the proposed model has high accuracy and fast

performance. Shamsolmoali et al. to handle the imbal-

anced class problems, proposed a capsule adversarial net-

works based on minority class augmentation. Some authors

proposed a general learning framework assign the gener-

ated samples to a distribution over a set of labels instead of

a single label. The effectiveness of their proposed model

is proved through a set of experiments. Zhang et al. pro-

posed DRCW-ASEG method in order to generate synthetic

examples for multi-class imbalanced problem. The authors

shown that their proposed strategy is able to improve the

classiﬁcation accuracy.

There is no noise input in the generator of pix2pix. A

novelty of pix2pix is that the generator of pix2pix learns a

mapping from an observed image yto output image G(y),

for example, from a grayscale image to a color image. As a

follow-up to pix2pix, pix2pixHD [60] used cGANs and fea-

ture matching loss for high-resolution image synthesis and

semantic manipulation. With the discriminators, the learn-

ing problem is a multi-task learning problem. Chrysos et al.

[8] proposed robust cGANs. Thekumparampil et al. [59]

discussed the robustness of conditional GANs to noisy la-

bels. Conditional CycleGAN [39] uses cGANs with cyclic

consistency. Mode seeking GANs (MSGANs) [40] pro-

poses a simple yet effective regularization term to address

the mode collapse issue for cGANs. GANs are also uti-

lized to achieve image composition [33,3,70,64], Based

on cGANs, we can generate samples conditioning on class

labels [45,44], text [50,22,72]. In [72,71], text to photo-

realistic image synthesis is conducted with stacked gen-

erative adversarial networks (SGAN) [23]. cGANs have

been used for convolutional face generation [15], face ag-

ing [1], multi-modal image translation [58,67], panoramic

image generation [14,55], exemplar-based image synthe-

sis [76,73,69], synthesizing outdoor images having spe-

ciﬁc scenery attributes [25], natural image description [9],

and scene manipulation [61]. Most cGANs based methods

[11,48,52,13,56] feed conditional information yinto the

discriminator by simply concatenating (embedded) yto the

input or to the feature vector at some middle layer. cGANs

with projection discriminator [41] adopts an inner product

between the condition vector yand the feature vector. Two-

domain I2I can solve many problems in computer vision,

computer graphics and image processing, such as image

style transfer [77,31], bounding box and keypoints [51,68]

which can be used in photo editor apps to promote user ex-

perience and semantic segmentation (c.) [47,79], which

beneﬁts the autonomous driving and image colorization (d.)

[57,32], and domain adaptation [42,6,37,65,66]. If low-

resolution images are taken as the source domain and high-

resolution images are taken as the target domain, we can

naturally achieve image super-resolution [63,75].

2.1.1 Multimodal Outputs

Multimodal image translates the input image from one do-

main to a distribution of potential outputs in the target do-

main while remaining faithful to the input.

Actually, this multimodal translation beneﬁts from the

Figure 4.

Figure 5.

solutions of mode collapse problem [17,2,19], in which the

generator tends to learn to map different input samples to

the same output. Thus, many multimodal image translation

methods [78,4] focus on solving the mode collapse prob-

lem to lead to diverse outputs naturally. BicycleGAN [78]

became the ﬁrst supervised multimodal image translation

work by combining cVAE-GAN [21,29,30] and cLR-GAN

[7,12,13] to systematically study a family of solutions to

the mode collapse problem and generate diverse and realis-

tic outputs. Similarly, Bansal et al. [4] proposed PixelNN

to achieve multimodal and controllable translated results in

image translation. They proposed a nearest-neighbor (NN)

approach combining pixelwise matching to translate the in-

complete, conditioned input to multiple outputs and allow a

user to control the translation through on-the-ﬂy editing of

the exemplar set.

Another solution for producing diverse outputs is to use

disentangled representation [7,20,26,10] which aims to

break down, or disentangle, each feature into narrowly de-

ﬁned variables and encodes them as separate dimensions.

When combining it with image translation, researchers dis-

entangle the representation of the source and target domains

into two parts: domain-invariant features content, which

are preserved during the translation, and domain-speciﬁc

features style, which are changed during the translation.

In other words, image translation aims to transfer images

from the source domain to the target domain by preserving

content while replacing style. Therefore, one can achieve

multimodal outputs by randomly choosing the style fea-

Figure 6.

Figure 7.

Figure 8.

tures that are often regularized to be drawn from a prior

Gaussian distribution N(0,1). Gonzalez-Garcia et al. [16]

disentangled the representation of two domains into three

parts: the shared part containing common information of

both domains, and two exclusive parts that only represent

those factors of variation that are particular to each domain.

In addition to the bi-directional multimodal translation and

retrieval of similar images across domains, they can also

transfer a domain-speciﬁc transfer and interpolation across

two domains.

3. Methods & Results

The total loss function consists of three terms, namely,

content loss (L2 loss), texture loss, and TV loss (total vari-

ation loss). The above is their joint loss function that they

want to minimize. Note that i is the number of scales and

as mentioned, they employ 3 scales in this work. x is the

ground truth image (i.e. image in good condition without

missing parts). h returns the colour content of xi within the

hole region R. phit(x) returns the feature maps computed by

network t given an input x. Rphi denotes the corresponding

hole region in the feature maps. The last term is the total

variation loss term which is commonly used in image pro-

cessing to ensure the smoothness of an image. alpha and

beta are the weights to balance the loss terms. For the con-

Figure 9.

Figure 10.

tent loss term, it is very easy to understand, just compute

the L2 loss to ensure the pixel-wise reconstruction accuracy.

For the texture loss term, it seems a bit complicated but it is

also easy to understand. First, they feed the images to the

pre-trained VGG-19 network to obtain the feature maps at

relu31 and relu41 layers (middle layers). Then, they sepa-

rate the feature maps into two groups, one for the hole re-

gion (Rphi) and another for the outside (i.e. valid region).

Each local feature patch P is with size of s x s x c (s is the

spatial size and c is the number of feature maps) inside the

hole region. What they do is to ﬁnd the most similar patch

outside the hole region then compute the average L2 dis-

tances of each local patch and its nearest neighbour. In Eq.

3, —Rphi— is the total number of patches sampled in the

region Rphi, Pi is the local patch centered at location i, and

nn(i) is calculated. Eq. 4 is used to search for the nearest

neighbour of each local patch Pi. Finally, the TV loss is

computed.

Experimental Results. Same as the Context Encoder, two

datasets are used for evaluation, Paris StreetView and Im-

ageNet datasets. The Paris StreetView consists of 14,900

training images and 100 test images; ImageNet contains

1.26M training images and 200 test images are randomly

selected from the validation set. Table 1 shows the quantita-

tive results of different methods. Higher PNSR means better

performance. It is obvious that the proposed method in this

paper offers the highest PNSR. The authors also claim that

quantitative evaluation (e.g. PSNR, L1 error, etc.) may not

be the most effective metric for image inpainting task as the

objective is to generate realistic-looking ﬁlled images. Fig-

ure 4 is the visual comparison with several methods. From

the zoom-in versions of (d) and (e), we can see that the pro-

posed method can generate sharper texture details than the

state-of-the-art method, Context Encoder. The authors pro-

vide the ablation study of the loss terms. Figure 5 shows

the result without using the content loss term. It is clear that

without the content loss term, the structure of the inpaint-

ing results is completely incorrect. Apart from showing the

content loss term is necessary. The authors also show the

importance of the texture loss term. Figure 6 shows the ef-

fect of different texture weights alpha in Eq. 1. Obviously,

more texture loss term gives sharper results but it may af-

fect the overall image structure as shown in Figure 6(d). As

mentioned, the authors use the same way as Context En-

coder to train the Content Network. They show the effect of

just using L2 loss and using both L2 and Adversarial loss.

From Figure 7, we can clearly see that the quality of the

output of the content network is important to the ﬁnal result.

It is shown that the content network is better to be trained

using both L2 and adversarial losses. As mentioned be-

fore, the authors suggest a multi-scale way to handle high-

resolution images. The results are shown in below, Figure

8 shows the high-resolution image inpainting results. For

Context Encoder, it only works for 128x128 input images.

So, the results are up-sampled to 512x512 using bilinear

interpolation. For the proposed method, the input will go

through the network three times at three scales to complete

the reconstruction. It is obvious that the proposed method

offers the best visual quality compared to the other methods.

However, because of the multi-scale way to high-resolution

image inpainting, the proposed method takes roughly 1 min

to ﬁll in a 256x256 hole of a 512x512 image with a Titan

X GPU, which is a major drawback of the proposed method

(i.e. low efﬁciency). The authors further extend the pro-

posed method to handle irregular shapes of holes. Simply

speaking, they ﬁrst modify the irregular hole to a bounding

rectangular hole. Then, they perform cropping and padding

to position the hole at the center. By doing these, they

can handle images with irregular holes. Some examples

are shown in below, This is an obvious improved version

of the Context Encoder. The authors adopt the techniques

from Neural Style Transfer to further enhance the texture

details of the generated pixels by the Context Encoder. As

a result, we are one step closer to realistic-looking ﬁlled

images. However, the authors also point out some future

directions for improvement. i) It is still difﬁcult to ﬁll the

missing parts when the scene is complicated as shown in

Figure 10. ii) The speed is a problem as it cannot achieve

real-time performance.

Again, I would like to highlight some points here and the

points are useful for the future posts. This work is an earlier

version of coarse-to-ﬁne network (also called two-stage net-

work). We ﬁrst reconstruct the missing parts and the recon-

structed parts should be with certain pixel-wise reconstruc-

tion accuracy (i.e. ensure the structure is correct). Then, we

reﬁne the texture details of the reconstructed parts such that

the ﬁlled images are with good visual quality. The concept

of texture loss plays an important role in later image inpaint-

ing papers. By employing this loss, we can have sharper

generated images. Later, we usually achieve sharp gener-

ated images by using Perceptual Loss and/or Style Loss.

We will cover them very soon!

References

[1] Grigory Antipov, Moez Baccouche, and Jean-Luc Dugelay.

Face aging with conditional generative adversarial networks.

In 2017 IEEE International Conference on Image Processing

(ICIP), pages 2089–2093. IEEE, 2017. 4

[2] Martin Arjovsky, Soumith Chintala, and L´

eon Bottou.

Wasserstein generative adversarial networks. In Interna-

tional Conference on Machine Learning, pages 214–223,

2017. 4

[3] Samaneh Azadi, Deepak Pathak, Sayna Ebrahimi, and

Trevor Darrell. Compositional gan: Learning image-

conditional binary composition. International Journal of

Computer Vision, 128(10):2570–2585, 2020. 4

[4] Aayush Bansal, Yaser Sheikh, and Deva Ramanan. Pixelnn:

Example-based image synthesis. In International Confer-

ence on Learning Representations, 2018. 4

[5] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas

Papernot, Avital Oliver, and Colin A Raffel. Mixmatch:

A holistic approach to semi-supervised learning. In Ad-

vances in Neural Information Processing Systems, pages

5049–5059, 2019. 2

[6] Jinming Cao, Oren Katzir, Peng Jiang, Dani Lischinski,

Danny Cohen-Or, Changhe Tu, and Yangyan Li. Dida: Dis-

entangled synthesis for domain adaptation, 2018. 4

[7] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya

Sutskever, and Pieter Abbeel. Infogan: Interpretable rep-

resentation learning by information maximizing generative

adversarial nets. In Neural Information Processing Systems,

pages 2172–2180, 2016. 4

[8] Grigorios G Chrysos, Jean Kossaiﬁ, and Stefanos Zafeiriou.

Robust conditional generative adversarial networks. arXiv

preprint arXiv:1805.08657, 2018. 4

[9] Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. To-

wards diverse and natural image descriptions via a condi-

tional gan. In IEEE International Conference on Computer

Vision, pages 2970–2979, 2017. 4

[10] Emily L Denton and vighnesh Birodkar. Unsupervised learn-

ing of disentangled representations from video. In I. Guyon,

U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-

wanathan, and R. Garnett, editors, Advances in Neural In-

formation Processing Systems 30, pages 4414–4423. Curran

Associates, Inc., 2017. 4

[11] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep

generative image models using laplacian pyramid of adver-

sarial networks. In Neural Information Processing Systems,

pages 1486–1494, 2015. 4

[12] Jeff Donahue, Philipp Kr¨

ahenb¨

uhl, and Trevor Darrell. Ad-

versarial feature learning. arXiv preprint arXiv:1605.09782,

2016. 4

[13] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier

Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron

Courville. Adversarially learned inference. arXiv preprint

arXiv:1606.00704, 2016. 4

[14] Marc-Andr´

e Gardner, Kalyan Sunkavalli, Ersin Yumer, Xi-

aohui Shen, Emiliano Gambaretto, Christian Gagn´

e, and

Jean-Franc¸ois Lalonde. Learning to predict indoor illumina-

tion from a single image. arXiv preprint arXiv:1704.00090,

2017. 4

[15] Jon Gauthier. Conditional generative adversarial nets for

convolutional face generation. Class Project for Stanford

CS231N: Convolutional Neural Networks for Visual Recog-

nition, Winter semester, 2014(5):2, 2014. 4

[16] Abel Gonzalez-Garcia, Joost Van De Weijer, and Yoshua

Bengio. Image-to-image translation for cross-domain dis-

entanglement. In Advances in neural information processing

systems, pages 1287–1298, 2018. 5

[17] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial

networks, 2017. 4

[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing

Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and

Yoshua Bengio. Generative adversarial nets. In Neural In-

formation Processing Systems, pages 2672–2680, 2014. 3

[19] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent

Dumoulin, and Aaron C Courville. Improved training of

wasserstein gans. In Neural Information Processing Systems,

pages 5767–5777, 2017. 4

[20] I. Higgins, Lo¨

ıc Matthey, A. Pal, Christopher P. Burgess,

Xavier Glorot, M. Botvinick, S. Mohamed, and Alexander

Lerchner. beta-vae: Learning basic visual concepts with a

constrained variational framework. In ICLR, 2017. 4

[21] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing

the dimensionality of data with neural networks. science,

313(5786):504–507, 2006. 4

[22] Seunghoon Hong, Dingdong Yang, Jongwook Choi, and

Honglak Lee. Inferring semantic layout for hierarchical text-

to-image synthesis. In IEEE Conference on Computer Vision

and Pattern Recognition, pages 7986–7994, 2018. 4

[23] Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, and

Serge Belongie. Stacked generative adversarial networks. In

IEEE Conference on Computer Vision and Pattern Recogni-

tion, pages 5077–5086, 2017. 4

[24] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A

Efros. Image-to-image translation with conditional adver-

sarial networks. In IEEE Conference on Computer Vision

and Pattern Recognition, pages 1125–1134, 2017. 2,3

[25] Levent Karacan, Zeynep Akata, Aykut Erdem, and Erkut

Erdem. Learning to generate images of outdoor scenes

from attributes and semantic layouts. arXiv preprint

arXiv:1612.00215, 2016. 4

[26] Hyunjik Kim and Andriy Mnih. Disentangling by factoris-

ing. In International Conference on Machine Learning,

pages 2649–2658. PMLR, 2018. 4

[27] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee,

and Jiwon Kim. Learning to discover cross-domain relations

with generative adversarial networks. In International Con-

ference on Machine Learning, pages 1857–1865, 2017. 2

[28] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende,

and Max Welling. Semi-supervised learning with deep gen-

erative models. In Advances in neural information process-

ing systems, pages 3581–3589, 2014. 2

[29] Diederik P Kingma and Max Welling. Auto-encoding varia-

tional bayes. stat, 1050:1, 2014. 4

[30] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo

Larochelle, and Ole Winther. Autoencoding beyond pixels

using a learned similarity metric. In International conference

on machine learning, pages 1558–1566. PMLR, 2016. 4

[31] Hsin-Ying Lee, Hung-Yu Tseng, Qi Mao, Jia-Bin Huang,

Yu-Ding Lu, Maneesh Singh, and Ming-Hsuan Yang.

Drit++: Diverse image-to-image translation via disentangled

representations. International Journal of Computer Vision,

pages 1–16, 2020. 4

[32] Junsoo Lee, Eungyeup Kim, Yunsung Lee, Dongjun Kim,

Jaehyuk Chang, and Jaegul Choo. Reference-based sketch

image colorization using augmented-self reference and

dense semantic correspondence. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern

Recognition (CVPR), June 2020. 4

[33] Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman,

and Simon Lucey. St-gan: Spatial transformer generative

adversarial networks for image compositing. In Proceed-

ings of the IEEE Conference on Computer Vision and Pattern

Recognition, pages 9455–9464, 2018. 4

[34] Jianxin Lin, Yingxue Pang, Yingce Xia, Zhibo Chen, and

Jiebo Luo. Tuigan: Learning versatile image-to-image trans-

lation with two unpaired images. In European Conference

on Computer Vision, pages 18–35. Springer, 2020. 2

[35] Jianxin Lin, Yijun Wang, Tianyu He, and Zhibo Chen.

Learning to transfer: Unsupervised meta domain translation.

arXiv preprint arXiv:1906.00181, 2019. 2

[36] Jianxin Lin, Yingce Xia, Sen Liu, Tao Qin, and Zhibo

Chen. Zstgan: An adversarial approach for unsuper-

vised zero-shot image-to-image translation. arXiv preprint

arXiv:1906.00184, 2019. 2

[37] Alexander H Liu, Yen-Cheng Liu, Yu-Ying Yeh, and Yu-

Chiang Frank Wang. A uniﬁed feature disentangler for multi-

domain image translation and manipulation. In Advances

in neural information processing systems, pages 2590–2599,

2018. 4

[38] Ming-Yu Liu, Xun Huang, Arun Mallya, Tero Karras, Timo

Aila, Jaakko Lehtinen, and Jan Kautz. Few-shot unsuper-

vised image-to-image translation. In Proceedings of the

IEEE/CVF International Conference on Computer Vision

(ICCV), October 2019. 2

[39] Yongyi Lu, Yu-Wing Tai, and Chi-Keung Tang. Conditional

cyclegan for attribute guided face image generation. arXiv

preprint arXiv:1705.09966, 2017. 4

[40] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and

Ming-Hsuan Yang. Mode seeking generative adversarial

networks for diverse image synthesis. In IEEE Conference

on Computer Vision and Pattern Recognition, pages 1429–

1437, 2019. 4

[41] Takeru Miyato and Masanori Koyama. cgans with projection

discriminator. arXiv preprint arXiv:1802.05637, 2018. 4

[42] Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ra-

mamoorthi, and Kyungnam Kim. Image to image translation

for domain adaptation. In Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition (CVPR),

June 2018. 4

[43] Aamir Mustafa and Rafał K. Mantiuk. Transformation

consistency regularization – a semi-supervised paradigm

for image-to-image translation. In Andrea Vedaldi, Horst

Bischof, Thomas Brox, and Jan-Michael Frahm, editors,

Computer Vision – ECCV 2020, pages 599–615, Cham,

2020. Springer International Publishing. 2

[44] Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovit-

skiy, and Jason Yosinski. Plug & play generative networks:

Conditional iterative generation of images in latent space. In

IEEE Conference on Computer Vision and Pattern Recogni-

tion, pages 4467–4477, 2017. 4

[45] Augustus Odena, Christopher Olah, and Jonathon Shlens.

Conditional image synthesis with auxiliary classiﬁer gans.

In International Conference on Machine Learning, pages

2642–2651, 2017. 4

[46] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan

Zhu. Semantic image synthesis with spatially-adaptive nor-

malization. In IEEE Conference on Computer Vision and

Pattern Recognition, pages 2337–2346, 2019. 2

[47] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan

Zhu. Semantic image synthesis with spatially-adaptive nor-

malization. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR), June

2019. 4

[48] Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu, and

Jose M ´

Alvarez. Invertible conditional gans for image edit-

ing. arXiv preprint arXiv:1611.06355, 2016. 4

[49] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri

Valpola, and Tapani Raiko. Semi-supervised learning with

ladder networks. In Advances in neural information process-

ing systems, pages 3546–3554, 2015. 2

[50] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-

geswaran, Bernt Schiele, and Honglak Lee. Generative ad-

versarial text to image synthesis. In International Conference

on Machine Learning, pages 1–10, 2016. 4

[51] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka,

Bernt Schiele, and Honglak Lee. Learning what and where

to draw. In Neural Information Processing Systems, pages

217–225, 2016. 4

[52] Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Tem-

poral generative adversarial nets with singular value clip-

ping. In IEEE International Conference on Computer Vision,

pages 2830–2839, 2017. 4

[53] Mingguang Shi and Bing Zhang. Semi-supervised learning

improves gene expression-based prediction of cancer recur-

rence. Bioinformatics, 27(21):3017–3023, 2011. 2

[54] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypi-

cal networks for few-shot learning. In Advances in neural

information processing systems, pages 4077–4087, 2017. 2

[55] Shuran Song and Thomas Funkhouser. Neural illumination:

Lighting prediction for indoor environments. In Proceedings

of the IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 6918–6926, 2019. 4

[56] Kumar Sricharan, Raja Bala, Matthew Shreve, Hui Ding,

Kumar Saketh, and Jin Sun. Semi-supervised conditional

gans. arXiv preprint arXiv:1708.05789, 2017. 4

[57] Patricia L Su´

arez, Angel D Sappa, and Boris X Vintimilla.

Infrared image colorization based on a triplet dcgan archi-

tecture. In Proceedings of the IEEE Conference on Com-

puter Vision and Pattern Recognition Workshops, pages 18–

23, 2017. 4

[58] Hao Tang, Dan Xu, Nicu Sebe, Yanzhi Wang, Jason J Corso,

and Yan Yan. Multi-channel attention selection gan with cas-

caded semantic guidance for cross-view image translation. In

IEEE Conference on Computer Vision and Pattern Recogni-

tion, pages 2417–2426, 2019. 4

[59] Kiran K Thekumparampil, Ashish Khetan, Zinan Lin, and

Sewoong Oh. Robustness of conditional gans to noisy labels.

In Neural Information Processing Systems, pages 10271–

10282, 2018. 4

[60] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,

Jan Kautz, and Bryan Catanzaro. High-resolution image syn-

thesis and semantic manipulation with conditional gans. In

IEEE Conference on Computer Vision and Pattern Recogni-

tion, pages 8798–8807, 2018. 4

[61] Shunyu Yao, Tzu Ming Hsu, Jun-Yan Zhu, Jiajun Wu, Anto-

nio Torralba, Bill Freeman, and Josh Tenenbaum. 3d-aware

scene manipulation via inverse graphics. In Neural Informa-

tion Processing Systems, pages 1887–1898, 2018. 4

[62] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dual-

gan: Unsupervised dual learning for image-to-image trans-

lation. In Proceedings of the IEEE international conference

on computer vision, pages 2849–2857, 2017. 2

[63] Yuan Yuan, Siyuan Liu, Jiawei Zhang, Yongbing Zhang,

Chao Dong, and Liang Lin. Unsupervised image super-

resolution using cycle-in-cycle generative adversarial net-

works. In Proceedings of the IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR) Workshops,

June 2018. 4

[64] Fangneng Zhan and Shijian Lu. Esir: End-to-end scene text

recognition via iterative image rectiﬁcation. In Proceedings

of the IEEE Conference on Computer Vision and Pattern

Recognition, pages 2059–2068, 2019. 4

[65] Fangneng Zhan, Shijian Lu, and Chuhui Xue. Verisimilar

image synthesis for accurate detection and recognition of

texts in scenes. In Proceedings of the European Conference

on Computer Vision (ECCV), pages 249–266, 2018. 4

[66] Fangneng Zhan, Chuhui Xue, and Shijian Lu. Ga-dan:

Geometry-aware domain adaptation network for scene text

detection and recognition. In Proceedings of the IEEE Inter-

national Conference on Computer Vision, pages 9105–9115,

2019. 4

[67] Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang,

and Shijian Lu. Multimodal image synthesis and editing: A

survey. arXiv preprint arXiv:2112.13592, 2021. 4

[68] Fangneng Zhan, Changgong Zhang, Wenbo Hu, Shijian Lu,

Feiying Ma, Xuansong Xie, and Ling Shao. Sparse needlets

for lighting estimation with spherical transport loss. In

Proceedings of the IEEE/CVF International Conference on

Computer Vision, pages 12830–12839, 2021. 4

[69] Fangneng Zhan, Jiahui Zhang, Yingchen Yu, Rongliang Wu,

and Shijian Lu. Modulated contrast for versatile image syn-

thesis. In Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 18280–18290,

2022. 4

[70] Fangneng Zhan, Hongyuan Zhu, and Shijian Lu. Spa-

tial fusion gan for image synthesis. In Proceedings of the

IEEE conference on computer vision and pattern recogni-

tion, pages 3653–3662, 2019. 4

[71] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xi-

aogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stack-

gan++: Realistic image synthesis with stacked generative ad-

versarial networks. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 41(8):1947–1962, 2019. 4

[72] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-

gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-

gan: Text to photo-realistic image synthesis with stacked

generative adversarial networks. In IEEE International Con-

ference on Computer Vision, pages 5907–5915, 2017. 4

[73] Pan Zhang, Bo Zhang, Dong Chen, Lu Yuan, and Fang Wen.

Cross-domain correspondence learning for exemplar-based

image translation. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition, pages

5143–5153, 2020. 4

[74] Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua

Bengio, and Yangqiu Song. Metagan: An adversarial ap-

proach to few-shot learning. In Advances in Neural Informa-

tion Processing Systems, pages 2365–2374, 2018. 2

[75] Yongbing Zhang, Siyuan Liu, Chao Dong, Xinfeng Zhang,

and Yuan Yuan. Multiple cycle-in-cycle generative adversar-

ial networks for unsupervised image super-resolution. IEEE

transactions on Image Processing, 29:1101–1112, 2019. 4

[76] Xingran Zhou, Bo Zhang, Ting Zhang, Pan Zhang, Jianmin

Bao, Dong Chen, Zhongfei Zhang, and Fang Wen. Cocosnet

v2: Full-resolution correspondence learning for image trans-

lation. In Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 11465–11475,

2021. 4

[77] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A

Efros. Unpaired image-to-image translation using cycle-

consistent adversarial networks. In International Conference

on Computer Vision, pages 2223–2232, 2017. 2,4

[78] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Dar-

rell, Alexei A Efros, Oliver Wang, and Eli Shechtman. To-

ward multimodal image-to-image translation. In Neural In-

formation Processing Systems, pages 465–476, 2017. 4

[79] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka.

Sean: Image synthesis with semantic region-adaptive nor-

malization. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR), June

2020. 4

ResearchGate has not been able to resolve any citations for this publication.

Modulated Contrast for Versatile Image Synthesis

Conference Paper

Full-text available

Mar 2022

Perceiving the similarity between images has been a long-standing and fundamental problem underlying various visual generation tasks. Predominant approaches measure the inter-image distance by computing pointwise absolute deviations, which tends to estimate the median of instance distributions and leads to blurs and artifacts in the generated images. This paper presents MoNCE, a versatile metric that introduces image contrast to learn a calibrated metric for the perception of multifaceted inter-image distances. Unlike vanilla contrast which indiscriminately pushes negative samples from the anchor regardless of their similarity , we propose to re-weight the pushing force of negative samples adaptively according to their similarity to the anchor , which facilitates the contrastive learning from informative negative samples. Since multiple patch-level con-trastive objectives are involved in image distance measurement , we introduce optimal transport in MoNCE to modulate the pushing force of negative samples collaboratively across multiple contrastive objectives. Extensive experiments over multiple image translation tasks show that the proposed MoNCE outperforms various prevailing metrics substantially.

CoCosNet v2: Full-Resolution Correspondence Learning for Image Translation

Conference Paper

Full-text available

Jun 2021

Cross-Domain Correspondence Learning for Exemplar-Based Image Translation

Conference Paper

Full-text available

Jun 2020

Compositional GAN: Learning Image-Conditional Binary Composition

Article

Full-text available

Nov 2020
INT J COMPUT VISION

Generative Adversarial Networks can produce images of remarkable complexity and realism but are generally structured to sample from a single latent source ignoring the explicit spatial interaction between multiple entities that could be present in a scene. Capturing such complex interactions between different objects in the world, including their relative scaling, spatial layout, occlusion, or viewpoint transformation is a challenging problem. In this work, we propose a novel self-consistent Composition-by-Decomposition network to compose a pair of objects. Given object images from two distinct distributions, our model can generate a realistic composite image from their joint distribution following the texture and shape of the input objects. We evaluate our approach through qualitative experiments and user evaluations. Our results indicate that the learned model captures potential interactions between the two object domains, and generates realistic composed scenes at test time.

DiDA: Iterative Boosting of Disentangled Synthesis and Domain Adaptation

Conference Paper

Nov 2021

Sparse Needlets for Lighting Estimation with Spherical Transport Loss

Conference Paper

Oct 2021

Transformation Consistency Regularization – A Semi-supervised Paradigm for Image-to-Image Translation

Chapter

Dec 2020

Scarcity of labeled data has motivated the development of semi-supervised learning methods, which learn from large portions of unlabeled data alongside a few labeled samples. Consistency Regularization between model’s predictions under different input perturbations, particularly has shown to provide state-of-the art results in a semi-supervised framework. However, most of these method have been limited to classification and segmentation applications. We propose Transformation Consistency Regularization, which delves into a more challenging setting of image-to-image translation, which remains unexplored by semi-supervised algorithms. The method introduces a diverse set of geometric transformations and enforces the model’s predictions for unlabeled data to be invariant to those transformations. We evaluate the efficacy of our algorithm on three different applications: image colorization, denoising and super-resolution. Our method is significantly data efficient, requiring only around 10–20% of labeled samples to achieve similar image reconstructions to its fully-supervised counterpart. Furthermore, we show the effectiveness of our method in video processing applications, where knowledge from a few frames can be leveraged to enhance the quality of the rest of the movie.

Reference-Based Sketch Image Colorization Using Augmented-Self Reference and Dense Semantic Correspondence

Conference Paper

Jun 2020

SEAN: Image Synthesis With Semantic Region-Adaptive Normalization

Conference Paper

Jun 2020

Learning to Transfer: Unsupervised Domain Translation via Meta-Learning

Article

Apr 2020

Unsupervised domain translation has recently achieved impressive performance with Generative Adversarial Network (GAN) and sufficient (unpaired) training data. However, existing domain translation frameworks form in a disposable way where the learning experiences are ignored and the obtained model cannot be adapted to a new coming domain. In this work, we take on unsupervised domain translation problems from a meta-learning perspective. We propose a model called Meta-Translation GAN (MT-GAN) to find good initialization of translation models. In the meta-training procedure, MT-GAN is explicitly trained with a primary translation task and a synthesized dual translation task. A cycle-consistency meta-optimization objective is designed to ensure the generalization ability. We demonstrate effectiveness of our model on ten diverse two-domain translation tasks and multiple face identity translation tasks. We show that our proposed approach significantly outperforms the existing domain translation methods when each domain contains no more than ten training samples.

High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis

Recommended publications

High-Resolution Image Inpainting Using Multi-scale Neural Patch Synthesis

Enhancing Image Inpainting via Neural Style Transfer: A Multi-Scale Approach for High-Resolution Ima...

MRI Medical Image Synthesis via Adversarial Cycle Consistent Adversarial Learning

Synthetic Data Generation for Enhanced Analysis and Modeling

Exploring Multimodal Architectures Adopted in Text-to-Image Synthesis