Conference PaperPDF Available

Coarse-to-Fine Image Inpainting via Region-wise Convolutions and Non-Local Correlation

August 2019

August 2019

DOI:10.24963/ijcai.2019/433

Conference: Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}

Authors:

Yuqing Ma

Beihang University (BUAA)

Xianglong Liu

Beihang University (BUAA)

Shihao Bai

Beihang University (BUAA)

Show all 6 authorsHide

Recently deep neural networks have achieved promising performance for filling large missing regions in image inpainting tasks. They usually adopted the standard convolutional architecture over the corrupted image, where the same convolution filters try to restore the diverse information on both existing and missing regions, and meanwhile ignores the long-distance correlation among the regions. Only relying on the surrounding areas inevitably leads to meaningless contents and artifacts, such as color discrepancy and blur. To address these problems, we first propose region-wise convolutions to locally deal with the different types of regions, which can help exactly reconstruct existing regions and roughly infer the missing ones from existing regions at the same time. Then, a non-local operation is introduced to globally model the correlation among different regions, promising visual consistency between missing and existing regions. Finally, we integrate the region-wise convolutions and non-local correlation in a coarse-to-fine framework to restore semantically reasonable and visually realistic images. Extensive experiments on three widely-used datasets for image inpainting tasks have been conducted, and both qualitative and quantitative experimental results demonstrate that the proposed model significantly outperforms the state-of-the-art approaches, especially for the large irregular missing regions.

Content uploaded by Shihao Bai

Content may be subject to copyright.

Coarse-to-Fine Image Inpainting via Region-wise Convolutions

and Non-Local Correlation

Yuqing Ma1,Xianglong Liu∗1,2,Shihao Bai1,Lei Wang1,Dailan He1and Aishan Liu1

1State Key Lab of Software Development Environment, Beihang University, China

2Beijing Advanced Innovation Center for Big Data-Based Precision Medicine,

Beihang University, Beijing, China

{mayuqing, xlliu, 16061167, HBwanglei, hdl730, liuaishan}@buaa.edu.cn

Abstract

Recently deep neural networks have achieved

promising performance for ﬁlling large missing

regions in image inpainting tasks. They usual-

ly adopted the standard convolutional architecture

over the corrupted image, where the same convolu-

tion ﬁlters try to restore the diverse information on

both existing and missing regions, and meanwhile

ignore the long-distance correlation among the re-

gions. Only relying on the surrounding areas in-

evitably leads to meaningless contents and artifact-

s, such as color discrepancy and blur. To address

these problems, we ﬁrst propose region-wise con-

volutions to locally deal with the different types of

regions, which can help exactly reconstruct existing

regions and roughly infer the missing ones from ex-

isting regions at the same time. Then, a non-local

operation is introduced to globally model the cor-

relation among different regions, promising visual

consistency between missing and existing region-

s. Finally, we integrate the region-wise convolu-

tions and non-local correlation in a coarse-to-ﬁne

framework to restore semantically reasonable and

visually realistic images. Extensive experiments

on three widely-used datasets for image inpainting

tasks have been conducted, and both qualitative and

quantitative experimental results demonstrate that

the proposed model signiﬁcantly outperforms the

state-of-the-art approaches, especially for the large

irregular missing regions.

1 Introduction

Image inpainting (i.e., image completion or image hole-

ﬁlling), synthesizing visually realistic and semantically plau-

sible contents in missing regions, has attracted great atten-

tions in recent years. It can be widely applied in many tasks

[Barnes et al., 2009a; Newson et al., 2014; Park et al., 2017;

Simakov et al., 2008], such as photo editing, image-based

rendering, computational photography, etc. Till now, there

have been many methods proposed for generating desirable

∗Corresponding Author

(a) Input (b) EC (c) Ours

Figure 1: Image inpainting results using EdgeConnect (EC) and our

proposed method on street view image.

contents in different ways, including the traditional methods

using handcrafted features and the deep generative models.

Traditional approaches can be roughly divided into two

types: diffusion-based and patch-based. The former methods

propagate background data into missing regions by following

a diffusive process typically modeled using differential opera-

tors [Ballester et al., 2000; Esedoglu and Shen, 2002]. Patch-

based methods [Kwatra et al., 2005; Barnes et al., 2009b]ﬁll

in missing regions with patches from a collection of source

images that maximize the patch similarity. These methods

have good effects in the completion of repeating structured

images. However, they are usually time-consuming and be-

sides they cannot hallucinate semantically plausible contents

for challenging cases where inpainting regions involve com-

plex, non-repetitive structures, e.g., faces, objects, etc.

The signiﬁcant development of deep neural networks and

generative adversarial networks inspires recent works to for-

mulate inpainting as a conditional image generation prob-

lem. Context Encoders [Pathak et al., 2016]ﬁrst exploit-

ed GANs to restore images, using a channel-wise fully con-

nected layer to propagate information between encoder and

decoder. [Iizuka et al., 2017]utilized dilated convolution-

s and employed both global and local discriminators to as-

sess images. [Yu et al., 2018b]adopted a coarse-to-ﬁne

network with attention mechanism to gradually reﬁne the

generated images. To perceptually enhance image quali-

ty, several studies [Yang et al., 2017; Song et al., 2017;

Wang et al., 2018b]attempted to extract features using pre-

trained VGG network to reduce the perceptual loss or style

loss. More recently, [Liu et al., 2018; Yu et al., 2018a;

Nazeri et al., 2019]further concentrated on irregular miss-

ing regions and achieved satisfying performance especially

for the highly structured images.

Despite the encouraging progress in image inpainting,

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

3123

Skip Lines

Standard Convolution Dilated Convolution Missing-region Convolution Existing-region Convolution



Mask

Image

Stage 1 Stage 2



Skip Lines

loss

loss



()



()



()



()

Region-wise

Convolution

composite composite

Existing

Region

Missing

Region

loss

Figure 2: The architecture of our proposed coarse-to-ﬁne image inpainting framework.

most existing methods still face the inconsistency problems,

such as distorted structures and blurry textures (see the result

of the very recent method EC [Nazeri et al., 2019]in Figure

1). This phenomenon is much likely due to the inappropriate

convolution operation over the two types of regions, i.e., ex-

isting and missing regions. Intuitively, different feature repre-

sentations should be extracted to characterize different types

of regions, since there is sufﬁcient content information in ex-

isting regions, but none in the missing ones, which need to

be inferred from existing regions. Therefore, directly apply-

ing the same convolution ﬁlters to generate semantic contents

inevitably leads to visual artifacts such as color discrepancy,

blur and obvious edge responses surrounding holes. Change-

able mask is proposed in recent works [Liu et al., 2018;

Yu et al., 2018a]to handle the difference. However, rely-

ing on the same ﬁlters for different regions, they still fail to

generate favourable results.

In this paper, to generate desirable contents for missing re-

gions, we treat the different types of regions using different

convolution ﬁlters. Existing regions contain sufﬁcient infor-

mation and thus can be reconstructed based on themselves,

while the missing ones without any information have to be

inferred from the existing regions. Therefore, we develop

region-wise convolution operations, i.e., self-reconstruction

and restoring from the existing regions, to separately deal

with existing and missing regions. The region-wise convolu-

tions help infer the missing semantic contents, but inevitably

cause the inconsistent appearance due to the ignorance of the

correlation between existing and missing regions. We further

propose a non-local operation to model the correlation among

regions, thus generate more meaningful contents to connect

them naturally. Then, we introduce a two stage coarse-to-ﬁne

image inpainting framework with a `1reconstruction loss, a

correlation loss and the popular style loss.

The framework produces natural, semantic contents for

missing regions by incorporating region-wise convolutions

and the non-local operation at the coarse stage, and further

outputs the restored image by eliminating the visually un-

pleasant artifacts at the ﬁne stage. Figure 2 shows the ar-

chitecture of our whole framework. Extensive experiments

on various datasets such as faces (CelebA-HQ [Karras et al.,

2017]), street views (Paris StreetView [Doersch et al., 2012])

and natural scenes (Places2 [Zhou et al., 2018]) demonstrate

that our proposed method can signiﬁcantly outperform other

state-of-the-art approaches in image inpainting.

2 The Approach

In this section, we elaborate the details of our coarse-to-ﬁne

image inpainting framework with encoder-decoder architec-

ture. We will ﬁrst introduce the whole framework consisting

of two stages which respectively learns the missing regions at

the coarse stage and further reﬁnes the whole image at the ﬁne

stage. Then, we will present our region-wise convolutions

and the non-local operation. Finally, the whole formulation

and optimization strategies will be provided.

2.1 The Coarse-to-ﬁne Framework

The state-of-the-art image inpainting solutions often ignore

either the difference or the correlation between the existing

and missing regions. To simultaneously address both issues,

we adopt a two-stage coarse-to-ﬁne framework based on the

encoder-decoder architecture. At the coarse stage, the frame-

work ﬁrst infers the semantic contents from the existing re-

gions using region-wise convolution ﬁlters, rather than the

identical ones. Then, it further enhances the quality of the

composited image using the non-local operation, which takes

the correlation between different regions into consideration.

At the ﬁne stage, the two different regions are considered to-

gether using a style loss over the whole image, which per-

ceptually enhances the image quality. With the two-stage

progressive generation, the framework will make the restored

images more realistic and perceptually consistent.

As shown in Figure 2, the framework takes the incomplete

image ˆ

Igand a binary mask Mas input, and attempts to

restore the complete image close to ground truth image Ig,

where Mindicates the missing regions (the mask value is 0

for missing pixels and 1 for elsewhere), ˆ

Ig=IgMand 

denotes dot product. To accomplish this goal, network E1,E2

serve as encoders in two stages respectively to extract seman-

tic features from corresponding input images. A decoder G

composing of the proposed region-wised convolutional layer-

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

3124

s is employed after encoder E1to restore the semantic con-

tents for different regions, and generates the predicted image

I(1)

p= G E1(ˆ

Ig)at the coarse stage. After feeding the

composited image I(1)

c=ˆ

Ig+I(1)

p(1−M)from the coarse

stage to encoder E2, another decoder Dat the second stage

further synthesizes the reﬁned image I(2)

p= D E2(I(1)

c).

Based on the encoder-decoder architectures, we ﬁnally have

the visually and semantically realistic inpainting result I(2)

Ig+I(2)

p(1 −M)close to the ground truth image Ig.

2.2 Inferring Region-wise Contents

For image inpainting tasks, the input images are composed

of both existing regions with valid pixels and missing regions

(masked regions) with invalid pixels in mask to be synthe-

sized. Only relying on the same convolution ﬁlters, we can

hardly restore the semantic features over different regions,

which in practice usually leads to the visual artifacts such

as color discrepancy, blur and obvious edge responses sur-

rounding the missing regions. Motivated by this observation,

we ﬁrst propose region-wise convolutions in the decoder net-

work Gat the coarse stage, and thus the decoder can separate-

ly generate the corresponding contents for different regions

using different convolution ﬁlters.

Speciﬁcally, let W,ˆ

Wbe the weights of the region-wise

convolution ﬁlters for existing and missing regions respec-

tively, and b,ˆ

bcorrespond to the biases. xis the feature for

the current convolution (sliding) window belonging to the w-

hole feature map X. Then, the region-wise convolutions at

every location can be formulated as follows:

x0=W>x+b,x∈XM

W>x+ˆ

b,x∈X(1 −M)(1)

This means that for different types of regions, different con-

volution ﬁlters will be learnt for feature representation.

In practice, we can accomplish region-wise convolutions

by proportionally resizing the mask as feature maps down-

sampled through the convolution layers. In this way, we can

ensure that different regions can be easily distinguished ac-

cording to the resized mask by channels, and thus the in-

formation in different regions can be transmitted consistently

across layers. The convolution ﬁlters for existing regions try

to reconstruct themselves, while those for missing ones focus

on inferring the semantic contents from existing parts.

2.3 Modelling Non-local Correlation

After the region-wise convolutions, the framework generates

a coarse predicted image, where missing regions are almost

recovered with semantically meaningful contents. However,

the predicted image is still far beyond the visually realistic

appearance. This is mainly because the convolution opera-

tions are skilled in processing local neighborhoods whereas

fail to model the correlation between distant positions.

To address this problem and improve the visual quality of

the recovered image, a non-local operation is adopted follow-

ing prior studies [Wang et al., 2018a]. It computes the re-

sponse at a position as a weighted sum of the features at all

positions in the input feature map, and thus can capture long-

distance correlation between patches inside an image. Note

that the traditional way to accomplish the non-local opera-

tion relies on the simple matrix multiplication and is usually

adopted in feed-forward process to obtain more information

for speciﬁc tasks. However, the computation will be quite

memory-consuming for large feature maps, which is not ap-

plicable in our generative models where the smallest feature

map created by Gis 128 ×128.

In this paper, we accomplish the non-local operation using

the simple outer product between different positions, rather

than the non-local block. Formally, given an image I(1)

Ψ(I(1)

c)denotes the c×h×wfeature map computed by

feature extraction method Ψ. In practice, in order to index

an output position in space dimension easily, we reshape the

feature map to the size of c×n, where n=h×w. Corre-

spondingly, Ψi(Ig)is the i-th column in the reshaped feature

map Ψ(Ig), where i= 1, . . . , n, of length c. Then, a pairwise

function fij can be deﬁned as a non-local operation, which

generates a n×ngram matrix evaluating the correlation be-

tween position iand j:

fij (I(1)

c) = Ψi(I(1)

c)>Ψj(I(1)

c).(2)

Once we have the non-local correlation, we can bring it in-

to the inpainting framework by introducing a correlation loss

based on the gram matrix.

2.4 The Formulation

To guide the learning of the two stage encoder-decoder net-

work, we introduce the following loss functions.

Reconstruction Loss

We employ `1reconstruction loss to promise the predicted

images at the two stages, including both the existing regions

and the missing ones, consistent with the ground truth at the

pixel level:

Lr=



I(1)

p−Ig



1+



I(2)

p−Ig



1

.(3)

The reconstruction loss is useful for region-wise convolution

ﬁlters to learn to generate meaningful contents for different

regions especially at the ﬁrst stage.

Correlation Loss

The reconstruction loss treats all pixels independently with-

out consideration of their correlation, while in our observa-

tion the relationship among distant local patches plays a crit-

ical role in keeping the semantic and visual consistency be-

tween the generated missing regions and the existing ones.

Therefore, we further introduce a correlation loss that can

help to determine the expected non-local operation. Namely,

for image I(1)

c, the correlation loss is deﬁned based on fij (·):

Lc=σ

i,j



fij (I(1)

c)−fij (Ig)



1

,(4)

where σdenotes the normalization factor by position. The

correlation loss forces the model to generate images with se-

mantic details much more close to the realistic image. Here,

different from the prior work of PConv, we only consider the

non-local correlation for the composited image.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

3125

Style Loss

Although non-local correlation loss is capable of capturing

long distance dependencies, enhancing the restoration of de-

tails, it still fails to avoid visual artifacts in unstable gener-

ative models. Therefore, we append a style loss to produce

clean results and further reﬁne the images perceptually as a

whole at the second stage. The style loss is widely used in

image inpainting and style transfer tasks meanwhile poses as

an effective tool to combat ”checkerboard” artifacts [Sajjadi

et al., 2017]. After projecting image I(2)

cinto a higher level

feature space using a pre-trained VGG, we could obtain the

feature map Φp(I(2)

p)of the p-th layer with size cp×hp×wp,

and thus the style loss is formulated as follows:

Ls=X

δp



Φp(I(2)

c)>Φp(I(2)

c)−(Φp(Ig))>(Φp(Ig))



1

(5)

where δpdenotes the normalization factor for the p-th select-

ed layer by channel. The style loss focuses on the relationship

between different channels to transfer the style for the com-

posited image at the second stage.

Overall Loss

The overall loss Lcombines the reconstruction, correlation

and styles loss functions:

L=Lr+λ1Lc+λ2Ls.(6)

In our coarse-to-ﬁne framework, the reconstruction loss

works in both stages to guarantee the pixel-wise consisten-

cy between the predicted images and the ground truth. To

capture the relationship among different regions and gener-

ate detailed contents at the ﬁrst stage, the correlation loss is

adopted to guide the training of the network E1and G. Fi-

nally, at the second stage, the style loss helps perceptually

enhance the image quality by considering the whole image.

2.5 Implementation and Training

In practice, we exploit the widely-adopted pre-trained VGG

network to extract features for the calculation of correlation

loss as well as style loss. For the computation of correlation

loss, only feature maps extracted by pool2are adopted due to

the weak semantic representation capability of pool1and the

blur caused by pool3and pool4. In order to calculate the style

loss, we use the output of pool1,pool2, and pool3together.

In another word, Ψ(·)=Φp(·)when p= 2.

We also adopt skip links, which as [Liu et al., 2018]

claimed, may propagate the noises for most inpainting archi-

tectures. However, we ﬁnd skip links will not suffer the nega-

tive effect in our framework due to region-wise convolutions

and thus enable the detailed output from existing regions.

The entire training procedure follows the standard forward

and backward optimization paradigm. In the forward step,

given a ground truth image Ig, we ﬁrst sample an irregular

binary mask Mand subsequently generate the incomplete

image ˆ

Ig. The inpaiting framework takes the concatenation

of ˆ

Igand Mas the input, and outputs the predicted image

I(1)

pand I(2)

prespectively in the coarse and ﬁne stages. In the

backward step, according to the three types of losses over the

predicted and composited images, we can simply update the

network parameters using the backward propagation.

Mask GLCIC CA PConv EC Ours

PSNR∗0-10% 26.71 36.13 30.41 30.32 42.52

10-20% 20.97 22.97 26.93 26.92 29.52

20-30% 18.22 20.26 24.80 24.91 26.77

30-40% 16.31 18.47 23.14 23.37 24.87

40-50% 14.88 17.09 21.71 22.06 23.34

50-60% 13.80 16.01 20.41 20.91 22.04

`†

1(10−3) 0-10% 23.55 17.40 18.94 18.82 4.85

10-20% 40.32 32.50 24.49 24.08 10.22

20-30% 59.26 47.76 30.48 29.62 15.91

30-40% 80.33 63.63 37.25 35.74 22.15

40-50% 102.67 80.36 45.23 42.67 29.08

50-60% 124.63 97.11 54.77 50.44 36.58

`†

2(10−3) 0-10% 3.06 2.20 1.14 1.17 0.46

10-20% 9.54 6.90 2.50 2.53 1.55

20-30% 17.40 11.92 4.04 4.00 2.77

30-40% 26.57 17.34 5.85 5.66 4.19

40-50% 36.60 23.25 8.07 7.58 5.85

50-60% 46.71 29.34 10.77 9.79 7.78

SSIM∗0-10% 0.902 0.965 0.924 0.925 0.982

10-20% 0.806 0.888 0.880 0.881 0.942

20-30% 0.708 0.811 0.834 0.836 0.901

30-40% 0.609 0.730 0.784 0.788 0.856

40-50% 0.513 0.647 0.728 0.736 0.807

50-60% 0.427 0.566 0.667 0.680 0.755

FID†0-10% 8.21 1.26 1.75 1.38 0.02

10-20% 34.48 8.73 2.10 1.80 0.11

20-30% 62.74 20.35 2.88 2.69 0.31

30-40% 90.94 36.53 4.31 4.36 0.68

40-50% 117.23 57.60 6.97 7.38 1.38

50-60% 140.53 81.66 12.10 12.52 2.66

Perceptual†0-10% 183.39 81.58 128.64 126.98 36.11

10-20% 363.68 220.77 193.84 192.50 109.42

20-30% 546.10 348.93 258.47 255.98 178.49

30-40% 729.94 471.10 326.36 321.03 247.02

40-50% 906.89 587.90 401.07 389.19 316.61

50-60% 1062.77 1132.34 485.31 459.95 385.93

Table 1: Quantitative comparisons among different methods on

Place2, in terms of different evaluation metrics. †means lower is

better, while ∗means higher is better.

3 Experiments

In this section, we will evaluate our proposed method visually

and quantitatively over several common datasets in image in-

painting compared to state-of-the-art methods. More results

could be found in the supplementary material1.

3.1 Datasets and Protocols

We employ the widely-used datasets in prior studies, in-

cluding CelebA-HQ [Karras et al., 2017], Places2 [Zhou

et al., 2018], and Paris StreetView [Doersch et al., 2012].

CelebA-HQ contains 30k high-resolution face images, and

we adopt the same partition as [Yu et al., 2018b]did. The

Places2 dataset includes 8,097,967 training images with di-

verse scenes. The Paris StreetView contains 14,900 training

images and 100 test images. For both datasets, we adopt the

original train, test, and validate splits.

We compare our method with four state-of-the-art model-

s, namely, Globally and locally Consistent Image Completion

(GLCIC) [Iizuka et al., 2017], Contextual Attention (CA) [Yu

et al., 2018b], Partial Convolution (PConv) [Liu et al., 2018]

and EdgeConnect (EC) [Nazeri et al., 2019]. Among those

1https://drive.google.com/file/d/1iO0cZ0fwgVeaRrhTLCuk-rvb

CekkMVmv/view?usp=sharing

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

3126

(a) Input (b) GLGIC (c) CA (d) PConv (e) EC (f) Ours (g) GT

Figure 3: Qualitative comparisons between different methods on Place2, Paris StreetView and CelebA-HQ datasets

(a) Origin (b) Input (c) Output

Figure 4: Object removal results (column (c)) using our model: re-

moving beard, watermark and kid from origin images (column (a))

according to the input mask (column (b)).

models, GLCIC and CA are initially designed for regular

missing regions, while PConv, EC and ours focus on irreg-

ular holes. Besides, the training of GLCIC and CA heavily

relies on local discriminators assuming availability of the lo-

cal bounding boxes of the holes, which would not make sense

under our experimental setting. Therefore, we directly ap-

ply their released pre-trained models for the two methods in

our experiments. For EC, we use their pre-trained models on

Paris dataset and Places2, and train the model on celebA-HQ

with the released codes. As to PConv, since there is no pub-

lished codes, we borrow the implementation on github2, and

retrain the model following the authors’ advice.

2https://github.com/MathiasGruber/PConv-Keras

For our method, we basically develop the model based

on the architecture of CA, discarding its contextual attention

module but adding the region-wise convolutions. Input im-

ages are resized to 256 ×256, and the proportion of irregular

missing regions varies from 0 to 40% in the training process.

We empirically choose the hyper-parameters λ1= 10−5,

λ2= 10−3, and the initial learning rate 10−4. Using the

Adam optimizer, on CelebA-HQ and Paris StreetView we

train the model with a batch size of 8 for 20 epochs, and on

Places2 we train it with a batch size of 48.

3.2 Qualitative Results

Figure 3 shows the inpainting results of different method-

s on several examples from Places2, Paris StreetView and

CelebA-HQ respectively, where “GT” stands for the ground

truth images. All the reported results are the direct outputs

from trained models without using any post-processing. Note

that images in Places2 contain too many semantic contents

and thus cannot be clearly shown in small size. So in the ﬁrst

row of Figure 3, we mark the speciﬁc regions using the yellow

rectangles. From the ﬁgure, we can see that GLCIC and CA

bring strong distortions in the inpaiting images, while PConv

can recover the semantic information for the missing irreg-

ular regions in most cases, but still faces obvious deviations

from the ground truth. EC performs well when small miss-

ing regions occur (e.g., 0 - 30%, see more results in the sup-

plementary material), but also fails to infer the correct edge

information for large holes. Among all the methods, it can

be seen that our model can restore images with more natural

contents in the missing regions, which look more consistent

with existing regions and much closer to the ground truth.

Unwanted object removal is one of the most useful appli-

cations of image inpainting. Therefore, we also study the per-

formance of our method in this task, and show several exam-

ples in Figure 4. It is obvious that the inpainting images seem

very natural and harmonious, even the unwanted objects ap-

pear with complex shapes and backgrounds.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

3127

(a) Input (b) standard conv. (c) without Lc(d) Lc+Ls(e) coarse stage (f) full model (g) GT

Figure 5: The effect of different components in our model: (a) the input incomplete images, (b) results using standard convolutions instead

of our region-wise convolutions, (c) results of model trained without our correlation loss Lc, (d) results of model trained with Lc,Lsat the

same stage, (e) results of the coarse stage, (f) results of our full coarse-to-ﬁne model, and (g) the ground truth images.

3.3 Quantitative Results

Following [Nazeri et al., 2019], we investigate the perfor-

mance of different methods using the following quantitative

metrics: 1) `1error, 2) `2error, 3) peak signal-to-noise ratio

(PSNR), and 4) structure similarity index (SSIM). These met-

rics assume pixel-wise independence, and can help to com-

pare the visual appearance of different inpainting images. But

in practice, they may assign favorable scores to perceptually

inaccurate results. Recent works [Xu et al., 2018]have shown

that metrics based on deep features are closer to those based

on human perception. Therefore, we also adopt another two

metrics including Frechet Inception Dsitance (FID) [Xu et al.,

2018]and perceptual error [Johnson et al., 2016]on deep fea-

tures to evaluate the performance at the semantic level.

Table 1 lists the results of all methods on the largest dataset

Place2 in terms of different metrics, with respect to differen-

t mask sizes. First, we can observe that as the missing area

gradually increases, all the methods perform worse in terms

of all metrics. But compared to others, our method obtains the

best performance in all cases, and its performance decreases

much more slowly when the mask size enlarges. This means

that our method can work stably and robustly, especially for

input images with large missing regions. Besides, in terms

of FID and Perceputal error, our method obviously achieves

much more signiﬁcant improvement over the state-of-the-art

methods like PConv and EC, which indicates that the pro-

posed framework can pursue more semantically meaningful

contents for missing regions. What’s more, in terms of P-

SNR, `1and `2errors, the superior performance over other

methods proves that our method enjoys strong capability of

generating more detailed contents for better visual quality.

3.4 Ablation Study

As aforementioned, our method mainly gains from region-

wise convolutions and the non-local correlation. Thus, we

study the effects of different parts in the image inpainting.

Figure 5 respectively shows the inpainting results obtained by

our framework, and the framework using standard convolu-

tion ﬁlters instead of region-wise ones, removing correlation

loss, using Lc,Lsat the same stage, or only adopting coarse

stage. From the results, we can see that without region-wise

convolutional layers, the framework can hardly infer the con-

sistent information with existing regions. Furthermore, with-

out considering the non-local correlation, the framework re-

stores the missing regions only according to the surrounding

areas. Moreover, using Lc,Lsat the same stage will cause ar-

tifacts and cannot restore semantic contents. Besides, we can

see that though the coarse stage can restore the semantic in-

formation, its outputs still contain strange artifacts. With the

help of both region-wise convolutions and non-local correla-

tion, our framework enjoys strong power to generate visually

and semantically close images to the ground truth.

4 Conclusion

We propose a two-stage coarse-to-ﬁne generative image in-

paiting framework, which integrates region-wise convolu-

tions and the non-local operation to deal with the differ-

ences and correlation between existing and missing regions.

Region-wise convolutions reconstruct existing regions while

infer missing regions from existing ones. The non-local op-

eration promises missing regions to own visual consistency

with existing regions, e.g., color, texture and edge. We show

that our proposed method is able to restore meaningful con-

tents for missing regions and connect existing and missing re-

gions naturally and thus signiﬁcantly improves inpainting re-

sults. Furthermore, we demonstrate that our inpainting frame-

work can edit face, clear watermarks, remove unwanted ob-

jects in practical applications. Extensive experiments on var-

ious datasets such as faces, paris streets and natural scenes

demonstrate that our proposed method can signiﬁcantly out-

perform other state-of-the-art approaches in image inpainting.

Acknowledgements

This work was supported by National Natural Science Foun-

dation of China (61690202, 61872021), Fundamental Re-

search Funds for Central Universities (YWF-19-BJ-J-271),

Beijing Municipal Science and Technology Commission

(Z171100000117022), and State Key Lab of Software Devel-

opment Environment (SKLSDE-2018ZX-04).

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

3128

References

[Ballester et al., 2000]Coloma Ballester, Marcelo

Bertalmio, Vicent Caselles, Guillermo Sapiro, and

Joan Verdera. Filling-in by joint interpolation of vector

ﬁelds and gray levels. 2000.

[Barnes et al., 2009a]Connelly Barnes, Eli Shechtman,

Adam Finkelstein, and Dan B Goldman. PatchMatch: A

randomized correspondence algorithm for structural im-

age editing. ACM Transactions on Graphics (Proc. SIG-

GRAPH), 28(3), August 2009.

[Barnes et al., 2009b]Connelly Barnes, Eli Shechtman,

Adam Finkelstein, and Dan B Goldman. Patchmatch: A

randomized correspondence algorithm for structural image

editing. ACM Transactions on Graphics (ToG), 28(3):24,

2009.

[Doersch et al., 2012]Carl Doersch, Saurabh Singh, Abhi-

nav Gupta, Josef Sivic, and Alexei Efros. What makes

paris look like paris? ACM Transactions on Graphics,

31(4), 2012.

[Esedoglu and Shen, 2002]Selim Esedoglu and Jianhong

Shen. Digital inpainting based on the mumford–shah–

euler image model. European Journal of Applied Math-

ematics, 13(4):353–370, 2002.

[Iizuka et al., 2017]Satoshi Iizuka, Edgar Simo-Serra, and

Hiroshi Ishikawa. Globally and locally consistent im-

age completion. ACM Transactions on Graphics (TOG),

36(4):107, 2017.

[Johnson et al., 2016]Justin Johnson, Alexandre Alahi, and

Fei Fei Li. Perceptual losses for real-time style transfer and

super-resolution. In European Conference on Computer

Vision, 2016.

[Karras et al., 2017]Tero Karras, Timo Aila, Samuli Laine,

and Jaakko Lehtinen. Progressive growing of gans for im-

proved quality, stability, and variation. arXiv preprint arX-

iv:1710.10196, 2017.

[Kwatra et al., 2005]Vivek Kwatra, Irfan Essa, Aaron Bo-

bick, and Nipun Kwatra. Texture optimization for

example-based synthesis. In ACM Transactions on Graph-

ics (ToG), volume 24, pages 795–802. ACM, 2005.

[Liu et al., 2018]Guilin Liu, Fitsum A Reda, Kevin J Shi-

h, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro.

Image inpainting for irregular holes using partial convolu-

tions. arXiv preprint arXiv:1804.07723, 2018.

[Nazeri et al., 2019]Kamyar Nazeri, Eric Ng, Tony Joseph,

Faisal Qureshi, and Mehran Ebrahimi. Edgeconnect: Gen-

erative image inpainting with adversarial edge learning.

arXiv preprint arXiv:1901.00212, 2019.

[Newson et al., 2014]Alasdair Newson, Andr´

es Almansa,

Matthieu Fradet, Yann Gousseau, and Patrick P´

erez. Video

inpainting of complex scenes. SIAM Journal on Imaging

Sciences, 7(4):1993–2019, 2014.

[Park et al., 2017]Eunbyung Park, Jimei Yang, Ersin Yumer,

Duygu Ceylan, and Alexander C Berg. Transformation-

grounded image generation network for novel 3d view

synthesis. In Proceedings of the ieee conference on com-

puter vision and pattern recognition, pages 3500–3509,

2017.

[Pathak et al., 2016]Deepak Pathak, Philipp Krahenbuhl, J-

eff Donahue, Trevor Darrell, and Alexei A Efros. Context

encoders: Feature learning by inpainting. In Proceedings

of the IEEE Conference on Computer Vision and Pattern

Recognition, pages 2536–2544, 2016.

[Sajjadi et al., 2017]Mehdi S. M. Sajjadi, Bernhard

Scholkopf, and Michael Hirsch. Enhancenet: Single

image super-resolution through automated texture synthe-

sis. In The IEEE International Conference on Computer

Vision (ICCV), Oct 2017.

[Simakov et al., 2008]Denis Simakov, Yaron Caspi, Eli

Shechtman, and Michal Irani. Summarizing visual data

using bidirectional similarity. In Computer Vision and Pat-

tern Recognition, 2008. CVPR 2008. IEEE Conference on,

pages 1–8. IEEE, 2008.

[Song et al., 2017]Yuhang Song, Chao Yang, Zhe L. Lin,

Hao Li, Qin Huang, and C.-C. Jay Kuo. Image inpaint-

ing using multi-scale feature image translation. CoRR, ab-

s/1711.08590, 2017.

[Wang et al., 2018a]Xiaolong Wang, Ross Girshick, Abhi-

nav Gupta, and Kaiming He. Non-local neural networks.

In Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 7794–7803, 2018.

[Wang et al., 2018b]Yi Wang, Xin Tao, Xiaojuan Qi, Xiaoy-

ong Shen, and Jiaya Jia. Image inpainting via generative

multi-column convolutional neural networks, 2018.

[Xu et al., 2018]Qiantong Xu, Huang Gao, Yuan Yang,

Chuan Guo, and Kilian Weinberger. An empirical study

on evaluation metrics of generative adversarial networks.

2018.

[Yang et al., 2017]Chao Yang, Xin Lu, Zhe Lin, Eli Shecht-

man, Oliver Wang, and Hao Li. High-resolution image

inpainting using multi-scale neural patch synthesis. In The

IEEE Conference on Computer Vision and Pattern Recog-

nition (CVPR), volume 1, page 3, 2017.

[Yu et al., 2018a]Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui

Shen, Xin Lu, and Thomas S Huang. Free-form image

inpainting with gated convolution. arXiv preprint arX-

iv:1806.03589, 2018.

[Yu et al., 2018b]Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui

Shen, Xin Lu, and Thomas S Huang. Generative image

inpainting with contextual attention. arXiv preprint, 2018.

[Zhou et al., 2018]Bolei Zhou, Agata Lapedriza, Aditya

Khosla, Aude Oliva, and Antonio Torralba. Places: A

10 million image database for scene recognition. IEEE

transactions on pattern analysis and machine intelligence,

40(6):1452–1464, 2018.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

3129

Reconstruction of Partially Broken Vascular Structures in X-Ray Images via Vesselness-Loss-Based Multi-Scale Generative Adversarial Networks

Article

Full-text available

Jan 2023

Coronary artery procedures are primarily performed based on X-ray angiography images. However, coronary arteries in X-ray images are often partially broken, complicating diagnoses and procedures owing to lack of visibility. In this paper, we propose a fully automatic method to restore locally broken parts of coronary arteries in X-ray images without using any external information, such as computed tomography images. To this end, we design a new multi-scale generative adversarial network and a vesselness-loss function. The proposed method is optimized for focus on elongated structures and can be utilized in various clinical applications. The proposed method is evaluated and compared with four other existing methods using the performance metrics, PSNR, MSE, and SSIM, and the result shows 34.3, 0.18, and 0.91 averages, respectively for each metric. Based on the performance result, the blocked regions are plausibly reconstructed into such original shapes of blood vessels, which can aid in image-based guiding catheter manipulation during coronary artery procedures. Eventually, the proposed method can be utilized in various clinical applications, e.g.,image-based planning and guidance of coronary procedures and prior simulation of results.

DoseDiff: Distance-aware Diffusion Model for Dose Prediction in Radiotherapy

Preprint

Jun 2023

Treatment planning is a critical component of the radiotherapy workflow, typically carried out by a medical physicist using a time-consuming trial-and-error manner. Previous studies have proposed knowledge-based or deep learning-based methods for predicting dose distribution maps to assist medical physicists in improving the efficiency of treatment planning. However, these dose prediction methods usuallylack the effective utilization of distance information between surrounding tissues andtargets or organs-at-risk (OARs). Moreover, they are poor in maintaining the distribution characteristics of ray paths in the predicted dose distribution maps, resulting in a loss of valuable information obtained by medical physicists. In this paper, we propose a distance-aware diffusion model (DoseDiff) for precise prediction of dose distribution. We define dose prediction as a sequence of denoising steps, wherein the predicted dose distribution map is generated with the conditions of the CT image and signed distance maps (SDMs). The SDMs are obtained by a distance transformation from the masks of targets or OARs, which provide the distance information from each pixel in the image to the outline of the targets or OARs. Besides, we propose a multiencoder and multi-scale fusion network (MMFNet) that incorporates a multi-scale fusion and a transformer-based fusion module to enhance information fusion between the CT image and SDMs at the feature level. Our model was evaluated on two datasets collected from patients with breast cancer and nasopharyngeal cancer, respectively. The results demonstrate that our DoseDiff outperforms the state-of-the-art dose prediction methods in terms of both quantitative and visual quality.

Hybrid attention generative adversarial network: texture inpainting algorithm for iris defects with excellent repair performance and generalization

Article

Full-text available

Jun 2023
J ELECTRON IMAGING

DNNAM: Image inpainting algorithm via deep neural networks and attention mechanism

Article

Mar 2024
APPL SOFT COMPUT

Deep Learning-Based Image and Video Inpainting: A Survey

Article

Full-text available

Jan 2024
INT J COMPUT VISION

Image and video inpainting is a classic problem in computer vision and computer graphics, aiming to fill in the plausible and realistic content in the missing areas of images and videos. With the advance of deep learning, this problem has achieved significant progress recently. The goal of this paper is to comprehensively review the deep learning-based methods for image and video inpainting. Specifically, we sort existing methods into different categories from the perspective of their high-level inpainting pipeline, present different deep learning architectures, including CNN, VAE, GAN, diffusion models, etc., and summarize techniques for module design. We review the training objectives and the common benchmark datasets. We present evaluation metrics for low-level pixel and high-level perceptional similarity, conduct a performance evaluation, and discuss the strengths and weaknesses of representative inpainting methods. We also discuss related real-world applications. Finally, we discuss open challenges and suggest potential future research directions.

Continuously Masked Transformer for Image Inpainting

Conference Paper

Oct 2023

MagConv: Mask-Guided Convolution for Image Inpainting

Article

Jul 2023
IEEE T IMAGE PROCESS

Standard convolution applied to image inpainting would lead to color discrepancy and blurriness for treating valid and invalid/hole regions without difference, which was partially amended by partial convolution (PConv). In PConv, a binary/hard mask was maintained as an indicator of valid and invalid pixels, where valid pixels and invalid pixels were treated differently. However, it can not describe validity degree of an impaired pixel. In addition, mask and image paths were separated, without sharing convolution kernel and exchanging information mutually, reducing data utilization efficiency. In this paper, a mask-guided convolution (MagConv) is proposed for image inpainting. In MagConv, mask and image paths share a convolution kernel to interact with each other and form a joint optimization scheme. In addition, a learnable piecewise activation function is raised to replace the reciprocal function of PConv, providing more flexible and adaptable compensation to convolution contaminated by invalid pixels. It also results in a soft mask of floating-point coefficients from 0 to 1 capable of indicating the validity degree of each pixel. Last but not least, MagConv splits the convolution kernel into positive and negative weights so that they can evaluate the validity of each pixel faithfully. Qualitative and quantitative experiments on the CelebA, Paris StreetView and Places2 datasets demonstrate that our method achieves favorable visual quality against state-of-the-art approaches.

Self-Prior Guided Pixel Adversarial Networks for Blind Image Inpainting

Article

Jun 2023

Blind image inpainting involves two critical aspects, i.e. , “where to inpaint” and “how to inpaint”. Knowing “where to inpaint” can eliminate the interference arising from corrupted pixel values; a good “how to inpaint” strategy yields high-quality inpainted results robust to various corruptions. In existing methods, these two aspects usually lack explicit and separate consideration. This paper fully explores these two aspects and proposes a self-prior guided inpainting network (SIN). The self-priors are obtained by detecting semantic-discontinuous regions and by predicting global semantic structures of the input image. On the one hand, the self-priors are incorporated into the SIN, which enables the SIN to perceive valid context information from uncorrupted regions and to synthesize semantic-aware textures for corrupted regions. On the other hand, the self-priors are reformulated to provide a pixel-wise adversarial feedback and a high-level semantic structure feedback, which can promote the semantic continuity of inpainted images. Experimental results demonstrate that our method achieves state-of-the-art performance in metric scores and in visual quality. It has an advantage over many existing methods that assume “where to inpaint” is known in advance. Extensive experiments on a series of related image restoration tasks validate the effectiveness of our method in obtaining high-quality inpainting.

Çekişmeli Üretici Ağlar Kullanılarak Hasarlı Mozaik Görüntülerinin TamamlanmasıDamaged Mosaic Image Inpainting By Using Generative Adversarial Network

Article

Jun 2023

Bilinen en eski sanat eserlerinden olan mozaikler tarih boyunca çok farklı uygarlıklar tarafından geliştirilmiş ve kullanılmışlardır. Geçmişten günümüze ulaşan mozaik eserlerinde tahribat sıklıkla rastlanmaktadır. Gerçekleşen doğa koşulları, insanların olumsuz etkileri veya nesnelerin doğası gereği yıpranmasından dolayı tahribata uğrayan eserler olabilmektedir. Bu eserlerdeki tahribatın onarılması ve orijinal görüntüsüne ulaşılması gerekliliği tüm tarih eserlerinde olduğu gibi mozaik eserlerinin de temel ihtiyacıdır. Görüntü tamamlama problemi literatürde farklı teknikler ile çözülmeye çalışılan güncel bir problemdir. Bu çalışmada görüntü tamamlama problemini derin öğrenme tabanlı yöntemlerle mozaik veri seti üzerindeki sonuçları incelenmiştir. Mozaik görüntüsündeki eksik bölgelerin düzeltilmesi bağlamsal dikkat ile görüntü tamamlama mimarisi kullanılmıştır. Bu mimari aynı veri seti kullanılarak farklı çekişmeli üretici ağ mimariler ile karşılaştırılmalı sonuçları incelenmiştir. Öğrenme aktarımı kullanılarak mozaik veri seti ile yeniden model eğitilmiştir. Test edilen mozaik örneklerdeki orijinal görüntü ile hasarı giderilmiş görüntü arasındaki yapısal benzerlik indisinin yapılan hasar oranına göre az hasarlı görüntülerde 0.92 - 0.95 çok hasarlı görüntülerde ise 0.72 - 0.89 arasında olduğu gözlemlenmiştir. Gerçekleştirilen görüntü tamamlama modeli ile az hasarlı mozaik resimlerinde görüntü tamamlamada yüksek başarı elde edilmiştir.

A review of advances in image inpainting research

Article

May 2023

Enhancenet: Single image super-resolution through automated texture synthesis

Jan 2005
1-8

Karras

Karras et al., 2017] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017. [Kwatra et al., 2005] Vivek Kwatra, Irfan Essa, Aaron Bobick, and Nipun Kwatra. Texture optimization for example-based synthesis. In ACM Transactions on Graphics (ToG), volume 24, pages 795-802. ACM, 2005. [Liu et al., 2018] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. arXiv preprint arXiv:1804.07723, 2018. [Nazeri et al., 2019] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Qureshi, and Mehran Ebrahimi. Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212, 2019. [Newson et al., 2014] Alasdair Newson, Andrés Almansa, Matthieu Fradet, Yann Gousseau, and Patrick Pérez. Video inpainting of complex scenes. SIAM Journal on Imaging Sciences, 7(4):1993-2019, 2014. [Park et al., 2017] Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, and Alexander C Berg. Transformationgrounded image generation network for novel 3d view synthesis. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 3500-3509, 2017. [Pathak et al., 2016] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2536-2544, 2016. [Sajjadi et al., 2017] Mehdi S. M. Sajjadi, Bernhard Scholkopf, and Michael Hirsch. Enhancenet: Single image super-resolution through automated texture synthesis. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. [Simakov et al., 2008] Denis Simakov, Yaron Caspi, Eli Shechtman, and Michal Irani. Summarizing visual data using bidirectional similarity. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8. IEEE, 2008. [Song et al., 2017] Yuhang Song, Chao Yang, Zhe L. Lin, Hao Li, Qin Huang, and C.-C. Jay Kuo. Image inpainting using multi-scale feature image translation. CoRR, abs/1711.08590, 2017.

Places: A 10 million image database for scene recognition

Jan 2018
1452-1464

et al., 2018b] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. arXiv preprint, 2018. [Zhou et al., 2018] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452-1464, 2018.

Coarse-to-Fine Image Inpainting via Region-wise Convolutions and Non-Local Correlation

Abstract

Recommended publications

SemanticStyleGAN: Generative Image Inpainting Using Style-Based Generator

Region-wise Generative Adversarial Image Inpainting for Large Missing Areas

Region-wise Generative Adversarial ImageInpainting for Large Missing Areas

Few-shot Visual Learning with Contextual Memory and Fine-grained Calibration

Stratified Rule-Aware Network for Abstract Visual Reasoning