Conference PaperPDF Available

Coarse-to-Fine Image Inpainting via Region-wise Convolutions and Non-Local Correlation

Authors:

Abstract

Recently deep neural networks have achieved promising performance for filling large missing regions in image inpainting tasks. They usually adopted the standard convolutional architecture over the corrupted image, where the same convolution filters try to restore the diverse information on both existing and missing regions, and meanwhile ignores the long-distance correlation among the regions. Only relying on the surrounding areas inevitably leads to meaningless contents and artifacts, such as color discrepancy and blur. To address these problems, we first propose region-wise convolutions to locally deal with the different types of regions, which can help exactly reconstruct existing regions and roughly infer the missing ones from existing regions at the same time. Then, a non-local operation is introduced to globally model the correlation among different regions, promising visual consistency between missing and existing regions. Finally, we integrate the region-wise convolutions and non-local correlation in a coarse-to-fine framework to restore semantically reasonable and visually realistic images. Extensive experiments on three widely-used datasets for image inpainting tasks have been conducted, and both qualitative and quantitative experimental results demonstrate that the proposed model significantly outperforms the state-of-the-art approaches, especially for the large irregular missing regions.
Coarse-to-Fine Image Inpainting via Region-wise Convolutions
and Non-Local Correlation
Yuqing Ma1,Xianglong Liu1,2,Shihao Bai1,Lei Wang1,Dailan He1and Aishan Liu1
1State Key Lab of Software Development Environment, Beihang University, China
2Beijing Advanced Innovation Center for Big Data-Based Precision Medicine,
Beihang University, Beijing, China
{mayuqing, xlliu, 16061167, HBwanglei, hdl730, liuaishan}@buaa.edu.cn
Abstract
Recently deep neural networks have achieved
promising performance for filling large missing
regions in image inpainting tasks. They usual-
ly adopted the standard convolutional architecture
over the corrupted image, where the same convolu-
tion filters try to restore the diverse information on
both existing and missing regions, and meanwhile
ignore the long-distance correlation among the re-
gions. Only relying on the surrounding areas in-
evitably leads to meaningless contents and artifact-
s, such as color discrepancy and blur. To address
these problems, we first propose region-wise con-
volutions to locally deal with the different types of
regions, which can help exactly reconstruct existing
regions and roughly infer the missing ones from ex-
isting regions at the same time. Then, a non-local
operation is introduced to globally model the cor-
relation among different regions, promising visual
consistency between missing and existing region-
s. Finally, we integrate the region-wise convolu-
tions and non-local correlation in a coarse-to-fine
framework to restore semantically reasonable and
visually realistic images. Extensive experiments
on three widely-used datasets for image inpainting
tasks have been conducted, and both qualitative and
quantitative experimental results demonstrate that
the proposed model significantly outperforms the
state-of-the-art approaches, especially for the large
irregular missing regions.
1 Introduction
Image inpainting (i.e., image completion or image hole-
filling), synthesizing visually realistic and semantically plau-
sible contents in missing regions, has attracted great atten-
tions in recent years. It can be widely applied in many tasks
[Barnes et al., 2009a; Newson et al., 2014; Park et al., 2017;
Simakov et al., 2008], such as photo editing, image-based
rendering, computational photography, etc. Till now, there
have been many methods proposed for generating desirable
Corresponding Author
(a) Input (b) EC (c) Ours
Figure 1: Image inpainting results using EdgeConnect (EC) and our
proposed method on street view image.
contents in different ways, including the traditional methods
using handcrafted features and the deep generative models.
Traditional approaches can be roughly divided into two
types: diffusion-based and patch-based. The former methods
propagate background data into missing regions by following
a diffusive process typically modeled using differential opera-
tors [Ballester et al., 2000; Esedoglu and Shen, 2002]. Patch-
based methods [Kwatra et al., 2005; Barnes et al., 2009b]fill
in missing regions with patches from a collection of source
images that maximize the patch similarity. These methods
have good effects in the completion of repeating structured
images. However, they are usually time-consuming and be-
sides they cannot hallucinate semantically plausible contents
for challenging cases where inpainting regions involve com-
plex, non-repetitive structures, e.g., faces, objects, etc.
The significant development of deep neural networks and
generative adversarial networks inspires recent works to for-
mulate inpainting as a conditional image generation prob-
lem. Context Encoders [Pathak et al., 2016]first exploit-
ed GANs to restore images, using a channel-wise fully con-
nected layer to propagate information between encoder and
decoder. [Iizuka et al., 2017]utilized dilated convolution-
s and employed both global and local discriminators to as-
sess images. [Yu et al., 2018b]adopted a coarse-to-fine
network with attention mechanism to gradually refine the
generated images. To perceptually enhance image quali-
ty, several studies [Yang et al., 2017; Song et al., 2017;
Wang et al., 2018b]attempted to extract features using pre-
trained VGG network to reduce the perceptual loss or style
loss. More recently, [Liu et al., 2018; Yu et al., 2018a;
Nazeri et al., 2019]further concentrated on irregular miss-
ing regions and achieved satisfying performance especially
for the highly structured images.
Despite the encouraging progress in image inpainting,
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)
3123
Skip Lines
Standard Convolution Dilated Convolution Missing-region Convolution Existing-region Convolution
Mask
Image
Stage 1 Stage 2
Skip Lines
loss
loss
()
()
()
()
Region-wise
Convolution
composite composite
Existing
Region
Missing
Region
loss
Figure 2: The architecture of our proposed coarse-to-fine image inpainting framework.
most existing methods still face the inconsistency problems,
such as distorted structures and blurry textures (see the result
of the very recent method EC [Nazeri et al., 2019]in Figure
1). This phenomenon is much likely due to the inappropriate
convolution operation over the two types of regions, i.e., ex-
isting and missing regions. Intuitively, different feature repre-
sentations should be extracted to characterize different types
of regions, since there is sufficient content information in ex-
isting regions, but none in the missing ones, which need to
be inferred from existing regions. Therefore, directly apply-
ing the same convolution filters to generate semantic contents
inevitably leads to visual artifacts such as color discrepancy,
blur and obvious edge responses surrounding holes. Change-
able mask is proposed in recent works [Liu et al., 2018;
Yu et al., 2018a]to handle the difference. However, rely-
ing on the same filters for different regions, they still fail to
generate favourable results.
In this paper, to generate desirable contents for missing re-
gions, we treat the different types of regions using different
convolution filters. Existing regions contain sufficient infor-
mation and thus can be reconstructed based on themselves,
while the missing ones without any information have to be
inferred from the existing regions. Therefore, we develop
region-wise convolution operations, i.e., self-reconstruction
and restoring from the existing regions, to separately deal
with existing and missing regions. The region-wise convolu-
tions help infer the missing semantic contents, but inevitably
cause the inconsistent appearance due to the ignorance of the
correlation between existing and missing regions. We further
propose a non-local operation to model the correlation among
regions, thus generate more meaningful contents to connect
them naturally. Then, we introduce a two stage coarse-to-fine
image inpainting framework with a `1reconstruction loss, a
correlation loss and the popular style loss.
The framework produces natural, semantic contents for
missing regions by incorporating region-wise convolutions
and the non-local operation at the coarse stage, and further
outputs the restored image by eliminating the visually un-
pleasant artifacts at the fine stage. Figure 2 shows the ar-
chitecture of our whole framework. Extensive experiments
on various datasets such as faces (CelebA-HQ [Karras et al.,
2017]), street views (Paris StreetView [Doersch et al., 2012])
and natural scenes (Places2 [Zhou et al., 2018]) demonstrate
that our proposed method can significantly outperform other
state-of-the-art approaches in image inpainting.
2 The Approach
In this section, we elaborate the details of our coarse-to-fine
image inpainting framework with encoder-decoder architec-
ture. We will first introduce the whole framework consisting
of two stages which respectively learns the missing regions at
the coarse stage and further refines the whole image at the fine
stage. Then, we will present our region-wise convolutions
and the non-local operation. Finally, the whole formulation
and optimization strategies will be provided.
2.1 The Coarse-to-fine Framework
The state-of-the-art image inpainting solutions often ignore
either the difference or the correlation between the existing
and missing regions. To simultaneously address both issues,
we adopt a two-stage coarse-to-fine framework based on the
encoder-decoder architecture. At the coarse stage, the frame-
work first infers the semantic contents from the existing re-
gions using region-wise convolution filters, rather than the
identical ones. Then, it further enhances the quality of the
composited image using the non-local operation, which takes
the correlation between different regions into consideration.
At the fine stage, the two different regions are considered to-
gether using a style loss over the whole image, which per-
ceptually enhances the image quality. With the two-stage
progressive generation, the framework will make the restored
images more realistic and perceptually consistent.
As shown in Figure 2, the framework takes the incomplete
image ˆ
Igand a binary mask Mas input, and attempts to
restore the complete image close to ground truth image Ig,
where Mindicates the missing regions (the mask value is 0
for missing pixels and 1 for elsewhere), ˆ
Ig=IgMand
denotes dot product. To accomplish this goal, network E1,E2
serve as encoders in two stages respectively to extract seman-
tic features from corresponding input images. A decoder G
composing of the proposed region-wised convolutional layer-
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)
3124
s is employed after encoder E1to restore the semantic con-
tents for different regions, and generates the predicted image
I(1)
p= G E1(ˆ
Ig)at the coarse stage. After feeding the
composited image I(1)
c=ˆ
Ig+I(1)
p(1M)from the coarse
stage to encoder E2, another decoder Dat the second stage
further synthesizes the refined image I(2)
p= D E2(I(1)
c).
Based on the encoder-decoder architectures, we finally have
the visually and semantically realistic inpainting result I(2)
c=
ˆ
Ig+I(2)
p(1 M)close to the ground truth image Ig.
2.2 Inferring Region-wise Contents
For image inpainting tasks, the input images are composed
of both existing regions with valid pixels and missing regions
(masked regions) with invalid pixels in mask to be synthe-
sized. Only relying on the same convolution filters, we can
hardly restore the semantic features over different regions,
which in practice usually leads to the visual artifacts such
as color discrepancy, blur and obvious edge responses sur-
rounding the missing regions. Motivated by this observation,
we first propose region-wise convolutions in the decoder net-
work Gat the coarse stage, and thus the decoder can separate-
ly generate the corresponding contents for different regions
using different convolution filters.
Specifically, let W,ˆ
Wbe the weights of the region-wise
convolution filters for existing and missing regions respec-
tively, and b,ˆ
bcorrespond to the biases. xis the feature for
the current convolution (sliding) window belonging to the w-
hole feature map X. Then, the region-wise convolutions at
every location can be formulated as follows:
x0=W>x+b,xXM
ˆ
W>x+ˆ
b,xX(1 M)(1)
This means that for different types of regions, different con-
volution filters will be learnt for feature representation.
In practice, we can accomplish region-wise convolutions
by proportionally resizing the mask as feature maps down-
sampled through the convolution layers. In this way, we can
ensure that different regions can be easily distinguished ac-
cording to the resized mask by channels, and thus the in-
formation in different regions can be transmitted consistently
across layers. The convolution filters for existing regions try
to reconstruct themselves, while those for missing ones focus
on inferring the semantic contents from existing parts.
2.3 Modelling Non-local Correlation
After the region-wise convolutions, the framework generates
a coarse predicted image, where missing regions are almost
recovered with semantically meaningful contents. However,
the predicted image is still far beyond the visually realistic
appearance. This is mainly because the convolution opera-
tions are skilled in processing local neighborhoods whereas
fail to model the correlation between distant positions.
To address this problem and improve the visual quality of
the recovered image, a non-local operation is adopted follow-
ing prior studies [Wang et al., 2018a]. It computes the re-
sponse at a position as a weighted sum of the features at all
positions in the input feature map, and thus can capture long-
distance correlation between patches inside an image. Note
that the traditional way to accomplish the non-local opera-
tion relies on the simple matrix multiplication and is usually
adopted in feed-forward process to obtain more information
for specific tasks. However, the computation will be quite
memory-consuming for large feature maps, which is not ap-
plicable in our generative models where the smallest feature
map created by Gis 128 ×128.
In this paper, we accomplish the non-local operation using
the simple outer product between different positions, rather
than the non-local block. Formally, given an image I(1)
c,
Ψ(I(1)
c)denotes the c×h×wfeature map computed by
feature extraction method Ψ. In practice, in order to index
an output position in space dimension easily, we reshape the
feature map to the size of c×n, where n=h×w. Corre-
spondingly, Ψi(Ig)is the i-th column in the reshaped feature
map Ψ(Ig), where i= 1, . . . , n, of length c. Then, a pairwise
function fij can be defined as a non-local operation, which
generates a n×ngram matrix evaluating the correlation be-
tween position iand j:
fij (I(1)
c) = Ψi(I(1)
c)>Ψj(I(1)
c).(2)
Once we have the non-local correlation, we can bring it in-
to the inpainting framework by introducing a correlation loss
based on the gram matrix.
2.4 The Formulation
To guide the learning of the two stage encoder-decoder net-
work, we introduce the following loss functions.
Reconstruction Loss
We employ `1reconstruction loss to promise the predicted
images at the two stages, including both the existing regions
and the missing ones, consistent with the ground truth at the
pixel level:
Lr=
I(1)
pIg
1+
I(2)
pIg
1
.(3)
The reconstruction loss is useful for region-wise convolution
filters to learn to generate meaningful contents for different
regions especially at the first stage.
Correlation Loss
The reconstruction loss treats all pixels independently with-
out consideration of their correlation, while in our observa-
tion the relationship among distant local patches plays a crit-
ical role in keeping the semantic and visual consistency be-
tween the generated missing regions and the existing ones.
Therefore, we further introduce a correlation loss that can
help to determine the expected non-local operation. Namely,
for image I(1)
c, the correlation loss is defined based on fij (·):
Lc=σ
n
X
i,j
fij (I(1)
c)fij (Ig)
1
,(4)
where σdenotes the normalization factor by position. The
correlation loss forces the model to generate images with se-
mantic details much more close to the realistic image. Here,
different from the prior work of PConv, we only consider the
non-local correlation for the composited image.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)
3125
Style Loss
Although non-local correlation loss is capable of capturing
long distance dependencies, enhancing the restoration of de-
tails, it still fails to avoid visual artifacts in unstable gener-
ative models. Therefore, we append a style loss to produce
clean results and further refine the images perceptually as a
whole at the second stage. The style loss is widely used in
image inpainting and style transfer tasks meanwhile poses as
an effective tool to combat ”checkerboard” artifacts [Sajjadi
et al., 2017]. After projecting image I(2)
cinto a higher level
feature space using a pre-trained VGG, we could obtain the
feature map Φp(I(2)
p)of the p-th layer with size cp×hp×wp,
and thus the style loss is formulated as follows:
Ls=X
p
δp
Φp(I(2)
c)>Φp(I(2)
c)p(Ig))>p(Ig))
1
,
(5)
where δpdenotes the normalization factor for the p-th select-
ed layer by channel. The style loss focuses on the relationship
between different channels to transfer the style for the com-
posited image at the second stage.
Overall Loss
The overall loss Lcombines the reconstruction, correlation
and styles loss functions:
L=Lr+λ1Lc+λ2Ls.(6)
In our coarse-to-fine framework, the reconstruction loss
works in both stages to guarantee the pixel-wise consisten-
cy between the predicted images and the ground truth. To
capture the relationship among different regions and gener-
ate detailed contents at the first stage, the correlation loss is
adopted to guide the training of the network E1and G. Fi-
nally, at the second stage, the style loss helps perceptually
enhance the image quality by considering the whole image.
2.5 Implementation and Training
In practice, we exploit the widely-adopted pre-trained VGG
network to extract features for the calculation of correlation
loss as well as style loss. For the computation of correlation
loss, only feature maps extracted by pool2are adopted due to
the weak semantic representation capability of pool1and the
blur caused by pool3and pool4. In order to calculate the style
loss, we use the output of pool1,pool2, and pool3together.
In another word, Ψ(·)=Φp(·)when p= 2.
We also adopt skip links, which as [Liu et al., 2018]
claimed, may propagate the noises for most inpainting archi-
tectures. However, we find skip links will not suffer the nega-
tive effect in our framework due to region-wise convolutions
and thus enable the detailed output from existing regions.
The entire training procedure follows the standard forward
and backward optimization paradigm. In the forward step,
given a ground truth image Ig, we first sample an irregular
binary mask Mand subsequently generate the incomplete
image ˆ
Ig. The inpaiting framework takes the concatenation
of ˆ
Igand Mas the input, and outputs the predicted image
I(1)
pand I(2)
prespectively in the coarse and fine stages. In the
backward step, according to the three types of losses over the
predicted and composited images, we can simply update the
network parameters using the backward propagation.
Mask GLCIC CA PConv EC Ours
PSNR0-10% 26.71 36.13 30.41 30.32 42.52
10-20% 20.97 22.97 26.93 26.92 29.52
20-30% 18.22 20.26 24.80 24.91 26.77
30-40% 16.31 18.47 23.14 23.37 24.87
40-50% 14.88 17.09 21.71 22.06 23.34
50-60% 13.80 16.01 20.41 20.91 22.04
`
1(103) 0-10% 23.55 17.40 18.94 18.82 4.85
10-20% 40.32 32.50 24.49 24.08 10.22
20-30% 59.26 47.76 30.48 29.62 15.91
30-40% 80.33 63.63 37.25 35.74 22.15
40-50% 102.67 80.36 45.23 42.67 29.08
50-60% 124.63 97.11 54.77 50.44 36.58
`
2(103) 0-10% 3.06 2.20 1.14 1.17 0.46
10-20% 9.54 6.90 2.50 2.53 1.55
20-30% 17.40 11.92 4.04 4.00 2.77
30-40% 26.57 17.34 5.85 5.66 4.19
40-50% 36.60 23.25 8.07 7.58 5.85
50-60% 46.71 29.34 10.77 9.79 7.78
SSIM0-10% 0.902 0.965 0.924 0.925 0.982
10-20% 0.806 0.888 0.880 0.881 0.942
20-30% 0.708 0.811 0.834 0.836 0.901
30-40% 0.609 0.730 0.784 0.788 0.856
40-50% 0.513 0.647 0.728 0.736 0.807
50-60% 0.427 0.566 0.667 0.680 0.755
FID0-10% 8.21 1.26 1.75 1.38 0.02
10-20% 34.48 8.73 2.10 1.80 0.11
20-30% 62.74 20.35 2.88 2.69 0.31
30-40% 90.94 36.53 4.31 4.36 0.68
40-50% 117.23 57.60 6.97 7.38 1.38
50-60% 140.53 81.66 12.10 12.52 2.66
Perceptual0-10% 183.39 81.58 128.64 126.98 36.11
10-20% 363.68 220.77 193.84 192.50 109.42
20-30% 546.10 348.93 258.47 255.98 178.49
30-40% 729.94 471.10 326.36 321.03 247.02
40-50% 906.89 587.90 401.07 389.19 316.61
50-60% 1062.77 1132.34 485.31 459.95 385.93
Table 1: Quantitative comparisons among different methods on
Place2, in terms of different evaluation metrics. means lower is
better, while means higher is better.
3 Experiments
In this section, we will evaluate our proposed method visually
and quantitatively over several common datasets in image in-
painting compared to state-of-the-art methods. More results
could be found in the supplementary material1.
3.1 Datasets and Protocols
We employ the widely-used datasets in prior studies, in-
cluding CelebA-HQ [Karras et al., 2017], Places2 [Zhou
et al., 2018], and Paris StreetView [Doersch et al., 2012].
CelebA-HQ contains 30k high-resolution face images, and
we adopt the same partition as [Yu et al., 2018b]did. The
Places2 dataset includes 8,097,967 training images with di-
verse scenes. The Paris StreetView contains 14,900 training
images and 100 test images. For both datasets, we adopt the
original train, test, and validate splits.
We compare our method with four state-of-the-art model-
s, namely, Globally and locally Consistent Image Completion
(GLCIC) [Iizuka et al., 2017], Contextual Attention (CA) [Yu
et al., 2018b], Partial Convolution (PConv) [Liu et al., 2018]
and EdgeConnect (EC) [Nazeri et al., 2019]. Among those
1https://drive.google.com/file/d/1iO0cZ0fwgVeaRrhTLCuk-rvb
CekkMVmv/view?usp=sharing
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)
3126
(a) Input (b) GLGIC (c) CA (d) PConv (e) EC (f) Ours (g) GT
Figure 3: Qualitative comparisons between different methods on Place2, Paris StreetView and CelebA-HQ datasets
(a) Origin (b) Input (c) Output
Figure 4: Object removal results (column (c)) using our model: re-
moving beard, watermark and kid from origin images (column (a))
according to the input mask (column (b)).
models, GLCIC and CA are initially designed for regular
missing regions, while PConv, EC and ours focus on irreg-
ular holes. Besides, the training of GLCIC and CA heavily
relies on local discriminators assuming availability of the lo-
cal bounding boxes of the holes, which would not make sense
under our experimental setting. Therefore, we directly ap-
ply their released pre-trained models for the two methods in
our experiments. For EC, we use their pre-trained models on
Paris dataset and Places2, and train the model on celebA-HQ
with the released codes. As to PConv, since there is no pub-
lished codes, we borrow the implementation on github2, and
retrain the model following the authors’ advice.
2https://github.com/MathiasGruber/PConv-Keras
For our method, we basically develop the model based
on the architecture of CA, discarding its contextual attention
module but adding the region-wise convolutions. Input im-
ages are resized to 256 ×256, and the proportion of irregular
missing regions varies from 0 to 40% in the training process.
We empirically choose the hyper-parameters λ1= 105,
λ2= 103, and the initial learning rate 104. Using the
Adam optimizer, on CelebA-HQ and Paris StreetView we
train the model with a batch size of 8 for 20 epochs, and on
Places2 we train it with a batch size of 48.
3.2 Qualitative Results
Figure 3 shows the inpainting results of different method-
s on several examples from Places2, Paris StreetView and
CelebA-HQ respectively, where “GT” stands for the ground
truth images. All the reported results are the direct outputs
from trained models without using any post-processing. Note
that images in Places2 contain too many semantic contents
and thus cannot be clearly shown in small size. So in the first
row of Figure 3, we mark the specific regions using the yellow
rectangles. From the figure, we can see that GLCIC and CA
bring strong distortions in the inpaiting images, while PConv
can recover the semantic information for the missing irreg-
ular regions in most cases, but still faces obvious deviations
from the ground truth. EC performs well when small miss-
ing regions occur (e.g., 0 - 30%, see more results in the sup-
plementary material), but also fails to infer the correct edge
information for large holes. Among all the methods, it can
be seen that our model can restore images with more natural
contents in the missing regions, which look more consistent
with existing regions and much closer to the ground truth.
Unwanted object removal is one of the most useful appli-
cations of image inpainting. Therefore, we also study the per-
formance of our method in this task, and show several exam-
ples in Figure 4. It is obvious that the inpainting images seem
very natural and harmonious, even the unwanted objects ap-
pear with complex shapes and backgrounds.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)
3127
(a) Input (b) standard conv. (c) without Lc(d) Lc+Ls(e) coarse stage (f) full model (g) GT
Figure 5: The effect of different components in our model: (a) the input incomplete images, (b) results using standard convolutions instead
of our region-wise convolutions, (c) results of model trained without our correlation loss Lc, (d) results of model trained with Lc,Lsat the
same stage, (e) results of the coarse stage, (f) results of our full coarse-to-fine model, and (g) the ground truth images.
3.3 Quantitative Results
Following [Nazeri et al., 2019], we investigate the perfor-
mance of different methods using the following quantitative
metrics: 1) `1error, 2) `2error, 3) peak signal-to-noise ratio
(PSNR), and 4) structure similarity index (SSIM). These met-
rics assume pixel-wise independence, and can help to com-
pare the visual appearance of different inpainting images. But
in practice, they may assign favorable scores to perceptually
inaccurate results. Recent works [Xu et al., 2018]have shown
that metrics based on deep features are closer to those based
on human perception. Therefore, we also adopt another two
metrics including Frechet Inception Dsitance (FID) [Xu et al.,
2018]and perceptual error [Johnson et al., 2016]on deep fea-
tures to evaluate the performance at the semantic level.
Table 1 lists the results of all methods on the largest dataset
Place2 in terms of different metrics, with respect to differen-
t mask sizes. First, we can observe that as the missing area
gradually increases, all the methods perform worse in terms
of all metrics. But compared to others, our method obtains the
best performance in all cases, and its performance decreases
much more slowly when the mask size enlarges. This means
that our method can work stably and robustly, especially for
input images with large missing regions. Besides, in terms
of FID and Perceputal error, our method obviously achieves
much more significant improvement over the state-of-the-art
methods like PConv and EC, which indicates that the pro-
posed framework can pursue more semantically meaningful
contents for missing regions. What’s more, in terms of P-
SNR, `1and `2errors, the superior performance over other
methods proves that our method enjoys strong capability of
generating more detailed contents for better visual quality.
3.4 Ablation Study
As aforementioned, our method mainly gains from region-
wise convolutions and the non-local correlation. Thus, we
study the effects of different parts in the image inpainting.
Figure 5 respectively shows the inpainting results obtained by
our framework, and the framework using standard convolu-
tion filters instead of region-wise ones, removing correlation
loss, using Lc,Lsat the same stage, or only adopting coarse
stage. From the results, we can see that without region-wise
convolutional layers, the framework can hardly infer the con-
sistent information with existing regions. Furthermore, with-
out considering the non-local correlation, the framework re-
stores the missing regions only according to the surrounding
areas. Moreover, using Lc,Lsat the same stage will cause ar-
tifacts and cannot restore semantic contents. Besides, we can
see that though the coarse stage can restore the semantic in-
formation, its outputs still contain strange artifacts. With the
help of both region-wise convolutions and non-local correla-
tion, our framework enjoys strong power to generate visually
and semantically close images to the ground truth.
4 Conclusion
We propose a two-stage coarse-to-fine generative image in-
paiting framework, which integrates region-wise convolu-
tions and the non-local operation to deal with the differ-
ences and correlation between existing and missing regions.
Region-wise convolutions reconstruct existing regions while
infer missing regions from existing ones. The non-local op-
eration promises missing regions to own visual consistency
with existing regions, e.g., color, texture and edge. We show
that our proposed method is able to restore meaningful con-
tents for missing regions and connect existing and missing re-
gions naturally and thus significantly improves inpainting re-
sults. Furthermore, we demonstrate that our inpainting frame-
work can edit face, clear watermarks, remove unwanted ob-
jects in practical applications. Extensive experiments on var-
ious datasets such as faces, paris streets and natural scenes
demonstrate that our proposed method can significantly out-
perform other state-of-the-art approaches in image inpainting.
Acknowledgements
This work was supported by National Natural Science Foun-
dation of China (61690202, 61872021), Fundamental Re-
search Funds for Central Universities (YWF-19-BJ-J-271),
Beijing Municipal Science and Technology Commission
(Z171100000117022), and State Key Lab of Software Devel-
opment Environment (SKLSDE-2018ZX-04).
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)
3128
References
[Ballester et al., 2000]Coloma Ballester, Marcelo
Bertalmio, Vicent Caselles, Guillermo Sapiro, and
Joan Verdera. Filling-in by joint interpolation of vector
fields and gray levels. 2000.
[Barnes et al., 2009a]Connelly Barnes, Eli Shechtman,
Adam Finkelstein, and Dan B Goldman. PatchMatch: A
randomized correspondence algorithm for structural im-
age editing. ACM Transactions on Graphics (Proc. SIG-
GRAPH), 28(3), August 2009.
[Barnes et al., 2009b]Connelly Barnes, Eli Shechtman,
Adam Finkelstein, and Dan B Goldman. Patchmatch: A
randomized correspondence algorithm for structural image
editing. ACM Transactions on Graphics (ToG), 28(3):24,
2009.
[Doersch et al., 2012]Carl Doersch, Saurabh Singh, Abhi-
nav Gupta, Josef Sivic, and Alexei Efros. What makes
paris look like paris? ACM Transactions on Graphics,
31(4), 2012.
[Esedoglu and Shen, 2002]Selim Esedoglu and Jianhong
Shen. Digital inpainting based on the mumford–shah–
euler image model. European Journal of Applied Math-
ematics, 13(4):353–370, 2002.
[Iizuka et al., 2017]Satoshi Iizuka, Edgar Simo-Serra, and
Hiroshi Ishikawa. Globally and locally consistent im-
age completion. ACM Transactions on Graphics (TOG),
36(4):107, 2017.
[Johnson et al., 2016]Justin Johnson, Alexandre Alahi, and
Fei Fei Li. Perceptual losses for real-time style transfer and
super-resolution. In European Conference on Computer
Vision, 2016.
[Karras et al., 2017]Tero Karras, Timo Aila, Samuli Laine,
and Jaakko Lehtinen. Progressive growing of gans for im-
proved quality, stability, and variation. arXiv preprint arX-
iv:1710.10196, 2017.
[Kwatra et al., 2005]Vivek Kwatra, Irfan Essa, Aaron Bo-
bick, and Nipun Kwatra. Texture optimization for
example-based synthesis. In ACM Transactions on Graph-
ics (ToG), volume 24, pages 795–802. ACM, 2005.
[Liu et al., 2018]Guilin Liu, Fitsum A Reda, Kevin J Shi-
h, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro.
Image inpainting for irregular holes using partial convolu-
tions. arXiv preprint arXiv:1804.07723, 2018.
[Nazeri et al., 2019]Kamyar Nazeri, Eric Ng, Tony Joseph,
Faisal Qureshi, and Mehran Ebrahimi. Edgeconnect: Gen-
erative image inpainting with adversarial edge learning.
arXiv preprint arXiv:1901.00212, 2019.
[Newson et al., 2014]Alasdair Newson, Andr´
es Almansa,
Matthieu Fradet, Yann Gousseau, and Patrick P´
erez. Video
inpainting of complex scenes. SIAM Journal on Imaging
Sciences, 7(4):1993–2019, 2014.
[Park et al., 2017]Eunbyung Park, Jimei Yang, Ersin Yumer,
Duygu Ceylan, and Alexander C Berg. Transformation-
grounded image generation network for novel 3d view
synthesis. In Proceedings of the ieee conference on com-
puter vision and pattern recognition, pages 3500–3509,
2017.
[Pathak et al., 2016]Deepak Pathak, Philipp Krahenbuhl, J-
eff Donahue, Trevor Darrell, and Alexei A Efros. Context
encoders: Feature learning by inpainting. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 2536–2544, 2016.
[Sajjadi et al., 2017]Mehdi S. M. Sajjadi, Bernhard
Scholkopf, and Michael Hirsch. Enhancenet: Single
image super-resolution through automated texture synthe-
sis. In The IEEE International Conference on Computer
Vision (ICCV), Oct 2017.
[Simakov et al., 2008]Denis Simakov, Yaron Caspi, Eli
Shechtman, and Michal Irani. Summarizing visual data
using bidirectional similarity. In Computer Vision and Pat-
tern Recognition, 2008. CVPR 2008. IEEE Conference on,
pages 1–8. IEEE, 2008.
[Song et al., 2017]Yuhang Song, Chao Yang, Zhe L. Lin,
Hao Li, Qin Huang, and C.-C. Jay Kuo. Image inpaint-
ing using multi-scale feature image translation. CoRR, ab-
s/1711.08590, 2017.
[Wang et al., 2018a]Xiaolong Wang, Ross Girshick, Abhi-
nav Gupta, and Kaiming He. Non-local neural networks.
In Proceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 7794–7803, 2018.
[Wang et al., 2018b]Yi Wang, Xin Tao, Xiaojuan Qi, Xiaoy-
ong Shen, and Jiaya Jia. Image inpainting via generative
multi-column convolutional neural networks, 2018.
[Xu et al., 2018]Qiantong Xu, Huang Gao, Yuan Yang,
Chuan Guo, and Kilian Weinberger. An empirical study
on evaluation metrics of generative adversarial networks.
2018.
[Yang et al., 2017]Chao Yang, Xin Lu, Zhe Lin, Eli Shecht-
man, Oliver Wang, and Hao Li. High-resolution image
inpainting using multi-scale neural patch synthesis. In The
IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), volume 1, page 3, 2017.
[Yu et al., 2018a]Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui
Shen, Xin Lu, and Thomas S Huang. Free-form image
inpainting with gated convolution. arXiv preprint arX-
iv:1806.03589, 2018.
[Yu et al., 2018b]Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui
Shen, Xin Lu, and Thomas S Huang. Generative image
inpainting with contextual attention. arXiv preprint, 2018.
[Zhou et al., 2018]Bolei Zhou, Agata Lapedriza, Aditya
Khosla, Aude Oliva, and Antonio Torralba. Places: A
10 million image database for scene recognition. IEEE
transactions on pattern analysis and machine intelligence,
40(6):1452–1464, 2018.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)
3129
... Meanwhile, coarse-to-refine frameworks were devised in [28], [29] and [31] where each stage utilizes a novel autoencoder network. A model similar to that developed in [31] was introduced in [21]-each employs a two-stage training scheme, but the latter uses an attention computing module (ACM) and attention transfer module (ATM) to generate attention scores and create aggregated residuals to synthesize high-quality images. ...
... Meanwhile, coarse-to-refine frameworks were devised in [28], [29] and [31] where each stage utilizes a novel autoencoder network. A model similar to that developed in [31] was introduced in [21]-each employs a two-stage training scheme, but the latter uses an attention computing module (ACM) and attention transfer module (ATM) to generate attention scores and create aggregated residuals to synthesize high-quality images. ...
Article
Full-text available
Coronary artery procedures are primarily performed based on X-ray angiography images. However, coronary arteries in X-ray images are often partially broken, complicating diagnoses and procedures owing to lack of visibility. In this paper, we propose a fully automatic method to restore locally broken parts of coronary arteries in X-ray images without using any external information, such as computed tomography images. To this end, we design a new multi-scale generative adversarial network and a vesselness-loss function. The proposed method is optimized for focus on elongated structures and can be utilized in various clinical applications. The proposed method is evaluated and compared with four other existing methods using the performance metrics, PSNR, MSE, and SSIM, and the result shows 34.3, 0.18, and 0.91 averages, respectively for each metric. Based on the performance result, the blocked regions are plausibly reconstructed into such original shapes of blood vessels, which can aid in image-based guiding catheter manipulation during coronary artery procedures. Eventually, the proposed method can be utilized in various clinical applications, e.g.,image-based planning and guidance of coronary procedures and prior simulation of results.
... Besides, conditional diffusion model partitions image-to-image generation into a sequence of denoising steps, which typically recover the general outline initially and then produce details. This can be considered as a recursive image generation method, which has been proven to be effective by previous studies [37][38][39]. Despite these superiorities, further exploration is required to fully realize the potential of the diffusion model in the task of predicting dose distribution in radiotherapy. ...
Preprint
Treatment planning is a critical component of the radiotherapy workflow, typically carried out by a medical physicist using a time-consuming trial-and-error manner. Previous studies have proposed knowledge-based or deep learning-based methods for predicting dose distribution maps to assist medical physicists in improving the efficiency of treatment planning. However, these dose prediction methods usuallylack the effective utilization of distance information between surrounding tissues andtargets or organs-at-risk (OARs). Moreover, they are poor in maintaining the distribution characteristics of ray paths in the predicted dose distribution maps, resulting in a loss of valuable information obtained by medical physicists. In this paper, we propose a distance-aware diffusion model (DoseDiff) for precise prediction of dose distribution. We define dose prediction as a sequence of denoising steps, wherein the predicted dose distribution map is generated with the conditions of the CT image and signed distance maps (SDMs). The SDMs are obtained by a distance transformation from the masks of targets or OARs, which provide the distance information from each pixel in the image to the outline of the targets or OARs. Besides, we propose a multiencoder and multi-scale fusion network (MMFNet) that incorporates a multi-scale fusion and a transformer-based fusion module to enhance information fusion between the CT image and SDMs at the feature level. Our model was evaluated on two datasets collected from patients with breast cancer and nasopharyngeal cancer, respectively. The results demonstrate that our DoseDiff outperforms the state-of-the-art dose prediction methods in terms of both quantitative and visual quality.
... In addition, Liu et al. 12 proposed an attention layer with coherent semantics that preserves the contextual structure and predicts missing regions more efficiently. Ma et al. 13 proposed a reconstruction method based on region convolution of missing regions for dealing with blurring and chromatic aberration problems during image inpainting. Guo et al. 14 proposed the introduction of full-resolution residual blocks in the encoder-decoder structure, which enables the network to perform better feature integration and texture prediction. ...
Article
Full-text available
Image and video inpainting is a classic problem in computer vision and computer graphics, aiming to fill in the plausible and realistic content in the missing areas of images and videos. With the advance of deep learning, this problem has achieved significant progress recently. The goal of this paper is to comprehensively review the deep learning-based methods for image and video inpainting. Specifically, we sort existing methods into different categories from the perspective of their high-level inpainting pipeline, present different deep learning architectures, including CNN, VAE, GAN, diffusion models, etc., and summarize techniques for module design. We review the training objectives and the common benchmark datasets. We present evaluation metrics for low-level pixel and high-level perceptional similarity, conduct a performance evaluation, and discuss the strengths and weaknesses of representative inpainting methods. We also discuss related real-world applications. Finally, we discuss open challenges and suggest potential future research directions.
Article
Standard convolution applied to image inpainting would lead to color discrepancy and blurriness for treating valid and invalid/hole regions without difference, which was partially amended by partial convolution (PConv). In PConv, a binary/hard mask was maintained as an indicator of valid and invalid pixels, where valid pixels and invalid pixels were treated differently. However, it can not describe validity degree of an impaired pixel. In addition, mask and image paths were separated, without sharing convolution kernel and exchanging information mutually, reducing data utilization efficiency. In this paper, a mask-guided convolution (MagConv) is proposed for image inpainting. In MagConv, mask and image paths share a convolution kernel to interact with each other and form a joint optimization scheme. In addition, a learnable piecewise activation function is raised to replace the reciprocal function of PConv, providing more flexible and adaptable compensation to convolution contaminated by invalid pixels. It also results in a soft mask of floating-point coefficients from 0 to 1 capable of indicating the validity degree of each pixel. Last but not least, MagConv splits the convolution kernel into positive and negative weights so that they can evaluate the validity of each pixel faithfully. Qualitative and quantitative experiments on the CelebA, Paris StreetView and Places2 datasets demonstrate that our method achieves favorable visual quality against state-of-the-art approaches.
Article
Blind image inpainting involves two critical aspects, i.e. , “where to inpaint” and “how to inpaint”. Knowing “where to inpaint” can eliminate the interference arising from corrupted pixel values; a good “how to inpaint” strategy yields high-quality inpainted results robust to various corruptions. In existing methods, these two aspects usually lack explicit and separate consideration. This paper fully explores these two aspects and proposes a self-prior guided inpainting network (SIN). The self-priors are obtained by detecting semantic-discontinuous regions and by predicting global semantic structures of the input image. On the one hand, the self-priors are incorporated into the SIN, which enables the SIN to perceive valid context information from uncorrupted regions and to synthesize semantic-aware textures for corrupted regions. On the other hand, the self-priors are reformulated to provide a pixel-wise adversarial feedback and a high-level semantic structure feedback, which can promote the semantic continuity of inpainted images. Experimental results demonstrate that our method achieves state-of-the-art performance in metric scores and in visual quality. It has an advantage over many existing methods that assume “where to inpaint” is known in advance. Extensive experiments on a series of related image restoration tasks validate the effectiveness of our method in obtaining high-quality inpainting.
Article
Bilinen en eski sanat eserlerinden olan mozaikler tarih boyunca çok farklı uygarlıklar tarafından geliştirilmiş ve kullanılmışlardır. Geçmişten günümüze ulaşan mozaik eserlerinde tahribat sıklıkla rastlanmaktadır. Gerçekleşen doğa koşulları, insanların olumsuz etkileri veya nesnelerin doğası gereği yıpranmasından dolayı tahribata uğrayan eserler olabilmektedir. Bu eserlerdeki tahribatın onarılması ve orijinal görüntüsüne ulaşılması gerekliliği tüm tarih eserlerinde olduğu gibi mozaik eserlerinin de temel ihtiyacıdır. Görüntü tamamlama problemi literatürde farklı teknikler ile çözülmeye çalışılan güncel bir problemdir. Bu çalışmada görüntü tamamlama problemini derin öğrenme tabanlı yöntemlerle mozaik veri seti üzerindeki sonuçları incelenmiştir. Mozaik görüntüsündeki eksik bölgelerin düzeltilmesi bağlamsal dikkat ile görüntü tamamlama mimarisi kullanılmıştır. Bu mimari aynı veri seti kullanılarak farklı çekişmeli üretici ağ mimariler ile karşılaştırılmalı sonuçları incelenmiştir. Öğrenme aktarımı kullanılarak mozaik veri seti ile yeniden model eğitilmiştir. Test edilen mozaik örneklerdeki orijinal görüntü ile hasarı giderilmiş görüntü arasındaki yapısal benzerlik indisinin yapılan hasar oranına göre az hasarlı görüntülerde 0.92 - 0.95 çok hasarlı görüntülerde ise 0.72 - 0.89 arasında olduğu gözlemlenmiştir. Gerçekleştirilen görüntü tamamlama modeli ile az hasarlı mozaik resimlerinde görüntü tamamlamada yüksek başarı elde edilmiştir.
Enhancenet: Single image super-resolution through automated texture synthesis
  • Karras
Karras et al., 2017] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017. [Kwatra et al., 2005] Vivek Kwatra, Irfan Essa, Aaron Bobick, and Nipun Kwatra. Texture optimization for example-based synthesis. In ACM Transactions on Graphics (ToG), volume 24, pages 795-802. ACM, 2005. [Liu et al., 2018] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. arXiv preprint arXiv:1804.07723, 2018. [Nazeri et al., 2019] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Qureshi, and Mehran Ebrahimi. Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212, 2019. [Newson et al., 2014] Alasdair Newson, Andrés Almansa, Matthieu Fradet, Yann Gousseau, and Patrick Pérez. Video inpainting of complex scenes. SIAM Journal on Imaging Sciences, 7(4):1993-2019, 2014. [Park et al., 2017] Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, and Alexander C Berg. Transformationgrounded image generation network for novel 3d view synthesis. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 3500-3509, 2017. [Pathak et al., 2016] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2536-2544, 2016. [Sajjadi et al., 2017] Mehdi S. M. Sajjadi, Bernhard Scholkopf, and Michael Hirsch. Enhancenet: Single image super-resolution through automated texture synthesis. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. [Simakov et al., 2008] Denis Simakov, Yaron Caspi, Eli Shechtman, and Michal Irani. Summarizing visual data using bidirectional similarity. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8. IEEE, 2008. [Song et al., 2017] Yuhang Song, Chao Yang, Zhe L. Lin, Hao Li, Qin Huang, and C.-C. Jay Kuo. Image inpainting using multi-scale feature image translation. CoRR, abs/1711.08590, 2017.
Places: A 10 million image database for scene recognition
et al., 2018b] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. arXiv preprint, 2018. [Zhou et al., 2018] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452-1464, 2018.