ArticlePDF Available

Deep Metric Learning-Based for Multi-Target Few-Shot Pavement Distress Classification

Authors:

Abstract

In this paper, we propose a new few-shot pavement distress detection method based on metric learning, which can effectively learn new categories from a few labeled samples. In our work, we adopt the backend network (ResNet18) to extract multilevel feature information from the base classes and then send the extracted features into the metric module. In the metric module, we introduce the attention mechanism to learn the feature attributes of "what" and "where" and focus the model on the desired characteristics. We also introduce a new metric loss function to maximize the distance between different categories while minimizing the distance between the same categories. In the testing stage, we calculate the cosine similarity between the support set and query set to complete novel category detection. The experimental results show that the proposed method significantly outperforms several benchmarking methods on the pavement distress dataset.
1
Abstract—Pavement distress detection is of great significance
for road maintenance and to ensure road safety. At present,
detection methods based on deep learning have achieved
outstanding performance in related fields. However, these
methods require large-scale training samples. For pavement
distress detection, it is difficult to collect more images with
pavement distress, and the types of pavement diseases are
increasing with time, so it is impossible to ensure sufficient
pavement distress samples to train the supervised deep model. In
this paper, we propose a new few-shot pavement distress detection
method based on metric learning, which can effectively learn new
categories from a few labeled samples. In our work, we adopt the
backend network (ResNet18) to extract multilevel feature
information from the base classes and then send the extracted
features into the metric module. In the metric module, we
introduce the attention mechanism to learn the feature attributes
of "what" and "where" and focus the model on the desired
characteristics. We also introduce a new metric loss function to
maximize the distance between different categories while
minimizing the distance between the same categories. In the testing
stage, we calculate the cosine similarity between the support set
and query set to complete novel category detection. The
experimental results show that the proposed method significantly
outperforms several benchmarking methods on the pavement
distress dataset. (The classification accuracies of 5-way 1-shot and
5-way 5-shot are 77.20% and 87.28%, respectively)
Index Terms- metric learning; deep learning; pavement distress
detection; few-shot; attention mechanism
I. INTRODUCTION
avement distresses, such as cracks, blocks, potholes,
alligators and so on, are mainly caused by vehicle
overloading, weather changes and road aging. If these
distresses cannot be treated in time, they will reduce the road
quality, and endanger traffic safety. Rapid and accurate
detection of road surface damage is helpful for maintaining the
roads in time, preventing traffic accidents and ensuring vehicle
safety. In the past, the pavement distress detection method was
mainly collecting pavement images with cameras by
engineering vehicles traveling along roads and then manually
classifying and processing the pavement images. This method
is not only time consuming but also highly subjective.
Computer vision technology has made great achievements in
related fields of image processing, replacing the complex and
tedious manual detection [1]-[4]. In the pavement distress
detection researches, some methods based on computer vision
technology also appear constantly, such as histogram-of-
oriented-gradient (HOG) [5], local-binary-pattern (LBP) [6]
and wavelet [7], followed by classifiers such as BP neural
networks and support vector machine (SVM) to classify
pavement distresses. Although this kind of method solves the
problem caused by human beings to some extent, these artificial
features rely on expert knowledge and lack universality.
Otherwise, the performance of these methods is still limited by
the complex structure, diverse shapes, complex backgrounds
and the strong interference of various noises (such as oil spots,
gravel, and zebra crossings, etc.).
In recent years, with the availability of large-scale datasets
(e.g., ImageNet) and the development of high-performance
computing units, deep learning-based methods have drawn
great attention in various visual tasks. These methods use the
convolutional neural network (CNN) [8] [9] to obtain
multilevel features from the input data to complete the
representational learning of the input image. However, most of
these supervised models need many labeled samples to fit the
deep CNN parameters. In industrial applications, it is difficult
to collect enough labeled images to train a CNN model. Hence,
these supervised CNN-based methods have difficulty learning
object distributions with a few labeled samples and suffer from
overfitting in the training process. In recent years, few-shot
Deep metric learning-based for multi-target few-
shot pavement distress Classification
Hongwen Dong, Kechen Song, Member, IEEE, Qi Wang, Yunhui Yan, Peng Jiang
P
Fig. 1. A brief illustration of few-shot learning for pavement distress
classification task. The aim of this task is to predict the query samples based o
n
the similarity with the support samples (with label) by few-shot model.
This work is supported by the National Natural Science Foundation of Chin
a
(51805078), the National Key Research and Development Program of Chin
a
(2017YFB0304200), the Fundamental Research Funds for the Central
Universities (N2003021, N2103011). (Corresponding authors: Kechen Song;
Yunhui Yan)
H. Dong, K. Song, Q. Wang, and Y. Yan are with the School of Mechanical
Engineering and Automation, Northeastern University, Shenyang, Liaoning,
110819, China, and the Key Laboratory of Vibration and Control of Aero-
Propulsion Systems Ministry of Education of China, Northeastern University,
Shenyang, 110819, China. (e-mail: donghongwenliran@163.com,
songkc@me.neu.edu.cn, 1810109@stu.neu.edu.cn, yanyh@mail.neu.edu.cn).
P. Jiang is with the department of Liaoning ATS Intelligent Transportation
Technology Co., Ltd., Shenyang, Liaoning, China. (jiangpeng1986@139.com)
2
learning has attracted attention in computer vision tasks,
especially in image classification task. The aim of few-shot
learning is to learn novel objects with little supervision as easily
as humans. Most few-shot learning methods adopt the metric-
learning scheme. The concept of metric learning is to learn the
similarity of a pair of samples, which maximizes the inter-class
variations and minimizes the intra-class variations. For example,
Matching Networks [10] train an end-to-end classifier similar
to the nearest neighbor, and the trained model does not need to
be adjusted; it can also be used to classify the categories that
did not appear in the training process. Prototypical Networks
[11] use Euclidean distance as the distance measure and make
the distance between the data of a class and the primitive
representation of the class the closest, and the distance to the
primitive representation of other classes farther. Relation
Networks [12] apply baseline CNN modules as the feature
encoder and then discriminate the similarities and
dissimilarities between the support and query samples by
concatenated features.
In the pavement distress classification task, because most of
the pavement is normal, it is difficult to collect enough distress
pavement images. In addition, in different scenes, different road
conditions and different pavement materials, distresses with the
same label name are very different. Therefore, supervised CNN
classification is not the best method for pavement distress
detection. To solve the above challenges, we introduce a deep
metric learning-based method for multi-target few-shot
pavement distress classification. The overview of our task is
shown in Fig. 1. Different from the approaches mentioned
above, our few-shot model improves the classification accuracy
in two ways. First, our few-shot model uses the baseline CNN
module as the feature extractor and adopts an attention
mechanism to obtain more robust and discriminative
information from images, which focuses the model on the
distressed region characteristics. In addition, we introduce a
new metric loss function to optimize the network model, which
makes the sample features of the same kind more compact and
enhances the separability of the sample features of different
categories. The framework of our method is shown in Fig. 2.
The main contributions of our work are summarized as follows:
1) A deep metric learning-based method for multi-target few-
shot pavement distress classification is proposed. To the best of
our knowledge, our work is the first attempt to do so in
pavement distress classification task.
2) A novel metric module is proposed. In the module, an
attention mechanism is applied to obtain more discriminative
information from images, and the model focuses on the
distressed region characteristics. Additionally, a metric loss
function is used to optimize the model, which maximizes the
inter-class variations and minimizes the intra-class variations.
3) We carry out few-shot classification experiments on a
pavement distress dataset and achieve competitive performance
with state-of-the-art methods.
The rest of this paper is organized as follows: Section
introduces the related works. Section Ⅲ describes our method
in detail. Next, we present the details and results of experiments
in Section Ⅳ, and we describe the experiments. Finally, the
conclusion of this paper is summarized in Section Ⅴ.
II. RELATED WORKS
In this section, we briefly review some related works on
pavement distress detection, few-shot learning for classification,
and attention mechanisms.
A. Pavement distress detection
In this section we briefly review traditional pavement distress
detection methods and deep learning-based pavement distress
detection methods. The traditional methods introduced in this
section refer to the non-deep learning-based methods.
1) Traditional pavement distress detection methods: In early
studies [13] [14], different threshold methods were used to
highlight crack regions from the background. However, these
thresholds were set subjectively, so the selected thresholds
could not adapt to the changes or differences in image color
information caused by different acquisition conditions.
Furthermore, edge-based algorithm [15] [16] was adopted for
crack edge detection. However, these methods were limited by
the low contrast and noise of the images. Currently, most of the
approaches adopt manually designed features such as Gabor
filters, wavelet transform, local binary pattern, and histogram
of oriented gradient for pavement crack detection. However,
these manually designed features are not suitable for complex
cracks and lack universality.
In recent years, many researchers have applied machine
learning for pavement distress detection. In [17], a new
algorithm that relies on a minimal path with images was
proposed for pavement crack detection. In [18], a supervised
learning method based on AdaBoost was used for road surface
detection. In [19], two simple local statistics means and
standard deviations were adopted to classify whether image
blocks contain crack pixels. In [20], a novel framework based
on random structured forests was proposed for road crack
detection. Although these methods have some advantages
compared with traditional methods, the detection effect of these
methods depends heavily on artificially designed features, and
the generalization performance is not strong.
2) Deep learning-based pavement distress detection methods:
Deep learning-based methods benefit from powerful feature
representation, which makes outstanding achievements in
computer vision-related fields. In the pavement distress
detection task, Zhang et al. [21] applied a deep CNN framework
for pavement distress image classification. In [22], a
comparative analysis of pavement distress classification based
on deep learning frameworks was introduced. In [23], a DCNN
was applied to classify pavement cracks on 3D images and
those cracks are labeled into 5 different categories. [24] [25] [26]
used a deep learning-based method to locate crack regions.
Dong et al. [27] fused multi-level features into different stages
and added the global context into the network for surface defect
segmentation. Yang et al. [28] fused multi-level features from
top-to-down for pavement crack segmentation. Zhang et al. [29]
adopted a three-stream boundary-aware network for fine-
grained pavement disease segmentation. Although these
methods achieved outstanding performance in pavement
disease detection, most of them only detect one kind of
pavement disease (e.g., crack) and lack universality, and these
methods are not effective for novel categories with a few label
samples.
3
B. Few-shot learning for classification
In this section, we briefly review two categories of existing
few-shot learning for classification methods.
1) Meta-learning: Meta-learning, sometimes called learning
to learn, focuses more on tasks than data. In MAML [30], an
algorithm for meta-learning called model-agnostic was
proposed, which trains a model on a learning task and processes
a new learning task with a few training samples. Eavi et al. [31]
proposed an optimization algorithm based on LSTM for
learning one learner neural network classifier, which is used to
train another in the case of a few samples. Li et al. [32]
proposed a Meta-SGD, which similar to SGD, can be trained
easily while initializing and adapting learners in only one step.
However, in these methods, the model structure is fixed, and
the image input size of the model is also fixed, so the
generalization is not good. Additionally, the model weights
need to calculate the second-order gradient which increases the
instability of the model.
2) Metric Learning: The concept of few-shot classification
algorithm based on metric learning mainly uses an encoder to
extract the features from input samples (labeled and unlabeled),
and then uses a metric function to calculate the similarity of the
features of unlabeled and labeled samples to output the category
prediction of unlabeled samples. Matching Networks [10] train
an end-to-end classifier similar to the nearest neighbor, and the
trained model does not need to be adjusted. Prototypical
Networks [11] adopt Euclidean distance as metric function,
which can maximize the inter-class variations and minimize the
intra-class variations. Relation Networks [12] apply baseline
CNN modules as the features encoder, and then discriminate the
similarities and dissimilarities between the support and query
samples by concatenated features. In [33], a graph convolution
network is used as the metric function. However, these methods
are still fixed for few-shot learning tasks.
C. Attention mechanism
The attention mechanism is a special signal processing
mechanism in human vision that can suppress useless
information and obtain interesting objects. In recent years,
attention mechanisms have been widely used in various deep
learning fields, such as image classification, object recognition
and semantic segmentation. For example, [34] introduced a
recurrent attention model that learns to direct high resolution
attention to the most discriminative regions without any spatial
supervision for fine-grained classification. In [35], an attention-
based global contextualized subnetwork was recurrently
adopted to generate the attentive location map for the input
image to highlight useful global contextual locations to provide
better object detection. Li et al. [36] proposed a pyramid
attention network, which implements spatial pyramid attention
on high-level features to exploit the impact of global contextual
information in semantic segmentation.
Inspired by the above method, we introduce an attention
mechanism into our method to extract more robust features. Our
attention mechanism includes two components: channel
attention and spatial attention. The former is used to extract
different channel features and focuses on the information with
a large weight according to the importance degree to ensure that
the features are “what”. The latter adopts non-local block to
obtain spatial attention and learns the features are “where”.
Fig. 2. Flow chart of our method. An encoder module (fe) which is used to extract base features from input images. A metric module (gm) which adopts attention
mechanism to obtain more discriminative information from input information and learns a metric function to maximize the distance between different categories
while minimizing the distance between the same categories. In the process of testing, the features of support set (with label) and query set (no label) are extracted
by (fe +gm) and compare cosine similarities of the features, and output the prediction of query set.
4
III. METHOD
A. Task Setting
Specifically, few-shot learning for classification task usually
involves three datasets [37]: a base class set Dbase, a support set
Dsupport and a query set Dquery. The goal of this task is to classify
each unlabeled query sample in Dquery correctly according to
Dsupport. However, because there are only a few labeled samples
for each class in Dsupport, a classification model cannot be trained
effectively. Therefore, we usually introduce Dbase to train a
model and learn transferable knowledge to help solve this
problem.

1
,N
base i i i
Dxy
is used for training the classification
model, where i
y
is the label corresponding to sample i
x
,
and N is the number of training samples.
novel support query
DD Dis a novel class set, where

1
,
M
ss
support i i i
Dxy
is a support set with M labeled
samples,
s
i
y
is the label corresponding to sample
s
i
, and

1
t
q
query i i
Dx
is the set without labels.
novel base
DD
, the goal of this task is to classify each
unlabeled query sample in Dquery given Dsupport.
B. Encoder Module
The robust features extracted from the input image have a
great impact on the final classification accuracy. In our method,
we build the encoder module (; )
e
fx
on the pre-trained
model ResNet-18 network to extract multi-level features from
raw to semantic. The encoder module contains four residual
blocks and a global average pooling layer. The details of the
encoder module are shown in Table I. Each residual block is
composed of a convolutional layer, non-linear activation
function, batch normalization, and pooling layer. Given a batch
images
12
,,... CWH
n
Xxxx 
with class label
12
, , ..., n
Ccc c, the output of (; )
e
fx
is:


;_ i
f x down scale BN conv x

V (1)
where BN denotes batch normalization, is the non-linear
activation function (ReLU), conv is the convolution operation
with a 3×3 kernel size, φ denotes the trainable weights,
down_scale denotes the max pooling operation.
C. Metric Module
Generally, the aim of metric learning is to maximize the
inter-class variations and minimize the intra-class variations by
a metric function. To further improve the aggregation and
separation of features, we introduce the attention mechanism
into the metric module to capture the key information, followed
by two loss functions.
Channel attention learns the features as “what”, which
extracts the importance of different channel features to key
information and focuses the information with a large weight
according to the importance degree to improve the feature
representation of discriminant semantics (as shown in Fig. 3).
Let
2
12
,, Cwh
C
vv 
 Vrepresents the encoder module
output. First, we adopt a global average pooling operation to
fuse the feature of V in the dimension w×h to produce a channel-
wise descriptor
2
12
,, C
C
uu U.
11
1wh
ii
wh
uv
WH

 (2)
Second, we adopt two 1×1 convolution layers to weight U to
capture channel-wise dependencies. We use the sigmoid on the
final feature maps.
2
21 12
,, C
C
WW mm

 MU (3)
where
and
denote the sigmoid and ReLU functions,
respectively. W1 and W2 are 1×1 convolution operation.
Third, we use M to reweight the channels of the original
feature map V to obtain the new feature distribution.
2
11 2 2
,, CW H
CC
mv m v 
 E (4)
Spatial attention learns the features are “where”, which
focuses on the spatial location information of key features. The
convolution operation with different size kernels can only
obtain the information of one local neighborhood at a time. To
TABLE I
DETAILS OF ENCODER MODULE
Stage
Type Output
33 conv, stride = 1 224224
22 max pool, stride = 2 112112
R1 [conv 33 + BN + ReLU, 5656
max pool 2×2
R2 [conv 33 + BN + ReLU, 2828
max pool 2×2
R4 [conv 33 + BN + ReLU, 1414
max pool 2×2
R4 [conv 33 + BN + ReLU, 77
max pool 2×2
Fig. 3. The overview architecture of the attention mechanism
5
obtain better spatial information, we consider all the feature
positions. Inspired by non-local neural networks [38], we add
non-local block into the metric module to obtain spatial
attention, and the details are shown in Fig. 3. The non-local
operation can be defined as:

1,
()
iijj
j
y
fVVgV
CV
(5)
where V is the input feature calculated by Eq. 1, y denotes the
output of the non-local operation, i is the output position index,
j is the index of all possible locations in the V feature. The
bivariate function f (Vi, Vj) calculates the weight between
positions i and j in feature V and outputs a one-dimensional
scalar. The unary function g calculates the characterization
value of V at position j. C(V) represents normalization factor.
We use the Gaussian function as a bivariate function f (Vi, Vj)
to calculates the weight between positions i and j in feature V,
which is defined as:



T
f= , ij
VV
ij
fVV e

(6)
where

T
ij
VV

denotes dot-product. We use linear
weighting for ()
j
gj
g
VWVwith the trainable weights Wg. The
output of spatial attention S is calculated as:
f
s
ii s j i
Wy V W gV V
 S (7)
Where denotes softmax, and Ws is the trainable weight.
The final output feature map of the metric module is the
fusion of channel attention and spatial attention, followed by a
convolutional layer and a nonlinear activation function.

;conv
FSE
(8)
where denotes the ReLU activation function and
represents
trainable weight.
After feature attention is extracted from the encoder module,
an effective metric function is needed to improve the
discriminant ability of model and generalize it to novel classes
Dnovle. In this module, we introduce center loss [39] to minimize
the intra-class variations. Center loss learns the feature centers
of each class and penalizes the distance between the features
and the center of the corresponding class. The formulation for
center loss is as follows:
2
2
1
1
2i
B
ciy
i
L
xz

(9)
where zyi denotes the yi class center of the deep features xi
extracted metric module, and B represents a mini-batch.
Intuitively, for few-shot classification tasks, the center loss
function can minimize the spatial distance between the same
categories. However, the differences between some categories
are very small, and how to keep the features of different classes
separable is important. In this paper, we let the module learn a
discriminant function that can maximize the inter-class
variations. The discriminant function can be formulated as:




''
'
exp , z
log
exp , z
kk
d
kk
kN
ED o
LED o


 

(10)
ED(,) denotes Standardized Euclidean distance. ok is the k-th
class average feature in every mini-batch, which is defined as:
1
i
ik
ktrain
k
x
xD
o
B
F (11)
where Dtrain is the basic class dataset. B represents a mini-batch.
The final metric function is defined as:
metric c d
L
LL (12)
D. Loss Function
In this paper, we adopt joint supervision to optimize the
model. First, we put the vector
12
f ,f ,...,fi
Fdefined in Eq.
(8) into a classifier; in the classification task based on
convolutional neural network, the fully connected layer with
softmax is usually used as the classifier, and outputs the
probability

ii
Pp cy of the ci category:

1
f
f
f
i
n
j
ic
jc
e
Pp c e

(13)
Next, we compute the loss of input samples xi belonging to the
target category ci:


1
1log 1 log 1
N
CE i i i i
i
L
qp q p
N

(14)
where N is the number of mini-batch. qi and pi represent the
ground truth and predicted label probabilities, respectively. The
final loss Ltotal is defined as:
f
inal CE metric
LLL
 (15)
where
α
is the balance parameter for the trade-off between
distribution and generalization. A smaller parameter value
indicates that the model tends to extract more robust and
generalized features. A larger parameter value indicates that
model focuses on learning the spatial distribution of the features.
In the experiments, we analyze the influence of the parameter
0, 1
.
E. Classifier fine-tuning
Classifier fine-tuning is the test phase in few-shot learning
for pavement distress classification. For the supervised CNN-
based classification task, the CNN network is trained and
optimized repeatedly on the training dataset to obtain an
optimal model encapsulating classification weights and then
computes classification scores on the test set. However, these
encapsulated classification weights are not fit to new classes
(with a few label samples) w hic h ar e not inc luded in the trai ning
set. In this work, the cosine classifier [9] is used as a similarity
classifier for few-shot tasks, which can be defined as:

22
,sq
sq
sq
xx
ConsineSimilarity x x
x
x
(16)
where denotes the dot product and 2
represents L2 norm.
xs and xq denote the support features and query features vector
extracted from the above metric module. By calculating the
similarity of the two feature vectors, the classifier outputs the
predicted of query samples.
6
IV. EXPERIMENTS
A. Implementation details
1) Parameter Setting: Our method employs a basic encoder
module together with a metric module for multi-target few-shot
pavement distress classification. For the basic encoder module,
ResNet-18 is employed as the backbone network. During the
training, the learning rate is 0.001 and halved every 10 epochs.
The weights realize the initialization of the newly added
convolutional layers through the “Xavier” scheme. We train the
model for a total of 100 epochs.
2) Computation Platform: The experiments are implemented
using PyTorch framework on NVIDIA GTX TITAN GPU on
Ubuntu 16.04 Linux. https://github.com/DHW-
Master/FS_PDD.git.
3) Evaluation: The classification accuracy is adopted to
evaluate the experimental results, which is defined as:
()
1
1s
Ti
i
s
r
accuracy TQ
(17)
where r(i) and Q(i) denote the number of samples that are
correctly and the number of query samples in i-th test episode,
respectively. Ts denotes the number of test episodes.
B. Results
1) Classification on Pavement Distress Dataset: We collect
the pavement distress from [40], which consists of 10 different
classes, and each image with 640×640 resolution. In this work,
we reorganized these distresses, and each class contains
approximately 300 samples with 224×224 resolution. Some
samples in this dataset are shown in Fig. 4, and we can observe
that the conditions of the samples in this dataset are complex
and changeable, such as uneven brightness, low contrast,
presence of oil stains and zebra crossing, etc., which make the
detection more challenging. In the experiments, we divide the
dataset into two data-sets, as listed in Table Ⅲ. We take one as
base class to train the model, and the other as a novel set to
evaluate the few-shot task. The numeric results presented in
Table Ⅱ show that compared with other methods, our method
can achieve 77.20% classification accuracy on 5-way 1-shot
and 87.28% classification accuracy on 5-way 5-shot.
2) Classification on MVTec Dataset: The MVtec dataset
contains 1709 high-resolution images of 15 different classes.
Each class contains defect-free images and different types of
TABLE
FIVE-WAY FEW-SHOT CLASSIFICATION ACCURACY ON THE PAVEMENT DISTRESS DATA SET (AVERAGE OF 50 TEST EPISODES AND EACH EPISODE CONTAINS 75
QUERY SAMPLES WITH 95% CONFIDENCE INTERVALS)
Methods Backbone
5-way Accuracy (%)
1-shot 5-shot
Data set1 Data set2 Mean Data set1 Data set2 Mean
Prototypical Net [11] 64-64-64-64 62.23 0.98 46.95 1.02 54.59 1.01 75.70 0.86 64.72 0.96 70.21 0.93
Matching Net [10] 64-64-64-64 60.83 0.99 57.12 1.00 58.97 0.99 68.43 0.94 74.30 0.88 71.36 0.91
Relation Net [12] 64-96-128-256 64.26 0.97 54.13 1.01 59.19 0.99 69.07 0.93 66.54 0.95 67.80 0.94
MAML [30] 32-32-32-32 59.00 0.99 57.86 1.00 58.43 0.99 73.70 0.89 73.90 0.89 73.80 0.89
Ours ResNet-18 75.00 0.88 79.40 0.82 77.20 0.85 86.53 0.69 88.03 0.66 87.28 0.67
Fig. 4. Example samples of pavement distress dataset, (a) Alligator, (b) Block,
(c) Lane-longitudinal, (d) Longitudinal, (e) Pothole, (f) Reflective, (g) Sealed-
longitudinal, (h) Sealed-reflective, (i) Transvers, (j) Sealed-alligator.
TABLE
THE DETAILS OF PAVEMENT DISTRESS DATASET
Dataset
Data set1 Data set2
Basic training
classes Novel classes Basic training
classes Novel classes
Classes
name
Alligator
Transvers
Lane-longitudinal
Longitudinal
Sealed-reflective
Reflective
Sealed-longitudinal
Block
Pothole
Sealed-alligator
Reflective
Sealed-longitudinal
Block
Pothole
Sealed-alligator
Alligator
Transvers
Lane-longitudinal
Longitudinal
Sealed-reflective
TABLE
FIVE-WAY FEW-SHOT CLASSIFICATION ACCURACY ON THE MVTec DATA
SET (AVERAGE OF 50 TEST EPISODES AND EACH EPISODE CONTAINS 75
QUERY SAMPLES WITH 95% CONFIDENCE INTERVALS)
Method Backbone MVTec 5-way Accuracy (%)
1-shot 5-shot
Prototypical Net [11] 64-64-64-64 92.75 0.52 94.85 0.44
Matching Net [10] 64-64-64-64 89.28 0.63 92.54 0.53
Relation Net [12] 64-96-128-256 92.57 0.53 93.59 0.49
MAML [30] 32-32-32-32 70.96 0.92 89.77 0.61
Ours ResNet-18 95.33 0.42 99.60 0.13
TABLE
FIVE-WAY FEW-SHOT CLASSIFICATION ACCURACY ON THE miniImageNet
DATA SET (AVERAGE OF 50 TEST EPISODES AND EACH EPISODE CONTAIN S 75
QUERY SAMPLES WITH 95% CONFIDENCE INTERVALS)
Method Backbone
miniImageNet 5-way
Accuracy (%)
1-shot 5-shot
Prototypical Net [11] 64-64-64-64 49.42 0.78 68.20 0.66
Matching Net [10] 64-64-64-64 43.56 0.84 55.31 0.73
Relation Net [12] 94-96-128-256 50.44 0.82 65.32 0.70
MAML [30] 32-32-32-32 48.70 1.84 63.11 0.92
Shot-Free [41] ResNet-12 59.04 n/a 77.64 n/a
MetaOptNet [42] ResNet-12 62.64 0.61 78.63 0.46
CTM [43] ResNet-18 64.12 0.82 80.51 0.13
RFS [44] ResNet-12 64.82 0.60 82.14 0.43
DeepEMD [45] ResNet-12 65.91 0.82 82.41 0.56
Ours ResNet-18 70.40 0.93 84.40 0.73
7
anomalous images. In the experiments, we validate our method
on anomalous images in this dataset. We reorganize the MVTec
dataset and expand the dataset by mirroring, flip, and rotation
methods. The reorganized MVTec dataset consists of 66 classes,
and each class contains approximately 130 samples, where 40
classes are randomly selected as the base classes, and the rest
as novel classes to verify the few-shot task. The numeric results
are presented in Table Ⅳ, from which we can observe that
compared with other methods, our method can achieve 95.33%
classification accuracy on 5-way 1-shot, and 99.60%
classification accuracy on 5-way 5-shot.
3) Classification on miniImageNet Dataset: The
miniImageNet dataset is a standard benchmark for few-shot
learning methods for recent works. It consists of 100 classes
randomly sampled from the ImageNet and each class contains
600 samples with 84×84. It is split into 64 base classes, 16
validation classes and 20 novel classes. The numeric results
presented in Table Ⅴ show that compared with other methods,
our method can achieve 70.40% classification accuracy on 5-
way 1-shot and 84.40% classification accuracy on 5-way 5-shot.
C. Ablation Studies and Discussion
We conduct ablation studies and discussions to analyze how
each component affects the performance of the proposed
method. We mainly consider four ablation components:
backbone networks, loss function, attention mechanism module,
and balance hyperparameter.
1) Ablation study of different backbone networks: In the
experiments, we use different backbone networks as the
encoder modules to verify the influence of different backbone
networks on the performance of our method. We run all the
experiments on the pavement distress dataset. The classification
accuracy is listed in Table Ⅵ, from which we can observe that
with the depth of backbone networks increases, the
performance of the model improves further. ResNet-18 selected
as the backbone network can significantly improve the
performance of the proposed method. Our analysis show that
most of the features extracted from the shallow network are
low-level features, which cannot effectively represent the
object category information. Higher feature dimensions can
effectively extract the high-level semantic features of the object,
which are crucial to the object category information of the
object. However, with the increase of network depth, the
network becomes more complex, and the performance of the
model will degrades due to parameters over-fitting the training
set, which cannot be effectively generalized into new categories.
Using ResNet-18 as the encoder model, our method can achieve
77.20% and 87.28% accuracy of 5-way 1-shot and 5-way 5-shot
on pavement distress dataset, respectively.
2) Ablation study of loss function: We conduct ablation
studies to verify the performance of the components of the loss
function. As mentioned in the paper, the purpose of our method
is to learn a metric space from a small number of samples,
which can minimize the distance of inter-class and maximize
the distance of intra-class to improve the classification accuracy.
To illustrate this point, we visualize the spatial distribution
of features extracted by our model on the pavement distress
dataset, as shown in Fig. 5. The first row images denote the
TABLE
CLASSIFICATION ACCURACY WITH DIFFERENT BACKBONE NETWORKS AND LOSS FUNCTION ON PAVEMENT DISTRESS DATA SET
Method Ablation
5-way Accuracy (%)
1-shot 5-shot
Data set1 Data set2 Mean Data set1 Data set2 Mean
Backbone
32-32-32-32 54.07 1.01 56.08 1.00 55.08 1.01 62.56 0.98 68.93 0.93 65.75 0.96
64-64-64-64 54.10 1.01 58.13 0.99 56.12 1.00 63.40 0.97 70.27 0.92 66.84 0.95
64-96-128-256 55.33 1.00 58.67 0.99 57.00 1.00 65.87 0.96 71.20 0.91 68.54 0.94
ResNet-18 75.00 0.88 79.40 0.82 77.20 0.85 86.53 0.69 88.03 0.66 87.28 0.67
ResNet-50 65.89 0.95 61.20 0.98 63.54 0.97 72.83 0.90 72.16 0.91 72.49 0.91
ResNet-101 62.96 0.98 62.64 0.99 62.80 0.98 71.65 0.91 71.31 0.92 71.48 0.92
Loss
ResNet-18+LCE 65.12 0.96 78.07 0.83 71.60 0.91 79.96 0.81 86.14 0.70 83.05 0.76
ResNet-18+Att.+LCE 67.12 0.95 78.80 0.82 72.96 0.90 84.50 0.73 85.60 0.71 85.05 0.72
ResNet-18+Att.+ Lfinal 75.00 0.88 79.40 0.82 77.20 0.85 86.53 0.69 88.03 0.66 87.28 0.67
Fig. 5. The space distribution of features in the experiments, and different
colors refers different classes. The 1-th row refers the features learned unde
r
LCE, the 2-th row refers the features learned under Ld. The 3-th row refers the
features learned by Lmetric.
8
feature space distribution under the cross-entropy loss function,
and the features greatly overlap and cannot be distinguished.
The second row images denote the feature space distribution
under the Ld loss function. With the increase of epochs, different
categories of features are significantly distinguished, but the
features of the same category are seriously scattered. The third
row images denotes the learned feature space distribution under
the Lmetric loss function. With the increase of epochs, the model
makes the space distance between the same category smaller
and the space distance between different categories larger. The
numeric results presented in Table show that our loss
function improves the classification accuracy from 72.96% to
77.20% of 5-way 1-shot, and 85.05% to 87.28% of 5-way 5-
shot.
3) Ablation study of attention mechanism module: In the task
of few-shot learning for classification in a complex
environment, it is more important to obtain more robust features
from a well-trained feature extractor. In this paper, we introduce
a parallel strategy attention mechanism module to solve the
above problems, which simultaneously generates channel and
spatial attention information. In the experiments, we estimate
the influences of different attention mechanisms. Four attention
modules are compared with our method on the pavement
distress dataset. The experimental results are listed in Table Ⅶ,
from which we can observe that our attention mechanism
module outperforms all competitive attention mechanisms.
4) Ablation study of the balance hyperparameter: The
hyperparameter
α
in Eq. (15) is the balance for penalty term
Lfinal. In the experiment, we study the performance of our
method on the pavement damage dataset under different
hyperparameters
α
. As shown in Fig.6, our method performs
best when the parameters are in the interval
α
[0.6, 0.7].
D. Visualization Results
The confusion matrix of our method on Dnovel samples of the
pavement distress dataset given one and five labeled samples
are shown in Fig. 7, where the intersection of the i-th row and
j-th column denotes the rate of the i-th class that are classified
as the j-th class in query samples. In Fig. 6, we can see that
given one labeled sample, our model performance is not good
in category 3; when given five labeled samples, our model
greatly improves the accuracies of category 3.
V. CONCLUSION
In this work, we introduce a deep metric learning-based
method for multi-target few-shot pavement distress
classification. Our model contains two modules. First, we
design a baseline encoder module to extract multi-level features
from the input, because robust features are important for the
result. After that, we introduce a novel metric module. In the
metric module, channel attention is adopted to learn the features
that are “what”, which by extracting the importance of different
channel features to key information and focusing on the
TABLE
CLASSIFICATION ACCURACY WITH DIFFERENT ATTENTION MECHANISMS ON PAVEMENT DISTRESS DATA SET
Method Backbone
5-way Accuracy (%)
1-shot 5-shot
Data set1 Data set2 Mean Data set1 Data set2 Mean
GC-Net [46]
ResNet-18
67.25 0.95 76.29 0.86 71.77 0.74 76.85 0.86 85.79 0.71 81.32 0.79
SENet [47] 68.88 0.94 72.16 0.91 70.52 0.95 79.52 0.82 83.63 0.75 81.58 0.78
ECA-Net [48] 67.04 0.95 72.88 0.90 69.96 0.93 76.75 0.85 84.08 0.74 80.42 0.80
CBAM [49] 70.11 0.92 72.11 0.91 71.11 0.92 79.31 0.82 81.17 0.79 80.24 0.81
Ours 75.00 0.88 79.40 0.82 77.20 0.85 86.53 0.69 88.03 0.66 87.28 0.67
Fig. 6. The verification accuracies with different
α
for 5-way N-shot. (N=1, 5)
on pavement distress dataset.
Fig. 7. The confusion matrix of our method. (a) is the result with one label (1-
shot), and (b) is the result with five label (5-shot).
9
information with large weight according to the importance
degree. Spatial attention is used to learn the features that are
“where”, which focuses on the spatial location information of
key features. Furthermore, we introduce a new metric loss
function, which guides the model to make the space distance
between the same category smaller and the space distance
between different categories larger. Experimental results show
the outperformance of our proposed method on the pavement
distress classification task.
REFERENCES
[1] Y. He, K. Song, Q. Meng and Y. Yan, "An End-to-End Steel Surface
Defect Detection Approach via Fusing Multiple Hierarchical Features,"
in IEEE Transactions on Instrumentation and Measurement, vol. 69, no.
4, pp. 1493-1504, April 2020.
[2] D. Zhang, K. Song, Q. Wang, Y. He, X. Wen and Y. Yan, "Two Deep
Learning Networks for Rail Surface Defect Inspection of Limited
Samples with Line-Level Label," in IEEE Transactions on Industrial
Informatics. doi: 10.1109/TII.2020.3045196
[3] M. Niu, K. Song, L. Huang, Q. Wang, Y. Yan and Q. Meng,
"Unsupervised Saliency Detection of Rail Surface Defects Using
Stereoscopic Images," in IEEE Transactions on Industrial Informatics,
vol. 17, no. 3, pp. 2271-2281, March 2021
[4] Y. Bao, K. Song, J. Liu, Y. Wang, Y. Yan, H. Yu, and X. Li, "Triplet-
Graph Reasoning Network for Few-Shot Metal Generic Surface Defect
Segmentation," in IEEE Transactions on Instrumentation and
Measurement, vol. 70, pp. 1-11, 2021, Art no. 5011111, doi:
10.1109/TIM.2021.3083561.
[5] R. Kapela et al., “Asphalt surfaced pavement cracks detection based on
histograms of oriented gradients,” in International Conference Mixed
Design of Integrated Circuits & Systems (MIXDES), Torun, 2015, pp.
579-584.
[6] Y. Hu, and C. Zhao. “A novel LBP based methods for pavement crack
detection,” J. Pattern Recognit. Res., vol. 5, no. 1, pp. 140-147, 2010.
[7] P. Subirats, J. Dumoulin, V. Legeay and D. Barba, “Automation of
Pavement Surface Crack Detection using the Continuous Wavelet
Transform,” in International Conference on Image Processing, Atlanta,
GA, 2006, pp. 3037-3040.
[8] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in Proc. Int. Conf. Learn.
Representations, 2015.
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. Comput. Vis. Pattern Recognit., Jun. 2016, pp.
770–778.
[10] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra,
Matching Networks for One Shot Learning,” NIPS, 2016.
[11] J. Snell, K. Swersky, and R. S. Zemel, “Prototypical Networks for Few-
shot Learning,” NIPS, 2017.
[12] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M.
Hospedales, “Learning to Compare: Relation Network for Few-Shot
Learning,” CVPR, 2017.
[13] F. Liu, G. Xu, Y. Yang, X. Niu, and Y. Pan, “Novel approach to
pavement cracking automatic detection based on segment extending,” in
Proc. Int. Symp. Knowl. Acquisition Modeling, Dec. 2008, pp. 610–614.
[14] W. Xu, Z. Tang, J. Zhou, and J. Ding, “Pavement crack detection based
on saliency and statistical features,” in Proc. IEEE Int. Conf. Image
Process. (ICIP), Sep. 2013, pp. 4093–4097.
[15] H. Zakeri, F. M. Nejad, A. Fahimifar, A. D. Torshizi, and M. H. F.
Zarandi, “A multi-stage expert system for classification of pavement
cracking,” in Proc. Joint IFSA World Congr. NAFIPS Annu. Meeting,
Jun. 2013, pp. 1125–1130.
[16] Y. Shi, L. Cui, Z. Qi, F. Meng, and Z. Chen, “Automatic road crack
detection using random structured forests,” IEEE Trans. Intell. Transp.
Syst., vol. 17, no. 12, pp. 3434–3445, Dec. 2016.
[17] R. Amhaz, S. Chambon, J. Idier and V. Baltazart, “Automatic Crack
Detection on Two-Dimensional Pavement Images: An Algorithm Based
on Minimal Path Selection,” in IEEE Trans. Intell. Transp. Syst., vol. 17,
no. 10, pp. 2718-2729, Oct. 2016.
[18] A. Cord and S. Chambon, “Automatic road defect detection by textural
pattern recognition based on adaboost,” Computer-Aided Civil and
Infrastructure Engineering, vol. 27, no. 4, pp. 244–259, 2012.
[19] H. Oliveira and P. L. Correia, “Automatic road crack detection and
characterization,” in IEEE Trans. Intell. Transp. Syst., vol. 14, no. 1, pp.
155–168, 2013.
[20] Y. Shi, L. Cui, Z. Qi, F. Meng, and Z. Chen,Automatic road crack
detection using random structured forests,” in IEEE Trans. Intell. Transp.
Syst., vol. 17, no. 12, pp. 3434–3445, 2016.
[21] L. Zhang, F. Yang, Y. Daniel Zhang and Y. J. Zhu, “Road crack detection
using deep convolutional neural network,” in IEEE International
Conference on Image Processing (ICIP), Phoenix, AZ, 2016, pp. 3708-
3712.
[22] V. Mandal, A. R. Mussah and Y. Adu-Gyamfi, (2020). Deep Learning
Frameworks for Pavement Distress Classification: A Comparative
Analysis. arXiv preprint arXiv:2010.10681.
[23] B. Li, K. C. Wang, A. Zhang, E. Yang and G. Wang, “Automatic
classification of pavement crack using deep convolutional neural
network,” International Journal of Pavement Engineering, vol. 21, no. 4,
pp. 457-463, 2020.
[24] X. Wang and Z. Hu, “Grid-based pavement crack analysis using deep
learning,” in Transportation Information and Safety (ICTIS), 2017 4th
International Conference on. IEEE, 2017, pp. 917–924.
[25] Y. Du, N. Pan, Z. Xu, F. Deng, Y. Shen and H. Kang, “Pavement distress
detection and classification based on yolo network,” International
Journal of Pavement Engineering, pp. 1–14, 2020.
[26] E. Ibragimov, H.-J. Lee, J.-J. Lee and N. Kim, “Automated pavement
distress detection using region based convolutional neural networks,”
International Journal of Pavement Engineering, pp. 1–12, 2020.
[27] H. Dong, K. Song, Y. He, J. Xu, Y. Yan and Q. Meng, "PGA-Net:
Pyramid Feature Fusion and Global Context Attention Network for
Automated Surface Defect Detection," in IEEE Transactions on
Industrial Informatics, vol. 16, no. 12, pp. 7448-7458, Dec. 2020.
[28] F. Yang, L. Zhang, S. Yu, D. Prokhorov, X. Mei and H. Ling, “Feature
Pyramid and Hierarchical Boosting Network for Pavement Crack
Detection,” in IEEE Trans. Intell. Transp. Syst., vol. 21, no. 4, pp. 1525-
1535, April 2020.
[29] Y. Zhang, Q. Li, X. Zhao and M. Tan, “TB-Net: A Three-Stream
Boundary-Aware Network for Fine-Grained Pavement Disease
Segmentation,” in IEEE/CVF Winter Conference on Applications of
Computer Vision. 2021. p. 3655-3664.
[30] C. Finn, P. Abbeel, and S. Levine, “Model-Agnostic Meta-Learning for
Fast Adaptation of Deep Networks,” International Conference on
Machine Learning (ICML), 2017.
[31] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning.
In International Conference on Learning Representations, 2017. 1, 2, 5,
7.
[32] Li, Z., Zhou, F., Chen, F., Li, H.: Meta-sgd: Learning to learn quickly for
few shot learning. In: arxiv:1707.09835. (2017) II-A, IV-A, IV-C, IV-D,
IV-D.
[33] V. G. Satorras and J. B. Estrach, “Few-shot learning with graph neural
networks,” in Proc. ICLR, 2018, pp. 1–13.
[34] P. Sermanet, A. Frome, and E. Real. Attention for finegrained
categorization. arXiv preprint arXiv:1412.7054, 2014.
[35] J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and S. Yan. Attentive
contexts for object detection. IEEE Transactions on Multimedia,
19(5):944–954, 2017. 2.
[36] H. Li, P. Xiong, J. An, and L. Wang. Pyramid attention network for
semantic segmentation. In BMVC, page 285, 2018.
[37] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few
examples: A survey on few-shot learning,” ACM Computing Surveys
(CSUR), vol. 53, no. 3, pp. 1-34, 2020.
[38] X. Wang, R. Girshick, A. Gupta and K. He, "Non-local Neural
Networks," 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7794-7803.
[39] Y. W en, K. Zhang , Z. Li, an d Y. Qi ao, “A discr imina tive f eatur e learning
approach for deep face recognition,” in Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics), vol. 9911 LNCS, 2016, pp. 499–515.
10
[40] M. Hamed, et al. Pavement Image Datasets: A New Benchmark Dataset
to Classify and Densify Pavement Distresses. Transportation Research
Record, 2020, 2674.2: 328-339.
[41] A. Ravichandran, R. Bhotika and S. Soatto, “Few-Shot Learning With
Embedded Class Models and Shot-Free Meta Training,” 2019 IEEE/CVF
International Conference on Computer Vision (ICCV), Seoul, Korea
(South), 2019, pp. 331-339.
[42] K. Lee, S. Maji, A. Ravichandran and S. Soatto, “Meta-Learning With
Differentiable Convex Optimization,” 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), Long Beach, CA,
USA, 2019, pp. 10649-10657.
[43] H. Li, D. Eigen, S. Dodge, M. Zeiler and X. Wang, “Finding Task-
Relevant Features for Few-Shot Learning by Category Traversal,” 2019
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), Long Beach, CA, USA, 2019, pp. 1-10.
[44] Y. Tian, Y. Wang, D. Krishnan, J. B Tenenbaum, and P. Isola,
“Rethinking few-shot image classification: a good embedding is all you
need?” arXiv preprint arXiv:2003.11539, 2020.
[45] C. Zhang, Y. Cai, G. Lin and C. Shen, “DeepEMD: Few-Shot Image
Classification With Differentiable Earth Mover’s Distance and
Structured Classifiers,” 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp.
12200-12210.
[46] Y. Cao, J. Xu, and et al, “Gcnet: Non-local networks meet squeeze-
excitation networks and beyond,” in Proceedings of the IEEE CVPR
Workshops, pp. 0–0, 2019.
[47] J. Hu, L. Shen, and et al, “Squeeze-and-excitation networks,” in Pro-
ceedings of the IEEE CVPR, pp. 7132–7141, 2018.
[48] Q. Wang, B. Wu, an d et al, “Eca-net: Efficient channel attention for deep
convolutional neural networks,” in 2020 IEEE CVPR, 2020.
[49] S. Woo, J. Park, and et al, “Cbam: Convolutional block attention module,”
in Proceedings of the ECCV, pp. 3–19, 2018.
Hongwen Dong received the B.S. degree in
School of Mechanical Engineering and
Automation, Liaoning University of
Technology, Jinzhou, China, in 2016, and
the M.S. degree in School of Mechanical
Engineering and Automation, Northeastern
University, Shenyang, China, in 2018. He is
currently pursuing the Ph.D. degree with
School of Mechanical Engineering and
Automation, Northeastern University, China. His research
interests include deep learning, pattern recognition and
semantic segmentation.
Kechen Song received the B.S., M.S. and
Ph.D. degrees in School of Mechanical
Engineering and Automation, Northeastern
University, Shenyang, China, in 2009,
2011 and 2014, respectively. Between 2018
and 2019, he was an Academic Visitor in
the Department of Computer Science,
Loughborough University, UK. He is
currently an Associate Professor in the School of Mechanical
Engineering and Automation, Northeastern University. His
research interest covers vision-based inspection system for steel
surface defects, surface topography, image processing and
pattern recognition.
Qi Wang received the B.S. and M.S. degrees
in mechanical engineering from the
University of Science and Technology
Liaoning, Anshan, China, in 2015 and 2018,
respectively. He is currently working toward
the Ph.D. degree in mechanical design and
theory with the School of Mechanical
Engineering and Automation, Northeastern
University, Shenyang, China. His current
research interests include image segmentation and thermal
imaging defect detection.
Yunhui Yan received the B.S., M.S. and
Ph.D. degrees in School of Mechanical
Engineering and Automation, Northeastern
University, Shenyang, China, in 1981, 1985
and 1997, respectively. He has been a
teacher in Northeastern University of China
since 1982, and became as professor in 1997.
During 1993-1994, he stayed in the Tohoku
National Industrial Research Institute as a
visiting scholar. His research interest covers intelligent
inspection, image processing and pattern recognition.
Peng Jiang received the B.S. and M.S.
degrees in School of Software Engineering,
Northeastern University, Shenyang, China,
in 2009 and 2011, respectively. Between
2016 and 2018, he was a Senior Engineer in
the Department of Information Technology,
Liaoning Transportation Research Institute
Group Co., Ltd. He is currently a Director of
R&D department in the Liaoning ATS
Intelligent Transportation Technology Co., Ltd. His research
interest covers software engineering and information
construction of expressway.
... This limitation hampers the accuracy of CNNs in breast lesion segmentation tasks, resulting in decreased segmentation precision [25]. Cao et al. designed a set of mixed dilated convolutions applied to D2U-Net to address challenges posed by low signal-to-noise ratio, significant artifacts, and variations in breast tumor shape and size [26]. Abraham et al. proposed an improved Attention U-Net model by combining image pyramids and attention mechanisms to capture context features at different levels for breast cancer segmentation [27]. ...
... The paper delves into the embedding strategy of CBAM in the Efficient Last Stage of the backbone network MobileNet V3 and explores three different structures [25], [26], as illustr ated in Fig. 5: (a) applying CBAM before the convolutional l ayers in the efficient last stage; (b) introducing CBAM after t he convolutional layers in the efficient last stage; (c) applying CBAM to the entire network after adaptive average pooling. Through experimental validation in Section 4.1, we observed that the effect of (b) was significantly more pronounced. ...
Article
Full-text available
Automated segmentation of breast tumors in breast ultrasound images has been a challenging frontier issue. The morphological diversity, boundary ambiguity, and heterogeneity of malignant tumors in breast lesions constrain the improvement of segmentation accuracy. To address these challenges, we propose an innovative deep learning-based method, namely Dual-Channel Deep Residual Attention UPerNet (DDRA-net), for efficient and accurate segmentation of breast tumor regions. The core of DDRA-net lies in the Dual-Channel Deep Residual Attention module (DDRA), which integrates depth-wise separable convolution and Convolutional Block Attention Module (CBAM). This design aims to enhance the extraction of crucial features within the receptive field to better capture subtle details of breast lesions. Through extensive experimental evaluation, DDRA-net demonstrates remarkable performance on a publicly available breast ultrasound datasets, exhibiting higher segmentation accuracy and stability compared to contemporary mainstream deep learning methods. Importantly, it is worth emphasizing that the flexibility of this method allows easy integration with other network structures to further improve the performance and applicability of breast tumor segmentation. In the segmentation of the Breast Ultrasound Image dataset, our precision, recall, IoU, F1 score, Dice, and Hausdorff Distance achieved the following values: 95.31%, 90.79%, 88.00%, 92.39%, 95.46%, and 3.02, respectively. Compared to the original UPerNet, DDRA-net demonstrated improvements of 2.92%, 4.64%, 5.52%, 4.97%, 3.4%, and 24.5% in these six metrics on the Breast Ultrasound Image dataset.
... Yu et al. [21] advance the field with a deep convolutional neural network coupled with an enhanced chicken swarm algorithm aimed at improving the model's generalization for concrete crack detection. Dong et al. [22] refine this methodology by employing ResNet18 as the backend network for extracting multi-level feature information, incor- porating an attention mechanism and a novel metric loss function to bolster the model's focus on pertinent details, a technique that underscores the utility of attention in enhancing model performance on critical data [23,24]. Wu et al. [25,26] propose integrating attention into YOLOv4, resulting in improved accuracy, and introduce CA attention to YOLOv5, demonstrating its superiority through extensive experiments. ...
Article
Full-text available
Rapid real-time detection of crack images helps prevent the emergence of more significant potential hazards. However, mature and sophisticated convolutional neural networks are more concerned with images of general everyday objects. These neural networks do not meet the real-time requirements for concrete defect detection for cracks with complex morphology and varying scales. This manuscript proposes a lightweight improvement strategy, which consists mainly of malleable efficient channel pruning, a more scaled wide-area receptive field (MSWR), and multi-channel fusion of spatial attention, referred to as MMM strategies. Firstly, the channel pruning count can intuitively make the general convolutional neural network more lightweight. Secondly, the wider receptive field can fuse multi-scale feature maps and recognize cracks of various scales. Finally, the multi-channel fusion of spatial attention enhances detection performance efficiently, ensuring real-time capability at minimal cost. The experimental results show that the lightweight network improved by the MMM strategy sacrifices no more than 8% in the detection accuracy of defects. In some cases, the detection accuracy is even improved, while the detection speed has a significant advantage. This lightweight strategy improves defect detection and has higher real-time adaptability than mainstream convolutional neural networks. The codes are available at https://github.com/mmm587/MMM.
... The challenge of differentiating the degree of carbon deposit can be solved partly by selecting an appropriate backbone network for small datasets of carbon deposit images. In general, networks with more parameters and a complex structure can extract deeper features, but some few-shot tasks [8,15,16] demonstrate that modest networks outperform complex networks. We validate this argument through experimental comparisons in the carbon deposit dataset and select the best model suitable for discriminating the degree of carbon deposit in automobile engines. ...
Article
The detection of carbon deposit degree is of great significance to the maintenance of automobile engine. Due to issues with poor feature aggregation, inter-class similarity, and intra-class variance in carbon deposit data with a small number of samples, model-based discriminative approaches cannot be widely implemented in the market. In order to overcome this technical barrier, the article examines the impact of DCNNs (Deep Convolutional Neural Networks) level on the recognition effect of the degree of carbon deposit, introduces a dropout structure and data enhancement strategy to lower the risk of overfitting brought on by the small dataset, and suggests a recognition method based on the kernel of dual-dimensional multiscale-multifrequency information features to enhance the differentiation characteristic. After experimental testing, the accuracy of this method is 86.9 %, the F1-score is 87.2 %, and the inference speed is 190 FPS, which can meet the practical requirements and provide basic support for the large-scale promotion of the model discrimination.
... With the extension of pavement service life, asphalt pavement distress has gradually become one of the main distresses [1][2][3][4], which causes serious impact on pavement service life and vehicle safety [5][6][7]. The emergence of crack further increases the challenge of asphalt pavement detection and maintenance [8][9][10][11][12]. By the end of 2022, China's pavement detection and maintenance mileage accounted for 99.9% of the total mileage, and the conventional asphalt pavement crack detection means is inefficient [13][14][15][16], difficult to achieve rapid and large-scale detection goals [17][18][19][20]. ...
... The Neck part is composed of Spatial Pyramid Pooling (SPP) structure and Path Aggregation Network (PANet) structure [42], and its main function is to fuse the feature information of different depths of the backbone network, so that the output information can better express the target in the image and improve the network accuracy [43][44][45]. ...
Article
Full-text available
Metal surface defect segmentation can play an important role in dealing with the issue of quality control during the production and manufacturing stages. There are still two major challenges in industrial applications. One is the case that the number of metal surface defect samples is severely insufficient, and the other is that the most existing algorithms can only be used for specific surface defects and it is difficult to generalize to other metal surfaces. In this work, a theory of few-shot metal generic surface defect segmentation is introduced to solve these challenges. Simultaneously, the Triplet-Graph Reasoning Network (TGRNet) and a novel dataset Surface Defects- $4^{i}$ are proposed to achieve this theory. In our TGRNet, the surface defect triplet (including triplet encoder and trip loss) is proposed and is used to segment background and defect area, respectively. Through triplet, the few-shot metal surface defect segmentation problem is transformed into few-shot semantic segmentation problem of defect area and background area. For few-shot semantic segmentation, we propose a method of multi-graph reasoning to explore the similarity relationship between different images. And to improve segmentation performance in the industrial scene, an adaptive auxiliary prediction module is proposed. For Surface Defects- $4^{i}$ , it includes multiple categories of metal surface defect images to verify the generalization performance of our TGRNet and adds the nonmetal categories (leather and tile) as extensions. Through extensive comparative experiments and ablation experiments, it is proved that our architecture can achieve state-of-the-art results.
Conference Paper
Full-text available
Semantic segmentation of surgical instruments plays a critical role in computer-assisted surgery. However, specular reflection and scale variation of instruments are likely to occur in the surgical environment, undesirably altering visual features of instruments, such as color and shape. These issues make semantic segmentation of surgical instruments more challenging. In this paper, a novel network, Pyramid Attention Aggregation Network, is proposed to aggregate multiscale attentive features for surgical instruments. It contains two critical modules: Double Attention Module and Pyramid Upsampling Module. Specifically, the Double Attention Module includes two attention blocks (i.e., position attention block and channel attention block), which model semantic dependencies between positions and channels by capturing joint semantic information and global contexts, respectively. The attentive features generated by the Double Attention Module can distinguish target regions, contributing to solving the specular reflection issue. Moreover, the Pyramid Upsampling Module extracts local details and global contexts by aggregating multi-scale attentive features. It learns the shape and size features of surgical instruments in different receptive fields and thus addresses the scale variation issue. The proposed network achieves state-of-the-art performance on various datasets. It achieves a new record of 97.10% mean IOU on Cata7. Besides, it comes first in the MICCAI EndoVis Challenge 2017 with 9.90% increase on mean IOU.
Article
Rail surface defect (RSD) inspection is an essential routine maintenance task. Computer vision testing is suitable for RSD inspection with its intuitiveness and rapidity. Deep learning techniques, which can extract deep semantic features, have been applied to inspect RSDs in recent years. However, these methods demand thousands of samples. And sample collection requires the industry and costs high. To address the issue, a novel inspection scheme for RSDs is presented for limited samples with line-level label, which regards defect images as sequence data and classifies pixel lines. Thousands of pixel lines are easy to be collected and labeling line-level is a simple task in labeling works. Then two methods OC-IAN and OC-TD are designed for inspecting express rail defects and common/heavy rail defects respectively. OC-IAN and OC-TD both employ one-dimensional convolutional neural network (ODCNN) to extract features and long and short term memory (LSTM) network to extract context information. The main differences between OC-IAN and OC-TD are that OC-TD applies a double-branch structure and removes the attention module. Experimental results on RSDDs dataset demonstrate that our methods are effective and outperform the state-of-the-art methods on defect-level metrics (Type-I: Rec-0.9314, Pre-0.8421, F1-0.8845; Type-II: Rec-0.9427, Pre-0.9176, F1-0.9300).
Chapter
The focus of recent meta-learning research has been on the development of learning algorithms that can quickly adapt to test time tasks with limited data and low computational cost. Few-shot learning is widely used as one of the standard benchmarks in meta-learning. In this work, we show that a simple baseline: learning a supervised or self-supervised representation on the meta-training set, followed by training a linear classifier on top of this representation, outperforms state-of-the-art few-shot learning methods. An additional boost can be achieved through the use of self-distillation. This demonstrates that using a good learned embedding model can be more effective than sophisticated meta-learning algorithms. We believe that our findings motivate a rethinking of few-shot image classification benchmarks and the associated role of meta-learning algorithms. Code: http://github.com/WangYueFt/rfs/.
Article
Automatic pavement crack detection is essential for evaluating maintenance requirements and ensuring driving safety. Crack detection plays a primary role in realising the automatic evaluation of pavement condition. Most existing researches on pavement crack detection rely on laborious work, which is a time- and cost-intensive process. Although there has been considerable research on pavement crack detection, it remains a challenging task owing to diverse complex pavement conditions. Recently, deep learning-based algorithms have achieved significant success in computer vision tasks. However, the techniques still have limitations for automatic pavement distress detection. To overcome the current limitations, this study proposes a method for detecting signs of pavement distress based on faster region based convolutional neural network (Faster R-CNN). The study focuses on the detection of longitudinal cracks, transverse cracks, alligator cracks, and partial patching in pavement images. A framework for applying the Faster R-CNN technique to a full-size pavement image is also proposed, which allows the sliding window size to be reduced, thus enabling the detection of larger images. The performance of the proposed method was validated against a dataset containing actual pavement images. The experimental test results show that the proposed method could successfully detect cracks and partial patching with accuracy.
Article
Visual information has been paid more attention to rail surface defect detection for its high efficiency and stability. However, it is not sufficient to detect more complete defects in complex background information. The addition of profiles can effectively improve the above situation due to its entity information. However, in high-speed detection, traditional three-dimensional (3D) profiles acquisition is difficult and separate from image acquisition, which can not satisfy the above requirements effectively. Therefore, an unsupervised stereoscopic saliency detection method based on binocular line-scanning system is proposed in this paper. This method can obtain the extraordinary precision image and profile information at the same time, and avoid decoding distortion of the structured light reconstruction method. In this method, a global low-rank non-negative reconstruction algorithm with a background constraint is proposed. Unlike the low-rank recovery (LRR) model, the algorithm has a more comprehensive low-rank and background clustering properties. Besides, outlier detection based on the geometric properties of the rail surface is also proposed in this method. Finally, image saliency results and depth outlier detection results are associated with interactive fusion. Otherwise, data set (RSDDS-113) containing rail surface defects is established for experimental verification. The experimental results demonstrate that our method can obtain the results that MAE is 0.09 and AUC is 0.94, which is better than other 15 algorithms.