ArticlePDF Available

Deep Metric Learning-Based for Multi-Target Few-Shot Pavement Distress Classification

June 2021
IEEE Transactions on Industrial Informatics PP(99):1-1

June 2021
PP(99):1-1

DOI:10.1109/TII.2021.3090036

Authors:

Ke-Chen Song

The Logistics Institute, Northeastern University

Show all 5 authorsHide

In this paper, we propose a new few-shot pavement distress detection method based on metric learning, which can effectively learn new categories from a few labeled samples. In our work, we adopt the backend network (ResNet18) to extract multilevel feature information from the base classes and then send the extracted features into the metric module. In the metric module, we introduce the attention mechanism to learn the feature attributes of "what" and "where" and focus the model on the desired characteristics. We also introduce a new metric loss function to maximize the distance between different categories while minimizing the distance between the same categories. In the testing stage, we calculate the cosine similarity between the support set and query set to complete novel category detection. The experimental results show that the proposed method significantly outperforms several benchmarking methods on the pavement distress dataset.

Content uploaded by Ke-Chen Song

Content may be subject to copyright.

Abstract—Pavement distress detection is of great significance

for road maintenance and to ensure road safety. At present,

detection methods based on deep learning have achieved

outstanding performance in related fields. However, these

methods require large-scale training samples. For pavement

distress detection, it is difficult to collect more images with

pavement distress, and the types of pavement diseases are

increasing with time, so it is impossible to ensure sufficient

pavement distress samples to train the supervised deep model. In

this paper, we propose a new few-shot pavement distress detection

method based on metric learning, which can effectively learn new

categories from a few labeled samples. In our work, we adopt the

backend network (ResNet18) to extract multilevel feature

information from the base classes and then send the extracted

features into the metric module. In the metric module, we

introduce the attention mechanism to learn the feature attributes

of "what" and "where" and focus the model on the desired

characteristics. We also introduce a new metric loss function to

maximize the distance between different categories while

minimizing the distance between the same categories. In the testing

stage, we calculate the cosine similarity between the support set

and query set to complete novel category detection. The

experimental results show that the proposed method significantly

outperforms several benchmarking methods on the pavement

distress dataset. (The classification accuracies of 5-way 1-shot and

5-way 5-shot are 77.20% and 87.28%, respectively)

Index Terms- metric learning; deep learning; pavement distress

detection; few-shot; attention mechanism

I. INTRODUCTION

avement distresses, such as cracks, blocks, potholes,

alligators and so on, are mainly caused by vehicle

overloading, weather changes and road aging. If these

distresses cannot be treated in time, they will reduce the road

quality, and endanger traffic safety. Rapid and accurate

detection of road surface damage is helpful for maintaining the

roads in time, preventing traffic accidents and ensuring vehicle

safety. In the past, the pavement distress detection method was

mainly collecting pavement images with cameras by

engineering vehicles traveling along roads and then manually

classifying and processing the pavement images. This method

is not only time consuming but also highly subjective.

Computer vision technology has made great achievements in

related fields of image processing, replacing the complex and

tedious manual detection [1]-[4]. In the pavement distress

detection researches, some methods based on computer vision

technology also appear constantly, such as histogram-of-

oriented-gradient (HOG) [5], local-binary-pattern (LBP) [6]

and wavelet [7], followed by classifiers such as BP neural

networks and support vector machine (SVM) to classify

pavement distresses. Although this kind of method solves the

problem caused by human beings to some extent, these artificial

features rely on expert knowledge and lack universality.

Otherwise, the performance of these methods is still limited by

the complex structure, diverse shapes, complex backgrounds

and the strong interference of various noises (such as oil spots,

gravel, and zebra crossings, etc.).

In recent years, with the availability of large-scale datasets

(e.g., ImageNet) and the development of high-performance

computing units, deep learning-based methods have drawn

great attention in various visual tasks. These methods use the

convolutional neural network (CNN) [8] [9] to obtain

multilevel features from the input data to complete the

representational learning of the input image. However, most of

these supervised models need many labeled samples to fit the

deep CNN parameters. In industrial applications, it is difficult

to collect enough labeled images to train a CNN model. Hence,

these supervised CNN-based methods have difficulty learning

object distributions with a few labeled samples and suffer from

overfitting in the training process. In recent years, few-shot

Deep metric learning-based for multi-target few-

shot pavement distress Classification

Hongwen Dong, Kechen Song, Member, IEEE, Qi Wang, Yunhui Yan, Peng Jiang

Fig. 1. A brief illustration of few-shot learning for pavement distress

classification task. The aim of this task is to predict the query samples based o

the similarity with the support samples (with label) by few-shot model.

This work is supported by the National Natural Science Foundation of Chin

(51805078), the National Key Research and Development Program of Chin

(2017YFB0304200), the Fundamental Research Funds for the Central

Universities (N2003021, N2103011). (Corresponding authors: Kechen Song;

Yunhui Yan)

H. Dong, K. Song, Q. Wang, and Y. Yan are with the School of Mechanical

Engineering and Automation, Northeastern University, Shenyang, Liaoning,

110819, China, and the Key Laboratory of Vibration and Control of Aero-

Propulsion Systems Ministry of Education of China, Northeastern University,

Shenyang, 110819, China. (e-mail: donghongwenliran@163.com,

songkc@me.neu.edu.cn, 1810109@stu.neu.edu.cn, yanyh@mail.neu.edu.cn).

P. Jiang is with the department of Liaoning ATS Intelligent Transportation

Technology Co., Ltd., Shenyang, Liaoning, China. (jiangpeng1986@139.com)

learning has attracted attention in computer vision tasks,

especially in image classification task. The aim of few-shot

learning is to learn novel objects with little supervision as easily

as humans. Most few-shot learning methods adopt the metric-

learning scheme. The concept of metric learning is to learn the

similarity of a pair of samples, which maximizes the inter-class

variations and minimizes the intra-class variations. For example,

Matching Networks [10] train an end-to-end classifier similar

to the nearest neighbor, and the trained model does not need to

be adjusted; it can also be used to classify the categories that

did not appear in the training process. Prototypical Networks

[11] use Euclidean distance as the distance measure and make

the distance between the data of a class and the primitive

representation of the class the closest, and the distance to the

primitive representation of other classes farther. Relation

Networks [12] apply baseline CNN modules as the feature

encoder and then discriminate the similarities and

dissimilarities between the support and query samples by

concatenated features.

In the pavement distress classification task, because most of

the pavement is normal, it is difficult to collect enough distress

pavement images. In addition, in different scenes, different road

conditions and different pavement materials, distresses with the

same label name are very different. Therefore, supervised CNN

classification is not the best method for pavement distress

detection. To solve the above challenges, we introduce a deep

metric learning-based method for multi-target few-shot

pavement distress classification. The overview of our task is

shown in Fig. 1. Different from the approaches mentioned

above, our few-shot model improves the classification accuracy

in two ways. First, our few-shot model uses the baseline CNN

module as the feature extractor and adopts an attention

mechanism to obtain more robust and discriminative

information from images, which focuses the model on the

distressed region characteristics. In addition, we introduce a

new metric loss function to optimize the network model, which

makes the sample features of the same kind more compact and

enhances the separability of the sample features of different

categories. The framework of our method is shown in Fig. 2.

The main contributions of our work are summarized as follows:

1) A deep metric learning-based method for multi-target few-

shot pavement distress classification is proposed. To the best of

our knowledge, our work is the first attempt to do so in

pavement distress classification task.

2) A novel metric module is proposed. In the module, an

attention mechanism is applied to obtain more discriminative

information from images, and the model focuses on the

distressed region characteristics. Additionally, a metric loss

function is used to optimize the model, which maximizes the

inter-class variations and minimizes the intra-class variations.

3) We carry out few-shot classification experiments on a

pavement distress dataset and achieve competitive performance

with state-of-the-art methods.

The rest of this paper is organized as follows: Section Ⅱ

introduces the related works. Section Ⅲ describes our method

in detail. Next, we present the details and results of experiments

in Section Ⅳ, and we describe the experiments. Finally, the

conclusion of this paper is summarized in Section Ⅴ.

II. RELATED WORKS

In this section, we briefly review some related works on

pavement distress detection, few-shot learning for classification,

and attention mechanisms.

A. Pavement distress detection

In this section we briefly review traditional pavement distress

detection methods and deep learning-based pavement distress

detection methods. The traditional methods introduced in this

section refer to the non-deep learning-based methods.

1) Traditional pavement distress detection methods: In early

studies [13] [14], different threshold methods were used to

highlight crack regions from the background. However, these

thresholds were set subjectively, so the selected thresholds

could not adapt to the changes or differences in image color

information caused by different acquisition conditions.

Furthermore, edge-based algorithm [15] [16] was adopted for

crack edge detection. However, these methods were limited by

the low contrast and noise of the images. Currently, most of the

approaches adopt manually designed features such as Gabor

filters, wavelet transform, local binary pattern, and histogram

of oriented gradient for pavement crack detection. However,

these manually designed features are not suitable for complex

cracks and lack universality.

In recent years, many researchers have applied machine

learning for pavement distress detection. In [17], a new

algorithm that relies on a minimal path with images was

proposed for pavement crack detection. In [18], a supervised

learning method based on AdaBoost was used for road surface

detection. In [19], two simple local statistics means and

standard deviations were adopted to classify whether image

blocks contain crack pixels. In [20], a novel framework based

on random structured forests was proposed for road crack

detection. Although these methods have some advantages

compared with traditional methods, the detection effect of these

methods depends heavily on artificially designed features, and

the generalization performance is not strong.

2) Deep learning-based pavement distress detection methods:

Deep learning-based methods benefit from powerful feature

representation, which makes outstanding achievements in

computer vision-related fields. In the pavement distress

detection task, Zhang et al. [21] applied a deep CNN framework

for pavement distress image classification. In [22], a

comparative analysis of pavement distress classification based

on deep learning frameworks was introduced. In [23], a DCNN

was applied to classify pavement cracks on 3D images and

those cracks are labeled into 5 different categories. [24] [25] [26]

used a deep learning-based method to locate crack regions.

Dong et al. [27] fused multi-level features into different stages

and added the global context into the network for surface defect

segmentation. Yang et al. [28] fused multi-level features from

top-to-down for pavement crack segmentation. Zhang et al. [29]

adopted a three-stream boundary-aware network for fine-

grained pavement disease segmentation. Although these

methods achieved outstanding performance in pavement

disease detection, most of them only detect one kind of

pavement disease (e.g., crack) and lack universality, and these

methods are not effective for novel categories with a few label

samples.

B. Few-shot learning for classification

In this section, we briefly review two categories of existing

few-shot learning for classification methods.

1) Meta-learning: Meta-learning, sometimes called learning

to learn, focuses more on tasks than data. In MAML [30], an

algorithm for meta-learning called model-agnostic was

proposed, which trains a model on a learning task and processes

a new learning task with a few training samples. Eavi et al. [31]

proposed an optimization algorithm based on LSTM for

learning one learner neural network classifier, which is used to

train another in the case of a few samples. Li et al. [32]

proposed a Meta-SGD, which similar to SGD, can be trained

easily while initializing and adapting learners in only one step.

However, in these methods, the model structure is fixed, and

the image input size of the model is also fixed, so the

generalization is not good. Additionally, the model weights

need to calculate the second-order gradient which increases the

instability of the model.

2) Metric Learning: The concept of few-shot classification

algorithm based on metric learning mainly uses an encoder to

extract the features from input samples (labeled and unlabeled),

and then uses a metric function to calculate the similarity of the

features of unlabeled and labeled samples to output the category

prediction of unlabeled samples. Matching Networks [10] train

an end-to-end classifier similar to the nearest neighbor, and the

trained model does not need to be adjusted. Prototypical

Networks [11] adopt Euclidean distance as metric function,

which can maximize the inter-class variations and minimize the

intra-class variations. Relation Networks [12] apply baseline

CNN modules as the features encoder, and then discriminate the

similarities and dissimilarities between the support and query

samples by concatenated features. In [33], a graph convolution

network is used as the metric function. However, these methods

are still fixed for few-shot learning tasks.

C. Attention mechanism

The attention mechanism is a special signal processing

mechanism in human vision that can suppress useless

information and obtain interesting objects. In recent years,

attention mechanisms have been widely used in various deep

learning fields, such as image classification, object recognition

and semantic segmentation. For example, [34] introduced a

recurrent attention model that learns to direct high resolution

attention to the most discriminative regions without any spatial

supervision for fine-grained classification. In [35], an attention-

based global contextualized subnetwork was recurrently

adopted to generate the attentive location map for the input

image to highlight useful global contextual locations to provide

better object detection. Li et al. [36] proposed a pyramid

attention network, which implements spatial pyramid attention

on high-level features to exploit the impact of global contextual

information in semantic segmentation.

Inspired by the above method, we introduce an attention

mechanism into our method to extract more robust features. Our

attention mechanism includes two components: channel

attention and spatial attention. The former is used to extract

different channel features and focuses on the information with

a large weight according to the importance degree to ensure that

the features are “what”. The latter adopts non-local block to

obtain spatial attention and learns the features are “where”.

Fig. 2. Flow chart of our method. An encoder module (fe) which is used to extract base features from input images. A metric module (gm) which adopts attention

mechanism to obtain more discriminative information from input information and learns a metric function to maximize the distance between different categories

while minimizing the distance between the same categories. In the process of testing, the features of support set (with label) and query set (no label) are extracted

by (fe +gm) and compare cosine similarities of the features, and output the prediction of query set.

III. METHOD

A. Task Setting

Specifically, few-shot learning for classification task usually

involves three datasets [37]: a base class set Dbase, a support set

Dsupport and a query set Dquery. The goal of this task is to classify

each unlabeled query sample in Dquery correctly according to

Dsupport. However, because there are only a few labeled samples

for each class in Dsupport, a classification model cannot be trained

effectively. Therefore, we usually introduce Dbase to train a

model and learn transferable knowledge to help solve this

problem.









base i i i

Dxy



is used for training the classification

model, where i

is the label corresponding to sample i

and N is the number of training samples.

 novel support query

DD D∪is a novel class set, where







support i i i

Dxy



is a support set with M labeled

samples,

is the label corresponding to sample

, and







query i i



is the set without labels.

 novel base

DD

, the goal of this task is to classify each

unlabeled query sample in Dquery given Dsupport.

B. Encoder Module

The robust features extracted from the input image have a

great impact on the final classification accuracy. In our method,

we build the encoder module (; )



on the pre-trained

model ResNet-18 network to extract multi-level features from

raw to semantic. The encoder module contains four residual

blocks and a global average pooling layer. The details of the

encoder module are shown in Table I. Each residual block is

composed of a convolutional layer, non-linear activation

function, batch normalization, and pooling layer. Given a batch

images





,,... CWH

Xxxx 

with class label





, , ..., n

Ccc c, the output of (; )



is:















;_ i

f x down scale BN conv x



V (1)

where BN denotes batch normalization,  is the non-linear

activation function (ReLU), conv is the convolution operation

with a 3×3 kernel size, φ denotes the trainable weights,

down_scale denotes the max pooling operation.

C. Metric Module

Generally, the aim of metric learning is to maximize the

inter-class variations and minimize the intra-class variations by

a metric function. To further improve the aggregation and

separation of features, we introduce the attention mechanism

into the metric module to capture the key information, followed

by two loss functions.

Channel attention learns the features as “what”, which

extracts the importance of different channel features to key

information and focuses the information with a large weight

according to the importance degree to improve the feature

representation of discriminant semantics (as shown in Fig. 3).

Let





,, Cwh

vv 

  Vrepresents the encoder module

output. First, we adopt a global average pooling operation to

fuse the feature of V in the dimension w×h to produce a channel-

wise descriptor





,, C

uu U.

1wh



 (2)

Second, we adopt two 1×1 convolution layers to weight U to

capture channel-wise dependencies. We use the sigmoid on the

final feature maps.













21 12

,, C

WW mm



     MU (3)

where



and



denote the sigmoid and ReLU functions,

respectively. W1 and W2 are 1×1 convolution operation.

Third, we use M to reweight the channels of the original

feature map V to obtain the new feature distribution.





11 2 2

,, CW H

mv m v 

   E (4)

Spatial attention learns the features are “where”, which

focuses on the spatial location information of key features. The

convolution operation with different size kernels can only

obtain the information of one local neighborhood at a time. To

TABLE I

DETAILS OF ENCODER MODULE

Stage

Type Output

33 conv, stride = 1 224224

22 max pool, stride = 2 112112

R1 [conv 33 + BN + ReLU, 5656

max pool 2×2

R2 [conv 33 + BN + ReLU, 2828

max pool 2×2

R4 [conv 33 + BN + ReLU, 1414

max pool 2×2

R4 [conv 33 + BN + ReLU, 77

max pool 2×2

Fig. 3. The overview architecture of the attention mechanism

obtain better spatial information, we consider all the feature

positions. Inspired by non-local neural networks [38], we add

non-local block into the metric module to obtain spatial

attention, and the details are shown in Fig. 3. The non-local

operation can be defined as:



()

iijj

fVVgV

CV 

 (5)

where V is the input feature calculated by Eq. 1, y denotes the

output of the non-local operation, i is the output position index,

j is the index of all possible locations in the V feature. The

bivariate function f (Vi, Vj) calculates the weight between

positions i and j in feature V and outputs a one-dimensional

scalar. The unary function g calculates the characterization

value of V at position j. C(V) represents normalization factor.

We use the Gaussian function as a bivariate function f (Vi, Vj)

to calculates the weight between positions i and j in feature V,

which is defined as:



f= , ij

fVV e



 (6)

where









denotes dot-product. We use linear

weighting for ()

VWVwith the trainable weights Wg. The

output of spatial attention S is calculated as:









ii s j i

Wy V W gV V



 S (7)

Where  denotes softmax, and Ws is the trainable weight.

The final output feature map of the metric module is the

fusion of channel attention and spatial attention, followed by a

convolutional layer and a nonlinear activation function.











;conv





FSE

(8)

where  denotes the ReLU activation function and



represents

trainable weight.

After feature attention is extracted from the encoder module,

an effective metric function is needed to improve the

discriminant ability of model and generalize it to novel classes

Dnovle. In this module, we introduce center loss [39] to minimize

the intra-class variations. Center loss learns the feature centers

of each class and penalizes the distance between the features

and the center of the corresponding class. The formulation for

center loss is as follows:

ciy





 (9)

where zyi denotes the yi class center of the deep features xi

extracted metric module, and B represents a mini-batch.

Intuitively, for few-shot classification tasks, the center loss

function can minimize the spatial distance between the same

categories. However, the differences between some categories

are very small, and how to keep the features of different classes

separable is important. In this paper, we let the module learn a

discriminant function that can maximize the inter-class

variations. The discriminant function can be formulated as:



exp , z

log

exp , z

ED o

LED o









 





 (10)

ED(⋅,⋅) denotes Standardized Euclidean distance. ok is the k-th

class average feature in every mini-batch, which is defined as:

ktrain



F (11)

where Dtrain is the basic class dataset. B represents a mini-batch.

The final metric function is defined as:

metric c d

LL (12)

D. Loss Function

In this paper, we adopt joint supervision to optimize the

model. First, we put the vector





f ,f ,...,fi

Fdefined in Eq.

(8) into a classifier; in the classification task based on

convolutional neural network, the fully connected layer with

softmax is usually used as the classifier, and outputs the

probability



Pp cy of the ci category:



Pp c e





 (13)

Next, we compute the loss of input samples xi belonging to the

target category ci:

   



1log 1 log 1

CE i i i i

qp q p

N

   

 (14)

where N is the number of mini-batch. qi and pi represent the

ground truth and predicted label probabilities, respectively. The

final loss Ltotal is defined as:

inal CE metric

LLL



 (15)

where

is the balance parameter for the trade-off between

distribution and generalization. A smaller parameter value

indicates that the model tends to extract more robust and

generalized features. A larger parameter value indicates that

model focuses on learning the spatial distribution of the features.

In the experiments, we analyze the influence of the parameter





0, 1



.

E. Classifier fine-tuning

Classifier fine-tuning is the test phase in few-shot learning

for pavement distress classification. For the supervised CNN-

based classification task, the CNN network is trained and

optimized repeatedly on the training dataset to obtain an

optimal model encapsulating classification weights and then

computes classification scores on the test set. However, these

encapsulated classification weights are not fit to new classes

(with a few label samples) w hic h ar e not inc luded in the trai ning

set. In this work, the cosine classifier [9] is used as a similarity

classifier for few-shot tasks, which can be defined as:



,sq

ConsineSimilarity x x



 (16)

where  denotes the dot product and 2

represents L2 norm.

xs and xq denote the support features and query features vector

extracted from the above metric module. By calculating the

similarity of the two feature vectors, the classifier outputs the

predicted of query samples.

IV. EXPERIMENTS

A. Implementation details

1) Parameter Setting: Our method employs a basic encoder

module together with a metric module for multi-target few-shot

pavement distress classification. For the basic encoder module,

ResNet-18 is employed as the backbone network. During the

training, the learning rate is 0.001 and halved every 10 epochs.

The weights realize the initialization of the newly added

convolutional layers through the “Xavier” scheme. We train the

model for a total of 100 epochs.

2) Computation Platform: The experiments are implemented

using PyTorch framework on NVIDIA GTX TITAN GPU on

Ubuntu 16.04 Linux. https://github.com/DHW-

Master/FS_PDD.git.

3) Evaluation: The classification accuracy is adopted to

evaluate the experimental results, which is defined as:

()

accuracy TQ



 (17)

where r(i) and Q(i) denote the number of samples that are

correctly and the number of query samples in i-th test episode,

respectively. Ts denotes the number of test episodes.

B. Results

1) Classification on Pavement Distress Dataset: We collect

the pavement distress from [40], which consists of 10 different

classes, and each image with 640×640 resolution. In this work,

we reorganized these distresses, and each class contains

approximately 300 samples with 224×224 resolution. Some

samples in this dataset are shown in Fig. 4, and we can observe

that the conditions of the samples in this dataset are complex

and changeable, such as uneven brightness, low contrast,

presence of oil stains and zebra crossing, etc., which make the

detection more challenging. In the experiments, we divide the

dataset into two data-sets, as listed in Table Ⅲ. We take one as

base class to train the model, and the other as a novel set to

evaluate the few-shot task. The numeric results presented in

Table Ⅱ show that compared with other methods, our method

can achieve 77.20% classification accuracy on 5-way 1-shot

and 87.28% classification accuracy on 5-way 5-shot.

2) Classification on MVTec Dataset: The MVtec dataset

contains 1709 high-resolution images of 15 different classes.

Each class contains defect-free images and different types of

TABLE Ⅱ

FIVE-WAY FEW-SHOT CLASSIFICATION ACCURACY ON THE PAVEMENT DISTRESS DATA SET (AVERAGE OF 50 TEST EPISODES AND EACH EPISODE CONTAINS 75

QUERY SAMPLES WITH 95% CONFIDENCE INTERVALS)

Methods Backbone

5-way Accuracy (%)

1-shot 5-shot

Data set1 Data set2 Mean Data set1 Data set2 Mean

Prototypical Net [11] 64-64-64-64 62.23  0.98 46.95  1.02 54.59  1.01 75.70  0.86 64.72  0.96 70.21  0.93

Matching Net [10] 64-64-64-64 60.83  0.99 57.12  1.00 58.97  0.99 68.43  0.94 74.30  0.88 71.36  0.91

Relation Net [12] 64-96-128-256 64.26  0.97 54.13  1.01 59.19  0.99 69.07  0.93 66.54  0.95 67.80  0.94

MAML [30] 32-32-32-32 59.00  0.99 57.86  1.00 58.43  0.99 73.70  0.89 73.90  0.89 73.80  0.89

Ours ResNet-18 75.00  0.88 79.40  0.82 77.20  0.85 86.53  0.69 88.03  0.66 87.28  0.67

Fig. 4. Example samples of pavement distress dataset, (a) Alligator, (b) Block,

longitudinal, (h) Sealed-reflective, (i) Transvers, (j) Sealed-alligator.

TABLE Ⅲ

THE DETAILS OF PAVEMENT DISTRESS DATASET

Dataset

Data set1 Data set2

Basic training

classes Novel classes Basic training

classes Novel classes

Classes

name

Alligator

Transvers

Lane-longitudinal

Longitudinal

Sealed-reflective

Reflective

Sealed-longitudinal

Block

Pothole

Sealed-alligator

Reflective

Sealed-longitudinal

Block

Pothole

Sealed-alligator

Alligator

Transvers

Lane-longitudinal

Longitudinal

Sealed-reflective

TABLE Ⅳ

FIVE-WAY FEW-SHOT CLASSIFICATION ACCURACY ON THE MVTec DATA

SET (AVERAGE OF 50 TEST EPISODES AND EACH EPISODE CONTAINS 75

QUERY SAMPLES WITH 95% CONFIDENCE INTERVALS)

Method Backbone MVTec 5-way Accuracy (%)

1-shot 5-shot

Prototypical Net [11] 64-64-64-64 92.75  0.52 94.85  0.44

Matching Net [10] 64-64-64-64 89.28  0.63 92.54  0.53

Relation Net [12] 64-96-128-256 92.57  0.53 93.59  0.49

MAML [30] 32-32-32-32 70.96  0.92 89.77  0.61

Ours ResNet-18 95.33  0.42 99.60  0.13

TABLE Ⅴ

FIVE-WAY FEW-SHOT CLASSIFICATION ACCURACY ON THE miniImageNet

DATA SET (AVERAGE OF 50 TEST EPISODES AND EACH EPISODE CONTAIN S 75

QUERY SAMPLES WITH 95% CONFIDENCE INTERVALS)

Method Backbone

miniImageNet 5-way

Accuracy (%)

1-shot 5-shot

Prototypical Net [11] 64-64-64-64 49.42  0.78 68.20  0.66

Matching Net [10] 64-64-64-64 43.56  0.84 55.31  0.73

Relation Net [12] 94-96-128-256 50.44  0.82 65.32  0.70

MAML [30] 32-32-32-32 48.70  1.84 63.11  0.92

Shot-Free [41] ResNet-12 59.04  n/a 77.64  n/a

MetaOptNet [42] ResNet-12 62.64  0.61 78.63  0.46

CTM [43] ResNet-18 64.12  0.82 80.51  0.13

RFS [44] ResNet-12 64.82  0.60 82.14  0.43

DeepEMD [45] ResNet-12 65.91  0.82 82.41  0.56

Ours ResNet-18 70.40  0.93 84.40  0.73

anomalous images. In the experiments, we validate our method

on anomalous images in this dataset. We reorganize the MVTec

dataset and expand the dataset by mirroring, flip, and rotation

methods. The reorganized MVTec dataset consists of 66 classes,

and each class contains approximately 130 samples, where 40

classes are randomly selected as the base classes, and the rest

as novel classes to verify the few-shot task. The numeric results

are presented in Table Ⅳ, from which we can observe that

compared with other methods, our method can achieve 95.33%

classification accuracy on 5-way 1-shot, and 99.60%

classification accuracy on 5-way 5-shot.

3) Classification on miniImageNet Dataset: The

miniImageNet dataset is a standard benchmark for few-shot

learning methods for recent works. It consists of 100 classes

randomly sampled from the ImageNet and each class contains

600 samples with 84×84. It is split into 64 base classes, 16

validation classes and 20 novel classes. The numeric results

presented in Table Ⅴ show that compared with other methods,

our method can achieve 70.40% classification accuracy on 5-

way 1-shot and 84.40% classification accuracy on 5-way 5-shot.

C. Ablation Studies and Discussion

We conduct ablation studies and discussions to analyze how

each component affects the performance of the proposed

method. We mainly consider four ablation components:

backbone networks, loss function, attention mechanism module,

and balance hyperparameter.

1) Ablation study of different backbone networks: In the

experiments, we use different backbone networks as the

encoder modules to verify the influence of different backbone

networks on the performance of our method. We run all the

experiments on the pavement distress dataset. The classification

accuracy is listed in Table Ⅵ, from which we can observe that

with the depth of backbone networks increases, the

performance of the model improves further. ResNet-18 selected

as the backbone network can significantly improve the

performance of the proposed method. Our analysis show that

most of the features extracted from the shallow network are

low-level features, which cannot effectively represent the

object category information. Higher feature dimensions can

effectively extract the high-level semantic features of the object,

which are crucial to the object category information of the

object. However, with the increase of network depth, the

network becomes more complex, and the performance of the

model will degrades due to parameters over-fitting the training

set, which cannot be effectively generalized into new categories.

Using ResNet-18 as the encoder model, our method can achieve

77.20% and 87.28% accuracy of 5-way 1-shot and 5-way 5-shot

on pavement distress dataset, respectively.

2) Ablation study of loss function: We conduct ablation

studies to verify the performance of the components of the loss

function. As mentioned in the paper, the purpose of our method

is to learn a metric space from a small number of samples,

which can minimize the distance of inter-class and maximize

the distance of intra-class to improve the classification accuracy.

To illustrate this point, we visualize the spatial distribution

of features extracted by our model on the pavement distress

dataset, as shown in Fig. 5. The first row images denote the

TABLE Ⅵ

CLASSIFICATION ACCURACY WITH DIFFERENT BACKBONE NETWORKS AND LOSS FUNCTION ON PAVEMENT DISTRESS DATA SET

Method Ablation

5-way Accuracy (%)

1-shot 5-shot

Data set1 Data set2 Mean Data set1 Data set2 Mean

Backbone

32-32-32-32 54.07  1.01 56.08  1.00 55.08  1.01 62.56  0.98 68.93  0.93 65.75  0.96

64-64-64-64 54.10  1.01 58.13  0.99 56.12  1.00 63.40  0.97 70.27  0.92 66.84  0.95

64-96-128-256 55.33  1.00 58.67  0.99 57.00  1.00 65.87  0.96 71.20  0.91 68.54  0.94

ResNet-18 75.00  0.88 79.40  0.82 77.20  0.85 86.53  0.69 88.03  0.66 87.28  0.67

ResNet-50 65.89  0.95 61.20  0.98 63.54  0.97 72.83  0.90 72.16  0.91 72.49  0.91

ResNet-101 62.96  0.98 62.64  0.99 62.80  0.98 71.65  0.91 71.31  0.92 71.48  0.92

Loss

ResNet-18+LCE 65.12  0.96 78.07  0.83 71.60  0.91 79.96  0.81 86.14  0.70 83.05  0.76

ResNet-18+Att.+LCE 67.12  0.95 78.80  0.82 72.96  0.90 84.50  0.73 85.60  0.71 85.05  0.72

ResNet-18+Att.+ Lfinal 75.00  0.88 79.40  0.82 77.20  0.85 86.53  0.69 88.03  0.66 87.28  0.67

Fig. 5. The space distribution of features in the experiments, and different

colors refers different classes. The 1-th row refers the features learned unde

LCE, the 2-th row refers the features learned under Ld. The 3-th row refers the

features learned by Lmetric.

feature space distribution under the cross-entropy loss function,

and the features greatly overlap and cannot be distinguished.

The second row images denote the feature space distribution

under the Ld loss function. With the increase of epochs, different

categories of features are significantly distinguished, but the

features of the same category are seriously scattered. The third

row images denotes the learned feature space distribution under

the Lmetric loss function. With the increase of epochs, the model

makes the space distance between the same category smaller

and the space distance between different categories larger. The

numeric results presented in Table Ⅵ show that our loss

function improves the classification accuracy from 72.96% to

77.20% of 5-way 1-shot, and 85.05% to 87.28% of 5-way 5-

shot.

3) Ablation study of attention mechanism module: In the task

of few-shot learning for classification in a complex

environment, it is more important to obtain more robust features

from a well-trained feature extractor. In this paper, we introduce

a parallel strategy attention mechanism module to solve the

above problems, which simultaneously generates channel and

spatial attention information. In the experiments, we estimate

the influences of different attention mechanisms. Four attention

modules are compared with our method on the pavement

distress dataset. The experimental results are listed in Table Ⅶ,

from which we can observe that our attention mechanism

module outperforms all competitive attention mechanisms.

4) Ablation study of the balance hyperparameter: The

hyperparameter

in Eq. (15) is the balance for penalty term

Lfinal. In the experiment, we study the performance of our

method on the pavement damage dataset under different

hyperparameters

. As shown in Fig.6, our method performs

best when the parameters are in the interval

 [0.6, 0.7].

D. Visualization Results

The confusion matrix of our method on Dnovel samples of the

pavement distress dataset given one and five labeled samples

are shown in Fig. 7, where the intersection of the i-th row and

j-th column denotes the rate of the i-th class that are classified

as the j-th class in query samples. In Fig. 6, we can see that

given one labeled sample, our model performance is not good

in category 3; when given five labeled samples, our model

greatly improves the accuracies of category 3.

V. CONCLUSION

In this work, we introduce a deep metric learning-based

method for multi-target few-shot pavement distress

classification. Our model contains two modules. First, we

design a baseline encoder module to extract multi-level features

from the input, because robust features are important for the

result. After that, we introduce a novel metric module. In the

metric module, channel attention is adopted to learn the features

that are “what”, which by extracting the importance of different

channel features to key information and focusing on the

TABLE Ⅶ

CLASSIFICATION ACCURACY WITH DIFFERENT ATTENTION MECHANISMS ON PAVEMENT DISTRESS DATA SET

Method Backbone

5-way Accuracy (%)

1-shot 5-shot

Data set1 Data set2 Mean Data set1 Data set2 Mean

GC-Net [46]

ResNet-18

67.25  0.95 76.29  0.86 71.77  0.74 76.85  0.86 85.79  0.71 81.32  0.79

SENet [47] 68.88  0.94 72.16  0.91 70.52  0.95 79.52  0.82 83.63  0.75 81.58  0.78

ECA-Net [48] 67.04  0.95 72.88  0.90 69.96  0.93 76.75  0.85 84.08  0.74 80.42  0.80

CBAM [49] 70.11  0.92 72.11  0.91 71.11  0.92 79.31  0.82 81.17  0.79 80.24  0.81

Ours 75.00  0.88 79.40  0.82 77.20  0.85 86.53  0.69 88.03  0.66 87.28  0.67

Fig. 6. The verification accuracies with different

for 5-way N-shot. (N=1, 5)

on pavement distress dataset.

Fig. 7. The confusion matrix of our method. (a) is the result with one label (1-

shot), and (b) is the result with five label (5-shot).

information with large weight according to the importance

degree. Spatial attention is used to learn the features that are

“where”, which focuses on the spatial location information of

key features. Furthermore, we introduce a new metric loss

function, which guides the model to make the space distance

between the same category smaller and the space distance

between different categories larger. Experimental results show

the outperformance of our proposed method on the pavement

distress classification task.

REFERENCES

[1] Y. He, K. Song, Q. Meng and Y. Yan, "An End-to-End Steel Surface

Defect Detection Approach via Fusing Multiple Hierarchical Features,"

in IEEE Transactions on Instrumentation and Measurement, vol. 69, no.

4, pp. 1493-1504, April 2020.

[2] D. Zhang, K. Song, Q. Wang, Y. He, X. Wen and Y. Yan, "Two Deep

Learning Networks for Rail Surface Defect Inspection of Limited

Samples with Line-Level Label," in IEEE Transactions on Industrial

Informatics. doi: 10.1109/TII.2020.3045196

[3] M. Niu, K. Song, L. Huang, Q. Wang, Y. Yan and Q. Meng,

"Unsupervised Saliency Detection of Rail Surface Defects Using

Stereoscopic Images," in IEEE Transactions on Industrial Informatics,

vol. 17, no. 3, pp. 2271-2281, March 2021

[4] Y. Bao, K. Song, J. Liu, Y. Wang, Y. Yan, H. Yu, and X. Li, "Triplet-

Graph Reasoning Network for Few-Shot Metal Generic Surface Defect

Segmentation," in IEEE Transactions on Instrumentation and

Measurement, vol. 70, pp. 1-11, 2021, Art no. 5011111, doi:

10.1109/TIM.2021.3083561.

[5] R. Kapela et al., “Asphalt surfaced pavement cracks detection based on

histograms of oriented gradients,” in International Conference Mixed

Design of Integrated Circuits & Systems (MIXDES), Torun, 2015, pp.

579-584.

[6] Y. Hu, and C. Zhao. “A novel LBP based methods for pavement crack

detection,” J. Pattern Recognit. Res., vol. 5, no. 1, pp. 140-147, 2010.

[7] P. Subirats, J. Dumoulin, V. Legeay and D. Barba, “Automation of

Pavement Surface Crack Detection using the Continuous Wavelet

Transform,” in International Conference on Image Processing, Atlanta,

GA, 2006, pp. 3037-3040.

[8] K. Simonyan and A. Zisserman, “Very deep convolutional networks for

large-scale image recognition,” in Proc. Int. Conf. Learn.

Representations, 2015.

[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image

recognition,” in Proc. Comput. Vis. Pattern Recognit., Jun. 2016, pp.

770–778.

[10] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra,

“Matching Networks for One Shot Learning,” NIPS, 2016.

[11] J. Snell, K. Swersky, and R. S. Zemel, “Prototypical Networks for Few-

shot Learning,” NIPS, 2017.

[12] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M.

Hospedales, “Learning to Compare: Relation Network for Few-Shot

Learning,” CVPR, 2017.

[13] F. Liu, G. Xu, Y. Yang, X. Niu, and Y. Pan, “Novel approach to

pavement cracking automatic detection based on segment extending,” in

Proc. Int. Symp. Knowl. Acquisition Modeling, Dec. 2008, pp. 610–614.

[14] W. Xu, Z. Tang, J. Zhou, and J. Ding, “Pavement crack detection based

on saliency and statistical features,” in Proc. IEEE Int. Conf. Image

Process. (ICIP), Sep. 2013, pp. 4093–4097.

[15] H. Zakeri, F. M. Nejad, A. Fahimifar, A. D. Torshizi, and M. H. F.

Zarandi, “A multi-stage expert system for classification of pavement

cracking,” in Proc. Joint IFSA World Congr. NAFIPS Annu. Meeting,

Jun. 2013, pp. 1125–1130.

[16] Y. Shi, L. Cui, Z. Qi, F. Meng, and Z. Chen, “Automatic road crack

detection using random structured forests,” IEEE Trans. Intell. Transp.

Syst., vol. 17, no. 12, pp. 3434–3445, Dec. 2016.

[17] R. Amhaz, S. Chambon, J. Idier and V. Baltazart, “Automatic Crack

Detection on Two-Dimensional Pavement Images: An Algorithm Based

on Minimal Path Selection,” in IEEE Trans. Intell. Transp. Syst., vol. 17,

no. 10, pp. 2718-2729, Oct. 2016.

[18] A. Cord and S. Chambon, “Automatic road defect detection by textural

pattern recognition based on adaboost,” Computer-Aided Civil and

Infrastructure Engineering, vol. 27, no. 4, pp. 244–259, 2012.

[19] H. Oliveira and P. L. Correia, “Automatic road crack detection and

characterization,” in IEEE Trans. Intell. Transp. Syst., vol. 14, no. 1, pp.

155–168, 2013.

[20] Y. Shi, L. Cui, Z. Qi, F. Meng, and Z. Chen, “Automatic road crack

detection using random structured forests,” in IEEE Trans. Intell. Transp.

Syst., vol. 17, no. 12, pp. 3434–3445, 2016.

[21] L. Zhang, F. Yang, Y. Daniel Zhang and Y. J. Zhu, “Road crack detection

using deep convolutional neural network,” in IEEE International

Conference on Image Processing (ICIP), Phoenix, AZ, 2016, pp. 3708-

3712.

[22] V. Mandal, A. R. Mussah and Y. Adu-Gyamfi, (2020). Deep Learning

Frameworks for Pavement Distress Classification: A Comparative

Analysis. arXiv preprint arXiv:2010.10681.

[23] B. Li, K. C. Wang, A. Zhang, E. Yang and G. Wang, “Automatic

classification of pavement crack using deep convolutional neural

network,” International Journal of Pavement Engineering, vol. 21, no. 4,

pp. 457-463, 2020.

[24] X. Wang and Z. Hu, “Grid-based pavement crack analysis using deep

learning,” in Transportation Information and Safety (ICTIS), 2017 4th

International Conference on. IEEE, 2017, pp. 917–924.

[25] Y. Du, N. Pan, Z. Xu, F. Deng, Y. Shen and H. Kang, “Pavement distress

detection and classification based on yolo network,” International

Journal of Pavement Engineering, pp. 1–14, 2020.

[26] E. Ibragimov, H.-J. Lee, J.-J. Lee and N. Kim, “Automated pavement

distress detection using region based convolutional neural networks,”

International Journal of Pavement Engineering, pp. 1–12, 2020.

[27] H. Dong, K. Song, Y. He, J. Xu, Y. Yan and Q. Meng, "PGA-Net:

Pyramid Feature Fusion and Global Context Attention Network for

Automated Surface Defect Detection," in IEEE Transactions on

Industrial Informatics, vol. 16, no. 12, pp. 7448-7458, Dec. 2020.

[28] F. Yang, L. Zhang, S. Yu, D. Prokhorov, X. Mei and H. Ling, “Feature

Pyramid and Hierarchical Boosting Network for Pavement Crack

Detection,” in IEEE Trans. Intell. Transp. Syst., vol. 21, no. 4, pp. 1525-

1535, April 2020.

[29] Y. Zhang, Q. Li, X. Zhao and M. Tan, “TB-Net: A Three-Stream

Boundary-Aware Network for Fine-Grained Pavement Disease

Segmentation,” in IEEE/CVF Winter Conference on Applications of

Computer Vision. 2021. p. 3655-3664.

[30] C. Finn, P. Abbeel, and S. Levine, “Model-Agnostic Meta-Learning for

Fast Adaptation of Deep Networks,” International Conference on

Machine Learning (ICML), 2017.

[31] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning.

In International Conference on Learning Representations, 2017. 1, 2, 5,

[32] Li, Z., Zhou, F., Chen, F., Li, H.: Meta-sgd: Learning to learn quickly for

few shot learning. In: arxiv:1707.09835. (2017) II-A, IV-A, IV-C, IV-D,

IV-D.

[33] V. G. Satorras and J. B. Estrach, “Few-shot learning with graph neural

networks,” in Proc. ICLR, 2018, pp. 1–13.

[34] P. Sermanet, A. Frome, and E. Real. Attention for finegrained

categorization. arXiv preprint arXiv:1412.7054, 2014.

[35] J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and S. Yan. Attentive

contexts for object detection. IEEE Transactions on Multimedia,

19(5):944–954, 2017. 2.

[36] H. Li, P. Xiong, J. An, and L. Wang. Pyramid attention network for

semantic segmentation. In BMVC, page 285, 2018.

[37] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few

examples: A survey on few-shot learning,” ACM Computing Surveys

(CSUR), vol. 53, no. 3, pp. 1-34, 2020.

[38] X. Wang, R. Girshick, A. Gupta and K. He, "Non-local Neural

Networks," 2018 IEEE/CVF Conference on Computer Vision and

Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7794-7803.

[39] Y. W en, K. Zhang , Z. Li, an d Y. Qi ao, “A discr imina tive f eatur e learning

approach for deep face recognition,” in Lecture Notes in Computer

Science (including subseries Lecture Notes in Artificial Intelligence and

Lecture Notes in Bioinformatics), vol. 9911 LNCS, 2016, pp. 499–515.

[40] M. Hamed, et al. Pavement Image Datasets: A New Benchmark Dataset

to Classify and Densify Pavement Distresses. Transportation Research

Record, 2020, 2674.2: 328-339.

[41] A. Ravichandran, R. Bhotika and S. Soatto, “Few-Shot Learning With

Embedded Class Models and Shot-Free Meta Training,” 2019 IEEE/CVF

International Conference on Computer Vision (ICCV), Seoul, Korea

(South), 2019, pp. 331-339.

[42] K. Lee, S. Maji, A. Ravichandran and S. Soatto, “Meta-Learning With

Differentiable Convex Optimization,” 2019 IEEE/CVF Conference on

Computer Vision and Pattern Recognition (CVPR), Long Beach, CA,

USA, 2019, pp. 10649-10657.

[43] H. Li, D. Eigen, S. Dodge, M. Zeiler and X. Wang, “Finding Task-

Relevant Features for Few-Shot Learning by Category Traversal,” 2019

IEEE/CVF Conference on Computer Vision and Pattern Recognition

(CVPR), Long Beach, CA, USA, 2019, pp. 1-10.

[44] Y. Tian, Y. Wang, D. Krishnan, J. B Tenenbaum, and P. Isola,

“Rethinking few-shot image classification: a good embedding is all you

need?” arXiv preprint arXiv:2003.11539, 2020.

[45] C. Zhang, Y. Cai, G. Lin and C. Shen, “DeepEMD: Few-Shot Image

Classification With Differentiable Earth Mover’s Distance and

Structured Classifiers,” 2020 IEEE/CVF Conference on Computer

Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp.

12200-12210.

[46] Y. Cao, J. Xu, and et al, “Gcnet: Non-local networks meet squeeze-

excitation networks and beyond,” in Proceedings of the IEEE CVPR

Workshops, pp. 0–0, 2019.

[47] J. Hu, L. Shen, and et al, “Squeeze-and-excitation networks,” in Pro-

ceedings of the IEEE CVPR, pp. 7132–7141, 2018.

[48] Q. Wang, B. Wu, an d et al, “Eca-net: Efficient channel attention for deep

convolutional neural networks,” in 2020 IEEE CVPR, 2020.

[49] S. Woo, J. Park, and et al, “Cbam: Convolutional block attention module,”

in Proceedings of the ECCV, pp. 3–19, 2018.

Hongwen Dong received the B.S. degree in

School of Mechanical Engineering and

Automation, Liaoning University of

Technology, Jinzhou, China, in 2016, and

the M.S. degree in School of Mechanical

Engineering and Automation, Northeastern

University, Shenyang, China, in 2018. He is

currently pursuing the Ph.D. degree with

School of Mechanical Engineering and

Automation, Northeastern University, China. His research

interests include deep learning, pattern recognition and

semantic segmentation.

Kechen Song received the B.S., M.S. and

Ph.D. degrees in School of Mechanical

Engineering and Automation, Northeastern

University, Shenyang, China, in 2009,

2011 and 2014, respectively. Between 2018

and 2019, he was an Academic Visitor in

the Department of Computer Science,

Loughborough University, UK. He is

currently an Associate Professor in the School of Mechanical

Engineering and Automation, Northeastern University. His

research interest covers vision-based inspection system for steel

surface defects, surface topography, image processing and

pattern recognition.

Qi Wang received the B.S. and M.S. degrees

in mechanical engineering from the

University of Science and Technology

Liaoning, Anshan, China, in 2015 and 2018,

respectively. He is currently working toward

the Ph.D. degree in mechanical design and

theory with the School of Mechanical

Engineering and Automation, Northeastern

University, Shenyang, China. His current

research interests include image segmentation and thermal

imaging defect detection.

Yunhui Yan received the B.S., M.S. and

Ph.D. degrees in School of Mechanical

Engineering and Automation, Northeastern

University, Shenyang, China, in 1981, 1985

and 1997, respectively. He has been a

teacher in Northeastern University of China

since 1982, and became as professor in 1997.

During 1993-1994, he stayed in the Tohoku

National Industrial Research Institute as a

visiting scholar. His research interest covers intelligent

inspection, image processing and pattern recognition.

Peng Jiang received the B.S. and M.S.

degrees in School of Software Engineering,

Northeastern University, Shenyang, China,

in 2009 and 2011, respectively. Between

2016 and 2018, he was a Senior Engineer in

the Department of Information Technology,

Liaoning Transportation Research Institute

Group Co., Ltd. He is currently a Director of

R&D department in the Liaoning ATS

Intelligent Transportation Technology Co., Ltd. His research

interest covers software engineering and information

construction of expressway.

DDRA-net: Dual-Channel Deep Residual Attention UPerNet for Breast Lesions Segmentation in Ultrasound Images

Article

Full-text available

Jan 2024

Automated segmentation of breast tumors in breast ultrasound images has been a challenging frontier issue. The morphological diversity, boundary ambiguity, and heterogeneity of malignant tumors in breast lesions constrain the improvement of segmentation accuracy. To address these challenges, we propose an innovative deep learning-based method, namely Dual-Channel Deep Residual Attention UPerNet (DDRA-net), for efficient and accurate segmentation of breast tumor regions. The core of DDRA-net lies in the Dual-Channel Deep Residual Attention module (DDRA), which integrates depth-wise separable convolution and Convolutional Block Attention Module (CBAM). This design aims to enhance the extraction of crucial features within the receptive field to better capture subtle details of breast lesions. Through extensive experimental evaluation, DDRA-net demonstrates remarkable performance on a publicly available breast ultrasound datasets, exhibiting higher segmentation accuracy and stability compared to contemporary mainstream deep learning methods. Importantly, it is worth emphasizing that the flexibility of this method allows easy integration with other network structures to further improve the performance and applicability of breast tumor segmentation. In the segmentation of the Breast Ultrasound Image dataset, our precision, recall, IoU, F1 score, Dice, and Hausdorff Distance achieved the following values: 95.31%, 90.79%, 88.00%, 92.39%, 95.46%, and 3.02, respectively. Compared to the original UPerNet, DDRA-net demonstrated improvements of 2.92%, 4.64%, 5.52%, 4.97%, 3.4%, and 24.5% in these six metrics on the Breast Ultrasound Image dataset.

Malleable pruning meets more scaled wide-area of attention model for real-time crack detection

Article

Full-text available

Jun 2024
VISUAL COMPUT

Rapid real-time detection of crack images helps prevent the emergence of more significant potential hazards. However, mature and sophisticated convolutional neural networks are more concerned with images of general everyday objects. These neural networks do not meet the real-time requirements for concrete defect detection for cracks with complex morphology and varying scales. This manuscript proposes a lightweight improvement strategy, which consists mainly of malleable efficient channel pruning, a more scaled wide-area receptive field (MSWR), and multi-channel fusion of spatial attention, referred to as MMM strategies. Firstly, the channel pruning count can intuitively make the general convolutional neural network more lightweight. Secondly, the wider receptive field can fuse multi-scale feature maps and recognize cracks of various scales. Finally, the multi-channel fusion of spatial attention enhances detection performance efficiently, ensuring real-time capability at minimal cost. The experimental results show that the lightweight network improved by the MMM strategy sacrifices no more than 8% in the detection accuracy of defects. In some cases, the detection accuracy is even improved, while the detection speed has a significant advantage. This lightweight strategy improves defect detection and has higher real-time adaptability than mainstream convolutional neural networks. The codes are available at https://github.com/mmm587/MMM.

Research on Discrimination Method of Carbon Deposit Degree of Automobile Engine Based on Deep Learning

Article

Apr 2024

The detection of carbon deposit degree is of great significance to the maintenance of automobile engine. Due to issues with poor feature aggregation, inter-class similarity, and intra-class variance in carbon deposit data with a small number of samples, model-based discriminative approaches cannot be widely implemented in the market. In order to overcome this technical barrier, the article examines the impact of DCNNs (Deep Convolutional Neural Networks) level on the recognition effect of the degree of carbon deposit, introduces a dropout structure and data enhancement strategy to lower the risk of overfitting brought on by the small dataset, and suggests a recognition method based on the kernel of dual-dimensional multiscale-multifrequency information features to enhance the differentiation characteristic. After experimental testing, the accuracy of this method is 86.9 %, the F1-score is 87.2 %, and the inference speed is 190 FPS, which can meet the practical requirements and provide basic support for the large-scale promotion of the model discrimination.

Asphalt Pavement Crack Detection Based on Improved YOLOv5s Algorithm

Conference Paper

Mar 2024

Qian Liu

Research on pavement crack detection based on improved YOLOv4

Conference Paper

Mar 2024

Qian Liu

Few-shot defect classification via feature aggregation based on graph neural network

Article

May 2024
J VIS COMMUN IMAGE R

Survey on Pavement Distress Detection and Recognition

Conference Paper

Feb 2024

AEKD: Unsupervised auto-encoder knowledge distillation for industrial anomaly detection

Article

Apr 2024
J MANUF SYST

Automation in road distress detection, diagnosis and treatment

Article

Mar 2024

Multiple distresses detection for Asphalt Pavement using improved you Only Look Once Algorithm based on convolutional neural network

Article

Mar 2024

Triplet-Graph Reasoning Network for Few-Shot Metal Generic Surface Defect Segmentation

Article

Full-text available

May 2021

Metal surface defect segmentation can play an important role in dealing with the issue of quality control during the production and manufacturing stages. There are still two major challenges in industrial applications. One is the case that the number of metal surface defect samples is severely insufficient, and the other is that the most existing algorithms can only be used for specific surface defects and it is difficult to generalize to other metal surfaces. In this work, a theory of few-shot metal generic surface defect segmentation is introduced to solve these challenges. Simultaneously, the Triplet-Graph Reasoning Network (TGRNet) and a novel dataset Surface Defects- $4^{i}$ are proposed to achieve this theory. In our TGRNet, the surface defect triplet (including triplet encoder and trip loss) is proposed and is used to segment background and defect area, respectively. Through triplet, the few-shot metal surface defect segmentation problem is transformed into few-shot semantic segmentation problem of defect area and background area. For few-shot semantic segmentation, we propose a method of multi-graph reasoning to explore the similarity relationship between different images. And to improve segmentation performance in the industrial scene, an adaptive auxiliary prediction module is proposed. For Surface Defects- $4^{i}$ , it includes multiple categories of metal surface defect images to verify the generalization performance of our TGRNet and adds the nonmetal categories (leather and tile) as extensions. Through extensive comparative experiments and ablation experiments, it is proved that our architecture can achieve state-of-the-art results.

Pyramid Attention Aggregation Network for Semantic Segmentation of Surgical Instruments

Conference Paper

Full-text available

Apr 2020

Semantic segmentation of surgical instruments plays a critical role in computer-assisted surgery. However, specular reflection and scale variation of instruments are likely to occur in the surgical environment, undesirably altering visual features of instruments, such as color and shape. These issues make semantic segmentation of surgical instruments more challenging. In this paper, a novel network, Pyramid Attention Aggregation Network, is proposed to aggregate multiscale attentive features for surgical instruments. It contains two critical modules: Double Attention Module and Pyramid Upsampling Module. Specifically, the Double Attention Module includes two attention blocks (i.e., position attention block and channel attention block), which model semantic dependencies between positions and channels by capturing joint semantic information and global contexts, respectively. The attentive features generated by the Double Attention Module can distinguish target regions, contributing to solving the specular reflection issue. Moreover, the Pyramid Upsampling Module extracts local details and global contexts by aggregating multi-scale attentive features. It learns the shape and size features of surgical instruments in different receptive fields and thus addresses the scale variation issue. The proposed network achieves state-of-the-art performance on various datasets. It achieves a new record of 97.10% mean IOU on Cata7. Besides, it comes first in the MICCAI EndoVis Challenge 2017 with 9.90% increase on mean IOU.

TB-Net: A Three-Stream Boundary-Aware Network for Fine-Grained Pavement Disease Segmentation

Conference Paper

Jan 2021

Deep Learning Frameworks for Pavement Distress Classification: A Comparative Analysis

Conference Paper

Dec 2020

Two Deep Learning Networks for Rail Surface Defect Inspection of Limited Samples With Line-Level Label

Article

Dec 2020

Rail surface defect (RSD) inspection is an essential routine maintenance task. Computer vision testing is suitable for RSD inspection with its intuitiveness and rapidity. Deep learning techniques, which can extract deep semantic features, have been applied to inspect RSDs in recent years. However, these methods demand thousands of samples. And sample collection requires the industry and costs high. To address the issue, a novel inspection scheme for RSDs is presented for limited samples with line-level label, which regards defect images as sequence data and classifies pixel lines. Thousands of pixel lines are easy to be collected and labeling line-level is a simple task in labeling works. Then two methods OC-IAN and OC-TD are designed for inspecting express rail defects and common/heavy rail defects respectively. OC-IAN and OC-TD both employ one-dimensional convolutional neural network (ODCNN) to extract features and long and short term memory (LSTM) network to extract context information. The main differences between OC-IAN and OC-TD are that OC-TD applies a double-branch structure and removes the attention module. Experimental results on RSDDs dataset demonstrate that our methods are effective and outperform the state-of-the-art methods on defect-level metrics (Type-I: Rec-0.9314, Pre-0.8421, F1-0.8845; Type-II: Rec-0.9427, Pre-0.9176, F1-0.9300).

Rethinking Few-Shot Image Classification: A Good Embedding is All You Need?

Chapter

Nov 2020

The focus of recent meta-learning research has been on the development of learning algorithms that can quickly adapt to test time tasks with limited data and low computational cost. Few-shot learning is widely used as one of the standard benchmarks in meta-learning. In this work, we show that a simple baseline: learning a supervised or self-supervised representation on the meta-training set, followed by training a linear classifier on top of this representation, outperforms state-of-the-art few-shot learning methods. An additional boost can be achieved through the use of self-distillation. This demonstrates that using a good learned embedding model can be more effective than sophisticated meta-learning algorithms. We believe that our findings motivate a rethinking of few-shot image classification benchmarks and the associated role of meta-learning algorithms. Code: http://github.com/WangYueFt/rfs/.

Automated pavement distress detection using region based convolutional neural networks

Article

Oct 2020

Automatic pavement crack detection is essential for evaluating maintenance requirements and ensuring driving safety. Crack detection plays a primary role in realising the automatic evaluation of pavement condition. Most existing researches on pavement crack detection rely on laborious work, which is a time- and cost-intensive process. Although there has been considerable research on pavement crack detection, it remains a challenging task owing to diverse complex pavement conditions. Recently, deep learning-based algorithms have achieved significant success in computer vision tasks. However, the techniques still have limitations for automatic pavement distress detection. To overcome the current limitations, this study proposes a method for detecting signs of pavement distress based on faster region based convolutional neural network (Faster R-CNN). The study focuses on the detection of longitudinal cracks, transverse cracks, alligator cracks, and partial patching in pavement images. A framework for applying the Faster R-CNN technique to a full-size pavement image is also proposed, which allows the sliding window size to be reduced, thus enabling the detection of larger images. The performance of the proposed method was validated against a dataset containing actual pavement images. The experimental test results show that the proposed method could successfully detect cracks and partial patching with accuracy.

ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

Conference Paper

Jun 2020

DeepEMD: Few-Shot Image Classification With Differentiable Earth Mover’s Distance and Structured Classifiers

Conference Paper

Jun 2020

Unsupervised Saliency Detection of Rail Surface Defects Using Stereoscopic Images

Article

Jun 2020

Visual information has been paid more attention to rail surface defect detection for its high efficiency and stability. However, it is not sufficient to detect more complete defects in complex background information. The addition of profiles can effectively improve the above situation due to its entity information. However, in high-speed detection, traditional three-dimensional (3D) profiles acquisition is difficult and separate from image acquisition, which can not satisfy the above requirements effectively. Therefore, an unsupervised stereoscopic saliency detection method based on binocular line-scanning system is proposed in this paper. This method can obtain the extraordinary precision image and profile information at the same time, and avoid decoding distortion of the structured light reconstruction method. In this method, a global low-rank non-negative reconstruction algorithm with a background constraint is proposed. Unlike the low-rank recovery (LRR) model, the algorithm has a more comprehensive low-rank and background clustering properties. Besides, outlier detection based on the geometric properties of the rail surface is also proposed in this method. Finally, image saliency results and depth outlier detection results are associated with interactive fusion. Otherwise, data set (RSDDS-113) containing rail surface defects is established for experimental verification. The experimental results demonstrate that our method can obtain the results that MAE is 0.09 and AUC is 0.94, which is better than other 15 algorithms.

Deep Metric Learning-Based for Multi-Target Few-Shot Pavement Distress Classification

Abstract

Recommended publications

Spectral-spatial classification method for hyperspectral images using stacked sparse autoencoder sui...

Many-Class Few-Shot Learning on Multi-Granularity Class Hierarchy

Co-matching: Combating Noisy Labels by Augmentation Anchoring

Weak-supervision for Deep Representation Learning under Class Imbalance