Content uploaded by Fang Liu
Author content
All content in this area was uploaded by Fang Liu on Apr 23, 2023
Content may be subject to copyright.
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 70, 2021 5010015
PriorMaskR-CNNBasedonGraphCutsLossand
Size Input for Precipitation Measurement
Mingchun Li ,DaliChen , Shixin Liu ,Member, IEEE, and Fang Liu
Abstract— Fusing prior knowledge with data-driven deep
learning for measurement is interesting and challenging. For
the detection of metallographic precipitations, the measurements
of size and shape of precipitations are roughly predictable in
advance through a transmission electron microscope (TEM).
In this article, we proposed a novel instance segmentation
network named prior mask R-CNN by fusing prior knowledge
for automatic precipitation detection. On the basis of the typical
mask R-CNN framework, we made the following improvements.
First, at the bounding box classification stage, in order to restore
area information, we input the size information besides only
uniform dimension features after the region of interest align
(RoIAlign). Second, at the mask segmentation stage, we pro-
posed a new loss function based on normalized graph cuts.
It is category-sensitive by setting different weight strategies for
different categories based on their prior shapes. In addition, from
the point of view of practicality, we designed an effective mea-
surement extraction module to get specific measurements, such
as the length of precipitations, from the final prediction results of
our network. In a variety of experiments, our method achieves the
highest mean average precision (mAP) of 0.475 and 0.298 among
different famous methods for bounding box detection and mask
segmentation tasks, respectively, which proves the effectiveness
of our method.
Index Terms—Graph cuts, instance segmentation, metallo-
graphic image, precipitation detection, prior knowledge.
I. INTRODUCTION
THE precipitations are nanoscale microstructures of alloy
materials, which play key roles in the mechanical prop-
erties of products, such as yield strength, ultimate tensile
strength, and elongation. As a result, it is extremely impor-
tant to measure the precipitates accurately. In this article,
we mainly focus on six-series aluminum alloy. It is an excel-
lent structural material due to its good formability, corrosion
resistance, weld ability, and low cost [1]. In order to investigate
the nanoscale microstructures in materials, a transmission
electron microscope (TEM) is typically used [2]. When we
Manuscript received December 25, 2020; revised April 12, 2021; accepted
April 29, 2021. Date of publication May 6, 2021; date of current version
May 19, 2021. This work was supported in part by the National Key Research
and Development Program of China under Grant 2017YFB0306400 and
in part by the National Natural Science Foundation of China under Grant
61773104. The Associate Editor coordinating the review process was
Mohamad Forouzanfar. (Corresponding author: Dali Chen.)
Mingchun Li, Dali Chen, and Shixin Liu are with the College of
Information Science and Engineering, Northeastern University, Shenyang
110819, China (e-mail: 407996328@qq.com; chendali@ise.neu.edu.cn;
sxliu@mail.neu.edu.cn).
Fang Liu is with the School of Materials Science and Engi-
neering, Northeastern University, Shenyang 110819, China (e-mail:
liufang@smm.neu.edu.cn).
Digital Object Identifier 10.1109/TIM.2021.3077996
observe the aluminum alloy with TEM under the standard
setting, we find that the precipitates are embedded in the alloy
matrix (aluminum) in horizontal, vertical, and longitudinal
directions with expected size, as shown in Fig. 1.
From the point of view of material science, these
microstructures are described as needle-shaped precipitates
(horizontal and longitudinal) and dot-shaped precipitates
(vertical), respectively, and have important worthiness for
researching and studying [3]. Considered the observed precip-
itates in TEM, images are often ambiguous, and the contours
are not as obvious as natural images. The traditional computer
vision method is difficult to directly measure the precipitates.
For practical production and academic research, material sci-
entists need to manually measure the precipitates in these three
directions for each specimen, which is time-consuming and
boring.
Fortunately, in recent years, with the development of com-
puter vision, methods based on deep learning achieved awe-
some results in image classification [4], boundary detec-
tion [5], object detection [6], and image segmentation [7].
In fact, in the measurement field for materials science, deep
learning methods are also widely used. For example, for
steel inner microstructures, Azimi et al. [8] designed a fully
convolutional neural network (FCNN) under a novel max
voting strategy to obtain pixel-level segmentation of marten-
site, tempered martensite, bainite, and pearlite. For nickel-
based superalloy, Wang et al. [9] used typical U-Net to get
information of precipitates and established a microstructure-
related hardness model according to the segmentation results.
For aluminum alloy, Li et al. [10] employed generative adver-
sarial network (GAN) and multitask learning to achieve the
detection of the second phase particles and grain bound-
aries. These deep learning methods tend to perform better
than the traditional machine learning methods (support vec-
tor regression [11], shallow neural network [12], and mean
shift [13]) or rule-based methods (automatic thresholding [14],
level set [15], graph cuts [16], and ultimate opening [17]),
especially when dealing with more complex measurement
tasks. Therefore, in view of the complexity of TEM images in
this article, we designed an instance segmentation framework
based on deep learning to exactly detect precipitations in the
alloy.
As a natural extension of the object detection task, instance
segmentation aims to predict pixelwise object instance seg-
mentation and object category [18]. In recent years, it has
been widely used in the field of measurement, for example,
1557-9662 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.
5010015 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 70, 2021
Fig. 1. Stereogram display of the three directions (horizontal, vertical, and longitudinal) of precipitations embedded in aluminum alloy.
detection and diagnosis of electrical equipment based on
infrared [19] and additional natural images [20]. In fact,
the major instance segmentation frameworks are based on
proposal segmentation, considering the great success of R-
CNN [21]. A typical example is mask R-CNN [22]. It consists
of two parts: region proposal network (RPN) and region of
interest network. Based on box regression and classification
involved in the second stage of faster R-CNN [23], it added
an additional mask prediction branch. At that time, it achieved
top performance for the MS COCO instance segmentation
task [24]. The success of this method is attributed to the excel-
lent performance of a fully convolution network (FCN) [25]
in segmentation tasks and the effectiveness of gradient prop-
agation under the region of interest align (RoIAlign) layers.
Following this route, in recent years, more instance segmenta-
tion methods have been proposed. For the strategy of instance
segmentation framework, Liu et al. [26] proposed PAFPN
that enhanced the entire feature hierarchy at the stage of
RPN and linked feature grid by adaptive feature pooling. For
the topological structure of instance segmentation framework,
cascade mask R-CNN [27], hybrid task cascade [28], and
mask scoring R-CNN [29] set up an extra block to improve
performance through cascade structures [27], [28] and quality
score modules [29]. It is worth noting that these methods
do not pay attention to the combination of prior knowledge.
This is understandable, of course, because these methods are
proposed for natural images, in which the information (size
or shape) of different categories is diffuse and unpredictable.
However, when there is obvious and predictable knowledge of
categories, such as the shape of metal microstructures, how to
effectively use prior knowledge is worth further consideration.
From the perspective of the training strategy, the most
intuitive way to employ prior knowledge is based on transfer
learning, which is widely used in a variety of deep learning
frameworks [30]. Transfer learning can expand the target
domain data by using the relevant source domain data, so as
to achieve the network with generalization ability and improve
the network performance. Typical paradigms include fine-
tuning parameters [31], domain adaptation [32], and so on.
It is one of the important tools to solve few-shot learning
from the perspective of data [33]. It implicitly transfers prior
knowledge to the network through relevant data. Besides that,
some recent studies showed that prior knowledge could be
fused to the end-to-end network more explicitly. For example,
when the shape of the foreground is known in the segmentation
task, Mirikharaji and Hamarneh [34] proposed a novel loss
term to encode the object shape and embed it into the loss
function to punish the predicted shape that does not satisfy
the prior knowledge. Han et al. [35] designed convex shape
sensitive loss function through a simple ergodic formula to
improve the robustness of the deep network to the noise and
reflection for pupil segmentation. When the location of each
category is known in advance, Zotti et al. [36] took the cardiac
distribution in 3-D position as prior knowledge and merge
it with the feature map before the classification layer in the
network to improve the performance of cardiac segmentation.
In addition to the training strategy, we can also modify the
structure of the network to fuse prior knowledge. Specifically,
for the image segmentation task, an obvious knowledge is that
the prediction results should be smooth and continuous. This
is very important for a dense prediction network, considering
that the network is usually trained pixelwise. In order to
address this problem, Chen et al. [37] designed an additional
conditional random field (CRF) module as postprocessing for
the prediction of a deep network to improve the localization
performance. Similarly, Zheng et al. [38] integrate the complex
energy function inference process of CRF into the running
logic of recurrent neural networks (RNNs), so as to realize the
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.
LI et al.: PRIOR MASK R-CNN BASED ON GRAPH CUTS LOSS AND SIZE INPUT FOR PRECIPITATION MEASUREMENT 5010015
end-to-end training with additional position and color informa-
tion. In fact, for the typical black-box model, such structural
modification can effectively make up for the deficiency of deep
learning and transfer some common and general knowledge to
the network to ensure the rationality of the prediction results.
In this article, our contributions can be summarized as
follows.
1) We proposed a two-stage instance segmentation frame-
work called prior mask R-CNN for automatic metallographic
precipitation measurement of aluminum alloys. At the RoI
network stage, we input the size information of each object for
recovering the area information between different categories
after the RoIAlign layer.
2) A new loss function based on normalized graph cuts is
proposed. By assigning weights in the graph based on different
rules for each category, we designed a shape-sensitive cut loss
function and embed it into the mask training period with the
original cross-entropy loss function, meanwhile.
3) We developed a simple postprocessing module to extract
the measurement information from the prediction results based
on the region-growing algorithm. This module could effec-
tively obtain quantitative information about the precipitates,
which plays a key role from the perspective of practical
applications.
II. METHODOLOGY
Here, we clearly point out that our task is to measure
the precipitates in TEM images of aluminum alloys. The
categories of precipitates can be divided into three types
according to their growing direction: horizontal precipitates,
vertical precipitates, and longitudinal precipitates (see Fig. 1).
First, we get the instance segmentation from the metallo-
graphic image by the proposed prior mask R-CNN. Second,
for the prediction results, we set up a postprocessing module to
obtain the specific measurement of each kind of precipitate.
In this section, we will specifically introduce the proposed
methodology in detail. Section II-A introduces the topology
structure of prior mask R-CNN and the specific size input
link. Section II-B demonstrates the cut loss function that
includes shape prior knowledge based on graph cuts. The
postprocessing module used to extract precipitate information
would be presented in Section II-C.
A. Structure of Prior Mask R-CNN
In this work, our ultimate goal is to help materials science
get concerned information from metallographic images. This
information mainly refers to the statistical information of
the size of precipitates, which can reveal the mechanical
properties of the alloy. Specifically, different from image
classification [39], [40] and semantic segmentation [41], our
task needs to detect each precipitate in the image and measure
it one by one. That is to say, it involves object detection [42]
and segmentation in turn. From the perspective of computer
vision, this is a typical instance segmentation task [43], [44].
Considering the challenge of specific noises (such as occlu-
sion, interference, and distortion) in the TEM images, on the
basis of the mask R-CNN framework, we introduce the size
information, and its specific structure is shown in Fig. 2.
In general, our network mainly consists of three parts:
backbone, RPN, and RoI network. The backbone is deployed
to obtain the abstract features of metallographic images by a
hierarchic convolution operation. On the basis of these con-
volution feature maps, RPN is employed to provide proposals
that may contain foreground at the first stage, and the RoI
network is used to fine-tune the proposal results and get the
mask segmentation at the second stage. Specifically, for back-
bone, we adopt the ResNet50 and feature pyramid networks
(FPNs). After constant stride operations in ResNet50, a series
of feature maps with different sizes are obtained. In FPN, these
features maps will be gradually fused, and finally, we get five
scale feature maps (x1−5).AtthestageofRPN,wesetup
three kinds of anchors with different length-to-width ratios.
Here, we do foreground detection for each feature map of five
scales, respectively, instead of multiscale anchor for fusion
features. That is to say, the loss function of RPN is the sum
of box classification loss and regression loss from five feature
maps, as follows:
LRPN =Lrpn__cls +Lrpn__reg =
5
i=1
Lrpn__cls (xi)+Lrpn__reg (xi)
(1)
where xiindicates the feature map from the ith scale.
The last part of the framework is the RoI network. It is
used to classify objects and segment each instance. Its input
involves two parts: the proposal from RPN and the feature
maps extracted based on the backbone. Among them, the effect
of the backbone is to extract deep features through learned
hierarchical convolution. The effect of RPN is to provide
candidate boxes that might contain the desired object from
the image. In order to obtain more accurate boundary boxes
of objects, nonmaximum suppression (NMS) based on IoU is
employed [22]. It could reject a region if it has an IoU overlap
with a higher scoring selected region larger than a certain
threshold. After NMS, we could get multiple instance-level
feature maps by cropping or resizing the image-level features
under the instructions of boxes. These feature maps will be fed
into the regression layer used to fine-tune the boundary box
and the classification layer used to identify the object category.
After the boundary boxes correction at the RoI network stage,
we obtain the refined feature maps by RoIAlign and fed them
into the mask layer used to object segmentation. Different from
the fully connected network at the classification and regression
layer, the segmentation layer is based on the convolution
network. The total loss of the RoI network could be expressed
as follows:
LRoI =Lcls +Lreg +Lmask.(2)
It is worth noting that the RoIAlign operation will change
the feature map of different scales into a unified scale through
bilinear interpolation. This operation can effectively transform
the objects of different scales to a uniform size, which is
necessary for the following network of classification and
segmentation tasks. In essence, RoIAlign [22] is standard
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.
5010015 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 70, 2021
Fig. 2. Structure of prior mask R-CNN.
operations for extracting a very small feature map from each
proposal box, regardless of the box size. That is to say, after the
RoIAlign operation, whether it is a large box or a small box,
it will convert it into a normalized feature map indiscriminately
by interpolation. There is no doubt that such an operation will
greatly damage the scale information of the object, especially
when the scale has a clear correlation with the object category.
To solve this scale damage problem, we assume that the sizes
of precipitates are related to the manufacturing process and
could be predicted when accurate precipitate information is
obtained. For example, the average size of the horizontal
precipitate is 1046.7 nm2in Fig. 3. That is, the object category
is highly dependent on the box size. However, all the feature
maps after RoIAlign would be the same size, and their original
scale information would be discarded. Therefore, we make the
following structural adjustment that constitutes our method’s
novelty and effectiveness.
For the classification (cls) layer and regression (reg) layer
of RoI, we adopt a four-layer fully connected network. First,
we straighten the 3-D feature graph (channel ×height ×
width) to the 1-D feature by the flatten operation. Next,
straightened features are fed into two fully connected layers
whose outputs are gradually reduced in turn. In the output
layer, we fuse the size information as an additional input
(green neural unit) with the output of the second fully con-
nected layer. Here, we point out that the size is calculated
based on the area of the boundary box, which is readily
available through a simple product operation from the RPN
results. Finally, the classification layer and the regression
layer based on the shared features, respectively, predict the
category of the object and the boundary box location. Specif-
ically, in Fig. 3, classes H, V, and L refer to the horizontal
precipitates, vertical precipitates, and longitudinal precipitates
respectively; tcx,tcy ,tcw,and tch are the offsets of the detected
box for category c. So far, through a simple skip connection,
we realized the size input to help the model fuse scale
information without additional high computational cost.
B. Loss Function of Prior Mask R-CNN
Once the topology of the network is determined, we need
to consider specific learning objectives. The objective function
of the optimization problem is more commonly called the
loss function for machine learning. It plays a key role in
deep learning, which determines how to guide the parameters
in the model to update. Whether for supervised learning or
unsupervised learning, it is very important to set a reasonable
loss function. In this section, we will introduce the loss
function involved in prior mask R-CNN in detail.
In general, the loss function in prior mask R-CNN is mainly
involved at the RPN stage and the RoI network, which are
shown as (1) and (2), respectively. In (2), we find that the
loss function of the RoI network consists of three parts:
classification loss Lcls , regression loss Lreg, and segmentation
loss Lmask . In view of the structure of total loss function,
it could be regarded as multitask learning [45], which aims
to leverage valuable knowledge that is involved in related
tasks to improve the whole performance of the network. First,
we show the classification loss and the regression loss of the
RoI network for one sample, as follows:
Lcls =kc
C
c=1
−gclog(sc)(3)
Lreg =kc
C
c=1
I
i=1
gc|t∗
ci −tci |(4)
where kcis the weight of category c(horizontal, vertical, and
longitudinal precipitations), gc∈{0,1}is the class-level binary
ground truth, sc∈[0,1]is the class-level prediction result
after activation function, and t∗=(t∗
cx,t∗
cy,t∗
cw,t∗
ch )is the real
offset set between ground truth and region of interest, whereas
t=(tcx,tcy ,tcw,tch )is the predicted offset set of ground truth
and interesting region.
In (3) and (4), we add the weight kcfor categories based
on standard cross-entropy loss and L1-norm loss function.
The reason why we do this is that the number of different
kinds of precipitates is obviously different. In fact, unlike the
unpredictability of objects in natural landscape images, each
metallographic image in our dataset contains all three kinds
of precipitates with different amounts at the same time. For
example, in a TEM image, the number of vertical precipitates
observed is always much more than that of longitudinal
or horizontal precipitates. Therefore, inspired by the class-
balanced strategy [46], we use the weight kcto alleviate the
imbalance problem of samples for the object classification and
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.
LI et al.: PRIOR MASK R-CNN BASED ON GRAPH CUTS LOSS AND SIZE INPUT FOR PRECIPITATION MEASUREMENT 5010015
Fig. 3. Proposed box size input (green neural unit) by skipping connection in the RoI network.
boundary box regression tasks. Specifically, kcis equal to the
total number of objects divided by the number of objects for
category c.
Besides, for the mask layer in the RoI network, its task is
to segment the region of interest. Different from the previous
classification layer and regression layer, it is a dense prediction
based on a convolution network. Here, we directly show the
mask loss function for metallographic precipitates in prior
mask R-CNN, as follows:
Lmask =αLCE +βLB+γLCut (5)
where LCE means the typical cross-entropy loss, LBis the
boundary loss proposed in [47], LCut indicates the proposed
cut loss in this article, and α=1, β=1.5, and γ=1.5are
the weights for these three losses.
Among them, the cross-entropy loss LCE is the most
common loss function in segmentation tasks. It guides the
network learning by calculating the cross entropy of each pixel
independently based on ground truth. It can be expressed as
follows:
LCE =
C
c=1
−Gc·log(Sc)(6)
where Gc∈{0,1}Nis the binary vector based on ground truth
with the shape of Nfor category c,Nis the total number of
pixels in image, Sc∈[0,1]Nis the predicted value vector
after sigmoid function for specific category c,and·means the
vector inner product.
In fact, from the nature of the loss function, the cross-
entropy loss as in (6) is to treat the image segmentation as
many isolated pixel classification problems, which is some-
what inconsistent with the human visual system. In order to
alleviate this problem, many novel design ideas are proposed
to compensate for the cross-entropy loss. For example, we can
directly optimize the evaluation index to improve the per-
formance of the model. Specifically, in [48], the Dice loss
function based on the similarity measurement dice coefficient
is proposed to solve the imbalance between foreground and
background voxels in medical images. In fact, for the class
imbalance problem of image segmentation, the Dice loss
function is widely employed for various deep networks and
respective tasks. Besides V-Net [48], in [49], it is used to
train a fully convolutional densenet for diffusion-weighted
images. Similarly, for the vessel segmenting in the X-ray
coronary angiography image sequence [50], the Dice loss
is selected as a loss function to train an encoder–decoder
framework with a channel attention mechanism to tackle the
class imbalance problem. Similarly, the Hausdorff distance
used to quantify the difference between two sets is also
encoded as a simple loss function, which is estimated by
three approximate methods in [51]. In addition, the perceptual
loss from the deep network, the loss function based on the
region, and the energy-based loss function are also widely
used to solve their respective problems [52]. No doubt, it is
important to design an appropriate loss function according
to the specific situation. Considering that the morphology
of precipitates in TEM metallographic images is predictable
(vertical precipitates as dot-shaped, horizontal precipitates,
and longitudinal precipitates as needle-shaped with different
directions), which is effective prior knowledge, we propose
cut loss based on normalized graph cuts to compensate for
the cross-entropy loss.
Graph cuts are an effective unsupervised image segmen-
tation method based on graph clustering. For the binary
classification of foreground and background in image I,let
the point set of foregrounds be Aand that of backgrounds be
B,thatis, AB=∅,AB=I. The graph cuts can be
expressed as follows,
Cut(A,B)=
u∈A,v∈B
w(u,v),(7)
where w(u,v) indicates the designed weight between uand v.
Specifically, to improve the performance of segmentation,
a regularized and extended version of the cut measurement
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.
5010015 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 70, 2021
named normalized graph cuts [53] can be written as
NCut(A,B)=Cut(A,B)
assoc(A,I)+Cut(A,B)
assoc(B,I)(8)
where assoc(A,I)=u∈A,t∈Iw(u,t)indicates the sum of
weights between points in Aand all the points in image I.
assoc(B,I)is the same definition.
From the perspective of optimization, once the weight
matrix Wis determined, the objective function based on
normalized graph cuts is
min
Vc
C
c=1
gc
VT
cWc(1−Vc)
DT
cVc
(9)
where Wc∈RN∗Nis the weight matrix for category C,gc∈
{0,1}is the class-level binary ground truth representing the
category contained in image I,Dc=Wc1 means the sum
of the elements of each row in Wc,andVC∈{0,1}Nis the
decision variable.
Here, we expand the binary classification task between
foreground and background into a multiclassification problem,
which is more appropriate for our task. It is obvious that, for
the optimization problem in (9), the decision variable VCis
the final segmentation result based on graph cuts. In order to
solve this kind of problem, an effective method is to transform
it into an eigenvalue solving problem based on the Rayleigh
quotation [53]. Furthermore, if we relax the hard constraint of
VCfrom {0,1}Nto the soft [0,1]N, regarding it as probability
output Scof deep learning after sigmoid activation function
is interesting. In other words, we transform the intuitive
optimization of decision variables into the optimization of
network parameters in deep learning. Inspired by [54], such
an optimization problem could be used as loss functions in
a deep network and solved iteratively by a backpropagation
algorithm based on gradient. The proposed cut loss function
and its gradient can be written as follows:
LCut =
C
c=1
gc
ST
cWc(1−Sc)
DT
cSc
(10)
∂LCut
∂θ =
∂C
c=1gc
ST
cWc(1−Sc)
DT
cSc
∂θ =
C
c=1
gc
∂ST
cWc(−Sc)
DT
cSc
∂θ
=
C
c=1
gcST
cWcScDc
DT
cSc2−2WcSc
DT
cSc∂Sc
∂θ (11)
where θindicates the parameter of the deep network.
In (10) and (11), the category weight Wcspecifies the
correlation strength between pixels, which largely determines
the final segmentation result. Generally speaking, the weight
matrix is symmetric, and each element needs to be calculated
independently. A simple way to obtain the weight matrix is to
use the kernel function based on pixel feature vector, which
is the popular method to get the energy function in CRF [55]
Wci,j=kFi,Fj=e−(Fi−Fj)T−1
c(Fi−Fj)(12)
where the pair iand jrefer to the index position in the matrix
Wc,Fiis the feature vector for pixel iand the same definition
for Fj,and−1
cindicates the inverse of the covariance matrix.
Considering the smoothness of prediction and the difference
of categories, the features of pixels include the 2-D position
information {X, Y} and the color information with three
channels {R, G, B}. Here, for the sake of simplicity, we only
consider the gray image with single channel G. In other words,
the specific representation of feature vector Fifor pixel i
is [xi,yi,gi]. Next, we discuss the relationship between the
weight matrix and prior knowledge.
In fact, when the shape of a particular category is known
in advance, the statistics characteristics of positions are also
predictable. This gives us inspiration when setting the weight
matrix in the cut loss function. As shown above, our task
is to segment the precipitates in three different directions:
horizontal, vertical, and longitudinal. The precipitates with
different shapes have their own statistic characteristics of the
position. For example, for horizontal precipitates, the position
axes Xand Yare negatively correlated. In contrast, Xand
Yin longitudinal precipitates are always positively correlated.
In fact, this phenomenon still exists after the RoIAlign of the
prior mask R-CNN framework because it does not change
the sign of the correlation between position axes Xand Y.
An example is shown in Fig. 4.
In Fig. 4, we show the weight matrix of the central
pixel, which could be obtained by reshaping the medium
row or column in the category weight matrix Wc.First,
it should be pointed out that, after the RoIAlign for the feature
map, we need to additional extract the raw image for the
same boundary box proposal to obtain the color information,
as shown in the upper left of Fig. 4. Here, we set three kinds
of correlation degree τxy ={+1.5,0,−1.5}with the color
channel to illustrate the difference of weight matrix under
respective strategies. Among them, the color channel focuses
on gray difference whereas the position focuses on smooth-
ness. The weight matrix of Color Channel+XYChannel(τxy =
+1.5)(lower middle) is the closest to the binarization matrix
based on the ground truth, which is also the most consistent
with the statistics characteristics of the horizontal precipitate.
This shows that, when we know the shape of a specific
category in advance, more specific, and location statistics
characteristics, the prior knowledge could be encoded through
the covariance matrix cthat is involved in weight matrix Wc
to be fused into the loss function LCut . Here, we intuitively
give the inverse of matrix c, which can be directly used
in (12)
−1
c=⎡
⎣
τpτpτcxy 0
τpτcxy τp0
00τg⎤
⎦(13)
where τp=1/102and τg=1/162are weights for
position and color channel, and τcx y ={+2.5,0,−2.5}refers
to correlation degree, which could be selected according to the
opposite of covariance of each category (horizontal, vertical,
and longitudinal).
Objectively speaking, scholars have proposed many dif-
ferent loss functions for segmentation tasks. According to
the category in [56], our cut loss could be regarded as the
region-based loss compared with the pixelwise CE loss. It is
worth mentioning that the cut loss does not require pixel-level
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.
LI et al.: PRIOR MASK R-CNN BASED ON GRAPH CUTS LOSS AND SIZE INPUT FOR PRECIPITATION MEASUREMENT 5010015
Fig. 4. Weight matrix of the central pixel under different strategies.
annotations, which is an interesting characteristic. It can be
seamlessly integrated into weakly supervised learning [57].
We only need class-level annotations to indicate the shape. The
category information will be encoded in the weight matrix Wc,
which combines the original image information. Specifically,
by setting different covariance, we can specify the smoothness
preference to maintain the prediction shape of the category.
From the perspective of computational efficiency, unlike the
complex inference process in CRF based on energy function
minimization, our loss function can be easily realized through
gradient backward propagation in deep learning, as shown in
(11). Of course, it is still time-consuming to calculate the huge
weight matrix Wc. However, since the raw images and the
feature maps are reduced to a smaller scale after RoIAlign,
the computational power loss is acceptable.
In addition to the statistical characteristics of the position,
the contour is a more explicit and intuitive descriptor for
the shape of the object. When the pixel-level annotation is
available, it is meaningful to measure the predicted boundary
∂pand the boundary ∂gin the ground truth for transferring the
shape information. Here, we use the boundary loss proposed
in [47] as another loss function of the mask layer to improve
the performance. We directly show the final nonsymmetric
L2 distance result after approximation as follows:
LB=
C
c=1
gc(φG·Sc)(14)
where φG∈RNis a distance vector that can be calculated in
advance with the same shape as Sc.
In (14), every element in φGrepresents the signed distance
between the current pixel and the nearest real boundary in the
ground truth. Specifically, if the current pixel is in the ground
truth, the symbol of the distance is negative. Otherwise, it is
positive. In other words, in order to minimize the boundary
loss LB, we need to maximize the predicted values for the
positive pixels and minimize the predicted values for the
negative ones at the same time. This is in line with perceptual
cognition. It should be noted that, unlike conventional loss
functions, such as cross entropy, its results may be negative
due to the approximation in the mathematical derivation that is
used for simplified calculation. However, this does not affect
its effectiveness in a deep network, which has been verified in
medical imaging [47]. In fact, its initial aim is to measure
the distance between two curves by integrating pixels on
the boundary. In view of differentiable requirements and the
limitation of computational power, the loss function is simpli-
fied as the inner product of two vectors after approximation.
To some extent, it is equivalent to an L1-norm loss function
with pixelwise weight, which implies useful information about
boundaries and shapes.
At this point, we have completed the design of the seg-
mentation loss function of the mask layer in the proposed
prior mask R-CNN. In general, we set up three loss functions:
pixelwise cross-entropy loss LCE in (6), boundary loss LB
in (14) based on boundary (shape) measurement, and the
proposed cut loss LCut in (10) considered shape statistical
characteristics, as shown in Fig. 5.
Besides, in metallographic images, noise exists objectively
and can be divided into three categories: occlusion (inclination
fringes), interference (dislocation), and distortion (residual).
They are caused by the observation process, the specimen
itself, and the preparation, respectively. Among them, noise
caused by occlusion affects performance the most. More
seriously, considering the imaging principle of TEM, occlusion
(equal inclination or thickness fringes) is very common in
images. When the precipitates appear near the occlusions,
the model needs to be able to repair the occluded part.
From the perspective of noise suppression, our designed loss
function (see Fig. 5) could fill in the occluded part effectively
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.
5010015 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 70, 2021
Fig. 5. Loss function of the mask layer in prior mask R-CNN.
Algorithm 1 Postprocessing for Information Extraction
Input: mask list M=[m1,m2,...mp], category list C=
[c1,c2,...cp]
Output: precipitates length list L=[l1,l2,...lp]
Initialize list L=[]
For iin range ( p):
Extract miin mask list M,ciin category list C;
Get region instances riby region growing algorithm based
on mi
Filter out non-maximum areas in ri
Detect eight corner key points (p1,p2... p8)in ri
If ci== horizontal:
li=Mean[Dist(p1,p8), Dist(p2,p7), Dist(p3,p6),
Dist(p4,p5)]
elif ci== longitudinal:
li=Mean[Dist(p1,p4), Dist(p2,p3), Dist(p5,p8),
Dist(p6,p7)]
else:
li=Mean[Dist(p1,p5), Dist(p2,p6), Dist(p3,p7),
Dist(p4,p8)]
Append lito L
Return precipitates length list L
by introducing the prior knowledge of shape and contour,
which is a significant improvement.
C. Postprocessing Module for Measurement
For the analysis of precipitates in this article, our ultimate
aim is to help material scientists measure the precipitates,
rather than solve a pure computer vision problem. The mea-
surement here refers to the statistical information of the three
kinds of precipitates, such as the distribution or mean value
of precipitates’ length, which is very important to reveal the
mechanical properties of the alloy. Therefore, from the practi-
cal point of view, we design a postprocessing module to extract
valuable information on the basis of instance segmentation.
Fig. 6. Flowchart of the postprocessing module.
Specifically, from the output of the computer vision network
to measurement acquisition, we mainly face two inevitable
problems. First, the segmentation results of network output
are not always connected, considering that it is obtained
by aggregating each pixel prediction independently. In other
words, a predicted mask may contain multiple isolated regions
that are treated as precipitates at the same time, which might be
caused by the visual noise in TEM images. Second, how to get
robust and reliable length information from irregular connected
domains is also a little problem that needs attention. In view
of the above problems, we designed a simple postprocessing
module based on region growing and key points’ detection for
measuring the precipitations, and the implementation details
are shown in Algorithm 1.
In brief, Algorithm 1 mainly includes a region-growing
algorithm, area filter, key points’ detection, and category-
wise length measurement. First, we use the seeded region-
growing algorithm [58] to get the region instances ri. Under
eight-neighbor pixels’ strategy, the selected seed points (pre-
dicted as foreground pixels) are grown to get the instance of
the connected domain. Next, nonmaximum areas are filtered
out to eliminate the effect of noise and obtain the real precipi-
tates. Then, eight corner points are detected in turn by simple
maximum and minimum functions based on plane position.
We point out that these corners may coincide, considering
the irregular shape. Finally, we set up different distance
measurement rules according to the categories of precipitates,
which is consistent with the statement in materials science. For
example, for horizontal precipitation, the length is based on
the long side. In contrast, for vertical precipitation, the length
refers to its diameter. An example of a postprocessing module
is shown in Fig. 6.
III. RESULTS
In this section, we will show a variety of experiments to
test the effectiveness of the proposed prior mask R-CNN for
the detection of precipitates in alloys.
A. Dataset
In this article, our experimental object is direct chill cast
Al–12.7Si–0.7Mg alloy without further chemical modification,
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.
LI et al.: PRIOR MASK R-CNN BASED ON GRAPH CUTS LOSS AND SIZE INPUT FOR PRECIPITATION MEASUREMENT 5010015
Fig. 7. Left: training samples (blue box) and test samples (red box) in one
slice. Right: overlapping samples (blue box) by sliding for augmentation.
which is widely used as structural materials. The dataset
contains 30 metallographic slices with the size of 2048 ×
2048 under different heat treatment conditions (such as dif-
ferent aging times and aging temperatures). It was observed
by transmission electron microscopy at a scale of 1 pixel:
0.15625 nm. For these metallographic images, we can find
a series of precipitates based on Mg and Si growing in three
orthogonal directions. Specifically, we call them horizontal,
vertical, and longitudinal precipitates, which are labeled at
pixel level by a material expert and seven volunteers in our
team. Here, we point out that the terms “horizontal” (about
+30◦to the horizontal axis) and “longitudinal” (about −60◦
to the horizontal axis) in this article are not strict. They are
only used to distinguish each other. After annotation, we divide
each slice into four parts and distribute them to the training
set and the test set, respectively. As with many deep learning
projects, we augment our metallographic dataset to expand
the training samples used to train the network. It should be
noted that not all typical image augmentation methods and
affine transformation are allowed in view of the clear material
science significance of precipitates in metallographic images.
For example, image rescaling or rotation may lead to weird
precipitates that cannot be observed in practice, at least under
the current TEM settings. In contrast, overlapping cutting
in the slice is allowed, as shown in Fig. 7. Finally, after
data augmentation, our training set contains 300 (90 raw +
210 augmentation) images with the size of 1024 ×1024,
whereas the test set contains 30 images of the same size.
B. Instance Segmentation Results
First, we introduce some quantitative indicators used to
evaluate our method. For the instance segmentation task,
generally speaking, the performance of the model is evaluated
from two aspects: object detection and mask segmentation.
Among them, for object detection tasks, a common evaluation
index is the mean average precision (mAP). It is popular for
natural image tasks, such as MS COCO [24] and PASCAL
VOC challenge [59], as follows:
mAP =
C
c=1
APc=
C
c=1
1
Nc
Nc
TP=1
max
tp≥TPtp
tp +FP(tp)(15)
where Ncmeans the total number of instances for category c,
tp means the number of the true detected object, and FP(tp)is
TAB L E I
HYPERPARAMETERS INVOLVED IN OUR METHOD
a specific function that returns the number of the false detected
object; specifically, if the number of tp samples could be
detected, the minimum number of false positives is returned;
otherwise, return infinity.
Next, we will do experiments to test the performance of the
proposed prior mask R-CNN. The specific hyperparameters of
this work are shown in Table I. Overall, we basically inherited
the typical settings in mmdetection [60]. For example, in each
training iteration, up to 256 anchors and 512 RoI are randomly
selected to guide RPN and RoI network learning, respectively.
The maximum number of proposal boxes from RPN is 1000,
and the NMS threshold for the positive sample is set to 0.7.
Considering the specificity of our task, some hyperparameters
need to be adjusted accordingly. In view of the small size
of vertical precipitates, we set the scale of the anchor to 4
to ensure that they can be fully detected. As for the learning
strategy, we use Adam [61] with a learning rate of 0.0003,
and the training epoch is set to 15. After the 15th training
epoch, 30 test images that did not appear in the training period
would be used to evaluate the performance. Instances with the
predicted probabilities higher than 0.3 will be considered valid,
and based on this, the quantitative index will be calculated
by (15). In order to further verify the effectiveness of the pro-
posed method, mask R-CNN [22], mask scoring R-CNN [29],
and cascade mask R-CNN [27] are also considered under the
same dataset as a comparison. The quantitative indicators are
showninTableII.
In Table II, we list the mean mAP and each category
AP at the same time, where the subscripts H, V, and L
refer to horizontal, vertical, and longitudinal precipitations.
The definition of AR is based on the same rule, and the
bold number of each column is the best performance of the
corresponding evolution index. Considering universality and
fairness, mAP is selected as the main index to comprehensively
evaluate methods.
On the whole, the performances for object detection (upper
of Table II) are generally better than that for mask segmenta-
tion (lower of Table II). This degradation could be understood,
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.
5010015 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 70, 2021
TAB L E I I
PERFORMANCE OF OBJECT DETECTI ON (UPPER)AND MASK SEGMENTATION (LOWER )AMONG MASK R-CNN, MASK SCORING R-CNN, CASCADE
MASK R-CNN, AND THE PROPOSEDPRI OR MASK R-CNN
considering that mask segmentation is often based on the
results of object detection for the typical two-stage instance
segmentation framework. However, whether for object detec-
tion (upper of Table II) or mask segmentation task (lower of
Table II), the proposed prior mask R-CNN achieves better
performances in more evaluation indexes. Among them, for the
main index mAP of object detection, our algorithm achieves
the highest score of 0.475, which is ahead of 0.397 from mask
R-CNN, 0.447 from scoring R-CNN, and 0.378 from cascade
R-CNN. The situation of mask segmentation is basically the
same, and our method achieves the highest score of 0.298 for
the mask segmentation task. It is obvious that our method
should be more effective and appropriate for the detection and
segmentation of metallographic precipitates.
Besides, we observe significant differences in performance
among different categories. For example, in the upper of
Table II, the minimum APVfor vertical precipitates of all
methods is 0.496 (mask R-CNN), whereas the maximum APH
for horizontal precipitates is only 0.356 (prior mask R-CNN).
This phenomenon is even more obvious in the mask segmen-
tation task. This is mainly due to the difficulty in predicting
the horizontal or longitudinal precipitates. In view of the
imaging principle of TEM, the horizontal and longitudinal
precipitates are often blurred with inexact contour compared
with the obvious dark gray vertical precipitates with a circular
shape, as shown in Fig. 1. In addition, for the prediction of
rectangle (needle) shape with a large length-to-width ratio of
horizontal and longitudinal precipitates, conventional convolu-
tion networks may encounter difficulties. It is worth noting that
this performance difference between categories is relatively
small for our proposed prior mask R-CNN. Fusing prior
knowledge into the deep network by our specific structure
(see Section II-A) and loss function (see Section II-B) might
alleviate the phenomenon. In addition, in order to compare
different methods more intuitively, we show prediction results
directly for the test set, just as in Fig. 8.
In Fig. 8, we selected three TEM metallographic images
to show the prediction results, which are realized by the
mmdetection toolbox [60]. The first row in Fig. 8 is the overall
prediction results of different methods, whereas the second
and third rows focus more on the boundary box detection
effect and mask segmentation result, respectively. Specifically,
for the second row in Fig. 8, our method predicts more
precipitates with higher scores, such as yellow horizontal
precipitates. This implies that our method is sensitive to the
complex precipitates, which is consistent with the high recall
rate (mAR =0.586) in Table II. Unlike mask scoring R-CNN
and cascade mask R-CNN, which add extra subnetwork to
the topology structure of mask RCNN, our method only fuses
the size input through a simple skip connection (see Fig. 3).
This is helpful for scale-sensitive classification problems, such
as precipitate detection in this article, so our method can
effectively detect more precipitates accurately. Besides, for the
third row in Fig. 8, our mask segmentation results are closest
to the shape of annotated precipitates in the ground truth. This
might be related to two additional segmentation loss functions
in prior mask R-CNN. To be more precise, the proposed cut
loss (10) that contains prior knowledge guides the network to
produce a smooth and consistent mask, by setting different
weight matrixes according to statistical characteristics. The
boundary loss (14) further ensures the rationality of the
predicted shape by measuring the contour distance between
prediction and ground truth. All these specific settings enable
our method to achieve better results.
Furthermore, we point out that the selection of hyperpa-
rameters in the model is ad hoc without using a validation
set. It implies that the hyperparameters in Table I may not
be optimal. The main criteria for selecting them are based
on the specific situation of our task. For example, we set the
“Anchor Scale” as 4 to ensure that the vertical precipitates
with an average size of 26 nm2could be detected effectively.
The selections of the correlation coefficients τHxy =+2.5,
τVxy =0,and τLxy =−2.5 in the cut loss are based
on the statistical knowledge from the currently available
dataset. In addition, practicability is also an important cri-
terion. We changed the “Threshold for Test” from a typical
0.05 to 0.3. This correction leads to the degradation of the
mAP score (from 0.503 to 0.475) but effectively reduces
the false positive rate, which is more valuable for material
experts. Of course, the settings of all methods in Table II
are basically consistent, except for some inherent structures
or loss functions (e.g., additional scoring layer in Scoring
R-CNN [29]). Under the above configuration, the performance
improvement of our prior mask R-CNN is relatively obvious,
just like Table II.
As mentioned above, different from the natural image
challenge, the evaluation index based on computer vision is not
the most important for the actual microstructure detection task.
From a practical point of view, our method should be able to
extract useful information from TEM images, which is helpful
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.
LI et al.: PRIOR MASK R-CNN BASED ON GRAPH CUTS LOSS AND SIZE INPUT FOR PRECIPITATION MEASUREMENT 5010015
Fig. 8. Prediction results between mask R-CNN, mask scoring R-CNN, cascade mask R-CNN, and the proposed prior mask R-CNN.
for material scientists to measure and analyze. Therefore,
in the following, we test the performance of the postprocessing
module for measurement proposed in this article. In order
to get the results more fairly, we selected three different
batches in the test set. These three batches are produced under
different heat treatments, specifically aging time, which is
meaningful to study the mechanical properties of the alloy.
That is to say, the difference between these test images is
even more obvious due to the different production processes
and inevitable changes in the environment. In addition, other
methods are also considered for comparison. The results are
shown in Table III.
In Table III, we show the average length of three kinds of
precipitates for three batches. H, V, and L refer to the hor-
izontal, vertical, and longitudinal precipitations, respectively.
The GT in the last row indicates the real annotated results by
experts, and the bold numbers in each column are the closest
results to the ground truth. Generally speaking, the post-
processing results based on prior mask R-CNN are more con-
sistent with the real results, no matter for the aging time of 1 or
12 h. This shows that the measurement results based on region
growing and key points’ detection (see Section II-C) can
accurately extract the material science information from the
network prediction results. It further proves the effectiveness
and robustness of our method. However, we must point out
that the accuracy under the current dataset is junior (maximum
error =5 nm), which is not enough for material analysis.
However, with more accurate and fine-labeled metallographic
data, our method still has the potential to be used in the actual
production process.
It is worth mentioning that, in the actual production, the size
of the image may not be consistent with the image (1024 ×
1024) in this article. If we want to use the trained model to
predict different sizes of images directly, we need to manually
convert the predicted images into the same scale (1 pixel:
0.15625 nm). In other words, our model is more sensitive to
scale than to size. This may be related to the mechanism of the
convolutional network. Furthermore, by converting the scale,
more metallographic images can be used for training or testing.
Considering the value of TEM images, this is very meaningful
compared with directly discarding these data.
C. Ablation Study
For our proposed prior mask R-CNN, we make two
improvements to the basic mask R-CNN framework, from the
network structure and loss function to more accurately detect
the precipitates in the alloy. Specifically, in terms of structure,
we introduce size input in the classification and regression
layer of the RoI network by skipping connection. For the loss
function, the weakly supervised loss function (10) based on
traditional graph cuts and the boundary loss function (14)
based on distance are used to segment in the mask layer
of the RoI network. These two improvements together make
our method achieve better performance, whether from the
perspective of computer vision or practical point, as shown
in Tables II and III. In this section, the specific role of these
two improvements will be analyzed in detail. Specifically,
under the same training settings and dataset, we make addi-
tional experiments on basic mask R-CNN with only structural
improvement and only loss function improvement for ablation
study. The final object detection performance, mask segmenta-
tion performance, and prediction results are shown in Table IV
and Fig. 9, respectively.
The term “Specific Structure” in Table IV are corresponding
to the “Size Input Structure” in Fig. 9, which refers to the basic
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.
5010015 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 70, 2021
TABLE III
AVERAGELENGTH OF PRECIP ITATES FOR THREE BAT CH ES (AGING TIME:1,3,AND 12 H)IN THE TEST SET AFTER INFORMATIONEXTRACTION MODULE
TAB L E I V
PERFORMANCE OF OBJ ECT DETECTION (UPPER)AND MASK SEGMENTATION (LOWER )OF BASIC MASK R-CNN, MASK R-CNN WITH SIZE INPUT
STRUCTURE,MASK R-CNN WITH ADDITI ONAL CUT AND BOUNDARY LOSS,AND COMPLETE PRIOR MASK R-CNN
Fig. 9. Prediction results between basic mask R-CNN, mask R-CNN with size input, mask R-CNN with cut and boundary loss, and complete prior mask
R-CNN.
mask R-CNN with additional size input. Similarly, “Specific
Loss” corresponds to “Cut&Boundrary Loss,” which indicates
the basic mask R-CNN with cut and boundary loss function.
The definition in bold is similar to the previous, that is, the
best performance for each evaluation index. First, from the
quantitative results in Table IV, it is obvious that both “Specific
Structure” and “Specific Loss” could improve the performance
of basic mask R-CNN. In the object detection task, from
the mAP column of the main evaluation index in upper of
Table IV, we find that the improvement effect of “Specific
Structure” (from 0.397 to 0.446) is slightly better than that of
“Specific Loss” (from 0.397 to 0.445). On the contrary, the
improvement effect of “Specific Loss” (from 0.242 to 0.296)
is better than that of “Specific Structure” (from 0.242 to 0.264)
in the mask segmentation task based on the lower of Table IV.
This situation is similar to other indicators, such as mAR,
which is used to test the recall rate of the model.
This phenomenon should be consistent with the original
intention of the designed two improvements. Specifically, for
structural improvement, we input the size information to the
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.
LI et al.: PRIOR MASK R-CNN BASED ON GRAPH CUTS LOSS AND SIZE INPUT FOR PRECIPITATION MEASUREMENT 5010015
classification and regression layer at the object detection stage,
after observing the obvious difference in the size distribution
of different types of precipitates (see Fig. 3). This is helpful
for size-sensitive microstructure classification tasks. It leads
to the improvement of the performance of object detection
after the structural improvement of the basic mask R-CNN.
Besides, loss improvement is mainly designed for the mask
segmentation task. Based on the predictable shape of different
precipitates, we set the cut loss to produce a smooth and proper
prediction result with category-related preference. In addition,
because the model is based on the typical two-stage instance
segmentation framework, no matter for RPN or the classi-
fication layer, the regression layer, and the mask layer in
the RoI network, their inputs are from the same backbone
convolution network (ResNet50 +FPN in this article). That
is to say, the improvement of any branch may be linked.
This also explains why the results in Table IV tend to show
methodological relevance. For example, an additional cut loss
set in the mask layer is also significant for the improvement
of the object detection task, except for the deserved original
segmentation task.
Finally, we test the effect of the proposed cut loss, which
is an important contribution of this article. In short, the cut
loss function is a kind of segmentation loss function, which is
inspired by the graph cuts theory. Compared with pixel-level
annotation used in the cross entropy, the cut loss only needs
class-level annotation and corresponding statistical character-
istics, which is also adaptive to weakly supervised learning.
The statistical characteristics of different categories could be
regarded as prior knowledge. When objects with the same
category appear in desired and predictable shapes, our loss
function benefited from prior knowledge is helpful for the
corresponding segmentation task. We note that the prior
knowledge is integrated into the loss function by setting the
corresponding weight matrix of categories. In the following,
in order to further test the effect of prior knowledge and cut
loss, we set three different weight matrices for longitudinal
precipitates by selecting different correlation degrees τLxy =
(−2,−2.5,−3)in (13). The final prediction results are shown
in Fig. 10.
In Fig. 10, the first column shows the overall prediction
results under the three correlation degrees, followed by the
results of the longitudinal precipitates and the enlarged view.
Intuitively, the outputs of the network are quite different.
Specifically, when τLxy =−2, the prediction shapes of
longitudinal precipitates are relatively blunt. However, when
we set τLxy =−3, the predicted shapes of that become sharp.
This shows that, with constant learning, the loss function could
control the shape of the prediction result. At the same time,
it also implies that the iterative method based on gradient
descent is effective to optimize the objective function of nor-
malized graph cuts to a certain extent. That is to say, by setting
the weight matrix that involves prior knowledge in cut loss,
we can control the shape of predicted segmentation. Different
from the cross-entropy loss based on the pixel level, this loss
function is more consistent with human visual perception. The
relevant prior knowledge is naturally integrated into the end-
to-end training period of the deep network without additional
Fig. 10. Prediction results for longitudinal precipitations under different
correlation degrees [τLxy =(−2,−2.5,−3))].
postprocessing modules or complex inference processes. More
importantly, the cut loss can be employed for any conventional
image segmentation network, besides the instance segmenta-
tion framework in this article. The cut loss function may be
an appropriate complement to the cross-entropy loss when the
shapes of categories are statistically significant.
D. Limitations
In general, based on deep learning, we proposed a novel
framework for the measurement of precipitates. There may be
some possible limitations in this study. First, from the per-
spective of experimental materials, our metallographic dataset
has only 300 nanometer-level TEM images. To be honest, it is
relatively small compared to the popular natural image dataset,
such as MS COCO (≈330k). Considering the high cost
of specimen preparation and expert annotation, it is difficult
to obtain a large number of metallographic images. How
to achieve better performance in the current small dataset
is worth thinking about. Second, from the perspective of a
deep network, strictly speaking, the tuning of hyperparameters
depends on the performance of the validation set. However,
in our work, the selection of hyperparameters in the model
is ad hoc without using such a validation set. As a result,
these results may not be fully generalized. In fact, finding the
proper hyperparameters might be very difficult, especially for
the model that contains many hyperparameters to be set in
this work. Finally, with the introduction of the proposed cut
loss (10), the computational efficiency will inevitably decrease.
Of course, this degradation (from about 0.55 to 0.75 s/image)
is basically acceptable. In future work, in order to solve the
problem of generalization, some learning strategies are worthy
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.
5010015 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 70, 2021
of attention, such as few-shot learning and transfer learning.
As for the selection of hyperparameters, automatic machine
learning [62] seems to be a good solution. These may be the
key to the practical application of our automatic measurement
methods in materials science.
IV. CONCLUSION
In this article, we proposed a novel framework for the
measurement of precipitates in aluminum alloys. It is a
two-stage instance segmentation network, which is based on
mask R-CNN and consists of the backbone network, RPN,
and the RoI network. For the RoI network, considering that
the size distributions of different precipitate categories have
obvious differences, we input the size information based on
boundary box area into the classification layer and regression
layer of the RoI network through a simple skip connection.
Besides, since the shape of precipitates is predictable, the
proposed cut loss function, including prior knowledge, and
the boundary loss function for measuring contour distance
are designed to segment the mask in the mask layer. In fact,
our framework improves the basic mask R-CNN in terms of
topological structure and loss, respectively, based on the prior
knowledge (size and shape) of the category. As a result, we call
the proposed framework prior mask R-CNN. From a practical
point of view, we design a simple postprocessing module
to extract material information based on the region-growing
algorithm and key points’ detection. As for the experiments,
our method achieves an mAP score of 0.475 in the object
detection task and an mAP score of 0.298 in the mask
segmentation task, which surpasses other comparison methods.
In addition, the length information of precipitates obtained
from the output of our network is more consistent with that
annotated by experts. This should be attributed to the designed
structure and loss function of our method. In the ablation study,
we tested these designs separately and explored the relevance
of the proposed cut loss to the predicted shape. In summary,
when the shapes and sizes of the objects are predictable in
advance, our framework named prior mask R-CNN provides
a new idea to improve performance for automatic measurement
by fusing prior knowledge.
REFERENCES
[1] L. P. Troeger and E. A. Starke, “Microstructural and mechanical char-
acterization of a superplastic 6xxx aluminum alloy,” Mater. Sci. Eng.,
A, vol. 277, nos. 1–2, pp. 102–113, Jan. 2000.
[2] T. Hemalatha, S. Akilandeswari, T. Krishnakumar, S. G. Leonardi,
G. Neri, and N. Donato, “Comparison of electrical and sensing proper-
ties of pure, Sn- and Zn-doped CuO gas sensors,” IEEE Trans. Instrum.
Meas., vol. 68, no. 3, pp. 903–912, Mar. 2019.
[3] F. Liu, F. Yu, D. Zhao, and L. Zuo, “Microstructure and mechanical
properties of an Al-12.7Si-0.7Mg alloy processed by extrusion and heat
treatment,” Mater. Sci. Eng. A., vol. 528, pp. 3786–3790, Apr. 2011.
[4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in Proc. Int. Conf. Learn. Represent.,
2015, pp. 1–14.
[5] Y. Liu et al., “Richer convolutional features for edge detection,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 1939–1946,
Aug. 2019.
[6] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput.
Vis . Patt er n R ecog. , Jun. 2016, pp. 779–788.
[7] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep
convolutional encoder-decoder architecture for image segmentation,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
Dec. 2017.
[8] S. M. Azimi, D. Britz, M. Engstler, M. Fritz, and F. Mücklich,
“Advanced steel microstructural classification by deep learning meth-
ods,” Sci. Rep., vol. 8, no. 1, pp. 1–14, Dec. 2018.
[9] C. Wang, D. Shi, and S. Li, “A study on establishing a microstructure-
related hardness model with precipitate segmentation using deep learning
method,” Materials, vol. 13, no. 5, p. 1256, Mar. 2020.
[10] M. Li, D. Chen, S. Liu, and F. Liu, “Grain boundary detection
and second phase segmentation based on multi-task learning and
generative adversarial network,” Measurement, vol. 162, Oct. 2020,
Art. no. 107857.
[11] K. Gajalakshmi, S. Palanivel, N. J. Nalini, S. Saravanan, and
K. Raghukandan, “Grain size measurement in optical microstruc-
ture using support vector regression,” Optik, vol. 138, pp. 320–327,
Jun. 2017.
[12] O. Dengiz, A. E. Smith, and I. Nettleship, “Grain boundary detection in
microstructure images using computational intelligence,” Comput. Ind.,
vol. 56, nos. 8–9, pp. 854–866, Dec. 2005.
[13] X. Zhenying, Z. Jiandong, Z. Qi, and P. Yamba, “Algorithm based
on regional separation for automatic grain boundary extraction using
improved mean shift method,” Surf. Topography, Metrology Properties,
vol. 6, no. 2, Apr. 2018, Art. no. 025001.
[14] H. Peregrina-Barreto, I. R. Terol-Villalobos, J. J. Rangel-Magdaleno ,
A. M. Herrera-Navarro, L. A. Morales-Hernández, and
F. Manríquez-Guerrero, “Automatic grain size determination in
microstructures using image processing,” Measurement, vol. 46,
no. 1, pp. 249–258, Jan. 2013.
[15] B. Lu, M. Cui, Q. Liu, and Y. Wang, “Automated grain boundary
detection using the level set method,” Comput. Geosci., vol. 35, no. 2,
pp. 267–275, Feb. 2009.
[16] B. Ma et al., “Fast-FineCut: Grain boundary detection in micro-
scopic images considering 3D information,” Micron, vol. 116, pp. 5–14,
Jan. 2019.
[17] C. A. Paredes-Orta, J. D. Mendiola-Santibañez, F. Manriquez-Guerrero,
and I. R. Terol-Villalobos, “Method for grain size determination in
carbon steels based on the ultimate opening,” Measurement, vol. 133,
pp. 193–207, Feb. 2019.
[18] L. Liu et al., “Deep learning for generic object detection: A survey,”
Int. J. Comput. Vis., vol. 128, no. 2, pp. 261–318, Jan. 2020.
[19] B. Wang et al., “Automatic fault diagnosis of infrared insulator images
based on image instance segmentation and temperature analysis,” IEEE
Trans. Instrum. Meas., vol. 69, no. 8, pp. 5345–5355, Aug. 2020.
[20] J. Ma, K. Qian, X. Zhang, and X. Ma, “Weakly supervised instance
segmentation of electrical equipment based on RGB-T automatic anno-
tation,” IEEE Trans. Instrum. Meas., vol. 69, no. 12, pp. 9720–9731,
Dec. 2020.
[21] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2014, pp. 580–587.
[22] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 2961–2969.
[23] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
real-time object detection with region proposal networks,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
[24] T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in Proc.
Eur. Conf. Comput. Vis., 2014, pp. 740–755.
[25] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 39, no. 4, pp. 640–651, 2017.
[26] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for
instance segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., Jun. 2018, pp. 8759–8768.
[27] Z. Cai and N. Vasconcelos, “Cascade R-CNN: High quality object
detection and instance segmentation,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 43, no. 5, pp. 1483–1498, May 2021.
[28] K. Chen et al., “Hybrid task cascade for instance segmentation,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
pp. 4974–4983.
[29] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang, “Mask scoring
R-CNN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2019, pp. 6409–6418.
[30] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.
LI et al.: PRIOR MASK R-CNN BASED ON GRAPH CUTS LOSS AND SIZE INPUT FOR PRECIPITATION MEASUREMENT 5010015
[31] Y. Guo, H. Shi, A. Kumar, K. Grauman, T. Rosing, and R. Feris,
“SpotTune: Transfer learning through adaptive fine-tuning,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
pp. 4805–4814.
[32] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim,
“Image to image translation for domain adaptation,” in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 4500–4509 .
[33] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few
examples: A survey on few-shot learning,” ACM Comput. Surv., vol. 53,
no. 3, pp. 1–34, Jul. 2020.
[34] Z. Mirikharaji and G. Hamarneh, “Star shape prior in fully convolu-
tional networks for skin lesion segmentation,” in Proc. MICCAI, 2018,
pp. 737–745.
[35] S. Y. Han, H. J. Kwon, Y. Kim, and N. I. Cho, “Noise-robust pupil center
detection through CNN-based segmentation with shape-prior loss,” IEEE
Access, vol. 8, pp. 64739–64749, 2020.
[36] C. Zotti, Z. Luo, O. Humbert, A. Lalande, and P. M. Jodoin, “GridNet
with automatic shape prior registration for automatic MRI cardiac
segmentation,” in Proc. STACOM-MICCAI, 2017, pp. 73–81.
[37] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“DeepLab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018.
[38] S. Zheng et al., “Conditional random fields as recurrent neural net-
works,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
pp. 1529–1537.
[39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 770–778.
[40] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4,
Inception-ResNet and the impact of residual connections on learning,”
in Proc. AAAI Conf. Artif. Intell., 2016, pp. 4278–4284.
[41] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
works for biomedical image segmentation,” in Medical Image Comput-
ing and Computer-Assisted Intervention—MICCAI, 2015, pp. 234–241.
[42] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
“Feature pyramid networks for object detection,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125.
[43] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time instance
segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
Oct. 2019, pp. 9157–9166.
[44] X. Chen, R. Girshick, K. He, and P. Dollár, “TensorMask: A foundation
for dense object segmentation,” in Proc. IEEE/CVF Int. Conf. Comput.
Vis. (ICCV), Oct. 2019, pp. 2061–2069.
[45] Y. Zhang and Q. Yang, “A survey on multi-task learning,” 2017,
arXiv:1707.08114. [Online]. Available: http://arxiv.org/abs/1707.08114
[46] S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proc. IEEE
Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1395–1403.
[47] H. Kervadec, J. Bouchtiba, C. Desrosiers, E. Granger, J. Dolz, and
I. B. Ayed, “Boundary loss for highly unbalanced segmentation,” in
Proc. Int. Conf. Med. Imag. Deep Learn, 2019, pp. 285–296 .
[48] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully convolutional
neural networks for volumetric medical image segmentation,” in Proc.
4th Int. Conf. 3D Vis. (3DV), Oct. 2016, pp. 565–571.
[49] R. Zhang et al., “Automatic segmentation of acute ischemic stroke
from DWI using 3-D fully convolutional DenseNets,” IEEE Trans. Med.
Imag., vol. 37, no. 9, pp. 2149–2160, Sep. 2018.
[50] D. Hao et al., “Sequential vessel segmentation via deep channel attention
network,” Neural Netw., vol. 128, pp. 172–187, Aug. 2020.
[51] D. Karimi and S. E. Salcudean, “Reducing the Hausdorff distance in
medical image segmentation with convolutional neural networks,” IEEE
Trans. Med. Imag., vol. 39, no. 2, pp. 499–513, Feb. 2020.
[52] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time
style transfer and super-resolution,” in Proc. Eur. Conf. Comput. Vis.,
2016, pp. 694–711.
[53] J. Shi and J. Malik, “Normalized cuts and image segmentation,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905,
Aug. 2000.
[54] M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers,
“Normalized cut loss for weakly-supervised CNN segmentation,” in
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 1818–1827.
[55] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected
CRFs with Gaussian edge potentials,” in Proc. Adv. Neural Inf. Process.
Syst., 2011, pp. 109–117.
[56] J. Ma, “Segmentation loss odyssey,” 2020, arXiv:2005.13449. [Online].
Available: http://arxiv.org/abs/2005.13449
[57] Z.-H. Zhou, “A brief introduction to weakly supervised learning,” Nat.
Sci. Rev., vol. 5, no. 1, pp. 44–53, Jan. 2018.
[58] R. Adams and L. Bischof, “Seeded region growing,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 16, no. 6, pp. 641–647, Jun. 1994.
[59] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams,
J. Winn, and A. Zisserman, “The Pascal visual object classes challenge:
A retrospective,” Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136,
Jan. 2015.
[60] K. Chen et al., “MMDetection: Open MMLab detection tool-
box and benchmark,” 2019, arXiv:1906.07155. [Online]. Available:
http://arxiv.org/abs/1906.07155
[61] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–41.
[62] X. He, K. Zhao, and X. Chu, “AutoML: A survey of the state-of-the-art,”
Knowl.-Based Syst., vol. 212, Jan. 2021, Art. no. 106622.
Mingchun Li received the B.S. and M.S. degrees in
automation from Northeastern University, Shenyang,
China, in 2015 and 2018, respectively, where he is
currently pursuing the Ph.D. degree with the College
of Information Science and Engineering.
His research lies at the intersection of machine
learning and image processing. His current research
work is about medical signals and industrial intelli-
gence based on deep learning.
Dali Chen received the B.S., M.S., and Ph.D.
degrees in automation, pattern recognition, and
intelligent systems from Northeastern University,
Shenyang, China, in 2003, 2005, and 2008, respec-
tively.
He is currently an Associate Professor with the
College of Information Science and Engineering,
Northeastern University. His research lies at the
intersection of machine learning and image process-
ing. His current research interest is to develop deep
learning algorithms for medical image processing
and industrial intelligent systems.
Shixin Liu (Member, IEEE) received the B.S.
degree in mechanical engineering from Southwest
Jiaotong University, Sichuan, China, in 1990, and
the M.S. and Ph.D. degrees in systems engineer-
ing from Northeastern University, Shenyang, China,
in 1993 and 2000, respectively.
He is currently a Professor with the College of
Information Science and Engineering, Northeastern
University. He has authored or coauthored over
100 publications, including one book. His research
interests are in intelligent optimization algorithms,
planning and scheduling, machine learning, and computer vision.
Fang Liu received the B.S. and Ph.D. degrees
in materials science from Northeastern University,
Shenyang, China, in 2004 and 2013, respectively.
She is currently a Lecturer with the School
of Materials Science and Engineering, Northeast-
ern University. Her current research interests are
wrought aluminum–silicon alloy and alloy design
based on finely dispersed second-phase particles
strengthening matrix.
Authorized licensed use limited to: Northeastern University. Downloaded on April 23,2023 at 07:06:40 UTC from IEEE Xplore. Restrictions apply.