Content uploaded by Kun Qian
Author content
All content in this area was uploaded by Kun Qian on Mar 24, 2021
Content may be subject to copyright.
3080978-1-7281-7687-1/20/$31.00 2020 IEEE
Fire Detection based on Convolutional Neural
Networks with Channel Attention
Xiaobo Zhang
School of Automation
Southeast University
Nanjing 210096, China
zxb852@sina.com
Kun Qian
School of Automation, Southeast University,
Nanjing 210096, China
Key Laboratory of Measurement and
Control of Complex Systems of Engineering,
Ministry of Education of China
kqian@seu.edu.cn
Kaihe Jing
School of Automation
Southeast University
Nanjing 210096, China
928104929@qq.com
Jianwei Yang
Future Science & Technology Park
Changping District,
Beijing, China, 102209
hvdcyjw@sina.com
Hai Yu
Global Energy Interconnection Research
Institute, State Grid
Nanjing, China, 210000
83616883@qq.com
Abstract—The existing research on fire detection is basically
based on a two-stage method, resulting in slower detection speed,
and the positioning accuracy is limited by the first-stage candidate
region extraction algorithm. In order to solve the problem of high-
precision real-time fire detection, this paper proposes a Yolo
detection network combined with attention mechanism. The
attention module is serially added to the final three convolutional
networks of different scales of Yolo v3. The channel attention
module updates the feature map by weighting and summing all
channels, which captures the semantic dependencies between
channels in the deep layers of the network, and improves the
generalization ability of the model. Experiments show that the
method proposed in this paper improves the accuracy of fire
detection without reducing the detection speed.
Keywords—Image recognition; fire detection; Deep learning;
Attention mechanism; Yolo
I. INTRODUCTION
There are flammable and explosive materials in typical
industrial production scenes such as power plants. Once a fire
occurs, it will bring huge damage to industrial production and
personnel safety. In addition, there may be multiple operating
devices in the production environment. Better response
measures can be taken for the location of the burning device.
Therefore, real-time fire detection is very necessary.
Fire detection methods can be divided into two categories: 1)
traditional fire detection and 2) visual-based fire detection.
Traditional fire detection is generally based on various types of
sensors, including smoke, temperature and photosensitive
sensors. Traditional fire detection is a close-range detection and
can only monitor a certain range of space, and is not suitable for
open space. And it is impossible to locate the specific position
of the flame.
Computer vision and deep learning technologies have
developed rapidly in recent years, the algorithm based on image
processing can effectively overcome the shortcomings of
traditional fire detection. The image has a wider field of view
and a longer detection range than traditional sensors. The image
contains more information, including the position of the flame.
Surveillance cameras are very common in various production
scene and are inexpensive. Image-based fire detection is also
divided into traditional methods and deep learning methods.
Among various deep learning methods, most of the current
researches are focused on image-level fire detection, rather than
region-level fire detection. However, in typical power plant
environment such as control rooms, substations, detecting and
localizing flame regions in images is essential for identifying the
burning equipment.
Yolo v3 is a recently proposed target detection network with
excellent performance. This paper proposes a method that
combines the channel attention mechanism with the Yolo v3
network, and uses the channel attention module to capture the
deep semantic feature dependency of Yolo. Experiments have
proved that the method improves the accuracy of fire detection
in multi-scale scenes and verifies the effectiveness of the method,
without reducing the detection speed.
II. RELATED WORK
Image processing algorithm provides an effective solution to
the problems existing in traditional fire detection. Fire detection
based on image processing is divided into traditional methods
and deep learning methods. Traditional recognition algorithms
can be summarized into three stages: 1. Flame pixel
classification 2. Motion detection 3. Candidate region feature
analysis. Flame pixel classification is mostly to establish static
color models in various color spaces, which are responsible for
generating flame candidate areas; motion detection is
responsible for detecting the dynamic feature; candidate area
feature analysis is responsible for comprehensive analysis of
previous data and realizing fire detection through conditional
filtering. Flame pixel classification and motion detection are
based on each other in different methods. Celik et al. [9] used
manually marked flame masks to extract flame pixel values for
subsequent color model research. Celik et al. [1] proposed a
This work is supported by the National Natural Science Foundation of
China (Grant No. 61573101), and the Science and Technology Program of
Global Energy Interconnection Research Institute "Research on
infrared/ultraviolet image recognition algorithm and software modules
development"
2020 Chinese Automation Congress (CAC) | 978-1-7281-7687-1/20/$31.00 ©2020 IEEE | DOI: 10.1109/CAC51589.2020.9327309
Authorized licensed use limited to: Southeast University. Downloaded on March 24,2021 at 00:05:59 UTC from IEEE Xplore. Restrictions apply.
3081
flame color model represented by geometric shapes in RGB
space.
In recent years, the method based on deep learning has
become the mainstream method of fire detection. At present,
most researches are to identify whether there is flame in the
image, but there are few researchers on flame location.
Regarding the research of fire detection, Dunnings et al. [7] use
superpixel segmentation to divide the picture into regions, and
then perform fire detection on each segmented block
respectively, and finally take the union of the detected regions
to obtain the segmented regions. The positioning accuracy of
this two-stage method is poor, and the process of superpixel
segmentation is very time-consuming. Using the traditional
algorithm of the color model, Zhong et al.[4] proposed a two-
stage algorithm combining the generation of candidate regions
and the classification network.But the candidate regions was not
used to locate the flame area, and the output is only a binary
classification of the image. Zhao et al. [14] used the saliency
detection method to extract the suspected fire area, calculated
the color and texture features of the ROI, and then used two
logistic regression classifiers to classify the feature vector of the
ROI, which is also a two-stage detection Methods with
positioning accuracy and detection speed limited.
Yolo v3 is a recently-appearing detection network with
excellent performance. It converts the border positioning
problem into a regression problem. It has high detection
accuracy and high speed, and is very suitable for real-time target
detection. The attention mechanism was first used in machine
translation, and it has been applied to convolutional networks for
image processing in recent years, deriving good results. Fu et al.
[6] proposed a dual attention module of space and channel,
which can capture the spatial relationship and channel
relationship of deep semantics. Wu et al. [2] introduced the
channel attention mechanism into the flame classification
network, and used the attention mechanism to learn the
nonlinear interaction between channels, which improved the
accuracy of flame type classification. This paper combines Yolo
v3 with the attention mechanism to achieve accurate and
efficient fire detection.
III. BUILDING FLAME DATASET BY RGB-T AUTOMATIC
ANNOTATION
At present, most flame datasets on the Internet only provide
binary labels labels rather than bounding box labels required for
fire detection. Therefore, this paper constructs a flame dataset.
In order to obtain sample labels accurately and quickly, our
previous work of RGB-T image registration was used. The
RGB-IR camera shown in Fig. 1 was used to collect flame
samples, and the flames in multi-scale scenes were collected
with paired RGB-T samples. A total of 5 videos were collected.
The RGB images is used for the input of the network, and the
infrared images is used for the automatic annotation algorithm
to automatically generate sample labels. In order to accurately
segment the flame mask in the infrared image, we keep the
relative position and viewing angle between the camera and the
flame fixed, and minimize the background interference in the
infrared image.
RGB image
channel
thermal image
channel
camera tripod
Fig. 1. RGB-IR camera
The process of automatic annotation as shown in Fig. 2 based
on our previous work of RGB-T image registration. The mutual
information method is used to calculate the transformation
relation between a pair of infrared and RGB images so as to
realize the automatic annotation. [18] The flame samples of the
RGB images and infrared images are registered to obtain the
transformation matrix M from the visible pixel coordinate
system to the infrared pixel coordinate system.
ܯ=ቂܴܶ
0்1ቃ
ܺᇱ
ܻᇱ
1൩=ቂܴܶ
0்1ቃܺ
ܻ
1൩
First, we use the transformation matrix to transform the RGB
images as shown in (2), and then crop the infrared field of view
and generate pixel-aligned pairs of RGB-T samples. An
example is shown in Fig. 3.
Fig. 2. Automatic annotation process
A sample video only needs to be registered once, and the
transformation matrix is applicable to all frames of this sample
video. Because the pixels in the flame area have higher
brightness in the infrared image, simple image processing
methods can be used to obtain the mask of the flame in the
infrared image. Since the infrared image and the RGB image are
RGB
image
Infrared
image
rgb-t image
registr ation
algorithm
Flame mask
under infrared
image
Flame mask
under visible
image
Sample self -
labeling
Detection
network
Authorized licensed use limited to: Southeast University. Downloaded on March 24,2021 at 00:05:59 UTC from IEEE Xplore. Restrictions apply.
3082
pixel-aligned, the mask of the flame in the infrared image is the
same as the mask of the flame in the RGB image. Finally, take
the minimum bounding rectangle to obtain the sample label.
×
+crop
RGB image
infrared image
pixel-aligned pairs
of RGB-T sample
Merge
Fig. 3. Example of generating pixel-aligned pairs of RGB-T samples
Fig. 4 shows some examples of automatic annotation. The
left column is the infrared images, the middle column is the
flame masks extracted by the image processing algorithm, and
the right column is the generated sample labels.
Fig. 4. Examples of automatic annotation
In order to expand the diversity of the sample set, 10 flame
videos were downloaded from the Internet, including flame
samples in different scenes such as indoor and outdoor, long and
short distance, day and night, and manually labeled to obtain
labels. Finally, the original data set was expanded by the method
of data enhancement, and the final flame data set contains 6000
samples.
IV. PROPOSED METHOD
The proposed method combines the Yolo v3 network with
the attention mechanism. The network structure is shown in Fig.
5. The attention module is serially added to the final three
convolutional networks on different scales. The structure of the
attention module refers to DANet[6]. The deep layer of the
network mainly extracts semantic information, in which each
channel is equivalent to a certain type of semantic response. The
channel attention module is used to capture the semantic
dependencies between channels and encode channel context
information into local features. The input is the original feature
maps, and the dependency between each feature map is used as
the weight to update the feature maps. The updated feature map
is obtained by weighting and summing all the feature maps, so
as to capture the semantic dependency between each channel
and strengthen the expression ability of semantic features.
The channel attention mechanism is shown in Fig. 6, where
the scale of the input feature maps A is C×H×W. First, perform
a reshape operation on A to obtain a C×N scale feature map,
where N=H×W is the number of pixels in each layer of the
feature maps, and then reshape and transpose A to obtain a N×C
scale feature map . The result of multiplying the two feature
maps is calculated through the softmax layer to obtain the
channel attention matrix X (C×C), and each element of the X
matrix is:
ݔ =௫ ( ήೕ)
σ௫ ( ήೕ)
సభ
The next step is to reshape the original input feature maps A
to a scale of C×N, multiply it by the transposition of X, and
reshape the result of the operation to a scale of C×H×W, which
is the same scale as the input feature. Finally, multiply the result
by a scale factor and add the original feature maps to get the final
output E (C×H×W):
ܧ=ߚσ൫ݔ ήܣ൯+ܣ
ୀଵ
7KH VFDOH IDFWRU ȕ LV WKH RQO\ WUDLQLQJ SDUDPHWHU LQ Whe
channel attention module, which is initialized to 0 and is
continuously assigned weights during the training process. X is
the channel attention matrix, each row of which represents the
weighting relationship between the channels. A value in a row
indicates the weight of the corresponding layer. The final output
feature E has the same scale as the input feature A, and the
feature map of each channel is the weighted sum of each channel
of the original input A. The channel attention module captures
the relevance of different layers of semantic features, improves
the ability to express semantic features, and models the semantic
dependence between different channels.
Fig. 6. Channel attention module
UHVKDSHUHVKDSH
UHVKDSH VRIWPD[
UHVKDSH
WUDQVSRVH
$
;
(
uu&+: uu&+:
Authorized licensed use limited to: Southeast University. Downloaded on March 24,2021 at 00:05:59 UTC from IEEE Xplore. Restrictions apply.
3083
Input
92 la yers
416×416×3
60 la yers
92
l
ay
ers
60
l
ayers
Darknet-53 with out FC layers
32 la yers
13×13 ×255
26×26 ×255
52×52 ×255
Multi-scale output
13×13 Atte ntion
module
Concat
Concat
26×26 Atte ntion
module
52×52 Atte ntion
module
Upsampling
Upsampling
Fig. 5. Yolo v3 detection network combined with attention mechanism
V. EXPERIMENT AND EVALUATION
The test was conducted on a computer with Intel Core i7-
9700K CPU @ 3.60GHz, NVIDIA 2080ti GPU, 16GB RAM,
Ubuntu 16.04. The detection performance of the network was
tested on two test sets. The first test set was generated using the
hold-out method. The original dataset was randomly divided
into a train set and a test set according to the proportion, and the
test set contains 600 samples. The distribution of the first test set
is close to the train set. Yolo v3 and the method proposed in this
paper (Yolo v3-SA) were tested on the first test set. The
calculation method of Average Precision (AP) in Pascal VOC
was adopted. When the IOU is 0.8, the AP of Yolo v3 is 66.3%,
and the AP of Yolo v3-SA proposed in this paper is 66.77%,
which is a slight increase. The channel attention mechanism
captures the semantic dependencies between channels by
weighting and summing the channels, but it also blurs the
original semantic features. As a result, the improvement is not
obvious on the first test set whose distribution is close to the
training set.
During the RGB-T registration, the RGB image channel was
transformed and cropped, so part of the field of view was lost,
resulting in a difference in the distribution of the original flame
samples. In order to test the detection performance of the model
on samples with different distributions, 250 new test samples
were generated by manually labeling the original flame samples.
The first test set was expanded with these new samples, and the
new test set contains 850 test samples, which is called the
generalized test set.
First, the detection speed was tested, and the results are
shown in Table 1. The experiment shows that the method
proposed in this paper reduces the detection speed very little,
and the algorithm is real-time.
TABLE I. DETECTION SPEED
algorithm Yolo v3 Yolo v3-SA
Frames Per Second 22.14 21.90
The two models were tested on the generalized test set. Table
2 lists the AP of the two models under different IOU standards.
The precision-recall curve, when the IOU is 0.75, is drawn as
shown in Fig. 7. The accuracy of the method proposed in this
paper on the generalized test set is 4.57% higher than that of
Yolo v3. Fig. 9 shows some test examples
TABLE II. DETECTION AVERAGE PRECISION
algorithm Average Precision
0.85 IOU 0.80 IOU 0.75 IOU 0.70 IOU 0.65 IOU
Yolo v3 28.99% 52.88% 69.31% 80.04% 85.79%
Yolo v3-SA 29.57% 55.86% 73.88% 86.51% 91.74%
(a) Yolo v3 (b) Yolo v3-SA
Fig. 7. Precision-Recall curve of 0.75 IOU
Finally, additional supplementary experiments were
performed on the public dataset voc2007 to verify the
effectiveness of the method in this paper. The mAP results are
shown in Fig. 8. The results show that the mean Average
Precision of the method proposed in this paper is improved by
0.54% compared with Yolo v3.
Authorized licensed use limited to: Southeast University. Downloaded on March 24,2021 at 00:05:59 UTC from IEEE Xplore. Restrictions apply.
3084
(a) Yolo v3 (b) Yolo v3-SA
Fig. 8. Supplementary experimental test results on the voc2007 dataset
Fig. 9. Examples of Yolo v3-SA fire detection
VI. CONCLUSION
In order to solve the problem of high-precision real-time fire
detection, this paper proposes a Yolo detection network
combined with an attention mechanism. The flame dataset is
generated using automatic annotation algorithm and data
enhancement. Combining the attention mechanism with the
Yolo v3 network, the channel attention module captures the
semantic dependencies between channels. Experiments show
Authorized licensed use limited to: Southeast University. Downloaded on March 24,2021 at 00:05:59 UTC from IEEE Xplore. Restrictions apply.
3085
that the method proposed in this paper improves the accuracy of
fire detection without reducing the detection speed. The next
step will be to study the lack of flame samples in some actual
production scene to further improve the generalization ability of
the model.
REFERENCES
[1] T. Celik, H. Demirel, and H. Ozkaramanli, "Automatic fire detection in
video sequences," in 2006 14th European Signal Processing Conference,
2006: IEEE, pp. 1-5.
[2] Y. Wu, Y. He, P. Shivakumara, Z. Li, H. Guo, and T. Lu, "Channel-wise
attention model-based fire and rating level detection in video," CAAI
Transactions on Intelligence Technology, vol. 4, no. 2, pp. 117-121, 2019.
[3] %87|UH\LQ <'HGHR÷OX 8*GNED\DQG$( &HWLQ&RPSXWHU
vision based method for real-time fire and flame detection," Pattern
recognition letters, vol. 27, no. 1, pp. 49-58, 2006.
[4] Z. Zhong, M. Wang, Y. Shi, and W. Gao, "A convolutional neural
network-based flame detection method in video sequence," Signal, Image
and Video Processing, vol. 12, no. 8, pp. 1619-1627, 2018.
[5] K. Muhammad, J. Ahmad, I. Mehmood, S. Rho, and S. W. Baik,
"Convolutional neural networks based fire detection in surveillance
videos," IEEE Access, vol. 6, pp. 18174-18183, 2018.
[6] J. Fu et al., "Dual attention network for scene segmentation," in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2019, pp. 3146-3154.
[7] A. J. Dunnings and T. P. Breckon, "Experimentally defined convolutional
neural network architecture variants for non-temporal real-time fire
detection," in 2018 25th IEEE International Conference on Image
Processing (ICIP), 2018: IEEE, pp. 1558-1562.
[8] X. Wang, A. Shrivastava, and A. Gupta, "A-fast-rcnn: Hard positive
generation via adversary for object detection," in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2017, pp. 2606-
2615.
[9] T. Celik and H. Demirel, "Fire detection in video sequences using a
generic color model," Fire safety journal, vol. 44, no. 2, pp. 147-158, 2009.
[10] B. U. Toreyin, Y. Dedeoglu, and A. E. Cetin, "Flame detection in video
using hidden Markov models," in IEEE International Conference on
Image Processing 2005, 2005, vol. 2: IEEE, pp. II-1230.
[11] W. Phillips Iii, M. Shah, and N. da Vitoria Lobo, "Flame recognition in
video," Pattern recognition letters, vol. 23, no. 1-3, pp. 319-327, 2002.
[12] Y. Ren, C. Zhu, and S. Xiao, "Object detection based on fast/faster RCNN
employing fully convolutional architectures," Mathematical Problems in
Engineering, vol. 2018, 2018.
[13] C. Hu, P. Tang, W. Jin, Z. He, and W. Li, "Real-time fire detection based
on deep convolutional long-recurrent networks and optical flow method,"
in 2018 37th Chinese Control Conference (CCC), 2018: IEEE, pp. 9061-
9066.
[14] Y. Zhao, J. Ma, X. Li, and J. Zhang, "Saliency detection and deep
learning-based wildfire identification in UAV imagery," Sensors, vol. 18,
no. 3, p. 712, 2018.
[15] B. Kim and J. Lee, "A video-based fire detection using deep learning
models," Applied Sciences, vol. 9, no. 14, p. 2862, 2019.
[16] C. B. Liu and N. Ahuja, "Vision based fire detection," in Proceedings of
the 17th International Conference on Pattern Recognition, 2004. ICPR
2004., 2004, vol. 4: IEEE, pp. 134-137.
[17] Q. X. Zhang, G. H. Lin, Y. M. Zhang, G. Xu, and J. I. Wang, "Wildland
forest fire smoke detection based on faster R-CNN using synthetic smoke
images," Procedia engineering, vol. 211, pp. 441-446, 2018.
[18] J. Ma, K. Qian, X. Zhang, and X. Ma, "Weakly Supervised Instance
Segmentation of Electrical Equipment Based on RGB-T Automatic
Annotation," IEEE Transactions on Instrumentation and Measurement,
2020, DOI:10.1109/TIM.2020.3001796
Authorized licensed use limited to: Southeast University. Downloaded on March 24,2021 at 00:05:59 UTC from IEEE Xplore. Restrictions apply.