ChapterPDF Available

Visual Scene Understanding for Autonomous Driving Using Semantic Segmentation

Authors:

Abstract and Figures

Deep neural networks are an increasingly important technique for autonomous driving, especially as a visual perception component. Deployment in a real environment necessitates the explainability and inspectability of the algorithms controlling the vehicle. Such insightful explanations are relevant not only for legal issues and insurance matters but also for engineers and developers in order to achieve provable functional quality guarantees. This applies to all scenarios where the results of deep networks control potentially life threatening machines. We suggest the use of a tiered approach, whose main component is a semantic segmentation model, over an end-to-end approach for an autonomous driving system. In order for a system to provide meaningful explanations for its decisions it is necessary to give an explanation about the semantics that it attributes to the complex sensory inputs that it perceives. In the context of high-dimensional visual input this attribution is done as a pixel-wise classification process that assigns an object class to every pixel in the image. This process is called semantic segmentation.
Content may be subject to copyright.
Visual Scene Understanding for Autonomous
Driving using Semantic Segmentation
Markus Hofmarcher ?, Thomas Unterthiner , José Arjona-Medina ,
Günter Klambauer, Sepp Hochreiter, Bernhard Nessler
LIT AI Lab & Institute for Machine Learning,
Johannes Kepler University Linz,
A-4040 Linz, Austria
Abstract.
Deep neural networks are an increasingly important technique
for autonomous driving, especially as a visual perception component.
Deployment in a real environment necessitates the explainability and
inspectability of the algorithms controlling the vehicle. Such insightful
explanations are relevant not only for legal issues and insurance matters
but also for engineers and developers in order to achieve provable func-
tional quality guarantees. This applies to all scenarios where the results
of deep networks control potentially life threatening machines.
We suggest the use of a tiered approach, whose main component is
a semantic segmentation model, over an end-to-end approach for an
autonomous driving system. In order for a system to provide meaningful
explanations for its decisions it is necessary to give an explanation about
the semantics that it attributes to the complex sensory inputs that it
perceives. In the context of high-dimensional visual input this attribution
is done as a pixel-wise classification process that assigns an object class
to every pixel in the image. This process is called semantic segmentation.
We propose an architecture that delivers real-time viable segmentation
performance and which conforms to the limitations in computational
power that is available in production vehicles. The output of such a
semantic segmentation model can be used as an input for an interpretable
autonomous driving system.
Keywords:
Deep Learning, Convolutional Neural Networks, Semantic
Segmentation, Classification, Visual Scene Understanding, Interpretabil-
ity.
1 Introduction
In the context of very complex learning systems there are two main approaches.
The traditional approach is based on handcrafted feature functions that pre-
process input in a manner deemed useful for humans. A human expert uses his
domain knowledge in the design of these features such that the task relevant
information is enhanced and irrelevant signals should be suppressed as much as
?These authors contributed equally to this work.
possible. This pre-processing greatly reduced the input dimensionality which was
necessary in order to apply traditional machine learning techniques.
The modern principle is the one favored by the nature of deep learning, the
so-called end-to-end approach. This approach gives the network the freedom
to learn a function that is based on raw inputs and directly produces the final
outputs. This end-to-end approach unleashed deep learning to revolutionize AI
in the last couple of years [8,27,36,35,16].
The advantage of the end-to-end principle is clear: the optimization in the
learning process has the freedom to access the full input space and build arbitrary
functions as needed. However there are a couple of disadvantages. First, a large
number of training examples are required to optimize the more complex function.
Furthermore, the domain knowledge usually encoded in engineered features is
missing. Finally, the resulting function is even more difficult to interpret than in
the traditional pipeline.
In the problem setting of autonomous driving the raw inputs are pixel-wise
sensor data from RGB cameras, point-wise depth information from LIDAR
sensors, proximity measurements from ultrasonic or radar sensors and odometry
information from an IMU. The desired output is just the steering angle, torque
and brake. In an end-to-end approach the functional model should utilize the
input in an optimal fashion to output optimal driving instructions with respect
to a given loss function [4,6,7].
However, the aforementioned drawbacks of an end-to-end approach quickly
turn out to be insurmountable for this application, as the number of training ex-
amples required to train such an end-to-end driving model for arbitrary scenarios
is huge. Furthermore, it is extremely hard to interpret such a model to answer
questions such as why a model decided for a certain action. This is especially
problematic in view of legal issues if the decisions of the model lead to erroneous
behavior of the vehicle.
Recently, research into the area of attribution methods, in which the goal is
to determine, which part of an input was responsible for a model’s output, has
gained traction [
38
,
42
,
2
,
39
]. However, in the scenario of autonomous driving with
its high-dimensional image inputs such attribution methods would explain to
which extent each single pixel has contributed to the driving decision. This will
provide some information about the functional correspondence, but this does not
suffice to answer the typical human "why"-question, which asks for the inner
model of the system and its world understanding.
We propose a tiered approach, whereby the inputs to individual sensory
components are directly - without hand-crafted features - fed into a deep neural
network whose training target is defined to be a human interpretable abstract
representation of the input. In particular, for a video input signal, we want to
abstract from the color of the clothing of pedestrians, from the specific texture
of a vehicle and from the specific appearances of streets and buildings. We are
only interested in their shapes and positions in space. Semantic segmentation
achieves this by assigning an object class to every single pixel of the input image.
This is similar to interpreting machine learning models by translating from one
modality to another [
33
,
11
,
21
]. Due to the geometry of our real world this leads
to smaller or larger continuous patches of single class identities representing the
shape of individual objects, like pedestrians, cars, buildings, lanes and pavement
(see Figure 1).
Fig. 1.
Example output (bottom) of the network with ground truth (center) on images
from the Cityscapes validation set.
It is obvious that the output of a semantic segmentation network is of much
lower dimensionality when compared to the raw input image. Moreover, it already
represents important first decisions of the overall system. Decisions about the
presence or absence of humans or vehicles are made at this stage. E.g., whether
a human is recognized or not, is fully determined by this intermediate processing
of the input, whether other vehicles are recognized as such is decided already at
this stage and also the course of the road is already reflected in the output as
one of the segments. To a certain extent we are back to a traditional pipeline
of engineered features, but this tiered processing is substantially different. We
do not hand-craft features, instead we employ human expert knowledge in order
to define an internal abstraction layer known that it is suitable for the overall
problem. Thus, the raw feature analysis is fully trained and optimized like in
end-to-end approaches. The crucial advantage lies in the explanatory power of
this layered processing.
In this article, parts of which were published already in [
40
], we propose
novel architectures that are able to infer the semantic segmentation in real-
time on embedded hardware, while achieving better accuracy than comparable
competitors. In Section 2 we shortly review the current state-of-the-art and
the techniques employed in semantic segmentation with a focus on autonomous
driving applications. In Section 3 we present our proposed architecture for efficient
semantic segmentation in autonomous vehicles and show its experimental settings
and their evaluation results in Sections 4 and 5.
2 Semantic Segmentation
Convolutional Neural Networks (CNNs) have emerged as the best performing
method for image related tasks [
27
,
36
] such as image classification [
35
,
16
] or traffic
sign recognition [
8
] and also semantic segmentation [
29
,
41
,
44
,
5
]. However, per-
forming semantic segmentation might incur a significant computational overhead
as compared to end-to-end models. This is especially the case when improved
segmentation performance is accomplished by increasing the network size [15].
When employing deep learning for semantic segmentation, the recognition
of major objects in the image, such as persons or vehicles is realized at higher
levels of a neural network. By design, these layers work on a coarser scale
and are translation invariant (e.g. imposed via pooling operations), such that
minor variations on a pixel level do not influence the recognition. Furthermore,
semantic segmentation requires pixel-exact classification of small features, which
are typically only found in lower layers of a network. This trade-off in resolution
is typically solved by using skip-connections from lower layers to the output
which increase the resolution at layers close to the output. Skip-connections were
introduced by the Fully Convolutional Network (FCN)[
29
], which still serves as a
blue-print for most modern approaches. These approaches only differ in how they
encode the object level information and how they decode these classifications to
pixel-exact labels. The original FCN employed the VGG network[
37
] that was
pre-trained on the LSVRC image classification task. Then FCN added information
to higher layers coming from lower layers which is upscaled through a transposed
convolution. The FCN architecture was improved by alternative ways to connect
to the lower layers, e.g. by accessing the lower-level pooling layers[
3
], by using
enhanced methods to integrate lower level information [
32
] or forgoing pooling
operations for dilated convolutions [
28
,
41
]. As a post-processing step, many recent
systems apply CRF-based refinement on the output produced by the neural
network [
28
,
45
]. CRF increases the accuracy of the segmentation at the cost of
additional computation.
Reducing the computational burden of semantic segmentation is essential
to make it feasible for embedded systems and autonomous driving. Neural
networks are trained on servers or workstations with powerful GPUs, and these
GPU systems are subsequently used for inference on new data. However, these
commodities do not exist in self-driving cars. A self-driving car needs to react
to new events instantly to guarantee the safety of passengers and other traffic
participants, while it is often acceptable if the borders of objects are not recognized
perfectly down to a pixel resolution. To segment an image in real-time is a strong
requirement in self-driving applications. Thus, it is critical that any convolutional
neural network deployed in these systems fulfills strict requirements in execution
speed.
There has been a vast amount of research in reducing the computation required
for deep learning. SqueezeNet [
22
] showed that it was possible to reproduce the
image classification accuracy of powerful CNNs, such as AlexNet[
25
], using 50x
less parameters by employing a more efficient architecture. ENet [
31
] followed
the same path and showed that semantic segmentation is feasible on embedded
devices in real-time. Another line of research increases the efficiency of existing
networks by deriving smaller networks from larger counterparts [
1
,
17
], by pruning
or quantizing weights [
14
,
13
] or tweaking the network for execution on specific
hardware designs [
12
]. These methods can be applied on top of new architectures
to speed up execution.
3 Methods
3.1 Overview
In order to achieve semantic segmentation in real time, we have to trade ex-
ecution speed against achievable segmentation accuracy. Like most successful
segmentation networks, our network is structured as an encoder-decoder pair.
An encoder CNN detects higher-level objects such as cars or pedestrians in the
input image. A decoder takes this information and enriches it with information
from the lower layers of the encoder, supplying a prediction for each pixel in the
original input. Figure 2 depicts the architecture.
3.2 Encoder
The encoder is a modified SqueezeNet 1.1 architecture [
22
], which was designed
as a low-latency network for image recognition while retaining AlexNet [25] like
accuracy. The main computational modules of SqueezeNet are the so-called “fire”
modules consisting of three convolutional operations, depicted in Figure 3a. The
encoder consists of eight “fire” modules, interspersed with a total of three max-
pooling layers for downsampling. All rectified linear units (ReLUs) of the original
architecture are substituted with exponential linear units (ELUs) [
9
], which make
more efficient use of parameters by also conveying information in the negative
part of the activation.
3.3 Parallel Dilated Convolutions
The decoder is based on a parallel dilated convolution layer [
28
] as depicted in
Figure 3b. This dilated layer combines the feature maps at the encoder output
at different receptive field sizes by using four dilated convolutions of kernel size 3
with different dilation factors. This is equivalent to sampling the layer input with
different rates. The contributions from the four dilated convolutions are then
fused by an element-wise sum. As a result, the receptive field size is increased
and multiscale spatial dependencies are taken into account without having to
resort to stacking multiple layers which would be computationally expensive.
3.4 Decoder and Bypasses
Pooling layers in the encoder are used to ensure a degree of translational invariance
when detecting the parts of an object. However, they in turn reduce the spatial
Convolution
Pool
Fire
module
Atrous
Score Layer
Transposed
convolution
Refinement
module
Input
Output
C F FF F F F F F
R R R TTTT A
FATR
C
P
P P P
Fig. 2. Architecture of the proposed network for semantic segmentation.
resolution of the output. Transposed convolutions in the decoder are used to
upsample the information back to its original size. To improve the upsampling, we
do not just use the data that comes directly from the layer immediately before the
transposed convolution layer, but combine it with low-level knowledge from lower
layers of the encoder. These layers are responsible for detecting finer structures at
a higher resolution, which helps with classifying the contours of objects more ex-
actly. Each refinement module combines two streams of information, one coming
from the previous upsampling layer, the other one from the encoder. The two con-
volutional layers in the refinement module learn how to weigh these two streams
before passing the information on to the next upsampling layer. We use refinement
modules similar to the ones used in the SharpMask approach[32]. We again use
ELUs instead of ReLU units (Figure 3c shows the implementation of the module).
Right before every pooling layer in the encoder, a bypass branches off to the
refinement module. Once there, a convolution layer weights knowledge from lower
layers. Then, it is concatenated with semantic object information from the previ-
ous upsampling layer. A second convolutional layer combines the concatenated
feature maps from both branches into the class map.
δ1δ2
δ0δ3
+
Squeeze 1x1
Expand 3x3
Expand 1x1
Convolution ELU Concatenation
δDilated Convolution Element-wise sum
+
a. Fire module b. Parallel Dilated Convolution c. Bypass refinement module
F
T
TT
P
Fig. 3.
a) SqueezeNet Fire module; b) Parallel dilated convolution layer; c) Refinement
module
3.5 Exponential Linear Units (ELU)
Our network makes extensive use of the exponential linear unit [
9
] because we
empirically found it to work better than ReLU units for our purpose. ELUs were
designed to avoid the bias shift in neural network training. The ELU activation
function is defined as
f(x) = (xif x > 0
α(exp(x)1) if x0, f0(x) = (1if x > 0
f(x) + αif x0.(1)
The parameter
α
was set to its default value of 1for all of our experiments.
Similar to the often-used ReLU activation, ELUs have a linear positive part
which helps to avoid the vanishing gradient [
18
,
19
,
20
], thus it allows training very
deep networks. However, in contrast to the ReLU, the saturating negative part
converges to
α
. This allows the ELU to have a mean activation of 0, thereby
avoiding any bias shift effect: In ReLU networks, units will typically have a non-
zero mean activation, thus they will act as additional bias unit for units in the
next layer. By enabling units to have zero mean, this bias shift effect is reduced,
which makes it easier for units to focus solely on actual information processing.
This could otherwise only be achieved by using batch normalization [
23
] or the
SELU activation function[24].
4 Experiments
4.1 Cityscapes dataset
We trained and evaluated the network on the Cityscapes dataset [
10
]. Cityscapes
is a high quality dataset for semantic street scene understanding. The dataset is
split in 2975, 500, and 1525 images of resolution 2048x1024 pixels for training,
validation, and testing, respectively. It contains 35 annotated classes on pixel-level
of which 19 are selected for evaluation in the challenge. Each class belongs to one
category: flat, nature, object, sky, construction, human, vehicle. As performance
measure the commonly used intersection over union metric is used, which is
evaluated for individual classes and categories. Notice that our results were
achieved without CRFs as post-processing because that would increase inference
time dramatically.
4.2 Training
The SqueezeNet encoder was initialized using publicly available ImageNet pre-
trained weights and fine-tuned for semantic segmentation on Cityscapes. The
rest of the weights were initialized using the MSRA scheme [
16
] and the network
was trained with full resolution images. The loss function was the sum of cross
entropy terms for all classes, equally weighted in the overall loss function. It
was optimized with Stochastic Gradient Descent, using a fixed learning rate of
10
8
, momentum of 0
.
9and a weight decay factor of 0
.
0002. The architecture
was implemented using the Caffe framework. Total training time was around 22
hours using 2Titan X Maxwell GPUs with batch size 3each at full resolution
images. In our experiments we trained for the pixel-wise segmentation task using
only the fine annotations without any additional training data. Augmentations
were not applied in order to establish a baseline for performance.
5 Results
Our network compares favourably against the ENet segmentation network [
31
], a
widely-used network architecture for efficient semantic segmentation on embedded
devices. Comparable, newer architectures exist [
30
,
34
,
43
], but were not compared
against.
5.1 Segmentation performance
We evaluated the network on the test set using the official Cityscapes evaluation
server. We achieve 59.8 per-class mean IoU and 84.3 per-category mean IoU.
Hence the network architecture is able to outperform both ENet as well as
SegNet [
3
] as it can be seen in Table 1. We improved on per-class IoU for all
but 5 classes (wall, truck, bus, train, motorcycle) compared to ENet. Detailed
per-class classification results are presented in Table 2. Visual inspection of the
predictions of the network show satisfying results on typical urban street scene
images (Figure 1). Object contours are segmented very sharply. We believe that
this is due to the enhanced ability to integrate pixel-level information from early
layers in the encoder into upsampling layers in the decoder by using refinement
modules in the bypasses.
Table 1.
Mean IoU over the 19 individual classes and the 7 categories on the Cityscapes
testset. ENet and SegNet results are taken from [31].
Class IoU Category IoU
Ours 59.8 84.3
ENet 58.3 80.4
SegNet 56.1 79.8
Table 2.
Per-class IoU on the Cityscapes testset. We improved the ENet results on all
but 5 classes (wall, truck, bus, train, motorcycle). ENet results are taken from [31].
road sidewalk building wall fence pole trafficlight trafficsign vegetation terrain
Ours 96.9 75.4 87.9 31.6 35.7 50.9 52.0 61.7 90.9 65.8
ENet 96.3 74.2 85.0 32.2 33.2 43.5 34.1 44.0 88.6 61.4
sky person rider car truck bus train motorcycle bicycle
Ours 93.0 73.8 42.6 91.5 18.8 41.2 33.3 34.0 59.9
ENet 90.6 65.5 38.4 90.6 36.9 50.5 48.1 38.8 55.4
5.2 Inference Run-time Performance
Similar to ENet, we are able to surpass the 10 fps design goal on the Nvidia TX1
board on a resolution of 640x360, which is a sensible lower limit for enabling self-
driving car applications. See Table 3 for a comparison of run-times between the
different architectures. The run-times were achieved using CUDA 7.5 and cuDNN
4.0, however we expect that timings will improve significantly by switching to
newer software versions that support Winograd convolutions[26].
Table 3.
Comparison of inference times on Nvidia Tegra X1. Timings for ENet were
taken from the original publication [31].
480x320 640x360 1280x720
ms fps ms fps ms fps
Ours 60 16.7 86 11.6 389 2.6
ENet 47 21.1 69 14.6 262 3.8
6 Discussion
As a complementary approach to end-to-end deep learning with its lack of
explainability we have proposed a tiered approach, whereby only individual
components are trained in an end-to-end-like fashion. The intermediate nodes,
which are outputs from one tier and inputs to the next, are defined in such a way
that their meaning is interpretable by a human expert with knowledge in the
domain of the overall system. Semantic segmentation is a powerful instrument
that provides a lower-dimensional abstraction of video input signals.
Currently Deep Neural Networks are state-of-the-art systems in semantic
image segmentation. However, these deep networks incur high computational
cost, which makes them unsuitable for deployment in self-driving cars. In this
work we have shown how neural networks can be made small enough to run on
embedded devices used in autonomous vehicles while still retaining a level of
segmentation performance sufficient for this application. We believe this to be a
promising approach towards a modular and interpretable system for autonomous
driving.
Subsequent components of the overall system can then take these semantic
features as input for their decisions. One such component could be a model for
situational awareness, classifying the kind of situation the vehicle is currently in.
Another component could be trained to determine an optimal driving path in
view of the semantically segmented surroundings and the situation classification.
Finally, based on these intermediate human understandable classifications and
regressions a controller can be trained on a last tier in order to translate the
now interpretable inner model of the system about the real world into specific
commands for steering, throttle response or braking. In case of erroneous behavior
of the vehicle an expert can investigate such a tiered system by looking at the
intermediate nodes of each tier and obtain answers to the "why"-question.
This gain in interpretability comes at a price: a tiered approach allows more
fine-grained control and human-interpretable checks of the intermediate steps of
driving decisions. However, current research has often showed that end-to-end
approaches, while harder to interpret, often allow deep learning systems to achieve
higher performance, often even outperforming humans. This is likely because the
system is able to process information in a way that is not compatible with typical
human perception. For example, while reducing the lower-level camera signals
to semantic segmentation maps and discarding more fine-grained information
might make sense for humans, it is possible that machine learning approaches
can still make use of such information at higher-level decision making processes.
Despite these drawbacks, we would argue that, given the accuracy achieved with
modern systems, the gain in interpretability is worth the potential loss in overall
performance.
Acknowledgments
This work was supported by Audi.JKU Deep Learning
Center, Audi Electronics Venture GmbH, Zalando SE with Research Agreement
01/2016, the Austrian Science Fund with Project P28660-N31 and NVIDIA
Corporation.
References
1.
Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural
Information Processing Systems. NIPS (2014)
2.
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On
pixel-wise explanations for non-linear classifier decisions by layer-wise relevance
propagation. PLOS ONE 10(7), 1–46 (07 2015)
3.
Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-
decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis
and Machine Intelligence 39(12), 2481–2495 (2017)
4.
Bojarski, M., Testa, D.D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel,
L.D., Monfort, M., Muller, U., Zhang, J., Zhang, X., Zhao, J., Zieba, K.: End to
end learning for self-driving cars. CoRR abs/1604.07316 (2016)
5.
Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution
for semantic image segmentation. CoRR abs/1706.05587 (2017)
6.
Chen, Z., Huang, X.: End-to-end learning for lane keeping of self-driving cars. In:
IEEE Intelligent Vehicles Symposium. pp. 1856–1860. IEEE (2017)
7.
Chi, L., Mu, Y.: Deep steering: Learning end-to-end driving model from spatial
and temporal visual cues. CoRR abs/1708.03798 (2017)
8.
Ciresan, D., Meier, U., Masci, J., Schmidhuber, J.: Multi-column deep neural
network for traffic sign classification. Neural Networks 32, 333–338 (2012)
9.
Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learn-
ing by exponential linear units (ELUs). In: International Conference on Learning
Representations. ICLR (2016)
10.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The Cityscapes dataset for semantic urban scene
understanding. In: IEEE Conference on Computer Vision and Pattern Recognition.
CVPR (2016)
11.
Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S.,
Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual
recognition and description. IEEE Transactions on Pattern Analysis and Machine
Intelligence 39(4), 677–691 (2017)
12.
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.:
Eie: Efficient inference engine on compressed deep neural network. International
Conference on Computer Architecture (2016)
13.
Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks
with pruning, trained quantization and huffman coding. In: International Conference
on Learning Representations. ICLR (2016)
14.
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for
efficient neural network. In: Advances in Neural Information Processing Systems.
NIPS (2015)
15.
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR (2015)
16.
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification. In: IEEE International Conference on
Computer Vision. ICCV (2015)
17.
Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.
CoRR abs/1503.02531 (2015)
18.
Hochreiter, S.: Untersuchungen zu dynamischen neuronalen Netzen. Master’s thesis,
Technische Universität München, Institut für Informatik (1991)
19.
Hochreiter, S.: The vanishing gradient problem during learning recurrent neural
nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst.
6
(2),
107–116 (1998)
20.
Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent
nets: the difficulty of learning long-term dependencies. In: Kremer, Kolen (eds.) A
Field Guide to Dynamical Recurrent Neural Networks. IEEE Press (2001)
21.
Hong, S., Yang, D., Choi, J., Lee, H.: Interpretable Text-to-Image Synthesis with
Hierarchical Semantic Layout Generation. Springer LNCS (2019)
22.
Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.:
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model
size. CoRR abs/1602.07360 (2016)
23.
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training
by reducing internal covariate shift. In: Proceedings of the 32nd International
Conference on Machine Learning. ICML (2015)
24. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural
networks. In: Advances in Neural Information Processing Systems. NIPS, Curran
Associates, Inc. (2017)
25.
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep
convolutional neural networks. In: Advances in Neural Information Processing
Systems. NIPS (2012)
26.
Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: IEEE
Conference on Computer Vision and Pattern Recognition. CVPR (2016)
27.
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature
521
(7553), 436–444
(2015)
28.
Liang-Chieh, C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.: Semantic
image segmentation with deep convolutional nets and fully connected crfs. In:
International Conference on Learning Representations. ICLR (2015)
29.
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition.
CVPR (2015)
30.
Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: Espnet: Efficient
spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings
of the European Conference on Computer Vision. ECCV (2018)
31.
Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: Enet: A deep neural network
architecture for real-time semantic segmentation. CoRR abs/1606.02147 (2016)
32.
Pinheiro, P.O., Lin, T.Y., Collobert, R., Dollár, P.: Learning to refine object
segments. In: Proceedings of the European Conference on Computer Vision. ECCV
(2016)
33.
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative
adversarial text-to-image synthesis. In: Proceedings of The 33rd International
Conference on Machine Learning. ICML (2016)
34.
Romera, E., Álvarez, J.M., Bergasa, L.M., Arroyo, R.: Erfnet: Efficient residual
factorized convnet for real-time semantic segmentation. IEEE Transactions on
Intelligent Transportation Systems 19, 263–272 (2018)
35.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large
Scale Visual Recognition Challenge. International Journal of Computer Vision
(IJCV) 115(3), 211–252 (2015)
36.
Schmidhuber, J.: Deep learning in neural networks: An overview. Neural Networks
61, 85–117 (2015)
37.
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: International Conference of Learning Representations. ICLR
(2015)
38.
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks:
Visualising image classification models and saliency maps. CoRR
abs/1312.6034
(2013)
39.
Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks.
In: Proceedings of the 34th International Conference on Machine Learning. ICML
(2017)
40.
Treml, M., Arjona-Medina, J., Unterthiner, T., Durgesh, R., Friedmann, F., Schu-
berth, P., Mayr, A., Heusel, M., Hofmarcher, M., Widrich, M., Bodenhofer, U.,
Nessler, B., Hochreiter, S.: Speeding up Semantic Segmentation for Autonomous
Driving. In: Workshop on Machine Learning for Intelligent Transport Systems,
Neural Information Processing Systems (NIPS) (2016)
41.
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In:
International Conference on Learning Representations. ICLR (2016)
42.
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In:
Proceedings of the European Conference on Computer Vision. ECCV (2014)
43.
Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: ICNet for real-time semantic segmentation
on high-resolution images. In: Proceedings of the European Conference on Computer
Vision. ECCV (2018)
44.
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In:
IEEE Conference on Computer Vision and Pattern Recognition. CVPR (2017)
45.
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang,
C., Torr, P.: Conditional random fields as recurrent neural networks. In: IEEE
International Conference on Computer Vision. ICCV (2015)
... The intricate nested architecture, which allowed for complex functional representation, had a downside in that it made the models less interpretable, earning them the nickname "black boxes" [4], [5]. As architecture evolved and the number of parameters multiplied, the model's reasoning became less transparent, and the ability to explain specific predictions or outputs diminished [6]. ...
... However, this approach demanded extensive fine-tuning and task-specific adjustments. In contrast, modern DL's end-to-end approach allowed its models to learn functions based on the raw inputs and outputs much faster, revolutionizing Artificial Intelligence (AI) [6]. ...
... Over the past decade, deep learning techniques have driven substantial progress in various natural language processing (NLP) tasks, including part-of-speech tagging, 17 named entity recognition, 18 sentiment analysis, 19 machine translation, 20 and dialogue systems. 21 Similarly, deep learning has led to marked improvements in numerous computer vision tasks, 22 such as object detection, 23 scene understanding, 24 and recognition of human activities and expressions. 25,26 However, in real-world scenarios, the majority of authentic tasks involve the integration of multiple modalities, necessitating the ability to process and comprehend cross-modal information interactions effectively. ...
Article
Large-scale benchmark datasets are crucial in advancing research within the computer science communities. They enable the development of more sophisticated AI models and serve as “golden” benchmarks for evaluating their performance. Thus, ensuring the quality of these datasets is of utmost importance for academic research and the progress of AI systems. For the emerging vision-language tasks, some datasets have been created and frequently used, such as Flickr30k, COCO, and NoCaps, which typically contain a large number of images paired with their ground-truth textual descriptions. In this paper, an automatic method is proposed to assess the quality of large-scale benchmark datasets designed for vision-language tasks. In particular, a new cross-modal matching model is developed, which is capable of automatically scoring the textual descriptions of visual images. Subsequently, this model is employed to evaluate the quality of vision-language datasets by automatically assigning a score to each ‘ground-truth’ description for every image picture. With a good agreement between manual and automated scoring results on the datasets, our findings reveal significant disparities in the quality of the ground-truth descriptions included in the benchmark datasets. Even more surprising, it is evident that a small portion of the descriptions are unsuitable for serving as reliable ground-truth references. These discoveries emphasize the need for careful utilization of these publicly accessible benchmark databases.
... Such research also provide impetus for leveraging graphic approaches to explicate autonomous driving judgements. A methodology for examining how a sequence of pixel intensities influences the CNN's forecasting, called Visual-Back Propagation is presented in [102]. The fore and aft autonomous driving objective tests using Udacity autonomous vehicle database demonstrate the efficacy of the suggested approach for troubleshooting Neural's predictions [107]. ...
Article
Full-text available
The advancement of Artificial Intelligence (AI) technology has accelerated the development of several systems that are elicited from it. This boom has made the systems vulnerable to security attacks and allows considerable bias in order to handle errors in the system. This puts humans at risk and leaves machines, robots, and data defenseless. Trustworthy AI (TAI) guarantees human value and the environment. In this paper, we present a comprehensive review of the state-of-the-art on how to build a Trustworthy and eXplainable AI, taking into account that AI is a black box with little insight into its underlying structure. The paper also discusses various TAI components, their corresponding bias, and inclinations that make the system unreliable. The study also discusses the necessity for TAI in many verticals, including banking, healthcare, autonomous system, and IoT. We unite the ways of building trust in all fragmented areas of data protection, pricing, expense, reliability, assurance, and decision-making processes utilizing TAI in several diverse industries and to differing degrees. It also emphasizes the importance of transparent and post hoc explanation models in the construction of an eXplainable AI and lists the potential drawbacks and pitfalls of building eXplainable AI. Finally, the policies for developing TAI in the autonomous vehicle construction sectors are thoroughly examined and eclectic ways of building a reliable, interpretable, eXplainable, and Trustworthy AI systems are explained to guarantee safe autonomous vehicle systems.
... With a wide range of applications, such as biometrics [8,9,10], medical image analysis [11,12,13], disease detection and crop classification [14,15,16,17,18], traffic control systems [19,20,21,22], autonomous driving cars [23,24,25,26], location of objects in satellite images [27], and content image retrieval systems [26][27][28][29], image segmentation is a complex and difficult pre-processing step whose objective is to partition the image into coherent regions, considering certain image characteristics such as color, intensity, texture and text, among others. Segmentation can be seen as a classification problem, where the objective is to classify N elements in K regions, where k ⩽ n, such that elements from the same region K have similar properties to each other and different from the properties of elements from different regions. ...
Conference Paper
Full-text available
In recent years, computer vision research has witnessed transformative changes with the integration of generative artificial intelligence (AI) models. The generative models have been widely researched in the field of semantic segmentation. In this survey paper, we present a comprehensive review of the generative models, with a specific focus on Generative Adversarial Networks (GANs), Diffusion Models (DMs), and Variational Autoencoders (VAEs), in the realm of semantic segmentation. We incorporate these generative models for model training, image synthesis, semantic label synthesis, image-label pair synthesis, domain adaptation, feature learning, and boundary localization for semantic segmentation. We also perform a thorough comparative analysis highlighting the approach, task, datasets involved, strengths, and weaknesses of the GANs, DMs, and VAEs-based semantic segmentation models. Our comparative evaluation showed a wide range of research works carried out in the generative semantic segmentation domain. This survey consists of diverse generative methodologies, serving as a comprehensive resource for researchers and enthusiasts contributing to the field of generative semantic segmentation.
Article
installed on the modern intelligent vehicles. Many Artificial Intelligence based foundation models have been proposed for smart sensing to recognize the known object classes in the new but similar scenarios. However, it is still challenging for the foundation models of smart sensing to detect all the object classes in both seen and unseen scenarios. This letter aims at pushing the boundary of smart sensing research for intelligent vehicles. We first summarize the current widely-used foundation models and the foundation intelligence needed for smart sensing of intelligent vehicles. We then explain Sora-based Parallel Vision to boost the foundation models of smart sensing from basic intelligence (1.0) to enhanced intelligence (2.0) and final generalized intelligence (3.0). Several representative case studies are discussed to show the potential usages of Sora-based Parallel Vision, followed by its future research direction.
Article
In recent years, the advancement of deep-learning technologies has greatly promoted the research progress of autonomous driving. However, deep neural network is like a black box. Given a specific input, it is difficult to explain the output of the network. Without explainable results, it would be unsafe to deploy deep networks in unseen environments or environments with potential unexpected situations. Especially for decision-making networks, inappropriate outputs could lead to severe traffic accidents. To provide a solution to this problem, we propose a deep neural network that jointly predicts the decision-making actions and corresponding natural-language explanations based on semantic scene understanding. Two types of explanations, the reasons of driving actions and the surrounding environment descriptions of the ego-vehicle, are designed. Both the reasons and descriptions are in the form of natural language. The decision-making actions could be explained by the corresponding reasons or the environment descriptions. We also release a large-scale dataset with hand-labelled ground truth including driving actions and environment descriptions. The superiority of our network over other methods is demonstrated on both our dataset and a public dataset.
Conference Paper
div class="section abstract"> Semantic segmentation is an integral component in many autonomous vehicle systems used for tasks like path identification and scene understanding. Autonomous vehicles must make decisions quickly enough so they can react to their surroundings, therefore, they must be able to segment the environment at high speeds. There has been a fair amount of research on semantic segmentation, but most of this research focuses on achieving higher accuracy, using the mean intersection over union (mIoU) metric rather than higher inference speed. More so, most of these semantic segmentation models are trained and evaluated on urban areas instead of off-road environments. Because of this there is a lack of knowledge in semantic segmentation models for use in off-road unmanned ground vehicles. In this research, SwiftNet, a semantic segmentation deep learning model designed for high inference speed and accuracy on images with large dimensions, was implemented and evaluated for inference speed of semantic segmentation of off-road environments. SwiftNet was pre-trained on the ImageNet dataset, then trained on 70% of the labeled images from the Rellis-3D dataset. Rellis-3D is an extensive off-road dataset designed for semantic segmentation, containing 6234 labeled 1920x1200 images. SwiftNet was evaluated using the remaining 30% of images from the Rellis-3D dataset and achieved an average inference speed of 24 frames per second (FPS) and an mIoU score 73.8% on a Titan RTX GPU. </div
Article
Full-text available
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1]. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.
Article
Full-text available
Semantic segmentation is a challenging task that addresses most of the perception needs of intelligent vehicles (IVs) in an unified way. Deep neural networks excel at this task, as they can be trained end-to-end to accurately classify multiple object categories in an image at pixel level. However, a good tradeoff between high quality and computational resources is yet not present in the state-of-the-art semantic segmentation approaches, limiting their application in real vehicles. In this paper, we propose a deep architecture that is able to run in real time while providing accurate semantic segmentation. The core of our architecture is a novel layer that uses residual connections and factorized convolutions in order to remain efficient while retaining remarkable accuracy. Our approach is able to run at over 83 FPS in a single Titan X, and 7 FPS in a Jetson TX1 (embedded device). A comprehensive set of experiments on the publicly available Cityscapes data set demonstrates that our system achieves an accuracy that is similar to the state of the art, while being orders of magnitude faster to compute than other architectures that achieve top precision. The resulting tradeoff makes our model an ideal approach for scene understanding in IV applications. The code is publicly available at: https://github.com/Eromera/erfnet
Article
Full-text available
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep"' in that they can be compositional in spatial and temporal "layers". Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Chapter
Generating images from natural language description has drawn a lot of attention in the research community for its practical usefulness and for understanding the method in which the model relates text with visual concepts by synthesizing them. Deep generative models have been successfully employed to address this task, which formulates the problem as a translation task from text to image. However, learning a direct mapping from text to image is challenging due to the complexity of the mapping and makes it difficult to understand the underlying generation process. To address these issues, we propose a novel hierarchical approach for text-to-image synthesis by inferring a semantic layout. Our algorithm decomposes the generation process into multiple steps. First, it constructs a semantic layout from the text using the layout generator and then converts the layout to an image with the image generator. The proposed layout generator progressively constructs a semantic layout in a coarse-to-fine manner by generating object bounding boxes and refining each box by estimating the object shapes inside the box. The image generator synthesizes an image conditioned on the inferred semantic layout, which provides a useful semantic structure of an image matching the text description. Conditioning the generation with the inferred semantic layout allows our model to generate semantically more meaningful images and provides interpretable representations to allow users to interactively control the generation process by modifying the layout. We demonstrate the capability of the proposed model on the challenging MS-COCO dataset and show that the model can substantially improve the image quality and interpretability of the output and semantic alignment to input text over existing approaches.
Chapter
We focus on the challenging task of real-time semantic segmentation in this paper. It finds many practical applications and yet is with fundamental difficulty of reducing a large portion of computation for pixel-wise label inference. We propose an image cascade network (ICNet) that incorporates multi-resolution branches under proper label guidance to address this challenge. We provide in-depth analysis of our framework and introduce the cascade feature fusion unit to quickly achieve high-quality segmentation. Our system yields real-time inference on a single GPU card with decent quality results evaluated on challenging datasets like Cityscapes, CamVid and COCO-Stuff.
Chapter
We introduce a fast and efficient convolutional neural network, ESPNet, for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a new convolutional module, efficient spatial pyramid (ESP), which is efficient in terms of computation, memory, and power. ESPNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-art semantic segmentation network PSPNet, while its category-wise accuracy is only 8% less. We evaluated ESPNet on a variety of semantic segmentation datasets including Cityscapes, PASCAL VOC, and a breast biopsy whole slide image dataset. Under the same constraints on memory and computation, ESPNet outperforms all the current efficient CNN networks such as MobileNet, ShuffleNet, and ENet on both standard metrics and our newly introduced performance metrics that measure efficiency on edge devices. Our network can process high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively. Our code is open-source and available at https://sacmehta.github.io/ESPNet/.
Conference Paper
In recent years, autonomous driving algorithms using low-cost vehicle-mounted cameras have attracted increasing endeavors from both academia and industry. There are multiple fronts to these endeavors, including object detection on roads, 3-D reconstruction etc., but in this work we focus on a vision-based model that directly maps raw input images to steering angles using deep networks. This represents a nascent research topic in computer vision. The technical contributions of this work are two-fold. First, the model is learned and evaluated on real human driving videos that are time-synchronized with other vehicle sensors. This differs from many prior models trained from synthetic data in racing games. Second, state-of-the-art models, such as PilotNet, mostly predict the wheel angles independently on each video frame, which contradicts common understanding of driving as a stateful process. Instead, our proposed model strikes a combination of spatial and temporal cues, jointly investigating instantaneous monocular camera observations and vehicle»s historical states. This is in practice accomplished by inserting carefully-designed recurrent units (e.g., LSTM and Conv-LSTM) at proper network layers. Our experimental study is based on about 6 hours of human driving data provided by Udacity. Comprehensive quantitative evaluations demonstrate the effectiveness and robustness of our model, even under scenarios like drastic lighting changes and abrupt turning. The comparison with other state-of-the-art models clearly reveals its superior performance in predicting the due wheel angle for a self-driving car.
Conference Paper
Deep Convolutional Neural Networks (DCNNs) have recently shown state of the art performance in high level vision tasks, such as image classification and object detection. This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification (also called "semantic image segmentation"). We show that responses at the final layer of DCNNs are not sufficiently localized for accurate object segmentation. This is due to the very invariance properties that make DCNNs good for high level tasks. We overcome this poor localization property of deep networks by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF). Qualitatively, our "DeepLab" system is able to localize segment boundaries at a level of accuracy which is beyond previous methods. Quantitatively, our method sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 71.6% IOU accuracy in the test set. We show how these results can be obtained efficiently: Careful network re-purposing and a novel application of the 'hole' algorithm from the wavelet community allow dense computation of neural net responses at 8 frames per second on a modern GPU.
Conference Paper
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, We introduce a three stage pipeline: pruning, quantization and Huffman encoding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman encoding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory, which has 180x less access energy.