ChapterPDF Available

Visual Scene Understanding for Autonomous Driving Using Semantic Segmentation

September 2019

September 2019

DOI:10.1007/978-3-030-28954-6_15

In book: Explainable AI: Interpreting, Explaining and Visualizing Deep Learning (pp.285-296)

Authors:

Markus Hofmarcher

Johannes Kepler University Linz

Thomas Unterthiner

Google Inc.

Jose Antonio Arjona-Medina

Janssen Pharmaceutica

Günter Klambauer

Johannes Kepler University Linz

Show all 6 authorsHide

Deep neural networks are an increasingly important technique for autonomous driving, especially as a visual perception component. Deployment in a real environment necessitates the explainability and inspectability of the algorithms controlling the vehicle. Such insightful explanations are relevant not only for legal issues and insurance matters but also for engineers and developers in order to achieve provable functional quality guarantees. This applies to all scenarios where the results of deep networks control potentially life threatening machines. We suggest the use of a tiered approach, whose main component is a semantic segmentation model, over an end-to-end approach for an autonomous driving system. In order for a system to provide meaningful explanations for its decisions it is necessary to give an explanation about the semantics that it attributes to the complex sensory inputs that it perceives. In the context of high-dimensional visual input this attribution is done as a pixel-wise classification process that assigns an object class to every pixel in the image. This process is called semantic segmentation.

Example output (bottom) of the network with ground truth (center) on images from the Cityscapes validation set.

…

Mean IoU over the 19 individual classes and the 7 categories on the Cityscapes testset. ENet and SegNet results are taken from [31].

…

Comparison of inference times on Nvidia Tegra X1. Timings for ENet were taken from the original publication [31].

…

Figures - uploaded by Jose Antonio Arjona-Medina

Content may be subject to copyright.

Content uploaded by Jose Antonio Arjona-Medina

Content may be subject to copyright.

Visual Scene Understanding for Autonomous

Driving using Semantic Segmentation

Markus Hofmarcher ?, Thomas Unterthiner ∗, José Arjona-Medina ∗,

Günter Klambauer, Sepp Hochreiter, Bernhard Nessler

LIT AI Lab & Institute for Machine Learning,

Johannes Kepler University Linz,

A-4040 Linz, Austria

Abstract.

Deep neural networks are an increasingly important technique

for autonomous driving, especially as a visual perception component.

Deployment in a real environment necessitates the explainability and

inspectability of the algorithms controlling the vehicle. Such insightful

explanations are relevant not only for legal issues and insurance matters

but also for engineers and developers in order to achieve provable func-

tional quality guarantees. This applies to all scenarios where the results

of deep networks control potentially life threatening machines.

We suggest the use of a tiered approach, whose main component is

a semantic segmentation model, over an end-to-end approach for an

autonomous driving system. In order for a system to provide meaningful

explanations for its decisions it is necessary to give an explanation about

the semantics that it attributes to the complex sensory inputs that it

perceives. In the context of high-dimensional visual input this attribution

is done as a pixel-wise classiﬁcation process that assigns an object class

to every pixel in the image. This process is called semantic segmentation.

We propose an architecture that delivers real-time viable segmentation

performance and which conforms to the limitations in computational

power that is available in production vehicles. The output of such a

semantic segmentation model can be used as an input for an interpretable

autonomous driving system.

Keywords:

Deep Learning, Convolutional Neural Networks, Semantic

Segmentation, Classiﬁcation, Visual Scene Understanding, Interpretabil-

ity.

1 Introduction

In the context of very complex learning systems there are two main approaches.

The traditional approach is based on handcrafted feature functions that pre-

process input in a manner deemed useful for humans. A human expert uses his

domain knowledge in the design of these features such that the task relevant

information is enhanced and irrelevant signals should be suppressed as much as

?These authors contributed equally to this work.

possible. This pre-processing greatly reduced the input dimensionality which was

necessary in order to apply traditional machine learning techniques.

The modern principle is the one favored by the nature of deep learning, the

so-called end-to-end approach. This approach gives the network the freedom

to learn a function that is based on raw inputs and directly produces the ﬁnal

outputs. This end-to-end approach unleashed deep learning to revolutionize AI

in the last couple of years [8,27,36,35,16].

The advantage of the end-to-end principle is clear: the optimization in the

learning process has the freedom to access the full input space and build arbitrary

functions as needed. However there are a couple of disadvantages. First, a large

number of training examples are required to optimize the more complex function.

Furthermore, the domain knowledge usually encoded in engineered features is

missing. Finally, the resulting function is even more diﬃcult to interpret than in

the traditional pipeline.

In the problem setting of autonomous driving the raw inputs are pixel-wise

sensor data from RGB cameras, point-wise depth information from LIDAR

sensors, proximity measurements from ultrasonic or radar sensors and odometry

information from an IMU. The desired output is just the steering angle, torque

and brake. In an end-to-end approach the functional model should utilize the

input in an optimal fashion to output optimal driving instructions with respect

to a given loss function [4,6,7].

However, the aforementioned drawbacks of an end-to-end approach quickly

turn out to be insurmountable for this application, as the number of training ex-

amples required to train such an end-to-end driving model for arbitrary scenarios

is huge. Furthermore, it is extremely hard to interpret such a model to answer

questions such as why a model decided for a certain action. This is especially

problematic in view of legal issues if the decisions of the model lead to erroneous

behavior of the vehicle.

Recently, research into the area of attribution methods, in which the goal is

to determine, which part of an input was responsible for a model’s output, has

gained traction [

]. However, in the scenario of autonomous driving with

its high-dimensional image inputs such attribution methods would explain to

which extent each single pixel has contributed to the driving decision. This will

provide some information about the functional correspondence, but this does not

suﬃce to answer the typical human "why"-question, which asks for the inner

model of the system and its world understanding.

We propose a tiered approach, whereby the inputs to individual sensory

components are directly - without hand-crafted features - fed into a deep neural

network whose training target is deﬁned to be a human interpretable abstract

representation of the input. In particular, for a video input signal, we want to

abstract from the color of the clothing of pedestrians, from the speciﬁc texture

of a vehicle and from the speciﬁc appearances of streets and buildings. We are

only interested in their shapes and positions in space. Semantic segmentation

achieves this by assigning an object class to every single pixel of the input image.

This is similar to interpreting machine learning models by translating from one

modality to another [

]. Due to the geometry of our real world this leads

to smaller or larger continuous patches of single class identities representing the

shape of individual objects, like pedestrians, cars, buildings, lanes and pavement

(see Figure 1).

Fig. 1.

Example output (bottom) of the network with ground truth (center) on images

from the Cityscapes validation set.

It is obvious that the output of a semantic segmentation network is of much

lower dimensionality when compared to the raw input image. Moreover, it already

represents important ﬁrst decisions of the overall system. Decisions about the

presence or absence of humans or vehicles are made at this stage. E.g., whether

a human is recognized or not, is fully determined by this intermediate processing

of the input, whether other vehicles are recognized as such is decided already at

this stage and also the course of the road is already reﬂected in the output as

one of the segments. To a certain extent we are back to a traditional pipeline

of engineered features, but this tiered processing is substantially diﬀerent. We

do not hand-craft features, instead we employ human expert knowledge in order

to deﬁne an internal abstraction layer known that it is suitable for the overall

problem. Thus, the raw feature analysis is fully trained and optimized like in

end-to-end approaches. The crucial advantage lies in the explanatory power of

this layered processing.

In this article, parts of which were published already in [

], we propose

novel architectures that are able to infer the semantic segmentation in real-

time on embedded hardware, while achieving better accuracy than comparable

competitors. In Section 2 we shortly review the current state-of-the-art and

the techniques employed in semantic segmentation with a focus on autonomous

driving applications. In Section 3 we present our proposed architecture for eﬃcient

semantic segmentation in autonomous vehicles and show its experimental settings

and their evaluation results in Sections 4 and 5.

2 Semantic Segmentation

Convolutional Neural Networks (CNNs) have emerged as the best performing

method for image related tasks [

] such as image classiﬁcation [

] or traﬃc

sign recognition [

] and also semantic segmentation [

]. However, per-

forming semantic segmentation might incur a signiﬁcant computational overhead

as compared to end-to-end models. This is especially the case when improved

segmentation performance is accomplished by increasing the network size [15].

When employing deep learning for semantic segmentation, the recognition

of major objects in the image, such as persons or vehicles is realized at higher

levels of a neural network. By design, these layers work on a coarser scale

and are translation invariant (e.g. imposed via pooling operations), such that

minor variations on a pixel level do not inﬂuence the recognition. Furthermore,

semantic segmentation requires pixel-exact classiﬁcation of small features, which

are typically only found in lower layers of a network. This trade-oﬀ in resolution

is typically solved by using skip-connections from lower layers to the output

which increase the resolution at layers close to the output. Skip-connections were

introduced by the Fully Convolutional Network (FCN)[

], which still serves as a

blue-print for most modern approaches. These approaches only diﬀer in how they

encode the object level information and how they decode these classiﬁcations to

pixel-exact labels. The original FCN employed the VGG network[

] that was

pre-trained on the LSVRC image classiﬁcation task. Then FCN added information

to higher layers coming from lower layers which is upscaled through a transposed

convolution. The FCN architecture was improved by alternative ways to connect

to the lower layers, e.g. by accessing the lower-level pooling layers[

], by using

enhanced methods to integrate lower level information [

] or forgoing pooling

operations for dilated convolutions [

]. As a post-processing step, many recent

systems apply CRF-based reﬁnement on the output produced by the neural

network [

]. CRF increases the accuracy of the segmentation at the cost of

additional computation.

Reducing the computational burden of semantic segmentation is essential

to make it feasible for embedded systems and autonomous driving. Neural

networks are trained on servers or workstations with powerful GPUs, and these

GPU systems are subsequently used for inference on new data. However, these

commodities do not exist in self-driving cars. A self-driving car needs to react

to new events instantly to guarantee the safety of passengers and other traﬃc

participants, while it is often acceptable if the borders of objects are not recognized

perfectly down to a pixel resolution. To segment an image in real-time is a strong

requirement in self-driving applications. Thus, it is critical that any convolutional

neural network deployed in these systems fulﬁlls strict requirements in execution

speed.

There has been a vast amount of research in reducing the computation required

for deep learning. SqueezeNet [

] showed that it was possible to reproduce the

image classiﬁcation accuracy of powerful CNNs, such as AlexNet[

], using 50x

less parameters by employing a more eﬃcient architecture. ENet [

] followed

the same path and showed that semantic segmentation is feasible on embedded

devices in real-time. Another line of research increases the eﬃciency of existing

networks by deriving smaller networks from larger counterparts [

], by pruning

or quantizing weights [

] or tweaking the network for execution on speciﬁc

hardware designs [

]. These methods can be applied on top of new architectures

to speed up execution.

3 Methods

3.1 Overview

In order to achieve semantic segmentation in real time, we have to trade ex-

ecution speed against achievable segmentation accuracy. Like most successful

segmentation networks, our network is structured as an encoder-decoder pair.

An encoder CNN detects higher-level objects such as cars or pedestrians in the

input image. A decoder takes this information and enriches it with information

from the lower layers of the encoder, supplying a prediction for each pixel in the

original input. Figure 2 depicts the architecture.

3.2 Encoder

The encoder is a modiﬁed SqueezeNet 1.1 architecture [

], which was designed

as a low-latency network for image recognition while retaining AlexNet [25] like

accuracy. The main computational modules of SqueezeNet are the so-called “ﬁre”

modules consisting of three convolutional operations, depicted in Figure 3a. The

encoder consists of eight “ﬁre” modules, interspersed with a total of three max-

pooling layers for downsampling. All rectiﬁed linear units (ReLUs) of the original

architecture are substituted with exponential linear units (ELUs) [

], which make

more eﬃcient use of parameters by also conveying information in the negative

part of the activation.

3.3 Parallel Dilated Convolutions

The decoder is based on a parallel dilated convolution layer [

] as depicted in

Figure 3b. This dilated layer combines the feature maps at the encoder output

at diﬀerent receptive ﬁeld sizes by using four dilated convolutions of kernel size 3

with diﬀerent dilation factors. This is equivalent to sampling the layer input with

diﬀerent rates. The contributions from the four dilated convolutions are then

fused by an element-wise sum. As a result, the receptive ﬁeld size is increased

and multiscale spatial dependencies are taken into account without having to

resort to stacking multiple layers which would be computationally expensive.

3.4 Decoder and Bypasses

Pooling layers in the encoder are used to ensure a degree of translational invariance

when detecting the parts of an object. However, they in turn reduce the spatial

Convolution

Pool

Fire

module

Atrous

Score Layer

Transposed

convolution

Refinement

module

Input

Output

C F FF F F F F F

R R R TTTT A

FATR

P P P

Fig. 2. Architecture of the proposed network for semantic segmentation.

resolution of the output. Transposed convolutions in the decoder are used to

upsample the information back to its original size. To improve the upsampling, we

do not just use the data that comes directly from the layer immediately before the

transposed convolution layer, but combine it with low-level knowledge from lower

layers of the encoder. These layers are responsible for detecting ﬁner structures at

a higher resolution, which helps with classifying the contours of objects more ex-

actly. Each reﬁnement module combines two streams of information, one coming

from the previous upsampling layer, the other one from the encoder. The two con-

volutional layers in the reﬁnement module learn how to weigh these two streams

before passing the information on to the next upsampling layer. We use reﬁnement

modules similar to the ones used in the SharpMask approach[32]. We again use

ELUs instead of ReLU units (Figure 3c shows the implementation of the module).

Right before every pooling layer in the encoder, a bypass branches oﬀ to the

reﬁnement module. Once there, a convolution layer weights knowledge from lower

layers. Then, it is concatenated with semantic object information from the previ-

ous upsampling layer. A second convolutional layer combines the concatenated

feature maps from both branches into the class map.

δ1δ2

δ0δ3

Squeeze 1x1

Expand 3x3

Expand 1x1

Convolution ELU Concatenation

δDilated Convolution Element-wise sum

a. Fire module b. Parallel Dilated Convolution c. Bypass refinement module

Fig. 3.

a) SqueezeNet Fire module; b) Parallel dilated convolution layer; c) Reﬁnement

module

3.5 Exponential Linear Units (ELU)

Our network makes extensive use of the exponential linear unit [

] because we

empirically found it to work better than ReLU units for our purpose. ELUs were

designed to avoid the bias shift in neural network training. The ELU activation

function is deﬁned as

f(x) = (xif x > 0

α(exp(x)−1) if x≤0, f0(x) = (1if x > 0

f(x) + αif x≤0.(1)

The parameter

was set to its default value of 1for all of our experiments.

Similar to the often-used ReLU activation, ELUs have a linear positive part

which helps to avoid the vanishing gradient [

], thus it allows training very

deep networks. However, in contrast to the ReLU, the saturating negative part

converges to

−α

. This allows the ELU to have a mean activation of 0, thereby

avoiding any bias shift eﬀect: In ReLU networks, units will typically have a non-

zero mean activation, thus they will act as additional bias unit for units in the

next layer. By enabling units to have zero mean, this bias shift eﬀect is reduced,

which makes it easier for units to focus solely on actual information processing.

This could otherwise only be achieved by using batch normalization [

] or the

SELU activation function[24].

4 Experiments

4.1 Cityscapes dataset

We trained and evaluated the network on the Cityscapes dataset [

]. Cityscapes

is a high quality dataset for semantic street scene understanding. The dataset is

split in 2975, 500, and 1525 images of resolution 2048x1024 pixels for training,

validation, and testing, respectively. It contains 35 annotated classes on pixel-level

of which 19 are selected for evaluation in the challenge. Each class belongs to one

category: ﬂat, nature, object, sky, construction, human, vehicle. As performance

measure the commonly used intersection over union metric is used, which is

evaluated for individual classes and categories. Notice that our results were

achieved without CRFs as post-processing because that would increase inference

time dramatically.

4.2 Training

The SqueezeNet encoder was initialized using publicly available ImageNet pre-

trained weights and ﬁne-tuned for semantic segmentation on Cityscapes. The

rest of the weights were initialized using the MSRA scheme [

] and the network

was trained with full resolution images. The loss function was the sum of cross

entropy terms for all classes, equally weighted in the overall loss function. It

was optimized with Stochastic Gradient Descent, using a ﬁxed learning rate of

−8

, momentum of 0

9and a weight decay factor of 0

0002. The architecture

was implemented using the Caﬀe framework. Total training time was around 22

hours using 2Titan X Maxwell GPUs with batch size 3each at full resolution

images. In our experiments we trained for the pixel-wise segmentation task using

only the ﬁne annotations without any additional training data. Augmentations

were not applied in order to establish a baseline for performance.

5 Results

Our network compares favourably against the ENet segmentation network [

], a

widely-used network architecture for eﬃcient semantic segmentation on embedded

devices. Comparable, newer architectures exist [

], but were not compared

against.

5.1 Segmentation performance

We evaluated the network on the test set using the oﬃcial Cityscapes evaluation

server. We achieve 59.8 per-class mean IoU and 84.3 per-category mean IoU.

Hence the network architecture is able to outperform both ENet as well as

SegNet [

] as it can be seen in Table 1. We improved on per-class IoU for all

but 5 classes (wall, truck, bus, train, motorcycle) compared to ENet. Detailed

per-class classiﬁcation results are presented in Table 2. Visual inspection of the

predictions of the network show satisfying results on typical urban street scene

images (Figure 1). Object contours are segmented very sharply. We believe that

this is due to the enhanced ability to integrate pixel-level information from early

layers in the encoder into upsampling layers in the decoder by using reﬁnement

modules in the bypasses.

Table 1.

Mean IoU over the 19 individual classes and the 7 categories on the Cityscapes

testset. ENet and SegNet results are taken from [31].

Class IoU Category IoU

Ours 59.8 84.3

ENet 58.3 80.4

SegNet 56.1 79.8

Table 2.

Per-class IoU on the Cityscapes testset. We improved the ENet results on all

but 5 classes (wall, truck, bus, train, motorcycle). ENet results are taken from [31].

road sidewalk building wall fence pole traﬃclight traﬃcsign vegetation terrain

Ours 96.9 75.4 87.9 31.6 35.7 50.9 52.0 61.7 90.9 65.8

ENet 96.3 74.2 85.0 32.2 33.2 43.5 34.1 44.0 88.6 61.4

sky person rider car truck bus train motorcycle bicycle

Ours 93.0 73.8 42.6 91.5 18.8 41.2 33.3 34.0 59.9

ENet 90.6 65.5 38.4 90.6 36.9 50.5 48.1 38.8 55.4

5.2 Inference Run-time Performance

Similar to ENet, we are able to surpass the 10 fps design goal on the Nvidia TX1

board on a resolution of 640x360, which is a sensible lower limit for enabling self-

driving car applications. See Table 3 for a comparison of run-times between the

diﬀerent architectures. The run-times were achieved using CUDA 7.5 and cuDNN

4.0, however we expect that timings will improve signiﬁcantly by switching to

newer software versions that support Winograd convolutions[26].

Table 3.

Comparison of inference times on Nvidia Tegra X1. Timings for ENet were

taken from the original publication [31].

480x320 640x360 1280x720

ms fps ms fps ms fps

Ours 60 16.7 86 11.6 389 2.6

ENet 47 21.1 69 14.6 262 3.8

6 Discussion

As a complementary approach to end-to-end deep learning with its lack of

explainability we have proposed a tiered approach, whereby only individual

components are trained in an end-to-end-like fashion. The intermediate nodes,

which are outputs from one tier and inputs to the next, are deﬁned in such a way

that their meaning is interpretable by a human expert with knowledge in the

domain of the overall system. Semantic segmentation is a powerful instrument

that provides a lower-dimensional abstraction of video input signals.

Currently Deep Neural Networks are state-of-the-art systems in semantic

image segmentation. However, these deep networks incur high computational

cost, which makes them unsuitable for deployment in self-driving cars. In this

work we have shown how neural networks can be made small enough to run on

embedded devices used in autonomous vehicles while still retaining a level of

segmentation performance suﬃcient for this application. We believe this to be a

promising approach towards a modular and interpretable system for autonomous

driving.

Subsequent components of the overall system can then take these semantic

features as input for their decisions. One such component could be a model for

situational awareness, classifying the kind of situation the vehicle is currently in.

Another component could be trained to determine an optimal driving path in

view of the semantically segmented surroundings and the situation classiﬁcation.

Finally, based on these intermediate human understandable classiﬁcations and

regressions a controller can be trained on a last tier in order to translate the

now interpretable inner model of the system about the real world into speciﬁc

commands for steering, throttle response or braking. In case of erroneous behavior

of the vehicle an expert can investigate such a tiered system by looking at the

intermediate nodes of each tier and obtain answers to the "why"-question.

This gain in interpretability comes at a price: a tiered approach allows more

ﬁne-grained control and human-interpretable checks of the intermediate steps of

driving decisions. However, current research has often showed that end-to-end

approaches, while harder to interpret, often allow deep learning systems to achieve

higher performance, often even outperforming humans. This is likely because the

system is able to process information in a way that is not compatible with typical

human perception. For example, while reducing the lower-level camera signals

to semantic segmentation maps and discarding more ﬁne-grained information

might make sense for humans, it is possible that machine learning approaches

can still make use of such information at higher-level decision making processes.

Despite these drawbacks, we would argue that, given the accuracy achieved with

modern systems, the gain in interpretability is worth the potential loss in overall

performance.

Acknowledgments

This work was supported by Audi.JKU Deep Learning

Center, Audi Electronics Venture GmbH, Zalando SE with Research Agreement

01/2016, the Austrian Science Fund with Project P28660-N31 and NVIDIA

Corporation.

References

Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural

Information Processing Systems. NIPS (2014)

Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On

pixel-wise explanations for non-linear classiﬁer decisions by layer-wise relevance

propagation. PLOS ONE 10(7), 1–46 (07 2015)

Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-

decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis

and Machine Intelligence 39(12), 2481–2495 (2017)

Bojarski, M., Testa, D.D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel,

L.D., Monfort, M., Muller, U., Zhang, J., Zhang, X., Zhao, J., Zieba, K.: End to

end learning for self-driving cars. CoRR abs/1604.07316 (2016)

Chen, L., Papandreou, G., Schroﬀ, F., Adam, H.: Rethinking atrous convolution

for semantic image segmentation. CoRR abs/1706.05587 (2017)

Chen, Z., Huang, X.: End-to-end learning for lane keeping of self-driving cars. In:

IEEE Intelligent Vehicles Symposium. pp. 1856–1860. IEEE (2017)

Chi, L., Mu, Y.: Deep steering: Learning end-to-end driving model from spatial

and temporal visual cues. CoRR abs/1708.03798 (2017)

Ciresan, D., Meier, U., Masci, J., Schmidhuber, J.: Multi-column deep neural

network for traﬃc sign classiﬁcation. Neural Networks 32, 333–338 (2012)

Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learn-

ing by exponential linear units (ELUs). In: International Conference on Learning

Representations. ICLR (2016)

10.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,

Franke, U., Roth, S., Schiele, B.: The Cityscapes dataset for semantic urban scene

understanding. In: IEEE Conference on Computer Vision and Pattern Recognition.

CVPR (2016)

11.

Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S.,

Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual

recognition and description. IEEE Transactions on Pattern Analysis and Machine

Intelligence 39(4), 677–691 (2017)

12.

Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.:

Eie: Eﬃcient inference engine on compressed deep neural network. International

Conference on Computer Architecture (2016)

13.

Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks

with pruning, trained quantization and huﬀman coding. In: International Conference

on Learning Representations. ICLR (2016)

14.

Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for

eﬃcient neural network. In: Advances in Neural Information Processing Systems.

NIPS (2015)

15.

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.

In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR (2015)

16.

He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectiﬁers: Surpassing human-

level performance on imagenet classiﬁcation. In: IEEE International Conference on

Computer Vision. ICCV (2015)

17.

Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.

CoRR abs/1503.02531 (2015)

18.

Hochreiter, S.: Untersuchungen zu dynamischen neuronalen Netzen. Master’s thesis,

Technische Universität München, Institut für Informatik (1991)

19.

Hochreiter, S.: The vanishing gradient problem during learning recurrent neural

nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst.

(2),

107–116 (1998)

20.

Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient ﬂow in recurrent

nets: the diﬃculty of learning long-term dependencies. In: Kremer, Kolen (eds.) A

Field Guide to Dynamical Recurrent Neural Networks. IEEE Press (2001)

21.

Hong, S., Yang, D., Choi, J., Lee, H.: Interpretable Text-to-Image Synthesis with

Hierarchical Semantic Layout Generation. Springer LNCS (2019)

22.

Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.:

Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model

size. CoRR abs/1602.07360 (2016)

23.

Ioﬀe, S., Szegedy, C.: Batch normalization: Accelerating deep network training

by reducing internal covariate shift. In: Proceedings of the 32nd International

Conference on Machine Learning. ICML (2015)

24. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural

networks. In: Advances in Neural Information Processing Systems. NIPS, Curran

Associates, Inc. (2017)

25.

Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep

convolutional neural networks. In: Advances in Neural Information Processing

Systems. NIPS (2012)

26.

Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: IEEE

Conference on Computer Vision and Pattern Recognition. CVPR (2016)

27.

LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature

521

(7553), 436–444

(2015)

28.

Liang-Chieh, C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.: Semantic

image segmentation with deep convolutional nets and fully connected crfs. In:

International Conference on Learning Representations. ICLR (2015)

29.

Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic

segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition.

CVPR (2015)

30.

Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: Espnet: Eﬃcient

spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings

of the European Conference on Computer Vision. ECCV (2018)

31.

Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: Enet: A deep neural network

architecture for real-time semantic segmentation. CoRR abs/1606.02147 (2016)

32.

Pinheiro, P.O., Lin, T.Y., Collobert, R., Dollár, P.: Learning to reﬁne object

segments. In: Proceedings of the European Conference on Computer Vision. ECCV

(2016)

33.

Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative

adversarial text-to-image synthesis. In: Proceedings of The 33rd International

Conference on Machine Learning. ICML (2016)

34.

Romera, E., Álvarez, J.M., Bergasa, L.M., Arroyo, R.: Erfnet: Eﬃcient residual

factorized convnet for real-time semantic segmentation. IEEE Transactions on

Intelligent Transportation Systems 19, 263–272 (2018)

35.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,

Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large

Scale Visual Recognition Challenge. International Journal of Computer Vision

(IJCV) 115(3), 211–252 (2015)

36.

Schmidhuber, J.: Deep learning in neural networks: An overview. Neural Networks

61, 85–117 (2015)

37.

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale

image recognition. In: International Conference of Learning Representations. ICLR

(2015)

38.

Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks:

Visualising image classiﬁcation models and saliency maps. CoRR

abs/1312.6034

(2013)

39.

Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks.

In: Proceedings of the 34th International Conference on Machine Learning. ICML

(2017)

40.

Treml, M., Arjona-Medina, J., Unterthiner, T., Durgesh, R., Friedmann, F., Schu-

berth, P., Mayr, A., Heusel, M., Hofmarcher, M., Widrich, M., Bodenhofer, U.,

Nessler, B., Hochreiter, S.: Speeding up Semantic Segmentation for Autonomous

Driving. In: Workshop on Machine Learning for Intelligent Transport Systems,

Neural Information Processing Systems (NIPS) (2016)

41.

Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In:

International Conference on Learning Representations. ICLR (2016)

42.

Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In:

Proceedings of the European Conference on Computer Vision. ECCV (2014)

43.

Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: ICNet for real-time semantic segmentation

on high-resolution images. In: Proceedings of the European Conference on Computer

Vision. ECCV (2018)

44.

Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In:

IEEE Conference on Computer Vision and Pattern Recognition. CVPR (2017)

45.

Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang,

C., Torr, P.: Conditional random ﬁelds as recurrent neural networks. In: IEEE

International Conference on Computer Vision. ICCV (2015)

Human-Centered Evaluation of XAI Methods

Conference Paper

Dec 2023

Automated Quality Evaluation of Large-Scale Benchmark Datasets for Vision-Language Tasks

Article

Nov 2023

Large-scale benchmark datasets are crucial in advancing research within the computer science communities. They enable the development of more sophisticated AI models and serve as “golden” benchmarks for evaluating their performance. Thus, ensuring the quality of these datasets is of utmost importance for academic research and the progress of AI systems. For the emerging vision-language tasks, some datasets have been created and frequently used, such as Flickr30k, COCO, and NoCaps, which typically contain a large number of images paired with their ground-truth textual descriptions. In this paper, an automatic method is proposed to assess the quality of large-scale benchmark datasets designed for vision-language tasks. In particular, a new cross-modal matching model is developed, which is capable of automatically scoring the textual descriptions of visual images. Subsequently, this model is employed to evaluate the quality of vision-language datasets by automatically assigning a score to each ‘ground-truth’ description for every image picture. With a good agreement between manual and automated scoring results on the datasets, our findings reveal significant disparities in the quality of the ground-truth descriptions included in the benchmark datasets. Even more surprising, it is evident that a small portion of the descriptions are unsuitable for serving as reliable ground-truth references. These discoveries emphasize the need for careful utilization of these publicly accessible benchmark databases.

A Review of Trustworthy and Explainable Artificial Intelligence (XAI)

Article

Full-text available

Jan 2023

The advancement of Artificial Intelligence (AI) technology has accelerated the development of several systems that are elicited from it. This boom has made the systems vulnerable to security attacks and allows considerable bias in order to handle errors in the system. This puts humans at risk and leaves machines, robots, and data defenseless. Trustworthy AI (TAI) guarantees human value and the environment. In this paper, we present a comprehensive review of the state-of-the-art on how to build a Trustworthy and eXplainable AI, taking into account that AI is a black box with little insight into its underlying structure. The paper also discusses various TAI components, their corresponding bias, and inclinations that make the system unreliable. The study also discusses the necessity for TAI in many verticals, including banking, healthcare, autonomous system, and IoT. We unite the ways of building trust in all fragmented areas of data protection, pricing, expense, reliability, assurance, and decision-making processes utilizing TAI in several diverse industries and to differing degrees. It also emphasizes the importance of transparent and post hoc explanation models in the construction of an eXplainable AI and lists the potential drawbacks and pitfalls of building eXplainable AI. Finally, the policies for developing TAI in the autonomous vehicle construction sectors are thoroughly examined and eclectic ways of building a reliable, interpretable, eXplainable, and Trustworthy AI systems are explained to guarantee safe autonomous vehicle systems.

Image thresholding approaches for medical image segmentation - short literature review

Article

Full-text available

Jan 2023

Recent advances on generative models for semantic segmentation: a survey

Conference Paper

Full-text available

Jun 2024

In recent years, computer vision research has witnessed transformative changes with the integration of generative artificial intelligence (AI) models. The generative models have been widely researched in the field of semantic segmentation. In this survey paper, we present a comprehensive review of the generative models, with a specific focus on Generative Adversarial Networks (GANs), Diffusion Models (DMs), and Variational Autoencoders (VAEs), in the realm of semantic segmentation. We incorporate these generative models for model training, image synthesis, semantic label synthesis, image-label pair synthesis, domain adaptation, feature learning, and boundary localization for semantic segmentation. We also perform a thorough comparative analysis highlighting the approach, task, datasets involved, strengths, and weaknesses of the GANs, DMs, and VAEs-based semantic segmentation models. Our comparative evaluation showed a wide range of research works carried out in the generative semantic segmentation domain. This survey consists of diverse generative methodologies, serving as a comprehensive resource for researchers and enthusiasts contributing to the field of generative semantic segmentation.

Sora-Based Parallel Vision for Smart Sensing of Intelligent Vehicles: From Foundation Models to Foundation Intelligence

Article

Feb 2024

installed on the modern intelligent vehicles. Many Artificial Intelligence based foundation models have been proposed for smart sensing to recognize the known object classes in the new but similar scenarios. However, it is still challenging for the foundation models of smart sensing to detect all the object classes in both seen and unseen scenarios. This letter aims at pushing the boundary of smart sensing research for intelligent vehicles. We first summarize the current widely-used foundation models and the foundation intelligence needed for smart sensing of intelligent vehicles. We then explain Sora-based Parallel Vision to boost the foundation models of smart sensing from basic intelligence (1.0) to enhanced intelligence (2.0) and final generalized intelligence (3.0). Several representative case studies are discussed to show the potential usages of Sora-based Parallel Vision, followed by its future research direction.

Expert-driven Rule-based Refinement of Semantic Segmentation Maps for Autonomous Vehicles

Conference Paper

Jun 2023

AI-Based Multi-Object Relative State Estimation with Self-Calibration Capabilities

Conference Paper

May 2023

NLE-DM: Natural-Language Explanations for Decision Making of Autonomous Driving Based on Semantic Scene Understanding

Article

Sep 2023

In recent years, the advancement of deep-learning technologies has greatly promoted the research progress of autonomous driving. However, deep neural network is like a black box. Given a specific input, it is difficult to explain the output of the network. Without explainable results, it would be unsafe to deploy deep networks in unseen environments or environments with potential unexpected situations. Especially for decision-making networks, inappropriate outputs could lead to severe traffic accidents. To provide a solution to this problem, we propose a deep neural network that jointly predicts the decision-making actions and corresponding natural-language explanations based on semantic scene understanding. Two types of explanations, the reasons of driving actions and the surrounding environment descriptions of the ego-vehicle, are designed. Both the reasons and descriptions are in the form of natural language. The decision-making actions could be explained by the corresponding reasons or the environment descriptions. We also release a large-scale dataset with hand-labelled ground truth including driving actions and environment descriptions. The superiority of our network over other methods is demonstrated on both our dataset and a public dataset.

Semantic Segmentation with High Inference Speed in Off-Road Environments

Conference Paper

Apr 2023

div class="section abstract"> Semantic segmentation is an integral component in many autonomous vehicle systems used for tasks like path identification and scene understanding. Autonomous vehicles must make decisions quickly enough so they can react to their surroundings, therefore, they must be able to segment the environment at high speeds. There has been a fair amount of research on semantic segmentation, but most of this research focuses on achieving higher accuracy, using the mean intersection over union (mIoU) metric rather than higher inference speed. More so, most of these semantic segmentation models are trained and evaluated on urban areas instead of off-road environments. Because of this there is a lack of knowledge in semantic segmentation models for use in off-road unmanned ground vehicles. In this research, SwiftNet, a semantic segmentation deep learning model designed for high inference speed and accuracy on images with large dimensions, was implemented and evaluated for inference speed of semantic segmentation of off-road environments. SwiftNet was pre-trained on the ImageNet dataset, then trained on 70% of the labeled images from the Rellis-3D dataset. Rellis-3D is an extensive off-road dataset designed for semantic segmentation, containing 6234 labeled 1920x1200 images. SwiftNet was evaluated using the remaining 30% of images from the Rellis-3D dataset and achieved an average inference speed of 24 frames per second (FPS) and an mIoU score 73.8% on a Titan RTX GPU. </div

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Article

Full-text available

Dec 2017

We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1]. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.

ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation

Article

Full-text available

Oct 2017

Semantic segmentation is a challenging task that addresses most of the perception needs of intelligent vehicles (IVs) in an unified way. Deep neural networks excel at this task, as they can be trained end-to-end to accurately classify multiple object categories in an image at pixel level. However, a good tradeoff between high quality and computational resources is yet not present in the state-of-the-art semantic segmentation approaches, limiting their application in real vehicles. In this paper, we propose a deep architecture that is able to run in real time while providing accurate semantic segmentation. The core of our architecture is a novel layer that uses residual connections and factorized convolutions in order to remain efficient while retaining remarkable accuracy. Our approach is able to run at over 83 FPS in a single Titan X, and 7 FPS in a Jetson TX1 (embedded device). A comprehensive set of experiments on the publicly available Cityscapes data set demonstrates that our system achieves an accuracy that is similar to the state of the art, while being orders of magnitude faster to compute than other architectures that achieve top precision. The resulting tradeoff makes our model an ideal approach for scene understanding in IV applications. The code is publicly available at: https://github.com/Eromera/erfnet

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description

Article

Full-text available

Nov 2014

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep"' in that they can be compositional in spatial and temporal "layers". Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

Interpretable Text-to-Image Synthesis with Hierarchical Semantic Layout Generation

Chapter

Sep 2019

Generating images from natural language description has drawn a lot of attention in the research community for its practical usefulness and for understanding the method in which the model relates text with visual concepts by synthesizing them. Deep generative models have been successfully employed to address this task, which formulates the problem as a translation task from text to image. However, learning a direct mapping from text to image is challenging due to the complexity of the mapping and makes it difficult to understand the underlying generation process. To address these issues, we propose a novel hierarchical approach for text-to-image synthesis by inferring a semantic layout. Our algorithm decomposes the generation process into multiple steps. First, it constructs a semantic layout from the text using the layout generator and then converts the layout to an image with the image generator. The proposed layout generator progressively constructs a semantic layout in a coarse-to-fine manner by generating object bounding boxes and refining each box by estimating the object shapes inside the box. The image generator synthesizes an image conditioned on the inferred semantic layout, which provides a useful semantic structure of an image matching the text description. Conditioning the generation with the inferred semantic layout allows our model to generate semantically more meaningful images and provides interpretable representations to allow users to interactively control the generation process by modifying the layout. We demonstrate the capability of the proposed model on the challenging MS-COCO dataset and show that the model can substantially improve the image quality and interpretability of the output and semantic alignment to input text over existing approaches.

ICNet for Real-Time Semantic Segmentation on High-Resolution Images: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part III

Chapter

Sep 2018

We focus on the challenging task of real-time semantic segmentation in this paper. It finds many practical applications and yet is with fundamental difficulty of reducing a large portion of computation for pixel-wise label inference. We propose an image cascade network (ICNet) that incorporates multi-resolution branches under proper label guidance to address this challenge. We provide in-depth analysis of our framework and introduce the cascade feature fusion unit to quickly achieve high-quality segmentation. Our system yields real-time inference on a single GPU card with decent quality results evaluated on challenging datasets like Cityscapes, CamVid and COCO-Stuff.

ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part X

Chapter

Sep 2018

We introduce a fast and efficient convolutional neural network, ESPNet, for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a new convolutional module, efficient spatial pyramid (ESP), which is efficient in terms of computation, memory, and power. ESPNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-art semantic segmentation network PSPNet, while its category-wise accuracy is only 8% less. We evaluated ESPNet on a variety of semantic segmentation datasets including Cityscapes, PASCAL VOC, and a breast biopsy whole slide image dataset. Under the same constraints on memory and computation, ESPNet outperforms all the current efficient CNN networks such as MobileNet, ShuffleNet, and ENet on both standard metrics and our newly introduced performance metrics that measure efficiency on edge devices. Our network can process high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively. Our code is open-source and available at https://sacmehta.github.io/ESPNet/.

Pyramid Scene Parsing Network

Conference Paper

Jul 2017

Learning End-to-End Autonomous Steering Model from Spatial and Temporal Visual Cues

Conference Paper

Oct 2017

In recent years, autonomous driving algorithms using low-cost vehicle-mounted cameras have attracted increasing endeavors from both academia and industry. There are multiple fronts to these endeavors, including object detection on roads, 3-D reconstruction etc., but in this work we focus on a vision-based model that directly maps raw input images to steering angles using deep networks. This represents a nascent research topic in computer vision. The technical contributions of this work are two-fold. First, the model is learned and evaluated on real human driving videos that are time-synchronized with other vehicle sensors. This differs from many prior models trained from synthetic data in racing games. Second, state-of-the-art models, such as PilotNet, mostly predict the wheel angles independently on each video frame, which contradicts common understanding of driving as a stateful process. Instead, our proposed model strikes a combination of spatial and temporal cues, jointly investigating instantaneous monocular camera observations and vehicle»s historical states. This is in practice accomplished by inserting carefully-designed recurrent units (e.g., LSTM and Conv-LSTM) at proper network layers. Our experimental study is based on about 6 hours of human driving data provided by Udacity. Comprehensive quantitative evaluations demonstrate the effectiveness and robustness of our model, even under scenarios like drastic lighting changes and abrupt turning. The comparison with other state-of-the-art models clearly reveals its superior performance in predicting the due wheel angle for a self-driving car.

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Conference Paper

Dec 2015

Deep Convolutional Neural Networks (DCNNs) have recently shown state of the art performance in high level vision tasks, such as image classification and object detection. This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification (also called "semantic image segmentation"). We show that responses at the final layer of DCNNs are not sufficiently localized for accurate object segmentation. This is due to the very invariance properties that make DCNNs good for high level tasks. We overcome this poor localization property of deep networks by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF). Qualitatively, our "DeepLab" system is able to localize segment boundaries at a level of accuracy which is beyond previous methods. Quantitatively, our method sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 71.6% IOU accuracy in the test set. We show how these results can be obtained efficiently: Careful network re-purposing and a novel application of the 'hole' algorithm from the wavelet community allow dense computation of neural net responses at 8 frames per second on a modern GPU.

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Conference Paper

Oct 2016

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, We introduce a three stage pipeline: pruning, quantization and Huffman encoding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman encoding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory, which has 180x less access energy.

Visual Scene Understanding for Autonomous Driving Using Semantic Segmentation

Abstract and Figures

Recommended publications

How machine perception relates to human perception: visual saliency and distance in a frame-by-frame...

Prediction of Driver’s Attention Points Based on Attention Model

Speeding up Semantic Segmentation for Autonomous Driving

Aggregation Architecture and all-to-one Network for Real-Time Semantic Segmentation

ADFNet: Accumulated Decoder Features for Real-time Semantic Segmentation

Automated Process for Incorporating Drivable Path into Real-Time Semantic Segmentation