Content uploaded by Jose Antonio Arjona-Medina
Author content
All content in this area was uploaded by Jose Antonio Arjona-Medina on Nov 08, 2019
Content may be subject to copyright.
Visual Scene Understanding for Autonomous
Driving using Semantic Segmentation
Markus Hofmarcher ?, Thomas Unterthiner ∗, José Arjona-Medina ∗,
Günter Klambauer, Sepp Hochreiter, Bernhard Nessler
LIT AI Lab & Institute for Machine Learning,
Johannes Kepler University Linz,
A-4040 Linz, Austria
Abstract.
Deep neural networks are an increasingly important technique
for autonomous driving, especially as a visual perception component.
Deployment in a real environment necessitates the explainability and
inspectability of the algorithms controlling the vehicle. Such insightful
explanations are relevant not only for legal issues and insurance matters
but also for engineers and developers in order to achieve provable func-
tional quality guarantees. This applies to all scenarios where the results
of deep networks control potentially life threatening machines.
We suggest the use of a tiered approach, whose main component is
a semantic segmentation model, over an end-to-end approach for an
autonomous driving system. In order for a system to provide meaningful
explanations for its decisions it is necessary to give an explanation about
the semantics that it attributes to the complex sensory inputs that it
perceives. In the context of high-dimensional visual input this attribution
is done as a pixel-wise classification process that assigns an object class
to every pixel in the image. This process is called semantic segmentation.
We propose an architecture that delivers real-time viable segmentation
performance and which conforms to the limitations in computational
power that is available in production vehicles. The output of such a
semantic segmentation model can be used as an input for an interpretable
autonomous driving system.
Keywords:
Deep Learning, Convolutional Neural Networks, Semantic
Segmentation, Classification, Visual Scene Understanding, Interpretabil-
ity.
1 Introduction
In the context of very complex learning systems there are two main approaches.
The traditional approach is based on handcrafted feature functions that pre-
process input in a manner deemed useful for humans. A human expert uses his
domain knowledge in the design of these features such that the task relevant
information is enhanced and irrelevant signals should be suppressed as much as
?These authors contributed equally to this work.
possible. This pre-processing greatly reduced the input dimensionality which was
necessary in order to apply traditional machine learning techniques.
The modern principle is the one favored by the nature of deep learning, the
so-called end-to-end approach. This approach gives the network the freedom
to learn a function that is based on raw inputs and directly produces the final
outputs. This end-to-end approach unleashed deep learning to revolutionize AI
in the last couple of years [8,27,36,35,16].
The advantage of the end-to-end principle is clear: the optimization in the
learning process has the freedom to access the full input space and build arbitrary
functions as needed. However there are a couple of disadvantages. First, a large
number of training examples are required to optimize the more complex function.
Furthermore, the domain knowledge usually encoded in engineered features is
missing. Finally, the resulting function is even more difficult to interpret than in
the traditional pipeline.
In the problem setting of autonomous driving the raw inputs are pixel-wise
sensor data from RGB cameras, point-wise depth information from LIDAR
sensors, proximity measurements from ultrasonic or radar sensors and odometry
information from an IMU. The desired output is just the steering angle, torque
and brake. In an end-to-end approach the functional model should utilize the
input in an optimal fashion to output optimal driving instructions with respect
to a given loss function [4,6,7].
However, the aforementioned drawbacks of an end-to-end approach quickly
turn out to be insurmountable for this application, as the number of training ex-
amples required to train such an end-to-end driving model for arbitrary scenarios
is huge. Furthermore, it is extremely hard to interpret such a model to answer
questions such as why a model decided for a certain action. This is especially
problematic in view of legal issues if the decisions of the model lead to erroneous
behavior of the vehicle.
Recently, research into the area of attribution methods, in which the goal is
to determine, which part of an input was responsible for a model’s output, has
gained traction [
38
,
42
,
2
,
39
]. However, in the scenario of autonomous driving with
its high-dimensional image inputs such attribution methods would explain to
which extent each single pixel has contributed to the driving decision. This will
provide some information about the functional correspondence, but this does not
suffice to answer the typical human "why"-question, which asks for the inner
model of the system and its world understanding.
We propose a tiered approach, whereby the inputs to individual sensory
components are directly - without hand-crafted features - fed into a deep neural
network whose training target is defined to be a human interpretable abstract
representation of the input. In particular, for a video input signal, we want to
abstract from the color of the clothing of pedestrians, from the specific texture
of a vehicle and from the specific appearances of streets and buildings. We are
only interested in their shapes and positions in space. Semantic segmentation
achieves this by assigning an object class to every single pixel of the input image.
This is similar to interpreting machine learning models by translating from one
modality to another [
33
,
11
,
21
]. Due to the geometry of our real world this leads
to smaller or larger continuous patches of single class identities representing the
shape of individual objects, like pedestrians, cars, buildings, lanes and pavement
(see Figure 1).
Fig. 1.
Example output (bottom) of the network with ground truth (center) on images
from the Cityscapes validation set.
It is obvious that the output of a semantic segmentation network is of much
lower dimensionality when compared to the raw input image. Moreover, it already
represents important first decisions of the overall system. Decisions about the
presence or absence of humans or vehicles are made at this stage. E.g., whether
a human is recognized or not, is fully determined by this intermediate processing
of the input, whether other vehicles are recognized as such is decided already at
this stage and also the course of the road is already reflected in the output as
one of the segments. To a certain extent we are back to a traditional pipeline
of engineered features, but this tiered processing is substantially different. We
do not hand-craft features, instead we employ human expert knowledge in order
to define an internal abstraction layer known that it is suitable for the overall
problem. Thus, the raw feature analysis is fully trained and optimized like in
end-to-end approaches. The crucial advantage lies in the explanatory power of
this layered processing.
In this article, parts of which were published already in [
40
], we propose
novel architectures that are able to infer the semantic segmentation in real-
time on embedded hardware, while achieving better accuracy than comparable
competitors. In Section 2 we shortly review the current state-of-the-art and
the techniques employed in semantic segmentation with a focus on autonomous
driving applications. In Section 3 we present our proposed architecture for efficient
semantic segmentation in autonomous vehicles and show its experimental settings
and their evaluation results in Sections 4 and 5.
2 Semantic Segmentation
Convolutional Neural Networks (CNNs) have emerged as the best performing
method for image related tasks [
27
,
36
] such as image classification [
35
,
16
] or traffic
sign recognition [
8
] and also semantic segmentation [
29
,
41
,
44
,
5
]. However, per-
forming semantic segmentation might incur a significant computational overhead
as compared to end-to-end models. This is especially the case when improved
segmentation performance is accomplished by increasing the network size [15].
When employing deep learning for semantic segmentation, the recognition
of major objects in the image, such as persons or vehicles is realized at higher
levels of a neural network. By design, these layers work on a coarser scale
and are translation invariant (e.g. imposed via pooling operations), such that
minor variations on a pixel level do not influence the recognition. Furthermore,
semantic segmentation requires pixel-exact classification of small features, which
are typically only found in lower layers of a network. This trade-off in resolution
is typically solved by using skip-connections from lower layers to the output
which increase the resolution at layers close to the output. Skip-connections were
introduced by the Fully Convolutional Network (FCN)[
29
], which still serves as a
blue-print for most modern approaches. These approaches only differ in how they
encode the object level information and how they decode these classifications to
pixel-exact labels. The original FCN employed the VGG network[
37
] that was
pre-trained on the LSVRC image classification task. Then FCN added information
to higher layers coming from lower layers which is upscaled through a transposed
convolution. The FCN architecture was improved by alternative ways to connect
to the lower layers, e.g. by accessing the lower-level pooling layers[
3
], by using
enhanced methods to integrate lower level information [
32
] or forgoing pooling
operations for dilated convolutions [
28
,
41
]. As a post-processing step, many recent
systems apply CRF-based refinement on the output produced by the neural
network [
28
,
45
]. CRF increases the accuracy of the segmentation at the cost of
additional computation.
Reducing the computational burden of semantic segmentation is essential
to make it feasible for embedded systems and autonomous driving. Neural
networks are trained on servers or workstations with powerful GPUs, and these
GPU systems are subsequently used for inference on new data. However, these
commodities do not exist in self-driving cars. A self-driving car needs to react
to new events instantly to guarantee the safety of passengers and other traffic
participants, while it is often acceptable if the borders of objects are not recognized
perfectly down to a pixel resolution. To segment an image in real-time is a strong
requirement in self-driving applications. Thus, it is critical that any convolutional
neural network deployed in these systems fulfills strict requirements in execution
speed.
There has been a vast amount of research in reducing the computation required
for deep learning. SqueezeNet [
22
] showed that it was possible to reproduce the
image classification accuracy of powerful CNNs, such as AlexNet[
25
], using 50x
less parameters by employing a more efficient architecture. ENet [
31
] followed
the same path and showed that semantic segmentation is feasible on embedded
devices in real-time. Another line of research increases the efficiency of existing
networks by deriving smaller networks from larger counterparts [
1
,
17
], by pruning
or quantizing weights [
14
,
13
] or tweaking the network for execution on specific
hardware designs [
12
]. These methods can be applied on top of new architectures
to speed up execution.
3 Methods
3.1 Overview
In order to achieve semantic segmentation in real time, we have to trade ex-
ecution speed against achievable segmentation accuracy. Like most successful
segmentation networks, our network is structured as an encoder-decoder pair.
An encoder CNN detects higher-level objects such as cars or pedestrians in the
input image. A decoder takes this information and enriches it with information
from the lower layers of the encoder, supplying a prediction for each pixel in the
original input. Figure 2 depicts the architecture.
3.2 Encoder
The encoder is a modified SqueezeNet 1.1 architecture [
22
], which was designed
as a low-latency network for image recognition while retaining AlexNet [25] like
accuracy. The main computational modules of SqueezeNet are the so-called “fire”
modules consisting of three convolutional operations, depicted in Figure 3a. The
encoder consists of eight “fire” modules, interspersed with a total of three max-
pooling layers for downsampling. All rectified linear units (ReLUs) of the original
architecture are substituted with exponential linear units (ELUs) [
9
], which make
more efficient use of parameters by also conveying information in the negative
part of the activation.
3.3 Parallel Dilated Convolutions
The decoder is based on a parallel dilated convolution layer [
28
] as depicted in
Figure 3b. This dilated layer combines the feature maps at the encoder output
at different receptive field sizes by using four dilated convolutions of kernel size 3
with different dilation factors. This is equivalent to sampling the layer input with
different rates. The contributions from the four dilated convolutions are then
fused by an element-wise sum. As a result, the receptive field size is increased
and multiscale spatial dependencies are taken into account without having to
resort to stacking multiple layers which would be computationally expensive.
3.4 Decoder and Bypasses
Pooling layers in the encoder are used to ensure a degree of translational invariance
when detecting the parts of an object. However, they in turn reduce the spatial
Convolution
Pool
Fire
module
Atrous
Score Layer
Transposed
convolution
Refinement
module
Input
Output
C F FF F F F F F
R R R TTTT A
FATR
C
P
P P P
Fig. 2. Architecture of the proposed network for semantic segmentation.
resolution of the output. Transposed convolutions in the decoder are used to
upsample the information back to its original size. To improve the upsampling, we
do not just use the data that comes directly from the layer immediately before the
transposed convolution layer, but combine it with low-level knowledge from lower
layers of the encoder. These layers are responsible for detecting finer structures at
a higher resolution, which helps with classifying the contours of objects more ex-
actly. Each refinement module combines two streams of information, one coming
from the previous upsampling layer, the other one from the encoder. The two con-
volutional layers in the refinement module learn how to weigh these two streams
before passing the information on to the next upsampling layer. We use refinement
modules similar to the ones used in the SharpMask approach[32]. We again use
ELUs instead of ReLU units (Figure 3c shows the implementation of the module).
Right before every pooling layer in the encoder, a bypass branches off to the
refinement module. Once there, a convolution layer weights knowledge from lower
layers. Then, it is concatenated with semantic object information from the previ-
ous upsampling layer. A second convolutional layer combines the concatenated
feature maps from both branches into the class map.
δ1δ2
δ0δ3
+
Squeeze 1x1
Expand 3x3
Expand 1x1
Convolution ELU Concatenation
δDilated Convolution Element-wise sum
+
a. Fire module b. Parallel Dilated Convolution c. Bypass refinement module
F
T
TT
P
Fig. 3.
a) SqueezeNet Fire module; b) Parallel dilated convolution layer; c) Refinement
module
3.5 Exponential Linear Units (ELU)
Our network makes extensive use of the exponential linear unit [
9
] because we
empirically found it to work better than ReLU units for our purpose. ELUs were
designed to avoid the bias shift in neural network training. The ELU activation
function is defined as
f(x) = (xif x > 0
α(exp(x)−1) if x≤0, f0(x) = (1if x > 0
f(x) + αif x≤0.(1)
The parameter
α
was set to its default value of 1for all of our experiments.
Similar to the often-used ReLU activation, ELUs have a linear positive part
which helps to avoid the vanishing gradient [
18
,
19
,
20
], thus it allows training very
deep networks. However, in contrast to the ReLU, the saturating negative part
converges to
−α
. This allows the ELU to have a mean activation of 0, thereby
avoiding any bias shift effect: In ReLU networks, units will typically have a non-
zero mean activation, thus they will act as additional bias unit for units in the
next layer. By enabling units to have zero mean, this bias shift effect is reduced,
which makes it easier for units to focus solely on actual information processing.
This could otherwise only be achieved by using batch normalization [
23
] or the
SELU activation function[24].
4 Experiments
4.1 Cityscapes dataset
We trained and evaluated the network on the Cityscapes dataset [
10
]. Cityscapes
is a high quality dataset for semantic street scene understanding. The dataset is
split in 2975, 500, and 1525 images of resolution 2048x1024 pixels for training,
validation, and testing, respectively. It contains 35 annotated classes on pixel-level
of which 19 are selected for evaluation in the challenge. Each class belongs to one
category: flat, nature, object, sky, construction, human, vehicle. As performance
measure the commonly used intersection over union metric is used, which is
evaluated for individual classes and categories. Notice that our results were
achieved without CRFs as post-processing because that would increase inference
time dramatically.
4.2 Training
The SqueezeNet encoder was initialized using publicly available ImageNet pre-
trained weights and fine-tuned for semantic segmentation on Cityscapes. The
rest of the weights were initialized using the MSRA scheme [
16
] and the network
was trained with full resolution images. The loss function was the sum of cross
entropy terms for all classes, equally weighted in the overall loss function. It
was optimized with Stochastic Gradient Descent, using a fixed learning rate of
10
−8
, momentum of 0
.
9and a weight decay factor of 0
.
0002. The architecture
was implemented using the Caffe framework. Total training time was around 22
hours using 2Titan X Maxwell GPUs with batch size 3each at full resolution
images. In our experiments we trained for the pixel-wise segmentation task using
only the fine annotations without any additional training data. Augmentations
were not applied in order to establish a baseline for performance.
5 Results
Our network compares favourably against the ENet segmentation network [
31
], a
widely-used network architecture for efficient semantic segmentation on embedded
devices. Comparable, newer architectures exist [
30
,
34
,
43
], but were not compared
against.
5.1 Segmentation performance
We evaluated the network on the test set using the official Cityscapes evaluation
server. We achieve 59.8 per-class mean IoU and 84.3 per-category mean IoU.
Hence the network architecture is able to outperform both ENet as well as
SegNet [
3
] as it can be seen in Table 1. We improved on per-class IoU for all
but 5 classes (wall, truck, bus, train, motorcycle) compared to ENet. Detailed
per-class classification results are presented in Table 2. Visual inspection of the
predictions of the network show satisfying results on typical urban street scene
images (Figure 1). Object contours are segmented very sharply. We believe that
this is due to the enhanced ability to integrate pixel-level information from early
layers in the encoder into upsampling layers in the decoder by using refinement
modules in the bypasses.
Table 1.
Mean IoU over the 19 individual classes and the 7 categories on the Cityscapes
testset. ENet and SegNet results are taken from [31].
Class IoU Category IoU
Ours 59.8 84.3
ENet 58.3 80.4
SegNet 56.1 79.8
Table 2.
Per-class IoU on the Cityscapes testset. We improved the ENet results on all
but 5 classes (wall, truck, bus, train, motorcycle). ENet results are taken from [31].
road sidewalk building wall fence pole trafficlight trafficsign vegetation terrain
Ours 96.9 75.4 87.9 31.6 35.7 50.9 52.0 61.7 90.9 65.8
ENet 96.3 74.2 85.0 32.2 33.2 43.5 34.1 44.0 88.6 61.4
sky person rider car truck bus train motorcycle bicycle
Ours 93.0 73.8 42.6 91.5 18.8 41.2 33.3 34.0 59.9
ENet 90.6 65.5 38.4 90.6 36.9 50.5 48.1 38.8 55.4
5.2 Inference Run-time Performance
Similar to ENet, we are able to surpass the 10 fps design goal on the Nvidia TX1
board on a resolution of 640x360, which is a sensible lower limit for enabling self-
driving car applications. See Table 3 for a comparison of run-times between the
different architectures. The run-times were achieved using CUDA 7.5 and cuDNN
4.0, however we expect that timings will improve significantly by switching to
newer software versions that support Winograd convolutions[26].
Table 3.
Comparison of inference times on Nvidia Tegra X1. Timings for ENet were
taken from the original publication [31].
480x320 640x360 1280x720
ms fps ms fps ms fps
Ours 60 16.7 86 11.6 389 2.6
ENet 47 21.1 69 14.6 262 3.8
6 Discussion
As a complementary approach to end-to-end deep learning with its lack of
explainability we have proposed a tiered approach, whereby only individual
components are trained in an end-to-end-like fashion. The intermediate nodes,
which are outputs from one tier and inputs to the next, are defined in such a way
that their meaning is interpretable by a human expert with knowledge in the
domain of the overall system. Semantic segmentation is a powerful instrument
that provides a lower-dimensional abstraction of video input signals.
Currently Deep Neural Networks are state-of-the-art systems in semantic
image segmentation. However, these deep networks incur high computational
cost, which makes them unsuitable for deployment in self-driving cars. In this
work we have shown how neural networks can be made small enough to run on
embedded devices used in autonomous vehicles while still retaining a level of
segmentation performance sufficient for this application. We believe this to be a
promising approach towards a modular and interpretable system for autonomous
driving.
Subsequent components of the overall system can then take these semantic
features as input for their decisions. One such component could be a model for
situational awareness, classifying the kind of situation the vehicle is currently in.
Another component could be trained to determine an optimal driving path in
view of the semantically segmented surroundings and the situation classification.
Finally, based on these intermediate human understandable classifications and
regressions a controller can be trained on a last tier in order to translate the
now interpretable inner model of the system about the real world into specific
commands for steering, throttle response or braking. In case of erroneous behavior
of the vehicle an expert can investigate such a tiered system by looking at the
intermediate nodes of each tier and obtain answers to the "why"-question.
This gain in interpretability comes at a price: a tiered approach allows more
fine-grained control and human-interpretable checks of the intermediate steps of
driving decisions. However, current research has often showed that end-to-end
approaches, while harder to interpret, often allow deep learning systems to achieve
higher performance, often even outperforming humans. This is likely because the
system is able to process information in a way that is not compatible with typical
human perception. For example, while reducing the lower-level camera signals
to semantic segmentation maps and discarding more fine-grained information
might make sense for humans, it is possible that machine learning approaches
can still make use of such information at higher-level decision making processes.
Despite these drawbacks, we would argue that, given the accuracy achieved with
modern systems, the gain in interpretability is worth the potential loss in overall
performance.
Acknowledgments
This work was supported by Audi.JKU Deep Learning
Center, Audi Electronics Venture GmbH, Zalando SE with Research Agreement
01/2016, the Austrian Science Fund with Project P28660-N31 and NVIDIA
Corporation.
References
1.
Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural
Information Processing Systems. NIPS (2014)
2.
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On
pixel-wise explanations for non-linear classifier decisions by layer-wise relevance
propagation. PLOS ONE 10(7), 1–46 (07 2015)
3.
Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-
decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis
and Machine Intelligence 39(12), 2481–2495 (2017)
4.
Bojarski, M., Testa, D.D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel,
L.D., Monfort, M., Muller, U., Zhang, J., Zhang, X., Zhao, J., Zieba, K.: End to
end learning for self-driving cars. CoRR abs/1604.07316 (2016)
5.
Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution
for semantic image segmentation. CoRR abs/1706.05587 (2017)
6.
Chen, Z., Huang, X.: End-to-end learning for lane keeping of self-driving cars. In:
IEEE Intelligent Vehicles Symposium. pp. 1856–1860. IEEE (2017)
7.
Chi, L., Mu, Y.: Deep steering: Learning end-to-end driving model from spatial
and temporal visual cues. CoRR abs/1708.03798 (2017)
8.
Ciresan, D., Meier, U., Masci, J., Schmidhuber, J.: Multi-column deep neural
network for traffic sign classification. Neural Networks 32, 333–338 (2012)
9.
Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learn-
ing by exponential linear units (ELUs). In: International Conference on Learning
Representations. ICLR (2016)
10.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The Cityscapes dataset for semantic urban scene
understanding. In: IEEE Conference on Computer Vision and Pattern Recognition.
CVPR (2016)
11.
Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S.,
Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual
recognition and description. IEEE Transactions on Pattern Analysis and Machine
Intelligence 39(4), 677–691 (2017)
12.
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.:
Eie: Efficient inference engine on compressed deep neural network. International
Conference on Computer Architecture (2016)
13.
Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks
with pruning, trained quantization and huffman coding. In: International Conference
on Learning Representations. ICLR (2016)
14.
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for
efficient neural network. In: Advances in Neural Information Processing Systems.
NIPS (2015)
15.
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR (2015)
16.
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification. In: IEEE International Conference on
Computer Vision. ICCV (2015)
17.
Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.
CoRR abs/1503.02531 (2015)
18.
Hochreiter, S.: Untersuchungen zu dynamischen neuronalen Netzen. Master’s thesis,
Technische Universität München, Institut für Informatik (1991)
19.
Hochreiter, S.: The vanishing gradient problem during learning recurrent neural
nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst.
6
(2),
107–116 (1998)
20.
Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent
nets: the difficulty of learning long-term dependencies. In: Kremer, Kolen (eds.) A
Field Guide to Dynamical Recurrent Neural Networks. IEEE Press (2001)
21.
Hong, S., Yang, D., Choi, J., Lee, H.: Interpretable Text-to-Image Synthesis with
Hierarchical Semantic Layout Generation. Springer LNCS (2019)
22.
Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.:
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model
size. CoRR abs/1602.07360 (2016)
23.
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training
by reducing internal covariate shift. In: Proceedings of the 32nd International
Conference on Machine Learning. ICML (2015)
24. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural
networks. In: Advances in Neural Information Processing Systems. NIPS, Curran
Associates, Inc. (2017)
25.
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep
convolutional neural networks. In: Advances in Neural Information Processing
Systems. NIPS (2012)
26.
Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: IEEE
Conference on Computer Vision and Pattern Recognition. CVPR (2016)
27.
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature
521
(7553), 436–444
(2015)
28.
Liang-Chieh, C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.: Semantic
image segmentation with deep convolutional nets and fully connected crfs. In:
International Conference on Learning Representations. ICLR (2015)
29.
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition.
CVPR (2015)
30.
Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: Espnet: Efficient
spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings
of the European Conference on Computer Vision. ECCV (2018)
31.
Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: Enet: A deep neural network
architecture for real-time semantic segmentation. CoRR abs/1606.02147 (2016)
32.
Pinheiro, P.O., Lin, T.Y., Collobert, R., Dollár, P.: Learning to refine object
segments. In: Proceedings of the European Conference on Computer Vision. ECCV
(2016)
33.
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative
adversarial text-to-image synthesis. In: Proceedings of The 33rd International
Conference on Machine Learning. ICML (2016)
34.
Romera, E., Álvarez, J.M., Bergasa, L.M., Arroyo, R.: Erfnet: Efficient residual
factorized convnet for real-time semantic segmentation. IEEE Transactions on
Intelligent Transportation Systems 19, 263–272 (2018)
35.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large
Scale Visual Recognition Challenge. International Journal of Computer Vision
(IJCV) 115(3), 211–252 (2015)
36.
Schmidhuber, J.: Deep learning in neural networks: An overview. Neural Networks
61, 85–117 (2015)
37.
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: International Conference of Learning Representations. ICLR
(2015)
38.
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks:
Visualising image classification models and saliency maps. CoRR
abs/1312.6034
(2013)
39.
Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks.
In: Proceedings of the 34th International Conference on Machine Learning. ICML
(2017)
40.
Treml, M., Arjona-Medina, J., Unterthiner, T., Durgesh, R., Friedmann, F., Schu-
berth, P., Mayr, A., Heusel, M., Hofmarcher, M., Widrich, M., Bodenhofer, U.,
Nessler, B., Hochreiter, S.: Speeding up Semantic Segmentation for Autonomous
Driving. In: Workshop on Machine Learning for Intelligent Transport Systems,
Neural Information Processing Systems (NIPS) (2016)
41.
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In:
International Conference on Learning Representations. ICLR (2016)
42.
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In:
Proceedings of the European Conference on Computer Vision. ECCV (2014)
43.
Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: ICNet for real-time semantic segmentation
on high-resolution images. In: Proceedings of the European Conference on Computer
Vision. ECCV (2018)
44.
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In:
IEEE Conference on Computer Vision and Pattern Recognition. CVPR (2017)
45.
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang,
C., Torr, P.: Conditional random fields as recurrent neural networks. In: IEEE
International Conference on Computer Vision. ICCV (2015)