ArticlePDF Available

Navigating the Landscape for Real-Time Localization and Mapping for Robotics and Virtual and Augmented Reality

August 2018
Proceedings of the IEEE 106(11):1-20

August 2018
106(11):1-20

DOI:10.1109/JPROC.2018.2856739

Authors:

Sajad Saeedi

Bruno Bodin

Yale-NUS College

Andy Nisbet

The University of Manchester

Show all 26 authorsHide

Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are (1) tools and methodology for systematic quantitative evaluation of SLAM algorithms, (2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives, (3) end-to-end simulation tools to enable optimisation of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches, and (4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.

Focal-plane Sensor-Processor Arrays (FPSPs) are parallel processing systems, where each pixel has a processing element.

…

demonstrates execution times for common convolution filters on various CPUs and GPUs compared with an implementation of FPSP, known as SCAMP [60]. The code for FPSP was automatically generated as explained in [65]. The parallel nature of the FPSP allows it to perform all of the tested filter kernels, shown on x-axis, in a fraction of the time needed by the other devices, shown on y-axis. This is a direct consequence of having a dedicated processing element available for every pixel, building up the filter on the whole image at the same time. As for the other devices, we see that for dense kernels (Gauss, Box), GPUs usually perform better than CPUs, whereas for sparse kernels (Sobel, Laplacian, Sharpen), CPUs seem to have an advantage. An outlier case being the 7 ? 7 box filter, at which only the most powerful graphics card manages to get a result comparable to the CPUs. It is assumed that the CPU implementation follows a more suitable algorithm than the GPU implementation, even though both implementations are based on their vendors performance libraries (Intel IPP, nVidia NPP). Another reason could be the fact, that the GTX680 and GTX780 are based on a hardware architecture that is less suitable for this type of filter than the TITAN X's architecture. While Fig. 7 shows that there is a significant reduction in execution time, the SCAMP chip consumes only 1.23W under full load. Compared to the experimented CPU and GPU systems, this at least 20 times less power. Clearly a more specialised image processing pipeline architecture could be more energy-efficient than these fully programmable architectures. There is scope for further research to map the space of alternative designs, including specialised heterogeneous multicore vision processing accelerators such as the Myriad-2 Vision Processing Unit [66].

…

Static, dynamic, and hybrid scheduling are the software optimisation methods presented for power efficiency and speed improvement.

…

An overview of the Diplomat framework. The user provides (1) the task implementations in various languages and (2) the dependencies between the tasks. Then in (3) Diplomat performs timing analysis on the target platform and in (4) abstracts the task-graph as a static dataflow model. Finally, a dataflow model analysis step is performed in (5), and in (6) the Diplomat compiler performs the code generation.

…

+10

Evaluation of the best result obtained with Diplomat for CPU and GPU configurations, and comparison with handwritten solutions (OpenMP, OpenCL) and automatic heuristics (Partitioning, Speed-up mapping) for KinectFusion on Arndale platform. The associated numbers on x-axis are different KinectFusion algorithmic parameter configuration, and the percent on top of Diplomat bars are the speedup over the manual implementation.

…

Figures - uploaded by Andy Nisbet

Content may be subject to copyright.

Content uploaded by Andy Nisbet

Content may be subject to copyright.

Navigating the Landscape for Real-time Localisation and Mapping for Robotics and Virtual

and Augmented Reality

Sajad Saeedi•, Bruno Bodin?, Harry Wagstaff?, Andy Nisbet†, Luigi Nardi‡, John Mawer†, Nicolas Melot•,

Oscar Palomar†, Emanuele Vespa•, Tom Spink?, Cosmin Gorgovan†, Andrew Webb†, James Clarkson†,

Erik Tomusk?, Thomas Debrunner•, Kuba Kaszyk ?, Pablo Gonzalez-de-Aledo•, Andrey Rodchenko†,

Graham Riley†, Christos Kotselidis†, Bj¨

orn Franke?, Michael F. P. O’Boyle?, Andrew J. Davison•,

Paul H. J. Kelly•, Mikel Luj´

an†, and Steve Furber†

Abstract—Visual understanding of 3D environments in real-

time, at low power, is a huge computational challenge. Often

referred to as SLAM (Simultaneous Localisation and Mapping),

it is central to applications spanning domestic and industrial

robotics, autonomous vehicles, virtual and augmented reality.

This paper describes the results of a major research effort

to assemble the algorithms, architectures, tools, and systems

software needed to enable delivery of SLAM, by supporting

applications specialists in selecting and conﬁguring the appro-

priate algorithm and the appropriate hardware, and compilation

pathway, to meet their performance, accuracy, and energy

consumption goals. The major contributions we present are (1)

tools and methodology for systematic quantitative evaluation

of SLAM algorithms, (2) automated, machine-learning-guided

exploration of the algorithmic and implementation design space

with respect to multiple objectives, (3) end-to-end simulation tools

to enable optimisation of heterogeneous, accelerated architectures

for the speciﬁc algorithmic requirements of the various SLAM

algorithmic approaches, and (4) tools for delivering, where

appropriate, accelerated, adaptive SLAM solutions in a managed,

JIT-compiled, adaptive runtime context.

Index Terms—SLAM, automatic performance tuning, hard-

ware simulation, scheduling

I. INTRODUCTION

Programming increasingly heterogeneous systems for

emerging application domains is an urgent challenge. One

particular domain with massive potential is real-time 3D scene

understanding, poised to effect a radical transformation in the

engagement between digital devices and the physical human

world. In particular, visual Simultaneous Localisation and

Mapping (SLAM), deﬁned as determining the position and

orientation of a moving camera in an unknown environment

by processing image frames in real-time, has emerged to be an

enabling technology for robotics and virtual/augmented reality

applications.

The objective of this work is to build the tools to enable

the computer vision pipeline architecture to be designed so

that SLAM requirements are aligned with hardware capability.

Since SLAM is computationally very demanding, several

subgoals are deﬁned: developing systems with 1) power and

energy efﬁciency, 2) speed and runtime improvement, and

3) improved results in terms of accuracy and robustness.

Fig. 1presents an overview of the directions explored. At

the ﬁrst stage, we consider different layers of the system

•Department of Computing, Imperial College London, UK

?School of Informatics, University of Edinburgh, UK

†School of Computer Science, University of Manchester, UK

‡Electrical Engineering - Computer Systems, Stanford University, USA

Performance

Evaluation

Runtime

Architecture

Compiler and

Algorithm Design Space

Exploration

Machine

Learning

Fig. 1: The objective of the paper is to create a pipeline that aligns computer vision

requirements with hardware capabilities. The paper’s focus is on three layers: algorithms,

compiler and runtime, and architecture. The goal is to develop a system that allows

us to achieve power and energy efﬁciency, speed and runtime improvement, and

accuracy/robustness at each layer and also holistically through design space exploration

and machine learning techniques.

including architecture, compiler and runtime, and computer

vision algorithms. Several distinct contributions have been

presented in these three layers, explained throughout the

paper. These contributions include novel benchmarking frame-

works for SLAM algorithms, various scheduling techniques

for software performance improvement, and ‘functional‘ and

‘detailed’ hardware simulation frameworks. Additionally, we

present holistic optimisation techniques, such as Design Space

Exploration (DSE), that allows us to take into account all these

layers together and optimise the system holistically to achieve

the desired performance metrics.

The major contributions we present are:

•tools and methodology for systematic quantitative evalu-

ation of SLAM algorithms,

•automated, machine-learning-guided exploration of the

algorithmic and implementation design space with respect

to multiple objectives,

•end-to-end simulation tools to enable optimisation of

heterogeneous, accelerated architectures for the speciﬁc

algorithmic requirements of the various SLAM algorith-

mic approaches, and

•tools for delivering, where appropriate, accelerated, adap-

tive SLAM solutions in a managed, JIT-compiled, adap-

tive runtime context.

This article is an overview of a large body of work uniﬁed

by these common objectives — to apply software synthesis,

and automatic performance tuning in the context of compil-

ers and library generators, performance engineering, program

generation, languages, and hardware synthesis. We speciﬁcally

target mobile, embedded, and wearable contexts, where trading

off quality-of-result against energy consumption is of critical

importance. The key signiﬁcance of the work lies, we believe,

in showing the importance and the feasibility of extending

these ideas across the full stack, incorporating algorithm

arXiv:1808.06352v1 [cs.CV] 20 Aug 2018

selection and conﬁguration into the design space along with

code generation and hardware levels of the system.

A. Background

Based on the structure shown in Fig. 1, in this section,

background material for the following topics is presented very

brieﬂy:

•computer vision,

•system software,

•computer architecture, and

•model-based design space exploration.

1) Computer Vision: In computer vision and robotics com-

munity, SLAM is a well-known problem. Using SLAM, a

sensor, such as a camera, is able to localise itself in an

unknown environment by incrementally building a map and

at the same time localising itself within the map. Various

methods have been proposed to solve the SLAM problem, but

robustness and real-time performance is still challenging [1].

From the mid 1990s onwards, a strong return has been made to

a model-based paradigm enabled primarily by the adoption of

probabilistic algorithms [2] which are able to cope with the un-

certainty in all real sensor measurements [3]. A breakthrough

was when it was shown to be feasible using computer vision

applied to commodity camera hardware. The MonoSLAM

system offered real-time 3D tracking of the position of a hand-

held or robot-mounted camera while reconstructing a sparse

point cloud of scene landmarks [4]. Increasing computer power

has since enabled previously “off-line” vision techniques to

be brought into the real-time domain; Parallel Tracking and

Mapping (PTAM) made use of classical bundle adjustment

within a real-time loop [5]. Then live dense reconstruction

methods, Dense Tracking and Mapping (DTAM) using a

standard single camera [6] and KinectFusion using a Microsoft

Kinect depth camera [7], showed that surface reconstruction

can be a standard part of a live SLAM pipeline, making use

of GPU-accelerated techniques for rapid dense reconstruction

and tracking.

KinectFusion is an important research contribution and has

been used throughout this paper in several sections, including

in SLAMBench benchmarking (Section II-A), in improved

mapping and path planning in robotic applications (Sec-

tion II-B), in Diplomat static scheduling (Section III-A2), in

Tornado and MaxineVM dynamic scheduling (Sections III-B1

and III-B2), in MaxSim hardware proﬁling (Section IV-B2),

and various design space exploration and crowdsourcing meth-

ods (Section V).

KinectFusion models the occupied space only and tells

nothing about the free space which is vital for robot navigation.

In this paper, we present a method to extend KinectFusion

to model free space as well (Section II-B). Additionally, we

introduce two benchmarking frameworks, SLAMBench and

SLAMBench2 (Section II-A). These frameworks allow us to

study various SLAM algorithms, including KinectFusion, un-

der different hardware and software conﬁgurations. Moreover,

a new sensor technology, focal-plane sensor-processor arrays,

is used to develop scene understanding algorithms, operating

at very high frame rates with very low power consumption

(Section II-C).

2) System Software: Smart scheduling strategies can bring

signiﬁcant performance improvement regarding execution

time [8] or energy consumption [9], [10], [11] by breaking

an algorithm into smaller units, distributing the units between

cores or Intellectual Properties (IP)s available, and adjusting

the voltage and frequency of the cores. Scheduling can be

done either statically or dynamically. Static scheduling re-

quires extended knowledge about the application, i.e., how an

algorithm can be broken into units, and how these units behave

in different settings. Decomposing an algorithm this way

impacts a static scheduler’s choice in allocating and mapping

resources to computation units, and therefore it needs to be

optimised. In this paper, two static scheduling techniques are

introduced (Section III-A) including idiom-based compilation

and Diplomat, a task-graph framework that exploits static

dataﬂow analysis to perform CPU/GPU mapping.

Since static schedulers do not operate online, optimisation

time is not a primary concern. However, important optimisa-

tion opportunities may depend on the data being processed;

therefore, dynamic schedulers have more chances in obtain-

ing the best performance. In this paper, two novel dynamic

scheduling techniques are introduced including MaxineVM, a

research platform for managed runtime languages executing on

ARMv7, and Tornado, a heterogeneous task-parallel program-

ming framework designed for heterogeneous systems where

the speciﬁc conﬁgurations of CPUs, GPGPUs, FPGAs, DSPs,

etc. in a system are not known till runtime (Section III-B).

In contrast, dynamic schedulers cannot spend too much

processing power to ﬁnd good solutions, as the performance

penalty may outweight the beneﬁts they bring. Quasi-static

scheduling is a compromising approach that statically com-

putes a good schedule and further improves it online depend-

ing on runtime conditions [12]. A hybrid scheduling technique

is introduced called power-aware code generation, which is a

compiler-based approach to runtime power management for

heterogeneous cores (Section III-C).

3) Computer Architecture: It has been shown that moving

to a dynamic heterogeneous model, where the use of hardware

resources and the capabilities of those resources are adjusted

at run-time, allows far more ﬂexible optimisation of system

performance and efﬁciency [13], [14]. Simulation methods,

such as memory and instruction set simulation, are powerful

tools to design and evaluate such systems. A large number of

simulation tools are available [15]; in this paper we further

improve upon current tools by introducing novel ‘functional’

and ‘detailed’ hardware simulation packages, that can simulate

individual cores and also complete CPU/GPU systems (Sec-

tion IV-A). Also novel proﬁling (Section IV-B) and speciali-

sation (Section IV-C) techniques are introduced which allow

us to custom-design chips for SLAM and computer vision

applications.

4) Model-based Design Space Exploration: Machine learn-

ing has rapidly emerged as a viable means to automate sequen-

tial optimising compiler construction. Rather than hand-craft a

set of optimisation heuristics based on compiler expert insight,

learning techniques automatically determine how to apply

optimisations based on statistical modelling and learning. Its

great advantage is that it can adapt to changing platforms

Software

Algorithm

Mapping

Sensors

Benchmarking

Scheduling

Static Dynamic Hybrid

Simulation

Functional

Specialisation

Architecture

Detailed

Proﬁling

Holistic Optimisation

Design Space Exploration

Multi-domain

Design Space Exploration

Motion-aware

Design Space Exploration

Comparative

Crowdsourcing

Power Eﬃciency Speed Quality of Results

Goals

Fig. 2: Outline of the paper. The contributions of the paper have been organised under four

sections, shown with solid blocks. These blocks cover algorithmic, software, architecture,

and holistic optimisation domains. Power efﬁciency, runtime speed, and quality of results

are the subgoals of the project. The latter includes metrics such as accuracy of model

reconstruction, accuracy of trajectory, and robustness.

as it has no a priori assumptions about their behaviour.

There are many studies showing it outperforms human-based

approaches [16], [17], [18], and [19].

Recent work shows that machine learning can automatically

port across architecture spaces with no additional learning

time, and can ﬁnd different, appropriate, ways of mapping

program parallelism for different parallel platforms [20], [21].

There is now ample evidence from previous research, that

design space exploration based on machine learning provides

a powerful tool for optimising the conﬁguration of complex

systems both statically and dynamically. It has been used

from the perspective of single-core processor design [22], the

modelling and prediction of processor performance [23], the

dynamic reconﬁguration of memory systems for energy efﬁ-

ciency [24], the design of SoC interconnect architectures [25],

and power management [24]. The DSE methodology will

address this paper’s goals from the perspective of future many-

core systems, extending beyond compilers and architecture to

elements of the system stack including application choices and

run-time policies. In this paper, several DSE related works

are introduced. Multi-domain DSE performs exploration on

hardware, software, and algorithmic choices (Section V-A1).

With multi-domain DSE, it is possible to compromise be-

tween metrics such as runtime speed, power consumption, and

SLAM accuracy. In Motion-aware DSE (Section V-A2), we

develop a comprehensive DSE that also takes into account the

complexity of the environment being modelled, including the

photometric texture of the environment, the geometric shape

of the environment, and the speed of the camera in the envi-

ronment. DSE works allow us to design applications that can

optimally choose a set of hardware, software, and algorithmic

parameters meeting certain desired performance metrics. One

example application is active SLAM (Section V-A2a).

B. Outline

Real-time 3D scene understanding is the main driving force

behind this work. 3D scene understanding has various applica-

tions in wearable devices, mobile systems, personal assistant

Speed

Quality of Results

Power Eﬃciency

Benchmarking

SLAMBench

SLAMBench2

Datasets

Advanced Sensors

Focal-Plane Sensor-Processor

Probabilistic Mapping

OFusion

Arrays

Fig. 3: Algorithmic contributions include benchmarking tools, advanced sensors, and

improved probabilistic mapping.

devices, Internet of Things, and many more. Throughout this

paper, we aim to answer the following questions: 1) How can

we improve 3D scene understanding (specially SLAM) algo-

rithms? 2) How can we improve power performance for het-

erogeneous systems? 3) How can we reduce the development

complexity of hardware and software? As shown in Fig. 2,

we focus on four design domains: computer vision algorithms,

software, hardware, and holistic optimisation methods. Several

novel improvements have been introduced, organised as shown

in Fig. 2.

•Section II (Algorithm) explains the algorithmic contri-

butions such as using novel sensors, improving dense

mapping, and novel benchmarking methods.

•Section III (Software) introduces software techniques for

improving system performance, including various types

of scheduling.

•Section IV (Architecture) presents hardware develop-

ments, including simulation, specialisation, and proﬁling

techniques.

•Section V(Holistic Optimisation) introduces holistic op-

timisation approaches, such as design space exploration

and crowdsourcing.

•Section VI summarises the work.

II. COMPUTER VISION ALGORITHMS AND APPLICATIONS

Computer vision algorithms are the main motivation of

the paper. We focus mainly on SLAM. Within the past few

decades, researchers have developed various SLAM algo-

rithms, but few tools are available to compare and bench-

mark these algorithms and evaluate their performance on the

available diverse hardware platforms. Moreover, the general

research direction is also moving towards making the current

algorithms more robust to eventually make them available in

industries and our everyday life. Additionally, as the sensing

technologies progress, the pool of SLAM algorithms become

more diverse and fundamentally new approaches need to be

invented.

This section presents algorithmic contributions from three

different aspects. As shown in Fig. 3, three main topics are

covered: 1) benchmarking tools to compare the performance

of the SLAM algorithms, 2) improved probabilistic mapping,

and 3) new sensor technologies for scene understanding.

Correctness Veriﬁcation

ICL-NUIM Dataset Visualisation Tool

Performance Evaluation

Frame Rate

Energy Consumption

Accuracy Trade-Oﬀ

KinectFusion Application

Input

Acquire and

Pre-process

Pose

Estimation

Update

Reconstruction

Surface

Prediction

ARM INTEL NVIDIA

Platforms ...

Implementations

C++ OpenMP OpenCL CUDA ...

Fig. 4: SLAMBench enables benchmarking of the KinectFusion algorithm on various

types of platforms by providing different implementations such as C++, OpenMP, CUDA,

and OpenCL.

A. Benchmarking: Evaluation of SLAM Algorithms

Real-time computer vision and SLAM offer great poten-

tial for a new level of scene modelling, tracking, and real

environmental interaction for many types of robots, but their

high computational requirements mean that implementation on

mass market embedded platforms is challenging. Meanwhile,

trends in low-cost, low-power processing are towards massive

parallelism and heterogeneity, making it difﬁcult for robotics

and vision researchers to implement their algorithms in a

performance-portable way.

To tackle the aforementioned challenges, in this section, two

computer vision benchmarking frameworks are introduced:

SLAMBench and SLAMBench2. Benchmarking is a scientiﬁc

method to compare the performance of different hardware and

software systems. Both benchmarking frameworks share com-

mon functionalities, but their objectives are different. While

SLAMBench provides a framework that is able to benchmark

various implementations of KinectFusion, SLAMBench2 pro-

vides a framework that is able to benchmark various different

SLAM algorithms in their original implementations.

Additionally, to systemically choose the proper datasets

to evaluate the SLAM algorithms, we introduce a dataset

complexity scoring method. All these projects allow us to

optimise power, speed, and accuracy.

1) SLAMBench: As a ﬁrst approach to investigate SLAM

algorithms, we introduced SLAMBench [26], a publicly avail-

able software framework which represents a starting point

for quantitative, comparable, and validatable experimental

research to investigate trade-offs in performance, accuracy,

and energy consumption of a dense RGB-D SLAM system.

SLAMBench provides a KinectFusion [7] implementation,

inspired by the open-source KFusion implementation [27].

SLAMBench provides the same KinectFusion in the C++,

OpenMP, CUDA, and OpenCL variants, and harnesses the

ICL-NUIM synthetic RGB-D dataset [28] with trajectory

and scene ground truth for reliable accuracy comparison of

different implementation and algorithms. The overall vision of

the SLAMBench framework is shown in Fig. 4, refer to [26]

for more information.

SLAMBench2

Runtime Power Accuracy

Algorithm API

KFusion

C++ OpenCL CUDA

ElasticFusion

CUDA

Dataset Format

ICL-NUIM

EuRoCMAV

Fig. 5: SLAMBench2 allows multiple algorithms (and implementations) to be combined

with a wide array of datasets. A simple API and dataset make it easy to interface with

new algorithms.

Algorithm Type Implementations

ElasticFusion [33] Dense CUDA

InﬁniTAM [34] Dense C++, OpenMP, CUDA

KinectFusion [7] Dense C++, OpenMP, OpenCL, CUDA

LSD-SLAM [35] Semi-Dense C++, PThread

ORB-SLAM2 [36] Sparse C++

MonoSLAM [37] Sparse C++, OpenCL

OKVIS [38] Sparse C++

PTAM [5] Sparse C++

SVO [39] Sparse C++

TABLE I: List of SLAM algorithms currently integrated in SLAMBench2. These

algorithms provide either dense, semi-dense, or sparse reconstructions [32].

Third parties have provided implementations of SLAM-

Bench in additional emerging languages:

•the C++ SyCL for OpenCL Khronos Group standard [29],

•the platform-neutral compute intermediate language for

accelerator programming PENCIL [30], the PENCIL

SLAMBench implementation can be found in [31].

As demonstrated in Fig. 2, SLAMBench has enabled us

to do more research in algorithmic, software, and archi-

tecture domains, explained throughout the paper. Examples

include Diplomat static scheduling (Section III-A2), Tornado

dynamic scheduling (Sections III-B1), MaxSim hardware pro-

ﬁling (Section IV-B2), multi-domain design space exploration

(Section V-A1), comparative design space exploration (Sec-

tion V-A3), and crowdsourcing (Section V-B).

2) SLAMBench2: SLAMBench has had substantial success

within both the compiler and architecture realms of academia

and industry. The SLAMBench performance evaluation frame-

work is tailored for the KinectFusion algorithm and the ICL-

NUIM input dataset. However, in SLAMBench 2.0, we re-

engineered SLAMBench to have more modularity by integrat-

ing two major features [32]. Firstly, a SLAM API has been

deﬁned, which provides an easy interface to integrate any

new SLAM algorithms into the framework. Secondly, there

is now an I/O system in SLAMBench2 which enables the

easy integration of new datasets and new sensors (see Fig. 5).

Additionally, SLAMBench2 features a new set of algorithms

and datasets from among the most popular in the computer

vision community, Table Isummarises these algorithms.

The works in [40] and [41] present benchmarking results,

comparing several SLAM algorithms on various hardware

platforms; however, SLAMBench2 provides a framework that

researchers can easily integrate and use to explore various

SLAM algorithms.

Dataset Trajectory Max Mean Variance

ICL-

NUIM

lr kt0 0.0250 0.0026 0.0014

lr kt1 0.0183 0.0026 0.0012

lr kt2 0.0427 0.0032 0.0023

lr kt3 0.0352 0.0032 0.0023

TABLE II: Complexity level metrics using information divergence [44].

3) Datasets: Research papers on SLAM often report per-

formance metrics such as pose estimation accuracy, scene

reconstruction error, or energy consumption. The reported

performance metrics, may not be representative of how well an

algorithm will work in real-world applications. Additionally,

as the diversity of the datasets is growing, it becomes a

challenging issue to decide which and how many datasets

should be used to compare the results. To address this concern,

not only we categorised datasets according to their complexity

in terms of trajectory and environment, but also we have

proposed new synthetic datasets with highly detailed scene

and realistic trajectories [42], [43].

In general, datasets do not come with a measure of com-

plexity level, and thus the comparisons may not reveal all

strengths or weaknesses of a new SLAM algorithm. In [44], we

proposed to use frame-by-frame Kullback-Leibler divergence

as a simple and fast metric to measure the complexity of a

dataset. Across all frames in a dataset, mean divergence and

the variance of divergence were used to assess the complex-

ity. Table II shows some of these statistics for ICL-NUIM

sequences for intensity divergence. Based on the reported

trajectory error metrics of the ElasticFusion algorithm [33],

datasets lr kt2 and lr kt3 are more difﬁcult than lr kt0 and

lr kt1. Using the proposed statistical divergence, these difﬁcult

trajectories have a higher complexity score as well.

B. OFusion: Probabilistic Mapping

Modern dense volumetric methods based on signed distance

functions such as DTAM [6] or explicit point clouds, such

as ElasticFusion [33], are able to recover high quality geo-

metric information in real-time. However, they do not explic-

itly encode information about empty space which essentially

becomes equivalent to unmapped space. In various robotic

applications this could be a problem as many navigation

algorithms require explicit and persistent knowledge about

the mapped empty space. Such information is instead well

encoded in classic occupancy grids, which, on the other hand,

lack the ability to faithfully represent the surface boundaries.

Loop et al. [45] proposed a novel probabilistic fusion frame-

work aiming at closing such information gap by employing

a continuous occupancy map representation in which the sur-

face boundaries are well-deﬁned. Targeting real-time robotics

applications, we have extended such framework to make it

suitable for the incremental tracking and mapping typical of

an exploratory SLAM system. The new formulation, denoted

as OFusion [46], allows robots to seamlessly perform camera

tracking, occupancy grid mapping and surface reconstruction

at the same time. As shown in Table III, OFusion not only

encodes the free space, but also performs at the same level

or better than state-of-the-art volumetric SLAM pipelines

Trajectory TSDF OFusion InﬁniTAM

ICL-NUIM lr kt0 0.0113 0.2289 0.3052

ICL-NUIM lr kt1 0.0117 0.0170 0.0214

ICL-NUIM lr kt2 0.0040 0.0055 0.1725

ICL-NUIM lr kt3 0.7582 0.0904 0.4858

TUM fr1 xyz 0.0295 0.0322 0.0273

TUM fr1 ﬂoor × × ×

TUM fr1 plant × × ×

TUM fr1 desk 0.1030 0.0918 0.0647

TUM fr2 desk 0.0641 0.0724 0.0598

TUM fr3 ofﬁce 0.0686 0.0531 0.0996

TABLE III: Absolute Trajectory Error (ATE), in metres, comparison between KinectFu-

sion (TSDF), occupancy mapping (OFusion), and InﬁniTAM across sequences from the

ICL-NUIM and TUM RGB-D detasets. Cross signs indicate tracking failure.

such as KinectFusion [7] and InﬁniTAM [34] in terms of

mean Absolute Trajectory Error (ATE). To demonstrate the

effectiveness of our approach we implemented a simple path

planning application on top of our mapping pipeline. We

used Informed RTT* [47] to generate a collision-free 3-meter

long trajectory between two obstructed start-goal endpoints,

showing the feasibility to achieve tracking, mapping and

planning in a single integrated control loop in real-time.

C. Advanced Sensors

Mobile robotics and various applications of SLAM, Convo-

lutional Neural Networks (CNN), and VR/AR are constrained

by power resources and low frame rates. These applications

can not only beneﬁt from high frame rate, but also could save

resources if they consumed less energy.

Monocular cameras have been used in many scene under-

standing and SLAM algorithms [37]. Passive stereo cameras

(e.g. Bumblebee2, 48 FPS @ 2.5 W [48]), structured light

cameras (e,g, Kinect, 30 FPS @ 2.25 W [49]) and Time-

of-ﬂight cameras (e.g. Kinect One, 30 FPS @ 15 W [49])

additionally provide metric depth measurements; however,

these cameras are limited by low frame rate and have rela-

tively demanding power budget for mobile devices; problems

that modern bio-inspired and analogue methods are trying to

address.

Dynamic Vision Sensor (DVS), also known as the event

camera, is a novel bio-inspired imaging technology, which

has the potential to address some of the key limitations

of conventional imaging systems. Instead of capturing and

sending a full frame, an event camera captures and sends a set

of sparse events, generated by the change in the intensity. They

are low-power and are able to detect changes very quickly.

Event cameras have been used in camera tracking [50], optical

ﬂow estimation [51], and pose estimation [52], [53], [54]. Very

high dynamic range of DVS makes it suitable for real-world

applications.

Cellular vision chips, such as the ACE400 [55],

ACE16K [56], MIPA4K [57], and Focal-plane Sensor-

Processor Arrays (FPSPs) [58], [59], [60], integrate sensing

and processing in the focal plane. FPSPs are massively parallel

processing systems on a single chip. By eliminating the need

for data transmission, not only the effective frame rate is

increased, but also the power consumption is reduced signif-

icantly. The individual processing elements are small general

purpose analogue processors with a reduced instruction set

and memory. Fig. 6shows a concept diagram of FPSP, where

each pixel not only has a light-sensitive sensor, but also has

a simple processing element. The main advantages of FPSPs

are the high effective frame rates at lower clock frequencies

which in turn reduces power consumption compared to con-

ventional sensing and processing systems [61]. However with

the limited instruction sets and local memory [60], developing

new applications for FPSPs, such as image ﬁltering or camera

tracking, is a challenging problem.

In the past, several interesting works have been presented

using FPSPs, including high-dynamic range imaging [62].

New directions are being followed to explore the performance

of FPSPs in real-world robotic and virtual reality applications.

These directions include 1) 4-DOF camera tracking [63], and

2) automatic ﬁlter kernel code generation as well as Viola-

Jones [64] face detection [65]. The key concept behind these

works with FPSP is the fact that FPSP is able to report sum of

intensity values of all (or a selection of) pixels in just one clock

cycle. This ability allows us to develop kernel code generation

and also develop/verify motion hypotheses for visual odometry

and camera tracking applications. The results of these works

demonstrate that FPSPs not only consume much less power

compared to conventional cameras, but also can be operated

at very high frame rates, such as 10,000 FPS.

Fig. 7demonstrates execution times for common convo-

lution ﬁlters on various CPUs and GPUs compared with an

implementation of FPSP, known as SCAMP [60]. The code

for FPSP was automatically generated as explained in [65].

The parallel nature of the FPSP allows it to perform all

of the tested ﬁlter kernels, shown on x-axis, in a fraction

of the time needed by the other devices, shown on y-axis.

This is a direct consequence of having a dedicated processing

element available for every pixel, building up the ﬁlter on

the whole image at the same time. As for the other devices,

we see that for dense kernels (Gauss, Box), GPUs usually

perform better than CPUs, whereas for sparse kernels (Sobel,

Laplacian, Sharpen), CPUs seem to have an advantage. An

outlier case being the 7×7box ﬁlter, at which only the most

powerful graphics card manages to get a result comparable to

the CPUs. It is assumed that the CPU implementation follows

a more suitable algorithm than the GPU implementation,

even though both implementations are based on their vendors

performance libraries (Intel IPP, nVidia NPP). Another reason

could be the fact, that the GTX680 and GTX780 are based on

a hardware architecture that is less suitable for this type of

ﬁlter than the TITAN X’s architecture. While Fig. 7shows that

there is a signiﬁcant reduction in execution time, the SCAMP

chip consumes only 1.23Wunder full load. Compared to

the experimented CPU and GPU systems, this at least 20

times less power. Clearly a more specialised image processing

pipeline architecture could be more energy-efﬁcient than these

fully programmable architectures. There is scope for further

research to map the space of alternative designs, including

specialised heterogeneous multicore vision processing accel-

erators such as the Myriad-2 Vision Processing Unit [66].

I/O

Pixel

Registers

Image

Sensor

Fig. 6: Focal-plane Sensor-Processor Arrays (FPSPs) are parallel processing systems,

where each pixel has a processing element.

Gauss3 Gauss5 Box3 Box5 Box7 Sob el Laplacian Sharpen

100

150

200

250

300

Filtering time [µs]

i7-3720

i7-4790

i7-6700

E5-1630

GTX680

GTX780

TITAN X

SCAMP

Fig. 7: Time for a single ﬁlter application of several well-known ﬁlters on CPU, GPU,

and SCAMP FPSP hardware. The FPSP code was generated by the method explained

in [65], the CPU and GPU code are based on OpenCV 3.3.0.

Fig. 8: Static, dynamic, and hybrid scheduling are the software optimisation methods

presented for power efﬁciency and speed improvement.

III. SOFTWARE OPTIMISATIONS, COMPILERS AND

RUNTIMES

In this section, we investigate how software optimisations,

that are mainly implemented as a collection of compiler

and runtime techniques, can be used to deliver potential

improvements in power consumption and speed trade-offs.

The optimisations must determine how to efﬁciently map and

schedule program parallelism onto multi-core, heterogeneous

processor architectures. This section presents the novel static,

dynamic, and hybrid approaches used to specialise computer

vision applications for execution on energy efﬁcient runtimes

and hardware (Fig. 8).

A. Static Scheduling and Code Transformation

In this section, we focus on static techniques applied when

building an optimised executable. Static schedulers and op-

timisers can only rely on performance models of underlying

architectures or code to optimise, which limit opportunities.

However they do not require additional code to execute, which

reduces runtime overhead. We ﬁrst introduce in III-A1 an

idiom-based heterogeneous compilation methodology which

given the source code of a program, can automatically identify

and transform portions of code in order to be accelerated using

many-core CPUs or GPUs. Then in III-A2, we propose a dif-

ferent methodology used to determine which resources should

be used to execute those portions of code. This methodology

takes a specialised direction, where applications need to be

expressed using a particular model in order to be scheduled.

1) Idiom-based heterogeneous compilation: A wide variety

of high-performance accelerators now exist, ranging from em-

bedded DSPs, to GPUs, to highly specialised devices such as

Tensor Processing Unit [67] and Vision Processing Unit [66].

These devices have the capacity to deliver high performance

and energy efﬁciency, but these improvements come at a cost:

to obtain peak performance, the target application or kernel

often needs to be rewritten or heavily modiﬁed. Although

high-level abstractions can reduce the cost and difﬁculty of

these modiﬁcations, these make it more difﬁcult to obtain peak

performance. In order to extract the maximum performance

from a particular accelerator, an application must be aware of

its exact hardware parameters (number of processors, mem-

ory sizes, bus speed, Network-on-Chip (NoC) routers, etc.),

and this often requires low level programming and tuning.

Optimised numeric libraries and Domain Speciﬁc Languages

(DSLs) have been proposed as a means of reconciling pro-

grammer ease and hardware performance. However, they still

require signiﬁcant legacy code modiﬁcation and increase the

number of languages programmers need to master.

Ideally, the compiler should be able to automatically take

advantage of these accelerators, by identifying opportunities

for their use, and then automatically calling into the appropri-

ate libraries or DSLs. However, in practice, compilers struggle

to identify such opportunities due to the complex and expen-

sive analysis required. Additionally, when such opportunities

are found, they are frequently on a too small scale to obtain any

real beneﬁt, with the cost of setting up the accelerator (i.e. data

movement, Remote Procedure Call (RPC) costs , etc.) being

much greater than the improvement in execution time or power

efﬁciency. Larger scale opportunities are difﬁcult to identify

due to the complexity of analysis, which often requires inter-

procedural analyses, loop invariant detection, pointer and alias

analyses, etc., which are complex to implement in the compiler

and expensive to compute. On the other hand, when humans

attempt to use these accelerators, they often lack the detailed

knowledge of the compiler, and resort to “hunches” or ad-hoc

methods, leading to sub-optimal performance.

In [68], we develop a novel approach to automatically detect

and exploit opportunities to take advantage of accelerators and

DSLs. We call these opportunities “idioms”. By expressing

these idioms as constraint problems, we can take advantage

Tasks' implementation in

C++, OpenMP, OpenCL, ...

Diplomat DSL

Task graph representation

Generated source code

C++, OpenMP, OpenCL, ...

Code generation

Static dataﬂow

abstraction

Time proﬁling

Dataﬂow analysis,

e.g. mapping

Fig. 9: An overview of the Diplomat framework. The user provides (1) the task

implementations in various languages and (2) the dependencies between the tasks. Then

in (3) Diplomat performs timing analysis on the target platform and in (4) abstracts the

task-graph as a static dataﬂow model. Finally, a dataﬂow model analysis step is performed

in (5), and in (6) the Diplomat compiler performs the code generation.

of constraint solving techniques (in our case a Satisﬁability

Modulo Theories (SMT) solver). Our technique converts the

constraint problem which describes each idiom into an LLVM

compiler pass. When running on LLVM IR (Intermediate

Representation), these passes identify and report instances of

each idiom. This technique is further strengthened by the use

of Symbolic Execution and Static Analysis techniques, so that

formally proved transformations can be automatically applied

when idioms are detected.

We have described idioms for sparse and dense linear

algebra, and stencils and reductions, and written transforma-

tions from these idioms to the established cuSPARSE and

clSPARSE libraries, as well as a data-parallel, functional DSL

which can be used to generate high performance platform

speciﬁc OpenCL code. We have then evaluated this tech-

nique on the NAS, Parboil, and Rodinia sequential C/C++

benchmarks, where we detect 55 instances of our described

idioms. The NAS, Parboil, and Rodinia benchmarks include

several key and frequently used computer vision and SLAM

related tasks such as convolution ﬁltering, particle ﬁltering,

backpropagation, k-means clustering, breadth-ﬁrst search, and

other fundamental computational building blocks. In the cases

where these idioms form a signiﬁcant part of the sequential

execution time, we are able to transform the program to obtain

performance improvements ranging from 1.24x to over 20x on

integrated and discrete GPUs, contributing to the fast execution

time objective.

2) Diplomat, Static mapping of multi-kernel applications

on heterogeneous platforms: We propose a novel approach

to heterogeneous embedded systems programmability using

a task-graph based DSL called Diplomat [69]. Diplomat is

a task-graph framework that exploits the potential of static

dataﬂow modelling and analysis to deliver performance es-

Diplomat (CPU/GPU)

Speedup-Mapping (CPU/GPU)

Partitioning (CPU/GPU)

Manual-OpenCL (CPU/GPU)

Speedup over Sequential

ARN 0 ARN 1 ARN 2 ARN 3

+0.7% +25.2%

+30.5% +0.8%

Fig. 10: Evaluation of the best result obtained with Diplomat for CPU and GPU

conﬁgurations, and comparison with handwritten solutions (OpenMP, OpenCL) and

automatic heuristics (Partitioning, Speed-up mapping) for KinectFusion on Arndale

platform. The associated numbers on x-axis are different KinectFusion algorithmic

parameter conﬁguration, and the percent on top of Diplomat bars are the speedup over

the manual implementation.

timation and CPU/GPU mapping. An application has to be

speciﬁed once, and then the framework can automatically

propose good mappings. This work aims at improving runtime

as much as performance robustness.

The Diplomat front-end is embedded in the Python pro-

gramming language and it allows the framework to gather

fundamental information about the application: the different

possible implementations of the tasks, their expected input and

output data sizes, and the existing data dependencies between

each of them.

At compile-time, the framework performs static analysis.

In order to beneﬁt from existing dataﬂow analysis techniques,

the initial task-graph needs to be turned into a dataﬂow model.

As the dataﬂow graph will not be used to generate the code, a

representation of the application does not need to be precise.

But it needs to model an application’s behaviour close enough

to obtain good performance estimations. Diplomat performs

the following steps. First, the initial task-graph is abstracted

into a static dataﬂow formalism. This includes a timing proﬁl-

ing step to estimate task durations and communication delays.

Then, by using static analysis techniques [70], a throughput

evaluation and a mapping of the application are performed.

Once a potential mapping has been selected, an executable

C++ code is automatically generated. This generated im-

plementation takes advantage of task-parallelism and data-

parallelism. It can use OpenMP and OpenCL and it may apply

partitioning between CPU and GPU when it is beneﬁcial. This

overview is summarised in Fig. 9.

We evaluate Diplomat with KinectFusion on two embed-

ded platforms, Odroid-XU3 and Arndale, with four different

conﬁgurations for algorithmic parameters, chosen manually.

Fig. 10 shows the results for Arndale for four different conﬁg-

urations, marked as ARN0...3. Using Diplomat, we observed

a 16% speed improvement on average and up to a 30% im-

provement over the best existing hand-coded implementation.

This is an improvement on runtime speed, one of the goals

outlined earlier.

B. Dynamic Scheduling

Dynamic scheduling takes place while the optimised pro-

gram runs with actual data. Because dynamic schedulers can

monitor actual performance, they can compensate for perfor-

mance skews due to data-dependant control-ﬂow and com-

putation that static schedulers cannot accurately capture and

model. Dynamic schedulers can therefore exploit additional

dynamic run-time information to enable more optimisation

opportunities. However, they also require the execution of

additional proﬁling and monitoring code, which can create

performance penalties.

Tornado and MaxineVM runtime scheduling are research

prototype systems that we are using to explore and investigate

dynamic scheduling opportunities. Tornado is a framework

(prototyped on top of Java) using dynamic scheduling for

transparent exploitation of task-level parallelism on hetero-

geneous systems having multicore CPUs, and accelerators

such as GPUs, DSPs and FPGAs. MaxineVM is a research

Java Virtual Machine (JVM) that we are initially using to

investigate dynamic heterogeneous multicore scheduling for

application and JVM service threads in order to better meet

the changing power and performance objectives of a system

under dynamically varying battery life and application service

demands.

1) Tornado: Tornado is a heterogeneous programming

framework that has been designed for programming sys-

tems that have a higher-degree of heterogeneity than existing

GPGPU accelerated systems and where system conﬁgurations

are unknown until runtime. The current Tornado prototype [71]

superseding JACC, described in [72], can dynamically of-

ﬂoad code to big.LITTLE cores, and GPUs with its OpenCL

backend that supports the widest possible set of accelerators.

Tornado can also be used to generate OpenCL code that is

suitable for high-level synthesis tools in order to produce

FPGA accelerators, although it is not practical to do this unless

the relatively long place and route times of FPGA vendor

tools can be amortised by application run-time overheads. The

main beneﬁt of Tornado is that it allows portable dynamic

exploration of how heterogeneous scheduling decisions for

task-parallel frameworks will lead to improvements in power-

performance trade-offs without rewriting the application level

code, and also where knowledge of the heterogeneous conﬁg-

uration of a system is delayed until runtime.

The Tornado API cleanly separates computation logic from

co-ordination logic that is expressed using a task-based pro-

gramming model. Currently, data parallelisation is expressed

using standard Java support for annotations [71]. Applications

remain architecture-neutral, and as the current implementation

of Tornado is based on the Java managed language, we are

able to dynamically generate code for heterogeneous execution

without recompilation of the Java source, and without manu-

ally generating new optimised routines for any accelerators

that may become available. Applications need only to be

conﬁgured at runtime for execution on the available hardware.

Tornado currently uses an OpenCL driver for maximum device

coverage: this includes mature support for: multi-core CPUs

and GPGPU, and maturing support for Xeon Phi coproces-

sor/accelerators. The current dynamic compiler technology of

Tornado is built upon JVMCI and GRAAL APIs for Java 8 and

above. The sequential Java and C++ versions of KinectFusion

in SLAMBench both perform at under 3 FPS with the C++

C++ - 2.72 FPS

Java - 0.81 FPS

Java/OpenCL

- 33.13 FPS

0 500 1000

Frame Number

Frames Per Second

Fig. 11: Execution performance of KinectFusion (using FPS) over the time using Tornado

(Java/OpenCL) vs. baseline Java and C++.

version being 3.4x faster than Java. This improvement of

runtime speed is shown in Fig. 11. By accelerating Kinect-

Fusion through GPGPU execution using Tornado, we manage

to achieve a constant rate of over 30 FPS (33.13 FPS) across

all frames (882) from the ICL-NUIM dataset with room 2

conﬁguration [28]. To achieve 30 FPS, all kernels have been

accelerated by up to 821.20x with an average of 47.84x across

the whole application [71], [73]. Tornado is an attractive

framework for the development of portable computer vision

applications as its dynamic JIT compilation for traditional

CPU cores and OpenCL compute devices such as GPUs

enables real-time performance constraints to be met whilst

eliminating the need to rewrite and optimise code for different

GPU devices.

2) MaxineVM: The main contribution of MaxineVM is to

provide a research infrastructure for managed runtime systems

that can execute on top of modern Instruction Set Architectures

(ISA)s supplied by both Intel and ARM. This is especially

relevant because ARM is the dominant ISA in mobile and

embedded platforms. MaxineVM has been released as open-

source software [74].

Heterogeneous multicore systems comprised of CPUs hav-

ing the same ISA but different power/performance design

point characteristics create a signiﬁcant challenge for virtual

machines that are typically agnostic to CPU core heterogene-

ity when undertaking thread-scheduling decisions. Further,

heterogeneous CPU core clusters, are typically attached to

NUMA-like memory system designs, consequently thread

scheduling policies need to be adjusted to make appropriate

decisions that do not adversely affect the performance and

power consumption of managed applications.

In MaxineVM, we are using the Java managed runtime

environment to optimise thread scheduling for heterogeneous

architectures. Consequently, we have chosen to use and extend

the Oracle Labs research project software for MaxineVM [75]

that provided a state-of-the-art research VM for x86-64. We

have developed a robust port of MaxineVM to ARMv7 [71],

[76] (an AArch64 port is also in progress) ISA processors that

can run important Java and SLAM benchmarks, including a

Java version of KinectFusion. MaxineVM has been designed

for maximum ﬂexibility, this sacriﬁces some performance, but

it is trivially possible to replace the public implementation

of an interface or scheme, such as for monitor or garbage

collection with simple command line switches to the command

that generates a MaxineVM executable image.

C. Hybrid Scheduling

Hybrid scheduling considers dynamic techniques which

takes advantage of static and dynamic data. A schedule can

be statically optimised for a target architecture and application

(i.e. using machine learning), and a dynamic scheduler can

further adjust this schedule to optimise further actual code

executions. Since it can rely on a statically optimised schedule,

the dynamic scheduler can save a signiﬁcant amount of work

and therefore lower its negative impact on performance.

1) Power-aware Code Generation: Power is an important

constraint in modern multi-core processor design. We have

shown that power across heterogeneous cores varies consider-

ably [77]. This work develops a compiler-based approach to

runtime power management for heterogeneous cores. Given

an externally determined power budget, it generates parallel

parameterised partitioned code that attempts to give the best

performance within that power budget. It uses the compiler in-

frastructure developed in [78]. The hybrid scheduling has been

tested on standard benchmarks such as DSPstone, UTSDP, and

Polybench. These benchmarks provide an in-depth comparison

with other methods and include key building blocks of many

SLAM and computer vision tasks such as matrix multipli-

cation, edge detection, and image histogram. We applied

this technique to embedded parallel OpenMP benchmarks on

the TI OMAP4 platform for a range of power budgets. On

average we obtain a 1.37x speed-up over dynamic voltage

and frequency scaling (DVFS). For low power budgets, we

see a 2x speed-up improvement. SLAM systems, and vision

applications in general, are composed of different phases.

An adaptive power budget for every phases positively impact

frame rate and power consumption.

IV. HARDWARE AND SIMULATION

The designers of heterogeneous Multiprocessor System-

on-Chip (MPSoC) are faced with an enormous task when

attempting to design a system that is co-optimised to deliver

power-performance efﬁciency under a wide range of dynamic

operating conditions concerning the available power stored in

a battery, and the current application performance demands.

In this paper, a variety of simulation tools and technologies

have been presented to assist designers in their evaluations

of how performance, energy, and power consumption trade-

offs are affected by computer vision algorithm parameters

and computational characteristics of speciﬁc implementations

on different heterogeneous processors and accelerators. Tools

have been developed that focus on the evaluation of native

and managed runtime systems, that execute on ARM and x86-

64 processor instruction set architectures in conjunction with

GPU and custom accelerator intellectual property.

The contributions of this section have been organised under

three main topics: simulation,proﬁling, and specialisation.

Under each topic, several novel tools and methods are pre-

sented. The main objective in developing these tools and

Fig. 12: Hardware development tasks are simulation, proﬁling, and specialisation tools;

each with its own goals. With these three task, it is possible to develop customised

hardware for computer vision applications.

methods is to reduce development complexity and increase

reproducibility for system analysis. Fig. 12 presents a graph

where all simulation, proﬁling, and specialisation tools are

summarised.

A. Fast Simulation

Simulators have become an essential tool for hardware

design. They allow designers to prototype different systems

before committing to a silicon design, and save enormous

amounts of money and time. They allow embedded systems

engineers to develop the driver and compiler stack, before the

system is available, and be able to verify their results. Even

after releasing the hardware, software engineers can make

use of simulators to prototype their programs in a virtual

environment, without the latency of ﬂashing the software onto

the hardware, or even without access to the hardware.

These different use cases require very different simulation

technologies. Prototyping hardware typically requires ‘de-

tailed’ performance modelling simulation to be performed,

which comes with a signiﬁcant slowdown compared to real

hardware. On the other hand, software development often does

not require such detailed simulation, and so faster ‘functional’

simulators can be used. This has led to the development of

multiple simulation systems within this work, with the GenSim

system being used for ‘functional’ simulation and APTsim

being used for more detailed simulation.

In this section, three novel system simulation works are

presented. These works are: GenSim, CPU/GPU simulation,

and APTsim.

1) The GenSim Architecture Description Language: Mod-

ern CPU architectures often have a large number of extensions

and versions. At the same time, simulation technologies have

improved, making simulators both faster and more accurate.

However, this has made the creation of a simulator for a mod-

ern architecture much more complex. Architecture Description

Languages (ADLs) seek to solve this problem by decoupling

the details of the simulated architecture from the tool used to

simulate it.

We have developed the GenSim simulation infrastructure,

which includes an ADL toolchain (see Fig. 13). This ADL is

designed to enable the rapid development of fast functional

simulation tools [79], and the prototyping of architectural

extensions (and potentially full instruction set architectures).

This infrastructure is used in the CPU/GPU simulation work

System

Modes

Features

Syntax

Formats

Decoding

Disassembly

Semantics

Behaviours

Exceptions

Predication

GenSim

CAPTIVE

Components

ArcSim

Components

Tests

Fig. 13: Diagram showing the general ﬂow of the GenSim ADL toolchain

(Section IV-A2). The GenSim infrastructure is described in

a number of publications [80], [81], [82]. GenSim is avail-

able under a permissive open-source license, and is available

at [83].

2) Full-system simulation for CPU/GPU: Graphics pro-

cessing units are highly-specialized processors that were origi-

nally designed to process large graphics workloads effectively,

however they have been inﬂuential in many industries, includ-

ing in executing computer vision tasks. Simulators for parallel

architectures, including GPUs, have not reached the same level

of maturity as simulators for CPUs, both due to the secrecy of

leading GPU vendors, and the problems arising from mapping

parallel onto scalar architectures, or onto different parallel

architectures.

At the moment, GPU simulators that have been presented in

literature have limitations, resulting from lack of veriﬁcation,

poor accuracy, poor speeds, and limited observability due to

incomplete modelling of certain hardware features. As they

don’t accurately model the full native software stack, they

are unable to execute realistic GPU workloads, which rely on

extensive interaction with user and system runtime libraries.

In this work, we propose a full-system methodology for

GPU simulation, where rather than simulating the GPU as

an independent unit, we simulate it as a component of a

larger system, comprising a CPU simulator with supporting

devices, operating system, and a native, unmodiﬁed driver

stack. This faithful modelling results in a simulation platform

indistinguishable from real hardware.

We have been focusing our efforts on simulation of the

ARM Mali GPU, and have built a substantial amount of

surrounding infrastructure. We have seen promising results

in simulation of compute applications, most notably SLAM-

Bench.

The work directly contributed to full system simulation,

by implementing the ARMv7 MMU, ARMv7 and Thumb-2

Instruction Sets, and a number of devices needed to commu-

nicate with the GPU. To connect the GPU model realistically,

we have implemented an ARM CPU GPU interface containing

an ARM Device on the CPU side [84].

The implementation of the Mali GPU simulator comprises:

•An implementation of the Job Manager, a hardware

resource for controlling jobs on the GPU side,

•The Shader Core Infrastructure, which allows for retriev-

ing important context, needed to execute shader programs

efﬁciently,

•The Shader Program Decoder, which allows us to inter-

pret Mali Shader binary programs,

•The Shader Program Execution Engine, which allows us

to simulate the behaviour of Mali programs.

Future plans for simulation include extending the infrastruc-

ture to support real time graphics simulation, increasing GPU

Simulation performance using Dynamic Binary Translation

(DBT) [79], [82], [85] techniques, and extending the Mali

Model to support performance modelling. We have also con-

tinued to investigate new techniques for full-system dynamic

binary translation (such as exploiting hardware features on the

host to further accelerate simulation performance), as well as

new methodologies for accelerating the implementation and

veriﬁcation of full system instruction set simulators. Fast full

system simulation presents a large number of unique chal-

lenges and difﬁculties and in addressing and overcoming these

difﬁculties, we expect to be able to produce a signiﬁcant body

of novel research. Taken as a whole, these tools will directly

allow us to explore next-generation many-core applications,

and design hardware that is characterised by high performance

and low power.

3) APTSim - simulation and prototyping platform: APTSim

(Fig. 14) is intended as a fast simulator allowing rapid simu-

lation of microprocessor architectures and microarchitectures

as well as the prototyping of accelerators. The system runs

on a platform consisting of a processor, for functional sim-

ulation, and an FPGA for implementing architecture timing

models and prototypes. Currently the Xilinx Zynq family

is used as the host platform. APTSim performs dynamic

binary instrumentation using MAMBO, see Section IV-B1, to

dynamically instrument a running executable along with the

MAST co-design library, described below. Custom MAMBO

plugins allow speciﬁc instructions, such as load/store or PC

changing events to be sent to MAST hardware models, such

as memory systems or processor pipeline. From a simulation

perspective the hardware models are for timing and gathering

statistics and do not perform functional simulation, which is

carried out on the host processor as native execution; so for

example if we send a request to a cache system the model

will tell us at which memory level the result is present in

and a response time, while the actual data will be returned

from the processor’s own memory. This separation allows for

smaller, less complicated, hardware models to gather statistics

whilst the processor executes the benchmark natively and

the MAMBO plugins capture the necessary events with low

overhead.

The MAST library provides a framework for easily inte-

grating many hardware IP blocks, implemented on FPGA,

with a linux based application running on a host processor.

MAST consists of two principal parts: a software compo-

nent and a hardware library. The software component allows

the discovery and management of hardware IP blocks and

the management of memory used by the hardware blocks;

critically this allows new hardware blocks to be conﬁgured

and used at runtime using only user space software. The

hardware library, written in Bluespec, contains parametrised IP

blocks including architecture models such as cache systems or

pipeline models and accelerator modules for computer vision,

ARM

FPGA

Native ARM Application

SimCtrl

Statistics:

utilisation, performance etc

Pipeline

model

Pipeline

models

Pipeline

Driver

Memory

system

Memory

models

Cache

Driver

Load Store

plugin

Instruction trace

plugin

MAST

Zynq

APTsim

MAMBO

Running on Zynq A9

Fig. 14: APTSim an FPGA accelerated simulation and prototyping platform, currently

implemented on Zynq SoC.

such as ﬁlters or feature detectors. The hardware models can

either be masters or slaves, from a memory perspective. As

masters, models can directly access processor memory leaving

the processor to execute code whilst the hardware is analysing

the execution of the last code block.

APTSim also allows us to evaluate prototype hardware,

for example we evaluated multiple branch predictors by im-

plementing them in Bluespec and using a MAST compliant

interface. This allows us to execute our benchmark code once

on the CPU and ofﬂoad to multiple candidate implementations

to rapidly explore the design space.

In [86] we show that on the Xilinx Zynq 7000 FPGA board

coupled with a relatively slow 666MHz ARM9 processor,

the slowdown of APTsim is 400x in comparison to native

execution on the same processor. While a relatively important

slowdown over native execution is unavoidable to implement

a ﬁne performance monitoring, slowdown on APTsim is about

half of GEM5 running at 3.2GHz on an Intel Xeon E3

to simulate the same ARM system. Note that, contrary to

APTsim, GEM5 on Xeon does not take proﬁt of any FPGA

acceleration. This shows the interest of APTsim to take proﬁt

of FPGA acceleration to implement a fast Register Transfer

Level (RTL) simulation and monitor its performance, while

hiding the complexity of FPGA programming from the user.

B. Proﬁling

Proﬁling is the process of analysing the runtime behaviour

of a program in order to perform some measurements about the

performance of the program. For example, to determine which

parts of the program take the most time to execute. This infor-

mation can then be used to improve software (for example, by

using a more optimised implementation of frequently executed

MaxSim

ZSim (C++)

Proﬁling

Data

Maxine VM (JAVA + C)

Heap

Code Cache

Protocol

Buﬀers

OOO Core Model

xchg rcx, rcx (magic NOPs);

ld / st [tag:base + oﬀset];

p:[tag(16b):base(48b)]

(tagged pointers);

* Field proﬁle.

message FieldProf {

required int32 oﬀset = 1;

required int64 readCount = 2;

required int64 writeCount = 3;

repeated int64 cacheMissCount = 4;

}

...

* Field information.

message FieldInfo {

required string name = 1;

required int32 classId = 2;

required int32 oﬀset = 3;

...

}

...

profGen profUse MaxineInfoGen

ZSimProf.db MaxineInfo.db

Fig. 15: MaxSim overview of Zsim and MaxineVM based proﬁling

functions) or to improve hardware (by including hardware

structures or instructions which provide better performance for

frequently executed functions). Proﬁling of native applications

is typically performed via dynamic binary instrumentation.

However, when a managed runtime environment is used,

the runtime environment can often perform the necessary

instrumentation. In this section, we explore both of these

possibilities, with MAMBO being used for native proﬁling,

and MaxSim being used for the proﬁling of Java applications.

1) MAMBO: instruction level proﬁling: Dynamic Binary

Instrumentation (DBI) is a technique for instrumenting ap-

plications transparently while they are executed, working at

the level of machine code. As the ARM architecture expands

beyond its traditional embedded domain, there is a growing

interest in DBI systems for the general-purpose multicore

processors that are part of the ARM family. DBI systems

introduce a performance overhead and reducing it is an active

area of research; however, most efforts have focused on the

x86 architecture.

MAMBO is a low overhead DBI framework for 32-bit

(AArch32) and 64-bit ARM (AArch64) [87]. MAMBO is

open-source [88]. MAMBO provides an event-driven plugin

API for the implementation of instrumentation tools with mini-

mal complexity. The API allows the enumeration, analysis and

instrumentation of the application code ahead of execution, as

well as tracking and control of events such as system calls.

Furthermore, the MAMBO API provides a number of high

level facilities for developing portable instrumentation, i.e.

plugins which can execute efﬁciently both on AArch32 and

AArch64, while being implemented using mostly high level

architecture-agnostic code.

MAMBO incorporates a number of novel optimisations,

speciﬁcally designed for the ARM architecture, which allow

it to minimise its performance overhead. The geometric mean

runtime overhead of MAMBO running SPEC CPU2006 with

no instrumentation is as low as 12% (on an APM X-C1

system), compared DynamoRIO [89], a state of the art DBI

system, which has an overhead of 34% under the same test

conditions.

Class Information Pointer (CIP) Elimination

Object Pointers Compression

4CA-EA/4C-B

4CAL-EAL/4C-C

1CA-EA/4C-B

1CAL-EAL/1C-C

DRAM Dynamic Energy

Reduction

(b) Relative

Execution Time

Reduction

Reduction (%)

Saving (%)

tag bits

012345678910 11

(a) Heap Space Saving per Tag Bits

Fig. 16: Performance of MaxSim on KinectFusion. (a) heap space saving using tagged

pointers, (b) relative reduction in execution time, and (c) relative reduction in DRAM

dynamic energy.

2) MaxSim: proﬁling and prototyping hardware-software

co-design for managed runtime systems: Managed applica-

tions, written in programming languages such as Java, C# and

others, represent a signiﬁcant share of workloads in the mo-

bile, desktop, and server domains. Microarchitectural timing

simulation of such workloads is useful for characterisation

and performance analysis, of both hardware and software,

as well as for research and development of novel hardware

extensions. MaxSim [90] (see Fig. 15), is a simulation platform

based on the MaxineVM [75] (explained in Section III-B2),

the ZSim [91] simulator, and the McPAT [92] modelling

framework. MaxSim can perform fast and accurate simulation

of managed runtime workloads running on top of Maxine

VM [74]. MaxSim’s capabilities include: 1) low-intrusive

microarchitectural proﬁling via pointer tagging on x86-64

platforms, 2) modelling of hardware extensions related, but

not limited to, tagged pointers, and 3) modelling of complex

software changes via address-space morphing.

Low-intrusive microarchitectural proﬁling is achieved by

utilising tagged pointers to collect type- and allocation-site re-

lated hardware events. Furthermore, MaxSim allows, through

a novel technique called address space morphing, the easy

modelling of complex object layout transformations. Finally,

through the co- designed capabilities of MaxSim, novel hard-

ware extensions can be implemented and evaluated. We show-

case MaxSim’s capabilities by simulating the whole set of

the DaCapo-9.12-bach benchmarks in less than a day while

performing an up-to-date microarchitectural power and per-

formance characterisation [90]. Furthermore, we demonstrate

a hardware/software co-designed optimisation that performs

dynamic load elimination for array length retrieval achieving

up to 14% L1 data cache loads reduction and up to 4%

dynamic energy reduction. In [93] we present results for

MaxineVM with MaxSim. We use SLAMBench to experiment

with KinectFusion on a 4-core Nehalem system, using 1 and 4

cores (denoted by 1C and 4C, respectively). We use MaxSim’s

extensions for the Address Generation Unit (AGU) (denoted

by 1CA and 4CA) and Load-Store Unit (LSU) extension

(shown by 1CAL and 4CAL). Fig. 16-a shows heap savings of

more than 30% on SLAMBench thanks to CIP (Class Informa-

tion Pointer) elimination. Fig. 16-b demonstrates the relative

reduction in execution time, using the proposed framework.

On this ﬁgure, EA refers to a machine conﬁguration with

CIP elimination with 16 bits CID (Class Information) and

EAL refers to a variant with CIP elimination, 4 bits CID, and

AGU and LSU extensions. Bstands for the standard baseline

MaxSim virtual machine and Cis Bwith object compression.

Fig 16-b shows up to 6% execution time performance beneﬁts

of CIP elimination over MaxSim with none of our extension,

whether its uses 4 cores (4CA-EA/4CA-B) or 1 core (1CA-

EA/1C-B) Finally, Fig. 16-c shows the relative reduction in

DRAM dynamic energy for the cases mentioned above. As

the graph shows, there is an 18% to 28% reduction in DRAM

dynamic energy. These reductions contribute to the objective

of having improved quality of the results. MaxSim is open-

source [74].

C. Specialisation

Recent developments in computer vision and machine learn-

ing have challenged hardware and circuit designers to design

faster and more efﬁcient systems for these tasks [94]. Tensor

Processing Unit (TPU) from Google [67], Vision Processing

Unit (VPU) from Intel Movidius [66], and Intelligent Pro-

cessing Unit (IPU) from Graphcore [95], are such devices

with major re-engineerings in hardware design, resulting in

outstanding performance. While the development of custom

hardware can be appealing due to the possible signiﬁcant

beneﬁts, it can lead to extremely high design, development,

and veriﬁcation costs, and a very long time to market. One

method of avoiding these costs while still obtaining many

of the beneﬁts of custom hardware is to specialise exist-

ing hardware. We have explored several possible paths to

specialisation, including specialised memory architectures for

GPGPU computations (which are frequently used in computer

vision algorithm implementations), the use of single-ISA het-

erogeneity (as seen in ARM’s big.LITTLE platforms), and the

potential for power and area savings by replacing hardware

structures with software.

1) Memory Architectures for GPGPU Computation: Cur-

rent GPUs are no longer perceived as accelerators solely for

graphic workloads, and now cater to a much broader spectrum

of applications. In a short time, GPUs have proven to be

of substantive signiﬁcance in the world of general-purpose

computing, playing a pivotal role in Scientiﬁc and High

Performance Computing (HPC). The rise of general-purpose

computing on GPUs has contributed to the introduction of

on-chip cache hierarchies in those systems. Additionally, in

SLAM algorithms, reusing previously processed data fre-

quently occurs such as in bundle adjustment, loop detection,

and loop closure. It has been shown that efﬁcient memory

use can improve the runtime speed of the algorithm. For

instance, the Distributed Particle (DP) ﬁlter optimises memory

Fig. 17: Speed-up of Instructions Per Cycle (IPC) with varying remote L1 access

latencies.

requirements using an efﬁcient data structure for maintaining

the map [96].

We have carried out a workload characterisation of GPU

architectures on general-purpose workloads, to assess the

efﬁciency of their memory hierarchies [97] and proposed a

novel cache optimisation to resolve some of the memory

performance bottlenecks in GPGPU systems [98].

In our workload characterisation study (overview on Fig. 17)

we saw that, in general, high level-1 (L1) data cache miss rates

place high demands on the available level-2 (L2) bandwidth

that is shared by the large number of cores in typical GPUs.

In particular, Fig. 17 represents bandwidth as the number of

Instruction Per Cycle (IPC). Furthermore, the high demand for

L2 bandwidth leads to extensive congestion in the L2 access

path, and in turn this leads to high memory latencies. Al-

though GPUs are heavily multi-threaded, in memory intensive

applications the memory latency becomes exposed due to a

shortage of active compute threads, reducing the ability of the

multi-threaded GPU to hide memory latency (Exposed latency

range on Fig. 17). Our study also quantiﬁed congestion in

the memory system, at each level of the memory hierarchy,

and characterised the implications of high latencies due to

congestion. We identiﬁed architectural parameters that play a

pivotal role in memory system congestion, and explored the

design space of architectural options to mitigate the bandwidth

bottleneck. We showed that the improvement in performance

achieved by mitigating the bandwidth bottleneck in the cache

hierarchy can exceed the speedup obtained by simply in-

creasing the on-chip DRAM bandwidth. We also showed that

addressing the bandwidth bottleneck in isolation at speciﬁc

levels can be suboptimal and can even be counter-productive.

In summary, we showed that it is imperative to resolve the

bandwidth bottleneck synergistically across all levels of the

memory hierarchy. The second part of our work in this area

aimed to reduce the pressure on the shared L2 bandwidth. One

of the key factors we have observed is that there is signiﬁcant

replication of data among private L1 caches, presenting an

opportunity to reuse data among the L1s. We have proposed

a Cooperative Caching Network (CCN), which exploits reuse

by connecting the L1 caches with a lightweight ring network

to facilitate inter-core communication of shared data. When

measured on a selection of GPGPU benchmarks, this approach

delivers a performance improvement of 14.7% for applications

that exhibit reuse.

2) Evaluation of single-ISA heterogeneity: We have in-

vestigated the design of heterogeneous processors sharing

Normalized Power

Normalized Time

Baseline Selection

2-Core Selection

8-Core Selection

1.0

2.0

3.0

4.0

1.0 1.61.51.41.31.21.1

Fig. 18: Example of a Baseline selection, and 2- and 8-Core selections for a speciﬁc

benchmark application.

a common ISA. The underlying motivation for single-ISA

heterogeneity is that a diverse set of cores can enable runtime

ﬂexibility. We argue that selecting a diverse set of hetero-

geneous cores to enable ﬂexible operation at runtime is a

non-trivial problem due to diversity in program behaviour.

We further show that common evaluation methods lead to

false conclusions about diversity. We suggest the Kolmogorov–

Smirnov (KS) test statistical test as an evaluation metric.

The KS test is the ﬁrst step towards a heterogeneous design

methodology that optimises for runtime ﬂexibility [99], [100].

A major roadblock to the further development of heteroge-

neous processors is the lack of appropriate evaluation metrics.

Existing metrics can be used to evaluate individual cores,

but to evaluate a heterogeneous processor, the cores must be

considered as a collective. Without appropriate metrics, it is

impossible to establish design goals for processors, and it is

difﬁcult to accurately compare two different heterogeneous

processors. We present four new metrics to evaluate user-

oriented aspects of sets of heterogeneous cores: localized

non-uniformity, gap overhead, set overhead, and generality.

The metrics consider sets rather than individual cores. We

use examples to demonstrate each metric, and show that the

metrics can be used to quantify intuitions about heterogeneous

cores [101].

For a heterogeneous processor to be effective, it must

contain a diverse set of cores to match a range of runtime

requirements and program behaviours. Selecting a diverse set

of cores is, however, a non-trivial problem. We present a

method of core selection that chooses cores at a range of

power-performance points. For example, we see on Fig. 18

that for a normalised power budget of 1.3 (1.3 times higher

than the most power-efﬁcient alternative), the best possible

normalised time using the baseline selection is 1.75 (1.75

times the fastest execution time), whereas an 8 core selection

can lower this ratio to 1.4 without exceeding the normalised

power budget, i.e., our method brings a 20% speedup. Our

algorithm is based on the observation that it is not necessary

for a core to consistently have high performance or low power;

one type of core can fulﬁl different roles for different types

of programs. Given a power budget, cores selected with our

method provide an average speedup of 7% on EEMBC mobile

benchmarks, and a 23% on SPECint 2006 benchmarks over

the state of the art core selection method [102].

Design Space Exploration (DSE)

Fig. 19: Holistic optimisation methods explore all domains of the real-time 3D scene

understanding, including hardware, software, and computer vision algorithms. Two

holistic works presented here: Design Space Exploration and Crowdsourcing.

V. HOLISTIC OPTIMISATION METHODS

In this section, we introduce holistic optimisation methods

that combine developments from multiple domains, i.e. hard-

ware, software, and algorithm, to develop efﬁcient end-to-end

solutions. The design space exploration work presents the idea

of exploring many sets of possible parameters to properly

exploit them at different situations. The crowdsourcing further

tests the DSE idea on a massive number of devices. Fig. 19

summarises their goals and contributions.

A. Design Space Exploration

Design space exploration is the exploration of various pos-

sible design choices before running the system [103]. In scene

understanding algorithms, the possible space of the design

choices is very large and spans from high-level algorithmic

choices down to parametric choices within an algorithm. For

instance, Zhang et al. [104] explore algorithmic choices for

a visual-inertial algorithmic parameters on an ARM CPU, as

well as a Xilinx Kintex-7 XC7K355T FPGA. In this section,

we introduce two DSE algorithms: The ﬁrst one called multi-

domain DSE explores algorithmic, compiler and hardware pa-

rameters. The second one, coined motion-aware DSE, further

adds the complexity of the motion and the environment to the

exploration space. The latter work is extended to develop an

active SLAM algorithm.

1) Multi-domain DSE: Until now, resource-intensive scene

understanding algorithms, such as KinectFusion, could only

run in real-time on powerful desktop GPUs. In [105] we

examine how it can be mapped to power constrained em-

bedded systems and we introduce HyperMapper, a tool for

multi-objective DSE. HyperMapper was demonstrated in a

variety of applications ranging from computer vision and

robotics to compilers [105], [106], [44], [107]. Key to our

approach is the idea of incremental co-design exploration,

where optimisation choices that concern the domain layer are

incrementally explored together with low-level compiler and

architecture choices (See Fig. 21, dashed boxes). The goal of

this exploration is to reduce execution time while minimising

power and meeting our quality of result objective. Fig. 20

shows an example performed with KinectFusion, in which for

each point, a set of parameters, two metrics, maximum ATE

and runtime speed, is shown. As the design space is too large

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

Runtime (sec)

0.035

0.040

0.045

0.050

0.055

Max ATE (m)

Accuracy limit = 0.05m

Default configuration

Active learning

Random sampling

Fig. 20: This plot illustrates the result of HyperMapper on the Design Space Exploration

of the KinectFusion algorithmic parameters considering accuracy and frame rate metrics.

We can see the result of random sampling (red) as well as the improvement of solutions

after active learning (black).

to exhaustively evaluate, we use active learning based on a

random forest predictor to ﬁnd good designs. We show that

our approach can, for the ﬁrst time, achieve dense 3D mapping

and tracking in the real-time range within a 1W power budget

on a popular embedded device. This is a 4.8x execution time

improvement and a 2.8x power reduction compared to the

state-of-the-art.

2) Motion and Structure-aware DSE: In Multi-domain

DSE, when tuning software and hardware parameters, we also

need to take into account the structure of the environment and

the motion of the camera. In the Motion and Structure-aware

Design Space Exploration (MS-DSE) work [44], we deter-

mine the complexity of the structure and motion with a few

parameters calculated using information theory. Depending on

this complexity and the desired performance metrics, suitable

parameters are explored and determined. The hypothesis of

MS-DSE is that we can use a small set of parameters as a very

useful proxy for a full description of the setting and motion of

a SLAM application. We call these Motion and Structure (MS)

parameters, and deﬁne them based on information divergence

metric. Fig. 21 demonstrates the set of all design spaces.

MS-DSE presents a comprehensive parametrisation of 3D

understanding scene algorithms, and thus based on this new

parameterisation, many new concepts and applications can be

developed. One of these applications, active SLAM, is outlined

here. For more applications, please see [105], [106], [44],

[107].

a) Active SLAM: Active SLAM is the method for choos-

ing the optimal camera trajectory, in order to maximise the

camera pose estimation, the accuracy of the reconstruction,

or the coverage of the environment. In [44], it is shown that

MS-DSE can be utilised to optimise not only ﬁxed system

parameters, but also to guide a robotic platform to maintain

a good performance for localisation and mapping. As shown

in Fig. 21, a Pareto front holds all optimal parameters. The

front has been prepared in advance by exploring the set of all

parameters. When the system is operating, optimal parameters

Hardware

e.g. clock frequency

Compiler

e.g. numerical precision

SLAM Algorithm

e.g. weights

Motion & Structure

e.g. divergence

Metric 1: execution time

Metric 2: trajectory error

Pareto Front

{param}

Desired

Metrics

params

Navigation

Design Spaces

Fig. 21: Motion and structure aware active SLAM design space exploration using

HyperMapper.

Experiment

active

random

Success

Failure

Success vs. Failure

Window Table Wall Carpet

active

random

active

random

active

random

Fig. 22: Success vs. failure rate when mapping the same environment with different

motion planning algorithms: active SLAM and random walk.

are chosen given the desired performance metrics. Then these

parameters are used to initialise the system. Using MS param-

eters, the objective is to avoid motions that cause very high

statistical divergence between two consecutive frames. This

way, we can provide a robust SLAM algorithm by allowing

the tracking work all the time. Fig. 22 compares the active

SLAM with a random walk algorithm. The experiments were

done in four different environments. In each environment, each

algorithm was run 10 times. Repeated experiments serve as a

measure of the robustness of the algorithm in dealing with

uncertainties rising from minor changes in illumination, or

inaccuracies of the response of the controller or actuator to

the commands. The consistency of the generated map was

evaluated manually as either a success or failure of SLAM.

If duplicates of one object were present in the map, it was

considered as failure. This experiment shows more than 50 %

success rate in SLAM when employing the proposed active

SLAM algorithm [44], an improvement in the robustness of

SLAM algorithms by relying on design space exploration.

3) Comparative DSE of Dense vs Semi-dense SLAM: An-

other different direction in any DSE work is the performance

exploration across multiple algorithms. While Multi-domain

DSE explores different parameters of a given algorithm, the

comparative DSE, presented in [108], explores the perfor-

mance of two different algorithms under different parametric

choices.

In comparative DSE, two state-of-the-art SLAM algorithms,

KinectFusion and LSD-SLAM, are compared on multiple

datasets. Using SLAMBench benchmarking capabilities, a

Absolute Trajectory Error (cm)

(b) Synthetic Scene

0 2 4 6 8 10 12 14 16 18

Absolute Trajectory Error (cm) Absolute Trajectory Error (cm)

LSD-SLAM

KinectFusion

0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18

0 2 4 6 8 10 12 14 16 18

(a) Real Scene

Absolute Trajectory Error (cm)

% of tiem recorded

Fig. 23: Distribution of Absolute Trajectory Error (ATE) using KinectFusion and LSD-

SLAM, run with default parameters on Desktop. The mean absolute error has been

highlighted. (a) TUM RGB-D fr2 xyz (b) ICL-NUIM lr kt2.

full design space exploration is performed over algorithmic

parameters, compilation ﬂags and multiple architectures. Such

thorough parameter space exploration gives us key insights on

the behaviour of each algorithm in different operative condi-

tions and the relationship between different sets of distinct,

yet correlated, parameters blocks.

As an example, in Fig. 23 we show the result of comparative

DSE between LSD-SLAM and KinectFusion in terms of their

ATE distribution across two scenes of two different datasets.

The histograms display the error distribution across the entire

sequence, from which we can get a sense of how well the

algorithms are performing for the whole trajectory. We hope

that these analyses enable researchers to develop more robust

algorithms. Without the holistic approach enabled by SLAM-

Bench such insights would have been much harder to obtain.

This sort of information is invaluable for a wild range of

SLAM practitioners, from VR/AR designers to roboticists that

want to select/modify the best algorithm for their particular use

case.

B. Crowdsourcing

The SLAMBench framework and more speciﬁcally its vari-

ous KinectFusion implementations has been ported to Android.

More than 2000 downloads have been made since its ofﬁcial

release on the Google Play store. We received numerous

positive feedback reports and this application has generated

a great deal of interest in the community and with industrial

partners.

This level of uptake allowed us to collect data from more

than 100 different mobile phones Fig. 24 shows the speed-

up across many models of Android devices that we have

experimented with. Clearly it is possible to achieve more

than twice runtime speed by tuning the system parameters

using the tools introduced in the paper. We plan to use these

data to analyse the performance of KinectFusion on those

platforms, and to provide techniques to optimise KinectFusion

performance depending of the targeted platform. This work

will apply transfer-learning methodology. We believe that by

combining design space exploration [106] and the collected

data, we can train a decision machine to select code variants

and conﬁgurations for diverse mobile platforms automatically.

VI. CONCLUSION

In this paper we focused on SLAM, which is an enabling

technology in many ﬁelds including virtual reality, augmented

reality, and robotics. The paper presented several contributions

0 2 4 6 8 10 12 14

Speed-up

Android Devices

Fig. 24: By combining design space exploration and crowdsourcing, we checked that

design space exploration efﬁciently works on various types of platforms. This ﬁgure

demonstrates the speed-up of the KinectFusion algorithm on various different types of

Android devices. Each bar represents the speed-up for one type (model) of Android

device. The models are not shown for the sake of clarity of the ﬁgure.

across hardware architecture, compiler and runtime software

systems, and computer vision algorithmic layers of SLAM

pipeline. We proposed not only contributions at each layer,

but also holistic methods that optimise the system as a whole.

In computer vision and applications, we presented bench-

marking tools that allow us to select a proper dataset and use it

to evaluate different SLAM algorithms. SLAMBench is used

to evaluate the KinectFusion algorithm on various different

hardware platforms. SLAMBench2 is used to compare various

SLAM algorithms very efﬁciently. We also extended the

KinectFusion algorithm, such that it can be used in robotic

path planning and navigation algorithms by mapping both

occupied and free space of the environment. Moreover, we

explored new sensing technologies such as focal-plane sensor-

processor arrays, which have low power consumption and high

effective frame rate.

The software layer of this project demonstrated that soft-

ware optimisation can be used to deliver signiﬁcant improve-

ments in power consumption and speed trade-off when spe-

cialised for computer vision applications. We explored static,

dynamic, and hybrid approaches and focused their application

on the KinectFusion algorithm. Being able to select and

deploy optimisations adaptively is particularly beneﬁcial in the

context of dynamic runtime environment where application-

speciﬁc details can strongly improve the result of JIT compi-

lation and thus the speed of the program.

The project has made a range of contributions across the

hardware design and development ﬁeld. Proﬁling tools have

been developed in order to locate and evaluate performance

bottlenecks in both native and managed applications. These

bottlenecks could then be addressed by a range of special-

isation techniques, and the specialised hardware evaluated

using the presented simulation techniques. This represents a

full workﬂow for creating new hardware for computer vision

applications which might be used in future platforms.

Finally, we report on holistic methods that exploit our ability

to explore the design space at every level in a holistic fashion.

We demonstrated several design space exploration methods

where we showed that it is possible to ﬁne-tune the system

such that we can meet desired performance metrics. It is also

shown that we can increase public engagement in accelerating

the design space exploration by crowdsourcing.

In future work, two main directions will be followed: The

ﬁrst is exploiting our knowledge from all domains of this

paper to select a SLAM algorithm and design a chip that

is customised to efﬁciently implement the algorithm. This

approach will utilise data from SLAMBench2 and real-world

experiments to drive the design of a specialised vision proces-

sor. The second direction is utilising the tools and techniques

presented here to develop a standardised method that takes the

high-level scene understanding functionalities and develops the

optimal code that maps the functionalities to the heterogeneous

resources available, optimising for the desired performance

metrics.

VII. ACKNOWLEDGEMENTS

This research is supported by Engineering and Physi-

cal Sciences Research Council (EPSRC), grant reference

EP/K008730/1, PAMELA project.

REFERENCES

[1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,

I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous

localization and mapping: Toward the robust-perception age,” IEEE

Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016.

[2] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics (Intelligent

Robotics and Autonomous Agents). The MIT Press, 2005.

[3] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and map-

ping: part I,” IEEE Robotics Automation Magazine, vol. 13, no. 2, pp.

99–110, 2006.

[4] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM:

Real-time single camera SLAM,” IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.

[5] G. Klein and D. Murray, “Parallel tracking and mapping on a camera

phone,” in Proceedings of IEEE and ACM International Symposium on

Mixed and Augmented Reality (ISMAR), 2009.

[6] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense

tracking and mapping in real-time,” in Proceedings of International

Conference on Computer Vision (ICCV), 2011, pp. 2320–2327.

[7] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J.

Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon, “KinectFu-

sion: Real-time dense surface mapping and tracking,” in Proceedings

of IEEE International Symposium on Mixed and Augmented Reality

(ISMAR), 2011, pp. 127–136.

[8] L. Fan, F. Zhang, G. Wang, and Z. Liu, “An effective approximation

algorithm for the malleable parallel task scheduling problem,” Journal

of Parallel and Distributed Computing, vol. 72, no. 5, pp. 693–704,

2012.

[9] N. Melot, C. Kessler, J. Keller, and P. Eitschberger, “Fast crown

scheduling heuristics for energy-efﬁcient mapping and scaling of

moldable streaming tasks on manycore systems,” ACM Transactions

on Architecture and Code Optimization (TACO), vol. 11, no. 4, pp.

62:1–62:24, 2015.

[10] N. Melot, C. Kessler, and J. Keller, “Improving energy-efﬁciency of

static schedules by core consolidation and switching off unused cores.”

in Proceedings of International Conference on Parallel Computing

(ParCo), 2015, pp. 285 – 294.

[11] H. Xu, F. Kong, and Q. Deng, “Energy minimizing for parallel real-

time tasks based on level-packing,” in IEEE International Conference

on Embedded and Real-Time Computing Systems and Applications

(RTCSA), 2012, pp. 98–103.

[12] T. Schwarzer, J. Falk, M. Glaß, J. Teich, C. Zebelein, and C. Haubelt,

“Throughput-optimizing compilation of dataﬂow applications for multi-

cores using quasi-static scheduling,” in Proceedings of ACM Interna-

tional Workshop on Software and Compilers for Embedded Systems,

2015, pp. 68–75.

[13] U. Dastgeer and C. Kessler, “Performance-aware composition frame-

work for GPU-based systems,” The Journal of Supercomputing, vol. 71,

no. 12, pp. 4646–4662, 2015.

[14] ——, “Smart containers and skeleton programming for GPU-based

systems,” International Journal of Parallel Programming, vol. 44,

no. 3, pp. 506–530, 2016.

[15] I. B¨

ohm, T. J. Edler von Koch, S. C. Kyle, B. Franke, and N. Topham,

“Generalized just-in-time trace compilation using a parallel task farm

in a dynamic binary translator,” The ACM Special Interest Group on

Programming Languages (SIGPLAN) Notices, vol. 46, no. 6, pp. 74–

85, 2011.

[16] K. D. Cooper, A. Grosul, T. J. Harvey, S. Reeves, D. Subramanian,

L. Torczon, and T. Waterman, “Adaptive compilation made efﬁcient,”

The ACM Special Interest Group on Programming Languages (SIG-

PLAN) Notices, vol. 40, no. 7, pp. 69–77, 2005.

[17] G. Fursin, Y. Kashnikov, A. W. Memon, Z. Chamski, O. Temam,

M. Namolaru, E. Yom-Tov, B. Mendelson, A. Zaks, E. Courtois,

F. Bodin, P. Barnard, E. Ashton, E. Bonilla, J. Thomson, C. K. I.

Williams, and M. O’Boyle, “Milepost GCC: Machine learning enabled

self-tuning compiler,” International Journal of Parallel Programming,

vol. 39, no. 3, pp. 296–327, 2011.

[18] Q. Wang, S. Kulkarni, J. Cavazos, and M. Spear, “A transactional

memory with automatic performance tuning,” ACM Transactions on

Architecture and Code Optimization (TACO), vol. 8, no. 4, p. 54, 2012.

[19] S. Kulkarni and J. Cavazos, “Mitigating the compiler optimization

phase-ordering problem using machine learning,” The ACM Special In-

terest Group on Programming Languages (SIGPLAN) Notices, vol. 47,

no. 10, pp. 147–162, 2012.

[20] H. Leather, E. Bonilla, and M. O’Boyle, “Automatic feature generation

for machine learning based optimizing compilation,” in Proceedings of

Annual IEEE/ACM International Symposium on Code Generation and

Optimization, 2009, pp. 81–91.

[21] G. Tournavitis, Z. Wang, B. Franke, and M. F. O’Boyle, “Towards

a holistic approach to auto-parallelization: Integrating proﬁle-driven

parallelism detection and machine-learning based mapping,” in Pro-

ceedings of ACM SIGPLAN Conference on Programming Language

Design and Implementation, 2009, pp. 177–187.

[22] M. Zuluaga, E. Bonilla, and N. Topham, “Predicting best design trade-

offs: A case study in processor customization,” in Design, Automation

Test in Europe Conference Exhibition (DATE), 2012, pp. 1030–1035.

[23] I. Bohm, B. Franke, and N. Topham, “Cycle-accurate performance

modelling in an ultra-fast just-in-time dynamic binary translation

instruction set simulator,” in International Conference on Embedded

Computer Systems: Architectures, Modeling and Simulation, 2010, pp.

1–10.

[24] K. T. Sundararajan, V. Porpodas, T. M. Jones, N. P. Topham, and

B. Franke, “Cooperative partitioning: Energy-efﬁcient cache partition-

ing for high-performance CMPs,” in IEEE International Symposium on

High-Performance Comp Architecture, 2012, pp. 1–12.

[25] O. Almer, N. Topham, and B. Franke, “A learning-based approach

to the automated design of MPSoC networks,” in Proceedings of

International Conference on Architecture of Computing Systems, 2011,

pp. 243–258.

[26] L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. Kelly, A. J.

Davison, M. Luj´

an, M. F. O’Boyle, G. Riley et al., “Introducing SLAM-

Bench, a Performance and Accuracy Benchmarking Methodology for

SLAM,” in IEEE International Conference on Robotics and Automation

(ICRA), 2015, pp. 5783–5790.

[27] G. Reitmayr and H. Seichter, “KFusion GitHub,” https://github.com/

GerhardR/kfusion.

[28] A. Handa, T. Whelan, J. McDonald, and A. Davison, “A Benchmark

for RGB-D Visual Odometry, 3D Reconstruction and SLAM,” in IEEE

International Conference on Robotics and Automation (ICRA), 2014,

pp. 1524–1531.

[29] P. Keir, “DAGR: A DSL for legacy OpenCL codes,” in 1st SYCL

Programming Workshop, 2016.

[30] R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy,

S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema et al., “Pencil:

A platform-neutral compute intermediate language for accelerator pro-

gramming,” in IEEE International Conference on Parallel Architecture

and Compilation (PACT), 2015, pp. 138–149.

[31] CARP-project. PENCIL-SLAMBench GitHub. https://github.com/

carpproject/slambench.

[32] B. Bodin, H. Wagstaff, S. Saeedi, L. Nardi, E. Vespa, J. Mayer,

A. Nisbet, M. Lujan, S. Furber, A. Davison, P. Kelly, and M. O’Boyle,

“SLAMBench2: Multi-objective head-to-head benchmarking for visual

SLAM,” in IEEE International Conference on Robotics and Automation

(ICRA), 2018, pp. 3637–3644.

[33] T. Whelan, S. Leutenegger, R. F. Salas-Moreno, B. Glocker, and A. J.

Davison, “ElasticFusion: Dense SLAM without a pose graph,” in RSS,

2015.

[34] O. Kahler, V. A. Prisacariu, C. Y. Ren, X. Sun, P. H. S. Torr, and D. W.

Murray, “Very high frame rate volumetric integration of depth images

on mobile device,” IEEE Transactions on Visualization and Computer

Graphics, vol. 21, no. 11, pp. 1241–1250, 2015.

[35] J. Engel, T. Sch¨

ops, and D. Cremers, “LSD-SLAM: Large-scale direct

monocular SLAM,” in European Conference on Computer Vision

(ECCV). Springer, 2014, pp. 834–849.

[36] R. Mur-Artal and J. D. Tardos, “ORB-SLAM2: An open-source SLAM

system for monocular, stereo, and RGB-D cameras,” IEEE Transactions

on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.

[37] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM:

Real-time single camera SLAM,” IEEE transactions on pattern anal-

ysis and machine intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.

[38] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale,

“Keyframe-based visual–inertial odometry using nonlinear optimiza-

tion,” The International Journal of Robotics Research, vol. 34, no. 3,

pp. 314–334, 2015.

[39] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct

monocular visual odometry,” in IEEE International Conference on

Robotics and Automation (ICRA), 2014, pp. 15–22.

[40] M. Abouzahir, A. Elouardi, R. Latif, S. Bouaziz, and A. Tajer,

“Embedding SLAM algorithms: Has it come of age?” Robotics and

Autonomous Systems, vol. 100, pp. 14 – 26, 2018.

[41] D. Jeffrey and S. Davide, “A benchmark comparison of monocular

visual-inertial odometry algorithms for ﬂying robot,” in IEEE Inter-

national Conference on Robotics and Automation (ICRA), 2018, pp.

2502–2509.

[42] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye,

Y. Huang, R. Tang, and S. Leutenegger, “InteriorNet: Mega-scale multi-

sensor photo-realistic indoor scenes dataset,” in British Machine Vision

Conference (BMVC), 2018.

[43] S. Saeedi, W. Li, D. Tzoumanikas, S. Leutenegger, P. H. J. Kelly, and

A. J. Davison. (2018) Characterising localization and mapping datasets.

http://wbli.me/lmdata/.

[44] S. Saeedi, L. Nardi, E. Johns, B. Bodin, P. Kelly, and A. Davison,

“Application-oriented design space exploration for SLAM algorithms,”

in IEEE International Conference on Robotics and Automation (ICRA),

2017, pp. 5716–5723.

[45] C. Loop, Q. Cai, S. Orts-Escolano, and P. A. Chou, “A closed-form

Bayesian fusion equation using occupancy probabilities,” in IEEE

International Conference on 3D Vision (3DV), 2016, pp. 380–388.

[46] E. Vespa, N. Nikolov, M. Grimm, L. Nardi, P. H. J. Kelly, and

S. Leutenegger, “Efﬁcient octree-based volumetric SLAM supporting

signed-distance and occupancy mapping,” IEEE Robotics and Automa-

tion Letters, vol. 3, no. 2, pp. 1144–1151, 2018.

[47] J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Informed RRT*:

Optimal sampling-based path planning focused via direct sampling

of an admissible ellipsoidal heuristic,” in IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS), 2014, pp. 2997–

3004.

[48] Point-Grey, “Bumblebee2 Datasheet,” https://www.ptgrey.com/support/

downloads/10132.

[49] P. Fankhauser, M. Bloesch, D. Rodriguez, R. Kaestner, M. Hutter, and

R. Siegwart, “Kinect v2 for mobile robot navigation: Evaluation and

modeling,” in IEEE International Conference on Advanced Robotics

(ICAR), 2015, pp. 388–394.

[50] H. Kim, A. Handa, R. Benosman, S.-H. Ieng, and A. Davison, “Simul-

taneous mosaicing and tracking with an event camera,” in Proceedings

of the British Machine Vision Conference (BMVC). BMVA Press,

2014.

[51] P. Bardow, A. J. Davison, and S. Leutenegger, “Simultaneous optical

ﬂow and intensity estimation from an event camera,” in IEEE Confer-

ence on Computer Vision and Pattern Recognition (CVPR), 2016, pp.

884–892.

[52] A. Censi and D. Scaramuzza, “Low-latency event-based visual odom-

etry,” in IEEE International Conference on Robotics and Automation

(ICRA), 2014, pp. 703–710.

[53] E. Mueggler, B. Huber, and D. Scaramuzza, “Event-based, 6-DOF

pose tracking for high-speed maneuvers,” in IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS), 2014, pp. 2761–

2768.

[54] H. Kim, S. Leutenegger, and A. J. Davison, “Real-time 3D recon-

struction and 6-DoF tracking with an event camera,” in European

Conference on Computer Vision (ECCV). Springer International

Publishing, 2016, pp. 349–364.

[55] R. Dominguez-Castro, S. Espejo, A. Rodriguez-Vazquez, R. A. Car-

mona, P. Foldesy, ´

A. Zar´

andy, P. Szolgay, T. Szir´

anyi, and T. Roska,

“A 0.8-/spl mu/m CMOS two-dimensional programmable mixed-signal

focal-plane array processor with on-chip binary imaging and instruc-

tions storage,” IEEE Journal of Solid-State Circuits, vol. 32, no. 7, pp.

1013–1026, 1997.

[56] G. Linan, S. Espejo, R. Dominguez-Castro, and A. Rodriguez-Vazquez,

“Architectural and basic circuit considerations for a ﬂexible 128×128

mixed-signal SIMD vision chip,” Analog Integrated Circuits and Signal

Processing, vol. 33, no. 2, pp. 179–190, 2002.

[57] J. Poikonen, M. Laiho, and A. Paasio, “MIPA4k: A 64×64 cell mixed-

mode image processor array,” in IEEE International Symposium on

Circuits and Systems (ISCAS), 2009, pp. 1927–1930.

[58] P. Dudek and P. J. Hicks, “A general-purpose processor-per-pixel

analog SIMD vision chip,” IEEE Transactions on Circuits and Systems,

vol. 52, no. 1, pp. 13–20, 2005.

[59] P. Dudek, “Implementation of SIMD vision chip with 128×128 array

of analogue processing elements,” in IEEE International Symposium

on Circuits and Systems (ISCAS), 2005, pp. 5806–5809.

[60] S. J. Carey, A. Lopich, D. R. Barr, B. Wang, and P. Dudek, “A

100,000 FPS vision sensor with embedded 535GOPS/W 256×256

SIMD processor array,” in IEEE Symposium on VLSI Circuits (VLSIC),

2013, pp. C182–C183.

[61] W. Zhang, Q. Fu, and N. J. Wu, “A programmable vision chip based

on multiple levels of parallel processors,” IEEE Journal of Solid-State

Circuits, vol. 46, no. 9, pp. 2132–2147, 2011.

[62] J. N. P. Martel, L. K. Mller, S. J. Carey, and P. Dudek, “Parallel HDR

tone mapping and auto-focus on a cellular processor array vision chip,”

in IEEE International Symposium on Circuits and Systems (ISCAS),

2016, pp. 1430–1433.

[63] L. Bose, J. Chen, S. J. Carey, P. Dudek, and W. Mayol-Cuevas, “Visual

odometry for pixel processor arrays,” in IEEE International Conference

on Computer Vision (ICCV), 2017, pp. 4614–4622.

[64] P. Viola and M. Jones, “Robust real-time object detection,” in Interna-

tional Journal of Computer Vision, vol. 57, no. 2. Kluwer Academic

Publishers, 2004, pp. 137–154.

[65] T. Debrunner, S. Saeedi, and P. H. J. Kelly, “Automatic kernel code

generation for cellular processor arrays,” in Submitted to ACM Trans-

actions on Architecture and Code Optimization (TACO), 2018.

[66] Intel-Movidius, “Intel Movidius Myriad VPU,” https://www.movidius.

com/myriad2.

[67] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,

S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin,

C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb,

T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R.

Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey,

A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar,

S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke,

A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Na-

garajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick,

N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,

C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,

M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan,

R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter

performance analysis of a tensor processing unit,” in ACM International

Symposium on Computer Architecture (ISCA), 2017, pp. 1–12.

[68] P. Ginsbach, T. Remmelg, M. Steuwer, B. Bodin, C. Dubach, and

M. O’Boyle, “Automatic matching of legacy code to heterogeneous

APIs: An idiomatic approach,” in ACM International Conference on

Architectural Support for Programming Languages and Operating

Systems (ASPLOS), 2018, pp. 139–153.

[69] B. Bodin, L. Nardi, P. H. J. Kelly, and M. F. P. OBoyle, “Diplomat:

Mapping of multi-kernel applications using a static dataﬂow abstrac-

tion,” in IEEE International Symposium on Modeling, Analysis and

Simulation of Computer and Telecommunication Systems (MASCOTS),

2016, pp. 241–250.

[70] B. Bodin, A. Munier-Kordon, and B. D. de Dinechin, “Optimal and fast

throughput evaluation of CSDF,” in ACM Annual Design Automation

Conference (DAC), 2016, pp. 160:1–160:6.

[71] C. Kotselidis, J. Clarkson, A. Rodchenko, A. Nisbet, J. Mawer, and

M. Luj´

an, “Heterogeneous managed runtime systems: A computer vi-

sion case study,” in ACM SIGPLAN/SIGOPS International Conference

on Virtual Execution Environments (VEE), 2017, pp. 74–82.

[72] J. Clarkson, C. Kotselidis, G. Brown, and M. Luj´

an, “Boosting java

performance using gpgpus,” in International Conference on Architec-

ture of Computing Systems (ARCS). Springer International Publishing,

2017, pp. 59–70.

[73] J. Clarkson, J. Fumero, M. Papadimitriou, M. Xekalaki, and C. Kot-

selidis, “Towards practical heterogeneous virtual machines,” in ACM

MoreVMs Workshop on Modern Language Runtimes, Ecosystems, and

VMs, 2018, pp. 46–48.

[74] Beehive Lab, Maxine/MaxSim. https://github.com/beehive-lab.

[75] C. Wimmer, M. Haupt, M. L. Van De Vanter, M. Jordan, L. Dayn`

es,

and D. Simon, “Maxine: An approachable virtual machine for, and

in, Java,” ACM Transactions on Architecture and Code Optimization

(TACO), vol. 9, no. 4, pp. 30:1–30:24, 2013.

[76] F. S. Zakkak, A. Nisbet, J. Mawer, T. Hartley, N. Foutris, O. Papadakis,

A. Andronikakis, I. Apreotesei, and C. Kotselidis, “On the future of

research VMs: A hardware/software perspective,” in ACM MoreVMs

Workshop on Modern Language Runtimes, Ecosystems, and VMs, 2018,

pp. 51–53.

[77] K. Chandramohan and M. F. O’Boyle, “Partitioning data-parallel

programs for heterogeneous MPSoCs: Time and energy design space

exploration,” in ACM SIGPLAN/SIGBED Conference on Languages,

Compilers and Tools for Embedded Systems (LCTES), 2014, pp. 73–

82.

[78] K. Chandramohan and M. F. P. O’Boyle, “A compiler framework for

automatically mapping data parallel programs to heterogeneous MP-

SoCs,” in ACM International Conference on Compilers, Architecture

and Synthesis for Embedded Systems (CASE), 2014, pp. 9:1–9:10.

[79] T. Spink, H. Wagstaff, B. Franke, and N. Topham, “Efﬁcient code

generation in a region-based dynamic binary translator,” in ACM

SIGPLAN/SIGBED Conference on Languages, Compilers and Tools

for Embedded Systems (LCTES), 2014, pp. 3–12.

[80] H. Wagstaff, M. Gould, B. Franke, and N. Topham, “Early partial

evaluation in a JIT-compiled, retargetable instruction set simulator

generated from a high-level architecture description,” in ACM Annual

Design Automation Conference (DAC), 2013, pp. 21:1–21:6.

[81] H. Wagstaff, T. Spink, and B. Franke, “Automated ISA branch coverage

analysis and test case generation for retargetable instruction set simu-

lators,” in IEEE International Conference on Compilers, Architecture

and Synthesis for Embedded Systems (CASES), 2014, pp. 1–10.

[82] T. Spink, H. Wagstaff, B. Franke, and N. Topham, “Efﬁcient dual-

ISA support in a retargetable, asynchronous dynamic binary translator,”

in IEEE International Conference on Embedded Computer Systems:

Architectures, Modeling, and Simulation (SAMOS), 2015, pp. 103–112.

[83] H. Wagstaff and T. Spink. The GenSim ADL toolset. http://www.

gensim.org/.

[84] K. Kaszyk, H. Wagstaff, T. Spink, B. Franke, M. O’Boyle, and

H. Uhrenholt, “Accurate emulation of a state-of-the-art mobile cpu/gpu

platform,” in Design Automation Conference (DAC) Work-in-Progress

Poster session, 2018.

[85] T. Spink, H. Wagstaff, and B. Franke, “Efﬁcient asynchronous interrupt

handling in a full-system instruction set simulator,” in ACM SIGPLAN

Notices, vol. 51, no. 5, 2016, pp. 1–10.

[86] J. Mawer, O. Palomar, C. Gorgovan, A. Nisbet, W. Toms, and M. Lujn,

“The potential of dynamic binary modiﬁcation and CPU-FPGA SoCs

for simulation,” in IEEE Annual International Symposium on Field-

Programmable Custom Computing Machines (FCCM), 2017, pp. 144–

151.

[87] C. Gorgovan, A. d’Antras, and M. Luj´

an, “MAMBO: A low-overhead

dynamic binary modiﬁcation tool for ARM,” ACM Transactions on

Architecture and Code Optimization (TACO), vol. 13, no. 1, pp. 14:1–

14:26, 2016.

[88] C. Gorgovan. MAMBO: A low-overhead dynamic binary modiﬁcation

tool for ARM. https://github.com/beehive-lab.

[89] D. L. Bruening, “Efﬁcient, transparent, and comprehensive runtime

code manipulation,” Ph.D. dissertation, Massachusetts Institute of Tech-

nology, 2004.

[90] A. Rodchenko, C. Kotselidis, A. Nisbet, A. Pop, and M. Lujn,

“MaxSim: A simulation platform for managed applications,” in IEEE

International Symposium on Performance Analysis of Systems and

Software (ISPASS), 2017, pp. 141–152.

[91] D. Sanchez and C. Kozyrakis, “ZSim: Fast and accurate microar-

chitectural simulation of thousand-core systems,” in ACM Annual

International Symposium on Computer Architecture (ISCA), 2013, pp.

475–486.

[92] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and

N. P. Jouppi, “McPAT: An integrated power, area, and timing modeling

framework for multicore and manycore architectures,” in IEEE/ACM

International Symposium on Microarchitecture (MICRO), 2009, pp.

469–480.

[93] A. Rodchenko, C. Kotselidis, A. Nisbet, A. Pop, and M. Lujan, “Type

information elimination from objects on architectures with tagged

pointers support,” IEEE Transactions on Computers, vol. 67, no. 1,

pp. 130–143, 2018.

[94] V. Sze, “Designing hardware for machine learning: The important

role played by circuit designers,” IEEE Solid-State Circuits Magazine,

vol. 9, no. 4, pp. 46–54, 2017.

[95] Graphcore, https://www.graphcore.ai/.

[96] A. Eliazar and R. Parr, “DP-SLAM: Fast, robust simultaneous localiza-

tion and mapping without predetermined landmarks,” in International

Joint Conference on Artiﬁcial Intelligence (IJCAI). Morgan Kauf-

mann, 2003, pp. 1135–1142.

[97] S. Dublish, V. Nagarajan, and N. Topham, “Characterizing memory

bottlenecks in GPGPU workloads,” in IEEE International Symposium

on Workload Characterization (IISWC), 2016, pp. 1–2.

[98] ——, “Cooperative caching for GPUs,” ACM Transactions on Archi-

tecture and Code Optimization (TACO), vol. 13, no. 4, pp. 39:1–39:25,

2016.

[99] E. Tomusk, C. Dubach, and M. O’Boyle, “Measuring ﬂexibility in

single-ISA heterogeneous processors,” in ACM International Confer-

ence on Parallel Architectures and Compilation (PACT), 2014, pp. 495–

496.

[100] E. Tomusk and C. Dubach, “Diversity: A design goal for heterogeneous

processors,” IEEE Computer Architecture Letters, vol. 15, no. 2, pp.

81–84, 2016.

[101] E. Tomusk, C. Dubach, and M. O’boyle, “Four metrics to evaluate

heterogeneous multicores,” ACM Transactions on Architecture and

Code Optimization (TACO), vol. 12, no. 4, pp. 37:1–37:25, 2015.

[102] ——, “Selecting heterogeneous cores for diversity,” ACM Transactions

on Architecture and Code Optimization (TACO), vol. 13, no. 4, pp.

49:1–49:25, 2016.

[103] E. Kang, E. Jackson, and W. Schulte, An Approach for Effective Design

Space Exploration. Springer Berlin Heidelberg, 2011, pp. 33–54.

[104] Z. Zhang, A. Suleiman, L. Carlone, V. Sze, and S. Karaman, “Visual-

inertial odometry on chip: An algorithm-and-hardware co-design ap-

proach,” in Robotics: Science and Systems (RSS), 2017.

[105] B. Bodin, L. Nardi, M. Z. Zia, H. Wagstaff, G. Sreekar Shenoy,

M. Emani, J. Mawer, C. Kotselidis, A. Nisbet, M. Lujan, B. Franke,

P. H. Kelly, and M. O’Boyle, “Integrating Algorithmic Parameters into

Benchmarking and Design Space Exploration in 3D Scene Understand-

ing,” in ACM International Conference on Parallel Architectures and

Compilation (PACT), 2016, pp. 57–69.

[106] L. Nardi, B. Bodin, S. Saeedi, E. Vespa, A. J. Davison, and P. H. J.

Kelly, “Algorithmic Performance-Accuracy Trade-off in 3D Vision

Applications Using HyperMapper,” in International Workshop on Au-

tomatic Performance Tuning (iWAPT), hosted by IEEE International

Parallel and Distributed Processing Symposium (IEEE IPDPS), 2017.

[107] D. Koeplinger, M. Feldman, R. Prabhakar, Y. Zhang, S. Hadjis,

R. Fiszel, T. Zhao, L. Nardi, A. Pedram, C. Kozyrakis et al., “Spatial: a

language and compiler for application accelerators,” in ACM SIGPLAN

Conference on Programming Language Design and Implementation,

2018, pp. 296–311.

[108] M. Z. Zia, L. Nardi, A. Jack, E. Vespa, B. Bodin, P. H. Kelly, and

A. J. Davison, “Comparative design space exploration of dense and

semi-dense SLAM,” in IEEE International Conference on Robotics and

Automation (ICRA), 2016, pp. 1292–1299.

A comparative study for the assessment of marker-less mixed reality applications for the operator training

Article

Feb 2024

TDO-SLAM: Traffic Sign and Dynamic Object based Visual SLAM

Article

Full-text available

Jan 2024

This paper introduces a real-time visual SLAM system, TDO-SLAM, using only a stereo vision camera. TDO-SLAM works not only in static but also in dynamic road environment by incorporating the object motion and the planar property of standing traffic signs. Traditional visual SLAM systems assume that the road environment is static. However, a variety of dynamic objects exist in the real-world urban environment. Thus, the traditional SLAM systems are subject to fail due to the various motion of the dynamic objects. To solve this inherent problem in the dynamic environment, TDO-SLAM detects, tracks, and manages the global object identification of dynamic objects and standing traffic signs through a novel Object-Level-Tracking method. We improve the accuracy of camera pose estimation through several steps of bundle adjustments, including the residual terms for the planar constraint of traffic signs and the dynamic object motion. Experimental results show that pose estimation accuracy is improved in complex environment with several dynamic objects and traffic signs. Performance of TDO-SLAM is analyzed and compared with ORB-SLAM2, ORB-SLAM3, and DynaSLAM using three benchmark datasets, KITTI Odometry dataset, KITTI Raw dataset, and Complex Urban dataset.

Pervasive Augmented Reality to support logistics operators in industrial scenarios: a shop floor user study on kit assembly

Article

Full-text available

May 2023
INT J ADV MANUF TECH

Augmented Reality (AR) is a pillar of the transition to Industry 4.0 and smart manufacturing. It can facilitate training, maintenance, assembly, quality control, remote collaboration and other tasks. AR has the potential to revolutionize the way information is accessed, used and exchanged, extending user's perception and improving their performance. This work proposes a Pervasive AR tool, created with partners from the industry sector, to support the training of logistics operators on industrial shop floors. A Human-Centered Design (HCD) methodology was used to identify operators difficulties, challenges, and define requirements. After initial meetings with stakeholders, two distinct methods were considered to configure and visualize AR content on the shop floor: Head-Mounted Display (HMD) and Handheld Device (HHD). A first (preliminary) user study with 26 participants was conducted to collect qualitative data regarding the use of AR in logistics, from individuals with different levels of expertise. The feedback obtained was used to improve the proposed AR application. A second user study was realized, in which 10 participants used different conditions to fulfill distinct logistics tasks: C1-paper; C2-HMD; C3-HHD. Results emphasize the potential of Pervasive AR in the operators' workspace, in particular for training of operators not familiar with the tasks. Condition C2 was preferred by all participants and considered more useful and efficient in supporting the operators activities on the shop floor.

Localization Coverage Analysis of THz Communication Systems with a 3D Array

Conference Paper

Full-text available

Dec 2022

Advances in Visual Simultaneous Localisation and Mapping Techniques for Autonomous Vehicles: A Review

Article

Full-text available

Nov 2022
SENSORS-BASEL

The recent advancements in Information and Communication Technology (ICT) as well as increasing demand for vehicular safety has led to significant progressions in Autonomous Vehicle (AV) technology. Perception and Localisation are major operations that determine the success of AV development and usage. Therefore, significant research has been carried out to provide AVs with the capabilities to not only sense and understand their surroundings efficiently, but also provide detailed information of the environment in the form of 3D maps. Visual Simultaneous Localisation and Mapping (V-SLAM) has been utilised to enable a vehicle understand its surroundings, map the environment, and identify its position within the area. This paper presents a detailed review of V-SLAM techniques implemented for AV perception and localisation. An overview of SLAM techniques is presented. In addition, an in-depth review is conducted to highlight various V-SLAM schemes, their strengths, and limitations. Challenges associated with V-SLAM deployment and future research directions are also provided in this paper.

PAL-SLAM2: Visual and visual–inertial monocular SLAM for panoramic annular lens

Article

May 2024
ISPRS J PHOTOGRAMM

Architectural Design Model Guided On-Demand Power Management of Energy-Efficient GPGPU for SLAM

Article

Feb 2023

Simultaneously localization and mapping (SLAM) is a core component in many embedded domains, e.g., robots, augmented and virtual reality. Due to SLAM’s high demand on computation resources, general-purpose graphic processing units (GPGPUs) are often used as its processing engine. Meanwhile, embedded systems usually have strict power constraint. Thus, how to deliver required performance for SLAM, yet still meet the power limit, is a great challenge faced by GPGPU designer. In this work, we discover the general principles of designing energy-efficient GPGPU for SLAM as “many SMs, enough SPs and registers, small caches”, by analyzing the implication of individual design parameters on both performance and power. Then, we conduct large-scale design space exploration and fit the Pareto frontier with a two-term exponential model. Further, we construct gradient boosting decision tree (GBDT)-based design models to predict the performance and power given the design parameters. The evaluation shows that our GBDT-based models can achieve [Formula: see text]3% mean average percentage error, which significantly outperform other machine learning models. With these models, a kernel’s requirement on hardware resources can be well understood. Based on such knowledge, we introduce design model guided power management strategies, including power gating and dynamic frequency and voltage scaling (DFVS). Overall, by combining these two power management strategies, we can improve the energy delay product by 36%.

UV Disinfection Robots: A Review

Article

Dec 2022
ROBOT AUTON SYST

The novel coronavirus (COVID-19) pandemic has completely changed our lives and how we interact with the world. The pandemic has brought about a pressing need to have effective disinfection practices that can be incorporated into daily life. They are needed to limit the spread of infections through surfaces and air, particularly in public settings. Most of the current methods utilize chemical disinfectants, which can be laborious and time-consuming. Ultraviolet (UV) irradiation is a proven and powerful means of disinfection. There has been a rising interest in the implementation of UV disinfection robots by various public institutions, such as hospitals, long-term care homes, airports, and shopping malls. The use of UV-based disinfection robots could make the disinfection process faster and more efficient. The objective of this review is to equip readers with the necessary background on UV disinfection and provide relevant discussion on various aspects of UV robots.

Hardware implementation of SLAM algorithms: a survey on implementation approaches and platforms

Article

Full-text available

Nov 2022
ARTIF INTELL REV

Simultaneous localization and mapping (SLAM) is an active research topic in machine vision and robotics. It has various applications in many different fields such as mobile robots, augmented and virtual reality, medical imaging, image-guided surgery systems, and unmanned aerial vehicles (UAVs). The computational complexity of SLAM algorithms is very high. Therefore, in many applications, it is necessary to implement them in real-time on platforms with low power consumption and small sizes. This paper reviews the implementation and the performance of SLAM algorithms on various platforms. Although there are various review studies on SLAM algorithms, the studies assessing the hardware implementation of these algorithms are very limited. This study attempts to fill this gap. It is shown that using the hardware–software (HW/SW) co-design approaches over mere Software (SW) or hardware (HW) approaches is currently the primary option for implementing SLAM algorithms on hardware platforms. A combination of a hardware accelerator and a software approach increases the speed of the implementation as well as the performance and the speed of the algorithm. Also, dividing different parts of the algorithm according to the structure and the nature of the algorithm between hardware and software in the HW/SW co-design approaches reduces the resource consumption and the cost. Furthermore, the design of hardware-compatible algorithms is one of the most critical gaps in the implementation of SLAM algorithms on hardware platforms.

Reconfigurable System-on-Chip Architectures for Robust Visual SLAM on Humanoid Robots

Article

Nov 2022

Visual Simultaneous Localization and Mapping (vSLAM) is the method of employing an optical sensor to map the robot’s observable surroundings while also identifying the robot’s pose in relation to that map. The accuracy and speed of vSLAM calculations can have a very significant impact on the performance and effectiveness of subsequent tasks that need to be executed by the robot, making it a key building component for current robotic designs. The application of vSLAM in the area of humanoid robotics is particularly difficult due to the robot’s unsteady locomotion. This paper introduces a pose graph optimization module based on RGB (ORB) features, as an extension of the KinectFusion pipeline (a well-known vSLAM algorithm), to assist in recovering the robot’s stance during unstable gait patterns when the KinectFusion tracking system fails. We develop and test a wide range of embedded MPSoC FPGA designs, and we investigate numerous architectural improvements, both precise and approximation, to study their impact on performance and accuracy. Extensive design space exploration reveals that properly designed approximations, which exploit domain knowledge and efficient management of CPU and FPGA fabric resources, enable real-time vSLAM at more than 30 fps in humanoid robots with high energy-efficiency and without compromising robot tracking and map construction. This is the first FPGA design to achieve robust, real-time dense SLAM operation targeting specifically humanoid robots. An open source release of our implementations and data can be found in [1].

Characterizing Visual Localization and Mapping Datasets

Conference Paper

Full-text available

May 2019

On the Future of Research VMs: A Hardware/Software Perspective

Conference Paper

Full-text available

Apr 2018

In the recent years, we have witnessed an explosion of the usages of Virtual Machines (VMs) which are currently found in desktops, smartphones, and cloud deployments. These recent developments create new research opportunities in the VM domain extending from performance to energy efficiency, and scalability studies. Research into these directions necessitates research frameworks for VMs that provide full coverage of the execution domains and hardware platforms. Unfortunately, the state of the art on Research VMs does not live up to such expectations and lacks behind industrial-strength software, making it hard for the research community to provide valuable insights. This paper presents our work in attempting to tackle those shortcomings by introducing Beehive, our vision towards a modular and seamlessly extensible ecosystem for research on virtual machines. Beehive unifies a number of existing state-of-the-art tools and components with novel ones providing a complete platform for hardware/software co-design of Virtual Machines.

Visual Odometry for Pixel Processor Arrays

Conference Paper

Full-text available

Oct 2017

ACME: adaptive compilation made efficient

Conference Paper

Jun 2005

SLAMBench2: Multi-Objective Head-to-Head Benchmarking for Visual SLAM

Preprint

Aug 2018

SLAM is becoming a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is a problem since different SLAM applications can have different functional and non-functional requirements. For example, a mobile phonebased AR application has a tight energy budget, while a UAV navigation system usually requires high accuracy. SLAMBench2 is a benchmarking framework to evaluate existing and future SLAM systems, both open and close source, over an extensible list of datasets, while using a comparable and clearly specified list of performance metrics. A wide variety of existing SLAM algorithms and datasets is supported, e.g. ElasticFusion, InfiniTAM, ORB-SLAM2, OKVIS, and integrating new ones is straightforward and clearly specified by the framework. SLAMBench2 is a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs across SLAM systems.

Towards practical heterogeneous virtual machines

Conference Paper

Apr 2018

Heterogeneous computing has emerged as a means to achieve high performance and energy efficiency. Naturally, this trend has been accompanied by changes in software development norms that do not necessarily favor programmers. A prime example is the two most popular heterogeneous programming languages, CUDA and OpenCL, which expose several low-level features to the API making them difficult to use by non-expert users. Instead of using low-level programming languages, developers tend to prefer more high-level, object-oriented languages typically executed on managed runtime environments. Although many programmers might expect that such languages would have already been adapted for execution on heterogeneous hardware, the reality is that their support is either very limited or totally absent. This paper highlights the main reasons and complexities of enabling heterogeneous managed runtime systems and proposes a number of directions to address those challenges.

Spatial: a language and compiler for application accelerators

Conference Paper

Jun 2018

Industry is increasingly turning to reconfigurable architectures like FPGAs and CGRAs for improved performance and energy efficiency. Unfortunately, adoption of these architectures has been limited by their programming models. HDLs lack abstractions for productivity and are difficult to target from higher level languages. HLS tools are more productive, but offer an ad-hoc mix of software and hardware abstractions which make performance optimizations difficult. In this work, we describe a new domain-specific language and compiler called Spatial for higher level descriptions of application accelerators. We describe Spatial's hardware-centric abstractions for both programmer productivity and design performance, and summarize the compiler passes required to support these abstractions, including pipeline scheduling, automatic memory banking, and automated design tuning driven by active machine learning. We demonstrate the language's ability to target FPGAs and CGRAs from common source code. We show that applications written in Spatial are, on average, 42% shorter and achieve a mean speedup of 2.9x over SDAccel HLS when targeting a Xilinx UltraScale+ VU9P FPGA on an Amazon EC2 F1 instance.

A Benchmark Comparison of Monocular Visual-Inertial Odometry Algorithms for Flying Robots

Conference Paper

May 2018

Flying robots require a combination of accuracy and low latency in their state estimation in order to achieve stable and robust flight. However, due to the power and payload constraints of aerial platforms, state estimation algorithms must provide these qualities under the computational constraints of embedded hardware. Cameras and inertial measurement units (IMUs) satisfy these power and payload constraints, so visualinertial odometry (VIO) algorithms are popular choices for state estimation in these scenarios, in addition to their ability to operate without external localization from motion capture or global positioning systems. It is not clear from existing results in the literature, however, which VIO algorithms perform well under the accuracy, latency, and computational constraints of a flying robot with onboard state estimation. This paper evaluates an array of publicly-available VIO pipelines (MSCKF, OKVIS, ROVIO, VINS-Mono, SVO+MSF, and SVO+GTSAM) on different hardware configurations, including several singleboard computer systems that are typically found on flying robots. The evaluation considers the pose estimation accuracy, per-frame processing time, and CPU and memory load while processing the EuRoC datasets, which contain six degree of freedom (6DoF) trajectories typical of flying robots. We present our complete results as a benchmark for the research community. Narrated video presentation: https://youtu.be/ymI3FmwU9AY

Automatic Matching of Legacy Code to Heterogeneous APIs: An Idiomatic Approach

Conference Paper

Mar 2018

Heterogeneous accelerators often disappoint. They provide the prospect of great performance, but only deliver it when using vendor specific optimized libraries or domain specific languages. This requires considerable legacy code modifications, hindering the adoption of heterogeneous computing. This paper develops a novel approach to automatically detect opportunities for accelerator exploitation. We focus on calculations that are well supported by established APIs: sparse and dense linear algebra, stencil codes and generalized reductions and histograms. We call them idioms and use a custom constraint-based Idiom Description Language (IDL) to discover them within user code. Detected idioms are then mapped to BLAS libraries, cuSPARSE and clSPARSE and two DSLs: Halide and Lift. We implemented the approach in LLVM and evaluated it on the NAS and Parboil sequential C/C++ benchmarks, where we detect 60 idiom instances. In those cases where idioms are a significant part of the sequential execution time, we generate code that achieves 1.26x to over 20x speedup on integrated and external GPUs.

Efficient Octree-Based Volumetric SLAM Supporting Signed-Distance and Occupancy Mapping

Article

Jan 2018

We present a dense volumetric simultaneous localisation and mapping (SLAM) framework that uses an octree representation for efficient fusion and rendering of either a truncated signed distance field (TSDF) or an occupancy map. The primary aim of this letter is to use one single representation of the environment that can be used not only for robot pose tracking and high-resolution mapping, but seamlessly for planning. We show that our highly efficient octree representation of space fits SLAM and planning purposes in a real-time control loop. In a comprehensive evaluation, we demonstrate dense SLAM accuracy and runtime performance on-par with flat hashing approaches when using TSDF-based maps, and considerable speed-ups when using occupancy mapping compared to standard occupancy maps frameworks. Our SLAM system can run at 10–40 Hz on a modern quadcore CPU, without the need for massive parallelization on a GPU. We, furthermore, demonstrate a probabilistic occupancy mapping as an alternative to TSDF mapping in dense SLAM and show its direct applicability to online motion planning, using the example of informed rapidly-exploring random trees (RRT $^*$ ).

Navigating the Landscape for Real-Time Localization and Mapping for Robotics and Virtual and Augmented Reality

Abstract and Figures

Recommended publications

DynaSLAM: Tracking, Mapping and Inpainting in Dynamic Scenes

Random Finite Sets for Robot Mapping and SLAM - New Concepts in Autonomous Robotic Map Representatio...

A Framework for Assessing and Designing Vision-based SLAM Systems for Autonomous Vehicles

A Real-Time GPU-Based Wall Detection Algorithm for Mapping and Navigation in Indoor Environments