PreprintPDF Available

AirLoc: Object-based Indoor Relocalization

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Indoor relocalization is vital for both robotic tasks like autonomous exploration and civil applications such as navigation with a cell phone in a shopping mall. Some previous approaches adopt geometrical information such as key-point features or local textures to carry out indoor relocalization, but they either easily fail in an environment with visually similar scenes or require many database images. Inspired by the fact that humans often remember places by recognizing unique landmarks, we resort to objects, which are more informative than geometry elements. In this work, we propose a simple yet effective object-based indoor relocalization approach, dubbed AirLoc. To overcome the critical challenges of object reidentification and remembering object relationships, we extract object-wise appearance embedding and inter-object geometric relationships. The geometry and appearance features are integrated to generate cumulative scene features. This results in a robust, accurate, and portable indoor relocalization system, which outperforms the state-of-the-art methods in room-level relocalization by 9.5% of PR-AUC and 7% of accuracy. In addition to exhaustive evaluation, we also carry out real-world tests, where AirLoc shows robustness in challenges like severe occlusion, perceptual aliasing, viewpoint shift, and deformation.
Content may be subject to copyright.
AirLoc: Object-based Indoor Relocalization
Aryan1, Bowen Li2, Sebastian Scherer2, Yun-Jou Lin3, and Chen Wang4
Abstract Indoor relocalization is vital for both robotic tasks
like autonomous exploration and civil applications such as
navigation with a cell phone in a shopping mall. Some previous
approaches adopt geometrical information such as key-point
features or local textures to carry out indoor relocalization,
but they either easily fail in environment with visually similar
scenes or require many database images. Inspired by the fact
that humans often remember places by recognizing unique
landmarks, we resort to objects, which are more informative
than geometry elements. In this work, we propose a simple yet
effective object-based indoor relocalization approach, dubbed
AirLoc. To overcome the critical challenges of the object rei-
dentification and remembering object relationships, we extract
object-wise appearance embedding and inter-object geometric
relationship. The geometry and appearance features are inte-
grated to generate cumulative scene features. This results in
a robust, accurate, and portable indoor relocalization system,
which outperforms the state-of-the-art methods in room-level
relocalization by 9.5% of PR-AUC and 7% of accuracy. In
addition to exhaustive evaluation, we also carry out real-world
tests, where AirLoc shows robustness in challenges like severe
occlusion, perceptual aliasing, viewpoint shift, and deformation.
Index Terms Indoor Relocalization, Object Graph
I. INTRODUCTION
Indoor relocalization has gained increasing attention with
the development of numerous mobile phones and robotic
applications such as virtual reality (VR) [1], augmented
reality (AR) [2], and robot navigation [3]. For example,
it can be employed in large buildings such as shopping
malls and offices where one can use cell phone for self-
relocalization when lost. Additionally, many existing mobile
robot localization techniques, such as visual odometry [4]
and simultaneous localization and mapping (SLAM) [5]
require indoor relocalization to correct accumulated drift.
Many algorithms [6] focus on providing accurate pose es-
timation, however, exact camera poses are often not required
by civil applications such as indoor navigation. For instance,
a lost patient in a hospital just wants to figure out which
room he is in, instead of the precise centimeter-level location.
Besides, we expect to re-recognize a place with only a few
image samples (database), making the system commercially
viable. If the method requires a large database, generalizing
the system and creating such a database for multiple places
is impractical due to memory and latency issues.
1The Department of Electronics and Communication, Delhi Technologi-
cal University, Delhi, India aryanmangal2022@gmail.com
2The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA
15213, USA {bowenli2, basti}@andrew.cmu.edu
3OPPO Palo Alto, California, USA. rose.lin@oppo.com
4The Department of Computer Science and Engineering, State University
of New York at Buffalo, NY 14260, USA. chenwang@dr.com
Query Image Database Images
Database
Relocalization
Object Graph
Fig. 1: The pipeline of AirLoc for object-based indoor
relocalization. AirLoc can provide room-level relocalization
by constructing an object graph using one query image
comparing with the database, which can be established with
only K(K= 1,2,5,10) images for each room.
In recent years, indoor relocalization methods have been
focusing on geometric textures with key-point features [7],
[8] or semantic information [9]. However, they are often not
scalable for two major reasons. First, they require either a
3D scene model [10] or a large number of database images
[11], which are not readily accessible in most real-world
indoor scenes. Secondly, these methods can’t work well in
challenging scenarios like occlusion, light changes, and the
interference of dynamic objects such as humans. It is because
that they heavily rely on local texture matching which often
produces false matches for illumination change or visually
similar scenes. Image-based methods, such as NetVLAD [11]
and PatchNetVLAD [12], also produce false matches because
they rely on the collective features of an image rather than
understanding the individual identities depicted in the image.
It remains questionable whether these challenges can be
resolved with a limited number of database images available.
Therefore, in this paper we resort to use higher-level infor-
mation such as objects’ appearance and relative geometry to
tackle the problem of indoor relocalization.
Researchers have shown increasing interest in object en-
coding and re-identification tasks [13], [14]. The strong rep-
resentation from objects can be utilized for re-identification
with amazing efficacy. Inspired by this, we propose AirLoc,
an object-based indoor relocalization approach shown in
Fig. 1, which fully utilizes appearance and geometry rela-
tions. We present that room-level relocalization for a single
arXiv:2304.00954v1 [cs.CV] 3 Apr 2023
query image can be effectively achieved given a database of
rooms. Furthermore, since the model is usually expected to
quickly generalize to new environments where a large num-
ber of database images can’t be quickly obtained, we take
only a few images (K= 1,2,3,5,10) from every room to
construct the database. AirLoc outperforms various baselines
and achieves an amazing speed of 20 ms/frame, making it
affordable for low-power mobile robots or cellphones, which
demonstrates its outstanding effectiveness and robustness. In
summary, the main contributions of this paper are:
We introduce a simple yet effective indoor relocalization
framework, named AirLoc, that relies on object-level
information to overcome the limitations of local feature
or image-based approaches.
We propose two modules to extract appearance and
geometry-related features, respectively, which are then
combined to perform room-level relocalization.
We perform exhaustive experimental evaluation are on a
newly rendered Reloc110 dataset, which contains 306K
images and 113 rooms. AirLoc can robustly outperform
the state-of-the-art methods, obtaining improvement of
9.5% PR-AUC and 7% accuracy.
We implement real-world tests to validate the robust-
ness of AirLoc to illumination change, occlusion, and
viewpoint shift. We release source code at https:
//github.com/sair-lab/AirLoc to benefit the
robotics community.
II. REL AT ED WORK
We first review the related datasets for indoor relocaliza-
tion. Then the methods based on key-point feature [8], [11],
[15] and objects [13], [14] are presented, respectively.
A. Datasets for Indoor Relocalization
Many datasets have been collected for semantic scene
understanding. The Places365-Standard dataset [16] is built
for visual understanding tasks like scene context, action
and event prediction, and object recognition. It contains 1.8
million train images from 365 scene categories. ADE20k [17]
dataset contains images exhaustively annotated with objects
and object-parts with additional information of occlusion.
MIT Indoor scenes database [18] contains 67 Indoor cate-
gories and 15620 images but distribution of images varies
per category. A recently introduced indoor RGB-D dataset,
RIO10 [19] has changing indoor environments containing 74
sequences split into training, validation and testing sets.
Datasets used for object-based scene understanding tasks,
such as real-world indoor relocalization, should include room
labels. Properties such as ground truth segmentation and
images from varying viewpoints are also important for finer
learning. Existing datasets miss at least one of the above
characteristics, which motivates us to construct a new dataset
with such labels and specifications.
B. Key-point and Image feature-based Methods
Handcrafted key-point features such as SIFT [20] and
SURF [8] have been widely applied to conventional methods
such as image retrieval, loop closure detection, and Visual
Place Recognition (VPR). A binary descriptor ORB [21]
was utilized in DBoW2 [22] for image retrieval using visual
vocabulary of features. However, these handcrafted local
features are not discriminative in more complex and clutted
environments, where the conventional methods easily fail.
Compared to handcrafted features, approaches using deep
learned features have been proved more robust [23]. Super-
Point [24], a recently proposed deep learning method, uses
self-supervised learning for training interest point detectors
and descriptors. Expanding upon SuperPoint, SuperGlue
[25] introduced a graph neural network that matches two
sets of local features by jointly finding correspondences
and rejecting non-matchable points. For the tasks such as
feature matching and place recognition , both SuperPoint
and SuperGlue have received widespread adoption [26].
Some image retreival methods [12] directly extract CNN-
based image features. [27] produces a global image represen-
tation by aggregating the activation CNN features. NetVLAD
[11] uses a generalized end-to-end deep learning-based Vec-
tor of Locally Aggregated Descriptors (VLAD) [28] layer.
However, one of the main challenges faced by NetVLAD and
other similar methods is the limited availability of training
data, which can adversely affect performance. To overcome
this issue spatial/depth data has been incorporated [29] and
input modalities such as RGB-D images and point cloud data
have been explored.
These descriptors are capable of producing distinguishable
descriptions, but struggle in visually similar environments. In
these conditions, different scenes could have similar local
textures, which results in similar descriptions and finally
leads to the failure of matching.
C. Object semantic features and their application
Object based semantic features are more robust and infor-
mative, and have been widely used in robotics applications
such as SLAM. The pioneering work of SLAM++ [30] per-
forms object-level SLAM using a depth camera. [31] develop
a quadratic-programming-based semantic object initialization
scheme to achieve high-accuracy object-level data associa-
tion and real-time semantic mapping. [32] integrated object
detection and localization module together to obtain the
semantic maps of the environment and improve localization.
X-View [33] localize aerial-to-ground globally and ground-
to-ground robot data of drastically different viewpoints using
object graph descriptors based on random walks.
Recently, AirCode [13] proposed a feature sparse and
object dense encoding method which is robust to viewpoint
changes, scaling, occlusion, and even object deformation.
Building upon that, AirObject [14] introduced a temporal
CNN across structural information in multiple frames to
perform temporal 3D object encoding. These frames were
obtained from a graph attention based encoder. However,
using these object descriptors for relocalization still remains
an open question. Taking motivation from above examples,
we use object encoders, like AirCode, to extract object
embeddings for relocalization.
Database
Query
Geometric Similarity
Appearance Similarity
Geometry Module
Appearance Module Relocalization
Fig. 2: The proposed object matching framework uses a
geometry module and an appearance module to match query
images with database objects for indoor relocalization.
III. PROP OSE D APPRO ACH
We propose AirLoc, a new architecture shown in Fig. 2.
It consists of two parts namely geometry and appearance
module. In this section, we will first present the individual
modules and then explain their ensembling. Finally, we will
present the loss function for the geometry module.
A. Appearance Module
Appearance module encodes objects’ visual characteris-
tics. Typically, each room in our database has K (K=
1,2,5,10) images and a query consists of 1 to 2 images.
Objects are first encoded into a feature vector, and if objects
appear in more than one image, we take the arithmetic
mean of their embeddings. We then construct the database
consisting room-wise object embeddings for relocalization.
1) Object Encoders: Instead of using masks or rectan-
gular patches of objects, we extract their features using
a group of key-points on the object, which can be more
distinctive. Based on previous research [34], we believe these
key-points can provide robust object re-identification and can
thus be used for embedding. Specifically, we use Superpoint
[24] to extract the feature points, where the position of
each point is denoted as hi= (x, y), i [1, N ], and
the associated descriptor as diRDp, where Dpis the
descriptor dimension. We then group the points into objects
using instance segmentations masks, which can be obtained
from commonly-used networks like Mask R-CNN [35] or an
open-world object detector [36].
Given the grouped points, we next aggregate the individual
features to form a collective object encoding. One of the
most intuitive solutions is to use the graph-based networks
such as GCN [37] and GAT [38] for feature aggregation,
where each feature point is taken as a node. However, we
found that graph networks perform well when training and
testing data are from the same distribution but can easily
overfit to unseen environments. On the contrary, image-based
feature aggregation methods have a better generalization
ability in this task. For efficiency, we introduce a widely-
used image retrieval framework NetVLAD [11] and modify
it to fit our feature-point based representation as shown in
Fig. 3. In the experiments, we found that this new framework
can generalize to a new dataset, Reloc110 even if the our
model is only trained on COCO [39] and YT-VIS [40],
indicating its robustness to environmental changes. Given N
Object-Wise
Keypoint Extraction
Superpoint NetVlad
Keypoint Feature
Aggregation Object Descriptors
Fig. 3: An object encoder is used in the appearance module
to generate object descriptors for objects in a room using
images and semantic labels.
descriptors di,(i= 1,· · · , N ), the object encoding Ocan
be represented as a C×Dpdimensional vector:
O(c) = φ N
X
i=1
ac(di)(dixc)!,(1)
where O(c)RDpis the c-th row of O.xcis c-th cluster
center (c= 1,· · · , C and Cis predefined) and ac(·)is the
learnable parameter that denotes the soft assignment of de-
scriptor dito cluster xc, and φis a composed normalization
function, i.e., an intra-normalization to make the model scale
insensitive, followed by a L2-normalization to make the rows
horizontally stacked into a vector.
2) Similarity: We propose an architecture to match the
query with the database. Once the object descriptors are
generated, they are exhaustively matched with the database
using cosine similarity. This results in an object similarity
matrix S, where each column consists similarity scores of a
query object with all the candidate objects in the database.
This can be represented as:
S(j, k) = cos(Od(j),Oq(k)),(2)
where jand kare j-th database object and k-th query object,
cos is the cosine similarity, and Odand Oqare database and
query object embeddings, respectively.
For efficiency, we adopt a simple yet effective object-
and room-level matching framework, respectively, which is
shown in Fig. 4. The object-level matching is obtained by
taking the maximum similarity with the database objects,
while the room-level matching is obtained by summing up
the object matching scores over each room. This is because
the matched rooms often have similar objects, and thus the
summation of object similarities can reason about the room
similarities, which can be represented as
R(p, q) =
Z
X
k=0
max(Spq(j, k)),(3)
where Ris room similarity matrix and Spq is object similar-
ity matrix, pbelongs to database rooms and qis the query
room, and Zis total number of query objects respectively.
B. Geometry Module
Merely relying on appearance embedding has a potential
problem, since rooms sharing similar objects could be con-
fusing. Inspired by the fact that objects are usually placed in
s1
s3
s2
sN
ΣMax(s)
Query 1
Room 1
Query N
Room M
Query Object Descriptors
Database Object Descriptors
Object Similarity Matrix
(For Every Query-Database Pair)
Room Similarity Matrix
Fig. 4: Appearance-based Matching: Maximum Object Sim-
ilairty for every query-databse pair is summed up to form
room similarity which is then used for relocalization.
different relative locations, we design a geometry module as
shown in Fig. 5 to assist the appearance-based matching.
An intuitive way to compute relative locations is to use
depth measurements, but this makes the framework incom-
patible for cell-phone applications where depth information
is often unavailable. For better generalizability, we resort
to object-wise key-point locations to encode geometric in-
formation. Specifically, we use their mean location (µj),
standard deviation (σj), 1st-, 2nd-, and 3rd-order momen-
tum (m1
j, m2
j, m3
j), and singular value decomposition (svdj).
Similar to appearance module, if an object appears in more
than one image, we take the arithmetic mean of its geomet-
ric features. Afterwards, the geometric features are passed
through a multilayer perceptron (MLP) and then subtracted
from each other to get relative geometric features. In this
way, if there are Zobjects we get CZ
2relative geometric
features, which can be computed as:
oj= [µj, σj, m1
j, m2
j, m3
j,svdj],(4)
ejk =g(oj)g(ok),(5)
where [·]is concatenation, ejk is the relative location feature
between j-th and k-th object and g(·)denotes MLP layer.
These geometric features are then passed through a two-
layered GAT [38] to perform attention-based message prop-
agation between the location features
et
u=σ
X
v∈N (u)
au·W·et1
u
,(6)
r=PU
u=0 eu
U,(7)
where et
uis the u-th location feature at t-th graph layer, σ
is nonlinearity, auis attention coefficient, Wis learnable
weight matrix [38] and ris room level embedding of di-
mension Eo. Finally, cosine similarity matching of query and
database room embeddings yields a room similarity matrix
similar to the appearance module.
Rloc(p, q) = cos(rp,rq),(8)
where rpand rqare p-th database and q-th query room.
Graph Attention
Encoder
Room Similarity Matrix
Query
𝝁.
σ .
𝑚.
. .
Pixel Features Features Encoder Relative Features Attention Encoder
Database
Geometry
Module
Fig. 5: The structure of geometry module.
C. Feature Ensembling
After we get a set of room similarities based on appearance
and geometry features, the final step is to integrate them
using a weighted sum with the weight w.
R0=w·R+Rloc (9)
Furthermore, it is observed that in most true positives from
appearance-only matching, the similarity of the matched
room is very high as compared to others. Hence, in such
cases, there is not much need to use both modules. Therefore,
in order to reduce the runtime and avoid the possibility of
result degradation due to rooms having similar geometry but
different objects, we apply the geometry-based assistance
only to those queries where the difference between the ap-
pearance similarity of the highest and second-highest match
is less than some threshold that we call the “appearance
threshold” (Tdiff). The queries with a difference greater than
the threshold are classified solely by appearance matching.
D. Loss Function
The graph attention encoder in geometry module is super-
vised by the room matching loss. The room matching loss
Lrmaximizes the cosine similarity of positive room pairs
and minimises the cosine similarity of negative room pairs.
Lr=X
{p,q}∈P+
(1 Cos(rp,rq))
+X
{p,q}∈P
max(0,Cos(rp,rq)ζ),
(10)
where ζ= 0.2is a constant margin, Cos is the cosine
similarity, and P+, P are positive and negative object pairs.
IV. EXP E RI MEN TAL RE SULTS
A. Dataset
The dataset adopted in this work, named Reloc110, is
newly rendered using Habitat-Sim [41], which is a high-
performance physics-enabled 3D simulator supporting 3D
scans of indoor/outdoor spaces and rigid-body mechanics.
To minimize the gap between simulation and real-world, we
0.2 0.4 0.6 0.8 1.0
Precision
0.0
0.2
0.4
0.6
0.8
1.0
Recall
AirLoc(K=1)
[0.3377] Baseline_1
[0.3356] Baseline_2
[0.4921] NetVLAD
[0.6205] GCN
[0.8059] AirLoc
0.2 0.4 0.6 0.8 1.0
Precision
0.0
0.2
0.4
0.6
0.8
1.0
Recall
AirLoc(K=2)
[0.3076] Baseline_1
[0.3577] Baseline_2
[0.6119] NetVLAD
[0.7489] GCN
[0.8807] AirLoc
0.2 0.4 0.6 0.8 1.0
Precision
0.0
0.2
0.4
0.6
0.8
1.0
Recall
AirLoc(K=5)
[0.3603] Baseline_1
[0.3938] Baseline_2
[0.7634] NetVLAD
[0.9159] GCN
[0.9698] AirLoc
0.2 0.4 0.6 0.8 1.0
Precision
0.0
0.2
0.4
0.6
0.8
1.0
Recall
AirLoc(K=10)
[0.3930] Baseline_1
[0.4222] Baseline_2
[0.8280] NetVLAD
[0.9713] GCN
[0.9929] AirLoc
Fig. 6: Precision-Recall plots showing comparison between AirLoc and different baselines for different K values.
TABLE I: Statistics of the newly rendered Reloc110 dataset.
We present the names, images, and rooms of 15 scenes.
Scene Images Rooms Scene Images Rooms
8WUm 18803 8 ULsK 13600 5
EDJb 22800 8 Vzqf 27000 9
i5no 18200 7 wc2J 32800 12
jh4f 13400 5 WYY7 11200 5
mJXq 25199 9 X7Hy 17800 7
qoiz 25500 9 YFuZ 20800 8
RPmz 17800 6 yqst 15600 6
S9hN 25000 9 Total 306000 113
borrowed Matterport3D [42], a large-scale RGB-D dataset
that contains 90 building-scale scenes. All the Matterport3D
scenes are in the form of textured 3D meshes and are created
from real-world RGB-D images.
We selected 15 scenes from the dataset, each containing
approximate 8 rooms. For every room, we sample approxi-
mate 2500 random poses, which are easily accessible for a
human or a robot, i.e., not inside a wall or under the ground.
Therefore, the images corresponding to the poses are similar
to what humans or robots perceive in their general actions.
We then render corresponding RGB image and semantic
segmentation labels for all the collected poses. The dataset
contains a total of 306000 images divided into 113 rooms.
Table I shows total number of rooms and total images
generated for every scene. We further divide the dataset into
test and train split as well where 3 scenes (RPmz, S9hN,
ULsK) are test split and remaining are train split.
B. Implementation Details
The AirLoc configurations for appearance based matching
are superpoint input dimension Dp= 256 and the number
of clusters in NetVLAD is C= 32. Configuration for
geometric matching are: relative location feature dimension
E= 256, hidden dimension of graph layer Eh= 512, output
dimension of graph Eo= 1024. For GAT we use 8 heads
and dropout of 0.5. For training, we used a batch size of 256,
learning rate of 1e4. The network is trained for 30 epochs
using Adam optimizer on a Nvidia A100 80GB GPU.
To validate the generalizability of the AirLoc, we do not
train appearance module on Reloc110 dataset. Instead, we
use NetVLAD pretrained on COCO [39] and YT-VIS [40]
datasets. The train split is only used for learning the geometry
module which only considers the relative position of objects
TABLE II: The Results Comparing AirLoc with baselines.
Method Accuracy
K=1 K=2 K=3 K=5 K = 10
Baseline 1 40.64 61.55 69.88 78.69 84.81
Baseline 2 41.31 49.14 51.65 61.67 65.64
NetVLAD [11] 58.01 74.16 79.89 90.02 95.37
GCN [38] 61.31 76.57 86.30 91.61 96.62
AirLoc 75.35 87.26 91.75 94.35 98.32
and hence can easily be generalized to unseen rooms.
For evaluation of room level localization performance, we
use the test split of Reloc110 dataset. To switch between
appearance-only and appearance-geometry matching, we use
(Tdiff) as 0.1. The weight (w) for the weighted sum of
appearance and geometry is 10.
C. Evaluation Metrics
AirLoc’s performance is evaluated on two metrics, ac-
curacy and precision-recall. While computing accuracy, we
make a one-one matching where a query is matched only
with one most similar room. Accuracy is then calculated as
ratio of correctly matched queries to total number of queries.
However, while computing precision-recall, we allow a one
to many matching. The query-database pair having similarity
value higher than a threshold ρis considered as a match.
Based on True Positives and False Negatives, we calculate
precision and recall. Furthermore, by varying the threshold
values ρ(0,1), we obtain precision-recall curves and
calculate area under curves (AUCs).
D. Comparison to State-of-the-art Methods
AirLoc is compared with two types of baselines: room-
level and object-level. The room-level baselines: Baseline 1,
Baseline 2, and NetVLAD, extract room-level features from
the input and calculate a room similarity matrix, thereby
avoiding object matching. In Baseline 1, NetVLAD-based
object encoders are used to extract individual objects features
and the room features are then calculated by averaging
the output object embeddings. In Baseline 2, the object
encoder from Baseline 1 is replaced with GAT, allowing
a comparison of the performance of NetVLAD and GAT
for object encoding. The NetVLAD baseline uses the output
image descriptors from a NetVLAD module as room fea-
tures, similar to how NetVLAD is typically used for place
recognition [11]. It is worth noting that in this baseline, the
TABLE III: Precision-Recall Results Comparing AirLoc with baselines.
Method K=1 K=2 K=3 K=5 K=10
P R F-1 P R F-1 P R F-1 P R F-1 P R F-1
Baseline 1 92.30 8.63 15.78 55.31 19.35 28.67 37.67 34.17 35.84 26.61 47.34 34.07 21.33 57.78 31.16
Baseline 2 73.18 11.38 19.71 53.34 25.55 34.55 39.26 35.67 37.38 31.93 46.42 37.83 27.36 54.85 36.51
NetVLAD 100 0.4 0.8 98.65 16.20 27.83 67.57 53.16 59.51 33.43 87.17 48.33 16.22 97.93 27.84
GCN 72.17 46.79 56.77 80.49 58.94 68.05 90.44 68.04 77.66 94.06 78.26 85.44 97.80 89.17 93.29
AirLoc 82.43 67.77 74.39 90.66 73.27 81.05 94.63 78.19 85.63 98.33 86.47 92.02 99.27 95.40 97.30
Query AirLoc Match NetVLAD Match
Fig. 7: Qualitative Results.
NetVLAD module is not used for object encoding, but rather
for encoding the entire image.
The object-level baseline, GCN [38], extract object infor-
mation first and matches object-level data to further generate
room similarity scores. It uses a similar architecture as
AirLoc, but with two differences. First, the NetVLAD-based
object encoder used in AirLoc is replaced with a graph
attention-based object encoder. This allows for a comparison
of the performance of these two types of object encoders.
Second, the geometry module is not used in the GCN base-
line which means that the it does not incorporate information
about the spatial relationships between objects.
In Fig. 6 and Table III, the performance of AirLoc is
compared to the baseline methods using precision-recall and
F-1 score, respectively, for different values of K. The results
show that AirLoc consistently outperforms all the baselines
across all K values in both PR-AUC and F-1. In particular,
AirLoc exceeds GCN and NetVLAD by an average of 9.5%,
22.5% respectively in terms of PR-AUC and 10% and 49%
respectively in terms of F-1 score. It can also be noticed that
for both the metrics, the performance gap between object-
based methods and room-based methods is consistently large,
demonstrating the importance of object level data.
Table II presents comparisons of AirLoc and baseline
methods in terms of accuracy. AirLoc outperforms all other
TABLE IV: Runtime Analysis.
Module Node Encoding Appearance Geometry Overall
AirLoc 2.5 ms 13.1 ms 4.8 ms 20.4 ms
GCN 8.1 ms 14.3 ms -ms 22.4 ms
approaches in accuracy as well. Specifically, it outperforms
GCN and NetVLAD by an average of 7% and 10%, respec-
tively, and the margin of improvement is larger when K is
smaller, indicating that AirLoc does not need as many images
to perform well compared to the other methods.
In Fig. 7 we present examples demonstrating the differ-
ence in performance between NetVLAD and AirLoc. For a
query, the closest database image produced by NetVLAD
is shown in the right column, while the closest database
image produced by AirLoc is shown in middle column.
It can be observed that NetVLAD’s matches look more
visually similar to query, but objects in these images are
different from query objects. This results in a wrong match
for NetVLAD. In contrast, AirLoc relies on object level
data and is able to correctly match the query image even
though the two images do not look visually similar. This
demonstrates the effectiveness of using object-level data, as
opposed to relying solely on visual similarity.
E. Efficiency
Table IV presents overall runtime and inference time of
individual modules for AirLoc. The runtime of the geometry
module, which does not run constantly and whose inference
depends on the appearance threshold (Tdiff), is 4.8ms, much
lower than the appearance module. The overall running time
of AirLoc is about 20.4ms, satisfying real-time requirements
of most applications. Even though the GCN baseline does
not have a geometry module, its overall runtime is higher
than that of AirLoc. This is due to the longer time taken by
the node encoding of GCN, which uses a GAT rather than
NetVLAD. The NetVLAD has a lower runtime compared
to other methods as it encodes the entire image, rather than
individual objects. However, the accuracy and PR-AUC for
NetVLAD are much lower than those of AirLoc.
F. Ablation Studies
To evaluate the effectiveness of the geometry module,
we compare the performance of AirLoc with and without
the geometry module, as well as with Tdiff = 1.Tdiff =
1means that every query is evaluated using appearance-
geometry matching, opposed to AirLoc where some queries
TABLE V: Ablation studies.
Method Accuracy
K=1 K=2 K=3
AirLoc (Tdiff = 1) 73.27 87.20 89.58
AirLoc (w/o Geometry) 74.14 85.86 90.97
AirLoc 75.35 87.26 91.75
TABLE VI: Variation of accuracy with Tdiff .
Tdiff Accuracy
K=1 K=2 K=3 K=5 K=10
0.01 74.34 86.02 91.36 94.04 98.45
0.05 75.12 86.80 91.60 94.14 98.42
0.1 75.35 87.26 91.75 94.35 98.32
0.2 75.17 87.58 91.40 93.62 97.95
0.35 74.59 87.46 90.62 92.78 97.44
0.5 74.14 87.28 90.12 92.12 96.97
were evaluated using appearance matching only. The results,
shown in Table V, demonstrate that AirLoc outperforms
AirLoc without the geometry by an average of 1.2%. This
suggests that geometry module helps the system to reason
about the geometry of the scene, leading to better and more
accurate relocalization. Additionally, except for K= 2, the
performance of AirLoc with Tdiff = 1 is lower than that of
AirLoc without geometry module, indicating that the current
setting with Tdiff <1can be generalized to most cases.
G. Parameter Analysis
To study the effect of different hyperparameters on the
accuracy of the AirLoc system, we conducted a parameter
analysis by varying the values of the hyperparameters and
measuring the resulting performance. The results of the
analysis, shown in Table VI, demonstrate that the maximum
accuracy for most values of K occurs around Tdiff = 0.1,
leading us to choose this value for the appearance threshold.
The results in Table VII show that the accuracy is highest
for a appearance to geometry weight value of w= 10.
These results provide insights into the impact of different
hyperparameter values on the accuracy of the AirLoc system.
H. Real-World Demo
This section presents real-world testing results of AirLoc
to demonstrate its robustness and generalization ability. We
collect only 4 images per room for our database and use the
pretrained models described in Section IV-B for the geometry
module and NetVLAD in this demo. For each image in the
Fig. 8, the left side displays the corresponding query captured
by a mobile phone, while the right side shows the relocal-
ization result. It can be seen that AirLoc is able to relocalize
well with illumination changes in Fig. 8a, and with human
interference in Fig. 8b as well. For better visualization, we
strongly suggest the readers watch the video attached to this
paper at https://youtu.be/7CflVLbQOkg.
V. C ONCLUSION
In this work, we present a novel indoor relocalization
method, AirLoc, which can play a crucial role in advance-
TABLE VII: Variation of accuracy with w.
wAccuracy
K=1 K=2 K=3 K=5 K = 10
1 72.38 85.14 89.28 92.68 97.71
5 74.75 87.10 91.07 93.97 97.98
10 75.35 87.26 91.75 94.35 98.32
20 75.10 87.36 91.36 94.31 98.15
50 74.08 87.27 89.63 93.22 98.41
Query Database Relocalization
Matching
Score
(a) Illumination Changes.
Query Database Relocalization
(b) Human Interference.
Fig. 8: The live relocalization demo.
ment of evolving applications such as augmented reality
and indoor positioning using mobile phones. To be able to
quickly generalize to new environments we employ objects
as the fundamental part of method. Specifically, AirLoc uses
objects’ appearance for relocalization and relative object
geometry to differentiate between scenes having similar
objects. Our experiments show that AirLoc outperforms
existing methods and achieves best perfromance on newly
rendered Reloc110 dataset. We envision AirLoc to play
a pivotal role in development of robust and generalizable
Indoor Positioning systems for robots and humans.
VI. ACKN OW LE D GE M EN T
This work was supported by OPPO US, the Spatial AI
& Robotics (SAIR) Lab at State University of New York at
Buffalo, and the AirLab at Carnegie Mellon University.
REFERENCES
[1] L. Meng, F. Tung, J. J. Little, J. Valentin, and C. W. de Silva,
“Exploiting points and lines in regression forests for rgb-d camera
relocalization,” in 2018 IEEE/RSJ International Conference on Intel-
ligent Robots and Systems (IROS). IEEE, 2018, pp. 6827–6834.
[2] T. Khan, K. Johnston, and J. Ophoff, “The impact of an augmented
reality application on learning motivation of students, Advances in
Human-Computer Interaction, vol. 2019, 2019.
[3] M. Shahjalal, M. Hossan, M. Hasan, M. Z. Chowdhury, N. T. Le, Y. M.
Jang, et al., “An implementation approach and performance analysis
of image sensor based multilateral indoor localization and navigation
system,” Wireless Communications and Mobile Computing, vol. 2018,
2018.
[4] H. Bavle, S. Manthe, P. De La Puente, A. Rodriguez-Ramos,
C. Sampedro, and P. Campoy, “Stereo visual odometry and semantics
based localization of aerial robots in indoor environments, in 2018
IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS). IEEE, 2018, pp. 1018–1023.
[5] J. Li, P. Wang, C. Ni, and W. Rong, “Loop closure detection based on
image semantic segmentation in indoor environment, Mathematical
Problems in Engineering, vol. 2022, 2022.
[6] M. Tian, Q. Nie, and H. Shen, “3d scene geometry-aware constraint for
camera localization with deep learning,” in 2020 IEEE International
Conference on Robotics and Automation (ICRA). IEEE, 2020, pp.
4211–4217.
[7] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a
versatile and accurate monocular slam system,” IEEE transactions on
robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
[8] H. Bay, T. Tuytelaars, and L. V. Gool, “Surf: Speeded up robust
features,” in European conference on computer vision. Springer, 2006,
pp. 404–417.
[9] X. Guo, J. Hu, J. Chen, F. Deng, and T. L. Lam, “Semantic histogram
based graph matching for real-time multi-robot global localization
in large scale environment, IEEE Robotics and Automation Letters,
vol. 6, no. 4, pp. 8349–8356, 2021.
[10] Y. He, W. Sun, H. Huang, J. Liu, H. Fan, and J. Sun, “Pvn3d: A deep
point-wise 3d keypoints voting network for 6dof pose estimation,
in Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, 2020, pp. 11 632–11 641.
[11] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad:
Cnn architecture for weakly supervised place recognition,” in Pro-
ceedings of the IEEE conference on computer vision and pattern
recognition, 2016, pp. 5297–5307.
[12] S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-netvlad:
Multi-scale fusion of locally-global descriptors for place recognition,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2021, pp. 14141–14 152.
[13] K. Xu, C. Wang, C. Chen, W. Wu, and S. Scherer, Aircode: A robust
object encoding method,” IEEE Robotics and Automation Letters,
vol. 7, no. 2, pp. 1816–1823, 2022.
[14] N. V. Keetha, C. Wang, Y. Qiu, K. Xu, and S. Scherer, Airobject:
A temporally evolving graph embedding for object identification, in
Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2022, pp. 8407–8416.
[15] P. C. Ng and S. Henikoff, “Sift: Predicting amino acid changes that
affect protein function, Nucleic acids research, vol. 31, no. 13, pp.
3812–3814, 2003.
[16] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places:
A 10 million image database for scene recognition,” IEEE transactions
on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1452–
1464, 2017.
[17] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,
“Scene parsing through ade20k dataset,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2017, pp. 633–
641.
[18] A. Quattoni and A. Torralba, “Recognizing indoor scenes, in 2009
IEEE conference on computer vision and pattern recognition. IEEE,
2009, pp. 413–420.
[19] J. Wald, A. Avetisyan, N. Navab, F. Tombari, and M. Nießner, “Rio:
3d object instance re-localization in changing indoor environments, in
Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2019, pp. 7658–7667.
[20] D. G. Lowe, “Distinctive image features from scale-invariant key-
points,” International journal of computer vision, vol. 60, no. 2, pp.
91–110, 2004.
[21] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An
efficient alternative to sift or surf, in 2011 International conference
on computer vision. Ieee, 2011, pp. 2564–2571.
[22] D. G´
alvez-L´
opez and J. D. Tardos, “Bags of binary words for fast place
recognition in image sequences,” IEEE Transactions on Robotics,
vol. 28, no. 5, pp. 1188–1197, 2012.
[23] Z. Chen, A. Jacobson, N. S¨
underhauf, B. Upcroft, L. Liu, C. Shen,
I. Reid, and M. Milford, “Deep learning features at scale for visual
place recognition,” in 2017 IEEE International Conference on Robotics
and Automation (ICRA). IEEE, 2017, pp. 3223–3230.
[24] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-
supervised interest point detection and description,” in Proceedings
of the IEEE conference on computer vision and pattern recognition
workshops, 2018, pp. 224–236.
[25] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Su-
perglue: Learning feature matching with graph neural networks, in
Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, 2020, pp. 4938–4947.
[26] N. V. Keetha, M. Milford, and S. Garg, “A hierarchical dual model
of environment-and place-specific utility for visual place recognition,
IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 6969–6976,
2021.
[27] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn
features off-the-shelf: an astounding baseline for recognition, in
Proceedings of the IEEE conference on computer vision and pattern
recognition workshops, 2014, pp. 806–813.
[28] H. J´
egou, M. Douze, C. Schmid, and P. P ´
erez, “Aggregating local
descriptors into a compact image representation,” in 2010 IEEE com-
puter society conference on computer vision and pattern recognition.
IEEE, 2010, pp. 3304–3311.
[29] H. F. Zaki, F. Shafait, and A. Mian, “Viewpoint invariant semantic
object and scene categorization with rgb-d sensors, Autonomous
Robots, vol. 43, no. 4, pp. 1005–1022, 2019.
[30] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and
A. J. Davison, “Slam++: Simultaneous localisation and mapping at the
level of objects, in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2013, pp. 1352–1359.
[31] Z. Qian, K. Patath, J. Fu, and J. Xiao, “Semantic slam with au-
tonomous object-level data association, in 2021 IEEE International
Conference on Robotics and Automation (ICRA). IEEE, 2021, pp.
11 203–11 209.
[32] L. Zhang, L. Wei, P. Shen, W. Wei, G. Zhu, and J. Song, “Semantic
slam based on object detection and improved octomap, IEEE Access,
vol. 6, pp. 75 545–75 559, 2018.
[33] A. Gawel, C. Del Don, R. Siegwart, J. Nieto, and C. Cadena, “X-
view: Graph-based semantic multi-view localization, IEEE Robotics
and Automation Letters, vol. 3, no. 3, pp. 1687–1694, 2018.
[34] M. J. Tarr and W. G. Hayward, “The concurrent encoding of
viewpoint-invariant and viewpoint-dependent information in visual
object recognition,” Visual Cognition, vol. 25, no. 1-3, pp. 100–121,
2017.
[35] K. He, G. Gkioxari, P. Doll ´
ar, and R. Girshick, “Mask r-cnn, in
Proceedings of the IEEE international conference on computer vision,
2017, pp. 2961–2969.
[36] K. Joseph, S. Khan, F. S. Khan, and V. N. Balasubramanian, “To-
wards open world object detection,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2021, pp.
5830–5840.
[37] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks, arXiv preprint arXiv:1609.02907, 2016.
[38] P. Veli ˇ
ckovi´
c, G. Cucurull, A. Casanova, A. Romero, P. Lio,
and Y. Bengio, “Graph attention networks,” arXiv preprint
arXiv:1710.10903, 2017.
[39] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Doll ´
ar, and C. L. Zitnick, “Microsoft coco: Common objects in
context,” in European conference on computer vision. Springer, 2014,
pp. 740–755.
[40] L. Yang, Y. Fan, and N. Xu, “Video instance segmentation,” in
Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2019, pp. 5188–5197.
[41] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain,
J. Straub, J. Liu, V. Koltun, J. Malik, et al., “Habitat: A platform for
embodied ai research,” in Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2019, pp. 9339–9347.
[42] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva,
S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d
data in indoor environments, arXiv preprint arXiv:1709.06158, 2017.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
When mobile robots run in indoor environment, a large number of similar images are easy to appear in the images collected, probably causing false-positive judgment in loop closure detection based on simultaneous localization and mapping (SLAM). To solve this problem, a loop closure detection algorithm for visual SLAM based on image semantic segmentation is proposed in this paper. Specifically, the current frame is semantically segmented by optimized DeepLabv3+ model to obtain semantic labels in the image. The 3D semantic node coordinates corresponding to each semantic label are then extracted by combining mask centroid and image depth information. According to the distribution of semantic nodes, the DBSCAN density clustering algorithm is adopted to cluster densely distributed semantic nodes to avoid mismatching due to the close distance of semantic nodes in the subsequent matching process. Finally, the multidimensional similarity comparison of first rough and then fine is adopted to screen the candidate frames of loop closure from key frames and then confirm the real loop closure to complete accurate loop closure detection. Testing with public datasets and self-filmed datasets, experimental results show that being well adapted to illumination change, viewpoint deviation, and item movement or missing, the proposed algorithm can effectively improve the accuracy of loop closure detection in indoor environment.
Conference Paper
Full-text available
Visual Place Recognition is a challenging task for robotics and autonomous systems, which must deal with the twin problems of appearance and viewpoint change in an always changing world. This paper introduces Patch-NetVLAD, which provides a novel formulation for combining the advantages of both local and global descriptor methods by deriving patch-level features from NetVLAD residuals. Unlike the fixed spatial neighborhood regime of existing local keypoint features, our method enables aggregation and matching of deep-learned local features defined over the feature-space grid. We further introduce a multi-scale fusion of patch features that have complementary scales (i.e. patch sizes) via an integral feature space and show that the fused features are highly invariant to both condition (season, structure, and illumination) and viewpoint (trans-lation and rotation) changes. Patch-NetVLAD outperforms both global and local feature descriptor-based methods with comparable compute, achieving state-of-the-art visual place recognition results on a range of challenging real-world datasets, including winning the Facebook Mapillary Visual Place Recognition Challenge at ECCV2020. It is also adaptable to user requirements, with a speed-optimised version operating over an order of magnitude faster than the state-of-the-art. By combining superior performance with improved computational efficiency in a configurable framework, Patch-NetVLAD is well suited to enhance both stand-alone place recognition capabilities and the overall performance of SLAM systems.
Article
Full-text available
The core problem of visual multi-robot simultaneous localization and mapping (MR-SLAM) is how to efficiently and accurately perform multi-robot global localization (MR-GL). The difficulties are two-fold. The first is the difficulty of global localization for significant viewpoint difference. Appearance-based localization methods tend to fail under large viewpoint changes. Recently, semantic graphs have been utilized to overcome the viewpoint variation problem. However, the methods are highly time-consuming, especially in large-scale environments. This leads to the second difficulty, which is how to perform real-time global localization. In this paper, we propose a semantic histogram based graph matching method that is robust to viewpoint variation and can achieve real-time global localization. Based on that, we develop a system that can accurately and efficiently perform MR-GL for both homogeneous and heterogeneous robots. The experimental results show that our approach is about 30 times faster than Random Walk based semantic descriptors. Moreover, it achieves an accuracy of 95% for global localization, while the accuracy of the state-of-the-art method is 85%.
Article
Object encoding and identification is crucial for many robotic tasks such as autonomous exploration and semantic relocalization. Existing works heavily rely on the tracking of detected objects but have difficulty to recall revisited objects precisely. In this paper, we propose a novel object encoding method, which is named as AirCode, based on a graph of key-points. To be robust to the number of key-points detected, we propose a feature sparse encoding and object dense encoding method to ensure that each key-point can only affect a small part of the object descriptors, leading it to be robust to viewpoint changes, scaling, occlusion, and even object deformation. In the experiments, we show that it achieves superior performance for object identification than the state-of-the art algorithms and is able to provide reliable semantic relocalization. It is a plug-and-play module and we expect that it will play an important role in various applications.
Article
Visual Place Recognition (VPR) approaches have typically attempted to match places by identifying visual cues, image regions or landmarks that have high “utility” in identifying a specific place. But this concept of utility is not singular - rather it can take a range of forms. In this letter, we present a novel approach to deduce two key types of utility for VPR: the utility of visual cues ‘specific’ to an environment, and to a particular place. We employ contrastive learning principles to estimate both the environment- and place-specific utility of Vector of Locally Aggregated Descriptors (VLAD) clusters in an unsupervised manner, which is then used to guide local feature matching through keypoint selection. By combining these two utility measures, our approach achieves state-of-the-art performance on three challenging benchmark datasets, while simultaneously reducing the required storage and compute time. We provide further analysis demonstrating that unsupervised cluster selection results in semantically meaningful results, that finer grained categorization often has higher utility for VPR than high level semantic categorization (e.g. building, road), and characterise how these two utility measures vary across different places and environments. Source code is made publicly available at https://github.com/Nik-V9/HEAPUtil .