PreprintPDF Available

AirLoc: Object-based Indoor Relocalization

April 2023

April 2023

Authors:

Preprints and early-stage research may not have been peer reviewed yet.

Indoor relocalization is vital for both robotic tasks like autonomous exploration and civil applications such as navigation with a cell phone in a shopping mall. Some previous approaches adopt geometrical information such as key-point features or local textures to carry out indoor relocalization, but they either easily fail in an environment with visually similar scenes or require many database images. Inspired by the fact that humans often remember places by recognizing unique landmarks, we resort to objects, which are more informative than geometry elements. In this work, we propose a simple yet effective object-based indoor relocalization approach, dubbed AirLoc. To overcome the critical challenges of object reidentification and remembering object relationships, we extract object-wise appearance embedding and inter-object geometric relationships. The geometry and appearance features are integrated to generate cumulative scene features. This results in a robust, accurate, and portable indoor relocalization system, which outperforms the state-of-the-art methods in room-level relocalization by 9.5% of PR-AUC and 7% of accuracy. In addition to exhaustive evaluation, we also carry out real-world tests, where AirLoc shows robustness in challenges like severe occlusion, perceptual aliasing, viewpoint shift, and deformation.

Qualitative Results.

…

Figures - uploaded by Bowen Li

Content may be subject to copyright.

Content uploaded by Bowen Li

Content may be subject to copyright.

AirLoc: Object-based Indoor Relocalization

Aryan1, Bowen Li2, Sebastian Scherer2, Yun-Jou Lin3, and Chen Wang4

Abstract— Indoor relocalization is vital for both robotic tasks

like autonomous exploration and civil applications such as

navigation with a cell phone in a shopping mall. Some previous

approaches adopt geometrical information such as key-point

features or local textures to carry out indoor relocalization,

but they either easily fail in environment with visually similar

scenes or require many database images. Inspired by the fact

that humans often remember places by recognizing unique

landmarks, we resort to objects, which are more informative

than geometry elements. In this work, we propose a simple yet

effective object-based indoor relocalization approach, dubbed

AirLoc. To overcome the critical challenges of the object rei-

dentiﬁcation and remembering object relationships, we extract

object-wise appearance embedding and inter-object geometric

relationship. The geometry and appearance features are inte-

grated to generate cumulative scene features. This results in

a robust, accurate, and portable indoor relocalization system,

which outperforms the state-of-the-art methods in room-level

relocalization by 9.5% of PR-AUC and 7% of accuracy. In

addition to exhaustive evaluation, we also carry out real-world

tests, where AirLoc shows robustness in challenges like severe

occlusion, perceptual aliasing, viewpoint shift, and deformation.

Index Terms— Indoor Relocalization, Object Graph

I. INTRODUCTION

Indoor relocalization has gained increasing attention with

the development of numerous mobile phones and robotic

applications such as virtual reality (VR) [1], augmented

reality (AR) [2], and robot navigation [3]. For example,

it can be employed in large buildings such as shopping

malls and ofﬁces where one can use cell phone for self-

relocalization when lost. Additionally, many existing mobile

robot localization techniques, such as visual odometry [4]

and simultaneous localization and mapping (SLAM) [5]

require indoor relocalization to correct accumulated drift.

Many algorithms [6] focus on providing accurate pose es-

timation, however, exact camera poses are often not required

by civil applications such as indoor navigation. For instance,

a lost patient in a hospital just wants to ﬁgure out which

room he is in, instead of the precise centimeter-level location.

Besides, we expect to re-recognize a place with only a few

image samples (database), making the system commercially

viable. If the method requires a large database, generalizing

the system and creating such a database for multiple places

is impractical due to memory and latency issues.

1The Department of Electronics and Communication, Delhi Technologi-

cal University, Delhi, India aryanmangal2022@gmail.com

2The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA

15213, USA {bowenli2, basti}@andrew.cmu.edu

3OPPO Palo Alto, California, USA. rose.lin@oppo.com

4The Department of Computer Science and Engineering, State University

of New York at Buffalo, NY 14260, USA. chenwang@dr.com

Query Image Database Images

Database

Relocalization

Object Graph

Fig. 1: The pipeline of AirLoc for object-based indoor

relocalization. AirLoc can provide room-level relocalization

by constructing an object graph using one query image

comparing with the database, which can be established with

only K(K= 1,2,5,10) images for each room.

In recent years, indoor relocalization methods have been

focusing on geometric textures with key-point features [7],

[8] or semantic information [9]. However, they are often not

scalable for two major reasons. First, they require either a

3D scene model [10] or a large number of database images

[11], which are not readily accessible in most real-world

indoor scenes. Secondly, these methods can’t work well in

challenging scenarios like occlusion, light changes, and the

interference of dynamic objects such as humans. It is because

that they heavily rely on local texture matching which often

produces false matches for illumination change or visually

similar scenes. Image-based methods, such as NetVLAD [11]

and PatchNetVLAD [12], also produce false matches because

they rely on the collective features of an image rather than

understanding the individual identities depicted in the image.

It remains questionable whether these challenges can be

resolved with a limited number of database images available.

Therefore, in this paper we resort to use higher-level infor-

mation such as objects’ appearance and relative geometry to

tackle the problem of indoor relocalization.

Researchers have shown increasing interest in object en-

coding and re-identiﬁcation tasks [13], [14]. The strong rep-

resentation from objects can be utilized for re-identiﬁcation

with amazing efﬁcacy. Inspired by this, we propose AirLoc,

an object-based indoor relocalization approach shown in

Fig. 1, which fully utilizes appearance and geometry rela-

tions. We present that room-level relocalization for a single

arXiv:2304.00954v1 [cs.CV] 3 Apr 2023

query image can be effectively achieved given a database of

rooms. Furthermore, since the model is usually expected to

quickly generalize to new environments where a large num-

ber of database images can’t be quickly obtained, we take

only a few images (K= 1,2,3,5,10) from every room to

construct the database. AirLoc outperforms various baselines

and achieves an amazing speed of 20 ms/frame, making it

affordable for low-power mobile robots or cellphones, which

demonstrates its outstanding effectiveness and robustness. In

summary, the main contributions of this paper are:

•We introduce a simple yet effective indoor relocalization

framework, named AirLoc, that relies on object-level

information to overcome the limitations of local feature

or image-based approaches.

•We propose two modules to extract appearance and

geometry-related features, respectively, which are then

combined to perform room-level relocalization.

•We perform exhaustive experimental evaluation are on a

newly rendered Reloc110 dataset, which contains 306K

images and 113 rooms. AirLoc can robustly outperform

the state-of-the-art methods, obtaining improvement of

9.5% PR-AUC and 7% accuracy.

•We implement real-world tests to validate the robust-

ness of AirLoc to illumination change, occlusion, and

viewpoint shift. We release source code at https:

//github.com/sair-lab/AirLoc to beneﬁt the

robotics community.

II. REL AT ED WORK

We ﬁrst review the related datasets for indoor relocaliza-

tion. Then the methods based on key-point feature [8], [11],

[15] and objects [13], [14] are presented, respectively.

A. Datasets for Indoor Relocalization

Many datasets have been collected for semantic scene

understanding. The Places365-Standard dataset [16] is built

for visual understanding tasks like scene context, action

and event prediction, and object recognition. It contains 1.8

million train images from 365 scene categories. ADE20k [17]

dataset contains images exhaustively annotated with objects

and object-parts with additional information of occlusion.

MIT Indoor scenes database [18] contains 67 Indoor cate-

gories and 15620 images but distribution of images varies

per category. A recently introduced indoor RGB-D dataset,

RIO10 [19] has changing indoor environments containing 74

sequences split into training, validation and testing sets.

Datasets used for object-based scene understanding tasks,

such as real-world indoor relocalization, should include room

labels. Properties such as ground truth segmentation and

images from varying viewpoints are also important for ﬁner

learning. Existing datasets miss at least one of the above

characteristics, which motivates us to construct a new dataset

with such labels and speciﬁcations.

B. Key-point and Image feature-based Methods

Handcrafted key-point features such as SIFT [20] and

SURF [8] have been widely applied to conventional methods

such as image retrieval, loop closure detection, and Visual

Place Recognition (VPR). A binary descriptor ORB [21]

was utilized in DBoW2 [22] for image retrieval using visual

vocabulary of features. However, these handcrafted local

features are not discriminative in more complex and clutted

environments, where the conventional methods easily fail.

Compared to handcrafted features, approaches using deep

learned features have been proved more robust [23]. Super-

Point [24], a recently proposed deep learning method, uses

self-supervised learning for training interest point detectors

and descriptors. Expanding upon SuperPoint, SuperGlue

[25] introduced a graph neural network that matches two

sets of local features by jointly ﬁnding correspondences

and rejecting non-matchable points. For the tasks such as

feature matching and place recognition , both SuperPoint

and SuperGlue have received widespread adoption [26].

Some image retreival methods [12] directly extract CNN-

based image features. [27] produces a global image represen-

tation by aggregating the activation CNN features. NetVLAD

[11] uses a generalized end-to-end deep learning-based Vec-

tor of Locally Aggregated Descriptors (VLAD) [28] layer.

However, one of the main challenges faced by NetVLAD and

other similar methods is the limited availability of training

data, which can adversely affect performance. To overcome

this issue spatial/depth data has been incorporated [29] and

input modalities such as RGB-D images and point cloud data

have been explored.

These descriptors are capable of producing distinguishable

descriptions, but struggle in visually similar environments. In

these conditions, different scenes could have similar local

textures, which results in similar descriptions and ﬁnally

leads to the failure of matching.

C. Object semantic features and their application

Object based semantic features are more robust and infor-

mative, and have been widely used in robotics applications

such as SLAM. The pioneering work of SLAM++ [30] per-

forms object-level SLAM using a depth camera. [31] develop

a quadratic-programming-based semantic object initialization

scheme to achieve high-accuracy object-level data associa-

tion and real-time semantic mapping. [32] integrated object

detection and localization module together to obtain the

semantic maps of the environment and improve localization.

X-View [33] localize aerial-to-ground globally and ground-

to-ground robot data of drastically different viewpoints using

object graph descriptors based on random walks.

Recently, AirCode [13] proposed a feature sparse and

object dense encoding method which is robust to viewpoint

changes, scaling, occlusion, and even object deformation.

Building upon that, AirObject [14] introduced a temporal

CNN across structural information in multiple frames to

perform temporal 3D object encoding. These frames were

obtained from a graph attention based encoder. However,

using these object descriptors for relocalization still remains

an open question. Taking motivation from above examples,

we use object encoders, like AirCode, to extract object

embeddings for relocalization.

Database

Query

Geometric Similarity

Appearance Similarity

Geometry Module

Appearance Module Relocalization

Fig. 2: The proposed object matching framework uses a

geometry module and an appearance module to match query

images with database objects for indoor relocalization.

III. PROP OSE D APPRO ACH

We propose AirLoc, a new architecture shown in Fig. 2.

It consists of two parts namely geometry and appearance

module. In this section, we will ﬁrst present the individual

modules and then explain their ensembling. Finally, we will

present the loss function for the geometry module.

A. Appearance Module

Appearance module encodes objects’ visual characteris-

tics. Typically, each room in our database has K (K=

1,2,5,10) images and a query consists of 1 to 2 images.

Objects are ﬁrst encoded into a feature vector, and if objects

appear in more than one image, we take the arithmetic

mean of their embeddings. We then construct the database

consisting room-wise object embeddings for relocalization.

1) Object Encoders: Instead of using masks or rectan-

gular patches of objects, we extract their features using

a group of key-points on the object, which can be more

distinctive. Based on previous research [34], we believe these

key-points can provide robust object re-identiﬁcation and can

thus be used for embedding. Speciﬁcally, we use Superpoint

[24] to extract the feature points, where the position of

each point is denoted as hi= (x, y), i ∈[1, N ], and

the associated descriptor as di∈RDp, where Dpis the

descriptor dimension. We then group the points into objects

using instance segmentations masks, which can be obtained

from commonly-used networks like Mask R-CNN [35] or an

open-world object detector [36].

Given the grouped points, we next aggregate the individual

features to form a collective object encoding. One of the

most intuitive solutions is to use the graph-based networks

such as GCN [37] and GAT [38] for feature aggregation,

where each feature point is taken as a node. However, we

found that graph networks perform well when training and

testing data are from the same distribution but can easily

overﬁt to unseen environments. On the contrary, image-based

feature aggregation methods have a better generalization

ability in this task. For efﬁciency, we introduce a widely-

used image retrieval framework NetVLAD [11] and modify

it to ﬁt our feature-point based representation as shown in

Fig. 3. In the experiments, we found that this new framework

can generalize to a new dataset, Reloc110 even if the our

model is only trained on COCO [39] and YT-VIS [40],

indicating its robustness to environmental changes. Given N

Object-Wise

Keypoint Extraction

Superpoint NetVlad

Keypoint Feature

Aggregation Object Descriptors

Fig. 3: An object encoder is used in the appearance module

to generate object descriptors for objects in a room using

images and semantic labels.

descriptors di,(i= 1,· · · , N ), the object encoding Ocan

be represented as a C×Dpdimensional vector:

O(c) = φ N

i=1

ac(di)(di−xc)!,(1)

where O(c)∈RDpis the c-th row of O.xcis c-th cluster

center (c= 1,· · · , C and Cis predeﬁned) and ac(·)is the

learnable parameter that denotes the soft assignment of de-

scriptor dito cluster xc, and φis a composed normalization

function, i.e., an intra-normalization to make the model scale

insensitive, followed by a L2-normalization to make the rows

horizontally stacked into a vector.

2) Similarity: We propose an architecture to match the

query with the database. Once the object descriptors are

generated, they are exhaustively matched with the database

using cosine similarity. This results in an object similarity

matrix S, where each column consists similarity scores of a

query object with all the candidate objects in the database.

This can be represented as:

S(j, k) = cos(Od(j),Oq(k)),(2)

where jand kare j-th database object and k-th query object,

cos is the cosine similarity, and Odand Oqare database and

query object embeddings, respectively.

For efﬁciency, we adopt a simple yet effective object-

and room-level matching framework, respectively, which is

shown in Fig. 4. The object-level matching is obtained by

taking the maximum similarity with the database objects,

while the room-level matching is obtained by summing up

the object matching scores over each room. This is because

the matched rooms often have similar objects, and thus the

summation of object similarities can reason about the room

similarities, which can be represented as

R(p, q) =

k=0

max(Spq(j, k)),(3)

where Ris room similarity matrix and Spq is object similar-

ity matrix, pbelongs to database rooms and qis the query

room, and Zis total number of query objects respectively.

B. Geometry Module

Merely relying on appearance embedding has a potential

problem, since rooms sharing similar objects could be con-

fusing. Inspired by the fact that objects are usually placed in

ΣMax(s)

Query 1

Room 1

Query N

Room M

Query Object Descriptors

Database Object Descriptors

Object Similarity Matrix

(For Every Query-Database Pair)

Room Similarity Matrix

Fig. 4: Appearance-based Matching: Maximum Object Sim-

ilairty for every query-databse pair is summed up to form

room similarity which is then used for relocalization.

different relative locations, we design a geometry module as

shown in Fig. 5 to assist the appearance-based matching.

An intuitive way to compute relative locations is to use

depth measurements, but this makes the framework incom-

patible for cell-phone applications where depth information

is often unavailable. For better generalizability, we resort

to object-wise key-point locations to encode geometric in-

formation. Speciﬁcally, we use their mean location (µj),

standard deviation (σj), 1st-, 2nd-, and 3rd-order momen-

tum (m1

j, m2

j, m3

j), and singular value decomposition (svdj).

Similar to appearance module, if an object appears in more

than one image, we take the arithmetic mean of its geomet-

ric features. Afterwards, the geometric features are passed

through a multilayer perceptron (MLP) and then subtracted

from each other to get relative geometric features. In this

way, if there are Zobjects we get CZ

2relative geometric

features, which can be computed as:

oj= [µj, σj, m1

j, m2

j, m3

j,svdj],(4)

ejk =g(oj)−g(ok),(5)

where [·]is concatenation, ejk is the relative location feature

between j-th and k-th object and g(·)denotes MLP layer.

These geometric features are then passed through a two-

layered GAT [38] to perform attention-based message prop-

agation between the location features

u=σ

X

v∈N (u)

au·W·et−1

u

,(6)

r=PU

u=0 eu

U,(7)

where et

uis the u-th location feature at t-th graph layer, σ

is nonlinearity, auis attention coefﬁcient, Wis learnable

weight matrix [38] and ris room level embedding of di-

mension Eo. Finally, cosine similarity matching of query and

database room embeddings yields a room similarity matrix

similar to the appearance module.

Rloc(p, q) = cos(rp,rq),(8)

where rpand rqare p-th database and q-th query room.

Graph Attention

Encoder

Room Similarity Matrix

Query

𝝁.

σ .

𝑚.

. .

Pixel Features Features Encoder Relative Features Attention Encoder

Database

Geometry

Module

Fig. 5: The structure of geometry module.

C. Feature Ensembling

After we get a set of room similarities based on appearance

and geometry features, the ﬁnal step is to integrate them

using a weighted sum with the weight w.

R0=w·R+Rloc (9)

Furthermore, it is observed that in most true positives from

appearance-only matching, the similarity of the matched

room is very high as compared to others. Hence, in such

cases, there is not much need to use both modules. Therefore,

in order to reduce the runtime and avoid the possibility of

result degradation due to rooms having similar geometry but

different objects, we apply the geometry-based assistance

only to those queries where the difference between the ap-

pearance similarity of the highest and second-highest match

is less than some threshold that we call the “appearance

threshold” (Tdiff). The queries with a difference greater than

the threshold are classiﬁed solely by appearance matching.

D. Loss Function

The graph attention encoder in geometry module is super-

vised by the room matching loss. The room matching loss

Lrmaximizes the cosine similarity of positive room pairs

and minimises the cosine similarity of negative room pairs.

Lr=X

{p,q}∈P+

(1 −Cos(rp,rq))

{p,q}∈P−

max(0,Cos(rp,rq)−ζ),

(10)

where ζ= 0.2is a constant margin, Cos is the cosine

similarity, and P+, P −are positive and negative object pairs.

IV. EXP E RI MEN TAL RE SULTS

A. Dataset

The dataset adopted in this work, named Reloc110, is

newly rendered using Habitat-Sim [41], which is a high-

performance physics-enabled 3D simulator supporting 3D

scans of indoor/outdoor spaces and rigid-body mechanics.

To minimize the gap between simulation and real-world, we

0.2 0.4 0.6 0.8 1.0

Precision

0.0

0.2

0.4

0.6

0.8

1.0

Recall

AirLoc(K=1)

[0.3377] Baseline_1

[0.3356] Baseline_2

[0.4921] NetVLAD

[0.6205] GCN

[0.8059] AirLoc

0.2 0.4 0.6 0.8 1.0

Precision

0.0

0.2

0.4

0.6

0.8

1.0

Recall

AirLoc(K=2)

[0.3076] Baseline_1

[0.3577] Baseline_2

[0.6119] NetVLAD

[0.7489] GCN

[0.8807] AirLoc

0.2 0.4 0.6 0.8 1.0

Precision

0.0

0.2

0.4

0.6

0.8

1.0

Recall

AirLoc(K=5)

[0.3603] Baseline_1

[0.3938] Baseline_2

[0.7634] NetVLAD

[0.9159] GCN

[0.9698] AirLoc

0.2 0.4 0.6 0.8 1.0

Precision

0.0

0.2

0.4

0.6

0.8

1.0

Recall

AirLoc(K=10)

[0.3930] Baseline_1

[0.4222] Baseline_2

[0.8280] NetVLAD

[0.9713] GCN

[0.9929] AirLoc

Fig. 6: Precision-Recall plots showing comparison between AirLoc and different baselines for different K values.

TABLE I: Statistics of the newly rendered Reloc110 dataset.

We present the names, images, and rooms of 15 scenes.

Scene Images Rooms Scene Images Rooms

8WUm 18803 8 ULsK 13600 5

EDJb 22800 8 Vzqf 27000 9

i5no 18200 7 wc2J 32800 12

jh4f 13400 5 WYY7 11200 5

mJXq 25199 9 X7Hy 17800 7

qoiz 25500 9 YFuZ 20800 8

RPmz 17800 6 yqst 15600 6

S9hN 25000 9 Total 306000 113

borrowed Matterport3D [42], a large-scale RGB-D dataset

that contains 90 building-scale scenes. All the Matterport3D

scenes are in the form of textured 3D meshes and are created

from real-world RGB-D images.

We selected 15 scenes from the dataset, each containing

approximate 8 rooms. For every room, we sample approxi-

mate 2500 random poses, which are easily accessible for a

human or a robot, i.e., not inside a wall or under the ground.

Therefore, the images corresponding to the poses are similar

to what humans or robots perceive in their general actions.

We then render corresponding RGB image and semantic

segmentation labels for all the collected poses. The dataset

contains a total of 306000 images divided into 113 rooms.

Table I shows total number of rooms and total images

generated for every scene. We further divide the dataset into

test and train split as well where 3 scenes (RPmz, S9hN,

ULsK) are test split and remaining are train split.

B. Implementation Details

The AirLoc conﬁgurations for appearance based matching

are superpoint input dimension Dp= 256 and the number

of clusters in NetVLAD is C= 32. Conﬁguration for

geometric matching are: relative location feature dimension

E= 256, hidden dimension of graph layer Eh= 512, output

dimension of graph Eo= 1024. For GAT we use 8 heads

and dropout of 0.5. For training, we used a batch size of 256,

learning rate of 1e−4. The network is trained for 30 epochs

using Adam optimizer on a Nvidia A100 80GB GPU.

To validate the generalizability of the AirLoc, we do not

train appearance module on Reloc110 dataset. Instead, we

use NetVLAD pretrained on COCO [39] and YT-VIS [40]

datasets. The train split is only used for learning the geometry

module which only considers the relative position of objects

TABLE II: The Results Comparing AirLoc with baselines.

Method Accuracy

K=1 K=2 K=3 K=5 K = 10

Baseline 1 40.64 61.55 69.88 78.69 84.81

Baseline 2 41.31 49.14 51.65 61.67 65.64

NetVLAD [11] 58.01 74.16 79.89 90.02 95.37

GCN [38] 61.31 76.57 86.30 91.61 96.62

AirLoc 75.35 87.26 91.75 94.35 98.32

and hence can easily be generalized to unseen rooms.

For evaluation of room level localization performance, we

use the test split of Reloc110 dataset. To switch between

appearance-only and appearance-geometry matching, we use

(Tdiff) as 0.1. The weight (w) for the weighted sum of

appearance and geometry is 10.

C. Evaluation Metrics

AirLoc’s performance is evaluated on two metrics, ac-

curacy and precision-recall. While computing accuracy, we

make a one-one matching where a query is matched only

with one most similar room. Accuracy is then calculated as

ratio of correctly matched queries to total number of queries.

However, while computing precision-recall, we allow a one

to many matching. The query-database pair having similarity

value higher than a threshold ρis considered as a match.

Based on True Positives and False Negatives, we calculate

precision and recall. Furthermore, by varying the threshold

values ρ∈(0,1), we obtain precision-recall curves and

calculate area under curves (AUCs).

D. Comparison to State-of-the-art Methods

AirLoc is compared with two types of baselines: room-

level and object-level. The room-level baselines: Baseline 1,

Baseline 2, and NetVLAD, extract room-level features from

the input and calculate a room similarity matrix, thereby

avoiding object matching. In Baseline 1, NetVLAD-based

object encoders are used to extract individual objects features

and the room features are then calculated by averaging

the output object embeddings. In Baseline 2, the object

encoder from Baseline 1 is replaced with GAT, allowing

a comparison of the performance of NetVLAD and GAT

for object encoding. The NetVLAD baseline uses the output

image descriptors from a NetVLAD module as room fea-

tures, similar to how NetVLAD is typically used for place

recognition [11]. It is worth noting that in this baseline, the

TABLE III: Precision-Recall Results Comparing AirLoc with baselines.

Method K=1 K=2 K=3 K=5 K=10

P R F-1 P R F-1 P R F-1 P R F-1 P R F-1

Baseline 1 92.30 8.63 15.78 55.31 19.35 28.67 37.67 34.17 35.84 26.61 47.34 34.07 21.33 57.78 31.16

Baseline 2 73.18 11.38 19.71 53.34 25.55 34.55 39.26 35.67 37.38 31.93 46.42 37.83 27.36 54.85 36.51

NetVLAD 100 0.4 0.8 98.65 16.20 27.83 67.57 53.16 59.51 33.43 87.17 48.33 16.22 97.93 27.84

GCN 72.17 46.79 56.77 80.49 58.94 68.05 90.44 68.04 77.66 94.06 78.26 85.44 97.80 89.17 93.29

AirLoc 82.43 67.77 74.39 90.66 73.27 81.05 94.63 78.19 85.63 98.33 86.47 92.02 99.27 95.40 97.30

Query AirLoc Match NetVLAD Match

✅ ❌

✅

❌

Fig. 7: Qualitative Results.

NetVLAD module is not used for object encoding, but rather

for encoding the entire image.

The object-level baseline, GCN [38], extract object infor-

mation ﬁrst and matches object-level data to further generate

room similarity scores. It uses a similar architecture as

AirLoc, but with two differences. First, the NetVLAD-based

object encoder used in AirLoc is replaced with a graph

attention-based object encoder. This allows for a comparison

of the performance of these two types of object encoders.

Second, the geometry module is not used in the GCN base-

line which means that the it does not incorporate information

about the spatial relationships between objects.

In Fig. 6 and Table III, the performance of AirLoc is

compared to the baseline methods using precision-recall and

F-1 score, respectively, for different values of K. The results

show that AirLoc consistently outperforms all the baselines

across all K values in both PR-AUC and F-1. In particular,

AirLoc exceeds GCN and NetVLAD by an average of 9.5%,

22.5% respectively in terms of PR-AUC and 10% and 49%

respectively in terms of F-1 score. It can also be noticed that

for both the metrics, the performance gap between object-

based methods and room-based methods is consistently large,

demonstrating the importance of object level data.

Table II presents comparisons of AirLoc and baseline

methods in terms of accuracy. AirLoc outperforms all other

TABLE IV: Runtime Analysis.

Module Node Encoding Appearance Geometry Overall

AirLoc 2.5 ms 13.1 ms 4.8 ms 20.4 ms

GCN 8.1 ms 14.3 ms -ms 22.4 ms

approaches in accuracy as well. Speciﬁcally, it outperforms

GCN and NetVLAD by an average of 7% and 10%, respec-

tively, and the margin of improvement is larger when K is

smaller, indicating that AirLoc does not need as many images

to perform well compared to the other methods.

In Fig. 7 we present examples demonstrating the differ-

ence in performance between NetVLAD and AirLoc. For a

query, the closest database image produced by NetVLAD

is shown in the right column, while the closest database

image produced by AirLoc is shown in middle column.

It can be observed that NetVLAD’s matches look more

visually similar to query, but objects in these images are

different from query objects. This results in a wrong match

for NetVLAD. In contrast, AirLoc relies on object level

data and is able to correctly match the query image even

though the two images do not look visually similar. This

demonstrates the effectiveness of using object-level data, as

opposed to relying solely on visual similarity.

E. Efﬁciency

Table IV presents overall runtime and inference time of

individual modules for AirLoc. The runtime of the geometry

module, which does not run constantly and whose inference

depends on the appearance threshold (Tdiff), is 4.8ms, much

lower than the appearance module. The overall running time

of AirLoc is about 20.4ms, satisfying real-time requirements

of most applications. Even though the GCN baseline does

not have a geometry module, its overall runtime is higher

than that of AirLoc. This is due to the longer time taken by

the node encoding of GCN, which uses a GAT rather than

NetVLAD. The NetVLAD has a lower runtime compared

to other methods as it encodes the entire image, rather than

individual objects. However, the accuracy and PR-AUC for

NetVLAD are much lower than those of AirLoc.

F. Ablation Studies

To evaluate the effectiveness of the geometry module,

we compare the performance of AirLoc with and without

the geometry module, as well as with Tdiff = 1.Tdiff =

1means that every query is evaluated using appearance-

geometry matching, opposed to AirLoc where some queries

TABLE V: Ablation studies.

Method Accuracy

K=1 K=2 K=3

AirLoc (Tdiff = 1) 73.27 87.20 89.58

AirLoc (w/o Geometry) 74.14 85.86 90.97

AirLoc 75.35 87.26 91.75

TABLE VI: Variation of accuracy with Tdiff .

Tdiff Accuracy

K=1 K=2 K=3 K=5 K=10

0.01 74.34 86.02 91.36 94.04 98.45

0.05 75.12 86.80 91.60 94.14 98.42

0.1 75.35 87.26 91.75 94.35 98.32

0.2 75.17 87.58 91.40 93.62 97.95

0.35 74.59 87.46 90.62 92.78 97.44

0.5 74.14 87.28 90.12 92.12 96.97

were evaluated using appearance matching only. The results,

shown in Table V, demonstrate that AirLoc outperforms

AirLoc without the geometry by an average of 1.2%. This

suggests that geometry module helps the system to reason

about the geometry of the scene, leading to better and more

accurate relocalization. Additionally, except for K= 2, the

performance of AirLoc with Tdiff = 1 is lower than that of

AirLoc without geometry module, indicating that the current

setting with Tdiff <1can be generalized to most cases.

G. Parameter Analysis

To study the effect of different hyperparameters on the

accuracy of the AirLoc system, we conducted a parameter

analysis by varying the values of the hyperparameters and

measuring the resulting performance. The results of the

analysis, shown in Table VI, demonstrate that the maximum

accuracy for most values of K occurs around Tdiff = 0.1,

leading us to choose this value for the appearance threshold.

The results in Table VII show that the accuracy is highest

for a appearance to geometry weight value of w= 10.

These results provide insights into the impact of different

hyperparameter values on the accuracy of the AirLoc system.

H. Real-World Demo

This section presents real-world testing results of AirLoc

to demonstrate its robustness and generalization ability. We

collect only 4 images per room for our database and use the

pretrained models described in Section IV-B for the geometry

module and NetVLAD in this demo. For each image in the

Fig. 8, the left side displays the corresponding query captured

by a mobile phone, while the right side shows the relocal-

ization result. It can be seen that AirLoc is able to relocalize

well with illumination changes in Fig. 8a, and with human

interference in Fig. 8b as well. For better visualization, we

strongly suggest the readers watch the video attached to this

paper at https://youtu.be/7CflVLbQOkg.

V. C ONCLUSION

In this work, we present a novel indoor relocalization

method, AirLoc, which can play a crucial role in advance-

TABLE VII: Variation of accuracy with w.

wAccuracy

K=1 K=2 K=3 K=5 K = 10

1 72.38 85.14 89.28 92.68 97.71

5 74.75 87.10 91.07 93.97 97.98

10 75.35 87.26 91.75 94.35 98.32

20 75.10 87.36 91.36 94.31 98.15

50 74.08 87.27 89.63 93.22 98.41

Query Database Relocalization

Matching

Score

(a) Illumination Changes.

Query Database Relocalization

(b) Human Interference.

Fig. 8: The live relocalization demo.

ment of evolving applications such as augmented reality

and indoor positioning using mobile phones. To be able to

quickly generalize to new environments we employ objects

as the fundamental part of method. Speciﬁcally, AirLoc uses

objects’ appearance for relocalization and relative object

geometry to differentiate between scenes having similar

objects. Our experiments show that AirLoc outperforms

existing methods and achieves best perfromance on newly

rendered Reloc110 dataset. We envision AirLoc to play

a pivotal role in development of robust and generalizable

Indoor Positioning systems for robots and humans.

VI. ACKN OW LE D GE M EN T

This work was supported by OPPO US, the Spatial AI

& Robotics (SAIR) Lab at State University of New York at

Buffalo, and the AirLab at Carnegie Mellon University.

REFERENCES

[1] L. Meng, F. Tung, J. J. Little, J. Valentin, and C. W. de Silva,

“Exploiting points and lines in regression forests for rgb-d camera

relocalization,” in 2018 IEEE/RSJ International Conference on Intel-

ligent Robots and Systems (IROS). IEEE, 2018, pp. 6827–6834.

[2] T. Khan, K. Johnston, and J. Ophoff, “The impact of an augmented

reality application on learning motivation of students,” Advances in

Human-Computer Interaction, vol. 2019, 2019.

[3] M. Shahjalal, M. Hossan, M. Hasan, M. Z. Chowdhury, N. T. Le, Y. M.

Jang, et al., “An implementation approach and performance analysis

of image sensor based multilateral indoor localization and navigation

system,” Wireless Communications and Mobile Computing, vol. 2018,

2018.

[4] H. Bavle, S. Manthe, P. De La Puente, A. Rodriguez-Ramos,

C. Sampedro, and P. Campoy, “Stereo visual odometry and semantics

based localization of aerial robots in indoor environments,” in 2018

IEEE/RSJ International Conference on Intelligent Robots and Systems

(IROS). IEEE, 2018, pp. 1018–1023.

[5] J. Li, P. Wang, C. Ni, and W. Rong, “Loop closure detection based on

image semantic segmentation in indoor environment,” Mathematical

Problems in Engineering, vol. 2022, 2022.

[6] M. Tian, Q. Nie, and H. Shen, “3d scene geometry-aware constraint for

camera localization with deep learning,” in 2020 IEEE International

Conference on Robotics and Automation (ICRA). IEEE, 2020, pp.

4211–4217.

[7] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a

versatile and accurate monocular slam system,” IEEE transactions on

robotics, vol. 31, no. 5, pp. 1147–1163, 2015.

[8] H. Bay, T. Tuytelaars, and L. V. Gool, “Surf: Speeded up robust

features,” in European conference on computer vision. Springer, 2006,

pp. 404–417.

[9] X. Guo, J. Hu, J. Chen, F. Deng, and T. L. Lam, “Semantic histogram

based graph matching for real-time multi-robot global localization

in large scale environment,” IEEE Robotics and Automation Letters,

vol. 6, no. 4, pp. 8349–8356, 2021.

[10] Y. He, W. Sun, H. Huang, J. Liu, H. Fan, and J. Sun, “Pvn3d: A deep

point-wise 3d keypoints voting network for 6dof pose estimation,”

in Proceedings of the IEEE/CVF conference on computer vision and

pattern recognition, 2020, pp. 11 632–11 641.

[11] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad:

Cnn architecture for weakly supervised place recognition,” in Pro-

ceedings of the IEEE conference on computer vision and pattern

recognition, 2016, pp. 5297–5307.

[12] S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-netvlad:

Multi-scale fusion of locally-global descriptors for place recognition,”

in Proceedings of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, 2021, pp. 14141–14 152.

[13] K. Xu, C. Wang, C. Chen, W. Wu, and S. Scherer, “Aircode: A robust

object encoding method,” IEEE Robotics and Automation Letters,

vol. 7, no. 2, pp. 1816–1823, 2022.

[14] N. V. Keetha, C. Wang, Y. Qiu, K. Xu, and S. Scherer, “Airobject:

A temporally evolving graph embedding for object identiﬁcation,” in

Proceedings of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, 2022, pp. 8407–8416.

[15] P. C. Ng and S. Henikoff, “Sift: Predicting amino acid changes that

affect protein function,” Nucleic acids research, vol. 31, no. 13, pp.

3812–3814, 2003.

[16] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places:

A 10 million image database for scene recognition,” IEEE transactions

on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1452–

1464, 2017.

[17] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,

“Scene parsing through ade20k dataset,” in Proceedings of the IEEE

conference on computer vision and pattern recognition, 2017, pp. 633–

641.

[18] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in 2009

IEEE conference on computer vision and pattern recognition. IEEE,

2009, pp. 413–420.

[19] J. Wald, A. Avetisyan, N. Navab, F. Tombari, and M. Nießner, “Rio:

3d object instance re-localization in changing indoor environments,” in

Proceedings of the IEEE/CVF International Conference on Computer

Vision, 2019, pp. 7658–7667.

[20] D. G. Lowe, “Distinctive image features from scale-invariant key-

points,” International journal of computer vision, vol. 60, no. 2, pp.

91–110, 2004.

[21] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An

efﬁcient alternative to sift or surf,” in 2011 International conference

on computer vision. Ieee, 2011, pp. 2564–2571.

[22] D. G´

alvez-L´

opez and J. D. Tardos, “Bags of binary words for fast place

recognition in image sequences,” IEEE Transactions on Robotics,

vol. 28, no. 5, pp. 1188–1197, 2012.

[23] Z. Chen, A. Jacobson, N. S¨

underhauf, B. Upcroft, L. Liu, C. Shen,

I. Reid, and M. Milford, “Deep learning features at scale for visual

place recognition,” in 2017 IEEE International Conference on Robotics

and Automation (ICRA). IEEE, 2017, pp. 3223–3230.

[24] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-

supervised interest point detection and description,” in Proceedings

of the IEEE conference on computer vision and pattern recognition

workshops, 2018, pp. 224–236.

[25] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Su-

perglue: Learning feature matching with graph neural networks,” in

Proceedings of the IEEE/CVF conference on computer vision and

pattern recognition, 2020, pp. 4938–4947.

[26] N. V. Keetha, M. Milford, and S. Garg, “A hierarchical dual model

of environment-and place-speciﬁc utility for visual place recognition,”

IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 6969–6976,

2021.

[27] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn

features off-the-shelf: an astounding baseline for recognition,” in

Proceedings of the IEEE conference on computer vision and pattern

recognition workshops, 2014, pp. 806–813.

[28] H. J´

egou, M. Douze, C. Schmid, and P. P ´

erez, “Aggregating local

descriptors into a compact image representation,” in 2010 IEEE com-

puter society conference on computer vision and pattern recognition.

IEEE, 2010, pp. 3304–3311.

[29] H. F. Zaki, F. Shafait, and A. Mian, “Viewpoint invariant semantic

object and scene categorization with rgb-d sensors,” Autonomous

Robots, vol. 43, no. 4, pp. 1005–1022, 2019.

[30] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and

A. J. Davison, “Slam++: Simultaneous localisation and mapping at the

level of objects,” in Proceedings of the IEEE conference on computer

vision and pattern recognition, 2013, pp. 1352–1359.

[31] Z. Qian, K. Patath, J. Fu, and J. Xiao, “Semantic slam with au-

tonomous object-level data association,” in 2021 IEEE International

Conference on Robotics and Automation (ICRA). IEEE, 2021, pp.

11 203–11 209.

[32] L. Zhang, L. Wei, P. Shen, W. Wei, G. Zhu, and J. Song, “Semantic

slam based on object detection and improved octomap,” IEEE Access,

vol. 6, pp. 75 545–75 559, 2018.

[33] A. Gawel, C. Del Don, R. Siegwart, J. Nieto, and C. Cadena, “X-

view: Graph-based semantic multi-view localization,” IEEE Robotics

and Automation Letters, vol. 3, no. 3, pp. 1687–1694, 2018.

[34] M. J. Tarr and W. G. Hayward, “The concurrent encoding of

viewpoint-invariant and viewpoint-dependent information in visual

object recognition,” Visual Cognition, vol. 25, no. 1-3, pp. 100–121,

2017.

[35] K. He, G. Gkioxari, P. Doll ´

ar, and R. Girshick, “Mask r-cnn,” in

Proceedings of the IEEE international conference on computer vision,

2017, pp. 2961–2969.

[36] K. Joseph, S. Khan, F. S. Khan, and V. N. Balasubramanian, “To-

wards open world object detection,” in Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition, 2021, pp.

5830–5840.

[37] T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation with graph

convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.

[38] P. Veli ˇ

ckovi´

c, G. Cucurull, A. Casanova, A. Romero, P. Lio,

and Y. Bengio, “Graph attention networks,” arXiv preprint

arXiv:1710.10903, 2017.

[39] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,

P. Doll ´

ar, and C. L. Zitnick, “Microsoft coco: Common objects in

context,” in European conference on computer vision. Springer, 2014,

pp. 740–755.

[40] L. Yang, Y. Fan, and N. Xu, “Video instance segmentation,” in

Proceedings of the IEEE/CVF International Conference on Computer

Vision, 2019, pp. 5188–5197.

[41] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain,

J. Straub, J. Liu, V. Koltun, J. Malik, et al., “Habitat: A platform for

embodied ai research,” in Proceedings of the IEEE/CVF International

Conference on Computer Vision, 2019, pp. 9339–9347.

[42] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva,

S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d

data in indoor environments,” arXiv preprint arXiv:1709.06158, 2017.

ResearchGate has not been able to resolve any citations for this publication.

Loop Closure Detection Based on Image Semantic Segmentation in Indoor Environment

Article

Full-text available

Mar 2022
MATH PROBL ENG

When mobile robots run in indoor environment, a large number of similar images are easy to appear in the images collected, probably causing false-positive judgment in loop closure detection based on simultaneous localization and mapping (SLAM). To solve this problem, a loop closure detection algorithm for visual SLAM based on image semantic segmentation is proposed in this paper. Specifically, the current frame is semantically segmented by optimized DeepLabv3+ model to obtain semantic labels in the image. The 3D semantic node coordinates corresponding to each semantic label are then extracted by combining mask centroid and image depth information. According to the distribution of semantic nodes, the DBSCAN density clustering algorithm is adopted to cluster densely distributed semantic nodes to avoid mismatching due to the close distance of semantic nodes in the subsequent matching process. Finally, the multidimensional similarity comparison of first rough and then fine is adopted to screen the candidate frames of loop closure from key frames and then confirm the real loop closure to complete accurate loop closure detection. Testing with public datasets and self-filmed datasets, experimental results show that being well adapted to illumination change, viewpoint deviation, and item movement or missing, the proposed algorithm can effectively improve the accuracy of loop closure detection in indoor environment.

Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition

Conference Paper

Full-text available

Jun 2021

Visual Place Recognition is a challenging task for robotics and autonomous systems, which must deal with the twin problems of appearance and viewpoint change in an always changing world. This paper introduces Patch-NetVLAD, which provides a novel formulation for combining the advantages of both local and global descriptor methods by deriving patch-level features from NetVLAD residuals. Unlike the fixed spatial neighborhood regime of existing local keypoint features, our method enables aggregation and matching of deep-learned local features defined over the feature-space grid. We further introduce a multi-scale fusion of patch features that have complementary scales (i.e. patch sizes) via an integral feature space and show that the fused features are highly invariant to both condition (season, structure, and illumination) and viewpoint (trans-lation and rotation) changes. Patch-NetVLAD outperforms both global and local feature descriptor-based methods with comparable compute, achieving state-of-the-art visual place recognition results on a range of challenging real-world datasets, including winning the Facebook Mapillary Visual Place Recognition Challenge at ECCV2020. It is also adaptable to user requirements, with a speed-optimised version operating over an order of magnitude faster than the state-of-the-art. By combining superior performance with improved computational efficiency in a configurable framework, Patch-NetVLAD is well suited to enhance both stand-alone place recognition capabilities and the overall performance of SLAM systems.

Semantic Histogram Based Graph Matching for Real-Time Multi-Robot Global Localization in Large Scale Environment

Article

Full-text available

Feb 2021

The core problem of visual multi-robot simultaneous localization and mapping (MR-SLAM) is how to efficiently and accurately perform multi-robot global localization (MR-GL). The difficulties are two-fold. The first is the difficulty of global localization for significant viewpoint difference. Appearance-based localization methods tend to fail under large viewpoint changes. Recently, semantic graphs have been utilized to overcome the viewpoint variation problem. However, the methods are highly time-consuming, especially in large-scale environments. This leads to the second difficulty, which is how to perform real-time global localization. In this paper, we propose a semantic histogram based graph matching method that is robust to viewpoint variation and can achieve real-time global localization. Based on that, we develop a system that can accurately and efficiently perform MR-GL for both homogeneous and heterogeneous robots. The experimental results show that our approach is about 30 times faster than Random Walk based semantic descriptors. Moreover, it achieves an accuracy of 95% for global localization, while the accuracy of the state-of-the-art method is 85%.

AirObject: A Temporally Evolving Graph Embedding for Object Identification

Conference Paper

Jun 2022

AirCode: A Robust Object Encoding Method

Article

Jan 2022

Object encoding and identification is crucial for many robotic tasks such as autonomous exploration and semantic relocalization. Existing works heavily rely on the tracking of detected objects but have difficulty to recall revisited objects precisely. In this paper, we propose a novel object encoding method, which is named as AirCode, based on a graph of key-points. To be robust to the number of key-points detected, we propose a feature sparse encoding and object dense encoding method to ensure that each key-point can only affect a small part of the object descriptors, leading it to be robust to viewpoint changes, scaling, occlusion, and even object deformation. In the experiments, we show that it achieves superior performance for object identification than the state-of-the art algorithms and is able to provide reliable semantic relocalization. It is a plug-and-play module and we expect that it will play an important role in various applications.

Towards Open World Object Detection

Conference Paper

Jun 2021

Semantic SLAM with Autonomous Object-Level Data Association

Conference Paper

May 2021

A Hierarchical Dual Model of Environment- And Place-Specific Utility for Visual Place Recognition

Article

Oct 2021

Visual Place Recognition (VPR) approaches have typically attempted to match places by identifying visual cues, image regions or landmarks that have high “utility” in identifying a specific place. But this concept of utility is not singular - rather it can take a range of forms. In this letter, we present a novel approach to deduce two key types of utility for VPR: the utility of visual cues ‘specific’ to an environment, and to a particular place. We employ contrastive learning principles to estimate both the environment- and place-specific utility of Vector of Locally Aggregated Descriptors (VLAD) clusters in an unsupervised manner, which is then used to guide local feature matching through keypoint selection. By combining these two utility measures, our approach achieves state-of-the-art performance on three challenging benchmark datasets, while simultaneously reducing the required storage and compute time. We provide further analysis demonstrating that unsupervised cluster selection results in semantically meaningful results, that finer grained categorization often has higher utility for VPR than high level semantic categorization (e.g. building, road), and characterise how these two utility measures vary across different places and environments. Source code is made publicly available at https://github.com/Nik-V9/HEAPUtil .

3D Scene Geometry-Aware Constraint for Camera Localization with Deep Learning

Conference Paper

May 2020

PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for 6DoF Pose Estimation

Conference Paper

Jun 2020

AirLoc: Object-based Indoor Relocalization

Abstract and Figures

Recommended publications

AirObject: A Temporally Evolving Graph Embedding for Object Identification

AirObject: A Temporally Evolving Graph Embedding for Object Identification

A Robust Object Encoding Method

AirCode: A Robust Object Encoding Method