Conference PaperPDF Available

A Dense Optical Flow-Based Feature Matching Approach in Visual Odometry

Authors:
A Dense Optical flow-Based Feature Matching Approach In Visual Odometry
Wenzhe Chen, Hao Fu, Meiping Shi, Ying Chen
College of Mechatronic Engineering and Automation
National University of Defense Technology
Changsha, P.R.China, 410073
chenwenzhe119@163.com
Abstract—Monocular visual odometry as the front end of
SLAM, its task is to estimate the movement of cameras
between adjacent images. One of the most important steps in
visual odometry is to find feature correspondence between
neighboring frames. In finding the feature matches, the search
region can not be too small in order to ensure a large number
of matches to be found. On the other hand, it could not be too
large, otherwise it would generate many false matches and will
deteriorate the entire system performance. In this paper, we
propose to add a dense optical flow computation step before
the matching step. The dense optical flow is computed through
optimization from the whole image. The image context will
help to ensure that its result will not deviate too far away from
the true position, and thus provides an initial position estimate
for the feature matches. Experiments on KITTI dataset show
that the proposed approach can indeed narrow down the
search range for finding feature correspondence while still
being able to find a large number of correct matches.
Keywords-visual odometry; ORB feature; dense optical flow;
I. INTRODUCTION
The research on visual odometry dates back to 1970s,
when Hans Moravec [1] discussed the issue of estimate the
state of motion of the robot with a camera. In recent years,
several successful approaches have been proposed, and
existing approaches can be divided into two categories:
direct approaches [2, 4, 5, 6] and indirect approaches [3, 7, 8,
9]. Whilst direct approaches are usually believed to be
computational intensive and do not apply to situations
where the inter-frame motion is large, feature-based
approaches involve a feature detection and matching step,
and are therefore more applicable for large inter-frame
motion cases. In this paper, we focus on feature-based
approach.
For a complete feature based visual odometry approach,
it usually involves three steps: the first step is to establish
feature correspondence between neighboring frames; then
the camera pose and the 3D position for each feature could
be computed by using multiple view geometry techniques;
and finally, to improve robustness, a local bundle
adjustment could be employed which simultaneously
optimize the camera poses and 3D structure in a local
neighborhood. Among the three steps, the first step plays a
fundamental role and will heavily influence the entire
system performance.
For an ideal visual odometry approach, we require the
feature correspondence step could produce a large number
of correct matches. The correctness and the quantity of the
matches are two conflicting goals that we had to
compromise. On the one hand, if we require a large number
of feature matches, we need to enlarge the search window
for finding feature correspondence. Nevertheless, as the
search window becomes larger, we will have a higher
chance for finding erroneous matches. On the other hand,
provided that we want to increase the correct rate of feature
matches, we require performing the search in a relatively
small neighborhood. However, as the search window
becomes smaller, we will miss a lot of matches. This leads
to a reduction in the number of matches.
In order to solve the contradiction between the quantity
and quality, we propose in this paper a novel approach
which tries to provide an initial position estimate for the
feature matches using the dense optical flow. As the dense
optical flow is computed through optimization from the
whole image, the image context will help to ensure that its
result will not deviate too far away from the true position,
and thus provides an initial position estimate for the feature
matches. An illustrative example is shown in Fig.1, where
we narrow the search window to ensure the correct rate,
while simultaneously we get more matches than the original
approach.
The rest of the paper is structured as follows: Section II
reviews relevant work. Section III describes the details of
our proposed approach. Experimental results are given in
Section IV and conclusion are drawn in Section V.
Figure 1. Alignment result of test images. The figure above represents the
original feature-based method, and the figure below represents our method.
II. RELATED WORK
Existing visual odometry approaches could be
categorized into two classes: direct approaches and feature-
based approaches.
A. Direct Approach
The direct approach is the pixel-based method. It can
estimate the movement of the camera according to the
brightness information of pixels, without calculating the key
points and descriptions. As a result, the direct approach
avoids the computation of feature detection and feature
matching.
Direct approach share similarity with the optical flow
approach. They both rely on constant intensity assumption.
However, their purpose are different: the optical flow
describes the movement of the pixel in the image, and the
direct approach aims to recover the camera motion. LSD-
SLAM [9] is a well known direct visual odometry approach.
Its main contribution is to apply the direct method to the
semi-dense monocular SLAM, which marks the successful
application of the monocular direct method in SLAM.
Nevertheless, its accuracy is a bit lower than feature-based
approach.
B. Feature-based Approach
Different from direct approach, feature-based approach
involves a feature computation step. Sparse salient feature
points are firstly extracted from each image, and then feature
correspondences are found by comparing the descriptors
extracted for each feature. Finally, the camera motion are
recovered from these feature correspondences. The features
are usually extracted in an very efficient way. Feature based
approach is usually believed to be more efficient than direct
approach and it can nicely handle situations with severe
illumination change or large inter-frame displacement.
Representative feature based approaches include PTAM
[5] and ORB-SLAM [6]. PTAM proposes to perform the
tracking and mapping in parallel, and was the first to use
non-linear optimization rather than a filter as the back-end
solution.
ORB-SLAM is a very famous successor of PTAM, it is
fast and easy to use in modern SLAM system. ORB feature
[14] is not as time-consuming as SIFT or SURF, it can be
calculated in real time on the CPU. Meanwhile it has good
rotation and scaling invariance compared to simple corner
detector, such as Harris corner detector [13].
III. PROPOSED METHOD
In this paper, we propose to add a dense optical flow
computation step before the matching step. The results of
dense optical flow could provide an initial position estimate
for finding feature correspondence. To meet real time needs,
following [14], we choose to use FAST [11] as the feature
detector and BRIEF [15] as the feature descriptor. In order to
make the BRIEF descriptor rotational invariant, we require
the calculation of the corner orientation in the FAST detector.
Figure 2. An overview of the proposed approach.
For the dense optical flow approach, we choose to use Dense
Inverse Search (DIS) [10], which is a recently proposed
super-fast approach.
In lack of ground-truth annotations, we employ multiple
view geometry constraints to verify the correctness of
feature matches. More specifically, we utilized epipolar
constraint and another constraint which requires the
triangulated 3D feature point to be in front of both cameras.
The overview of our system is illustrated in Fig.2. The
key modules in our system are feature detection and
description, fast dense optical flow computation, and
verification of the feature matches with two proposed
constraints. We will discuss them in more details in the
following sub-sections.
A. Feature Detection and Description
1) FAST detector
The computation of FAST corner orientation involves
the following steps:
a) The threshold of FAST is adjusted to ensure that the
feature points detected by the FAST algorithm is more than
N, which is the desired number of feature points;
b) At the position of each feature point, the Harris
response R of the feature point is calculated and the N
points with the largest R value are taken as the FAST
feature points;
c) Since we want to embed rotation invariance to the
BRIEF descriptor, the corner orientation of feature point is
required to be calculated. The moment (center of gravity) of
a feature point is calculated as:
(x, y)
i j
ij
x y
M x y I

The corner orientation is computed as:
1
10 01
00 00
, tan x
x y
y
M M c
c c
M M c
 1.3
where
,x y
is the point within the feature neighborhood, and
is the corner orientation of the FAST feature point.
2) BRIEF description
BRIEF is an abbreviation for Binary Robust Independent
Elementary Features. BRIEF is a very fast feature descriptor.
It is a binary coded descriptor, which greatly accelerates the
calculation of the feature descriptor. This algorithm is an
extremely fast and promising algorithm.
The main idea of BRIEF is to select a number of pairs
(usually N = 256) near the feature point randomly, by
comparing the gray value of these point pairs, the results are
combined into a binary string, which is deemed as the final
descriptor. Each bit in the BRIEF descriptor is a result of
binary comparison between two pixels selected. The BRIEF
descriptor is computed in a rotated patch, according to the
FAST detector direction, to obtain a good rotational
invariance.
B. Fast Dense Optical Flow
The optical flow is an approach about describing the
movement of pixels between neighboring frames, which is
divided into sparse optical flow and dense optical flow
approach. The sparse optical flow only calculates optical
flows for a number of feature points, and the dense optical
flow calculate offsets for all points in the image. The dense
optical flow is computed through optimization from the
entire image, and the image context will help to ensure its
result will not deviate too far away from the true position. In
[10], the authors propose an algorithm which is extremely
fast and could obtain competitive accuracy by using Dense
Inverse Search.
The method is summarized as follows:
A coarse-to-fine image pyramid is firstly constructed,
and for each scale:
a) Create regular grid of overlapping 2D patches with
fixed size and overlap;
b) Estimate 2D displacement per patch: (Gradient
descent on image intensities)
2
' 1
arg min [I (x u ') T(x)]
u t
x
u u

c) Calculate the weighted average of 2D patch
displacements:
,x
2
1
(x) max(1, (x) )
s
N
i
s i
ii
U u
Z d

d) Variational refinement:
(U) (E ) (E ) (E ) dx
I G S
E

Through DIS method, we can get a dense optical flow
field faster than real-time.
C. Epipolar constraint and Triangulation constraint
To verify the correctness of feature matches, we utilized
techniques from multiple view geometry. A point in one
view defines an epipolar line in the other view on which the
corresponding point lies. The epipolar geometry depends
only on the relative camera positions and their internal
parameters. It does not depend on the scene structure. The
epipolar geometry is essentially the geometry of intersection
between the image plane and the pencil of planes having the
baseline as axis (the baseline is a line joining the two camera
centers). The epipolar geometry is represented by a 3×3
matrix called the fundamental matrix F.
The formula (6) is called the Epipolar Constraint, and
this constraint exists for any point matches correspond to the
same 3D point.
,
where P1 and P2 are the feature points on each of the two
images, which are given in pixel coordinates. F is the
fundamental matrix.
In this paper, we calculate F by the ground-truth poses of
each frame. The calculated F is used as the F_true for
constraint verification. After that, we can get the
transformation matrix R|t by the SVD (singular value
decomposition) of the E (the essential matrix).
The geometric interpretation for the four SVD solutions
of the essential matrix is that there are four possible
scenarios for the two cameras and the spatial points in the
case of keeping the projection point constant. Because the
three-dimensional point observed by the camera can not
appear on the back of the camera, only one solutions can
satisfy the most of corresponding points. It is called the
Triangulation Constraint.
For feature matches satisfying the epipolar constraint
and the triangulation constraint, we believe that they are
potentially correct matches.
IV. EXPERIMENTAL RESULTS
A. Datasets
We perform experiments on KITTI visual odometry
dataset (sequence 02, image_0) which contains a sequence
of 4661 frames.
B. Qualitative Results
When the search window for finding feature matches is
extremely small, ORB matching point is mostly
concentrated in the scene far away, and its number is less.
As is shown in Fig.3. This time the system may choose to
use the Homography matrix to describe the inter-frame
motion, but it is obvious that H is not enough to describe a
non-planar, complex road scene. In contrast, our method can
ensure that we can find correct matches in both near and far
scenes when we narrow the search range for finding point
matches.
(a) ORB (b) Our
Figure 3. Matching result when the search window size is 10 pixel. (a) is the result of ORB which has 140 matches, and (b) is the result of our method
which has 237 matches.
Figure 4. The validity of the constraint. After epipolar constraint and triangulation constraint, the result is considered to be the potentially true matches.
As is shown in Fig.4, we qualitatively verify the validity
of the constraint. Under normal circumstances, the epipolar
constraint can verify the correctness of the vast majority of
points, but the epipolar constraint is a necessary but
not sufficient condition for the verification of true values.
When there is a repetitive structure in the scene, there are
still some false matches. In these circumstances, the
triangulation constraint might help. The matches satisfying
both these two constraints are considered to be potentially
correct matches.
Fig.5 shows the effect of two dense optical flows. Where
(a) is the Lucas-Kanade [16] optical flow field computed at
each pixel, and (b) represents the matches that satisfy the
epipolar constraint in LK dense optical flow field. (c) is the
improved Horn-Schunck [17] optical flow field, displayed
by the Munsell color system, and (d) represents the matches
that satisfy the epipolar constraint in HS dense optical flow
field.
From the figure, we can see that the variation between
adjacent pixels is not be too much and the error of the
calculated point is not too large, which can be inferred from
the fact that most of the points satisfy the epipolar constraint.
Accordingly we can use this dense optical flow field to
provide an initial position estimate for subsequent feature
matching.
C. Quantitative Results
As is shown in Fig.6, we quantitatively calculate the
average number of matched points under different search
window size. The blue solid line and the dotted line
represent the number of ORB’s matching points and the
number of points which satisfy the two constraints
respectively, and the green dotted line represents the number
of points which only satisfy the epipolar constraint is
satisfied in ORB; The red solid line and the dotted line
respectively represent the number of our method’s matching
points and the number of points which satisfy the two
constraints, and the black dotted line represents the number
of points in which only the epipolar constraint is satisfied.
From the results we can see that as the radius of the
search window decreases, the number the ORB’s matched
points decreases dramatically. In contrast, both the number
of matched points and the number of potential true matches
decrease slowly in our method. Even if the search radius is
as small as 10 pixels, our approach can still find more than
230 true matches, whilst the original approach can only find
less than 150 matches.
(a) (b)
(c) (d)
Figure 5. The dense optical flow field and the situation of satisfying the epipolar constraint. (a), (b) represent the LK dense optical flow; (c), (d) represent
the improved HS optical flow.
Figure 6. The number of matched points in different search window size.
The abscissa represents the size of the search window, the unit is the pixel;
the ordinate represents the number of matched points.
Figure 7. The matching error in different search window size. The
abscissa represents the size of the search window, the unit is the pixel; the
ordinate represents the error of matched points, the unit is the pixel2.
As is shown in Fig.7, we quantitatively calculate the
error statistics. We calculate the error of matches that does
not satisfy the epipolar constraint. More specifically, we
calculate the average of the square distance between the
matched points and the epipolar line in the adjacent frames.
We can clearly see that the error of our method is less than
ORB in the same window size. This demonstrates that
matching accuracy of our method is superior to ORB.
V. CONCLUSION
In this paper, we propose a novel approach for
monocular visual odometry by using ORB feature and
dense optical flow. By comparison with the traditional ORB
approaches, our method not only has the advantage in
matching quantity, but also a higher matching accuracy
when narrowing the search range.
REFERENCES
[1] Moravec, Hans Peter. Obstacle avoidance and navigation in the real
world by a seeing robot rover. Stanford University, 1980.
[2] Irani, Michal, and P. Anandan. About Direct Methods.Vision
Algorithms: Theory and Practice. 1999:267-277.
[3] Torr, P. H. S., and A. Zisserman. Feature Based Methods for
Structure and Motion Estimation. Vision Algorithms: Theory and
Practice. Springer Berlin Heidelberg, 2000:278-294.
[4] Davison, A. J., et al. "MonoSLAM: real-time single camera
SLAM." IEEE Transactions on Pattern Analysis & Machine
Intelligence29.6(2007):1052.
[5] Klein, Georg, and D. Murray. "Parallel Tracking and Mapping for
Small AR Workspaces." IEEE and ACM International Symposium
on Mixed and Augmented Reality IEEE, 2008:1-10.
[6] Mur-Artal, Raúl, J. M. M. Montiel, and J. D. Tardós. "ORB-SLAM:
A Versatile and Accurate Monocular SLAM System." IEEE
Transactions on Robotics 31.5(2015):1147-1163.
[7] Engel, J, V. Koltun, and D. Cremers. "Direct Sparse
Odometry." IEEE Transactions on Pattern Analysis & Machine
Intelligence PP.99(2017):1-1.
[8] Forster, Christian, M. Pizzoli, and D. Scaramuzza. "SVO: Fast semi-
direct monocular visual odometry." IEEE International Conference
on Robotics and AutomationIEEE, 2014:15-22.
[9] Engel, Jakob, T. Schöps, and D. Cremers. LSD-SLAM: Large-Scale
Direct Monocular SLAM. Computer Vision ECCV 2014. Springer
International Publishing, 2014:834-849.
[10] Kroeger, Till, et al. "Fast Optical Flow Using Dense Inverse
Search." European Conference on Computer Vision Springer
International Publishing, 2016:471-488.
[11] Rosten, Edward, and T. Drummond. "Machine Learning for High-
Speed Corner Detection." European Conference on Computer
Vision Springer, Berlin, Heidelberg, 2006:430-443.
[12] Calonder, Michael, et al. "BRIEF: Binary Robust Independent
Elementary Features." European Conference on Computer
Vision Springer-Verlag, 2010:778-792.
[13] Harris, C. "A combined corner and edge detector." Proc Alvey
Vision Conf 1988.3(1988):147-151.
[14] Rublee, Ethan, et al. "ORB: An efficient alternative to SIFT or
SURF." IEEE International Conference on Computer Vision IEEE,
2011:2564-2571.
[15] Calonder, Michael, et al. "BRIEF: Binary Robust Independent
Elementary Features." European Conference on Computer
Vision Springer-Verlag, 2010:778-792.
[16] Lucas, Bruce D, and T. Kanade. "An iterative image registration
technique with an application to stereo vision."International Joint
Conference on Artificial IntelligenceMorgan Kaufmann Publishers
Inc. 1981:674-679.
[17] Horn, Berthold K. P., and B. G. Schunck. "Determining optical
flow." Artificial Intelligence 17.1–3(1981):185-203.
... In this article, a new dynamic keypoint removing method is proposed to increase the accuracy and robustness in the dynamic indoor environment on localization and mapping. It combines semantic segmentation network Deeplabv3 (Chen et al., 2017) with moving detection to exclude dynamic region of images. The main contributions of this paper are as follows: 30 1. Sparse optical flow and epipolar geometry are combined to conduct moving detection; and moving detection is combined with the segmentation result to perform dynamic object removal. ...
Preprint
Full-text available
In this study, a robust and accurate SLAM method for dynamic environments is proposed. Sparse optical flow and epipolar geometric constraint are combined to conduct moving detection by judging whether a priori dynamic object is in motion. Semantic segmentation is combined with moving detection to perform dynamic keypoints removal by removing dynamic objects. The dynamic objects removal method is integrated into ORB-SLAM2, enabling robust, accurate localization and mapping. Experiments on TUM datasets show that compared with ORB-SLAM2, the proposed system can significantly reduce the pose estimation error, and the RMSE and S.D. of ORB-SLAM2 are reduced by up to 97.78% and 97.91% respectively under high dynamic sequences, improving the robustness in dynamic environments. Compared with other similar SLAM methods, the RMSE and S.D. of the proposed method are reduced by up to 69.26% and 73.03% respectively. Dense semantic maps built with our method are also much closer to the groundtruth.
Conference Paper
Full-text available
Most recent works in optical flow extraction focus on the accuracy and neglect the time complexity. However, in real-life visual applications, such as tracking, activity detection and recognition, the time complexity is critical. We propose a solution with very low time complexity and competitive accuracy for the computation of dense optical flow. It consists of three parts: (1) inverse search for patch correspondences; (2) dense displacement field creation through patch aggregation along multiple scales; (3) variational refinement. At the core of our Dense Inverse Search-based method (DIS) is the efficient search of correspondences inspired by the inverse compositional image alignment proposed by Baker and Matthews (2001, 2004). DIS is competitive on standard optical flow benchmarks. DIS runs at 300 Hz up to 600 Hz on a single CPU core (1024 \(\times \) 436 resolution. 42 Hz/46 Hz when including preprocessing: disk access, image re-scaling, gradient computation. More details in Sect. 3.1.), reaching the temporal resolution of human’s biological vision system. It is order(s) of magnitude faster than state-of-the-art methods in the same range of accuracy, making DIS ideal for real-time applications.
Article
Full-text available
Most recent works in optical flow extraction focus on the accuracy and neglect the time complexity. However, in real-life visual applications, such as tracking, activity detection and recognition, the time complexity is critical. We propose a solution with very low time complexity and competitive accuracy for the computation of dense optical flow. It consists of three parts: 1) inverse search for patch correspondences; 2) dense displacement field creation through patch aggregation along multiple scales; 3) variational refinement. At the core of our Dense Inverse Search-based method (DIS) is the efficient search of correspondences inspired by the inverse compositional image alignment proposed by Baker and Matthews in 2001. DIS is competitive on standard optical flow benchmarks with large displacements. DIS runs at 300Hz up to 600Hz on a single CPU core, reaching the temporal resolution of human's biological vision system. It is order(s) of magnitude faster than state-of-the-art methods in the same range of accuracy, making DIS ideal for visual applications.
Article
Full-text available
This paper presents ORB-SLAM, a feature-based monocular SLAM system that operates in real time, in small and large, indoor and outdoor environments. The system is robust to severe motion clutter, allows wide baseline loop closing and relocalization, and includes full automatic initialization. Building on excellent algorithms of recent years, we designed from scratch a novel system that uses the same features for all SLAM tasks: tracking, mapping, relocalization, and loop closing. A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation. We present an exhaustive evaluation in 27 sequences from the most popular datasets. ORB-SLAM achieves unprecedented performance with respect to other state-of-the-art monocular SLAM approaches. For the benefit of the community, we make the source code public.
Conference Paper
Full-text available
We propose a semi-direct monocular visual odom-etry algorithm that is precise, robust, and faster than current state-of-the-art methods. The semi-direct approach eliminates the need of costly feature extraction and robust matching techniques for motion estimation. Our algorithm operates directly on pixel intensities, which results in subpixel precision at high frame-rates. A probabilistic mapping method that explicitly models outlier measurements is used to estimate 3D points, which results in fewer outliers and more reliable points. Precise and high frame-rate motion estimation brings increased robustness in scenes of little, repetitive, and high-frequency texture. The algorithm is applied to micro-aerial-vehicle state-estimation in GPS-denied environments and runs at 55 frames per second on the onboard embedded computer and at more than 300 frames per second on a consumer laptop. We call our approach SVO (Semi-direct Visual Odometry) and release our implementation as open-source software.
Conference Paper
Full-text available
This report is a brief overview of the use of “feature based” methods in structure and motion computation. A companion paper by Irani and Anandan [16] reviews “direct” methods.
Conference Paper
We propose a direct (feature-less) monocular SLAM algorithm which, in contrast to current state-of-the-art regarding direct meth- ods, allows to build large-scale, consistent maps of the environment. Along with highly accurate pose estimation based on direct image alignment, the 3D environment is reconstructed in real-time as pose-graph of keyframes with associated semi-dense depth maps. These are obtained by filtering over a large number of pixelwise small-baseline stereo comparisons. The explicitly scale-drift aware formulation allows the approach to operate on challenging sequences including large variations in scene scale. Major enablers are two key novelties: (1) a novel direct tracking method which operates on sim(3), thereby explicitly detecting scale-drift, and (2) an elegant probabilistic solution to include the effect of noisy depth values into tracking. The resulting direct monocular SLAM system runs in real-time on a CPU.
Article
We propose a novel direct sparse visual odometry formulation. It combines a fully direct probabilistic model (minimizing a photometric error) with consistent, joint optimization of all model parameters, including geometry -- represented as inverse depth in a reference frame -- and camera motion. This is achieved in real time by omitting the smoothness prior used in other direct methods and instead sampling pixels evenly throughout the images. Since our method does not depend on keypoint detectors or descriptors, it can naturally sample pixels from across all image regions that have intensity gradient, including edges or smooth intensity variations on mostly white walls. The proposed model integrates a full photometric calibration, accounting for exposure time, lens vignetting, and non-linear response functions. We thoroughly evaluate our method on three different datasets comprising several hours of video. The experiments show that the presented approach significantly outperforms state-of-the-art direct and indirect methods in a variety of real-world settings, both in terms of tracking accuracy and robustness.