ArticlePDF Available

3D Reconstruction from Multiple Images: Part 1 - Principles.

Authors:

Abstract and Figures

This issue discusses methods to extract three-dimensional (3D) models from plain images. In particular, the 3D information is obtained from images for which the camera parameters are unknown. The principles underlying such uncalibrated structure-from-motion methods are outlined. First, a short review of 3D acquisition technologies puts such methods in a wider context and highlights their important advantages. Then, the actual theory behind this line of research is given. The authors have tried to keep the text maximally self-contained, therefore also avoiding to rely on an extensive knowledge of the projective concepts that usually appear in texts about self-calibration 3D methods. Rather, mathematical explanations that are more amenable to intuition are given. The explanation of the theory includes the stratification of reconstructions obtained from image pairs as well as metric reconstruction on the basis of more than two images combined with some additional knowledge about the cameras used. Readers who want to obtain more practical information about how to implement such uncalibrated structure-from-motion pipelines may be interested in two more Foundations and Trends issues written by the same authors. Together with this issue they can be read as a single tutorial on the subject.
Content may be subject to copyright.
Foundations and Trends R
in
Computer Graphics and Vision
Vol. 4, No. 4 (2008) 287–404
c
2010 T. Moons, L. Van Gool and M. Vergauwen
DOI: 10.1561/0600000007
3D Reconstruction from Multiple Images
Part 1: Principles
By Theo Moons, Luc Van Gool, and
Maarten Vergauwen
Contents
1 Introduction to 3D Acquisition 291
1.1 A Taxonomy of Methods 291
1.2 Passive Triangulation 293
1.3 Active Triangulation 295
1.4 Other Methods 299
1.5 Challenges 309
1.6 Conclusions 314
2 Principles of Passive 3D Reconstruction 315
2.1 Introduction 315
2.2 Image Formation and Camera Model 316
2.3 The 3D Reconstruction Problem 329
2.4 The Epipolar Relation Between Two
Images of a Static Scene 332
2.5 Two Image-Based 3D Reconstruction Up-Close 340
2.6 From Projective to Metric Using
More Than Two Images 354
2.7 Some Important Special Cases 376
Bibliography 398
References 403
Foundations and Trends R
in
Computer Graphics and Vision
Vol. 4, No. 4 (2008) 287–404
c
2010 T. Moons, L. Van Gool and M. Vergauwen
DOI: 10.1561/0600000007
3D Reconstruction from Multiple Images
Part 1: Principles
Theo Moons1, Luc Van Gool2,3, and
Maarten Vergauwen4
1Hogeschool — Universiteit Brussel, Stormstraat 2, Brussel, B-1000,
Belgium, Theo.Moons@hubrussel.be
2Katholieke Universiteit Leuven, ESAT — PSI, Kasteelpark Arenberg 10,
Leuven, B-3001, Belgium, Luc.VanGool@esat.kuleuven.be
3ETH Zurich, BIWI, Sternwartstrasse 7, Zurich, CH-8092, Switzerland,
vangool@vision.ee.ethz.ch
4GeoAutomation NV, Karel Van Lotharingenstraat 2, Leuven, B-3000,
Belgium, maarten.vergauwen@geoautomation.com
Abstract
This issue discusses methods to extract three-dimensional (3D) models
from plain images. In particular, the 3D information is obtained from
images for which the camera parameters are unknown. The principles
underlying such uncalibrated structure-from-motion methods are out-
lined. First, a short review of 3D acquisition technologies puts such
methods in a wider context and highlights their important advan-
tages. Then, the actual theory behind this line of research is given. The
authors have tried to keep the text maximally self-contained, therefore
also avoiding to rely on an extensive knowledge of the projective con-
cepts that usually appear in texts about self-calibration 3D methods.
Rather, mathematical explanations that are more amenable to intu-
ition are given. The explanation of the theory includes the stratification
of reconstructions obtained from image pairs as well as metric recon-
struction on the basis of more than two images combined with some
additional knowledge about the cameras used. Readers who want to
obtain more practical information about how to implement such uncal-
ibrated structure-from-motion pipelines may be interested in two more
Foundations and Trends issues written by the same authors. Together
with this issue they can be read as a single tutorial on the subject.
Preface
Welcome to this Foundations and Trends tutorial on three-
dimensional (3D) reconstruction from multiple images. The focus is on
the creation of 3D models from nothing but a set of images, taken from
unknown camera positions and with unknown camera settings. In this
issue, the underlying theory for such “self-calibrating” 3D reconstruc-
tion methods is discussed. Of course, the text cannot give a complete
overview of all aspects that are relevant. That would mean dragging
in lengthy discussions on feature extraction, feature matching, track-
ing, texture blending, dense correspondence search, etc. Nonetheless,
we tried to keep at least the geometric aspects of the self-calibration
reasonably self-contained and this is where the focus lies.
The issue consists of two main parts, organized in separate sections.
Section 1 places the subject of self-calibrating 3D reconstruction from
images in the wider context of 3D acquisition techniques. This sec-
tion thus also gives a short overview of alternative 3D reconstruction
techniques, as the uncalibrated structure-from-motion approach is not
necessarily the most appropriate one for all applications. This helps to
bring out the pros and cons of this particular approach.
289
290
Section 2 starts the actual discussion of the topic. With images as
our key input for 3D reconstruction, this section first discusses how we
can mathematically model the process of image formation by a camera,
and which parameters are involved. Equipped with that camera model,
it then discusses the process of self-calibration for multiple cameras
from a theoretical perspective. It deals with the core issues of this tuto-
rial: given images and incomplete knowledge about the cameras, what
can we still retrieve in terms of 3D scene structure and how can we
make up for the missing information. This section also describes cases
in between fully calibrated and uncalibrated reconstruction. Breaking
a bit with tradition, we have tried to describe the whole self-calibration
process in intuitive, Euclidean terms. We have avoided the usual expla-
nation via projective concepts, as we believe that entities like the dual
of the projection of the absolute quadric are not very amenable to
intuition.
Readers who are interested in implementation issues and a prac-
tical example of a self-calibrating 3D reconstruction pipeline may be
interested in two complementary, upcoming issues by the same authors,
which together with this issue can be read as a single tutorial.
1
Introduction to 3D Acquisition
This section discusses different methods for capturing or ‘acquiring’
the three-dimensional (3D) shape of surfaces and, in some cases, also
the distance or ‘range’ of the object to the 3D acquisition device. The
section aims at positioning the methods discussed in the sequel of the
tutorial within this more global context. This will make clear that alter-
native methods may actually be better suited for some applications
that need 3D. This said, the discussion will also show that the kind of
approach described here is one of the more flexible and powerful ones.
1.1 A Taxonomy of Methods
A 3D acquisition taxonomy is given in Figure 1.1. A first distinction is
between active and passive methods. With active techniques the light
sources are specially controlled, as part of the strategy to arrive at the
3D information. Active lighting incorporates some form of temporal or
spatial modulation of the illumination. With passive techniques, on the
other hand, light is not controlled or only with respect to image quality.
Typically passive techniques work with whichever reasonable, ambient
light available. From a computational point of view, active methods
291
292 Introduction to 3D Acquisition
Fig. 1.1 Taxonomy of methods for the extraction of information on 3D shape.
tend to be less demanding, as the special illumination is used to simplify
some of the steps in the 3D capturing process. Their applicability is
restricted to environments where the special illumination techniques
can be applied.
A second distinction is between the number of vantage points from
where the scene is observed and/or illuminated. With single-vantage
methods the system works from a single vantage point. In case there
are multiple viewing or illumination components, these are positioned
very close to each other, and ideally they would coincide. The latter
can sometimes be realized virtually, through optical means like semi-
transparent mirrors. With multi-vantage systems, several viewpoints
and/or controlled illumination source positions are involved. For multi-
vantage systems to work well, the different components often have to
be positioned far enough from each other. One says that the ‘baseline’
between the components has to be wide enough. Single-vantage meth-
ods have as advantages that they can be made compact and that they
do not suffer from the occlusion problems that occur when parts of the
scene are not visible from all vantage points in multi-vantage systems.
The methods mentioned in the taxonomy will now be discussed in
a bit more detail. In the remaining sections, we then continue with
the more elaborate discussion of passive, multi-vantage structure-from-
motion (SfM) techniques, the actual subject of this tutorial. As this
1.2 Passive Triangulation 293
overview of 3D acquisition methods is not intended to be in-depth nor
exhaustive, but just to provide a bit of context for our further image-
based 3D reconstruction from uncalibrated images account, we do not
include references in this part.
1.2 Passive Triangulation
Several multi-vantage approaches use the principle of triangulation
for the extraction of depth information. This also is the key concept
exploited by the self-calibrating structure-from-motion (SfM) methods
described in this tutorial.
1.2.1 (Passive) Stereo
Suppose we have two images, taken at the same time and from differ-
ent viewpoints. Such setting is referred to as stereo. The situation is
illustrated in Figure 1.2. The principle behind stereo-based 3D recon-
struction is simple: given the two projections of the same point in the
world onto the two images, its 3D position is found as the intersection
of the two projection rays. Repeating such process for several points
Fig. 1.2 The principle behind stereo-based 3D reconstruction is very simple: given two
images of a point, the point’s position in space is found as the intersection of the two
projection rays. This procedure is referred to as ‘triangulation’.
294 Introduction to 3D Acquisition
yields the 3D shape and configuration of the objects in the scene. Note
that this construction — referred to as triangulation — requires the
equations of the rays and, hence, complete knowledge of the cameras:
their (relative) positions and orientations, but also their settings like
the focal length. These camera parameters will be discussed in Sec-
tion 2. The process to determine these parameters is called (camera)
calibration.
Moreover, in order to perform this triangulation process, one needs
ways of solving the correspondence problem, i.e., finding the point in
the second image that corresponds to a specific point in the first image,
or vice versa. Correspondence search actually is the hardest part of
stereo, and one would typically have to solve it for many points. Often
the correspondence problem is solved in two stages. First, correspon-
dences are sought for those points for which this is easiest. Then, corre-
spondences are sought for the remaining points. This will be explained
in more detail in subsequent sections.
1.2.2 Structure-from-Motion
Passive stereo uses two cameras, usually synchronized. If the scene is
static, the two images could also be taken by placing the same cam-
era at the two positions, and taking the images in sequence. Clearly,
once such strategy is considered, one may just as well take more than
two images, while moving the camera. Such strategies are referred to
as structure-from-motion or SfM for short. If images are taken over
short time intervals, it will be easier to find correspondences, e.g.,
by tracking feature points over time. Moreover, having more camera
views will yield object models that are more complete. Last but not
least, if multiple views are available, the camera(s) need no longer
be calibrated beforehand, and a self-calibration procedure may be
employed instead. Self-calibration means that the internal and exter-
nal camera parameters (cf. Section 2.2) are extracted from images of
the unmodified scene itself, and not from images of dedicated calibra-
tion patterns. These properties render SfM a very attractive 3D acqui-
sition strategy. A more detailed discussion is given in the following
sections.
1.3 Active Triangulation 295
1.3 Active Triangulation
Finding corresponding points can be facilitated by replacing one of the
cameras in a stereo setup by a projection device. Hence, we combine
one illumination source with one camera. For instance, one can project
a spot onto the object surface with a laser. The spot will be easily
detectable in the image taken by the camera. If we know the position
and orientation of both the laser ray and the camera projection ray,
then the 3D surface point is again found as their intersection. The
principle is illustrated in Figure 1.3 and is just another example of the
triangulation principle.
The problem is that knowledge about the 3D coordinates of one
point is hardly sufficient in most applications. Hence, in the case of
the laser, it should be directed at different points on the surface and
each time an image has to be taken. In this way, the 3D coordinates
of these points are extracted, one point at a time. Such a ‘scanning’
Fig. 1.3 The triangulation principle used already with stereo, can also be used in an active
configuration. The laser Lprojects a ray of light onto the object O. The intersection point
Pwith the object is viewed by a camera and forms the spot Pon its image plane I.
This information suffices for the computation of the three-dimensional coordinates of P,
assuming that the laser-camera configuration is known.
296 Introduction to 3D Acquisition
process requires precise mechanical apparatus (e.g., by steering rotat-
ing mirrors that reflect the laser light into controlled directions). If
the equations of the laser rays are not known precisely, the resulting
3D coordinates will be imprecise as well. One would also not want the
system to take a long time for scanning. Hence, one ends up with the
conflicting requirements of guiding the laser spot precisely and fast.
These challenging requirements have an adverse effect on the price.
Moreover, the times needed to take one image per projected laser spot
add up to seconds or even minutes of overall acquisition time. A way
out is using special, super-fast imagers, but again at an additional cost.
In order to remedy this problem, substantial research has gone into
replacing the laser spot by more complicated patterns. For instance,
the laser ray can without much difficulty be extended to a plane, e.g.,
by putting a cylindrical lens in front of the laser. Rather than forming
a single laser spot on the surface, the intersection of the plane with the
surface will form a curve. The configuration is depicted in Figure 1.4.
The 3D coordinates of each of the points along the intersection curve
Fig. 1.4 If the active triangulation configuration is altered by turning the laser spot into
a line (e.g., by the use of a cylindrical lens), then scanning can be restricted to a one-
directional motion, transversal to the line.
1.3 Active Triangulation 297
can be determined again through triangulation, namely as the intersec-
tion of the plane with the viewing ray for that point. This still yields
a unique point in space. From a single image, many 3D points can be
extracted in this way. Moreover, the two-dimensional scanning motion
as required with the laser spot can be replaced by a much simpler
one-dimensional sweep over the surface with the laser plane.
It now stands to reason to try and eliminate any scanning alto-
gether. Is it not possible to directly go for a dense distribution of points
all over the surface? Unfortunately, extensions to the two-dimensional
projection patterns that are required are less straightforward. For
instance, when projecting multiple parallel lines of light simultaneously,
a camera viewing ray will no longer have a single intersection with such
a pencil of illumination planes. We would have to include some kind
of code into the pattern to make a distinction between the different
lines in the pattern and the corresponding projection planes. Note that
counting lines has its limitations in the presence of depth discontinu-
ities and image noise. There are different ways of including a code. An
obvious one is to give the lines different colors, but interference by the
surface colors may make it difficult to identify a large number of lines
in this way. Alternatively, one can project several stripe patterns in
sequence, giving up on using a single projection but still only using a
few. Figure 1.5 gives a (non-optimal) example of binary patterns. The
sequence of being bright or dark forms a unique binary code for each
column in the projector. Although one could project different shades
of gray, using binary (i.e., all-or-nothing black or white) type of codes
Fig. 1.5 A series of masks that can be projected for active stereo applications. Subsequent
masks contain ever finer stripes. Each of the masks is projected and for a point in the
scene the sequence of black/white values is recorded. The subsequent bits obtained that
way characterize the horizontal position of the points, i.e., the plane of intersection (see
text). The resolution that is required (related to the width of the thinnest stripes) imposes
the number of such masks that has to be used.
298 Introduction to 3D Acquisition
is beneficial for robustness. Nonetheless, so-called phase shift methods
successfully use a set of patterns with sinusoidally varying intensities
in one direction and constant intensity in the perpendicular direction
(i.e., a more gradual stripe pattern than in the previous example).
Each of the three sinusoidal patterns has the same amplitude but is
120phase shifted with respect to each other. Intensity ratios in the
images taken under each of the three patterns yield a unique position
modulo the periodicity of the patterns. The sine patterns sum up to a
constant intensity, so adding the three images yields the scene texture.
The three subsequent projections yield dense range values plus texture.
An example result is shown in Figure 1.6. These 3D measurements have
been obtained with a system that works in real time (30 Hz depth +
texture).
One can also design more intricate patterns that contain local spa-
tial codes to identify parts of the projection pattern. An example
is shown in Figure 1.7. The figure shows a face on which the sin-
gle, checkerboard kind of pattern on the left is projected. The pat-
tern is such that each column has its own distinctive signature. It
consists of combinations of little white or black squares at the ver-
tices of the checkerboard squares. 3D reconstructions obtained with
this technique are shown in Figure 1.8. The use of this pattern only
requires the acquisition of a single image. Hence, continuous projection
Fig. 1.6 3D results obtained with a phase-shift system. Left: 3D reconstruction without
texture. Right: The same 3D reconstruction with texture, obtained by summing the three
images acquired with the phase-shifted sine projections.
1.4 Other Methods 299
Fig. 1.7 Example of one-shot active range technique. Left: The pro jection pattern allowing
disambiguation of its different vertical columns. Right: The pattern is pro jected on a face.
Fig. 1.8 Two views of the 3D description obtained with the active method of Figure 1.7.
in combination with video input yields a 4D acquisition device that
can capture 3D shape (but not texture) and its changes over time. All
these approaches with specially shaped projected patterns are com-
monly referred to as structured light techniques.
1.4 Other Methods
With the exception of time-of-flight techniques, all other methods in the
taxonomy of Figure 1.1 are of less practical importance (yet). Hence,
300 Introduction to 3D Acquisition
only time-of-flight is discussed to a somewhat greater length. For the
other approaches, only their general principles are outlined.
1.4.1 Time-of-Flight
The basic principle of time-of-flight sensors is the measurement of the
duration before a sent out time-modulated signal — usually light from a
laser — returns to the sensor. This time is proportional to the distance
from the object. This is an active, single-vantage approach. Depending
on the type of waves used, one calls such devices radar (electromag-
netic waves of low frequency), sonar (acoustic waves), or optical radar
(optical electromagnetic waves, including near-infrared).
A first category uses pulsed waves and measures the delay between
the transmitted and the received pulse. These are the most often used
type. A second category is used for smaller distances and measures
phase shifts between outgoing and returning sinusoidal waves. The low
level of the returning signal and the high bandwidth required for detec-
tion put pressure on the signal to noise ratios that can be achieved.
Measurement problems and health hazards with lasers can be allevi-
ated by the use of ultrasound. The bundle has a much larger opening
angle then, and resolution decreases (a lot).
Mainly optical signal-based systems (typically working in the near-
infrared) represent serious competition for the methods mentioned
before. Such systems are often referred to as LIDAR (LIght Detec-
tion And Ranging) or LADAR (LAser Detection And Ranging, a term
more often used by the military, where wavelengths tend to be longer,
like 1,550 nm in order to be invisible in night goggles). As these sys-
tems capture 3D data point-by-point, they need to scan. Typically a
horizontal motion of the scanning head is combined with a faster ver-
tical flip of an internal mirror. Scanning can be a rather slow process,
even if at the time of writing there were already LIDAR systems on the
market that can measure 50,000 points per second. On the other hand,
LIDAR gives excellent precision at larger distances in comparison to
passive techniques, which start to suffer from limitations in image res-
olution. Typically, errors at tens of meters will be within a range of
a few centimeters. Triangulation-based techniques require quite some
baseline to achieve such small margins. A disadvantage is that surface
1.4 Other Methods 301
texture is not captured and that errors will be substantially larger for
dark surfaces, which reflect little of the incoming signal. Missing texture
can be resolved by adding a camera, as close as possible to the LIDAR
scanning head. But of course, even then the texture is not taken from
exactly the same vantage point. The output is typically delivered as a
massive, unordered point cloud, which may cause problems for further
processing. Moreover, LIDAR systems tend to be expensive.
More recently, 3D cameras have entered the market, that use
the same kind of time-of-flight principle, but that acquire an entire
3D image at the same time. These cameras have been designed to
yield real-time 3D measurements of smaller scenes, typically up to a
couple of meters. So far, resolutions are still limited (in the order of
150 ×150 range values) and depth resolutions only moderate (couple
of millimeters under ideal circumstances but worse otherwise), but this
technology is making advances fast. It is expected that the price of such
cameras will drop sharply soon, as some games console manufacturer’s
plan to offer such cameras as input devices.
1.4.2 Shape-from-Shading and Photometric Stereo
We now discuss the remaining, active techniques in the taxonomy of
Figure 1.1.
‘Shape-from-shading’ techniques typically handle smooth, untex-
tured surfaces. Without the use of structured light or time-of-flight
methods these are difficult to handle. Passive methods like stereo may
find it difficult to extract the necessary correspondences. Yet, people
can estimate the overall shape quite well (qualitatively), even from a
single image and under uncontrolled lighting. This would win it a place
among the passive methods. No computer algorithm today can achieve
such performance, however. Yet, progress has been made under simpli-
fying conditions. One can use directional lighting with known direction
and intensity. Hence, we have placed the method in the ‘active’ family
for now. Gray levels of object surface patches then convey information
on their 3D orientation. This process not only requires information on
the sensor-illumination configuration, but also on the reflection char-
acteristics of the surface. The complex relationship between gray levels
302 Introduction to 3D Acquisition
and surface orientation can theoretically be calculated in some cases —
e.g., when the surface reflectance is known to be Lambertian — but is
usually derived from experiments and then stored in ‘reflectance maps’
for table-lookup. For a Lambertian surface with known albedo and for
a known light source intensity, the angle between the surface normal
and the incident light direction can be derived. This yields surface nor-
mals that lie on a cone about the light direction. Hence, even in this
simple case, the normal of a patch cannot be derived uniquely from
its intensity. Therefore, information from different patches is combined
through extra assumptions on surface smoothness. Neighboring patches
can be expected to have similar normals. Moreover, for a smooth sur-
face the normals at the visible rim of the object can be determined
from their tangents in the image if the camera settings are known.
Indeed, the 3D normals are perpendicular to the plane formed by the
projection ray at these points and the local tangents to the boundary
in the image. This yields strong boundary conditions. Estimating the
lighting conditions is sometimes made part of the problem. This may
be very useful, as in cases where the light source is the sun. The light
is also not always assumed to be coming from a single direction. For
instance, some lighting models consist of both a directional component
and a homogeneous ambient component, where light is coming from all
directions in equal amounts. Surface interreflections are a complication
which these techniques so far cannot handle.
The need to combine normal information from different patches can
be reduced by using different light sources with different positions. The
light sources are activated one after the other. The subsequent observed
intensities for the surface patches yield only a single possible normal
orientation (not withstanding noise in the intensity measurements).
For a Lambertian surface, three different lighting directions suffice to
eliminate uncertainties about the normal direction. The three cones
intersect in a single line, which is the sought patch normal. Of course,
it still is a good idea to further improve the results, e.g., via smoothness
assumptions. Such ‘photometric stereo’ approach is more stable than
shape-from-shading, but it requires a more controlled acquisition envi-
ronment. An example is shown in Figure 1.9. It shows a dome with 260
LEDs that is easy to assemble and disassemble (modular design, fitting
1.4 Other Methods 303
Fig. 1.9 (a) Mini-dome with different LED light sources, (b) scene with one of the LEDs
activated, (c) 3D reconstruction of a cuneiform tablet, without texture, and (d) the same
tablet with texture.
in a standard aircraft suitcase; see part (a) of the figure). The LEDs
are automatically activated in a predefined sequence. There is one over-
head camera. The resulting 3D reconstruction of a cuneiform tablet is
shown in Figure 1.9(c) without texture, and in (d) with texture.
304 Introduction to 3D Acquisition
As with structured light techniques, one can try to reduce the num-
ber of images that have to be taken, by giving the light sources differ-
ent colors. The resulting mix of colors at a surface patch yields direct
information about the surface normal. In case 3 projections suffice, one
can exploit the R-G-B channels of a normal color camera. It is like
taking three intensity images in parallel, one per spectral band of the
camera.
Note that none of the above techniques yield absolute depths, but
rather surface normal directions. These can be integrated into full
3D models of shapes.
1.4.3 Shape-from-Texture and Shape-from-Contour
Passive single vantage methods include shape-from-texture and shape-
from-contour. These methods do not yield true range data, but, as in
the case of shape-from-shading, only surface orientation.
Shape-from-texture assumes that a surface is covered by a homo-
geneous texture (i.e., a surface pattern with some statistical or geo-
metric regularity). Local inhomogeneities of the imaged texture (e.g.,
anisotropy in the statistics of edge orientations for an isotropic tex-
ture, or deviations from assumed periodicity) are regarded as the result
of projection. Surface orientations which allow the original texture to
be maximally isotropic or periodic are selected. Figure 1.10 shows an
Fig. 1.10 Left: The regular texture yields a clear perception of a curved surface. Right: The
result of a shape-from-texture algorithm.
1.4 Other Methods 305
example of a textured scene. The impression of an undulating surface
is immediate. The right-hand side of the figure shows the results for
a shape-from-texture algorithm that uses the regularity of the pattern
for the estimation of the local surface orientation. Actually, what is
assumed here is a square shape of the pattern’s period (i.e., a kind
of discrete isotropy). This assumption suffices to calculate the local
surface orientation. The ellipses represent circles with such calculated
orientation of the local surface patch. The small stick at their center
shows the computed normal to the surface.
Shape-from-contour makes similar assumptions about the true
shape of, usually planar, objects. Observing an ellipse, the assumption
can be made that it actually is a circle, and the slant and tilt angles of
the plane can be determined. For instance, in the shape-from-texture
figure we have visualized the local surface orientation via ellipses. This
3D impression is compelling, because we tend to interpret the elliptical
shapes as projections of what in reality are circles. This is an exam-
ple of shape-from-contour as applied by our brain. The circle–ellipse
relation is just a particular example, and more general principles have
been elaborated in the literature. An example is the maximization of
area over perimeter squared, as a measure of shape compactness, over
all possible deprojections, i.e., surface patch orientations. Returning
to our example, an ellipse would be deprojected to a circle for this
measure, consistent with human vision. Similarly, symmetries in the
original shape will get lost under projection. Choosing the slant and
tilt angles that maximally restore symmetry is another example of a
criterion for determining the normal to the shape. As a matter of fact,
the circle–ellipse case also is an illustration for this measure. Regular
figures with at least a 3-fold rotational symmetry yield a single orien-
tation that could make up for the deformation in the image, except
for the mirror reversal with respect to the image plane (assuming that
perspective distortions are too small to be picked up). This is but a
special case of the more general result, that a unique orientation (up to
mirror reflection) also results when two copies of a shape are observed
in the same plane (with the exception where their orientation differs by
0or 180in which case nothing can be said on the mere assumption
that both shapes are identical). Both cases are more restrictive than
306 Introduction to 3D Acquisition
skewed mirror symmetry (without perspective effects), which yields a
one-parameter family of solutions only.
1.4.4 Shape-from-Defocus
Cameras have a limited depth-of-field. Only points at a particular
distance will be imaged with a sharp projection in the image plane.
Although often a nuisance, this effect can also be exploited because it
yields information on the distance to the camera. The level of defocus
has already been used to create depth maps. As points can be blurred
because they are closer or farther from the camera than at the position
of focus, shape-from-defocus methods will usually combine more than
a single image, taken from the same position but with different focal
lengths. This should disambiguate the depth.
1.4.5 Shape-from-Silhouettes
Shape-from-silhouettes is a passive, multi-vantage approach. Suppose
that an object stands on a turntable. At regular rotational intervals
an image is taken. In each of the images, the silhouette of the object
is determined. Initially, one has a virtual lump of clay, larger than the
object and fully containing it. From each camera orientation, the silhou-
ette forms a cone of projection rays, for which the intersection with this
virtual lump is calculated. The result of all these intersections yields an
approximate shape, a so-called visual hull. Figure 1.11 illustrates the
process.
One has to be careful that the silhouettes are extracted with good
precision. A way to ease this process is by providing a simple back-
ground, like a homogeneous blue or green cloth (‘blue keying’ or ‘green
keying’). Once a part of the lump has been removed, it can never be
retrieved in straightforward implementations of this idea. Therefore,
more refined, probabilistic approaches have been proposed to fend off
such dangers. Also, cavities that do not show up in any silhouette will
not be removed. For instance, the eye sockets in a face will not be
detected with such method and will remain filled up in the final model.
This can be solved by also extracting stereo depth from neighboring
1.4 Other Methods 307
Fig. 1.11 The first three images show different backprojections from the silhouette of a
teapot in three views. The intersection of these backprojections form the visual hull of the
object, shown in the bottom right image. The more views are taken, the closer the visual
hull approaches the true shape, but cavities not visible in the silhouettes are not retrieved.
viewpoints and by combining the 3D information coming from both
methods.
The hardware needed is minimal, and very low-cost shape-from-
silhouette systems can be produced. If multiple cameras are placed
around the object, the images can be taken all at once and the capture
time can be reduced. This will increase the price, and also the silhouette
extraction may become more complicated. In the case video cameras
are used, a dynamic scene like a moving person can be captured in 3D
over time (but note that synchronization issues are introduced). An
example is shown in Figure 1.12, where 15 video cameras were set up
in an outdoor environment.
Of course, in order to extract precise cones for the intersection, the
relative camera positions and their internal settings have to be known
precisely. This can be achieved with the same self-calibration methods
expounded in the following sections. Hence, also shape-from-silhouettes
can benefit from the presented ideas and this is all the more interesting
308 Introduction to 3D Acquisition
Fig. 1.12 (a) Fifteen cameras setup in an outdoor environment around a person,(b) a more
detailed view on the visual hull at a specific moment of the action,(c) a detailed view on
the visual hull textured by backprojecting the image colors, and (d) another view of the
visual hull with backprojected colors. Note how part of the sock area has been erroneously
carved away.
as this 3D extraction approach is among the most practically relevant
ones for dynamic scenes (‘motion capture’).
1.5 Challenges 309
1.4.6 Hybrid Techniques
The aforementioned techniques often have complementary strengths
and weaknesses. Therefore, several systems try to exploit multiple tech-
niques in conjunction. A typical example is the combination of shape-
from-silhouettes with stereo as already hinted in the previous section.
Both techniques are passive and use multiple cameras. The visual hull
produced from the silhouettes provides a depth range in which stereo
can try to refine the surfaces in between the rims, in particular at the
cavities. Similarly, one can combine stereo with structured light. Rather
than trying to generate a depth map from the images pure, one can
project a random noise pattern, to make sure that there is enough tex-
ture. As still two cameras are used, the projected pattern does not have
to be analyzed in detail. Local pattern correlations may suffice to solve
the correspondence problem. One can project in the near-infrared, to
simultaneously take color images and retrieve the surface texture with-
out interference from the projected pattern. So far, the problem with
this has often been the weaker contrast obtained in the near-infrared
band. Many such integrated approaches can be thought of.
This said, there is no single 3D acquisition system to date that can
handle all types of objects or surfaces. Transparent or glossy surfaces
(e.g., glass, metals), fine structures (e.g., hair or wires), and too weak,
too busy, to too repetitive surface textures (e.g., identical tiles on a
wall) may cause problems, depending on the system that is being used.
The next section discusses still existing challenges in a bit more detail.
1.5 Challenges
The production of 3D models has been a popular research topic already
for a long time now, and important progress has indeed been made since
the early days. Nonetheless, the research community is well aware of
the fact that still much remains to be done. In this section we list some
of these challenges.
As seen in the previous subsections, there is a wide variety of tech-
niques for creating 3D models, but depending on the geometry and
material characteristics of the object or scene, one technique may be
much better suited than another. For example, untextured objects are
310 Introduction to 3D Acquisition
a nightmare for traditional stereo, but too much texture may interfere
with the patterns of structured-light techniques. Hence, one would seem
to need a battery of systems to deal with the variability of objects —
e.g., in a museum — to be modeled. As a matter of fact, having to
model the entire collections of diverse museums is a useful application
area to think about, as it poses many of the pending challenges, often
several at once. Another area is 3D city modeling, which has quickly
grown in importance over the last years. It is another extreme in terms
of conditions under which data have to be captured, in that cities rep-
resent an absolutely uncontrolled and large-scale environment. Also in
that application area, many problems remain to be resolved.
Here is a list of remaining challenges, which we do not claim to be
exhaustive:
Many objects have an intricate shape, the scanning of which
requires high precision combined with great agility of the
scanner to capture narrow cavities and protrusions, deal with
self-occlusions, fine carvings, etc.
The types of objects and materials that potentially have to be
handled — think of the museum example — are very diverse,
like shiny metal coins, woven textiles, stone or wooden sculp-
tures, ceramics, gems in jewellery and glass. No single tech-
nology can deal with all these surface types and for some of
these types of artifacts there are no satisfactory techniques
yet. Also, apart from the 3D shape the material characteris-
tics may need to be captured as well.
The objects to be scanned range from tiny ones like a needle
to an entire construction or excavation site, landscape, or
city. Ideally, one would handle this range of scales with the
same techniques and similar protocols.
For many applications, data collection may have to be
undertaken on-site under potentially adverse conditions or
implying transportation of equipment to remote or harsh
environments.
Objects are sometimes too fragile or valuable to be touched
and need to be scanned ‘hands-off’. The scanner needs to be
1.5 Challenges 311
moved around the object, without it being touched, using
portable systems.
Masses of data often need to be captured, like in the museum
collection or city modeling examples. More efficient data cap-
ture and model building are essential if this is to be practical.
Those undertaking the digitization may or may not be tech-
nically trained. Not all applications are to be found in indus-
try, and technically trained personnel may very well not
be around. This raises the need for intelligent devices that
ensure high-quality data through (semi-)automation, self-
diagnosis, and effective guidance of the operator.
In many application areas the money that can be spent is
very limited and solutions therefore need to be relatively
cheap.
Also, precision is a moving target in many applications and
as higher precisions are achieved, new applications present
themselves that push for going even beyond. Analyzing the
3D surface of paintings to study brush strokes is a case in
point.
These considerations about the particular conditions under which
models may need to be produced, lead to a number of desirable, tech-
nological developments for 3D data acquisition:
Combined extraction of shape and surface
reflectance. Increasingly, 3D scanning technology is
aimed at also extracting high-quality surface reflectance
information. Yet, there still is an appreciable way to go
before high-precision geometry can be combined with
detailed surface characteristics like full-fledged BRDF
(Bidirectional Reflectance Distribution Function) or BTF
(Bidirectional Texture Function) information.
In-hand scanning. The first truly portable scanning sys-
tems are already around. But the choice is still restricted,
especially when also surface reflectance information is
required and when the method ought to work with all types
of materials, including metals, glass, etc. Also, transportable
312 Introduction to 3D Acquisition
here is supposed to mean more than ‘can be dragged between
places’, i.e., rather the possibility to easily move the system
around the object, ideally also by hand. But there also is the
interesting alternative to take the objects to be scanned in
one’s hands, and to manipulate them such that all parts get
exposed to the fixed scanner. This is not always a desirable
option (e.g., in the case of very valuable or heavy pieces), but
has the definite advantage of exploiting the human agility
in presenting the object and in selecting optimal, additional
views.
On-line scanning. The physical action of scanning and the
actual processing of the data often still are two separate
steps. This may create problems in that the completeness and
quality of the result can only be inspected after the scanning
session is over and the data are analyzed and combined at the
lab or the office. It may then be too late or too cumbersome
to take corrective actions, like taking a few additional scans.
It would be very desirable if the system would extract the
3D data on the fly, and would give immediate visual feed-
back. This should ideally include steps like the integration
and remeshing of partial scans. This would also be a great
help in planning where to take the next scan during scanning.
A refinement can then still be performed off-line.
Opportunistic scanning. Not a single 3D acquisition tech-
nique is currently able to produce 3D models of even a large
majority of exhibits in a typical museum. Yet, they often have
complementary strengths and weaknesses. Untextured sur-
faces are a nightmare for passive techniques, but may be ideal
for structured light approaches. Ideally, scanners would auto-
matically adapt their strategy to the object at hand, based
on characteristics like spectral reflectance, texture spatial
frequency, surface smoothness, glossiness, etc. One strategy
would be to build a single scanner that can switch strategy
on-the-fly. Such a scanner may consist of multiple cameras
and projection devices, and by today’s technology could still
be small and light-weight.
1.5 Challenges 313
Multi-modal scanning. Scanning may not only combine
geometry and visual characteristics. Additional features like
non-visible wavelengths (UV,(N)IR) could have to be cap-
tured, as well as haptic impressions. The latter would then
also allow for a full replay to the public, where audiences can
hold even the most precious objects virtually in their hands,
and explore them with all their senses.
Semantic 3D. Gradually computer vision is getting at a
point where scene understanding becomes feasible. Out of 2D
images, objects and scene types can be recognized. This will
in turn have a drastic effect on the way in which ‘low’-level
processes can be carried out. If high-level, semantic interpre-
tations can be fed back into ‘low’-level processes like motion
and depth extraction, these can benefit greatly. This strat-
egy ties in with the opportunistic scanning idea. Recognizing
what it is that is to be reconstructed in 3D (e.g., a car and
its parts) can help a system to decide how best to go about,
resulting in increased speed, robustness, and accuracy. It can
provide strong priors about the expected shape and surface
characteristics.
Off-the-shelf components. In order to keep 3D modeling
cheap, one would ideally construct the 3D reconstruction sys-
tems on the basis of off-the-shelf, consumer products. At least
as much as possible. This does not only reduce the price, but
also lets the systems surf on a wave of fast-evolving, mass-
market products. For instance, the resolution of still, digital
cameras is steadily on the increase, so a system based on
such camera(s) can be upgraded to higher quality without
much effort or investment. Moreover, as most users will be
acquainted with such components, the learning curve to use
the system is probably not as steep as with a totally novel,
dedicated technology.
Obviously, once 3D data have been acquired, further process-
ing steps are typically needed. These entail challenges of their own.
Improvements in automatic remeshing and decimation are definitely
314 Introduction to 3D Acquisition
still possible. Also solving large 3D puzzles automatically, preferably
exploiting shape in combination with texture information, would be
something in high demand from several application areas. Level-of-
detail (LoD) processing is another example. All these can also be
expected to greatly benefit from a semantic understanding of the data.
Surface curvature alone is a weak indicator of the importance of a shape
feature in LoD processing. Knowing one is at the edge of a salient, func-
tionally important structure may be a much better reason to keep it in
at many scales.
1.6 Conclusions
Given the above considerations, the 3D reconstruction of shapes
from multiple, uncalibrated images is one of the most promising
3D acquisition techniques. In terms of our taxonomy of techniques,
self-calibrating structure-from-motion is a passive, multi-vantage
point strategy. It offers high degrees of flexibility in that one can
freely move a camera around an object or scene. The camera can be
hand-held. Most people have a camera and know how to use it. Objects
or scenes can be small or large, assuming that the optics and the
amount of camera motion are appropriate. These methods also give
direct access to both shape and surface reflectance information, where
both can be aligned without special alignment techniques. Efficient
implementations of several subparts of such Structure-from-Motion
pipelines have been proposed lately, so that the on-line application
of such methods is gradually becoming a reality. Also, the required
hardware is minimal, and in many cases consumer type cameras will
suffice. This keeps prices for data capture relatively low.
2
Principles of Passive 3D Reconstruction
2.1 Introduction
In this section the basic principles underlying self-calibrating, passive
3D reconstruction, are explained. More specifically, the central goal
is to arrive at a 3D reconstruction from the uncalibrated image data
alone. But, to understand how three-dimensional (3D) objects can be
reconstructed from two-dimensional (2D) images, one first needs to
know how the reverse process works: i.e., how images of a 3D object
arise. Section 2.2 therefore discusses the image formation process in a
camera and introduces the camera model which will be used through-
out the text. As will become clear this model incorporates internal
and external parameters related to the technical specifications of the
camera(s) and their location with respect to the objects in the scene.
Subsequent sections then set out to extract 3D models of the scene
without prior knowledge of these parameters, i.e., without the need
to calibrate the cameras internally or externally first. This reconstruc-
tion problem is formulated mathematically in Section 2.3 and a solu-
tion strategy is initiated. The different parts in this solution of the 3D
reconstruction problem are elaborated in the following sections. Along
the way, fundamental notions such as the correspondence problem,
315
316 Principles of Passive 3D Reconstruction
the epipolar relation, and the fundamental matrix of an image pair
are introduced (Section 2.4), and the possible stratification of the
reconstruction process into Euclidean, metric, affine, and projective
reconstructions is explained (Section 2.5). Furthermore, self-calibration
equations are derived and their solution is discussed in Section 2.6.
Apart from the generic case, special camera motions are considered as
well (Section 2.7). In particular, camera translation and camera rota-
tion are discussed. These often occur in practice, but their systems of
self-calibration equations or reconstruction equations become singular.
Attention is paid also to the case of internally calibrated cameras and
the important notion and use of the essential matrix is explored for
that case.
As already mentioned in the preface, we follow a particular route —
somewhat different than usual but hopefully all the more intuitive and
self-contained — to develop the different ideas and to extract the cor-
responding results. Nonetheless, there are numerous relevant papers
that present alternative and complementary approaches which are not
explicitly referenced in the text. We provide, therefore, a complete Bib-
liography of all the papers. This precedes the cited References and read-
ers who want to gain in-depth knowledge of the field are encouraged to
look also at those papers.
As a note on widely used terminology in this domain, the word
camera is often to be interpreted as a certain viewpoint and viewing
direction — a field of view or image — and if mention is made of a
first, second, . . . camera then this can just as well refer to the same
camera being moved around to a first, second, . . . position.
2.2 Image Formation and Camera Model
2.2.1 The Pinhole Camera
The simplest model of the image formation process in a camera is that
of a pinhole camera or camera obscura. The camera obscura is not
more than a black box one side of which is punctured to yield a small
hole. The rays of light from the outside world that pass through the
hole and fall on the opposite side of the box there form a 2D image of
the 3D environment outside the box (called the scene), as is depicted
2.2 Image Formation and Camera Model 317
Fig. 2.1 In a pinhole camera or camera obscura an image of the scene is formed by the rays
of light that are reflected by the objects in the scene and fall through the center of projection
onto the opposite wall of the box, forming a photo-negative image of the scene. The photo-
positive image of the scene corresponds to the projection of the scene onto a hypothetical
image plane situated in front of the camera. It is this hypothetical plane which is typically
used in computer vision, in order to avoid sign reversals.
in Figure 2.1. Some art historians believe that the painter Vermeer
actually used a room-sized version of a camera obscura. Observe that
this pinhole image actually is the photo-negative image of the scene.
The photo-positive image one observes when watching a photograph
or a computer screen corresponds to the projection of the scene onto a
hypothetical plane that is situated in front of the camera obscura at the
same distance from the hole as the opposite wall on which the image is
actually formed. In the sequel, the term image plane will always refer to
this hypothetical plane in front of the camera. This hypothetical plane
is preferred to avoid sign reversals in the computations. The distance
between the center of projection (the hole) and the image plane is called
the focal length of the camera.
The amount of light that falls into the box through the small hole
is very limited. One can increase this amount of light by making the
hole bigger, but then rays coming from different 3D points can fall
onto the same point on the image, thereby causing blur. One way of
getting around this problem is by making use of lenses, which focus the
light. Apart from the introduction of geometric and chromatic aber-
rations, even the most perfect lens will come with a limited depth-of-
field. This means that only scene points within a limited depth range
are imaged sharply. Within that depth range the camera with lens
318 Principles of Passive 3D Reconstruction
basically behaves like the pinhole model. The ‘hole’ in the box will
in the sequel be referred to as the center of projection or the camera
center, and the type of projection realized by this idealized model is
referred to as perspective projection.
It has to be noted that whereas in principle a single convex lens
might be used, real camera lenses are composed of multiple lenses, in
order to reduce deviations from the ideal model (i.e., to reduce the
aforementioned aberrations). A detailed discussion on this important
optical component is out of the scope of this tutorial, however.
2.2.2 Projection Equations for a Camera-Centered
Reference Frame
To translate the image formation process into mathematical formulas
we first introduce a reference frame for the 3D environment (also called
the world) containing the scene. The easiest is to fix it to the camera.
Figure 2.2 shows such a camera-centered reference frame. It is a right-
handed and orthonormal reference frame whose origin is at the center
of projection. Its Z-axis is the principal axis of the camera, — i.e.,
the line through the center of projection and orthogonal to the image
X
Y
Z
u
v
C
pM
m
Center of
projection
Image plane Principal axis
Principal point
Fig. 2.2 The camera-centered reference frame is fixed to the camera and aligned with its
intrinsic directions, of which the principal axis is one. The coordinates of the projection m
of a scene point Monto the image plane in a pinhole camera model with a camera-centered
reference frame, as expressed by equation (2.1), are given with respect to the principal
point pin the image.
2.2 Image Formation and Camera Model 319
plane — and the XY -plane is the plane through the center of projec-
tion and parallel to the image plane. The image plane is the plane with
equation Z=f, where fdenotes the focal length of the camera. The
principal axis intersects the image plane in the principal point p.
The camera-centered reference frame induces an orthonormal uv
reference frame in the image plane, as depicted in Figure 2.2. The
image of a scene point Mis the point mwhere the line through Mand
the origin of the camera-centered reference frame intersects the image
plane. If Mhas coordinates (X,Y,Z)R3with respect to the camera-
centered reference frame, then an arbitrary point on the line through
the origin and the scene point Mhas coordinates ρ(X,Y,Z) for some
real number ρ. The point of intersection of this line with the image
plane must satisfy the relation ρZ =f, or equivalently, ρ=f
Z. Hence,
the image mof the scene point Mhas coordinates (u, v, f ), where
u=fX
Zand v=fY
Z.(2.1)
Projections onto the image plane cannot be detected with infinite
precision. An image rather consists of physical cells capturing photons,
so-called picture elements,orpixels for short. Apart from some exotic
designs (e.g., hexagonal or log-polar cameras), these pixels are arranged
in a rectangular grid, i.e., according to rows and columns, as depicted
in Figure 2.3 (left). Pixel positions are typically indicated with a row
x
y
pu
v
v
~
u
~
0
pu
pv
Fig. 2.3 Left: In a digital image, the position of a point in the image is indicated by its
pixel coordinates. This corresponds to the way in which a digital image is read from a CCD.
Right: The coordinates (u, v) of the projection of a scene point in the image are defined
with respect to the principal point p. Pixel coordinates, on the other hand, are measures
with respect to the upper left corner of the image.
320 Principles of Passive 3D Reconstruction
and column number measured with respect to the top left corner of
the image. These numbers are called the pixel coordinates of an image
point. We will denote them by (x, y), where the x-coordinate is mea-
sured horizontally and increasing to the right, and the y-coordinate is
measured vertically and increasing downwards. This choice has several
advantages:
The way in which x- and y-coordinates are assigned to image
points corresponds directly to the way in which an image is
read out by several digital cameras with a CCD: starting at
the top left and reading line by line.
The camera-centered reference frame for the world being
right-handed then implies that its Z-axis is pointing away
from the image into the scene (as opposed to into the cam-
era). Hence, the Z-coordinate of a scene point corresponds
to the “depth” of that point with respect to the camera,
which conceptually is nice because it is the big unknown to
be solved for in 3D reconstruction problems.
As a consequence, we are not so much interested in the metric coordi-
nates (u,v) indicating the projection mof a scene point Min the image
and given by formula (2.1), as in the corresponding row and column
numbers (x, y) of the underlying pixel. At the end of the day, it will
be these pixel coordinates to which we have access when analyzing the
image. Therefore we have to make the transition from (u, v)-coordinates
to pixel coordinates (x, y) explicit first.
In a camera-centered reference frame the X-axis is typically chosen
parallel to the rows and the Y-axis parallel to the columns of the rect-
angular grid of pixels. In this way, the u- and v-axes induced in the
image plane have the same direction and sense as those in which the
pixel coordinates xand yof image points are measured. But, whereas
pixel coordinates are measured with respect to the top left corner of
the image, (u,v)-coordinates are measured with respect to the principal
point p. The first step in the transition from (u,v)- to (x, y)-coordinates
for an image point mthus is to apply offsets to each coordinate. To this
end, denote by puand pvthe metric distances, measured in the hori-
zontal and vertical directions, respectively, of the principal point pfrom
2.2 Image Formation and Camera Model 321
the upper left corner of the image (see Figure 2.3 (right) ). With the
top left corner of the image as origin, the principal point now has coor-
dinates (pu,pv) and the perspective projection mof the scene point M,
as described by formula (2.1), will have coordinates
˜u=fX
Z+puand ˜v=fY
Z+pv.
These (˜u, ˜v)-coordinates of the image point mare still expressed in the
metric units of the camera-centered reference frame. To convert them
to pixel coordinates, one has to divide ˜uand ˜vby the width and the
height of a pixel, respectively. Let muand mvbe the inverse of respec-
tively the pixel width and height, then muand mvindicate how many
pixels fit into one horizontal respectively vertical metric unit. The pixel
coordinates (x, y) of the projection mof the scene point Min the image
are thus given by
x=mufX
Z+puand y=mvfY
Z+pv;
or equivalently,
x=αx
X
Z+pxand y=αy
Y
Z+py,(2.2)
where αx=mufand αy=mvfis the focal length expressed in number
of pixels for the x- and y-direction of the image and (px,py) are the
pixel coordinates of the principal point. The ratio αy
αx=mv
mu, giving the
ratio of the pixel width with respect to the pixel height, is called the
aspect ratio of the pixels.
2.2.3 A Matrix Expression for Camera-Centered Projection
More elegant expressions for the projection equations (2.2) are obtained
if one uses extended pixel coordinates for the image points. In particular,
if a point mwith pixel coordinates (x, y) in the image is represented by
the column vector m=(x,y, 1)T, then formula (2.2) can be rewritten as:
Zm=Z
x
y
1
=
αx0px
0αypy
001
X
Y
Z
.(2.3)
322 Principles of Passive 3D Reconstruction
Observe that, if one interpretes the extended pixel coordinates (x,y,1)T
of the image point mas a vector indicating a direction in the world,
then, since Zdescribes the “depth” in front of the camera at which the
corresponding scene point Mis located, the 3 ×3-matrix
αx0px
0αypy
001
represents the transformation that converts world measurements
(expressed in meters, centimeters, millimeters, ...) into the pixel met-
ric of the digital image. This matrix is called the calibration matrix
of the camera, and it is generally represented as the upper triangular
matrix
K=
αxsp
x
0αypy
001
,(2.4)
where αxand αyare the focal length expressed in number of pixels for
the x- and y-directions in the image respectively, and with (px,py) the
pixel coordinates of the principal point. The additional scalar sin the
calibration matrix Kis called the skew factor and models the situation
in which the pixels are parallelograms (i.e., not rectangular). It also
yields an approximation to the situation in which the physical imaging
plane is not perfectly perpendicular to the optical axis of the lens or
objective (as was assumed above). In fact, sis inversely proportional to
the tangent of the angle between the X- and the Y-axis of the camera-
centered reference frame. Consequently, s= 0 for digital cameras with
rectangular pixels.
Together, the entries αx,αy,s,px, and pyof the calibration
matrix Kdescribe the internal behavior of the camera and are there-
fore called the internal parameters of the camera. Furthermore, the
projection equations (2.3) of a pinhole camera with respect to a camera-
centered reference frame for the scene are compactly written as:
ρm=KM,(2.5)
where M=(X,Y,Z)Tare the coordinates of a scene point Mwith respect
to the camera-centered reference frame for the scene, m=(x,y, 1)Tare
2.2 Image Formation and Camera Model 323
the extended pixel coordinates of its projection min the image, Kis
the calibration matrix of the camera. Furthermore, ρis a positive real
number and it actually represents the “depth” of the scene point M
in front of the camera, because, due to the structure of the calibra-
tion matrix K, the third row in the matrix equality (2.5) reduces to
ρ=Z. Therefore ρis called the projective depth of the scene point M
corresponding to the image point m.
2.2.4 The General Linear Camera Model
When more than one camera is used, or when the objects in the scene
are to be represented with respect to another, non-camera-centered
reference frame (called the world frame), then the position and orien-
tation of the camera in the scene are described by a point C, indicat-
ing the center of projection, and a 3 ×3-rotation matrix Rindicating
the orientation of the camera-centered reference frame with respect to
the world frame. More precisely, the column vectors riof the rotation
matrix Rare the unit direction vectors of the coordinate axes of the
camera-centered reference frame, as depicted in Figure 2.4. As Cand
Rrepresent the setup of the camera in the world space, they are called
the external parameters of the camera.
The coordinates of a scene point Mwith respect to the camera-
centered reference frame are found by projecting the relative position
X
Y
Z
0
C
r
1
r
2
r
3
u
v
M= (
X
,
Y
,
Z
)
(
u
,
v
)
Fig. 2.4 The position and orientation of the camera in the scene are given by a position
vector Canda3×3-rotation matrix R. The projection mof a scene point Mis then given
by formula (2.6).
324 Principles of Passive 3D Reconstruction
vector MCorthogonally onto each of the coordinate axes of the
camera-centered reference frame. The column vectors riof the rotation
matrix Rbeing the unit direction vectors of the coordinate axes of the
camera-centered reference frame, the coordinates of Mwith respect to
the camera-centered reference frame are given by the dot products of
the relative position vector MCwith the unit vectors ri; or equiv-
alently, by premultiplying the column vector MCwith the trans-
pose of the orientation matrix R, viz. RT(MC). Hence, following
formula (2.5), the projection mof the scene point Min the image is
given by the (general) projection equations:
ρm=KRT(MC),(2.6)
where M=(X,Y,Z)Tare the coordinates of a scene point Mwith respect
to an (arbitrary) world frame, m=(x,y, 1)Tare the extended pixel coor-
dinates of its projection min the image, Kis the calibration matrix of
the camera, Cis the position and Ris the rotation matrix expressing
the orientation of the camera with respect to the world frame, and ρis
a positive real number representing the projective depth of the scene
point Mwith respect to the camera.
Many authors prefer to use extended coordinates for scene points as
well. So, if M
1=(X,Y,Z,1)Tare the extended coordinates of the scene
point M=(X,Y,Z)T, then the projection equations (2.6) becomes
ρm=KRT|−KRTCM
1.(2.7)
The 3 ×4-matrix P=(KRT|−KRTC) is called the projection matrix
of the camera.
Notice that, if only the 3 ×4-projection matrix Pis known, it is
possible to retrieve the internal and external camera parameters from it.
Indeed, as is seen from formula (2.7), the upper left 3 ×3-submatrix of
Pis formed by multiplying Kand RT. Its inverse is RK1, since Ris a
rotation matrix and thus RT=R1. Furthermore, Kis a non-singular
upper triangular matrix and so is K1. In particular, RK1is the prod-
uct of an orthogonal matrix and an upper triangular one. Recall from
linear algebra that every 3 ×3-matrix of maximal rank can uniquely be
decomposed as a product of an orthogonal and a non-singular, upper
2.2 Image Formation and Camera Model 325
triangular matrix with positive diagonal entries by means of the QR-
decomposition [3] (with Qthe orthogonal and Rthe upper-diagonal
matrix). Hence, given the 3 ×4-projection matrix Pof a pinhole cam-
era, the calibration matrix Kand the orientation matrix Rof the
camera can easily be recovered from the inverse of the upper left 3 ×3-
submatrix of Pby means of QR-decomposition. If Kand Rare known,
then the center of projection Cis found by premultiplying the fourth
column of Pwith the matrix RK1.
The camera model of formula (2.7) is usually referred to as the
general linear camera model. Taking a close look at this formula shows
how general the camera projection matrix P=(KRT|−KRTC)isas
a matrix, in fact. Apart from the fact that the 3 ×3-submatrix on
the left has to be full rank, one cannot demand more than that it
has to be QR-decomposable, which holds for any such matrix. The
attentive reader may now object that, according to formula (2.4), the
calibration matrix Kmust have entry 1 at the third position in the
last row, whereas there is no such constraint for the upper triangular
matrix in a QR-decomposition. This would seem a further restriction
on the left 3 ×3-submatrix of P, but it can easily be lifted by observing
that the camera projection matrix Pis actually only determined up to
a scalar factor. Indeed, due to the non-zero scalar factor ρin the left-
hand side of formula (2.7), one can always ensure this property to hold.
Put differently, any 3×4-matrix whose upper left 3×3-submatrix is
non-singular can be interpreted as the projection matrix of a (linear)
pinhole camera.
2.2.5 Non-linear Distortions
The perspective projection model described in the previous sections is
linear in the sense that the scene point, the corresponding image point
and the center of projection are collinear, and that straight lines in the
scene do generate straight lines in the image. Perspective projection
therefore only models the linear effects in the image formation pro-
cess. Images taken by real cameras, on the other hand, also experience
non-linear deformations or distortions which make the simple linear
pinhole model inaccurate. The most important and best known non-
linear distortion is radial distortion. Figure 2.5(left) shows an example
326 Principles of Passive 3D Reconstruction
Fig. 2.5 Left: An image exhibiting radial distortion. The vertical wall at the left of the
building appears bent in the image and the gutter on the frontal wall on the right appears
curved too. Right: The same image after removal of the radial distortion. Straight lines in
the scene now appear as straight lines in the image as well.
of a radially distorted image. Radial distortion is caused by a system-
atic variation of the optical magnification when radially moving away
from a certain point, called the center of distortion. The larger the dis-
tance between an image point and the center of distortion, the larger
the effect of the distortion. Thus, the effect of the distortion is mostly
visible near the edges of the image. This can clearly be seen in Fig-
ure 2.5 (left). Straight lines near the edges of the image are no longer
straight but are bent. For practical use, the center of radial distortion
can often be assumed to coincide with the principal point, which usu-
ally also coincides with the center of the image. But it should be noted
that these are only approximations and dependent on the accuracy
requirements, a more precise determination may be necessary [7].
Radial distortion is a non-linear effect and is typically modeled using
a Taylor expansion. Typically, only the even order terms play a role
in this expansion, i.e., the effect is symmetric around the center. The
effect takes place in the lens, hence mathematically the radial distortion
should be between the external and internal parameters of the pinhole
model. The model we will propose here follows this strategy. Let us
define
ρmu=
mux
muy
1
=RT(MC),
2.2 Image Formation and Camera Model 327
where M=(X,Y,Z)Tare the coordinates of a scene point with respect
to the world frame. The distance rof the point mufrom the optical axis
is then
r2=m2
ux +m2
uy.
We now define mdas
md=
mdx
mdy
1
=
(1 + κ1r2+κ2r4+κ3r6+...)mux
(1 + κ1r2+κ2r4+κ3r6+...)muy
1
.(2.8)
The lower order terms of this expansion are the most important ones
and typically one does not compute more than three parameters (κ1,
κ2,κ3). Finally the projection mof the 3D point Mis:
m=Kmd.(2.9)
When the distortion parameters are known, the image can be undis-
torted in order to make all lines straight again and thus make the lin-
ear pinhole model valid. The undistorted version of Figure 2.5 (left) is
shown in the same Figure 2.5 (right).
The model described above puts the radial distortion parameters
between the external and linear internal parameters of the camera. In
the literature one often finds that the distortion is put on the left of the
internal parameters, i.e., a 3D point is first projected into the image
via the linear model and then shifted. Conceptually the latter model is
less suited than the one used here because putting the distortion at the
end makes it dependent on the internal parameters, especially on the
focal length. This means one has to re-estimate the radial distortion
parameters every time the focal length changes. This is not necessary
in the model as suggested here. This, however, does not mean that this
model is perfect. In reality the center of radial distortion (now assumed
to be the principal point) sometimes changes when the focal length is
altered. This effect cannot be modeled with this approach.
In the remainder of the text we will assume that radial distortion
has been removed from all images if present, unless stated otherwise.
328 Principles of Passive 3D Reconstruction
2.2.6 Explicit Camera Calibration
As explained before a perspective camera can be described by its
internal and external parameters. The process of determining these
parameters is known as camera calibration. Accordingly, we make a
distinction between internal calibration and external calibration, also
known as pose estimation. For the determination of all parameters one
often uses the term complete calibration. Traditional 3D passive recon-
struction techniques had a separate, explicit camera calibration step.
As highlighted before, the difference with self-calibration techniques as
explained in this tutorial is that in the latter the same images used for
3D scene reconstruction are also used for camera calibration.
Traditional internal calibration procedures [1, 17, 18] extract the
camera parameters from a set of known 3D–2D correspondences, i.e., a
set of 3D coordinates with corresponding 2D coordinates in the image
for their projections. In order to easily obtain such 3D–2D correspon-
dences, they employ special calibration objects, like the ones displayed
in Figure 2.6. These objects contain easily recognizable markers. Inter-
nal calibration sometimes starts by fitting a linearized calibration model
to the 3D–2D correspondences, which is then improved with a subse-
quent non-linear optimization step.
Some applications do not suffer from unknown scales or effects of
projection, but their results would deteriorate under the influence of
non-linear effects like radial distortion. It is possible to only undo such
distortion without having to go through the entire process of internal
Fig. 2.6 Calibration objects: 3D (left) and 2D (right).
2.3 The 3D Reconstruction Problem 329
calibration. One way is by detecting structures that are known to be
straight, but appear curved under the influence of the radial distortion.
Then, for each line (e.g., the curved roof top in the example of Figure 2.5
(left)), points are sampled along it and a straight line is fitted through
these data. If we want the distortion to vanish, the error consisting
of the sum of the distances of each point to their line should be zero.
Hence a non-linear minimization algorithm like Levenberg–Marquardt
is applied to these data. The algorithm is initialized with the distortion
parameters set to zero. At every iteration, new values are computed for
these parameters, the points are warped accordingly and new lines are
fitted. The algorithm stops when it converges to a solution where all
selected lines are straight (i.e., the resulting error is close to zero). The
resulting unwarped image can be seen on the right in Figure 2.5.
2.3 The 3D Reconstruction Problem
The aim of passive 3D reconstruction is to recover the geometric struc-
ture of a (static) scene from one or more of its images: Given a point m
in an image, determine the point Min the scene of which mis the projec-
tion. Or, in mathematical parlance, given the pixel coordinates (x,y)of
a point min a digital image, determine the world coordinates (X,Y,Z)
of the scene point Mof which mis the projection in the image. As can be
observed from the schematic representation of Figure 2.7, a point min
image
C
m
M
Fig. 2.7 3D reconstruction from one image is an underdetermined problem: a point min the
image can be the projection of any world point Malong the pro jecting ray of m.
330 Principles of Passive 3D Reconstruction
the image can be the projection of any point Min the world space that
lies on the line through the center of projection and the image point m.
This line is called the projecting ray or the line of sight of the image
point min the given camera. Thus, 3D reconstruction from one image
is an underdetermined problem.
On the other hand, if two images of the scene are available, the
position of a scene point Mcan be recovered from its projections m1and
m2in the images by triangulation: Mis the point of intersection of the
projecting rays of m1and m2, as depicted in Figure 2.8. This stereo setup
and the corresponding principle of triangulation have already been
introduced in Section 1. As already noted there, the 3D reconstruc-
tion problem has not yet been solved, unless the internal and external
parameters of the cameras are known. Indeed, if we assume that the
images are corrected for radial distortion and other non-linear effects
and that the general linear pinhole camera model is applicable, then,
according to formula (2.6) in Section 2.2.4, the projection equations of
the first camera are modeled as:
ρ1m1=K1RT
1(MC1),(2.10)
where M=(X,Y,Z)Tare the coordinates of the scene point Mwith
respect to the world frame, m1=(x1,y1,1)Tare the extended pixel
coordinates of its projection m1in the first image, K1is the calibration
image 1
C
1
m
1
image 2
m
2
C
2
M
Fig. 2.8 Given two images of a static scene, the location of the scene point Mcan be recovered
from its projections m1and m2in the respective images by means of triangulation.
2.3 The 3D Reconstruction Problem 331
matrix of the first camera, C1is the position and R1is the orientation
of the first camera with respect to the world frame, and ρ1is a positive
real number representing the projective depth of Mwith respect to the
first camera. To find the projecting ray of an image point m1in the
first camera and therefore all points projecting onto m1there, recall
from Section 2.2.3 that the calibration matrix K1converts world mea-
surements (expressed in meters, centimeters, millimeters, etc) into the
pixel metric of the digital image. Since m1=(x1,y1,1) are the extended
pixel coordinates of the point m1in the first image, the direction of
the projecting ray of m1in the camera-centered reference frame of the
first camera is given by the three-vector K1
1m1. With respect to the
world frame, the direction vector of the projecting ray is R1K1
1m1,by
definition of R1. As the position of the first camera in the world frame
is given by the point C1, the parameter equations of the projecting ray
of m1in the world frame are:
M=C1+ρ1R1K1
1m1for some ρ1R. (2.11)
So, every scene point Msatisfying equation (2.11) for some real num-
ber ρ1projects onto the point m1in the first image. Notice that
expression (2.11) can be found directly by solving the projection equa-
tions (2.10) for M. Clearly, the parameter equations (2.11) of the pro-
jecting ray of a point m1in the first image are only fully known, provided
the calibration matrix K1and the position C1and orientation R1of
the camera with respect to the world frame are known (i.e., when the
first camera is fully calibrated).
Similarly, the projection equations for the second camera are:
ρ2m2=K2RT
2(MC2),(2.12)
where m2=(x2,y2,1)Tare the extended pixel coordinates of M’s projec-
tion m2in the second image, K2is the calibration matrix of the second
camera, C2is the position and R2the orientation of the second camera
with respect to the world frame, and ρ2is a positive real number rep-
resenting the projective depth of Mwith respect to the second camera.
Solving equation (2.12) for Myields:
M=C2+ρ2R2K1
2m2; (2.13)
332 Principles of Passive 3D Reconstruction
and, if in this equation ρ2is seen as a parameter, then formula (2.13)
are just the parameter equations for the projecting ray of the image
point m2in the second camera. Again, these parameter equations are
fully known only if K2,C2, and R2are known (i.e., when the second
camera is fully calibrated). The system of six equations (2.10) and
(2.12) can be solved for the five unknowns X,Y,Z,ρ1, and ρ2. Observe
that this requires the system to be rank-deficient, which is guaranteed
if the points m1and m2are in correspondence (i.e., their projecting rays
intersect) and therefore special relations (in particular, the so-called
epipolar relations, which will be derived in the next section) hold.
When the cameras are not internally and externally calibrated, then
it is not immediately clear how to perform triangulation from the image
data alone. On the other hand, one intuitively feels that every image
of a static scene constrains in one way or another the shape and the
relative positioning of the objects in the world, even if no information
about the camera parameters is known. The key to the solution of the
3D reconstruction problem is found in understanding how the locations
m1and m2of the projections of a scene point Min different views are
related to each other. This relationship is explored in the next section.
2.4 The Epipolar Relation Between Two
Images of a Static Scene
2.4.1 The Fundamental Matrix
A point m1in a first image of the scene is the projection of a scene
point Mthat can be at any position along the projecting ray of m1in that
first camera. Therefore, the corresponding point m2(i.e., the projection
of M) in a second image of the scene must lie on the projection 2
of this projecting ray in the second image, as depicted in Figure 2.9.
To derive the equation of this projection 2, suppose for a moment
that the internal and external parameters of both cameras are known.
Then, the projecting ray of the point m1in the first camera is given by
formula (2.11), viz. M=C1+ρ1R1K1
1m1. Substituting the right-hand
side in the projection equations (2.12) of the second camera, yields
ρ2m2=ρ1K2RT
2R1K1
1m1+K2RT
2(C1C2).(2.14)
2.4 The Epipolar Relation Between Two Images of a Static Scene 333
l2
image 2
C2
m2
M
m1
image 1
C1
Fig. 2.9 The point m2in the second image corresponding to a point m1in the first image
lies on the epipolar line 2which is the projection in the second image of the projecting ray
of m1in the first camera.
The last term in this equation corresponds to the projection e2of the
position C1of the first camera in the second image:
ρe2e2=K2RT
2(C1C2).(2.15)
e2is called the epipole of the first camera in the second image. The
first term in the right-hand side of equation (2.14), on the other hand,
indicates the direction of the projecting ray (2.11) in the second image.
Indeed, recall from Section 2.3 that R1K1
1m1is the direction vector
of the projecting ray of m1with respect to the world frame. In the
camera-centered reference frame of the second camera, the coordinates
of this vector are RT
2R1K11m1. The point in the second image that
corresponds to this viewing direction is then given by K2RT
2R1K11m1.
Put differently, K2RT
2R1K11m1are homogeneous coordinates for the
vanishing point of the projecting ray (2.11) in the second image, as can
be seen from Figure 2.10.
To simplify the notation, put A=K2RT
2R1K1
1. Then Ais an
invertible 3 ×3-matrix which, for every point m1in the first image, gives
homogeneous coordinates Am1for the vanishing point in the second
view of the projecting ray of m1in the first camera. In the literature this
matrix is referred to as the infinite homography, because it corresponds
to the 2D projective transformation induced between the images by
the plane at infinity of the scene. More about this interpretation of the
334 Principles of Passive 3D Reconstruction
M
C1C2
m1m2
e2
A
m1
image 1 image 2
oo
Fig. 2.10 The epipole e2of the first camera in the second image indicates the position in
the second image where the center of projection C1of the first camera is observed. The
point Am1in the second image is the vanishing point of the projecting ray of m1in the
second image.
matrix Acan be found in Section 2.4.3. Formula (2.14) can now be
rewritten as:
ρ2m2=ρ1Am1+ρe2e2.(2.16)
Formula (2.16) algebraically expresses the geometrical observation that,
for a given point m1in one image, the corresponding point m2in another
image of the scene lies on the line 2through the epipole e2and the
vanishing point Am1of the projecting ray of m1in the first camera
(cf. Figure 2.10). The line 2is called the epipolar line in the second
image corresponding to m1, and formula (2.16) is referred to as the
epipolar relation between corresponding image points. 2is the sought
projection in the second image of the entire projecting ray of m1in the
first camera. It is important to realize that the epipolar line 2in the
second image relies on the point m1in the first image. Put differently,
selecting another point m1in the first image generically will result in
another epipolar line 2in the second view. However, as is seen from
formula (2.16), these epipolar lines all run through the epipole e2. This
was to be expected, of course, as all the projecting rays of the first
camera originate from the center of projection C1of the first camera.
Hence, their projections in the second image — which, by definition,
are epipolar lines in the second view — must also all run through
2.4 The Epipolar Relation Between Two Images of a Static Scene 335
Fig. 2.11 All projecting rays of the first camera originate from its center of projection. Their
projections into the image plane of the second camera are therefore all seen to intersect in
the projection of this center of projection, i.e., at the epipole.
the projection of C1in the second image, which is just the epipole e2.
Figure 2.11 illustrates this observation graphically.
In the literature the epipolar relation (2.16) is usually expressed
in closed form. To this end, we fix some notations: for a 3-vector a=
(a1,a2,a3)TR3, let [a]×denote the skew-symmetric 3 ×3-matrix
[a]×=
0a3a2
a30a1
a2a10
,(2.17)
which represents the cross product with a; i.e., [a]×v=a×vfor all
3-vectors vR3. Observe that [a]×has rank 2 if ais non-zero. The
epipolar relation states that, for a point m1in the first image, its corre-
sponding point m2in the second image must lie on the line through
the epipole e2and the vanishing point Am1. Algebraically, this is
expressed by demanding that the 3-vectors m2,e2, and Am1represent-
ing homogeneous coordinates of the corresponding image points are
linearly dependent (cf. formula (2.16)). Recall from linear algebra that
this is equivalent to |m2e2Am1|= 0, where the vertical bars denote
the determinant of the 3 ×3-matrix whose columns are the specified
336 Principles of Passive 3D Reconstruction
column vectors. Moreover, by definition of the cross product, this deter-
minant equals
|m2e2Am1|=mT
2(e2×Am1).
Expressing the cross product as a matrix multiplication then yields
|m2e2Am1|=mT
2[e2]×Am1.
Hence, the epipolar relation (2.16) is equivalently expressed by the
equation
mT
2Fm1=0,(2.18)
where F=[e2]×Aisa3×3-matrix, called the fundamental matrix of
the image pair, and with e2the epipole in the second image and Athe
invertible 3 ×3-matrix defined above [2, 4]. Note that, since [a]×is a
rank 2 matrix, the fundamental matrix Falso has rank 2.
2.4.2 Gymnastics with F
The closed form (2.18) of the epipolar relation has the following
advantages:
(1) The fundamental matrix Fcan, up to a non-zero scalar fac-
tor, be computed from the image data alone.
Indeed, for each pair of corresponding points m1and m2in the
images, relation (2.18) yields one homogeneous linear equa-
tion in the entries of the fundamental matrix F. Knowing
(at least) eight pairs of corresponding points between the
two images, the fundamental matrix Fcan, up to a non-
zero scalar factor, be computed from these point correspon-
dences in a linear manner. Moreover, by also exploiting the
fact that Fhas rank 2, the fundamental matrix Fcan even
be computed, up to a non-zero scalar factor, from seven point
correspondences between the images, albeit by a non-linear
algorithm as the rank 2 condition involves a relation between
products of three entries of F. Different methods for efficient
and robust computation of Fwill be explained in Subsec-
tion 4.2.2 of Section 4 in Part 2 of this tutorial.
2.4 The Epipolar Relation Between Two Images of a Static Scene 337
(2) Given F,the epipole e2in the second image is the unique
3-vector with third coordinate equal to 1,satisfying FTe2=0.
This observation follows immediately from the fact that F=
[e2]×Aand that [e2]T
×e2=[e2]×e2=e2×e2=0.
(3) Similarly, the epipole e1of the second camera in the first
image — i.e., the projection e1of the position C2of the sec-
ond camera in the first image — is the unique 3-vector with
third coordinate equal to 1,satisfying Fe1=0.
According to formula (2.6) in Section 2.2.4, the pro-
jection e1of the position C2of the second camera in
the first image is given by ρe1e1=K1RT
1(C2C1), with
ρe1a non-zero scalar factor. Since A=K2RT
2R1K1
1,
ρe1Ae1=K2RT
2(C2C1)=ρe2e2, and thus ρe1Fe1=
[e2]×(ρe1Ae1)=[e2]×(ρe2e2)=0. Notice that this also
shows that the infinite homography Amaps the epipole e1
in the first image onto the epipole e2in the second image.
(4) Given a point m1in the first image, the 3-vector Fm1yields
homogeneous coordinates for the epipolar line 2in the sec-
ond image corresponding to m1;i.e., 2Fm1.
Recall that the epipolar relation mT
2Fm1= 0 expresses the
geometrical observation that the point m2in the second
image, which corresponds to m1, lies on the line 2through
the epipole e2and the point Am1, which by definition is the
epipolar line in the second image corresponding to m1. This
proves the claim.
(5) Similarly, given a point m2in the second image, the 3-vector
FTm2yields homogeneous coordinates for the epipolar line 1
in the first image corresponding to m2;i.e., 1FTm2.
By interchanging the role of the two images in the reasoning
leading up to the epipolar relation derived above, one easily
sees that the epipolar line 1in the first image corresponding
to a point m2in the second image is the line through the
epipole e1in the first image and the vanishing point A1m2
in the first image of the projecting ray of m2in the second
338 Principles of Passive 3D Reconstruction
camera. The corresponding epipolar relation
|m1e1A1m2|= 0 (2.19)
expresses that m1lies on that line. As Ais an invertible
matrix, its determinant |A|is a non-zero scalar. Multiply-
ing the left-hand side of equality (2.19) with |A|yields
|A||m1e1A1m2|=|Am1Ae1m2|
=|Am1(ρe2e1)e2m2|
=ρe2
ρe1|m2e2Am1|=ρe2
ρe1
mT
2Fm1,
because ρe1Ae1=ρe2e2, as seen in item (3) above, and
|m2e2Am1|=mT
2(e2×Am1)=mT
2Fm1, by definition of the
fundamental matrix F(cf. formula (2.18)). Consequently,
the epipolar relation (2.19) is equivalent to mT
2Fm1= 0, and
the epipolar line 1in the first image corresponding to a
given point m2in the second image has homogeneous coor-
dinates FTm2. We could have concluded this directly from
formula (2.18) based on symmetry considerations as well.
2.4.3 Grasping the Infinite Homography
Before continuing our investigation on how to recover 3D information
about the scene from images alone, it is worth to have a closer look at
the invertible matrix Aintroduced in Section 2.4.1 first. The matrix A
is defined algebraically as A=K2RT
2R1K1
1, but it also has a clear
geometrical interpretation: the matrix Atransfers vanishing points of
directions in the scene from the first image to the second one. Indeed,
consider a line Lin the scene with direction vector VR3. The van-
ishing point v1of its projection 1in the first image is the point of
intersection of the line through the center of projection C1and paral-
lel to Lwith the image plane of the first camera, as depicted in Fig-
ure 2.12. Parameter equations of this line are M=C1+τVwith τa
scalar parameter, and the projection of every point Mon this line in
the first image is given by K1RT
1(MC1)=τK1RT
1V. The vanishing
point v1of the line 1in the first image thus satisfies the equation
2.4 The Epipolar Relation Between Two Images of a Static Scene 339
L
C1
v
1
image
l1
oo
Fig. 2.12 The vanishing point v1in the first image of the projection 1of a line Lin the
scene is the point of intersection of the line through the center of projection C1and parallel
to Lwith the image plane.
ρv1v1=K1RT
1Vfor some non-zero scalar ρv1. Similarly, the vanishing
point v2of the projection 2of the line Lin the second image is given by
ρv2v2=K2RT
2Vfor some non-zero scalar ρv2. Conversely, given a van-
ishing point v1in the first image, the corresponding direction vector V
in the scene is V=ρv1R1K1
1v1for some scalar ρv1, and its vanishing
point in the second image is given by ρv2v2=ρv1K2RT
2R1K1
1v1.As
A=K2RT
2R1K1
1, this relation between the vanishing points v1and
v2can be simplified to ρv2=Av1, where ρ=ρv2
ρv1is a non-zero scalar
factor. Hence, if v1is the vanishing point of a line in the first image,
then Av1are homogeneous coordinates of the vanishing point of the
corresponding line in the second image, as was claimed. In particular,
the observation made in Section 2.4.1 that for any point m1in the first
image Am1are homogeneous coordinates for the vanishing point in the
second view of the projecting ray of m1in the first camera is in fact
another instance of the same general property.
As is explained in more detail in Appendix A, in projective geom-
etry, direction vectors Vin the scene are represented as points on the
plane at infinity of the scene. A vanishing point in an image then is
just the perspective projection onto the image plane of a point on the
plane at infinity in the scene. In this respect, the matrix Ais a homog-
raphy matrix of the projective transformation that maps points from
the first image via the plane at infinity of the scene into the second
image. This explains why Ais called the infinite homography in the
computer vision literature.
340 Principles of Passive 3D Reconstruction
2.5 Two Image-Based 3D Reconstruction Up-Close
From an algebraic point of view, triangulation can be interpreted as
solving the two camera projection equations for the scene point M. For-
mulated in this way, passive 3D reconstruction is seen as solving the
following problem: Given two images I1and I2of a static scene and
a set of corresponding image points m1∈I
1and m2∈I
2between these
images, determine a calibration matrix K1, a position C1and an orien-
tation R1for the first camera and a calibration matrix K2, a position
C2and an orientation R2for the second camera, and for every pair of
corresponding image points m1∈I
1and m2∈I
2compute world coordi-
nates (X,Y,Z) of a scene point Msuch that:
ρ1m1=K1RT
1(MC1) and ρ2m2=K2RT
2(MC2).(2.20)
These equations are the point of departure for our further analysis. In
traditional stereo one would know K1,R1,C1,K2,R2, and C2. Then
formula (2.20) yields a system of six linear equations in five unknowns
from which the coordinates of Mas well as the scalar factors ρ1and ρ2
can be computed, as was explained in Section 2.3.
Here, however, we are interested in the question which information
can be salvaged in cases where our knowledge about the camera con-
figuration is incomplete. In the following sections, we will gradually
assume less and less information about the camera parameters to be
known. For each case, we will examine the damage to the precision with
which we can still reconstruct the 3D structure of the scene. As will
be seen, depending on what is still known about the camera setup, the
geometric uncertainty about the 3D reconstruction can range from a
Euclidean motion up to a 3D projectivity.
2.5.1 Euclidean 3D Reconstruction
Let us first assume that we do not know about the camera positions
and orientations relative to the world coordinate frame, but that we
only know the position and orientation of the second camera relative
to (the camera-centered reference frame of) the first one. We will also
assume that both cameras are internally calibrated so that the matrices
2.5 Two Image-Based 3D Reconstruction Up-Close 341
K1and K2are known as well. This case is relevant when, e.g., using a
hand-held stereo rig and taking a single stereo image pair.
A moment’s reflection shows that it is not possible to determine
the world coordinates of Mfrom the images in this case. Indeed, chang-
ing the position and orientation of the world frame does not alter the
setup of the cameras in the scene, and consequently, does not alter the
images. Thus it is impossible to recover absolute information about
the cameras’ external parameters in the real world from the projection
equations (2.20) alone, beyond what we already know (relative camera
pose). Put differently, one cannot hope for more than to recover the
3D structure of the scene up to a 3D Euclidean transformation of the
scene from the projection equations (2.20) alone.
On the other hand, the factor RT
1(MC1) in the right-hand side of
the first projection equation in formula (2.20) is just a Euclidean trans-
formation of the scene. Thus, without loss of generality, we may replace
this factor by M, because we just lost all hope of retrieving the 3D coor-
dinates of Mmore precisely than up to some unknown 3D Euclidean
transformation anyway. The first projection equation then simplifies to
ρ1m1=K1M. Solving M=RT
1(MC1) for Mgives M=R1M+C1, and
substituting this into the second projection equation in (2.20) yields
ρ2m2=K2RT
2R1M+K2RT
2(C1C2). Together,
ρ1m1=K1Mand ρ2m2=K2RT
2R1M+K2RT
2(C1C2) (2.21)
constitute a system of equations which allow to recover the 3D structure
of the scene up to a 3D Euclidean transformation M=RT
1(MC1).
Indeed, as the cameras are assumed to be internally calibrated, K1and
K2are known and the first equation in (2.21) can be solved for M, viz.
M=ρ1K1
1m1. Plugging this new expression for Minto the right-hand
side of the second equation in formula (2.21), one gets
ρ2m2=ρ1K2RT
2R1K1
1m1+K2RT
2(C1C2).(2.22)
Notice that this actually brings us back to the epipolar relation (2.16)
which was derived in Section 2.4.1. In this equation RT
2R1and
RT
2(C1C2) represent, respectively, the relative orientation and the
relative position of the first camera with respect to the second one.
As these are assumed to be known too, equality (2.22) yields a system
342 Principles of Passive 3D Reconstruction
of three linear equations from which the two unknown scalar factors
ρ1and ρ2can be computed. And, when ρ1is found, the Euclidean
transformation Mof the scene point Mis found as well.
In summary, if the cameras are internally calibrated and the rela-
tive position and orientation of the cameras is known, then for each
pair of corresponding points m1and m2in the images the Euclidean
transformation Mof the underlying scene point Mcan be recovered f rom
the relations (2.21). Formula (2.21) is therefore referred to as a system
of Euclidean reconstruction equations for the scene and the 3D points
Msatisfying these equations constitute a Euclidean reconstruction of
the scene. As a matter of fact, better than Euclidean reconstruction is
often not needed, as the 3D shape of the objects in the scene is per-
fectly retrieved, only not their position relative to the world coordinate
frame, which is completely irrelevant in many applications.
Notice how we have absorbed the unknown parameters into the new
coordinates M. This is a strategy that will be used repeatedly in the next
sections. It is interesting to observe that M=RT
1(MC1) are, in fact,
the coordinates of the scene point Mwith respect to the camera-centered
reference frame of the first camera, as was calculated in Section 2.2.4.
The Euclidean reconstruction equations (2.21) thus coincide with the
system of projection equations (2.20) if the world frame is the camera-
centered reference frame of the first camera (i.e., C1=0and R1=I3)
and the rotation matrix R2then expresses the relative orientation of
the second camera with respect to the first one.
2.5.2 Metric 3D Reconstruction
Next consider a stereo setup as that of the previous section, but sup-
pose that we do not know the distance between the centers of pro-
jection C1and C2any more. We do, however, still know the relative
orientation of the cameras and the direction along which the sec-
ond camera is shifted with respect to the first one. This means that
RT
2(C1C2) is only known up to a non-zero scalar factor. As the
cameras still are internally calibrated, the calibration matrices K1and
K2are known and it follows that the last term in the second of the
Euclidean reconstruction equations (2.21), viz. K2RT
2(C1C2), can
2.5 Two Image-Based 3D Reconstruction Up-Close 343
only be determined up to an unknown scalar factor. According to for-
mula (2.15), K2RT
2(C1C2)=ρe2e2yields the epipole e2of the first
camera in the second image. Knowledge of the direction in which the
second camera is shifted with respect to the first one, combined with
knowledge of K2, clearly also allows to determine e2. It is interesting
to notice, however, that, if a sufficient number of corresponding points
can be found between the images, then the fundamental matrix of the
image pair can be computed, up to a non-zero scalar factor, and the
epipole e2can also be recovered that way (as explained in item (2) of
Section 2.4.2 and, in more detail, in Section 4 in Part 2 of this tutorial).
Having a sufficient number of point correspondences would thus elimi-
nate the need to know about the direction of camera shift beforehand.
On the other hand, the assumption that the inter-camera distance is
not known implies that the scalar factor ρe2in the last term of the
Euclidean reconstruction equations
ρ1m1=K1Mand ρ2m2=K2RT
2R1M+ρe2e2(2.23)
is unknown. This should not come as a surprise, since ρe2is the
projective depth of C1in the second camera, and thus it is directly
related to the inter-camera distance. Algebraically, the six homoge-
neous equations (2.23) do not suffice to solve for the six unknowns
M,ρ1,ρ2, and ρe2. One only gets a solution up to an unknown scale.
But, using the absorption trick again, we introduce the new coordinates
¯
M=1
ρe2M=1
ρe2RT
1(MC1), which is a 3D similarity transformation of
the original scene. Formulas (2.23) then reduces to
¯ρ1m1=K1¯
Mand ¯ρ2m2=K2RT
2R1¯
M+e2,(2.24)
where ¯ρ1=ρ1
ρe2and ¯ρ2=ρ2
ρe2are scalar factors expressing the projective
depth of the scene point underlying m1and m2in each camera relative to
the scale ρe2of the metric reconstruction of the scene. The coordinates
¯
Mprovide a 3D reconstruction of the scene point Mup to an unknown
3D similarity, as expected.
We could have seen this additional scaling issue coming also intu-
itively. If one were to scale a scene together with the cameras in it, then
this would have no impact on the images. In terms of the relative cam-
era positions, this would only change the distance between them, not
344 Principles of Passive 3D Reconstruction
their relative orientations or the relative direction in which one cam-
era is displaced with respect to the other. The calibration matrices K1
and K2would remain the same, since both the focal lengths and the
pixel sizes are supposed to be scaled by the same factor and the num-
ber of pixels in the image is kept the same as well, so that the offsets
in the calibration matrices do not change. Again, as such changes are
not discernible in the images, having internally calibrated cameras and
external calibration only up to the exact distance between the cam-
eras leaves us with one unknown, but fixed, scale factor. Together with
the unknown Euclidean motion already present in the 3D reconstruc-
tion derived in the previous section, this unknown scaling brings the
geometric uncertainty about the 3D scene up to an unknown 3D simi-
larity transformation. Such a reconstruction of the scene is commonly
referred to in the computer vision literature as a metric reconstruction
of the scene, and formula (2.24) is referred to as a system of metric
reconstruction equations. Although annoying, it should be noted that
fixing the overall unknown scale usually is the least of our worries in
practice, as indeed knowledge about a single distance or length in the
scene suffices to lift the uncertainty about scale.
2.5.3 Affine 3D Reconstruction
A further step toward our goal of 3D reconstruction from the image data
alone is to give up on knowledge of the internal camera parameters as
well. For the metric reconstruction equations (2.24) this implies that
the calibration matrices K1and K2are also unknown. As before, one
can perform a change of coordinates ˜
M=K1¯
Mand replace ¯
Min the
reconstruction equations (2.24) by ¯
M=K1
1˜
M. This gives:
˜ρ1m1=˜
Mand ˜ρ2m2=A˜
M+e2,(2.25)
where ˜ρ1ρ1ρ2ρ2, and A=K2RT
2R1K1
1is the infinite homog-
raphy introduced in Section 2.4.1. If the invertible matrix Ais known,
then this system (2.25) can be solved for the scalars ˜ρ1ρ2, and, more
importantly, for ˜
M, as in the metric case (cf. Section 2.5.2). More on
how to extract Afrom image information only is to follow shortly. As
˜
M=K1¯
M=1
ρe2K1RT
1(MC1) represents a 3D affine transformation of
2.5 Two Image-Based 3D Reconstruction Up-Close 345
the world space, formula (2.25) is referred to as a system of affine recon-
struction equations for the scene and the 3D points ˜
Msatisfying these
equations constitute an affine reconstruction of the scene, i.e., a recon-
struction which is correct up to an unknown 3D affine transformation.
It suffices to know Aand e2in order to compute an affine recon-
struction of the scene. As explained in Section 2.4.2 and in more detail
in Section 4, e2can be extracted from F, and Fcan be derived —
up to a non-zero scalar factor — from corresponding points between
the images. Since e2is in the left nullspace of F, an unknown scalar
factor on Fdoes not prevent the extraction of e2, however. Unfortu-
nately, determining Ais not that easy in practice. Awas defined as
A=K2RT
2R1K1
1, where K1and K2are the calibration matrices and
RT
2R1represents the relative orientation of the cameras. If this infor-
mation about the cameras is not available, then this formula cannot be
used to compute A. On the other hand, the fundamental matrix Fof
the image pair has been defined in Section 2.4.1 as F=[e2]×A. But,
unfortunately, the relation F=[e2]×Adoes not define the matrix A
uniquely. Indeed, suppose A1and A2are 3 ×3-matrices such that
F=[e2]×A1and F=[e2]×A2. Then [e2]×(A1A2)=0. As [e2]×is
the skew-symmetric 3 ×3-matrix which represents the cross product
with the 3-vector e2; i.e., [e2]×v=e2×vfor all vR3, the columns
of A1and A2can differ by a scalar multiple of e2. In particular,
A1=A2+e2aTfor some 3-vector aR3. Hence, the infinite homog-
raphy Acannot be recovered from point correspondences between the
images alone.
So, what other image information can then be used to determine A?
Recall from Section 2.4.3 that the matrix Atransfers vanishing points
of directions in the scene from the first image to the second one. Each
pair of corresponding vanishing points in the images therefore yields
constraints on the infinite homography A. More precisely, if v1and v2
are the vanishing points in, respectively, the first and the second image
of a particular direction in the scene, then ρv2=Av1for some non-zero
scalar factor ρ. Since Aisa3×3-matrix and as each such constraint
brings three equations, but also adds one additional unknown ρ,at
least four such constraints are needed to determine the matrix Aup
to a scalar factor ρA. This unknown factor does not form an obstacle
346 Principles of Passive 3D Reconstruction
for the obtained matrix to be used in the affine reconstruction equa-
tions (2.25), because by multiplying the first reconstruction equation
with the same factor and then absorbing it into ˜
Min the right-hand side
of both equations, one still obtains an affine 3D reconstruction of the
scene.
Identifying the vanishing points of four independent directions in an
image pair is rarely possible. More often, one has three dominant direc-
tions, typically orthogonal to each other. This is the case for many built-
up environments. Fortunately, there is one direction which is always
available, namely the line passing through the positions C1and C2
in the scene. The vanishing points of this line in the images are the
intersection of the line with each image plane. But these are just the
epipoles. So, the epipoles e1and e2in a pair of images are correspond-
ing vanishing points of the direction of the line through the camera
positions C1and C2in the scene and therefore satisfy the relation
ρee2=Ae1for some ρeR,
as we have already noted in Section 2.4.2, item (3). Consequently, if
the vanishing points of three independent directions in the scene can be
identified in the images, then the infinite homography Acanbecom-
puted up to a non-zero scalar factor; at least if none of the three direc-
tions is that of the line connecting the centers of projection C1and C2.
Notice that we have only absorbed K1into the affine coordinates ˜
M
of the 3D reconstruction, but that K2also can remain unknown, as it
is absorbed by A.
2.5.4 Projective 3D Reconstruction
Finally, we have arrived at the situation where we assume no knowledge
about the camera configuration or about the scene whatsoever. Instead,
we only assume that one can find point correspondences between the
images and extract the fundamental matrix of the image pair.
The main conclusion of the previous section is that, if no information
about the internal and external parameters of the cameras is available
and if insufficient vanishing points can be found and matched, then
the only factor that separates us from an affine 3D reconstruction of
2.5 Two Image-Based 3D Reconstruction Up-Close 347
the scene is the infinite homography A. Let us therefore investigate
whether some partial knowledge about Acan still be retrieved from
general point correspondences between the images.
Recall from Section 2.4.1 that F=[e2]×Aand that the epipole e2
can uniquely be determined from the fundamental matrix F.Itisnot
difficult to verify that ([e2]×)3=−e22[e2]×, where e2denotes the
norm of the 3-vector e2.AsF=[e2]×A, it follows that
[e2]×([e2]×F)=([e2]×)3A=−e22[e2]×A=−e22F.
So, [e2]×Fisa3×3-matrix which when premultiplied with [e2]×
yields a non-zero scalar multiple of the fundamental matrix F.In
other words, up to a non-zero scalar factor, the 3 ×3-matrix [e2]×F
could be a candidate for the unknown matrix A. Unfortunately, as
both the fundamental matrix Fand [e2]×have rank 2, the matrix
[e2]×Fis not invertible as Aought to be. But, recall from the pre-
vious section that two matrices A1and A2satisfying F=[e2]×A1
and F=[e2]×A2are related by A1=A2+e2aTfor some 3-vector
aR3. This implies that the unknown matrix Amust be of the form
A=(1/e22)[e2]×F+e2aTfor some 3-vector aR3. It follows
that the invertible matrix A, needed for an affine reconstruction of
the scene, can only be recovered up to three unknown components of
a. As we do not know them, the simplest thing to do is to put them to
zero or to make a random guess. This section analyzes what happens
to the reconstruction if we do just that.
The expression for Aonly takes on the particular form given above
in case Fis obtained from camera calibration (i.e., from the rotation
and calibration matrices). In case Fis to be computed from point
correspondences — as is the case here — it can only be determined up
to a non-zero scalar factor. Let ˆ
Fbe an estimate of the fundamental
matrix as obtained from point correspondences, then F=κˆ
Ffor some
non-zero scalar factor κ. Now define ˆ
A=(1/e22)[e2]׈
F. Then A=
κˆ
A+e2aTfor some unknown 3-vector aR3. Notice that, as observed
before, the scalar factor κbetween Fand ˆ
Fhas no influence on the
pixel coordinates of e2, as derived from ˆ
Finstead of F. Using ˆ
Afor A
in the affine reconstruction equations (2.25), for corresponding image
348 Principles of Passive 3D Reconstruction
points m1and m2we now solve the following system of linear equations:
ˆρ1m1=ˆ
Mand ˆρ2m2=ˆ
Aˆ
M+e2,(2.26)
where ˆρ1and ˆρ2are non-zero scalar factors, and where the 3D points
ˆ
Mconstitute a 3D reconstruction of the scene, which — as will be
demonstrated next — differs from the original scene by a (unique, but
unknown) 3D projective transformation. The set of 3D points ˆ
Mis called
aprojective 3D reconstruction of the scene and formula (2.26) is referred
to as a system of projective reconstruction equations. Figure 2.13 sum-
marizes the steps that have lead up to these equations.
In order to prove that the points ˆ
Mobtained from equations (2.26)
do indeed constitute a projective 3D reconstruction of the scene,
Projective 3D reconstruction from two uncalibrated images
Given:A set of point correspondences m1∈I
1and m2∈I
2
between two uncalibrated images I1and I2of a static scene
Objective:A projective 3D reconstruction ˆ
Mof the scene
Algorithm:
(1) Compute an estimate ˆ
Ffor the fundamental matrixa
(2) Computebthe epipole e2from ˆ
F
(3) Compute the 3 ×3-matrix ˆ
A=(1/e22)[e2]׈
F
(4) For each pair of corresponding image points m1and m2,
solve the following system of linear equations for ˆ
M:
ˆρ1m1=ˆ
Mand ˆρ2m2=ˆ
Aˆ
M+e2
ρ1and ˆρ2are non-zero scalars )
aCf. item (1) in Section 2.4.2. See also Subsections 4.2.2 and 4.2.3 in Section 4 in Part 2
of this tutorial.
bCf. item (2) in Section 2.4.2. See also Subsection 4.3 in Section 4 in Part 2 of this
tutorial.
Fig. 2.13 A basic algorithm for projective 3D reconstruction from two uncalibrated images.
2.5 Two Image-Based 3D Reconstruction Up-Close 349
we first express the equations (2.26) in terms of projection matrices
and extended coordinates ˆ
M
1=(ˆ
X, ˆ
Y,ˆ
Z,1)Tfor the 3D point ˆ
M=
(ˆ
X, ˆ
Y,ˆ
Z)T(cf. formula (2.7) in Section 2.2.4):
ˆρ1m1=(I3|0)ˆ
M
1and ˆρ2m2=(ˆ
A|e2)ˆ
M
1.(2.27)
Similarly, the affine reconstruction equations (2.25) are written as:
˜ρ1m1=(I3|0)˜
M
1and ˜ρ2m2=(A|e2)˜
M
1,(2.28)
where ˜
M
1=(˜
X, ˜
Y,˜
Z,1)Tare the extended coordinates of the 3D point
˜
M=(˜
X, ˜
Y,˜
Z)T. Recall that the invertible matrix Ais of the form
A=κˆ
A+e2aTfor some non-zero scalar κand 3-vector aR3. The
last equality in (2.28) therefore is:
˜ρ2m2=(κˆ
A+e2aT|e2)˜
M
1=(ˆ
A|e2)κI30
aT1˜
M
1; (2.29)
and the first equality in (2.28) can be rewritten as:
κ˜ρ1m1=(I3|0)κI30
aT1˜
M
1.
Comparing these expressions to formula (2.27), it follows that
λˆ
M
1=κI30
aT1˜
M
1.(2.30)
for some non-zero scalar λRand that ˆρ1=(κ/λρ1and ˆρ2=
(1ρ2. Notice that one cannot simply go for a solution with λ=1,as
this implies that aT=0Tand therefore A=κˆ
A. This would mean that
we have been lucky enough to guess the correct infinite homography
matrix A(up to a scalar factor) right away. But this cannot be the case
according to our proposed choice : ˆ
Ahas rank 2 and thus is singular
whereas the correct matrix Ais invertible. Rather, eliminating λfrom
equation (2.30) gives:
ˆ
M=κ˜
M
aT˜
M+1
.
350 Principles of Passive 3D Reconstruction
Recall from Section 2.5.3 that ˜
M=1
ρe2K1RT
1(MC1). Substituting this
into the previous equation, one sees that
ˆ
M=κK1RT
1(MC1)
aTK1RT
1(MC1)+ρe2
is a projective transformation of the scene. Moreover, since ˜
M=K1¯
M
with ¯
M=1
ρe2RT
1(MC1) being the metric reconstruction of the scene
defined in Section 2.5.2, formula (2.30) can also be written as:
λˆ
M
1=κK10
aTK11¯
M
1; (2.31)
or, after elimination of λ,
ˆ
M=κK1¯
M
aTK1¯
M+1
,
which shows that ˆ
Malso is a projective transformation of ¯
M.
2.5.5 Taking Stock — Stratification
The goal put forward at the beginning of Section 2.5 was to recover
the three-dimensional geometrical structure of a scene from two images
of it, without necessarily having complete information about the inter-
nal and external parameters of the cameras. It was immediately seen
that one can only recover it up to a 3D Euclidean transformation if
only information about the relative camera poses is available for the
external camera parameters. If the precise inter-camera distance is also
unknown, 3D reconstruction is only possible up to a 3D similarity and
the 3D reconstruction is said to be metric. Moreover, if the calibra-
tion matrix of the first camera is unknown (and also of the second
camera for that matter), then at most an affine 3D reconstruction of
the scene is feasible. And, if the infinite homography Aintroduced in
Section 2.4.1 is unknown, then the scene can only be reconstructed up
to a 3D projective transformation. The latter case is very relevant, as
it applies to situations where the internal and external parameters of
the camera pair are unknown, but where point correspondences can be
found.
2.5 Two Image-Based 3D Reconstruction Up-Close 351
Projective
Affine
Metric
similarity
Original
Euclidean
Fig. 2.14 The aim of passive 3D reconstruction is to recover the 3D geometrical structure
of the scene from images. With a fully calibrated setup of two cameras a Euclidean recon-
struction can be achieved. The reconstruction degenerates to metric if the inter-camera
distance (sometimes referred to as baseline) is unknown. If the calibration matrices of the
cameras are unknown, then the scene structure can only be recovered up to a 3D affine
transformation; and, if no information about the camera setup is available, then only a
projective reconstruction is feasible.
Figure 2.14 illustrates these different situations. In the figure the
scene consists of a cube. In a metric reconstruction a cube is found, but
the actual size of the cube is undetermined. In an affine reconstruction
the original cube is reconstructed as a parallelepiped. Affine transfor-
mations preserve parallelism, but they do not preserve metric relations
such as angles and relative lengths. In a projective reconstruction the
scene appears as an (irregular) hexahedron, because projective trans-
formations only preserve incidence relations such as collinearity and
coplanarity, but parallelism or any metric information is not preserved.
Table 2.1 shows the mathematical expressions that constitute these geo-
metrical transformations. The mutual relations between the different
types of 3D reconstruction are often referred to as stratification of the
geometries. This term reflects that the transformations higher up in the
list are special types (subgroups) of the transformations lower down.
Obviously, the uncertainty about the reconstruction increases when
going down the list, as also corroborated by the number of degrees of
freedom in these transformations.
In the first instance, the conclusion of this section — namely that
without additional information about the (internal and external) cam-
era parameters the three-dimensional structure of the scene only can
be recovered from two images up to an unknown projective transforma-
tion — might come as a disappointment. From a mathematical point of
view, however, this should not come as a surprise, because algebraically
passive 3D reconstruction boils down to solving the reconstruction
352 Principles of Passive 3D Reconstruction
Table 2.1. The stratification of geometries.
Geometrical transf. Mathematical expression
Euclidean transf. M=RM+Twith Ra rotation matrix, TR3
similarity transf. M=κRM+Twith Ra rotation matrix, TR3,κR
affine transf. M=QM+Twith Qan invertible matrix, TR3
projective transf.
X=p11X+p12 Y+p13 Z+p14
p41X+p42 Y+p43 Z+p44
Y=p21X+p22 Y+p23 Z+p24
p41X+p42 Y+p43 Z+p44
Z=p31X+p32 Y+p33 Z+p34
p41X+p42 Y+p43 Z+p44
with P=(pij ) an invertible 4 ×4-matrix
equations (2.20) for the camera parameters and the scene points M.In
terms of projection matrices and extended coordinates (cf. formula (2.7)
in Section 2.2.4), the reconstruction equations (2.20) are formulated as:
ρ1m1=P1M
1and ρ2m2=P2M
1,(2.32)
where Pj=(KjRT
j|−KjRT
jCj) is the 3 ×4-projection matrix of the
j-th camera, with j=1 or j= 2, and M
1=(X,Y,Z,1)Tare the
extended coordinates of the scene point M=(X, Y, Z)T. Moreover, it
was observed in Section 2.2.4 that in the general linear camera model
any 3×4-matrix of maximal rank can be interpreted as the projection
matrix of a linear pinhole camera. Consequently, inserting an arbitrary
invertible 4 ×4-matrix and its inverse in the right-hand sides of the
projection equations (2.32) does not alter the image points m1and m2
in the left-hand sides of the equations and yields another — but equally
valid — decomposition of the reconstruction equations:
ρ1m1=P1H1HM
1and ρ2m2=P2H1HM
1;
or equivalently,
ˆρ1m1=ˆ
P1ˆ
M
1and ˆρ2m2=ˆ
P2ˆ
M
1,(2.33)
with ˆ
P1=P1H1and ˆ
P2=P2H1two 3 ×4-matrices of maximal
rank, λˆ
M
1=HM
1with λa non-zero scalar a 3D projective transfor-
2.5 Two Image-Based 3D Reconstruction Up-Close 353
mation of the scene, and ˆρ1=ρ1
λand ˆρ2=ρ2
λnon-zero scalar factors.
Clearly, formulas (2.33) can be interpreted as the projection equations
of scene points ˆ
Mthat, when observed by cameras with respective pro-
jection matrices ˆ
P1and ˆ
P2, yield the same set of image points m1and m2.
As Hcan be any invertible 4 ×4-matrix, it is clear that one cannot
hope to do better than recovering the 3D geometric structure of the
scene up to an arbitrary 3D projective transformation if no information
about the cameras is available. But, the longer analysis presented in the
previous sections and leading up to the same conclusion has provided
an explicit algorithm for projective 3D reconstruction, which will be
refined in the next sections.
Awareness of the stratification is also useful when additional infor-
mation on the scene (rather than the cameras) is available. One may
know some (relative) lengths or angles (including orthogonalities and
parallelisms). Exploiting such information can make it possible to
upgrade the geometric structure of the reconstruction to one with less
uncertainty, i.e., to one higher up in the stratification table. Based
on known lengths and angles, it may, for instance, become possible
to construct a 3D projective transformation matrix (homography) H
that converts the projective reconstruction ˆ
M, obtained from the image
data (cf. Figure 2.13 and formula (2.31) ), directly into a Euclidean
one. And, even if no metric information about the scene is available,
other geometrical relations that are known to exist in the scene may
be useful. In particular, we have already discussed the case of parallel
lines in three independent directions, that suffice to upgrade a pro-
jective reconstruction into an affine one through the three vanishing
points they deliver. Indeed, they allow to determine the three unknown
parameters aR3of the invertible matrix A(cf. Section 2.5.3). Know-
ing aallows to upgrade the projective 3D reconstruction ˆ
Mto the affine
3D reconstruction κ˜
Mby using forumla (2.30) in Section 2.5.4. How-
ever, one does not always need the projections of (at least) two parallel
lines in the image to compute a vanishing point. Alternatively, if in an
image three points can be identified that are the projections of collinear
scene points M1,M2, and M3of which the ratio d(M1,M2)
d(M1,M3)of their Euclidean
distances in the real world is known (e.g., three equidistant points in
the scene), then one can determine the vanishing point of this direction
354 Principles of Passive 3D Reconstruction
in the image using the cross ratio (cf. Appendix A). In Subsection 4.6.1
of Section 4 in Part 2 of this tutorial such possibilities for improving the
3D reconstruction will be explored further. In the next section, how-
ever, we will assume that no information about the scene is available
and we will investigate how more than two images can contribute to
better than a projective 3D reconstruction.
2.6 From Projective to Metric Using
More Than Two Images
In the geometric stratification of the previous section, we ended up
with a 3D projective reconstruction in case no prior camera calibration
information is available whatsoever and we have to work purely from
image correspondences. In this section we will investigate how and when
we can work our way up to a metric 3D reconstruction, if we were to
have more than just two uncalibrated images.
2.6.1 Projective Reconstruction and Projective Camera
Matrices from Multiple Images
Suppose we are given mimages I1,I2,... ,Imof a static scene and a set
of corresponding points mj∈I
jbetween the images (j∈{1,2,...,m}).
As in formula (2.20) the projection equations of the j-th camera are:
ρjmj=KjRT
j(MCj) for j∈{1,2,...,m}; (2.34)
or, in terms of projection matrices and extended coordinates as in for-
mula (2.32):
ρjmj=KjRT
j|−KjRT
jCjM
1for j∈{1,2,...,m}. (2.35)
Of course, when one has more than just two views, one can still extract
at least a projective reconstruction of the scene, as when one only had
two. Nonetheless, a note of caution is in place here. If we were to try
and build a projective reconstruction by pairing the first with each
of the other images separately and then combine the resulting projec-
tive reconstructions, this would in general not work. Indeed, one has
to ensure that the same projective distortion is obtained for each of
2.6 From Pro jective to Metric Using More Than Two Images 355
the reconstructions. That this will not automatically amount from a
pairwise reconstruct-and-then-combine procedure becomes clear if one
writes down the formulas explicitly. When the internal and external
parameters of the cameras are unknown, a projective 3D reconstruc-
tion of the scene can be computed from the first two images by the
procedure of Figure 2.13 in Section 2.5.4. In particular, for each point
correspondence m1∈I
1and m2∈I
2in the first two images, the recon-
structed 3D point ˆ
Mis the solution of the system of linear equations:
ˆρ1m1=ˆ
Mand ˆρ2m2=ˆ
A2ˆ
M+e2,(2.36)
where ˆρ1and ˆρ2are non-zero scalars (which depend on ˆ
Mand thus are
also unknown) and ˆ
A2=(1/e22)[e2]׈
F12 with ˆ
F12 an estimate of
the fundamental matrix between the first two images as computed from
point correspondences. The resulting points ˆ
Mconstitute a 3D recon-
struction of the scene which relates to the metric reconstruction ¯
Mby
the projective transformation:
λˆ
M
1=κK10
aTK11¯
M
1,(2.37)
as was demonstrated in Section 2.5.4 (formula (2.31)). Let Hbe the
homography matrix in the right-hand side of formula (2.37), viz.:
H=κK10
aTK11.(2.38)
For the other images a similar set of equations can be derived by pairing
each additional image with the first one. But, as mentioned before, one
has to be careful in order to end up with the same projective distortion
already resulting from the reconstruction based on the first two views.
Indeed, for an additional image Ijwith j3, blindly applying the pro-
cedure of Figure 2.13 in Section 2.5.4 to the first and the j-th images
without considering the second image or the already obtained projec-
tively reconstructed 3D points ˆ
Mwould indeed result in a 3D projective
reconstruction ˆ
M(j), but one which relates to the metric reconstruction ¯
M
by the projective transformation:
λˆ
M(j)
1=κ(j)K10
aT
(j)K11¯
M
1,
356 Principles of Passive 3D Reconstruction
in which the parameters κ(j)Rand a(j)R3are not related to the
parameters κand ain the projective transformation (2.37) and the
corresponding homography matrix Hin formula (2.38). A consistent
projective reconstruction implies that aT
(j)(j)=aT. So, when intro-
ducing additional images Ijwith j3 into the reconstruction process,
one cannot simply choose the matrix ˆ
Ajfor each new view indepen-
dently, but one has to make sure that the linear system of reconstruction
equations:
ˆρ1m1=ˆ
M
and ˆρjmj=ˆ
Ajˆ
M+ejfor j∈{2,...,m}(2.39)
is consistent, i.e., that it is such that the solutions ˆ
Msatisfy all recon-
struction equations at once, in as far as image projections mjare avail-
able. The correct way to proceed for the j-th image with j3 therefore
is to express ˆ
Ajmore generally as ˆ
Aj=κj(1/ej2)[ej]׈
F1j+ejaT
j
where κjRand ajR3are parameters such that the reconstruc-
tion equations ˆρjmj=ˆ
Ajˆ
M+ejhold for all reconstructed 3D points ˆ
M
obtained from formulas (2.36). Each image point mjbrings three linear
equations for the unknown parameters κjand aj, but also introduces
1 unknown scalar factor ˆρj. Hence, theoretically 2 point correspon-
dences between the first, the second and the j-th image would suffice
to determine κjand ajin a linear manner. However, the parameter-
ization of the matrix ˆ
Ajrelies on the availability of the fundamental
matrix F1jbetween the first and the j-th image. And, as explained
in Section 2.4.2, if F1jis to be estimated from point correspondences
too, then at least 7 point correspondences are needed. Therefore, from a
computational point of view it is better to estimate both ˆ
Aand e3from
the relation ˆρjmj=ˆ
Ajˆ
M+ejdirectly. Indeed, for each image point mj
this formula yields three linear equations in the nine unknown entries of
the matrix ˆ
Ajand the two unknown pixel coordinates of the epipole ej,
but it also introduces one unknown scalar factor ˆρj. Consequently, at
least 6 point correspondences are needed to uniquely determine ˆ
Ajand
ejin a linear manner. Moreover, since the fundamental matrix between
the first and the j-th image is defined in Section 2.4.1 as F1j=[ej]×Aj,
an estimate for Ajand ejimmediately implies an estimate for F1jas
well. At first sight this may seem a better approach to estimate the
2.6 From Pro jective to Metric Using More Than Two Images 357
fundamental matrix F1j, because it only needs 6 point correspondences
instead of 7, but one should realize that three images are involved here
(instead of only two images in Section 2.4.2). The relations between
three or more views of a static scene will be further explored in Sec-
tion 3 of Part 2 of this tutorial. More information on how to efficiently
compute ˆ
Ajin practice can be found in Subsection 4.4 of Section 4 in
Part 2 of this tutorial.
2.6.2 From Projective to Affine 3D Reconstruction
Now that the discussion of projective reconstruction is extended to the
case of more than two cameras, we next consider options to go beyond
this and upgrade to an affine or even a metric reconstruction. But
before doing so, we first make explicit the link between the plane at
infinity of the scene and the projective transformation matrix:
H=κK10
aTK11,(2.40)
which describes the transition from a metric to a projective reconstruc-
tion. Readers who are not very familiar with projective geometry can
find in Appendix A in Part 3 of this tutorial all necessary background
material that is used in this subsection, or they may skip this subsection
entirely and continue immediately with Section 2.6.3.
Recall that in projective geometry direction vectors are represented
as points on the plane at infinity of the world space. If we would be able
to identify the plane at infinity of the scene in the projective reconstruc-
tion, then by moving it to infinity, we can already upgrade the projec-
tive reconstruction to an affine one. In the projective reconstruction ˆ
Mof
Section 2.6.1, the plane at infinity of the scene is found as the plane with
equation aTˆ
Mκ= 0. Indeed, as explained in Appendix A, the plane
at infinity of the scene has homogeneous coordinates (0,0,0,1)Tin the
world space. Moreover, if a projective transformation whose action on
homogeneous coordinates of 3D points is represented by an invertible
4×4-homography matrix His applied to the world space, then homo-
geneous coordinates of planes are transformed by the inverse transpose
HT=(H1)T=(HT)1of the homography matrix H. For example,
358 Principles of Passive 3D Reconstruction
the 3D similarity transformation ¯
M=1
ρe2RT
1(MC1) is represented in
matrix notation by:
¯
M
1=1
ρe2RT
11
ρe2RT
1C1
0T1M
1.
The corresponding homography matrix is:
1
ρe2RT
11
ρe2RT
1C1
0T1
and its inverse transpose is:
ρe2RT
10
CT
11.
The plane at infinity of the scene has homogeneous coordinates
(0,0,0,1)Tand is mapped by this similarity transformation onto itself,
because
ρe2RT
10
CT
11
0
0
0
1
=
0
0
0
1
.
This illustrates again the general fact that Euclidean transformations,
similarity transformations and affine transformations of the world space
do not affect the plane at infinity of the scene. A projective transfor-
mation on the contrary generally does affect the plane at infinity. In
particular, the projective transformation λˆ
M
1=H¯
M
1defined by the
homography matrix (2.40) maps the plane at infinity of the scene to
the plane with homogeneous coordinates:
HT
0
0
0
1
=1
κKT
11
κa
0T1
0
0
0
1
=1
κa
1.
In other words, the plane at infinity of the scene is found in the
projective reconstruction ˆ
Mas the plane with equation 1
κaTˆ
M+1=0,
2.6 From Pro jective to Metric Using More Than Two Images 359
or equivalently, aTˆ
Mκ= 0. Given this geometric interpretation of
1
κa, it becomes even clearer why we need to keep these values the
same for the different choices of ˆ
Ajin the foregoing discussion about
building a consistent, multi-view projective reconstruction of the scene.
All two-view projective reconstructions need to share the same plane
at infinity for them to be consistent. This said, even if one keeps the
3D reconstruction consistent following the aforementioned method,
one still does not know Hor 1
κaT.
If the position of the plane at infinity of the scene were known in the
projective reconstruction, one could derive a projective transformation
which will turn the projective reconstruction into an affine one, i.e., to
put the plane at infinity really at infinity. Indeed, formula (2.37) can
be rewritten as:
λˆ
M
1=κK10
aTK11¯
M
1=κI30
aT1K1¯
M
1
=κI30
aT1˜
M
1,
where ˜
M=K1¯
Mis the affine 3D reconstruction that was introduced by
formula (2.25) in Section 2.5.3 (see also formula (2.30) in Section 2.5.4).
Knowing that 1
κaTˆ
M+ 1 = 0 is the equation of the plane at infinity
of the scene in the projective reconstruction, the previous equation can
also be written as:
λˆ
M
1=κI30
aT1˜
M
1=I30
1
κaT1κ˜
M
1.(2.41)
Observe that since ˜
Mis an affine 3D reconstruction of the scene, κ˜
Mis
an affine 3D reconstruction as well. Denoting π=1
κa, the plane at
infinity has equation πT
ˆ
M+ 1 = 0 and Equation (2.41) reads as:
λˆ
M
1=I30
πT
1κ˜
M
1.
This is the 3D projective transformation which maps the plane at
infinity in the affine reconstruction κ˜
Mto the plane with equation
360 Principles of Passive 3D Reconstruction
πT
ˆ
M+ 1 = 0 in the projective 3D reconstruction ˆ
M. Put differently,
if the plane at infinity of the scene can be identified as the plane
πT
ˆ
M+ 1 = 0 in the projective reconstruction ˆ
M(e.g., from directions
which are known to be parallel in the scene or from vanishing points
in the images), then the inverse projective transformation:
˜
λκ˜
M
1=I30
πT
1ˆ
M
1with ˜
λa non-zero scalar, (2.42)
or equivalently,
κ˜
M=ˆ
M
πTˆ
M+1
turns the projective 3D reconstruction ˆ
Minto the affine reconstruc-
tion κ˜
M. In mathematical parlance, the projective transformation (2.42)
maps the plane πT
ˆ
M+1=0 to infinity.
2.6.3 Recovering Metric Structure : Self-Calibration
Equations
If no information is available about where the plane at infinity is located
in the projective reconstruction ˆ
M, one can still upgrade the reconstruc-
tion, even to metric. Recall from formulas (2.39) that ˆρjmj=ˆ
Ajˆ
M+ej
for all j2 and from formula (2.37) that:
λˆ
M
1=κK10
aTK11¯
M
1.
Together, these equations yield
ˆρjmj=ˆ
Ajˆ
M+ej=(ˆ
Aj|ej)ˆ
M
1
=1
λ(ˆ
Aj|ej)κK10
aTK11¯
M
1
=1
λκˆ
AjK1+ejaTK1ej¯
M
1
or equivalently,
λˆρjmj=(κˆ
AjK1+ejaTK1)¯
M+ejfor all j2.
2.6 From Pro jective to Metric Using More Than Two Images 361
On the other hand, a similar computation as in the derivation of the
metric reconstruction equations (2.24) in Section 2.5.2 shows that
¯ρjmj=KjRT
jR1¯
M+ej,with ¯ρj=ρj
ρej
.
Comparing both equations, one sees that
κˆ
AjK1+ejaTK1=λjKjRT
jR1for all j2
and for some scalar λj. In Section 2.6.2 it is observed that the param-
eters κRand aR3determine the plane at infinity of the scene in
the projective reconstruction: π=1
κa. Making this reference to the
plane at infinity of the scene explicit in the previous equation, one gets
κj(ˆ
AjejπT
)K1=KjRT
jR1for all j2, (2.43)
where κj=κ/λjis a non-zero scalar. This equation has two interesting
consequences. First of all, multiplying both sides of the equality on the
right with the inverse of the calibration matrix K1gives
κj(ˆ
AjejπT
)=KjRT
jR1K1
1for all j2,
which, since the right-hand side of this equation is just the invertible
matrix Ajdefined in Section 2.4.1, yields an explicit relation between
the infinite homography Ajand the 3 ×3-matrices ˆ
Ajcomputed from
the images as described in Section 2.6.1, viz
Aj=κj(ˆ
AjejπT
) for all j2 (2.44)
with non-zero scalars κjR. And secondly, multiplying both sides in
equality (2.43) on the right with the inverse R1
1=RT
1of the rotation
matrix R1gives
κj(ˆ
AjejπT
)K1RT
1=KjRT
jfor all j2.
If one now multiplies both sides of this last equation with its transpose,
then
(KjRT
j)(KjRT
j)T=κ2
j(ˆ
AjejπT
)K1RT
1(K1RT
1)T(ˆ
AjejπT
)T
for all j2, which by RT
j=R1
jreduces to
KjKT
j=κ2
j(ˆ
AjejπT
)K1KT
1(ˆ
AjejπT
)T(2.45)
362 Principles of Passive 3D Reconstruction
for all j∈{2,...,m}. Equations (2.45) are the so-called self-calibration
or autocalibration equations [8] and all self-calibration methods essen-
tially are variations on solving these equations for the calibration matri-
ces Kjand the 3-vector πlocating the plane at infinity of the scene
in the projective reconstruction. The various methods may differ in the
constraints or the assumptions on the calibration matrices they employ,
however.
2.6.4 Scrutinizing the Self-Calibration Equations
The self-calibration equations (2.45) have a simple geometric interpre-
tation, which we will explore first before looking into ways for solving
them. Readers who are merely interested in the practice of 3D recon-
struction may skip this section and continue with Section 2.6.5.
2.6.4.1 Metric Structure and the Preservation of Angles
In Section 2.5.2 it was observed that if one wants to reconstruct a scene
from images only and if no absolute distance is given for any parts of
the scene, then one can never hope to do better than a metric recon-
struction, i.e., a 3D reconstruction which differs from the original scene
by an unknown 3D similarity transformation. Typical of a 3D simi-
larity transformation is that all distances in the scene are scaled by
a fixed scalar factor and that all angles are preserved. Moreover, for
a projective 3D reconstruction of the scene to be a metric one, it is
necessary and sufficient that the angles of any triangle formed by three
points in the reconstruction are equal to the corresponding angles in the
triangle formed by the original three scene points. The self-calibration
equations (2.45) enforce this condition in the projective reconstruction
at hand, as will be explained now.
Let M,P, and Qbe three arbitrary points in the scene. The angle
between the line segments [M,P] and [M,Q] in Euclidean 3-space is found
as the angle between the 3-vectors PMand QMin R3. This angle
is uniquely defined by its cosine, which is given by the formula:
cos(PM,QM)= PM,QM
PMQM,
2.6 From Pro jective to Metric Using More Than Two Images 363
where PM,QMdenotes the (standard) inner product; and,
PM=PM,PMand QM=QM,QMare the
norms of the 3-vectors PMand QMin R3.Now,PMis a direc-
tion vector for the line defined by the points Mand Pin the scene.
As explained in Appendix A, the vanishing point vjof the line MP in
the j-th image (j∈{1,2,...,m}) is given by ρvj vj=KjRT
j(PM),
where ρjmj=KjRT
j(MCj) are the projection equations of the j-th
camera, as defined by formula (2.34). Since Rjis a rotation matrix,
RT
j=R1
jand the 3-vector PMcan (theoretically) be recovered up
to scale from its vanishing point vjin the j-th image by the for-
mula PM=ρvj RjK1
jvj. Similarly, QMis a direction vector of the
line through the points Mand Qin the scene and the vanishing point
wjof this line in the j-th image is given by ρwj wj=KjRT
j(QM),
thus yielding QM=ρwj RjK1
jwj. By definition of the inner
product in R3,
PM,QM=(PM)T(QM)
=(ρvj RjK1
jvj)T(ρwj RjK1
jwj)
=ρvj ρwj vT
jKT
jK1
jwj
=ρvj ρwj vT
j(KjKT
j)1wj.
A similar calculation yields
PM2=(PM)T(PM)=ρ2
vj vT
j(KjKT
j)1vj
and QM2=(QM)T(QM)=ρ2
wj wT
j(KjKT
j)1wj.
Combining the previous expressions, one gets the following formula for
the cosine of the angle between the line segments [M,P] and [M,Q]inthe
scene:
cos(PM,QM)= vT
j(KjKT
j)1wj
vT
j(KjKT
j)1vjwT
j(KjKT
j)1wj
.(2.46)
This equation states that the angle between two lines in the scene can
be measured from a perspective image of the scene if the vanishing
points of these lines can be identified in the image and if the calibration
matrix Kjof the camera — or rather KjKT
j— is known.
364 Principles of Passive 3D Reconstruction
This should not come as a surprise, since vanishing points encode
3D directions in a perspective image. What is more interesting to
notice, however, is that the inner product of the scene is encoded in
the image by the symmetric matrix (KjKT
j)1. In the computer vision
literature this matrix is commonly denoted by ωjand referred to as the
image of the absolute conic in the j-th view [16]. We will not expand on
this interpretation right now, but the interested reader can find more
details on this matter in Appendix B. The fact that the calibration
matrix Kjappears in a formula relating measurements in the image to
measurements in the scene was to be expected. The factor RT
j(MCj)
in the right-hand side of the projection equations ρjmj=KjRT
j(MCj)
corresponds to a rigid motion of the scene, and hence does not have an
influence on angles and distances in the scene. The calibration matrix
Kj, on the other hand, is an upper triangular matrix, and thus intro-
duces scaling and skewing. When measuring scene angles and distances
from the image, one therefore has to undo this skewing and scaling
first by premultiplying the image coordinates with the inverse matrix
K1
j. And, last but not least, from the point of view of camera self-
calibration, formula (2.46) introduces additional constraints between
different images of the same scene, viz:
vT
i(KiKT
i)1wi
vT
i(KiKT
i)1viwT
i(KiKT
i)1wi
=vT
j(KjKT
j)1wj
vT
j(KjKT
j)1vjwT
j(KjKT
j)1wj
,
which must hold for every pair of images i, j ∈{1,2,...,m}and for
every two pairs of corresponding vanishing points vi,vjand wi,wjin
these images. Obviously, only m1 of these relations are independent:
vT
1(K1KT
1)1w1
vT
1(K1KT
1)1v1wT
1(K1KT
1)1w1
=vT
j(KjKT
j)1wj
vT
j(KjKT
j)1vjwT
j(KjKT
j)1wj
(2.47)
for every j2 and for every two pairs of corresponding vanishing
points v1,vjand w1,wjbetween the first and the j-th image. The
self-calibration equations (2.45) enforce these constraints in the projec-
tive reconstruction of the scene, as will be demonstrated next.
2.6 From Pro jective to Metric Using More Than Two Images 365
2.6.4.2 Infinity Homographies and the
Preservation of Angles
To see why the final claim in the last section holds, we have to rein-
terpret the underlying relation (2.46) in terms of projective geome-
try. Consider again the (arbitrary) scene points M,P, and Q. Their
extended coordinates, respectively, are M
1,P
1, and Q
1. As explained in
Appendix A, the direction vector PMof the line Lthrough the points
Mand Pin the scene corresponds in projective 3-space to the point of
intersection of the line through the projective points M
1and P
1with
the plane at infinity of the scene; and the vanishing point vjof Lin
the j-th image is the perspective projection of this point of intersection
onto the image plane of the j-th camera (j∈{1,2,...,m}). In particu-
lar, the vanishing points v1,v2,... , vmare corresponding points in the
images I1,I2, ... ,Im. Moreover, it was explained in Section 2.4.3 that
the matrix Aj=KjRT
jR1K1
1introduced in Section 2.4.1 actually is
a homography matrix representing the projective transformation that
maps (vanishing) points from the first image via the plane at infinity
of the scene onto the corresponding (vanishing) point in the j-th image
(j2), and therefore is referred to in the computer vision literature as
the infinite homography between the first and the j-th image. Explicitly,
Ajv1=ρjvjfor some non-zero scalar factor ρj. Similarly, the vanishing
points w1and wjof the line through Mand Qin, respectively, the first
and j-th image satisfy Ajw1=σjwjfor some non-zero scalar factor σj.
Using the infinite homography Aj=KjRT
jR1K1
1, the inner product
vT
j(KjKT
j)1wjin the j-th image can also be expressed in terms of the
corresponding vanishing points v1and w1in the first image:
vT
j(KjKT
j)1wj=1
ρj
Ajv1T
(KjKT
j)11
σj
Ajw1
=1
ρj
1
σj
vT
1AT
j(KjKT
j)1Ajw1.(2.48)
Using again Aj=KjRT
jR1K1
1and the fact that RT
j=R1
j, since
Rjis a rotation matrix, the 3 ×3-matrix AT
j(KjKT
j)1Ajin the
366 Principles of Passive 3D Reconstruction
right-hand side of the previous equality simplifies to:
AT
j(KjKT
j)1Aj=(KjRT
jR1K1
1)T(KT
jK1
j)(KjRT
jR1K1
1)
=(K1KT
1)1.
Equation (2.48) then reads
vT
1(K1KT
1)1w1=ρjσjvT
j(KjKT
j)1wj.
By similar calculations, one also finds that vT
1(K1KT
1)1v1=
ρ2
jvT
j(KjKT
j)1vjand wT
1(K1KT
1)1w1=σ2
jwT
j(KjKT
j)1wj. Together,
these three equalities establish the relations (2.47) in an almost triv-
ial manner. It is important to realize, however, that it actually is the
equality:
AT
j(KjKT
j)1Aj=(K1KT
1)1,(2.49)
which makes that the constraints (2.47) are satisfied. Our claim is that
the self-calibration equations (2.45) are nothing else but this fundamen-
tal relation (2.49) expressed in terms of the projective reconstruction
of the scene which was computed from the given images as described
in Section 2.6.1.
2.6.4.3 Equivalence of Self-Calibration and the
Preservation of Angles
Now, suppose that no information whatsoever on the scene is given,
but that we have computed matrices ˆ
Ajand epipoles ejfor each image
(j2) as well as a collection of 3D points ˆ
Msatisfying the projective
reconstruction equations (2.39), viz:
ˆρ1m1=ˆ
Mand ˆρjmj=ˆ
Ajˆ
M+ejfor j∈{2,...,m},
for given point correspondences m1,m2,...,mmbetween the images. The
3D points ˆ
Mhave been shown in Section 2.5.4 to form a projective
3D reconstruction of the scene. To prove our claim about the self-
calibration equations, we will follow the same line of reasoning as the
one that led to relation (2.49) above. Let ˆ
Mand ˆ
Pbe two arbitrary points
in the projective reconstruction. As explained in Appendix A, the van-
ishing point vjin the j-th image of the line ˆ
Lthrough the points ˆ
Mand
2.6 From Pro jective to Metric Using More Than Two Images 367
ˆ
Pin the 3D reconstruction is the projection of the point of intersection
of the line ˆ
Lwith the plane at infinity of the scene. Since no information
whatsoever on the scene is available, we do not know where the plane
at infinity of the scene is located in the reconstruction. However, we do
know that for every 3D line in the projective reconstruction, its vanish-
ing points in the respective images should be mapped onto each other
by the projective transformation which maps one image to another via
the plane at infinity of the scene. The idea now is to identify a plane in
the projective reconstruction which does exactly that. Suppose that the
(unknown) equation of this plane is πT
ˆ
M+ 1 = 0. The projective trans-
formation which maps the first image to the j-th one via this plane is
computed as follows: From the first projective reconstruction equation
ˆρ1m1=ˆ
Mit follows that parameter equations for the projecting ray of
an arbitrary image point m1in the first camera are given by ˆ
Mρm1
where ˆρRis a scalar parameter. This projecting ray intersects the
plane at infinity of the scene in the point ˆ
Mρm1that satifies the
equation πT
ˆ
M+ 1 = 0. Hence,
ˆρ=1
πT
m1
and ˆ
M=1
πT
m1
m1.
By the projective reconstruction equations (2.39), the projection mjof
this point ˆ
Min the j-th image for j2 satisfies ˆρjmj=ˆ
Ajˆ
M+ej.
Substituting the expression for ˆ
Min this equation yields
ˆρjmj=1
πT
m1
ˆ
Ajm1+ej,
or equivalently,
(πT
m1ρjmj=ˆ
Ajm1ej(πT
m1)=(ˆ
AjejπT
)m1.
Consequently, the 3 ×3-matrix ˆ
AjejπT
is a homography matrix of
the projective transformation which maps the first image to the j-th
one via the plane at infinity of the scene. In Section 2.4.3, on the other
hand, it was demonstrated that the invertible matrix Ajintroduced in
Section 2.4.1 also is a matrix for this infinite homography. Therefore,
ˆ
AjejπT
and Ajmust be equal up to a non-zero scalar factor, i.e.,
Aj=κj(ˆ
AjejπT
) for some non-zero scalar κjR. Notice that this
368 Principles of Passive 3D Reconstruction
is exactly the same expression for Ajas was found in formula (2.44) of
Section 2.6.3, but now it is obtained by a geometrical argument instead
of an algebraic one.
Clearly, for any plane πTˆ
M+ 1 = 0 the homography matrix ˆ
Aj
ejπTwill map the first image to the j-th one via that plane. But only
the plane at infinity of the scene will guarantee that the cosines
vT
1(K1KT
1)1w1
vT
1(K1KT
1)1v1wT
1(K1KT
1)1w1
and vT
j(KjKT
j)1wj
vT
j(KjKT
j)1vjwT
j(KjKT
j)1wj
(cf. formula (2.47) ) computed in each image j(j2) admit the same
value whenever v1is mapped onto vjand w1is mapped onto wjby the
homography ˆ
AjejπT. And, as was observed earlier in this section,
this will only be guaranteed if AT
j(KjKT
j)1Aj=(K1KT
1)1where
Ajnow has to be interpreted as the infinite homography mapping the
first image to the j-th image by the plane at infinity of the scene (cf.
formula (2.49) ). Since Aj=κj(ˆ
AjejπT
), the plane at infinity of
the scene is that plane πT
ˆ
M+ 1 = 0 in the projective reconstruction
for which
κ2
j(ˆ
AjejπT
)T(KjKT
j)1(ˆ
AjejπT
)=(K1KT
1)1
for all j∈{2,3,...,m}. This equality expresses that the intrinsic way
of measuring in the j-th image — represented by the symmetric matrix
(KjKT
j)1— must be compatible with the intrinsic way of mea-
suring in the first image — represented by the symmetric matrix
(K1KT
1)1— and, more importantly, that it is the infinite homogra-
phy between the first and the j-th image — represented by the matrix
(ˆ
AjejπT
) — which actually transforms the metric (KjKT
j)1into
the metric (K1KT
1)1. Finally, if both sides of this matrix equality are
inverted, one gets the relation:
1
κ2
j
(ˆ
AjejπT
)1KjKT
j(ˆ
AjejπT
)T=K1KT
1,
or, after solving for KjKT
j,
KjKT
j=κ2
j(ˆ
AjejπT
)K1KT
1(ˆ
AjejπT
)T,(2.50)
2.6 From Pro jective to Metric Using More Than Two Images 369
which must hold for all j∈{2,3,...,m}. Observe that these are exactly
the self-calibration equations (2.45), as we claimed earlier.
Intuitively, these equations state that if a plane πT
ˆ
M+1=0 in
the reconstruction is known or found to be the plane at infinity of the
scene and if a metric — represented by a symmetric, positive-definite
matrix K1KT
1— is induced in the first image, then by really map-
ping the plane πˆ
M+ 1 = 0 to infinity and by measuring in the j-th
image (j2) according to a metric induced from K1KT
1through for-
mula (2.50), the projective reconstruction will be upgraded to a metric
3D reconstruction of the scene. This is in accordance with our findings
in Section 2.6.2 that the projective transformation
κ¯
M=K1
1ˆ
M
πT
ˆ
M+1
transforms the projective reconstruction ˆ
Minto the metric reconstruc-
tion κ¯
M.
2.6.5 A Glimpse on Absolute Conics and Quadrics
The right-hand side of the self-calibration equations (2.50) can also be
written as:
KjKT
j=κ2
j(ˆ
AjejπT
)K1KT
1(ˆ
AjejπT
)T
=κ2
jˆ
AjejI3
πT
K1KT
1I3πˆ
AT
j
eT
j
=κ2
jˆ
AjejK1KT
1K1KT
1π
πT
K1KT
1πT
K1KT
1πˆ
AT
j
eT
j.
In the computer vision literature, the 3 ×3-matrix KjKT
jin the left-
hand side of the equality is commonly denoted by ω
jand referred to
as the dual image of the absolute conic in the j-th view [16]. It is the
mathematical dual of the image of the absolute conic ωjin the j-th
view. The 4 ×4-matrix
=K1KT
1K1KT
1π
πT
K1KT
1πT
K1KT
1π(2.51)
370 Principles of Passive 3D Reconstruction
in the right-hand side of the previous equality, on the other hand, is
in the computer vision literature usually referred to as the absolute
quadric [16]. The self-calibration equations (2.50) are compactly writ-
ten in this manner as:
ω
j=κ2
j(ˆ
Aj|ej)Ω
(ˆ
Aj|ej)T,(2.52)
where ω
j=KjKT
jand with Ωas defined above. We will not expand
on this interpretation in terms of projective geometry right now, but the
interested reader can find more details on this matter in Appendix B.
The main advantage of writing the self-calibration equations in the
form (2.52) is that it yields linear equations in the entries of ω
jand
the entries of Ω, which are easier to solve in practice than the non-
linear formulation (2.50). Moreover, the absolute quadric Ωencodes
both the plane at infinity of the scene and the internal calibration of
the first camera in a very concise fashion. Indeed,
=I30
πT
1K1KT
10
0T0 I30
πT
1T
,
from which it immediately follows that Ωhas rank 3 and that its
nullspace is spanned by the plane at infinity of the scene (πT
1)T.
Moreover, Ωcan also be decomposed as:
=K10
πT
K11I30
0T0 K10
πT
K11T
,
where the 4 ×4-matrix
K10
πT
K11=I30
πT
1K10
0T1,(2.53)
is the homography matrix of the projective transformation
λˆ
M
1=K10
πT
K11κ¯
M
1,
which relates the projective reconstruction ˆ
Mto the metric reconstruc-
tion κ¯
M, as discussed in Section 2.6.2. Hence, the rectifying homogra-
phy to update the projective reconstruction of the scene to a metric
2.6 From Pro jective to Metric Using More Than Two Images 371
one is directly available once the absolute quadric Ωhas been recov-
ered. It is interesting to observe that the decomposition (2.53) of the
rectifying homography exhibits the stratification of geometries as dis-
cussed in Section 2.5.5. The rightmost matrix in the right-hand side
of equation (2.53) represents the affine deformation induced on the
metric reconstruction κ¯
Mby including the uncertainty about the inter-
nal camera parameters K1in the 3D reconstruction, yielding the affine
reconstruction κ˜
M; and the leftmost matrix in the decomposition (2.53)
changes the plane at infinity to the plane πT
ˆ
M+ 1 = 0 in the projective
reconstruction ˆ
Mof the scene.
How Ωis computed in practice, given a projective reconstruction
of the scene is discussed in more detail in Subsection 4.6.2 of Section 4
in Part 2 of the tutorial. Therefore we will not continue with this topic
any further here, but instead we will address the question of how many
images are needed in order for the self-calibration equations to yield a
unique solution.
2.6.6 When do the Self-Calibration Equations Yield a
Unique Solution ?
The self-calibration equations yield additional constraints on the cal-
ibration matrices of the cameras and about the location of the plane
at infinity of the scene in a projective 3D reconstruction. But as was
already observed in Sections 2.5.4 and 2.5.5, if no additional informa-
tion about the internal and/or external calibration of the cameras or
about the Euclidean structure of the scene is available, then with only
two images one cannot hope for better than a projective reconstruction
of the scene. The question that remains unsolved up till now is: Is it
possible to recover a metric reconstruction of the scene with only the
images at our disposal; and, more importantly, how many images are
needed to obtain a unique solution?
Consider again the self-calibration equations (2.50) in their compact
formulation:
KjKT
j=κ2
j(ˆ
AjejπT
)K1KT
1(ˆ
AjejπT
)T(2.54)
for each j∈{2,3,...,m}. In these equations the calibration matrices Kj
all appear as KjKT
j. It is thus advantageous to take the entries of
372 Principles of Passive 3D Reconstruction
KjKT
jas the unknowns in the self-calibration equations instead of
expressing them in terms of the internal camera parameters consti-
tuting Kj(cf. formula (2.4)). As Kjis an invertible upper-triangular
matrix, KjKT
jis a positive-definite symmetric matrix. So, if KjKT
j
is known, the calibration matrix Kjitself can uniquely be obtained
from KjKT
jby Cholesky factorization [3]. Furthermore, each KjKT
j
is a symmetric 3 ×3-matrix whose (3,3)-th entry equals 1 (cf. for-
mula (2.4)). Consequently, each KjKT
jis completely characterized by
five scalar parameters, viz. the diagonal elements other than the (3,3)-
th one and the upper-triangular entries. Similarly, the scalar factors κ2
j
can be considered as being single variables in the self-calibration equa-
tions. Together with the three unknown components of the 3-vector π,
the number of unknowns in the self-calibration equations (2.54) for
mimages add up to 5m+(m1)+3=6m+ 2. On the other hand,
for mimages, formula (2.54) yields m1 matrix equations. Since
both sides of these equations are formed by symmetric 3 ×3-matrices,
each matrix equation induces only six different non-linear equations
in the components of KjKT
j, the components of πand the scalars
κ2
jfor j∈{1,2,...,m}. Hence, for mimages, the self-calibration equa-
tions (2.54) yield a system of 6(m1)=6m6 non-linear equations
in 6m+ 2 unknowns. Clearly, without additional constraints on the
unknowns, this system does not have a unique solution.
In practical situations, however, quantitative or qualitative infor-
mation about the cameras can be used to constrain the number of
solutions. Let us consider some examples.
Images obtained by the same or identical cameras.
If the images are obtained with one or more cameras whose
calibration matrices are the same, then K1=K2=... =
Km=Kand the self-calibration equations (2.54) reduce to
KKT=κ2
j(ˆ
AjejπT
)KKT(ˆ
AjejπT
)T
for all j∈{2,...,m}. In this case, only five internal cam-
era parameters — in practice, the five independent scalars
characterizing KKT— are to be determined, reducing the
number of unknowns to 5 + (m1)+3=m+ 7. On the
2.6 From Pro jective to Metric Using More Than Two Images 373
other hand, the self-calibration equations yield six equations
for each image other than the first one. If these equations
are independent for each view, a solution is determined
provided 6(m1) m+ 7. Consequently, if m3images
obtained by cameras with identical calibration matrices, then
the calibration matrix Kand the plane at infinity of the
scene can — in principle — be determined from the self-
calibrations equations and a metric reconstruction of the
scene can be obtained.
Images obtained by the same or identical cameras
with different focal lengths.
If only the focal length of the camera is varying between the
images, then four of the five internal parameters are the same
for all cameras, which brings the total number of unknowns
to 4 + m+(m1)+3=2m+ 6. Since the self-calibration
equations (2.54) bring six equations for each image other
than the first one, a solution is in principle determined pro-
vided 6(m1) 2m+ 6. In other words, when the focal
length of the camera is allowed to vary between the images,
then — in principle — a metric reconstruction of the scene
can be obtained from m3images.
Known aspect ratio and skew, but unknown and dif-
ferent focal length and principal point.
When the aspect ratio and the skew of the cameras are
known, but the focal length and the principal point of the
cameras are unknown and possibly different for each image,
only three internal parameters have to be determined for each
camera. This brings the number of unknowns for all mimages
to 3m+(m1)+3=4m+ 2. As formula (2.54) brings
six equations for each image other than the first one, a solu-
tion is in principle determined provided 6(m1) 4m+2.
In other words, when the aspect ratio and the skew of the cam-
eras are known, but the focal length and the principal point
of the cameras are unknown and allowed to vary between the
images, then — in principle — a metric reconstruction of
374 Principles of Passive 3D Reconstruction
the scene can be obtained from m4images. Note that the
case of square pixels, usual with digital cameras, is a special
case of this.
Rectangular pixels (and, hence, known skew) and
unknown, but fixed aspect ratio.
In case the skew of the pixels is known and if the aspect
ratio is identical for all cameras, but unknown, then the
total number of unknowns in the self-calibration equations is
1+3m+(m1)+3=4m+ 3. Because there are six self-
calibration equations for each image other than the first one,
a solution is in principle determined provided 6(m1)
4m+ 3. In other words, when the skew of the cameras is
known and if the aspect ratio is identical for all cameras, but
unknown, and if the focal length and the principal point of
the cameras are unknown and allowed to vary between the
images, then — in principle — a metric reconstruction of
the scene can be obtained from m5images.
Aspect ratio and skew identical, but unknown.
In the situation where the aspect ratio and the skew of the
cameras are identical, but unknown, two of the five inter-
nal parameters are the same for all cameras, which brings
the total number of unknowns to 2 + 3 m+(m1)+3=
4m+ 4. As formula (2.54) brings six equations for each
image other than the first one, a solution is in principle deter-
mined provided 6(m1) 4m+ 4. In other words, if the
aspect ratio and the skew are the same for each camera, but
unknown, and when the focal length and the principal point of
the camera are allowed to vary between the images, then —
in principle — a metric reconstruction of the scene can be
obtained from m5images.
Rectangular pixels or known skew.
If only the skew is known for each image, then four internal
parameters have to be determined for each camera, which
brings the total number of unknowns to 4m+(m1)+3=
5m+ 2. Since there are six self-calibration equations for each
2.6 From Pro jective to Metric Using More Than Two Images 375
image other than the first one, a solution is in principle deter-
mined provided 6(m1) 5m+ 2. In other words, when
only the skew is known for each image, but all the other
internal parameters of the camera are unknown and allowed
to vary between the images, then — in principle — a met-
ric reconstruction of the scene can be obtained from m8
images.
Aspect ratio unknown, but fixed.
When the aspect ratio of the pixels is the same for all cam-
eras, but its value is unknown, then of the five unknown
internal parameters of the cameras, one is the same for all
cameras, thus bringing the total number of unknowns to
1+4m+(m1)+3=5m+ 3. With six self-calibration
equations for each image other than the first one, a solu-
tion is in principle determined provided 6(m1) 5m+3.
In other words, if the aspect ratio of the pixels is the same
for each camera, but its value is unknown, and when all the
other internal parameters of the cameras are unknown and
allowed to vary between the images, then — in principle —
a metric reconstruction of the scene can be obtained from
m9images.
In conclusion, although the self-calibration equations (2.54) bring
too few equations to allow a unique solution for the calibration
matrix Kjof each camera and to uniquely identify the plane at infin-
ity of the scene in the projective 3D reconstruction, in most practical
situations a unique solution can be obtained by exploiting additional
constraints on the internal parameters of the cameras, provided a suffi-
cient number of images are available. The required minimum number of
images as given above for different situations only is indicative in that
it is correct, provided the resulting self-calibration equations are inde-
pendent. In most applications this will be the case. However, one should
always keep in mind that there do exist camera configurations and cam-
era motions for which the self-calibration equations become dependent
and the system is degenerate. Such special situations are referred to
in the literature as critical motion sequences. A detailed analysis of all
376 Principles of Passive 3D Reconstruction
these cases is beyond the scope of this text, but the interested reader
is referred to [9, 14, 15] for further information. Moreover, it must also
be emphasized that the self-calibration equations in (2.54) are not very
well suited for numerical computations. For practical use, their linear
formulation (2.52) in terms of the absolute quadric is recommended.
Subsection 4.6.2 of Section 4 in Part 2 of this tutorial discusses this
issue in more detail.
2.7 Some Important Special Cases
In the preceding sections the starting point was that no information
about the internal and external parameters of the cameras is available
and that the positions and orientations of the cameras can be com-
pletely arbitrary. In the last section, it was observed that one has to
assume some constraints on the internal parameters in order to solve the
self-calibration equations, thereby still leaving the external parameters
completely unknown at the outset. In practical applications, the lat-
ter might not always be the case either and quite a bit of knowledge
about the camera motion may be available. Situations in which the
object or the camera has purely translated in between the acquisition
of the images are quite common (e.g., a camera on rails), and so are
cases of pure camera rotation (e.g., a camera on a tripod). These simple
camera motions both offer opportunities and limitations, which will be
explored in this section.
Moreover, sometimes it makes more sense to first calibrate cameras
internally, and then to only recover 3D structure and external param-
eters from the images. An example is digital surveying (3D measuring
in large-scale environments like cities) with one or more fixed cameras
mounted on a van. These cameras can be internally calibrated before
driving off for a new surveying campaign, during which the internal
parameters can be expected to remain fixed.
And, last but not least, in practical applications it may often be use-
ful and important — apart from reconstructing the three-dimensional
structure of the scene — to obtain also information about the camera
positions and orientations or about the camera parameters underlying
the image data. In the computer vision literature this is referred to as
2.7 Some Important Special Cases 377
structure-and-motion. The 3D reconstruction process outlined in the
previous sections can provide such information too. The fundamental
concepts needed to retrieve camera pose information will be introduced
in the subsequent sections as well.
2.7.1 Camera Translation and Stereo Rigs
Suppose that in between the acquisition of the first and the second
image the camera only has translated and that the internal camera
parameters did not change. In that case, the orientation of the camera
has not changed — i.e., R1=R2=R— and the calibration matrices
are the same — i.e., K1=K2=K. Consequently, the invertible matrix
Aintroduced in Section 2.4.1 reduces to:
A=K2RT
2R1K1
1=KRTRK1=I3,
the 3 ×3-identity matrix I3, because Ris an orthogonal matrix and
thus Rt=R1. In other words, if one knows that the camera has
only translated between the acquisition of the images, then the infinite
homography Ais theoretically known to be the 3 ×3-identity matrix.
Recall from Section 2.5.3 that, if Ais known, an affine reconstruction
of the scene can be computed by solving the system of affine recon-
struction equations (2.25), viz:
˜ρ1m1=˜
Mand ˜ρ2m2=A˜
M+e2=˜
M+e2,
yielding six equations in five unknowns. Put differently, in case of a
pure camera translation an affine reconstruction of the scene can be
computed from two uncalibrated images.
An example of an image pair, taken with a camera that has trans-
lated parallel to the object, is shown in Figure 2.15. Two views of
the resulting affine 3D reconstruction are shown in Figure 2.16. The
results are quite convincing for the torsos, but the homogeneous wall
in the background could not be reconstructed. The difference lies in the
presence or absence of texture, respectively, and the related ease or dif-
ficulty in finding corresponding points. In the untextured background
all points look the same and the search for correspondences fails.
It is important to observe that without additional information about
the camera(s) or the scene, one can still do no better than an affine
378 Principles of Passive 3D Reconstruction
Fig. 2.15 A pair of stereo images for a scene with two torsos, where the cameras have
identical settings and are purely translated with respect to each other.
Fig. 2.16 Two views of a three-dimensional affine reconstruction obtained from the stereo
image pair of Figure 2.15.
reconstruction even if one gets additional, translated views. Indeed, the
self-calibration equations (2.45) derived in Section 2.6.3 are:
KjKT
j=κ2
j(ˆ
AjejπT
)K1KT
1(ˆ
AjejπT
)T
for all j∈{2,...,m}. In case of a translating camera with con-
stant internal parameters, K1=K2=... =Km=K,A2=A3=... =
Am=I3and π=(0,0,0)t, as the scene structure has been recovered
up to a 3D affine transformation (instead of only a general projective
one). Hence, the self-calibration equations reduce to:
KKT=κ2
j(I3ej0T)KKT(I3ej0T)T
2.7 Some Important Special Cases 379
for all j∈{2,...,m}; or equivalently, KKT=κ2
jKKT, which only
implies that all κ2
jmust be equal to 1, but do not yield any infor-
mation about the calibration matrix K. In summary, with a translating
camera an affine reconstruction of the scene can be obtained already
from two images, but self-calibration and metric reconstruction are not
possible.
A special case of camera translation, which is regularly used in
practical applications, is a stereo rig with two identical cameras (i.e.,
cameras having the same internal parameters) in the following config-
uration: The optical axes of the cameras are parallel and their image
planes are coplanar, with coincident x-axes. This special configuration
is depicted in Figure 2.17. The distance between the two centers of pro-
jection is called the baseline of the stereo rig and is denoted by b.To
simplify the mathematical analysis, we may, without loss of generality,
let the world frame coincide with the camera-centered reference frame
of the left or ‘first’ camera. The projection equations of the stereo rig
Fig. 2.17 Schematic representation of a simple stereo rig: two identical cameras having
coplanar image planes and parallel axes. In particular, their x-axes are aligned.
380 Principles of Passive 3D Reconstruction
then reduce to:
ρ1
x1
y1
1
=K
X
Y
Z
and ρ2
x2
y2
1
=K
Xb
Y
Z
,
where K=
αxsp
x
0αypy
001
is the calibration matrix of the
two cameras. The pixel coordinates (x1,y1) and (x2,y2) of the projec-
tions m1and m2of a scene point Mwhose coordinates with respect to
the world frame are (X,Y,Z), are given by:
x1=αxX
Z+sY
Z+px
y1=αyY
Z+py
and x2=αxXb
Z+sY
Z+px
y2=αyY
Z+py
.
In particular, y1=y2and x1=x2+αxb
Z. In other words, correspond-
ing points in the images are found on the same horizontal line (the
epipolar line for this particular setup) and the horizontal distance
between them, viz. x1x2=αxb
Z, is inversely proportional to the
Z-coordinate (i.e., the projective depth) of the underlying scene point M.
In the computer vision literature the difference x1x2is called the dis-
parity between the image points and the projective depth Zis some-
times also referred to as the range of the scene point. In photography
the resolution of an image is defined as the minimum distance two
points in the image have to be apart in order to be visually distinguish-
able. As the range Zis inversely proportional to the disparity, it follows
that beyond a certain distance, depth measurement will become very
coarse. Human stereo depth perception, for example, which is based on
a two-eye configuration similar to this stereo rig, is limited to distances
of about 10 m. Beyond, depth impressions arise from other cues. In a
stereo rig the disparity between corresponding image points, apart from
being inversely proportional to the projective depth Z, also is directly
proportional to the baseline band to αx. Since αxexpresses the focal
length of the cameras in number of pixels for the x-direction of the
image (cf. formula (2.2) in Section 2.2.2), the depth resolution of a
stereo rig can be increased by increasing one or both of these variables.
2.7 Some Important Special Cases 381
Upon such a change, the same distance will correspond to a larger dis-
parity and therefore distance sampling gets finer. It should be noted,
however, that one should strike a balance between increasing resolution
and keeping visible to both cameras as much of the scene as possible.
Indeed, when disparities get larger, chances of finding both projections
within the images diminish. Figure 2.18 shows planes of equal disparity
for two focal lengths and two baseline distances. Notice how for this
particular stereo setup points with identical disparities form planes at
Fig. 2.18 Equal disparities correspond to points at equal distance to the stereo system
(iso-depth planes): Rays that project onto points with a fixed disparity intersect in planes
of constant range. At a given distance these planes get closer (i.e., sample distance more
densely) if the baseline is increased (top-left to top-right), the focal length is increased
(top-left to bottom-left), or both (top-left to bottom-right).
382 Principles of Passive 3D Reconstruction
equal depth or distance from the stereo system. The same distance is
seen to be sampled finer after the focal length or the baseline have been
increased. A smaller part of the scene is visible to both cameras upon
such a change, certainly for those parts which are close.
Similar considerations are also relevant for more general relative
camera displacements than pure translations: Precisions go up as cam-
era views differ more and projecting rays are intersecting under larger
angles (i.e., if images are taken under wide baseline conditions). On the
other hand, without intermediate views at one’s disposal, there tend to
be holes in the reconstruction, for points visible in only one or no views.
2.7.2 Pure Rotation Around the Center of Projection
Another special case of camera motion is a camera that rotates around
the center of projection. This situation is depicted in Figure 2.19. The
point Cdenotes the center of projection. A first image I1is recorded and
then the camera is rotated around the center of projection to record
the second image I2. A scene point Mis projected in the images I1
and I2onto the image points m1and m2, respectively. As the figure
Fig. 2.19 Two images I1and I2are recorded by a camera which rotates around the center
of projection C. A scene point Mis projected in the images onto the image points m1and m2,
respectively.
2.7 Some Important Special Cases 383
shows, 3D reconstruction from point correspondences is not possible
in this situation. The underlying scene point Mis to be found at the
intersection of the projecting rays of m1and m2, but these coincide.
This conclusion can also be obtained algebraically by investigating
the reconstruction equations for this case. Recall from Section 2.3 that
the projection equations for two cameras in general position are:
ρ1m1=K1RT
1(MC1) and ρ2m2=K2RT
2(MC2).
If the camera performs a pure rotation around the center of projection,
then C1=C2=C, and the projection equations become:
ρ1m1=K1RT
1(MC) and ρ2m2=K2RT
2(MC).
Solving the first equation for the scene point Mand substituting this
into the second equation, as in Section 2.4.1, gives
ρ2m2=ρ1K2RT
2R1K1
1m1,
or, using A=K2RT
2R1K1
1as before, one gets
ρ2m2=ρ1Am1.(2.55)
This equation establishes a direct relationship between the pixel coor-
dinates of corresponding image points. In fact, the equation states that
the second image I2is a projective transformation of the first image
I1with the invertible matrix Aintroduced in Section 2.4.1 as homogra-
phy matrix. This should not come as a surprise, because looking back at
Figure 2.19 and forgetting for a moment the scene point M, one sees that
image I2is a perspective projection of image I1with Cas the center
of projection. Hence, searching for corresponding points between two
images obtained by a camera that has only rotated around the center of
projection, is quite simple: One only needs to determine the invertible
matrix A. If the internal parameters of the cameras are known and if
the relative orientation R=R1RT
2between the two cameras is known
as well, then the infinite homography Acan be computed directly from
its definition A=K2RT
2R1K1
1. However, when the internal and exter-
nal camera parameters are not known, then Acan be computed from
four pairs of corresponding points between the images. Indeed, each
384 Principles of Passive 3D Reconstruction
corresponding point pair (m1,m2)∈I
1×I
2must satisfy the relation
Am1=ρm2for some non-zero scalar factor ρ, and thus brings a system
of three linear equations in the nine unknown components of the matrix
Aand the unknown scalar ρ. As these equations are homogeneous, at
least four point correspondences are needed to determine the matrix A
up to a non-zero scalar factor. We will not pursue these computational
issues any further here, since they are discussed in detail in Subsec-
tion 4.2.4 of Section 4 in Part 2 of the tutorial. Once the matrix Ais
known, formula (2.55) says that for every point m1in the first image
Am1are homogeneous coordinates of the corresponding point m2in the
second image.
Observe that formula (2.55) actually expresses the epipolar relation
between the images I1and I2. Indeed, in its homogeneous form (2.16)
the epipolar relation between corresponding image points is:
ρ2m2=ρ1Am1+ρe2e2,
where ρe2e2=K2RT
2(C1C2) is the epipole of the first camera in
the second image. For a rotating camera, C1=C2, which implies that
ρe2e2=0, and formula (2.55) results. On the other hand, as the infinite
homography Acan be determined from point correspondences between
the images, one could hope for an affine 3D reconstruction of the scene,
as explained in Section 2.5.3. Unfortunately, this is not possible — as
we already know — because in this case the affine reconstruction equa-
tions (2.25), viz. ˜ρ1m1=˜
Mand ˜ρ2m2=A˜
M+e2, reduce to ˜ρ1m1=˜
Mand
˜ρ2m2=A˜
M. Taking ˜
Mρ1m1from the first equation and substituting
it into the second one in order to compute the unknown scalar, ˜ρ1
now brings back formula (2.55), which does not allow to uniquely
determine ˜ρ1. This proves algebraically that 3D reconstruction from
point correspondences is not possible in case the images are obtained
by a camera that rotates around the center of projection.
Although 3D reconstruction is not possible from images acquired
with a camera that rotates around the center of projection, it is
possible to recover the internal camera parameters from the images
alone. Indeed, as explained in Section 2.6.4, the self-calibration
equations (2.45) result from the fundamental relation (2.49), viz:
AT(K2KT
2)1A=(K1KT
1)1,(2.56)
2.7 Some Important Special Cases 385
where Ais the infinite homography introduced in Section 2.4.1. In case
of a rotating camera the matrix Acan be determined up to a non-zero
scalar factor if at least four pairs of corresponding points are identified
in the images. Let ˆ
Abe such an estimate for A. Then A=κˆ
Afor some
unknown scalar κ. Substitution in formula (2.56) and solving for K2KT
2
yields K2KT
2=κ2ˆ
A(K1KT
1)ˆ
AT. If the internal camera parameters
have remained constant during camera rotation, then K1=K2=K
and the self-calibration equations reduce to KKT=κ2ˆ
A(KKT)ˆ
AT.
Since ˆ
Ais only determined up to a non-zero-scalar factor, one may
assume without loss of generality that its determinant equals 1. Taking
determinants of both sides of the self-calibration equations, it follows
that κ2= 1, because Kis an invertible matrix and generally has non-
unit determinant. Consequently, the self-calibration equations become
KKT=ˆ
A(KKT)ˆ
ATand they yield a system of six linear equations
in the five unknown entries of the symmetric matrix KKT. The cali-
bration matrix Kitself can be recovered from KKTby Cholesky fac-
torization [3], as explained in Section 2.6.6. In summary, with a camera
that rotates around the center of projection 3D reconstruction of the
scene is not possible, but self-calibration is [5].
2.7.3 Internally Calibrated Cameras and the Essential
Matrix
In some applications, the internal parameters of the cameras may be
known, through a (self-)calibration procedure applied prior to the cur-
rent processing of new input images. It will be demonstrated in this
section that when the internal parameters of the cameras are known,
but no information about the (absolute or relative) position and orien-
tation of the cameras is available, a metric 3D reconstruction from two
images is feasible.
2.7.3.1 Known Camera Matrices and 3D Reconstruction
Consider again the projection equations (2.20) for two cameras observ-
ing a static scene, as used in Section 2.5:
ρ1m1=K1RT
1(MC1) and ρ2m2=K2RT
2(MC2).(2.57)
386 Principles of Passive 3D Reconstruction
If the calibration matrices K1and K2are known, then the (unbiased)
perspective projections of the scene point Min each image plane can be
retrieved as q1=K1
1m1and q2=K1
2m2, respectively. By multiplying
the metric 3D reconstruction equations (2.24) derived in Section 2.5.2,
viz.:
¯ρ1m1=K1¯
Mand ¯ρ2m2=K2RT
2R1¯
M+e2,
on the left with K1
1and K1
2, respectively, they simplify to
¯ρ1q1=¯
Mand ¯ρ2q2=RT
2R1¯
M+qe,(2.58)
where qe=K1
2e2is the (unbiased) perspective projection of the posi-
tion C1of the first camera onto the image plane of the second camera.
It is interesting to observe that, since the epipole e2is defined by for-
mula (2.15) in Section 2.4.1 as ρe2e2=K2RT
2(C1C2), it follows that
ρe2qe=RT
2(C1C2). In other words, ρe2qegives the position C1of the
first camera with respect to the camera-centered reference frame of the
second camera, i.e., the relative position. And, as ρe2is the unknown
scale of the metric reconstruction ¯
M,qein fact represents the translation
direction between the two cameras in the metric 3D reconstruction of
the scene.
Thus, the rotation matrix R=RT
2R1is the only unknown factor in
the 3D reconstruction equations (2.58) that separates us from a metric
3D reconstruction of the scene. In fact, Rrepresents the orientation of
the first camera in the camera-centered reference frame of the second
one (cf. Section 2.2.4), i.e., the relative orientation of the cameras. It
will be demonstrated below that the rotation matrix Rcan be recovered
from the so-called essential matrix of this pair of calibrated images. But
first the notion of essential matrix has to be defined.
2.7.3.2 The Essential Matrix
Recall from Section 2.4.1 that the epipolar relation (2.18) between cor-
responding image points is found by solving the first projection equa-
tion in formula (2.57) for Mand substituting the resulting expression in
the second projection equation, thus yielding
ρ2m2=ρ1K2RT
2R1K1
1m1+K2RT
2(C1C2)
2.7 Some Important Special Cases 387
(cf. formula (2.14) in Section 2.4.1). Multiplying both sides of this equa-
tion on the left by K1
2yields
ρ2K1
2m2=ρ1RT
2R1K1
1m1+RT
2(C1C2).
Introducing q1=K1
1m1,q2=K1
2m2, and R=RT
2R1as above, one
gets
ρ2q2=ρ1Rq1+RT
2(C1C2).
Let us denote the last term in this equation by t, then t=
RT
2(C1C2)=ρe2qerepresents the relative position of the first cam-
era with respect to the second one, as explained above. The epipolar
relation then is
ρ2q2=ρ1Rq1+t.(2.59)
From an algebraic point of view, this equation expresses that the
3-vectors q2,Rq1, and tare linearly dependent, and hence the determi-
nant |q2tRq1|= 0. Following the same reasoning as in Section 2.4.1,
|q2tRq1|=qT
2(t×Rq1)=qT
2[t]×Rq1,
where [t]×is the skew-symmetric 3 ×3-matrix that represents the cross
product with the 3-vector t. The 3 ×3-matrix E=[t]×Ris known in
the literature as the essential matrix of the (calibrated) image pair [11]
and the epipolar relation between the calibrated images is expressed
by the equation:
qT
2Eq1=0.(2.60)
Given enough corresponding projections q1and q2, the essential matrix
Ecan be recovered up to a non-zero scalar factor from this relation
in a linear manner. In fact, since E=[t]×Rwith ta 3-vector and
Ra3×3-rotation matrix, the essential matrix Ehas six degrees of
freedom. Consequently, five corresponding projections q1and q2suffice
to compute Eup to a non-zero scalar factor [13, 12]. How this can be
achieved in practice will be discussed in Subsection 4.6.3 of Section 4
in Part 2 of the tutorial.
388 Principles of Passive 3D Reconstruction
2.7.3.3 The Mathematical Relationship Between E and F
For the sake of completeness, it might be useful to highlight the math-
ematical relationship between the essential matrix Eand the funda-
mental matrix F. Since the content of this subsection is not essential
for understanding the remainder of the text, readers who are primarily
interested in the practice of 3D reconstruction can skip this subsection.
Recall from formula (2.60) that qT
2Eq1= 0 describes the epipo-
lar relation between corresponding projections q1and q2in the image
planes of the two cameras. Substituting q1=K1
1m1and q2=K1
2m2in
formula (2.60) thus yields the epipolar relation between the two images
in terms of the image (i.e., pixel) coordinates m1and m2, viz.:
mT
2KT
2EK1
1m1=0.(2.61)
Comparison with the common form of the epipolar relation mT
2Fm1=0
shows that the fundamental matrix Fis a scalar multiple of the 3 ×
3-matrix KT
2EK1
1. More precisely, recall from Section 2.4.1 that
F=[e2]×Aand that
mT
2Fm1=mT
2[e2]×Am1=mT
2(e2×Am1)=|m2e2Am1|.
Substituting m1=K1q1,m2=K2q2, and e2=K2qein the previous
expression gives
mT
2Fm1=|K2q2K2qeAK1q1|=|K2||q2qeK1
2AK1q1|.
Using A=K2RT
2R1K1
1,R=RT
2R1, and t=ρe2qe, the right-hand
side simplifies to
mT
2Fm1=|K2|
ρe2|q2tRq1|=|K2|
ρe2qT
2(t×Rq1)
=|K2|
ρe2qT
2[t]×Rq1=|K2|
ρe2qT
2Eq1.
Substituting q1and q2by q1=K1
1m1and q2=K1
2m2again, one
finally gets
mT
2Fm1=|K2|
ρe2
mT
2KT
2EK1
1m1=mT
2|K2|
ρe2
KT
2EK1
1m1.
Because this equality must hold for all 3-vectors m1and m2, it follows
that
F=|K2|
ρe2
KT
2EK1
1; or equivalently, E=ρe2
|K2|KT
2FK1.
2.7 Some Important Special Cases 389
This precise relationship exists between the theoretical definitions of
the fundamental matrix Fand the essential matrix Eonly. In practice,
however, the fundamental matrix Fand the essential matrix Ecan only
be recovered up to a non-zero scalar factor from point correspondences
between the images. Therefore, it suffices to compute such an estimate
for one of them and to use the relevant formula:
ˆ
F=KT
2ˆ
EK1
1or ˆ
E=KT
2ˆ
FK1(2.62)
as an estimate for the other one, as could be inferred directly from
formula (2.61).
2.7.3.4 Recovering the Relative Camera Setup from the
Essential Matrix
Suppose an estimate ˆ
Efor the essential matrix has been computed from
point correspondences between the images. We will now demonstrate
how the relative setup of the cameras (i.e., the rotation matrix R
and (the direction of) the translation vector tdefining the essential
matrix E) can be recovered from ˆ
E. Due to the homogeneous nature
of the epipolar relation (2.60), ˆ
Eis a non-zero scalar multiple of the
essential matrix E=[t]×Rdefined above. Therefore, ˆ
E=λ[t]×Rfor
some non-zero scalar factor λ. Observe that ˆ
Eisa3×3-matrix of
rank 2. Indeed, [t]×is skew-symmetric and thus has rank 2 if tis
a non-zero 3-vector, whereas Ris a rotation matrix and hence it is
invertible. Moreover,
ˆ
ETt=λ([t]×R)Tt=λRT([t]×)Tt=λRT([t]×)t
=λRT(t×t)=0,
which implies that the 3-vector tbelongs to the left nullspace of ˆ
E.
Let ˆ
E=UΣVTbe the singular value decomposition of the
matrix ˆ
E[3]. Then Σis a diagonal matrix of rank 2 and the left
nullspace of ˆ
Eis spanned by the 3-vector u3constituting the third
column of the orthogonal matrix U.Astbelongs to the left nullspace
of ˆ
E,tmust be a scalar multiple of u3. This already yields t,uptoa
scale.
390 Principles of Passive 3D Reconstruction
Write t=µu3for some non-zero scalar µ. Then ˆ
E=λ[t]×R=
κ[u3]×Rwith κ=λµa non-zero scalar factor. Furthermore, recall from
linear algebra that the singular values of ˆ
Eare the square root of the
eigenvalues of the symmetric matrix ˆ
Eˆ
ET.Now
ˆ
Eˆ
ET=(κ[u3]×R)(κ[u3]×R)T=κ2[u3]×RRT([u3]×)T
=κ2([u3]×)2,
where the last equality follows from the fact that Ris a rotation
matrix — and thus RRT=I3is the 3 ×3-identity matrix — and that
[u3]×is a skew-symmetric matrix — i.e., ([u3]×)T=[u3]×. Because
U=[u1u2u3] is an orthogonal matrix, the first two columns u1and u2
of Uare orthogonal to the third column u3and, as they are all unit
vectors, it follows that
([u3]×)2u1=u3×(u3×u1)=u1
and ([u3]×)2u2=u3×(u3×u2)=u2.
Consequently,
ˆ
Eˆ
ETu1=κ2([u3]×)2u1=κ2u1
and ˆ
Eˆ
ETu2=κ2([u3]×)2u2=κ2u2.
In particular, u1and u2are eigenvectors of ˆ
Eˆ
ETwith eigenvalue κ2.
Furthermore, u3is an eigenvector of ˆ
Eˆ
ETwith eigenvalue 0, because
ˆ
Eˆ
ETu3=κ2([u3]×)2u3=κ2u3×(u3×u3)=0.
Together this proves that the diagonal matrix Σin the singular value
decomposition of the matrix ˆ
Eequals
Σ=
|κ|00
0|κ|0
000
=|κ|
100
010
000
,(2.63)
where |κ|denotes the absolute value of κ. If we denote the columns
of the orthogonal matrix Vby v1,v2, and v3, respectively, then V=
[v1v2v3] and the singular value decomposition of ˆ
Eis given by:
ˆ
E=UΣVT=|κ|u1vT
1+|κ|u2vT
2.
2.7 Some Important Special Cases 391
As Uand Vare orthogonal matrices, and because u3and v3do not
actively participate in the singular value decomposition of the rank 2
matrix ˆ
E, we can infer them to be u3=u1×u2and v3=v1×v2.
On the other hand, ˆ
E=κ[u3]×Rand our aim is to compute
the unknown rotation matrix R. To this end, we will re-express the
skew-symmetric matrix [u3]×in terms of the orthogonal matrix U=
[u1u2u3]. Recall that:
[u3]×u1=u3×u1=u2,[u3]×u2=u3×u2=u1
and [u3]×u3=u3×u3=0;
or, in matrix form,
[u3]×U=[u3]×[u1u2u3]=[u2u10]
=[u1u2u3]
010
100
000
=U
010
100
000
.
Consequently, [u3]×U=UZ, or, equivalently, [u3]×=UZUTwith
Z=
010
100
000
,
since Uis an orthogonal matrix. The matrix ˆ
E=κ[u3]×Rcan now be
rewritten as ˆ
E=κUZUTR. Combining this expression with the sin-
gular value decomposition ˆ
E=UΣVTyields κUZUTR=UΣVT.
By some algebraic manipulations this equality can be simplified to
κUZUTR=UΣVT
⇐⇒ κZUTR=ΣVT( multiplying on the left with UT)
⇐⇒ κZUT=ΣVTRT( multiplying on the right with RT)
⇐⇒ κUZT=RVΣT( taking transposes of both sides )
⇐⇒ κ[u1u2u3]
010
100
000
=|κ|R[v1v2v3]
100
010
000
( expanding U,Z,Vand Σ)
392 Principles of Passive 3D Reconstruction
⇐⇒ κ[u2u10]=|κ|R[v1v20] ( matrix multiplication )
⇐⇒ Rv1=u2and Rv2=u1( equality of matrices )
where =κ
|κ|equals 1 if κis positive and 1ifκis negative. Because
Ris a rotation matrix and since v3=v1×v2and u3=u1×u2,
Rv3=R(v1×v2)=(Rv1)×(Rv2)
=(u2)×(u1)=2u1×u2=u3.
But then
R[v1v2v3]=[u2u1u3]=[u1u2u3]
00
00
001
,
or equivalently,
RV =U
00
00
001
.
This yields the following formula for the rotation matrix R:
R=U
00
00
001
VT.
Observe that Uand Vare the orthogonal matrices in the singular value
decomposition ˆ
E=UΣVTof the matrix ˆ
Ewhich is computed from
point correspondences in the images, and thus are known. The scalar
on the other hand, equals =κ
|κ|where κis the unknown scalar factor
in ˆ
E=κ[u3]×Rand hence is not known. But, as can take only values
1 and 1, the previous formula yields two possible solutions for the
rotation matrix R, viz.:
ˆ
R=U
010
100
001
VTor ˆ
R=U
010
100
001
VT.(2.64)
2.7 Some Important Special Cases 393
2.7.3.5 Euclidean 3D Reconstruction for a known
Inter-Camera Distance
With the conclusion that the unknown relative rotation matrix Rmust
be one of the two matrices in (2.64), it is now proven that a metric
3D reconstruction of the scene can be computed from two images by
the metric reconstruction equations (2.58) (still assuming that the cal-
ibration matrices of both cameras are known). If the distance between
the two camera positions C1and C2is known too, then a Euclidean
3D reconstruction of the scene can be computed. This is quite evident
from the fact that this distance allows us to fix the scale of the metric
reconstruction. In the remainder of this section, we will give a more
formal proof, as it also allows us to dwell further on the ambiguities
that persist.
Consider again the projection equations (2.57) for two cameras
observing a static scene:
ρ1m1=K1RT
1(MC1),and ρ2m2=K2RT
2(MC2).
If the calibration matrices K1and K2are known, then the (unbiased)
perspective projections q1=K1
1m1and q2=K1
2m2of the scene point
Min the respective image planes can be retrieved. Multiplying both
equations on the left with K1
1and K1
2, respectively, yields
ρ1q1=RT
1(MC1) and ρ2q2=RT
2(MC2).(2.65)
The right-hand side of the first equation, viz. RT
1(MC1), gives the
3D coordinates of the scene point Mwith respect to the camera-
centered reference frame of the first camera (cf. Section 2.2.4). And
similarly, the right-hand side of the second equation, viz. RT
2(MC2),
gives the 3D coordinates of the scene point Mwith respect to the
camera-centered reference frame of the second camera. As argued in
Section 2.5.1, it is not possible to recover absolute information about
the cameras’ external parameters in the real world from the previous
equations alone. Therefore, the strategy proposed in Section 2.5.1 is
to reconstruct the scene with respect to the camera-centered reference
frame of the first camera, thus yielding the Euclidean 3D reconstruction
M=RT
1(MC1). Solving this expression for M, gives M=C1+R1Mand
394 Principles of Passive 3D Reconstruction
substituting these expressions in formulas (2.65) yields
ρ1q1=Mand ρ2q2=RT
2R1M+RT
2(C1C2).
Using R=RT
2R1and t=RT
2(C1C2) as in the previous subsections,
one gets the following Euclidean 3D reconstruction equations for cali-
brated images:
ρ1q1=Mand ρ2q2=RM+t.(2.66)
It was demonstrated in the previous subsection that, if an estimate
ˆ
Eof the essential matrix Eof the calibrated image pair is available,
then an estimate for the rotation matrix Rand for the direction of the
translation vector tcan be derived from a singular value decomposition
of ˆ
E. More precisely, if ˆ
E=UΣVTis a singular value decomposition
of ˆ
E, then tis a scalar multiple of the unit 3-vector u3constituting
the third column of the orthogonal matrix Uand Ris one of the two
matrices ˆ
Ror ˆ
Rdefined by the expressions in (2.64). Now, since u3is
a unit vector, there are two possibilities for t, namely:
t=tu3or t=−tu3.(2.67)
Together, expressions (2.64) and (2.67) yield four possible, but different,
candidates for the relative setup of the two cameras, viz.:
(ˆ
t,ˆ
R),(ˆ
t,ˆ
R),(ˆ
t,ˆ
R),and (ˆ
t,ˆ
R).(2.68)
Observe that tactually is the Euclidean distance between the
two camera positions C1and C2. Indeed, t=RT
2(C1C2) and thus
||t||2=tTt=||C1C2||2. Hence, if the distance between the camera
positions C1and C2is known, then each of the four possibilities in (2.68)
together with the reconstruction equations (2.66) yields a Euclidean
3D reconstruction of the scene and one of these 3D reconstructions cor-
responds to a description of the scene in coordinates with respect to the
camera-centered reference frame of the first camera. In particular, in the
Euclidean reconstruction Mof the scene computed from formula (2.66)
the first camera is positioned at the origin and its orientation is given
by the 3 ×3-identity matrix I3. The position of the second camera in
the 3D reconstruction M, on the other hand, is given by
RT
1(C2C1)=RT
1R2RT
2(C2C1)
=RT
2R1TRT
2(C1C2)=RTt
2.7 Some Important Special Cases 395
and the orientation of the second camera in the 3D reconstruction Mis
given by RT
1R2=RT. The four possibilities for a setup of the cameras
which are compatible with an estimated essential matrix ˆ
Eand which
are listed in (2.68) correspond to four mirror symmetric configurations,
as depicted in Figure 2.20. Since formulas (2.67) imply that ˆ
t=ˆ
t,
changing ˆ
tinto ˆ
tin the relative setup of the cameras results in a
reversal of the baseline of the camera pair. And, it follows from the
expressions in (2.64) that
ˆ
R=ˆ
RV
100
010
001
VT,
where the matrix product
V
100
010
001
VT
in the right-hand side of the equation represents a rotation through
180about the line joining the centers of projection. Hence, changing
Fig. 2.20 From the singular value decomposition of (an estimate ˆ
Eof) the essential
matrix E, four possibilities for the relative translation and rotation between the two cam-
eras can be computed. In the figure, the first camera is depicted in grey and the other four
cameras correspond to these four different possibilities. In particular, with the notations
used in the text (cf. formula (2.68)), the four possible solutions for the camera setup, viz
(ˆ
t,ˆ
R), (ˆ
t,ˆ
R), (ˆ
t,ˆ
R) and (ˆ
t,ˆ
R), correspond to the blue, yellow, red and green camera,
respectively.
396 Principles of Passive 3D Reconstruction
ˆ
Rinto ˆ
Rin the relative setup of the cameras results in rotating the
second camera 180about the baseline of the camera pair. Moreover,
given a pair m1and m2of corresponding points between the images, the
reconstructed 3D point Mwill be in front of both cameras in only one
of the four possibilities for the camera setup computed from ˆ
E. Hence,
to identify the correct camera setup among the candidates (2.68) it
suffices to test for a single reconstructed point Min which of the four
possibilities it is in front of both the cameras (i.e., for which of the
four candidates the projective depths ρ1and ρ2of Min the Euclidean
reconstruction equations (2.66) are both positive) [6, 10].
2.7.3.6 Metric 3D Reconstruction from Two Calibrated
Images
Finally, if the distance between the camera positions C1and C2is
not known, then the transformation ¯
M=1
tM=1
tRT
1(MC1) still
yields a metric 3D reconstruction of the scene at scale t=C1C2.
Reconstruction equations for this metric reconstruction follow immedi-
ately from equations (2.66) by dividing both equations by t, viz.:
¯ρ1q1=¯
Mand ¯ρ2q2=R¯
M+u,(2.69)
where u=t
t. It follows from formula (2.67) that the 3-vector u3con-
stituting the third column of the matrix Uin a singular value decompo-
sition of an estimate ˆ
Eof the essential matrix Eyields two candidates
for the unit vector uin the metric 3D reconstruction equations (2.69),
viz. u3and u3. Together with the two possibilities for the rotation
matrix Rgiven by the expressions in (2.64), one gets the following
four candidates for the relative setup of the two cameras in the metric
reconstruction:
(u3,ˆ
R),(u3,ˆ
R),(u3,ˆ
R), and (u3,ˆ
R).(2.70)
As before, the correct camera setup can easily be identified among the
candidates (2.70) by testing for a single reconstructed point ¯
Min which
of the four possibilities it is in front of both the cameras (i.e., for which
of the four candidates the projective depths ¯ρ1and ¯ρ2of ¯
Min the metric
reconstruction equations (2.69) are both positive).
2.7 Some Important Special Cases 397
It is important to note the difference between these metric 3D recon-
structions and the metric 3D reconstruction described in Section 2.5.2 :
Apart from different setups of the cameras, all four possible metric
reconstructions described here differ from the scene by a fixed scale,
which is the (unknown) distance between the two camera positions;
whereas for the metric 3D reconstruction in Section 2.5.2, nothing is
known or guaranteed about the actual scale of the reconstruction with
respect to the original scene.
Bibliography
Beardsley, P. A., P. H. S. Torr, and A. Zisserman, “3d model acquisition from
extended image sequence,” in Computer Vision — ECCV’96, Lecture Notes in
Computer Science, vol. 1065, pp. II.683–II.695, Springer-Verlag, 1996.
Bouguet, J. Y., “Camera calibration toolbox for matlab,” http://www.vision.
caltech.edu/bouguetj/calib doc/.
Canny, J. F., “A computational approach to edge detection,” IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 8, no. 6, pp. 679–698, 1986.
Clark, T. A. and J. G. Fryer, “The development of camera calibration methods and
models,” Photogrammetry Records, vol. 16, no. 9, pp. 51–66, 1998.
Devernay, F. and O. Faugeras, “Automatic calibration and removal of distortion
from scenes of structured environments,” in Proc. SPIE Investigative and Trial
Image Processing, vol. 2567, pp. 62–67, 1995.
Devernay, F. and O. Faugeras, “From projective to euclidean reconstruction,” in
Proc. Computer Vision and Pattern Recognition, pp. 264–269, 1996.
Devernay, F. and O. Faugeras, “Straight lines have to be straight,” Machine Vision
and Applications, vol. 13, pp. 14–24, 2001.
Faugeras, O., Three-Dimensional Computer Vision : A Geometric Viewpoint. Cam-
bridge, MA/London, UK: MIT Press, 1993.
Faugeras, O. D., “What can be seen in three dimensions with an uncalibrated
stereo rig,” in Computer Vision — ECCV’92, Lecture Notes in Computer Sci-
ence, vol. 588, pp. 563–578, Springer-Verlag, 1992.
Faugeras, O., “What can be seen in three dimensions with an uncalibrated
stereo rig,” in Computer Vision — (ECCV’92), pp. 563–578, vol. LNCS 588,
Berlin/Heidelberg/New York/Tokyo: Springer-Verlag, 1992.
Fitzgibbon, A. W. , “Simultaneous linear estimation of multiple view geometry and
lens distortion,” in Proc. Computer Vision and Pattern Recognition, pp. 125–132,
398
Bibliography 399
2001.
Fitzgibbon, A. W. and A. Zisserman, “Automatic camera recovery for closed or open
image sequences,” in Computer Vision — ECCV’98, Lecture Notes in Computer
Science, vol. 1406, pp. I.311–I.326, Springer-Verlag, 1998.
Foley, J., A. van Dam, S. Feiner, J. Hughes, and R. Phillips, Introduction to Com-
puter Graphics. Addison-Wesley, 1993.
Golub, G. H. and C. F. V. Loan, Matrix Computations. Baltimore, ML, USA: The
John Hopkins University Press, 1996.
Hartley, R., “A linear method for reconstruction from lines and points,” in Pro-
ceedings of the 5 th International Conference on Computer Vision (ICCV’95),
pp. 882–887, Los Alamitos, CA: IEEE Computer Society Press, 1995.
Hartley, R., “Cheirality,” International Journal of Computer Vision, vol. 26, no. 1,
pp. 41–61, 1998.
Hartley, R., R. Gupta, and T. Chang, “Stereo from uncalibrated cameras,” in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR’92), pp. 761–764, IEEE Computer Society Press, 1992.
Hartley, R. and P. Sturm. “Triangulation,” Computer Vision and Image Under-
standing, vol. 68, no. 2, 146–157, November 1997.
Hartley, R. I., “In defence of the 8-point algorithm,” in Proceedings of the 5th Inter-
national Conference on Computer Vision (ICCV’95), pp. 1064–1070, IEEE Com-
puter Society Press, 1995.
Hartley, R., “Estimation of relative camera positions for uncalibrated cam-
eras,” in Computer Vision — (ECCV’92), pp. 579–587, vol. LNCS 588,
Berlin/Heidelberg/New York/Tokyo: Springer-Verlag, 1992.
Hartley, R., “Self-calibration from multiple views with a rotating camera,”
in Computer Vision — (ECCV’94), pp. 471–478, vol. LNCS 800/801,
Berlin/Heidelberg/New York/Tokyo: Springer-Verlag, 1994.
Hartley, R. and S. B. Kang, “Parameter-free radial distortion correction with center
of distortion estimation,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 29, no. 8, pp. 1309–1321, doi:10.1109/ TPAMI.2007.1147, June
2007.
Hartley, R. and A. Zisserman, Multiple View Geometry in Computer Vision. Cam-
bridge University Press, ISBN: 0521540518, 2004.
Hofmann, W., “Das problem der Gef¨ahrlichen Fl¨aachen in Theorie und Praxis -
Ein Beitrag zur Hauptaufgabe der Photogrammetrie,” PhD thesis, Fakult¨at f¨ur
Bauwesen, Technische Universit¨aat M¨unchen, Germany, 1953.
Kahl, F., B. Triggs, and K. ˚
Astr¨om, “Critical motions for auto-calibration when some
intrinsic parameters can vary,” Journal of Mathematical Imaging and Vision,
vol. 13, no. 2, pp. 131–146, October 2000.
Koch, R., M. Pollefeys, and L. Van Gool, “Multi viewpoint stereo from uncalibrated
video sequences,” in ECCV ’98: Proceedings of the 5th European Conference on
Computer Vision, vol. I, pp. 55–71, London, UK: Springer-Verlag, 1998.
Kruppa, E., “Zur ermittlung eines objektes aus zwei perspektiven mit innerer ori-
entierung,” Kaiserlicher Akademie der Wissenschaften (Wien) Kl. Abt, IIa(122):
pp. 1939–1948, 1913.
Lafortune, E., S. Foo, K. Torrance, and D. Greenberg, “Non-linear approximation
of reflectance functions,” in Proc. SIGGRAPH ’97, vol. 31, pp. 117–126, 1997.
400 Bibliography
Laveau, S. and O. Faugeras, “Oriented projective geometry for computer vision,”
in ECCV ’96: Proceedings of the 4th European Conference on Computer Vision,
vol. I, pp. 147–156, London, UK: Springer-Verlag, 1996.
Longuet-Higgins, H., “The reconstruction of a plane surface from two perspective
projections,” in Proceedings of Royal Society of London, vol. 227 of B, pp. 399–
410, 1986.
Longuet-Higgins, H. C., “A computer algorithm for reconstructing a scene from two
projections,” Nature, vol. 293, no. 10, pp. 133–135, 1981.
Longuet-Higgins, H. C., “A computer algorithm for reconstructing a scene from two
projections,” Nature, vol. 293, pp. 133–135, 1981.
Luong, Q.-T. and O. D. Faugeras, “The fundamental matrix: Theory, algorithms,
and stability analysis,” International Journal on Computer Vision, vol. 17, no. 1,
pp. 43–75, 1996.
Luong, Q.-T. and T. Vi´eville, “Canonic representations for the geometries of multiple
projective views,” in Computer Vision — ECCV’94, Lecture Notes in Computer
Science, vol. 800, pp. I.589–I.599, Springer-Verlag, 1994.
Luong, Q.-T. and T. Vi´eville, “Canonic representations for the geometries of multiple
projective views,” Computer Vision and Image Understanding, vol. 64, no. 2,
pp. 589–599, 1996.
Ma, Y., S. Soatto, J. Kosecka, and S. S. Sastry, An Invitation to 3D Vision: From
Images to Geometric Models. Berlin/Heidelberg/New York/Tokyo: Springer-
Verlag, 2003.
Maybank, S., Theory of Reconstruction from Image Motion Secaucus, NJ, USA:
Springer-Verlag New York, Inc. 1992.
Maybank, S. J., “The critical line concruence for reconstruction from three images,”
Applicable Algebra in Engineering, Communication and Computing, vol. 6,
pp. 89–113, 1995.
Maybank, S. J. and A. Shashua, “Ambiguity in reconstruction from images of six
points,” in Proceedings of the 6 th International Conference on Computer Vision
(ICCV’98), pp. 703–708, Narosa Publishing House, 1998.
McGlone, J. C., (ed.), Manual of Photogrammetry, volume 5th edition. the American
Society of Photogrammetry and Remote Sensing, Bethesda, ML, USA, 2004.
Moons, T., L. Van Gool, M. Van Diest, and E. Pauwels, “Affine reconstruction
from perspective image pairs,” in Applications of Invariance in Computer Vision,
(J. L. Mundy, A. Zisserman, and D. Forsyth eds.), vol. 825 of Lecture Notes in
Computer Science, pp. 297–316, 1994.
Moons, T., L. Van Gool, M. Proesmans, and E. Pauwels. “Affine reconstruction from
perspective image pairs with a relative object–camera translation in between.”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 1,
77–83, 1996.
Navab, N., O. Faugeras, and T. Vi´eville, “The critical sets of lines for cam-
era displacement estimation: A mixed euclidean–projective and constructive
approach,” in Proceedings of the 4th International Conference on Computer
Vision (ICCV’93), pp. 713–723, Los Alamitos, CA: IEEE Computer Society
Press, 1993.
Bibliography 401
Nist´er, D., “Automatic dense reconstruction from uncalibrated video sequences,”
PhD thesis, Royal Institute of Technology (KTH), Stockholm, Sweden, 2001.
Nist´er, D., “An efficient solution to the five-point relative pose problem,” IEEE
Transactions On Pattern Analysis and Machine Intelligence, vol. 26, no. 6,
pp. 756–777, June 2004.
Philip, J., “A non-iterative algorithm for determining all essential matrices cor-
responding to five point Pairs,” The Photogrammetric Record, vol. 15, no. 88,
pp. 589–599, 1996.
Pizarro, O., R. Eustice, and H. Singh, “Relative pose estimation for instrumented,
calibrated imaging platforms,” in Proceedings of VII th Digital Image Computing
Techniques and Applications, vol. 601–612, Sydney, Australia, 2003.
Pollefeys, M., L. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and
R. Koch, “Visual modeling with a hand-held camera,” International Journal of
Computer Vision, vol. 59, no. 3, pp. 207–232, 2004.
Pollefeys, M., R. Koch, and L. Van Gool, “Self-calibration and metric reconstruc-
tion in spite of varying and unknown internal camera parameters.” International
Journal of Computer Vision, vol. 32, no. 1, pp. 7–25, 1999.
Stein, G. P., “Lens distortion calibration using point correspondences,” in Proc.
Computer Vision and Pattern Recognition, pp. 602–608, 1997.
Sturm, P., Critical motion sequences for monocular self-calibration and uncali-
brated euclidean reconstruction, 1996.
Sturm, P., “Critical motion sequences and conjugacy of ambiguous euclidean recon-
structions,” in Proceedings of the 10th Scandinavian Conference on Image Analy-
sis, Lappeenranta, Finland, vol. I, (M. Frydrych, J. Parkkinen, and A. Visa, eds.),
pp. 439–446, June 1997.
Sturm, P., “Critical motion sequences for the self-calibration of cameras and stereo
systems with variable focal length,” in British Machine Vision Conference, Not-
tingham, England, pp. 63–72, September 1999.
Tardif, J.-P., P. Sturm, M. Trudeau, and S. Roy, “Calibration of cam-
eras with radially symmetric distortion,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 31, no. 9, pp. 1552–1566, July 2009.
doi:10.1109/TPAMI.2008.202.
Triggs, B., “Matching constraints and the joint image,” in Proceedings of the 4th
International Conference on Computer Vision (ICCV’95), pp. 338–343, Los
Alamitos, CA: IEEE Computer Society Press, 1995.
Triggs, B., “The geometry of pro jective reconstruction i: Matching constraints and
the joint image,” International Journal of Computer Vision, pp. 338–343, 1995.
Triggs, B., “Autocalibration and the absolute quadric,” in CVPR ’97: Proceedings of
the 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97),
pp. 609–614, Washington, DC, USA: IEEE Computer Society, 1997.
Triggs, B., “Autocalibration and the absolute quadric,” in CVPR ’97: Proceedings of
the 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97),
pp. 609–614, Washington, DC, USA: IEEE Computer Society, 1997.
Tsai, R. Y., “An ecient and accurate camera calibration technique for 3-d machine
vision.” in Proceedings International Conference on Pattern Recognition, pp. 364–
374, 1986.
402 Bibliography
Tsai, R. Y., “Synopsis of recent progress on camera calibration for 3d machine
vision,” The Robotics Review, vol. 3, no. 4, pp. 323–344, 1987.
Tsai, R. Y., “A versatile camera calibration technique for high-accuracy 3d machine
vision metrology using off-the-shelf tv cameras and lenses,” IEEE Transactions
of Robotics and Automation, vol. 3, no. 4, pp. 323–344, 1987.
Tsai, R. Y., “A versatile camera calibration technique for high-accuracy 3D
machine vision metrology using off-the-shelf TV cameras and lenses,” Radiome-
try, pp. 221–244, 1992.
Van Gool, L., T. Moons, M. Proesmans, and M. Van Diest, “Affine reconstruction
from perspective image pairs obtained by a translating camera,” In Proceedings
of the 12th International Conference on Pattern Recognition, pp. 290–294, Los
Alamitos, CA: IEEE Computer Society Press, 1994.
Wang, J., F. Shi, J. Zhang, and Y. Liu, “A new calibration model and method of
camera lens distortion,” Pattern Recognition, vol. 41, no. 2, pp. 607–615, 2008.
Wei, G.-Q., and S. D. Ma, “Implicit and explicit camera calibration: Theory and
experiments,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 16, no. 5, pp. 469–480, May 1994.
Zhang, Z., “On the epipolar geometry between two images with lens distortion,” in
Proceedings International Conference on Pattern Recognition, pp. 407–411, 1996.
Zhang, Z., “Determining the epipolar geometry and its uncertainty — a review,”
International Journal of Computer Vision, vol. 27, no. 2, pp. 161–195, March
1998.
Zhang, Z., “A flexible new technique for camera calibration,” in Proceedings of the
Seventh International Conference on Computer Vision, pp. 666–673, September
1999.
Zhang, Z., “Flexible camera calibration by viewing a plane from unknown orienta-
tions,” in ICCV, pp. 666–673, 1999.
Zhang, Z., “A flexible new technique for camera calibration,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 22, no. 11, pp. 1330–1334, 2000.
References
[1] J. Y. Bouguet, “Camera calibration toolbox for matlab,” http://www.vision.
caltech.edu/bouguetj/calib doc/.
[2] O. Faugeras, “What can be seen in three dimensions with an uncalibrated
stereo rig,” in Computer Vision — (ECCV’92), pp. 563–578, vol. LNCS 588,
Berlin/Heidelberg/New York/Tokyo: Springer-Verlag, 1992.
[3] G. H. Golub and C. F. V. Loan, Matrix Computations. Baltimore, ML, USA:
The John Hopkins University Press, 1996.
[4] R. Hartley, “Estimation of relative camera positions for uncalibrated cam-
eras,” in Computer Vision — (ECCV’92), pp. 579–587, vol. LNCS 588,
Berlin/Heidelberg/New York/Tokyo: Springer-Verlag, 1992.
[5] R. Hartley, “Self-calibration from multiple views with a rotating cam-
era,” in Computer Vision — (ECCV’94), pp. 471–478, vol. LNCS 800/801,
Berlin/Heidelberg/New York/Tokyo: Springer-Verlag, 1994.
[6] R. Hartley, “Cheirality,” International Journal of Computer Vision, vol. 26,
no. 1, pp. 41–61, 1998.
[7] R. Hartley and S. B. Kang, “Parameter-free radial distortion correction
with center of distortion estimation,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, vol. 29, no. 8, pp. 1309–1321, doi:10.1109/
TPAMI.2007.1147, June 2007.
[8] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision.
Cambridge University Press, ISBN: 0521540518, 2004.
[9] F. Kahl, B. Triggs, and K. ˚
Astr¨om, “Critical motions for auto-calibration when
some intrinsic parameters can vary,” Journal of Mathematical Imaging and
Vision, vol. 13, no. 2, pp. 131–146, October 2000.
403
404 References
[10] H. Longuet-Higgins, “A computer algorithm for reconstructing a scene from
two projections,” Nature, vol. 293, no. 10, pp. 133–135, 1981.
[11] H. C. Longuet-Higgins, “A computer algorithm for reconstructing a scene from
two projections,” Nature, vol. 293, pp. 133–135, 1981.
[12] D. Nist´er, “An efficient solution to the five-point relative pose problem,” IEEE
Transactions On Pattern Analysis and Machine Intelligence, vol. 26, no. 6,
pp. 756–777, June 2004.
[13] J. Philip, “A non-iterative algorithm for determining all essential matrices cor-
responding to five point Pairs,” The Photogrammetric Record, vol. 15, no. 88,
pp. 589–599, 1996.
[14] P. Sturm, “Critical motion sequences and conjugacy of ambiguous euclidean
reconstructions,” in Proceedings of the 10th Scandinavian Conference on Image
Analysis, Lappeenranta, Finland, vol. I, (M. Frydrych, J. Parkkinen, and
A. Visa, eds.), pp. 439–446, June 1997.
[15] P. Sturm, “Critical motion sequences for the self-calibration of cameras and
stereo systems with variable focal length,” in British Machine Vision Confer-
ence, Nottingham, England, pp. 63–72, September 1999.
[16] B. Triggs, “Autocalibration and the absolute quadric,” in CVPR ’97: Pro-
ceedings of the 1997 Conference on Computer Vision and Pattern Recognition
(CVPR ’97), pp. 609–614, Washington, DC, USA: IEEE Computer Society,
1997.
[17] R. Y. Tsai, “A versatile camera calibration technique for high-accuracy 3D
machine vision metrology using off-the-shelf TV cameras and lenses,” Radiom-
etry, pp. 221–244, 1992.
[18] Z. Zhang, “Flexible camera calibration by viewing a plane from unknown ori-
entations,” in ICCV, pp. 666–673, 1999.
... In the coming decade, dense 3D data acquisition of objects is likely to become one of the most important problems in computer vision and industrial machine vision. Moreover, it can be helpful for a wide range of other cutting-edge scientific disciplines such as metrology [12], geometry processing [4], forensics [44], etc. At present, it is widely accepted that methods such as structure-from-motion [21,55], * Corresponding Author (k.sur46@gmail.com) ...
... At present, it is widely accepted that methods such as structure-from-motion [21,55], * Corresponding Author (k.sur46@gmail.com) multi-view stereo [12], photometric stereo [30,54,61], and other standalone approaches [34,35,43,47,60] are not sufficient on their own to provide detailed and precise 3D reconstruction for all kinds of surfaces [44]. Therefore, methods that combine complementary surface estimates by leveraging more than one modality are often preferred [36,46]. ...
Preprint
Full-text available
This paper presents a simple and effective solution to the longstanding classical multi-view photometric stereo (MVPS) problem. It is well-known that photometric stereo (PS) is excellent at recovering high-frequency surface details, whereas multi-view stereo (MVS) can help remove the low-frequency distortion due to PS and retain the global geometry of the shape. This paper proposes an approach that can effectively utilize such complementary strengths of PS and MVS. Our key idea is to combine them suitably while considering the per-pixel uncertainty of their estimates. To this end, we estimate per-pixel surface normals and depth using an uncertainty-aware deep-PS network and deep-MVS network, respectively. Uncertainty modeling helps select reliable surface normal and depth estimates at each pixel which then act as a true representative of the dense surface geometry. At each pixel, our approach either selects or discards deep-PS and deep-MVS network prediction depending on the prediction uncertainty measure. For dense, detailed, and precise inference of the object's surface profile, we propose to learn the implicit neural shape representation via a multilayer perceptron (MLP). Our approach encourages the MLP to converge to a natural zero-level set surface using the confident prediction from deep-PS and deep-MVS networks, providing superior dense surface reconstruction. Extensive experiments on the DiLiGenT-MV benchmark dataset show that our method provides high-quality shape recovery with a much lower memory footprint while outperforming almost all of the existing approaches.
... To produce a 3D representation of an object of interest W ⊂ Q, several calibrated images must be collected during the 3D reconstruction process [16]. ...
Conference Paper
Full-text available
Cooperative control of multi-UAV systems has attracted substantial research attention due to its significance in various application sectors such as emergency response, search and rescue missions, and critical infrastructure inspection. This paper proposes a distributed control algorithm to generate collision-free trajectories that drive the multi-UAV system to completely inspect a set of 3D points on the surface of an object of interest. The objective of the UAVs is to cooperatively inspect the object of interest in the minimum amount of time. Extensive numerical simulations for a team of quadrotor UAVs inspecting a real 3D structure illustrate the validity and effectiveness of the proposed approach.
... However, these are cost-effective options and need a large computation power for the testing and training of the model. Moreover, most of the existing research is very specific to their domains [1,2] or use multiple images of the object [3,4]. ...
Conference Paper
Full-text available
Creation of 3D models from a single RGB image is challenging problem in image processing these days, as the technology is in its early development stage. However, the demands for 3D technology and 3D reconstruction have been rapidly increasing nowadays. The traditional approach of computer graphics is to create a geometric model in 3D and try to reproduce it onto a 2D image with rendering. The major aim of the study is to create 3D models from 2D RGB image using machine learning techniques to be less computationally complex as compared to any deep learning algorithm. The proposed model has been based on three different modules such as: 2.5D features extraction, mesh generation, and 3D boundary detection. The ShapeNet dataset has been used for comparison. The testing results has shown an accuracy of 90.77 % in the plane class, 85.72% in the chair class, and 72.14% in the automobile class. The proposed model could be applicable to problems where reconstruction of 3D models is required such as: variations in geometric scale, mix of textured, uniformly colored, and reflective surfaces.
... Based on a taxonomy of methods [5], 3D acquisition methods are broadly categorized as active and passive methods. The active reconstruction methods use controlled illuminations such as structured illumination [6], photometric stereo [7], and shape from shading techniques [8]. ...
... Meanwhile, the rapid growth of unmanned aerial vehicles (UAVs) in recent years and their ability to provide high resolution and accuracy information has improved UAV photogrammetry projects. Moreover, their versatility in data acquisition, as well as the combination of different sensors and the use of three-dimensional model production algorithms such as Structure from Motion (SfM) and Multi View-Stereo (MVS) (Moons et al., 2009;Skarlatos and Kiparissi, 2012) has been used in a variety of applications such as surveying and forestry (Chang et al., 2020;Fakhri and Latifi, 2021), archaeology (Jacq et al., 2021), civil engineering (Lv et al., 2021), and documentation (Godinho et al., 2020). These advancements have transformed drone systems into standard platforms for collecting three-dimensional data (Jarzabek-Rychard and Karpina, 2016;Yao et al., 2019). ...
Article
Full-text available
In recent years, Unmanned Aerial Vehicles (UAVs) have become popular tools in mapping applications. In such applications, the image motion, bad lighting effects, and poor texture all directly affect the quality of the derived tie points, which in turn imposes constraints on image extraction and may lead to a low accuracy point cloud. This paper proposes a contrast enhancement technique to improve the accuracy of a photogrammetric model created using UAV images. The luminance component (Y) in the YIQ color space is normalized using the sigmoid function, and the low contrast images are enhanced using the Contrast-Limited Adaptive Histogram Equalization (CLAHE) on the luminosity component. To evaluate the proposed method, three-dimensional models were created using images acquired by the Phantom 4 Pro UAV in three distinct places and at altitudes of 20, 40, 60, 80, and 90 meters. The results showed that enhancing the contrast of images increased the number of tie points and reduced reprojection error by approximately 10%. It also improved the resolution of the digital elevation model by approximately 2cm/pixel while greatly improving the texture and quality with respect to that developed using the original images.
Article
Full-text available
Concrete defect information is of vital importance to building maintenance. Increasingly, computer vision has been explored for automated concrete defect detection. However, existing studies suffer from the challenging issue of false positives. In addition, 3D reconstruction of the defects to pinpoint their positions and geometries has not been sufficiently explored. To address these limitations, this study proposes a novel computational approach for detecting and reconstructing concrete defects from geotagged aerial images. A bundle registration algorithm is devised to align a batch of aerial photographs with a building information model (BIM). The registration enables the retrieval of material semantics in BIM to determine the regions of interest for defect detection. It helps rectify the camera poses of the aerial images, enabling precise defect reconstruction. Experiments demonstrate the effectiveness of the approach, which significantly reduced the false discovery rate from 70.8% to 56.8%, resulting in an intersection over union 6.4% higher than that of the traditional method. The geometry of the defects was successfully reconstructed in 3D world space. This study opens a new avenue to advance the field of defect detection by exploiting the rich information from BIM. The approach can be deployed at scale, supporting urban renovation, numerical simulation, and other smart applications.
Article
This paper addresses the problem of determining the kind of three-dimensional reconstructions that can be obtained from a binocular stereo rig for which no three-dimensional metric calibration data is available. The only information at our disposal is a set of pixel correspondences between the two retinas which we assume are obtained by some correlation technique or any other means. We show that even in this case some very rich non-metric reconstructions of the environment can nonetheless be obtained. Specifically we show that if we choose five arbitrary correspondences, then a unique (up to an arbitrary projective transformation) projective representation of the environment can be constructed which is relative to the five points in three-dimensional space which gave rise to the correspondences. We then show that if we choose only four arbitrary correspondences, then an affine representation of the environment can be constructed. This reconstruction is defined up to an arbitrary affine transformation and is relative to the four points in three-dimensional space which gave rise to the correspondences. The reconstructed scene also depends upon three arbitrary parameters and two scenes reconstructed from the same set of correspondences with two different sets of parameter values are related by a projective transformation. Our results indicate that computer vision may have been slightly overdoing it in trying at all costs to obtain metric information from images. Indeed, our past experience with the computation of such information has shown us that it is difficult to obtain, requiring awkward calibration procedures and special purpose patterns which are difficult if not impossible to use in natural environments with active vision systems. In fact it is not often the case that accurate metric information is necessary for robotics applications for example where relative information is usually all what is needed.
Article
Correction for image distortion in cameras has been an important topic for as long as users have wanted to faithfully reproduce or use observed information. Initially the main application was mapping. While this task continues today other applications also require precise calibration of cameras such as close range 3-D measurement and many 2-D measurement tasks. In the past the cameras used were few in number and highly expensive whereas today a typical large industrial company will have many inexpensive cameras being used for highly important measurement tasks . Cameras are used more today than they ever were but the golden age of camera calibration for aerial mapping is now well in the past. This paper considers some of the key developments and attempts to put them into perspective. In particular the driving forces behind each improvement have been highlighted.