ArticlePDF Available

3D Reconstruction from Multiple Images: Part 1 - Principles.

January 2009
Foundations and Trends® in Computer Graphics and Vision 4:287-404

January 2009
4:287-404

Source
DBLP

Authors:

Theo Moons

KU Leuven

Luc Van Gool

ETH Zurich

Maarten Vergauwen

KU Leuven

This issue discusses methods to extract three-dimensional (3D) models from plain images. In particular, the 3D information is obtained from images for which the camera parameters are unknown. The principles underlying such uncalibrated structure-from-motion methods are outlined. First, a short review of 3D acquisition technologies puts such methods in a wider context and highlights their important advantages. Then, the actual theory behind this line of research is given. The authors have tried to keep the text maximally self-contained, therefore also avoiding to rely on an extensive knowledge of the projective concepts that usually appear in texts about self-calibration 3D methods. Rather, mathematical explanations that are more amenable to intuition are given. The explanation of the theory includes the stratification of reconstructions obtained from image pairs as well as metric reconstruction on the basis of more than two images combined with some additional knowledge about the cameras used. Readers who want to obtain more practical information about how to implement such uncalibrated structure-from-motion pipelines may be interested in two more Foundations and Trends issues written by the same authors. Together with this issue they can be read as a single tutorial on the subject.

Taxonomy of methods for the extraction of information on 3D shape.

…

The principle behind stereo-based 3D reconstruction is very simple: given two images of a point, the point's position in space is found as the intersection of the two projection rays. This procedure is referred to as 'triangulation'.

…

The triangulation principle used already with stereo, can also be used in an active configuration. The laser L projects a ray of light onto the object O. The intersection point P with the object is viewed by a camera and forms the spot P on its image plane I. This information suffices for the computation of the three-dimensional coordinates of P , assuming that the laser-camera configuration is known.

…

f the active triangulation configuration is altered by turning the laser spot into a line (e.g., by the use of a cylindrical lens), then scanning can be restricted to a onedirectional motion, transversal to the line.

…

+18

D results obtained with a phase-shift system. Left: 3D reconstruction without texture. Right: The same 3D reconstruction with texture, obtained by summing the three images acquired with the phase-shifted sine projections.

…

Figures - uploaded by Theo Moons

Content may be subject to copyright.

Content uploaded by Theo Moons

Content may be subject to copyright.

Foundations and Trends R

in

Computer Graphics and Vision

Vol. 4, No. 4 (2008) 287–404

2010 T. Moons, L. Van Gool and M. Vergauwen

DOI: 10.1561/0600000007

3D Reconstruction from Multiple Images

Part 1: Principles

By Theo Moons, Luc Van Gool, and

Maarten Vergauwen

Contents

1 Introduction to 3D Acquisition 291

1.1 A Taxonomy of Methods 291

1.2 Passive Triangulation 293

1.3 Active Triangulation 295

1.4 Other Methods 299

1.5 Challenges 309

1.6 Conclusions 314

2 Principles of Passive 3D Reconstruction 315

2.1 Introduction 315

2.2 Image Formation and Camera Model 316

2.3 The 3D Reconstruction Problem 329

2.4 The Epipolar Relation Between Two

Images of a Static Scene 332

2.5 Two Image-Based 3D Reconstruction Up-Close 340

2.6 From Projective to Metric Using

More Than Two Images 354

2.7 Some Important Special Cases 376

Bibliography 398

References 403

Foundations and Trends R

in

Computer Graphics and Vision

Vol. 4, No. 4 (2008) 287–404

2010 T. Moons, L. Van Gool and M. Vergauwen

DOI: 10.1561/0600000007

3D Reconstruction from Multiple Images

Part 1: Principles

Theo Moons1, Luc Van Gool2,3, and

Maarten Vergauwen4

1Hogeschool — Universiteit Brussel, Stormstraat 2, Brussel, B-1000,

Belgium, Theo.Moons@hubrussel.be

2Katholieke Universiteit Leuven, ESAT — PSI, Kasteelpark Arenberg 10,

Leuven, B-3001, Belgium, Luc.VanGool@esat.kuleuven.be

3ETH Zurich, BIWI, Sternwartstrasse 7, Zurich, CH-8092, Switzerland,

vangool@vision.ee.ethz.ch

4GeoAutomation NV, Karel Van Lotharingenstraat 2, Leuven, B-3000,

Belgium, maarten.vergauwen@geoautomation.com

Abstract

This issue discusses methods to extract three-dimensional (3D) models

from plain images. In particular, the 3D information is obtained from

images for which the camera parameters are unknown. The principles

underlying such uncalibrated structure-from-motion methods are out-

lined. First, a short review of 3D acquisition technologies puts such

methods in a wider context and highlights their important advan-

tages. Then, the actual theory behind this line of research is given. The

authors have tried to keep the text maximally self-contained, therefore

also avoiding to rely on an extensive knowledge of the projective con-

cepts that usually appear in texts about self-calibration 3D methods.

Rather, mathematical explanations that are more amenable to intu-

ition are given. The explanation of the theory includes the stratiﬁcation

of reconstructions obtained from image pairs as well as metric recon-

struction on the basis of more than two images combined with some

additional knowledge about the cameras used. Readers who want to

obtain more practical information about how to implement such uncal-

ibrated structure-from-motion pipelines may be interested in two more

Foundations and Trends issues written by the same authors. Together

with this issue they can be read as a single tutorial on the subject.

Preface

Welcome to this Foundations and Trends tutorial on three-

dimensional (3D) reconstruction from multiple images. The focus is on

the creation of 3D models from nothing but a set of images, taken from

unknown camera positions and with unknown camera settings. In this

issue, the underlying theory for such “self-calibrating” 3D reconstruc-

tion methods is discussed. Of course, the text cannot give a complete

overview of all aspects that are relevant. That would mean dragging

in lengthy discussions on feature extraction, feature matching, track-

ing, texture blending, dense correspondence search, etc. Nonetheless,

we tried to keep at least the geometric aspects of the self-calibration

reasonably self-contained and this is where the focus lies.

The issue consists of two main parts, organized in separate sections.

Section 1 places the subject of self-calibrating 3D reconstruction from

images in the wider context of 3D acquisition techniques. This sec-

tion thus also gives a short overview of alternative 3D reconstruction

techniques, as the uncalibrated structure-from-motion approach is not

necessarily the most appropriate one for all applications. This helps to

bring out the pros and cons of this particular approach.

289

290

Section 2 starts the actual discussion of the topic. With images as

our key input for 3D reconstruction, this section ﬁrst discusses how we

can mathematically model the process of image formation by a camera,

and which parameters are involved. Equipped with that camera model,

it then discusses the process of self-calibration for multiple cameras

from a theoretical perspective. It deals with the core issues of this tuto-

rial: given images and incomplete knowledge about the cameras, what

can we still retrieve in terms of 3D scene structure and how can we

make up for the missing information. This section also describes cases

in between fully calibrated and uncalibrated reconstruction. Breaking

a bit with tradition, we have tried to describe the whole self-calibration

process in intuitive, Euclidean terms. We have avoided the usual expla-

nation via projective concepts, as we believe that entities like the dual

of the projection of the absolute quadric are not very amenable to

intuition.

Readers who are interested in implementation issues and a prac-

tical example of a self-calibrating 3D reconstruction pipeline may be

interested in two complementary, upcoming issues by the same authors,

which together with this issue can be read as a single tutorial.

Introduction to 3D Acquisition

This section discusses diﬀerent methods for capturing or ‘acquiring’

the three-dimensional (3D) shape of surfaces and, in some cases, also

the distance or ‘range’ of the object to the 3D acquisition device. The

section aims at positioning the methods discussed in the sequel of the

tutorial within this more global context. This will make clear that alter-

native methods may actually be better suited for some applications

that need 3D. This said, the discussion will also show that the kind of

approach described here is one of the more ﬂexible and powerful ones.

1.1 A Taxonomy of Methods

A 3D acquisition taxonomy is given in Figure 1.1. A ﬁrst distinction is

between active and passive methods. With active techniques the light

sources are specially controlled, as part of the strategy to arrive at the

3D information. Active lighting incorporates some form of temporal or

spatial modulation of the illumination. With passive techniques, on the

other hand, light is not controlled or only with respect to image quality.

Typically passive techniques work with whichever reasonable, ambient

light available. From a computational point of view, active methods

291

292 Introduction to 3D Acquisition

Fig. 1.1 Taxonomy of methods for the extraction of information on 3D shape.

tend to be less demanding, as the special illumination is used to simplify

some of the steps in the 3D capturing process. Their applicability is

restricted to environments where the special illumination techniques

can be applied.

A second distinction is between the number of vantage points from

where the scene is observed and/or illuminated. With single-vantage

methods the system works from a single vantage point. In case there

are multiple viewing or illumination components, these are positioned

very close to each other, and ideally they would coincide. The latter

can sometimes be realized virtually, through optical means like semi-

transparent mirrors. With multi-vantage systems, several viewpoints

and/or controlled illumination source positions are involved. For multi-

vantage systems to work well, the diﬀerent components often have to

be positioned far enough from each other. One says that the ‘baseline’

between the components has to be wide enough. Single-vantage meth-

ods have as advantages that they can be made compact and that they

do not suﬀer from the occlusion problems that occur when parts of the

scene are not visible from all vantage points in multi-vantage systems.

The methods mentioned in the taxonomy will now be discussed in

a bit more detail. In the remaining sections, we then continue with

the more elaborate discussion of passive, multi-vantage structure-from-

motion (SfM) techniques, the actual subject of this tutorial. As this

1.2 Passive Triangulation 293

overview of 3D acquisition methods is not intended to be in-depth nor

exhaustive, but just to provide a bit of context for our further image-

based 3D reconstruction from uncalibrated images account, we do not

include references in this part.

1.2 Passive Triangulation

Several multi-vantage approaches use the principle of triangulation

for the extraction of depth information. This also is the key concept

exploited by the self-calibrating structure-from-motion (SfM) methods

described in this tutorial.

1.2.1 (Passive) Stereo

Suppose we have two images, taken at the same time and from diﬀer-

ent viewpoints. Such setting is referred to as stereo. The situation is

illustrated in Figure 1.2. The principle behind stereo-based 3D recon-

struction is simple: given the two projections of the same point in the

world onto the two images, its 3D position is found as the intersection

of the two projection rays. Repeating such process for several points

Fig. 1.2 The principle behind stereo-based 3D reconstruction is very simple: given two

images of a point, the point’s position in space is found as the intersection of the two

projection rays. This procedure is referred to as ‘triangulation’.

294 Introduction to 3D Acquisition

yields the 3D shape and conﬁguration of the objects in the scene. Note

that this construction — referred to as triangulation — requires the

equations of the rays and, hence, complete knowledge of the cameras:

their (relative) positions and orientations, but also their settings like

the focal length. These camera parameters will be discussed in Sec-

tion 2. The process to determine these parameters is called (camera)

calibration.

Moreover, in order to perform this triangulation process, one needs

ways of solving the correspondence problem, i.e., ﬁnding the point in

the second image that corresponds to a speciﬁc point in the ﬁrst image,

or vice versa. Correspondence search actually is the hardest part of

stereo, and one would typically have to solve it for many points. Often

the correspondence problem is solved in two stages. First, correspon-

dences are sought for those points for which this is easiest. Then, corre-

spondences are sought for the remaining points. This will be explained

in more detail in subsequent sections.

1.2.2 Structure-from-Motion

Passive stereo uses two cameras, usually synchronized. If the scene is

static, the two images could also be taken by placing the same cam-

era at the two positions, and taking the images in sequence. Clearly,

once such strategy is considered, one may just as well take more than

two images, while moving the camera. Such strategies are referred to

as structure-from-motion or SfM for short. If images are taken over

short time intervals, it will be easier to ﬁnd correspondences, e.g.,

by tracking feature points over time. Moreover, having more camera

views will yield object models that are more complete. Last but not

least, if multiple views are available, the camera(s) need no longer

be calibrated beforehand, and a self-calibration procedure may be

employed instead. Self-calibration means that the internal and exter-

nal camera parameters (cf. Section 2.2) are extracted from images of

the unmodiﬁed scene itself, and not from images of dedicated calibra-

tion patterns. These properties render SfM a very attractive 3D acqui-

sition strategy. A more detailed discussion is given in the following

sections.

1.3 Active Triangulation 295

1.3 Active Triangulation

Finding corresponding points can be facilitated by replacing one of the

cameras in a stereo setup by a projection device. Hence, we combine

one illumination source with one camera. For instance, one can project

a spot onto the object surface with a laser. The spot will be easily

detectable in the image taken by the camera. If we know the position

and orientation of both the laser ray and the camera projection ray,

then the 3D surface point is again found as their intersection. The

principle is illustrated in Figure 1.3 and is just another example of the

triangulation principle.

The problem is that knowledge about the 3D coordinates of one

point is hardly suﬃcient in most applications. Hence, in the case of

the laser, it should be directed at diﬀerent points on the surface and

each time an image has to be taken. In this way, the 3D coordinates

of these points are extracted, one point at a time. Such a ‘scanning’

Fig. 1.3 The triangulation principle used already with stereo, can also be used in an active

conﬁguration. The laser Lprojects a ray of light onto the object O. The intersection point

Pwith the object is viewed by a camera and forms the spot Pon its image plane I.

This information suﬃces for the computation of the three-dimensional coordinates of P,

assuming that the laser-camera conﬁguration is known.

296 Introduction to 3D Acquisition

process requires precise mechanical apparatus (e.g., by steering rotat-

ing mirrors that reﬂect the laser light into controlled directions). If

the equations of the laser rays are not known precisely, the resulting

3D coordinates will be imprecise as well. One would also not want the

system to take a long time for scanning. Hence, one ends up with the

conﬂicting requirements of guiding the laser spot precisely and fast.

These challenging requirements have an adverse eﬀect on the price.

Moreover, the times needed to take one image per projected laser spot

add up to seconds or even minutes of overall acquisition time. A way

out is using special, super-fast imagers, but again at an additional cost.

In order to remedy this problem, substantial research has gone into

replacing the laser spot by more complicated patterns. For instance,

the laser ray can without much diﬃculty be extended to a plane, e.g.,

by putting a cylindrical lens in front of the laser. Rather than forming

a single laser spot on the surface, the intersection of the plane with the

surface will form a curve. The conﬁguration is depicted in Figure 1.4.

The 3D coordinates of each of the points along the intersection curve

Fig. 1.4 If the active triangulation conﬁguration is altered by turning the laser spot into

a line (e.g., by the use of a cylindrical lens), then scanning can be restricted to a one-

directional motion, transversal to the line.

1.3 Active Triangulation 297

can be determined again through triangulation, namely as the intersec-

tion of the plane with the viewing ray for that point. This still yields

a unique point in space. From a single image, many 3D points can be

extracted in this way. Moreover, the two-dimensional scanning motion

as required with the laser spot can be replaced by a much simpler

one-dimensional sweep over the surface with the laser plane.

It now stands to reason to try and eliminate any scanning alto-

gether. Is it not possible to directly go for a dense distribution of points

all over the surface? Unfortunately, extensions to the two-dimensional

projection patterns that are required are less straightforward. For

instance, when projecting multiple parallel lines of light simultaneously,

a camera viewing ray will no longer have a single intersection with such

a pencil of illumination planes. We would have to include some kind

of code into the pattern to make a distinction between the diﬀerent

lines in the pattern and the corresponding projection planes. Note that

counting lines has its limitations in the presence of depth discontinu-

ities and image noise. There are diﬀerent ways of including a code. An

obvious one is to give the lines diﬀerent colors, but interference by the

surface colors may make it diﬃcult to identify a large number of lines

in this way. Alternatively, one can project several stripe patterns in

sequence, giving up on using a single projection but still only using a

few. Figure 1.5 gives a (non-optimal) example of binary patterns. The

sequence of being bright or dark forms a unique binary code for each

column in the projector. Although one could project diﬀerent shades

of gray, using binary (i.e., all-or-nothing black or white) type of codes

Fig. 1.5 A series of masks that can be projected for active stereo applications. Subsequent

masks contain ever ﬁner stripes. Each of the masks is projected and for a point in the

scene the sequence of black/white values is recorded. The subsequent bits obtained that

way characterize the horizontal position of the points, i.e., the plane of intersection (see

text). The resolution that is required (related to the width of the thinnest stripes) imposes

the number of such masks that has to be used.

298 Introduction to 3D Acquisition

is beneﬁcial for robustness. Nonetheless, so-called phase shift methods

successfully use a set of patterns with sinusoidally varying intensities

in one direction and constant intensity in the perpendicular direction

(i.e., a more gradual stripe pattern than in the previous example).

Each of the three sinusoidal patterns has the same amplitude but is

120◦phase shifted with respect to each other. Intensity ratios in the

images taken under each of the three patterns yield a unique position

modulo the periodicity of the patterns. The sine patterns sum up to a

constant intensity, so adding the three images yields the scene texture.

The three subsequent projections yield dense range values plus texture.

An example result is shown in Figure 1.6. These 3D measurements have

been obtained with a system that works in real time (30 Hz depth +

texture).

One can also design more intricate patterns that contain local spa-

tial codes to identify parts of the projection pattern. An example

is shown in Figure 1.7. The ﬁgure shows a face on which the sin-

gle, checkerboard kind of pattern on the left is projected. The pat-

tern is such that each column has its own distinctive signature. It

consists of combinations of little white or black squares at the ver-

tices of the checkerboard squares. 3D reconstructions obtained with

this technique are shown in Figure 1.8. The use of this pattern only

requires the acquisition of a single image. Hence, continuous projection

Fig. 1.6 3D results obtained with a phase-shift system. Left: 3D reconstruction without

texture. Right: The same 3D reconstruction with texture, obtained by summing the three

images acquired with the phase-shifted sine projections.

1.4 Other Methods 299

Fig. 1.7 Example of one-shot active range technique. Left: The pro jection pattern allowing

disambiguation of its diﬀerent vertical columns. Right: The pattern is pro jected on a face.

Fig. 1.8 Two views of the 3D description obtained with the active method of Figure 1.7.

in combination with video input yields a 4D acquisition device that

can capture 3D shape (but not texture) and its changes over time. All

these approaches with specially shaped projected patterns are com-

monly referred to as structured light techniques.

1.4 Other Methods

With the exception of time-of-ﬂight techniques, all other methods in the

taxonomy of Figure 1.1 are of less practical importance (yet). Hence,

300 Introduction to 3D Acquisition

only time-of-ﬂight is discussed to a somewhat greater length. For the

other approaches, only their general principles are outlined.

1.4.1 Time-of-Flight

The basic principle of time-of-ﬂight sensors is the measurement of the

duration before a sent out time-modulated signal — usually light from a

laser — returns to the sensor. This time is proportional to the distance

from the object. This is an active, single-vantage approach. Depending

on the type of waves used, one calls such devices radar (electromag-

netic waves of low frequency), sonar (acoustic waves), or optical radar

(optical electromagnetic waves, including near-infrared).

A ﬁrst category uses pulsed waves and measures the delay between

the transmitted and the received pulse. These are the most often used

type. A second category is used for smaller distances and measures

phase shifts between outgoing and returning sinusoidal waves. The low

level of the returning signal and the high bandwidth required for detec-

tion put pressure on the signal to noise ratios that can be achieved.

Measurement problems and health hazards with lasers can be allevi-

ated by the use of ultrasound. The bundle has a much larger opening

angle then, and resolution decreases (a lot).

Mainly optical signal-based systems (typically working in the near-

infrared) represent serious competition for the methods mentioned

before. Such systems are often referred to as LIDAR (LIght Detec-

tion And Ranging) or LADAR (LAser Detection And Ranging, a term

more often used by the military, where wavelengths tend to be longer,

like 1,550 nm in order to be invisible in night goggles). As these sys-

tems capture 3D data point-by-point, they need to scan. Typically a

horizontal motion of the scanning head is combined with a faster ver-

tical ﬂip of an internal mirror. Scanning can be a rather slow process,

even if at the time of writing there were already LIDAR systems on the

market that can measure 50,000 points per second. On the other hand,

LIDAR gives excellent precision at larger distances in comparison to

passive techniques, which start to suﬀer from limitations in image res-

olution. Typically, errors at tens of meters will be within a range of

a few centimeters. Triangulation-based techniques require quite some

baseline to achieve such small margins. A disadvantage is that surface

1.4 Other Methods 301

texture is not captured and that errors will be substantially larger for

dark surfaces, which reﬂect little of the incoming signal. Missing texture

can be resolved by adding a camera, as close as possible to the LIDAR

scanning head. But of course, even then the texture is not taken from

exactly the same vantage point. The output is typically delivered as a

massive, unordered point cloud, which may cause problems for further

processing. Moreover, LIDAR systems tend to be expensive.

More recently, 3D cameras have entered the market, that use

the same kind of time-of-ﬂight principle, but that acquire an entire

3D image at the same time. These cameras have been designed to

yield real-time 3D measurements of smaller scenes, typically up to a

couple of meters. So far, resolutions are still limited (in the order of

150 ×150 range values) and depth resolutions only moderate (couple

of millimeters under ideal circumstances but worse otherwise), but this

technology is making advances fast. It is expected that the price of such

cameras will drop sharply soon, as some games console manufacturer’s

plan to oﬀer such cameras as input devices.

1.4.2 Shape-from-Shading and Photometric Stereo

We now discuss the remaining, active techniques in the taxonomy of

Figure 1.1.

‘Shape-from-shading’ techniques typically handle smooth, untex-

tured surfaces. Without the use of structured light or time-of-ﬂight

methods these are diﬃcult to handle. Passive methods like stereo may

ﬁnd it diﬃcult to extract the necessary correspondences. Yet, people

can estimate the overall shape quite well (qualitatively), even from a

single image and under uncontrolled lighting. This would win it a place

among the passive methods. No computer algorithm today can achieve

such performance, however. Yet, progress has been made under simpli-

fying conditions. One can use directional lighting with known direction

and intensity. Hence, we have placed the method in the ‘active’ family

for now. Gray levels of object surface patches then convey information

on their 3D orientation. This process not only requires information on

the sensor-illumination conﬁguration, but also on the reﬂection char-

acteristics of the surface. The complex relationship between gray levels

302 Introduction to 3D Acquisition

and surface orientation can theoretically be calculated in some cases —

e.g., when the surface reﬂectance is known to be Lambertian — but is

usually derived from experiments and then stored in ‘reﬂectance maps’

for table-lookup. For a Lambertian surface with known albedo and for

a known light source intensity, the angle between the surface normal

and the incident light direction can be derived. This yields surface nor-

mals that lie on a cone about the light direction. Hence, even in this

simple case, the normal of a patch cannot be derived uniquely from

its intensity. Therefore, information from diﬀerent patches is combined

through extra assumptions on surface smoothness. Neighboring patches

can be expected to have similar normals. Moreover, for a smooth sur-

face the normals at the visible rim of the object can be determined

from their tangents in the image if the camera settings are known.

Indeed, the 3D normals are perpendicular to the plane formed by the

projection ray at these points and the local tangents to the boundary

in the image. This yields strong boundary conditions. Estimating the

lighting conditions is sometimes made part of the problem. This may

be very useful, as in cases where the light source is the sun. The light

is also not always assumed to be coming from a single direction. For

instance, some lighting models consist of both a directional component

and a homogeneous ambient component, where light is coming from all

directions in equal amounts. Surface interreﬂections are a complication

which these techniques so far cannot handle.

The need to combine normal information from diﬀerent patches can

be reduced by using diﬀerent light sources with diﬀerent positions. The

light sources are activated one after the other. The subsequent observed

intensities for the surface patches yield only a single possible normal

orientation (not withstanding noise in the intensity measurements).

For a Lambertian surface, three diﬀerent lighting directions suﬃce to

eliminate uncertainties about the normal direction. The three cones

intersect in a single line, which is the sought patch normal. Of course,

it still is a good idea to further improve the results, e.g., via smoothness

assumptions. Such ‘photometric stereo’ approach is more stable than

shape-from-shading, but it requires a more controlled acquisition envi-

ronment. An example is shown in Figure 1.9. It shows a dome with 260

LEDs that is easy to assemble and disassemble (modular design, ﬁtting

1.4 Other Methods 303

Fig. 1.9 (a) Mini-dome with diﬀerent LED light sources, (b) scene with one of the LEDs

activated, (c) 3D reconstruction of a cuneiform tablet, without texture, and (d) the same

tablet with texture.

in a standard aircraft suitcase; see part (a) of the ﬁgure). The LEDs

are automatically activated in a predeﬁned sequence. There is one over-

head camera. The resulting 3D reconstruction of a cuneiform tablet is

shown in Figure 1.9(c) without texture, and in (d) with texture.

304 Introduction to 3D Acquisition

As with structured light techniques, one can try to reduce the num-

ber of images that have to be taken, by giving the light sources diﬀer-

ent colors. The resulting mix of colors at a surface patch yields direct

information about the surface normal. In case 3 projections suﬃce, one

can exploit the R-G-B channels of a normal color camera. It is like

taking three intensity images in parallel, one per spectral band of the

camera.

Note that none of the above techniques yield absolute depths, but

rather surface normal directions. These can be integrated into full

3D models of shapes.

1.4.3 Shape-from-Texture and Shape-from-Contour

Passive single vantage methods include shape-from-texture and shape-

from-contour. These methods do not yield true range data, but, as in

the case of shape-from-shading, only surface orientation.

Shape-from-texture assumes that a surface is covered by a homo-

geneous texture (i.e., a surface pattern with some statistical or geo-

metric regularity). Local inhomogeneities of the imaged texture (e.g.,

anisotropy in the statistics of edge orientations for an isotropic tex-

ture, or deviations from assumed periodicity) are regarded as the result

of projection. Surface orientations which allow the original texture to

be maximally isotropic or periodic are selected. Figure 1.10 shows an

Fig. 1.10 Left: The regular texture yields a clear perception of a curved surface. Right: The

result of a shape-from-texture algorithm.

1.4 Other Methods 305

example of a textured scene. The impression of an undulating surface

is immediate. The right-hand side of the ﬁgure shows the results for

a shape-from-texture algorithm that uses the regularity of the pattern

for the estimation of the local surface orientation. Actually, what is

assumed here is a square shape of the pattern’s period (i.e., a kind

of discrete isotropy). This assumption suﬃces to calculate the local

surface orientation. The ellipses represent circles with such calculated

orientation of the local surface patch. The small stick at their center

shows the computed normal to the surface.

Shape-from-contour makes similar assumptions about the true

shape of, usually planar, objects. Observing an ellipse, the assumption

can be made that it actually is a circle, and the slant and tilt angles of

the plane can be determined. For instance, in the shape-from-texture

ﬁgure we have visualized the local surface orientation via ellipses. This

3D impression is compelling, because we tend to interpret the elliptical

shapes as projections of what in reality are circles. This is an exam-

ple of shape-from-contour as applied by our brain. The circle–ellipse

relation is just a particular example, and more general principles have

been elaborated in the literature. An example is the maximization of

area over perimeter squared, as a measure of shape compactness, over

all possible deprojections, i.e., surface patch orientations. Returning

to our example, an ellipse would be deprojected to a circle for this

measure, consistent with human vision. Similarly, symmetries in the

original shape will get lost under projection. Choosing the slant and

tilt angles that maximally restore symmetry is another example of a

criterion for determining the normal to the shape. As a matter of fact,

the circle–ellipse case also is an illustration for this measure. Regular

ﬁgures with at least a 3-fold rotational symmetry yield a single orien-

tation that could make up for the deformation in the image, except

for the mirror reversal with respect to the image plane (assuming that

perspective distortions are too small to be picked up). This is but a

special case of the more general result, that a unique orientation (up to

mirror reﬂection) also results when two copies of a shape are observed

in the same plane (with the exception where their orientation diﬀers by

0◦or 180◦in which case nothing can be said on the mere assumption

that both shapes are identical). Both cases are more restrictive than

306 Introduction to 3D Acquisition

skewed mirror symmetry (without perspective eﬀects), which yields a

one-parameter family of solutions only.

1.4.4 Shape-from-Defocus

Cameras have a limited depth-of-ﬁeld. Only points at a particular

distance will be imaged with a sharp projection in the image plane.

Although often a nuisance, this eﬀect can also be exploited because it

yields information on the distance to the camera. The level of defocus

has already been used to create depth maps. As points can be blurred

because they are closer or farther from the camera than at the position

of focus, shape-from-defocus methods will usually combine more than

a single image, taken from the same position but with diﬀerent focal

lengths. This should disambiguate the depth.

1.4.5 Shape-from-Silhouettes

Shape-from-silhouettes is a passive, multi-vantage approach. Suppose

that an object stands on a turntable. At regular rotational intervals

an image is taken. In each of the images, the silhouette of the object

is determined. Initially, one has a virtual lump of clay, larger than the

object and fully containing it. From each camera orientation, the silhou-

ette forms a cone of projection rays, for which the intersection with this

virtual lump is calculated. The result of all these intersections yields an

approximate shape, a so-called visual hull. Figure 1.11 illustrates the

process.

One has to be careful that the silhouettes are extracted with good

precision. A way to ease this process is by providing a simple back-

ground, like a homogeneous blue or green cloth (‘blue keying’ or ‘green

keying’). Once a part of the lump has been removed, it can never be

retrieved in straightforward implementations of this idea. Therefore,

more reﬁned, probabilistic approaches have been proposed to fend oﬀ

such dangers. Also, cavities that do not show up in any silhouette will

not be removed. For instance, the eye sockets in a face will not be

detected with such method and will remain ﬁlled up in the ﬁnal model.

This can be solved by also extracting stereo depth from neighboring

1.4 Other Methods 307

Fig. 1.11 The ﬁrst three images show diﬀerent backprojections from the silhouette of a

teapot in three views. The intersection of these backprojections form the visual hull of the

object, shown in the bottom right image. The more views are taken, the closer the visual

hull approaches the true shape, but cavities not visible in the silhouettes are not retrieved.

viewpoints and by combining the 3D information coming from both

methods.

The hardware needed is minimal, and very low-cost shape-from-

silhouette systems can be produced. If multiple cameras are placed

around the object, the images can be taken all at once and the capture

time can be reduced. This will increase the price, and also the silhouette

extraction may become more complicated. In the case video cameras

are used, a dynamic scene like a moving person can be captured in 3D

over time (but note that synchronization issues are introduced). An

example is shown in Figure 1.12, where 15 video cameras were set up

in an outdoor environment.

Of course, in order to extract precise cones for the intersection, the

relative camera positions and their internal settings have to be known

precisely. This can be achieved with the same self-calibration methods

expounded in the following sections. Hence, also shape-from-silhouettes

can beneﬁt from the presented ideas and this is all the more interesting

308 Introduction to 3D Acquisition

Fig. 1.12 (a) Fifteen cameras setup in an outdoor environment around a person,(b) a more

detailed view on the visual hull at a speciﬁc moment of the action,(c) a detailed view on

the visual hull textured by backprojecting the image colors, and (d) another view of the

visual hull with backprojected colors. Note how part of the sock area has been erroneously

carved away.

as this 3D extraction approach is among the most practically relevant

ones for dynamic scenes (‘motion capture’).

1.5 Challenges 309

1.4.6 Hybrid Techniques

The aforementioned techniques often have complementary strengths

and weaknesses. Therefore, several systems try to exploit multiple tech-

niques in conjunction. A typical example is the combination of shape-

from-silhouettes with stereo as already hinted in the previous section.

Both techniques are passive and use multiple cameras. The visual hull

produced from the silhouettes provides a depth range in which stereo

can try to reﬁne the surfaces in between the rims, in particular at the

cavities. Similarly, one can combine stereo with structured light. Rather

than trying to generate a depth map from the images pure, one can

project a random noise pattern, to make sure that there is enough tex-

ture. As still two cameras are used, the projected pattern does not have

to be analyzed in detail. Local pattern correlations may suﬃce to solve

the correspondence problem. One can project in the near-infrared, to

simultaneously take color images and retrieve the surface texture with-

out interference from the projected pattern. So far, the problem with

this has often been the weaker contrast obtained in the near-infrared

band. Many such integrated approaches can be thought of.

This said, there is no single 3D acquisition system to date that can

handle all types of objects or surfaces. Transparent or glossy surfaces

(e.g., glass, metals), ﬁne structures (e.g., hair or wires), and too weak,

too busy, to too repetitive surface textures (e.g., identical tiles on a

wall) may cause problems, depending on the system that is being used.

The next section discusses still existing challenges in a bit more detail.

1.5 Challenges

The production of 3D models has been a popular research topic already

for a long time now, and important progress has indeed been made since

the early days. Nonetheless, the research community is well aware of

the fact that still much remains to be done. In this section we list some

of these challenges.

As seen in the previous subsections, there is a wide variety of tech-

niques for creating 3D models, but depending on the geometry and

material characteristics of the object or scene, one technique may be

much better suited than another. For example, untextured objects are

310 Introduction to 3D Acquisition

a nightmare for traditional stereo, but too much texture may interfere

with the patterns of structured-light techniques. Hence, one would seem

to need a battery of systems to deal with the variability of objects —

e.g., in a museum — to be modeled. As a matter of fact, having to

model the entire collections of diverse museums is a useful application

area to think about, as it poses many of the pending challenges, often

several at once. Another area is 3D city modeling, which has quickly

grown in importance over the last years. It is another extreme in terms

of conditions under which data have to be captured, in that cities rep-

resent an absolutely uncontrolled and large-scale environment. Also in

that application area, many problems remain to be resolved.

Here is a list of remaining challenges, which we do not claim to be

exhaustive:

•Many objects have an intricate shape, the scanning of which

requires high precision combined with great agility of the

scanner to capture narrow cavities and protrusions, deal with

self-occlusions, ﬁne carvings, etc.

•The types of objects and materials that potentially have to be

handled — think of the museum example — are very diverse,

like shiny metal coins, woven textiles, stone or wooden sculp-

tures, ceramics, gems in jewellery and glass. No single tech-

nology can deal with all these surface types and for some of

these types of artifacts there are no satisfactory techniques

yet. Also, apart from the 3D shape the material characteris-

tics may need to be captured as well.

•The objects to be scanned range from tiny ones like a needle

to an entire construction or excavation site, landscape, or

city. Ideally, one would handle this range of scales with the

same techniques and similar protocols.

•For many applications, data collection may have to be

undertaken on-site under potentially adverse conditions or

implying transportation of equipment to remote or harsh

environments.

•Objects are sometimes too fragile or valuable to be touched

and need to be scanned ‘hands-oﬀ’. The scanner needs to be

1.5 Challenges 311

moved around the object, without it being touched, using

portable systems.

•Masses of data often need to be captured, like in the museum

collection or city modeling examples. More eﬃcient data cap-

ture and model building are essential if this is to be practical.

•Those undertaking the digitization may or may not be tech-

nically trained. Not all applications are to be found in indus-

try, and technically trained personnel may very well not

be around. This raises the need for intelligent devices that

ensure high-quality data through (semi-)automation, self-

diagnosis, and eﬀective guidance of the operator.

•In many application areas the money that can be spent is

very limited and solutions therefore need to be relatively

cheap.

•Also, precision is a moving target in many applications and

as higher precisions are achieved, new applications present

themselves that push for going even beyond. Analyzing the

3D surface of paintings to study brush strokes is a case in

point.

These considerations about the particular conditions under which

models may need to be produced, lead to a number of desirable, tech-

nological developments for 3D data acquisition:

•Combined extraction of shape and surface

reﬂectance. Increasingly, 3D scanning technology is

aimed at also extracting high-quality surface reﬂectance

information. Yet, there still is an appreciable way to go

before high-precision geometry can be combined with

detailed surface characteristics like full-ﬂedged BRDF

(Bidirectional Reﬂectance Distribution Function) or BTF

(Bidirectional Texture Function) information.

•In-hand scanning. The ﬁrst truly portable scanning sys-

tems are already around. But the choice is still restricted,

especially when also surface reﬂectance information is

required and when the method ought to work with all types

of materials, including metals, glass, etc. Also, transportable

312 Introduction to 3D Acquisition

here is supposed to mean more than ‘can be dragged between

places’, i.e., rather the possibility to easily move the system

around the object, ideally also by hand. But there also is the

interesting alternative to take the objects to be scanned in

one’s hands, and to manipulate them such that all parts get

exposed to the ﬁxed scanner. This is not always a desirable

option (e.g., in the case of very valuable or heavy pieces), but

has the deﬁnite advantage of exploiting the human agility

in presenting the object and in selecting optimal, additional

views.

•On-line scanning. The physical action of scanning and the

actual processing of the data often still are two separate

steps. This may create problems in that the completeness and

quality of the result can only be inspected after the scanning

session is over and the data are analyzed and combined at the

lab or the oﬃce. It may then be too late or too cumbersome

to take corrective actions, like taking a few additional scans.

It would be very desirable if the system would extract the

3D data on the ﬂy, and would give immediate visual feed-

back. This should ideally include steps like the integration

and remeshing of partial scans. This would also be a great

help in planning where to take the next scan during scanning.

A reﬁnement can then still be performed oﬀ-line.

•Opportunistic scanning. Not a single 3D acquisition tech-

nique is currently able to produce 3D models of even a large

majority of exhibits in a typical museum. Yet, they often have

complementary strengths and weaknesses. Untextured sur-

faces are a nightmare for passive techniques, but may be ideal

for structured light approaches. Ideally, scanners would auto-

matically adapt their strategy to the object at hand, based

on characteristics like spectral reﬂectance, texture spatial

frequency, surface smoothness, glossiness, etc. One strategy

would be to build a single scanner that can switch strategy

on-the-ﬂy. Such a scanner may consist of multiple cameras

and projection devices, and by today’s technology could still

be small and light-weight.

1.5 Challenges 313

•Multi-modal scanning. Scanning may not only combine

geometry and visual characteristics. Additional features like

non-visible wavelengths (UV,(N)IR) could have to be cap-

tured, as well as haptic impressions. The latter would then

also allow for a full replay to the public, where audiences can

hold even the most precious objects virtually in their hands,

and explore them with all their senses.

•Semantic 3D. Gradually computer vision is getting at a

point where scene understanding becomes feasible. Out of 2D

images, objects and scene types can be recognized. This will

in turn have a drastic eﬀect on the way in which ‘low’-level

processes can be carried out. If high-level, semantic interpre-

tations can be fed back into ‘low’-level processes like motion

and depth extraction, these can beneﬁt greatly. This strat-

egy ties in with the opportunistic scanning idea. Recognizing

what it is that is to be reconstructed in 3D (e.g., a car and

its parts) can help a system to decide how best to go about,

resulting in increased speed, robustness, and accuracy. It can

provide strong priors about the expected shape and surface

characteristics.

•Oﬀ-the-shelf components. In order to keep 3D modeling

cheap, one would ideally construct the 3D reconstruction sys-

tems on the basis of oﬀ-the-shelf, consumer products. At least

as much as possible. This does not only reduce the price, but

also lets the systems surf on a wave of fast-evolving, mass-

market products. For instance, the resolution of still, digital

cameras is steadily on the increase, so a system based on

such camera(s) can be upgraded to higher quality without

much eﬀort or investment. Moreover, as most users will be

acquainted with such components, the learning curve to use

the system is probably not as steep as with a totally novel,

dedicated technology.

Obviously, once 3D data have been acquired, further process-

ing steps are typically needed. These entail challenges of their own.

Improvements in automatic remeshing and decimation are deﬁnitely

314 Introduction to 3D Acquisition

still possible. Also solving large 3D puzzles automatically, preferably

exploiting shape in combination with texture information, would be

something in high demand from several application areas. Level-of-

detail (LoD) processing is another example. All these can also be

expected to greatly beneﬁt from a semantic understanding of the data.

Surface curvature alone is a weak indicator of the importance of a shape

feature in LoD processing. Knowing one is at the edge of a salient, func-

tionally important structure may be a much better reason to keep it in

at many scales.

1.6 Conclusions

Given the above considerations, the 3D reconstruction of shapes

from multiple, uncalibrated images is one of the most promising

3D acquisition techniques. In terms of our taxonomy of techniques,

self-calibrating structure-from-motion is a passive, multi-vantage

point strategy. It oﬀers high degrees of ﬂexibility in that one can

freely move a camera around an object or scene. The camera can be

hand-held. Most people have a camera and know how to use it. Objects

or scenes can be small or large, assuming that the optics and the

amount of camera motion are appropriate. These methods also give

direct access to both shape and surface reﬂectance information, where

both can be aligned without special alignment techniques. Eﬃcient

implementations of several subparts of such Structure-from-Motion

pipelines have been proposed lately, so that the on-line application

of such methods is gradually becoming a reality. Also, the required

hardware is minimal, and in many cases consumer type cameras will

suﬃce. This keeps prices for data capture relatively low.

Principles of Passive 3D Reconstruction

2.1 Introduction

In this section the basic principles underlying self-calibrating, passive

3D reconstruction, are explained. More speciﬁcally, the central goal

is to arrive at a 3D reconstruction from the uncalibrated image data

alone. But, to understand how three-dimensional (3D) objects can be

reconstructed from two-dimensional (2D) images, one ﬁrst needs to

know how the reverse process works: i.e., how images of a 3D object

arise. Section 2.2 therefore discusses the image formation process in a

camera and introduces the camera model which will be used through-

out the text. As will become clear this model incorporates internal

and external parameters related to the technical speciﬁcations of the

camera(s) and their location with respect to the objects in the scene.

Subsequent sections then set out to extract 3D models of the scene

without prior knowledge of these parameters, i.e., without the need

to calibrate the cameras internally or externally ﬁrst. This reconstruc-

tion problem is formulated mathematically in Section 2.3 and a solu-

tion strategy is initiated. The diﬀerent parts in this solution of the 3D

reconstruction problem are elaborated in the following sections. Along

the way, fundamental notions such as the correspondence problem,

315

316 Principles of Passive 3D Reconstruction

the epipolar relation, and the fundamental matrix of an image pair

are introduced (Section 2.4), and the possible stratiﬁcation of the

reconstruction process into Euclidean, metric, aﬃne, and projective

reconstructions is explained (Section 2.5). Furthermore, self-calibration

equations are derived and their solution is discussed in Section 2.6.

Apart from the generic case, special camera motions are considered as

well (Section 2.7). In particular, camera translation and camera rota-

tion are discussed. These often occur in practice, but their systems of

self-calibration equations or reconstruction equations become singular.

Attention is paid also to the case of internally calibrated cameras and

the important notion and use of the essential matrix is explored for

that case.

As already mentioned in the preface, we follow a particular route —

somewhat diﬀerent than usual but hopefully all the more intuitive and

self-contained — to develop the diﬀerent ideas and to extract the cor-

responding results. Nonetheless, there are numerous relevant papers

that present alternative and complementary approaches which are not

explicitly referenced in the text. We provide, therefore, a complete Bib-

liography of all the papers. This precedes the cited References and read-

ers who want to gain in-depth knowledge of the ﬁeld are encouraged to

look also at those papers.

As a note on widely used terminology in this domain, the word

camera is often to be interpreted as a certain viewpoint and viewing

direction — a ﬁeld of view or image — and if mention is made of a

ﬁrst, second, . . . camera then this can just as well refer to the same

camera being moved around to a ﬁrst, second, . . . position.

2.2 Image Formation and Camera Model

2.2.1 The Pinhole Camera

The simplest model of the image formation process in a camera is that

of a pinhole camera or camera obscura. The camera obscura is not

more than a black box one side of which is punctured to yield a small

hole. The rays of light from the outside world that pass through the

hole and fall on the opposite side of the box there form a 2D image of

the 3D environment outside the box (called the scene), as is depicted

2.2 Image Formation and Camera Model 317

Fig. 2.1 In a pinhole camera or camera obscura an image of the scene is formed by the rays

of light that are reﬂected by the objects in the scene and fall through the center of projection

onto the opposite wall of the box, forming a photo-negative image of the scene. The photo-

positive image of the scene corresponds to the projection of the scene onto a hypothetical

image plane situated in front of the camera. It is this hypothetical plane which is typically

used in computer vision, in order to avoid sign reversals.

in Figure 2.1. Some art historians believe that the painter Vermeer

actually used a room-sized version of a camera obscura. Observe that

this pinhole image actually is the photo-negative image of the scene.

The photo-positive image one observes when watching a photograph

or a computer screen corresponds to the projection of the scene onto a

hypothetical plane that is situated in front of the camera obscura at the

same distance from the hole as the opposite wall on which the image is

actually formed. In the sequel, the term image plane will always refer to

this hypothetical plane in front of the camera. This hypothetical plane

is preferred to avoid sign reversals in the computations. The distance

between the center of projection (the hole) and the image plane is called

the focal length of the camera.

The amount of light that falls into the box through the small hole

is very limited. One can increase this amount of light by making the

hole bigger, but then rays coming from diﬀerent 3D points can fall

onto the same point on the image, thereby causing blur. One way of

getting around this problem is by making use of lenses, which focus the

light. Apart from the introduction of geometric and chromatic aber-

rations, even the most perfect lens will come with a limited depth-of-

ﬁeld. This means that only scene points within a limited depth range

are imaged sharply. Within that depth range the camera with lens

318 Principles of Passive 3D Reconstruction

basically behaves like the pinhole model. The ‘hole’ in the box will

in the sequel be referred to as the center of projection or the camera

center, and the type of projection realized by this idealized model is

referred to as perspective projection.

It has to be noted that whereas in principle a single convex lens

might be used, real camera lenses are composed of multiple lenses, in

order to reduce deviations from the ideal model (i.e., to reduce the

aforementioned aberrations). A detailed discussion on this important

optical component is out of the scope of this tutorial, however.

2.2.2 Projection Equations for a Camera-Centered

Reference Frame

To translate the image formation process into mathematical formulas

we ﬁrst introduce a reference frame for the 3D environment (also called

the world) containing the scene. The easiest is to ﬁx it to the camera.

Figure 2.2 shows such a camera-centered reference frame. It is a right-

handed and orthonormal reference frame whose origin is at the center

of projection. Its Z-axis is the principal axis of the camera, — i.e.,

the line through the center of projection and orthogonal to the image

Center of

projection

Image plane Principal axis

Principal point

Fig. 2.2 The camera-centered reference frame is ﬁxed to the camera and aligned with its

intrinsic directions, of which the principal axis is one. The coordinates of the projection m

of a scene point Monto the image plane in a pinhole camera model with a camera-centered

reference frame, as expressed by equation (2.1), are given with respect to the principal

point pin the image.

2.2 Image Formation and Camera Model 319

plane — and the XY -plane is the plane through the center of projec-

tion and parallel to the image plane. The image plane is the plane with

equation Z=f, where fdenotes the focal length of the camera. The

principal axis intersects the image plane in the principal point p.

The camera-centered reference frame induces an orthonormal uv

reference frame in the image plane, as depicted in Figure 2.2. The

image of a scene point Mis the point mwhere the line through Mand

the origin of the camera-centered reference frame intersects the image

plane. If Mhas coordinates (X,Y,Z)∈R3with respect to the camera-

centered reference frame, then an arbitrary point on the line through

the origin and the scene point Mhas coordinates ρ(X,Y,Z) for some

real number ρ. The point of intersection of this line with the image

plane must satisfy the relation ρZ =f, or equivalently, ρ=f

Z. Hence,

the image mof the scene point Mhas coordinates (u, v, f ), where

u=fX

Zand v=fY

Z.(2.1)

Projections onto the image plane cannot be detected with inﬁnite

precision. An image rather consists of physical cells capturing photons,

so-called picture elements,orpixels for short. Apart from some exotic

designs (e.g., hexagonal or log-polar cameras), these pixels are arranged

in a rectangular grid, i.e., according to rows and columns, as depicted

in Figure 2.3 (left). Pixel positions are typically indicated with a row

Fig. 2.3 Left: In a digital image, the position of a point in the image is indicated by its

pixel coordinates. This corresponds to the way in which a digital image is read from a CCD.

Right: The coordinates (u, v) of the projection of a scene point in the image are deﬁned

with respect to the principal point p. Pixel coordinates, on the other hand, are measures

with respect to the upper left corner of the image.

320 Principles of Passive 3D Reconstruction

and column number measured with respect to the top left corner of

the image. These numbers are called the pixel coordinates of an image

point. We will denote them by (x, y), where the x-coordinate is mea-

sured horizontally and increasing to the right, and the y-coordinate is

measured vertically and increasing downwards. This choice has several

advantages:

•The way in which x- and y-coordinates are assigned to image

points corresponds directly to the way in which an image is

read out by several digital cameras with a CCD: starting at

the top left and reading line by line.

•The camera-centered reference frame for the world being

right-handed then implies that its Z-axis is pointing away

from the image into the scene (as opposed to into the cam-

era). Hence, the Z-coordinate of a scene point corresponds

to the “depth” of that point with respect to the camera,

which conceptually is nice because it is the big unknown to

be solved for in 3D reconstruction problems.

As a consequence, we are not so much interested in the metric coordi-

nates (u,v) indicating the projection mof a scene point Min the image

and given by formula (2.1), as in the corresponding row and column

numbers (x, y) of the underlying pixel. At the end of the day, it will

be these pixel coordinates to which we have access when analyzing the

image. Therefore we have to make the transition from (u, v)-coordinates

to pixel coordinates (x, y) explicit ﬁrst.

In a camera-centered reference frame the X-axis is typically chosen

parallel to the rows and the Y-axis parallel to the columns of the rect-

angular grid of pixels. In this way, the u- and v-axes induced in the

image plane have the same direction and sense as those in which the

pixel coordinates xand yof image points are measured. But, whereas

pixel coordinates are measured with respect to the top left corner of

the image, (u,v)-coordinates are measured with respect to the principal

point p. The ﬁrst step in the transition from (u,v)- to (x, y)-coordinates

for an image point mthus is to apply oﬀsets to each coordinate. To this

end, denote by puand pvthe metric distances, measured in the hori-

zontal and vertical directions, respectively, of the principal point pfrom

2.2 Image Formation and Camera Model 321

the upper left corner of the image (see Figure 2.3 (right) ). With the

top left corner of the image as origin, the principal point now has coor-

dinates (pu,pv) and the perspective projection mof the scene point M,

as described by formula (2.1), will have coordinates

˜u=fX

Z+puand ˜v=fY

Z+pv.

These (˜u, ˜v)-coordinates of the image point mare still expressed in the

metric units of the camera-centered reference frame. To convert them

to pixel coordinates, one has to divide ˜uand ˜vby the width and the

height of a pixel, respectively. Let muand mvbe the inverse of respec-

tively the pixel width and height, then muand mvindicate how many

pixels ﬁt into one horizontal respectively vertical metric unit. The pixel

coordinates (x, y) of the projection mof the scene point Min the image

are thus given by

x=mufX

Z+puand y=mvfY

Z+pv;

or equivalently,

x=αx

Z+pxand y=αy

Z+py,(2.2)

where αx=mufand αy=mvfis the focal length expressed in number

of pixels for the x- and y-direction of the image and (px,py) are the

pixel coordinates of the principal point. The ratio αy

αx=mv

mu, giving the

ratio of the pixel width with respect to the pixel height, is called the

aspect ratio of the pixels.

2.2.3 A Matrix Expression for Camera-Centered Projection

More elegant expressions for the projection equations (2.2) are obtained

if one uses extended pixel coordinates for the image points. In particular,

if a point mwith pixel coordinates (x, y) in the image is represented by

the column vector m=(x,y, 1)T, then formula (2.2) can be rewritten as:

Zm=Z



1

=



αx0px

0αypy

001







Z

.(2.3)

322 Principles of Passive 3D Reconstruction

Observe that, if one interpretes the extended pixel coordinates (x,y,1)T

of the image point mas a vector indicating a direction in the world,

then, since Zdescribes the “depth” in front of the camera at which the

corresponding scene point Mis located, the 3 ×3-matrix





αx0px

0αypy

001





represents the transformation that converts world measurements

(expressed in meters, centimeters, millimeters, ...) into the pixel met-

ric of the digital image. This matrix is called the calibration matrix

of the camera, and it is generally represented as the upper triangular

matrix

K=



αxsp

0αypy

001



,(2.4)

where αxand αyare the focal length expressed in number of pixels for

the x- and y-directions in the image respectively, and with (px,py) the

pixel coordinates of the principal point. The additional scalar sin the

calibration matrix Kis called the skew factor and models the situation

in which the pixels are parallelograms (i.e., not rectangular). It also

yields an approximation to the situation in which the physical imaging

plane is not perfectly perpendicular to the optical axis of the lens or

objective (as was assumed above). In fact, sis inversely proportional to

the tangent of the angle between the X- and the Y-axis of the camera-

centered reference frame. Consequently, s= 0 for digital cameras with

rectangular pixels.

Together, the entries αx,αy,s,px, and pyof the calibration

matrix Kdescribe the internal behavior of the camera and are there-

fore called the internal parameters of the camera. Furthermore, the

projection equations (2.3) of a pinhole camera with respect to a camera-

centered reference frame for the scene are compactly written as:

ρm=KM,(2.5)

where M=(X,Y,Z)Tare the coordinates of a scene point Mwith respect

to the camera-centered reference frame for the scene, m=(x,y, 1)Tare

2.2 Image Formation and Camera Model 323

the extended pixel coordinates of its projection min the image, Kis

the calibration matrix of the camera. Furthermore, ρis a positive real

number and it actually represents the “depth” of the scene point M

in front of the camera, because, due to the structure of the calibra-

tion matrix K, the third row in the matrix equality (2.5) reduces to

ρ=Z. Therefore ρis called the projective depth of the scene point M

corresponding to the image point m.

2.2.4 The General Linear Camera Model

When more than one camera is used, or when the objects in the scene

are to be represented with respect to another, non-camera-centered

reference frame (called the world frame), then the position and orien-

tation of the camera in the scene are described by a point C, indicat-

ing the center of projection, and a 3 ×3-rotation matrix Rindicating

the orientation of the camera-centered reference frame with respect to

the world frame. More precisely, the column vectors riof the rotation

matrix Rare the unit direction vectors of the coordinate axes of the

camera-centered reference frame, as depicted in Figure 2.4. As Cand

Rrepresent the setup of the camera in the world space, they are called

the external parameters of the camera.

The coordinates of a scene point Mwith respect to the camera-

centered reference frame are found by projecting the relative position

M= (

)

(

)

Fig. 2.4 The position and orientation of the camera in the scene are given by a position

vector Canda3×3-rotation matrix R. The projection mof a scene point Mis then given

by formula (2.6).

324 Principles of Passive 3D Reconstruction

vector M−Corthogonally onto each of the coordinate axes of the

camera-centered reference frame. The column vectors riof the rotation

matrix Rbeing the unit direction vectors of the coordinate axes of the

camera-centered reference frame, the coordinates of Mwith respect to

the camera-centered reference frame are given by the dot products of

the relative position vector M−Cwith the unit vectors ri; or equiv-

alently, by premultiplying the column vector M−Cwith the trans-

pose of the orientation matrix R, viz. RT(M−C). Hence, following

formula (2.5), the projection mof the scene point Min the image is

given by the (general) projection equations:

ρm=KRT(M−C),(2.6)

where M=(X,Y,Z)Tare the coordinates of a scene point Mwith respect

to an (arbitrary) world frame, m=(x,y, 1)Tare the extended pixel coor-

dinates of its projection min the image, Kis the calibration matrix of

the camera, Cis the position and Ris the rotation matrix expressing

the orientation of the camera with respect to the world frame, and ρis

a positive real number representing the projective depth of the scene

point Mwith respect to the camera.

Many authors prefer to use extended coordinates for scene points as

well. So, if M

1=(X,Y,Z,1)Tare the extended coordinates of the scene

point M=(X,Y,Z)T, then the projection equations (2.6) becomes

ρm=KRT|−KRTCM

1.(2.7)

The 3 ×4-matrix P=(KRT|−KRTC) is called the projection matrix

of the camera.

Notice that, if only the 3 ×4-projection matrix Pis known, it is

possible to retrieve the internal and external camera parameters from it.

Indeed, as is seen from formula (2.7), the upper left 3 ×3-submatrix of

Pis formed by multiplying Kand RT. Its inverse is RK−1, since Ris a

rotation matrix and thus RT=R−1. Furthermore, Kis a non-singular

upper triangular matrix and so is K−1. In particular, RK−1is the prod-

uct of an orthogonal matrix and an upper triangular one. Recall from

linear algebra that every 3 ×3-matrix of maximal rank can uniquely be

decomposed as a product of an orthogonal and a non-singular, upper

2.2 Image Formation and Camera Model 325

triangular matrix with positive diagonal entries by means of the QR-

decomposition [3] (with Qthe orthogonal and Rthe upper-diagonal

matrix). Hence, given the 3 ×4-projection matrix Pof a pinhole cam-

era, the calibration matrix Kand the orientation matrix Rof the

camera can easily be recovered from the inverse of the upper left 3 ×3-

submatrix of Pby means of QR-decomposition. If Kand Rare known,

then the center of projection Cis found by premultiplying the fourth

column of Pwith the matrix −RK−1.

The camera model of formula (2.7) is usually referred to as the

general linear camera model. Taking a close look at this formula shows

how general the camera projection matrix P=(KRT|−KRTC)isas

a matrix, in fact. Apart from the fact that the 3 ×3-submatrix on

the left has to be full rank, one cannot demand more than that it

has to be QR-decomposable, which holds for any such matrix. The

attentive reader may now object that, according to formula (2.4), the

calibration matrix Kmust have entry 1 at the third position in the

last row, whereas there is no such constraint for the upper triangular

matrix in a QR-decomposition. This would seem a further restriction

on the left 3 ×3-submatrix of P, but it can easily be lifted by observing

that the camera projection matrix Pis actually only determined up to

a scalar factor. Indeed, due to the non-zero scalar factor ρin the left-

hand side of formula (2.7), one can always ensure this property to hold.

Put diﬀerently, any 3×4-matrix whose upper left 3×3-submatrix is

non-singular can be interpreted as the projection matrix of a (linear)

pinhole camera.

2.2.5 Non-linear Distortions

The perspective projection model described in the previous sections is

linear in the sense that the scene point, the corresponding image point

and the center of projection are collinear, and that straight lines in the

scene do generate straight lines in the image. Perspective projection

therefore only models the linear eﬀects in the image formation pro-

cess. Images taken by real cameras, on the other hand, also experience

non-linear deformations or distortions which make the simple linear

pinhole model inaccurate. The most important and best known non-

linear distortion is radial distortion. Figure 2.5(left) shows an example

326 Principles of Passive 3D Reconstruction

Fig. 2.5 Left: An image exhibiting radial distortion. The vertical wall at the left of the

building appears bent in the image and the gutter on the frontal wall on the right appears

curved too. Right: The same image after removal of the radial distortion. Straight lines in

the scene now appear as straight lines in the image as well.

of a radially distorted image. Radial distortion is caused by a system-

atic variation of the optical magniﬁcation when radially moving away

from a certain point, called the center of distortion. The larger the dis-

tance between an image point and the center of distortion, the larger

the eﬀect of the distortion. Thus, the eﬀect of the distortion is mostly

visible near the edges of the image. This can clearly be seen in Fig-

ure 2.5 (left). Straight lines near the edges of the image are no longer

straight but are bent. For practical use, the center of radial distortion

can often be assumed to coincide with the principal point, which usu-

ally also coincides with the center of the image. But it should be noted

that these are only approximations and dependent on the accuracy

requirements, a more precise determination may be necessary [7].

Radial distortion is a non-linear eﬀect and is typically modeled using

a Taylor expansion. Typically, only the even order terms play a role

in this expansion, i.e., the eﬀect is symmetric around the center. The

eﬀect takes place in the lens, hence mathematically the radial distortion

should be between the external and internal parameters of the pinhole

model. The model we will propose here follows this strategy. Let us

deﬁne

ρmu=



mux

muy

1

=RT(M−C),

2.2 Image Formation and Camera Model 327

where M=(X,Y,Z)Tare the coordinates of a scene point with respect

to the world frame. The distance rof the point mufrom the optical axis

is then

r2=m2

ux +m2

uy.

We now deﬁne mdas

md=



mdx

mdy

1

=





(1 + κ1r2+κ2r4+κ3r6+...)mux

(1 + κ1r2+κ2r4+κ3r6+...)muy







.(2.8)

The lower order terms of this expansion are the most important ones

and typically one does not compute more than three parameters (κ1,

κ2,κ3). Finally the projection mof the 3D point Mis:

m=Kmd.(2.9)

When the distortion parameters are known, the image can be undis-

torted in order to make all lines straight again and thus make the lin-

ear pinhole model valid. The undistorted version of Figure 2.5 (left) is

shown in the same Figure 2.5 (right).

The model described above puts the radial distortion parameters

between the external and linear internal parameters of the camera. In

the literature one often ﬁnds that the distortion is put on the left of the

internal parameters, i.e., a 3D point is ﬁrst projected into the image

via the linear model and then shifted. Conceptually the latter model is

less suited than the one used here because putting the distortion at the

end makes it dependent on the internal parameters, especially on the

focal length. This means one has to re-estimate the radial distortion

parameters every time the focal length changes. This is not necessary

in the model as suggested here. This, however, does not mean that this

model is perfect. In reality the center of radial distortion (now assumed

to be the principal point) sometimes changes when the focal length is

altered. This eﬀect cannot be modeled with this approach.

In the remainder of the text we will assume that radial distortion

has been removed from all images if present, unless stated otherwise.

328 Principles of Passive 3D Reconstruction

2.2.6 Explicit Camera Calibration

As explained before a perspective camera can be described by its

internal and external parameters. The process of determining these

parameters is known as camera calibration. Accordingly, we make a

distinction between internal calibration and external calibration, also

known as pose estimation. For the determination of all parameters one

often uses the term complete calibration. Traditional 3D passive recon-

struction techniques had a separate, explicit camera calibration step.

As highlighted before, the diﬀerence with self-calibration techniques as

explained in this tutorial is that in the latter the same images used for

3D scene reconstruction are also used for camera calibration.

Traditional internal calibration procedures [1, 17, 18] extract the

camera parameters from a set of known 3D–2D correspondences, i.e., a

set of 3D coordinates with corresponding 2D coordinates in the image

for their projections. In order to easily obtain such 3D–2D correspon-

dences, they employ special calibration objects, like the ones displayed

in Figure 2.6. These objects contain easily recognizable markers. Inter-

nal calibration sometimes starts by ﬁtting a linearized calibration model

to the 3D–2D correspondences, which is then improved with a subse-

quent non-linear optimization step.

Some applications do not suﬀer from unknown scales or eﬀects of

projection, but their results would deteriorate under the inﬂuence of

non-linear eﬀects like radial distortion. It is possible to only undo such

distortion without having to go through the entire process of internal

Fig. 2.6 Calibration objects: 3D (left) and 2D (right).

2.3 The 3D Reconstruction Problem 329

calibration. One way is by detecting structures that are known to be

straight, but appear curved under the inﬂuence of the radial distortion.

Then, for each line (e.g., the curved roof top in the example of Figure 2.5

(left)), points are sampled along it and a straight line is ﬁtted through

these data. If we want the distortion to vanish, the error consisting

of the sum of the distances of each point to their line should be zero.

Hence a non-linear minimization algorithm like Levenberg–Marquardt

is applied to these data. The algorithm is initialized with the distortion

parameters set to zero. At every iteration, new values are computed for

these parameters, the points are warped accordingly and new lines are

ﬁtted. The algorithm stops when it converges to a solution where all

selected lines are straight (i.e., the resulting error is close to zero). The

resulting unwarped image can be seen on the right in Figure 2.5.

2.3 The 3D Reconstruction Problem

The aim of passive 3D reconstruction is to recover the geometric struc-

ture of a (static) scene from one or more of its images: Given a point m

in an image, determine the point Min the scene of which mis the projec-

tion. Or, in mathematical parlance, given the pixel coordinates (x,y)of

a point min a digital image, determine the world coordinates (X,Y,Z)

of the scene point Mof which mis the projection in the image. As can be

observed from the schematic representation of Figure 2.7, a point min

image

Fig. 2.7 3D reconstruction from one image is an underdetermined problem: a point min the

image can be the projection of any world point Malong the pro jecting ray of m.

330 Principles of Passive 3D Reconstruction

the image can be the projection of any point Min the world space that

lies on the line through the center of projection and the image point m.

This line is called the projecting ray or the line of sight of the image

point min the given camera. Thus, 3D reconstruction from one image

is an underdetermined problem.

On the other hand, if two images of the scene are available, the

position of a scene point Mcan be recovered from its projections m1and

m2in the images by triangulation: Mis the point of intersection of the

projecting rays of m1and m2, as depicted in Figure 2.8. This stereo setup

and the corresponding principle of triangulation have already been

introduced in Section 1. As already noted there, the 3D reconstruc-

tion problem has not yet been solved, unless the internal and external

parameters of the cameras are known. Indeed, if we assume that the

images are corrected for radial distortion and other non-linear eﬀects

and that the general linear pinhole camera model is applicable, then,

according to formula (2.6) in Section 2.2.4, the projection equations of

the ﬁrst camera are modeled as:

ρ1m1=K1RT

1(M−C1),(2.10)

where M=(X,Y,Z)Tare the coordinates of the scene point Mwith

respect to the world frame, m1=(x1,y1,1)Tare the extended pixel

coordinates of its projection m1in the ﬁrst image, K1is the calibration

image 1

image 2

Fig. 2.8 Given two images of a static scene, the location of the scene point Mcan be recovered

from its projections m1and m2in the respective images by means of triangulation.

2.3 The 3D Reconstruction Problem 331

matrix of the ﬁrst camera, C1is the position and R1is the orientation

of the ﬁrst camera with respect to the world frame, and ρ1is a positive

real number representing the projective depth of Mwith respect to the

ﬁrst camera. To ﬁnd the projecting ray of an image point m1in the

ﬁrst camera and therefore all points projecting onto m1there, recall

from Section 2.2.3 that the calibration matrix K1converts world mea-

surements (expressed in meters, centimeters, millimeters, etc) into the

pixel metric of the digital image. Since m1=(x1,y1,1) are the extended

pixel coordinates of the point m1in the ﬁrst image, the direction of

the projecting ray of m1in the camera-centered reference frame of the

ﬁrst camera is given by the three-vector K−1

1m1. With respect to the

world frame, the direction vector of the projecting ray is R1K−1

1m1,by

deﬁnition of R1. As the position of the ﬁrst camera in the world frame

is given by the point C1, the parameter equations of the projecting ray

of m1in the world frame are:

M=C1+ρ1R1K−1

1m1for some ρ1∈R. (2.11)

So, every scene point Msatisfying equation (2.11) for some real num-

ber ρ1projects onto the point m1in the ﬁrst image. Notice that

expression (2.11) can be found directly by solving the projection equa-

tions (2.10) for M. Clearly, the parameter equations (2.11) of the pro-

jecting ray of a point m1in the ﬁrst image are only fully known, provided

the calibration matrix K1and the position C1and orientation R1of

the camera with respect to the world frame are known (i.e., when the

ﬁrst camera is fully calibrated).

Similarly, the projection equations for the second camera are:

ρ2m2=K2RT

2(M−C2),(2.12)

where m2=(x2,y2,1)Tare the extended pixel coordinates of M’s projec-

tion m2in the second image, K2is the calibration matrix of the second

camera, C2is the position and R2the orientation of the second camera

with respect to the world frame, and ρ2is a positive real number rep-

resenting the projective depth of Mwith respect to the second camera.

Solving equation (2.12) for Myields:

M=C2+ρ2R2K−1

2m2; (2.13)

332 Principles of Passive 3D Reconstruction

and, if in this equation ρ2is seen as a parameter, then formula (2.13)

are just the parameter equations for the projecting ray of the image

point m2in the second camera. Again, these parameter equations are

fully known only if K2,C2, and R2are known (i.e., when the second

camera is fully calibrated). The system of six equations (2.10) and

(2.12) can be solved for the ﬁve unknowns X,Y,Z,ρ1, and ρ2. Observe

that this requires the system to be rank-deﬁcient, which is guaranteed

if the points m1and m2are in correspondence (i.e., their projecting rays

intersect) and therefore special relations (in particular, the so-called

epipolar relations, which will be derived in the next section) hold.

When the cameras are not internally and externally calibrated, then

it is not immediately clear how to perform triangulation from the image

data alone. On the other hand, one intuitively feels that every image

of a static scene constrains in one way or another the shape and the

relative positioning of the objects in the world, even if no information

about the camera parameters is known. The key to the solution of the

3D reconstruction problem is found in understanding how the locations

m1and m2of the projections of a scene point Min diﬀerent views are

related to each other. This relationship is explored in the next section.

2.4 The Epipolar Relation Between Two

Images of a Static Scene

2.4.1 The Fundamental Matrix

A point m1in a ﬁrst image of the scene is the projection of a scene

point Mthat can be at any position along the projecting ray of m1in that

ﬁrst camera. Therefore, the corresponding point m2(i.e., the projection

of M) in a second image of the scene must lie on the projection 2

of this projecting ray in the second image, as depicted in Figure 2.9.

To derive the equation of this projection 2, suppose for a moment

that the internal and external parameters of both cameras are known.

Then, the projecting ray of the point m1in the ﬁrst camera is given by

formula (2.11), viz. M=C1+ρ1R1K−1

1m1. Substituting the right-hand

side in the projection equations (2.12) of the second camera, yields

ρ2m2=ρ1K2RT

2R1K−1

1m1+K2RT

2(C1−C2).(2.14)

2.4 The Epipolar Relation Between Two Images of a Static Scene 333

image 2

image 1

Fig. 2.9 The point m2in the second image corresponding to a point m1in the ﬁrst image

lies on the epipolar line 2which is the projection in the second image of the projecting ray

of m1in the ﬁrst camera.

The last term in this equation corresponds to the projection e2of the

position C1of the ﬁrst camera in the second image:

ρe2e2=K2RT

2(C1−C2).(2.15)

e2is called the epipole of the ﬁrst camera in the second image. The

ﬁrst term in the right-hand side of equation (2.14), on the other hand,

indicates the direction of the projecting ray (2.11) in the second image.

Indeed, recall from Section 2.3 that R1K−1

1m1is the direction vector

of the projecting ray of m1with respect to the world frame. In the

camera-centered reference frame of the second camera, the coordinates

of this vector are RT

2R1K1−1m1. The point in the second image that

corresponds to this viewing direction is then given by K2RT

2R1K1−1m1.

Put diﬀerently, K2RT

2R1K1−1m1are homogeneous coordinates for the

vanishing point of the projecting ray (2.11) in the second image, as can

be seen from Figure 2.10.

To simplify the notation, put A=K2RT

2R1K−1

1. Then Ais an

invertible 3 ×3-matrix which, for every point m1in the ﬁrst image, gives

homogeneous coordinates Am1for the vanishing point in the second

view of the projecting ray of m1in the ﬁrst camera. In the literature this

matrix is referred to as the inﬁnite homography, because it corresponds

to the 2D projective transformation induced between the images by

the plane at inﬁnity of the scene. More about this interpretation of the

334 Principles of Passive 3D Reconstruction

C1C2

m1m2

image 1 image 2

Fig. 2.10 The epipole e2of the ﬁrst camera in the second image indicates the position in

the second image where the center of projection C1of the ﬁrst camera is observed. The

point Am1in the second image is the vanishing point of the projecting ray of m1in the

second image.

matrix Acan be found in Section 2.4.3. Formula (2.14) can now be

rewritten as:

ρ2m2=ρ1Am1+ρe2e2.(2.16)

Formula (2.16) algebraically expresses the geometrical observation that,

for a given point m1in one image, the corresponding point m2in another

image of the scene lies on the line 2through the epipole e2and the

vanishing point Am1of the projecting ray of m1in the ﬁrst camera

(cf. Figure 2.10). The line 2is called the epipolar line in the second

image corresponding to m1, and formula (2.16) is referred to as the

epipolar relation between corresponding image points. 2is the sought

projection in the second image of the entire projecting ray of m1in the

ﬁrst camera. It is important to realize that the epipolar line 2in the

second image relies on the point m1in the ﬁrst image. Put diﬀerently,

selecting another point m1in the ﬁrst image generically will result in

another epipolar line 2in the second view. However, as is seen from

formula (2.16), these epipolar lines all run through the epipole e2. This

was to be expected, of course, as all the projecting rays of the ﬁrst

camera originate from the center of projection C1of the ﬁrst camera.

Hence, their projections in the second image — which, by deﬁnition,

are epipolar lines in the second view — must also all run through

2.4 The Epipolar Relation Between Two Images of a Static Scene 335

Fig. 2.11 All projecting rays of the ﬁrst camera originate from its center of projection. Their

projections into the image plane of the second camera are therefore all seen to intersect in

the projection of this center of projection, i.e., at the epipole.

the projection of C1in the second image, which is just the epipole e2.

Figure 2.11 illustrates this observation graphically.

In the literature the epipolar relation (2.16) is usually expressed

in closed form. To this end, we ﬁx some notations: for a 3-vector a=

(a1,a2,a3)T∈R3, let [a]×denote the skew-symmetric 3 ×3-matrix

[a]×=



0−a3a2

a30−a1

−a2a10

,(2.17)

which represents the cross product with a; i.e., [a]×v=a×vfor all

3-vectors v∈R3. Observe that [a]×has rank 2 if ais non-zero. The

epipolar relation states that, for a point m1in the ﬁrst image, its corre-

sponding point m2in the second image must lie on the line through

the epipole e2and the vanishing point Am1. Algebraically, this is

expressed by demanding that the 3-vectors m2,e2, and Am1represent-

ing homogeneous coordinates of the corresponding image points are

linearly dependent (cf. formula (2.16)). Recall from linear algebra that

this is equivalent to |m2e2Am1|= 0, where the vertical bars denote

the determinant of the 3 ×3-matrix whose columns are the speciﬁed

336 Principles of Passive 3D Reconstruction

column vectors. Moreover, by deﬁnition of the cross product, this deter-

minant equals

|m2e2Am1|=mT

2(e2×Am1).

Expressing the cross product as a matrix multiplication then yields

|m2e2Am1|=mT

2[e2]×Am1.

Hence, the epipolar relation (2.16) is equivalently expressed by the

equation

2Fm1=0,(2.18)

where F=[e2]×Aisa3×3-matrix, called the fundamental matrix of

the image pair, and with e2the epipole in the second image and Athe

invertible 3 ×3-matrix deﬁned above [2, 4]. Note that, since [a]×is a

rank 2 matrix, the fundamental matrix Falso has rank 2.

2.4.2 Gymnastics with F

The closed form (2.18) of the epipolar relation has the following

advantages:

(1) The fundamental matrix Fcan, up to a non-zero scalar fac-

tor, be computed from the image data alone.

Indeed, for each pair of corresponding points m1and m2in the

images, relation (2.18) yields one homogeneous linear equa-

tion in the entries of the fundamental matrix F. Knowing

(at least) eight pairs of corresponding points between the

two images, the fundamental matrix Fcan, up to a non-

zero scalar factor, be computed from these point correspon-

dences in a linear manner. Moreover, by also exploiting the

fact that Fhas rank 2, the fundamental matrix Fcan even

be computed, up to a non-zero scalar factor, from seven point

correspondences between the images, albeit by a non-linear

algorithm as the rank 2 condition involves a relation between

products of three entries of F. Diﬀerent methods for eﬃcient

and robust computation of Fwill be explained in Subsec-

tion 4.2.2 of Section 4 in Part 2 of this tutorial.

2.4 The Epipolar Relation Between Two Images of a Static Scene 337

(2) Given F,the epipole e2in the second image is the unique

3-vector with third coordinate equal to 1,satisfying FTe2=0.

This observation follows immediately from the fact that F=

[e2]×Aand that [e2]T

×e2=−[e2]×e2=−e2×e2=0.

(3) Similarly, the epipole e1of the second camera in the ﬁrst

image — i.e., the projection e1of the position C2of the sec-

ond camera in the ﬁrst image — is the unique 3-vector with

third coordinate equal to 1,satisfying Fe1=0.

According to formula (2.6) in Section 2.2.4, the pro-

jection e1of the position C2of the second camera in

the ﬁrst image is given by ρe1e1=K1RT

1(C2−C1), with

ρe1a non-zero scalar factor. Since A=K2RT

2R1K−1

ρe1Ae1=K2RT

2(C2−C1)=−ρe2e2, and thus ρe1Fe1=

[e2]×(ρe1Ae1)=[e2]×(−ρe2e2)=0. Notice that this also

shows that the inﬁnite homography Amaps the epipole e1

in the ﬁrst image onto the epipole e2in the second image.

(4) Given a point m1in the ﬁrst image, the 3-vector Fm1yields

homogeneous coordinates for the epipolar line 2in the sec-

ond image corresponding to m1;i.e., 2Fm1.

Recall that the epipolar relation mT

2Fm1= 0 expresses the

geometrical observation that the point m2in the second

image, which corresponds to m1, lies on the line 2through

the epipole e2and the point Am1, which by deﬁnition is the

epipolar line in the second image corresponding to m1. This

proves the claim.

(5) Similarly, given a point m2in the second image, the 3-vector

FTm2yields homogeneous coordinates for the epipolar line 1

in the ﬁrst image corresponding to m2;i.e., 1FTm2.

By interchanging the role of the two images in the reasoning

leading up to the epipolar relation derived above, one easily

sees that the epipolar line 1in the ﬁrst image corresponding

to a point m2in the second image is the line through the

epipole e1in the ﬁrst image and the vanishing point A−1m2

in the ﬁrst image of the projecting ray of m2in the second

338 Principles of Passive 3D Reconstruction

camera. The corresponding epipolar relation

|m1e1A−1m2|= 0 (2.19)

expresses that m1lies on that line. As Ais an invertible

matrix, its determinant |A|is a non-zero scalar. Multiply-

ing the left-hand side of equality (2.19) with |A|yields

|A||m1e1A−1m2|=|Am1Ae1m2|

=|Am1−(ρe2/ρe1)e2m2|

=ρe2

ρe1|m2e2Am1|=ρe2

ρe1

2Fm1,

because ρe1Ae1=−ρe2e2, as seen in item (3) above, and

|m2e2Am1|=mT

2(e2×Am1)=mT

2Fm1, by deﬁnition of the

fundamental matrix F(cf. formula (2.18)). Consequently,

the epipolar relation (2.19) is equivalent to mT

2Fm1= 0, and

the epipolar line 1in the ﬁrst image corresponding to a

given point m2in the second image has homogeneous coor-

dinates FTm2. We could have concluded this directly from

formula (2.18) based on symmetry considerations as well.

2.4.3 Grasping the Inﬁnite Homography

Before continuing our investigation on how to recover 3D information

about the scene from images alone, it is worth to have a closer look at

the invertible matrix Aintroduced in Section 2.4.1 ﬁrst. The matrix A

is deﬁned algebraically as A=K2RT

2R1K−1

1, but it also has a clear

geometrical interpretation: the matrix Atransfers vanishing points of

directions in the scene from the ﬁrst image to the second one. Indeed,

consider a line Lin the scene with direction vector V∈R3. The van-

ishing point v1of its projection 1in the ﬁrst image is the point of

intersection of the line through the center of projection C1and paral-

lel to Lwith the image plane of the ﬁrst camera, as depicted in Fig-

ure 2.12. Parameter equations of this line are M=C1+τVwith τa

scalar parameter, and the projection of every point Mon this line in

the ﬁrst image is given by K1RT

1(M−C1)=τK1RT

1V. The vanishing

point v1of the line 1in the ﬁrst image thus satisﬁes the equation

2.4 The Epipolar Relation Between Two Images of a Static Scene 339

image

Fig. 2.12 The vanishing point v1in the ﬁrst image of the projection 1of a line Lin the

scene is the point of intersection of the line through the center of projection C1and parallel

to Lwith the image plane.

ρv1v1=K1RT

1Vfor some non-zero scalar ρv1. Similarly, the vanishing

point v2of the projection 2of the line Lin the second image is given by

ρv2v2=K2RT

2Vfor some non-zero scalar ρv2. Conversely, given a van-

ishing point v1in the ﬁrst image, the corresponding direction vector V

in the scene is V=ρv1R1K−1

1v1for some scalar ρv1, and its vanishing

point in the second image is given by ρv2v2=ρv1K2RT

2R1K−1

1v1.As

A=K2RT

2R1K−1

1, this relation between the vanishing points v1and

v2can be simpliﬁed to ρv2=Av1, where ρ=ρv2

ρv1is a non-zero scalar

factor. Hence, if v1is the vanishing point of a line in the ﬁrst image,

then Av1are homogeneous coordinates of the vanishing point of the

corresponding line in the second image, as was claimed. In particular,

the observation made in Section 2.4.1 that for any point m1in the ﬁrst

image Am1are homogeneous coordinates for the vanishing point in the

second view of the projecting ray of m1in the ﬁrst camera is in fact

another instance of the same general property.

As is explained in more detail in Appendix A, in projective geom-

etry, direction vectors Vin the scene are represented as points on the

plane at inﬁnity of the scene. A vanishing point in an image then is

just the perspective projection onto the image plane of a point on the

plane at inﬁnity in the scene. In this respect, the matrix Ais a homog-

raphy matrix of the projective transformation that maps points from

the ﬁrst image via the plane at inﬁnity of the scene into the second

image. This explains why Ais called the inﬁnite homography in the

computer vision literature.

340 Principles of Passive 3D Reconstruction

2.5 Two Image-Based 3D Reconstruction Up-Close

From an algebraic point of view, triangulation can be interpreted as

solving the two camera projection equations for the scene point M. For-

mulated in this way, passive 3D reconstruction is seen as solving the

following problem: Given two images I1and I2of a static scene and

a set of corresponding image points m1∈I

1and m2∈I

2between these

images, determine a calibration matrix K1, a position C1and an orien-

tation R1for the ﬁrst camera and a calibration matrix K2, a position

C2and an orientation R2for the second camera, and for every pair of

corresponding image points m1∈I

1and m2∈I

2compute world coordi-

nates (X,Y,Z) of a scene point Msuch that:

ρ1m1=K1RT

1(M−C1) and ρ2m2=K2RT

2(M−C2).(2.20)

These equations are the point of departure for our further analysis. In

traditional stereo one would know K1,R1,C1,K2,R2, and C2. Then

formula (2.20) yields a system of six linear equations in ﬁve unknowns

from which the coordinates of Mas well as the scalar factors ρ1and ρ2

can be computed, as was explained in Section 2.3.

Here, however, we are interested in the question which information

can be salvaged in cases where our knowledge about the camera con-

ﬁguration is incomplete. In the following sections, we will gradually

assume less and less information about the camera parameters to be

known. For each case, we will examine the damage to the precision with

which we can still reconstruct the 3D structure of the scene. As will

be seen, depending on what is still known about the camera setup, the

geometric uncertainty about the 3D reconstruction can range from a

Euclidean motion up to a 3D projectivity.

2.5.1 Euclidean 3D Reconstruction

Let us ﬁrst assume that we do not know about the camera positions

and orientations relative to the world coordinate frame, but that we

only know the position and orientation of the second camera relative

to (the camera-centered reference frame of) the ﬁrst one. We will also

assume that both cameras are internally calibrated so that the matrices

2.5 Two Image-Based 3D Reconstruction Up-Close 341

K1and K2are known as well. This case is relevant when, e.g., using a

hand-held stereo rig and taking a single stereo image pair.

A moment’s reﬂection shows that it is not possible to determine

the world coordinates of Mfrom the images in this case. Indeed, chang-

ing the position and orientation of the world frame does not alter the

setup of the cameras in the scene, and consequently, does not alter the

images. Thus it is impossible to recover absolute information about

the cameras’ external parameters in the real world from the projection

equations (2.20) alone, beyond what we already know (relative camera

pose). Put diﬀerently, one cannot hope for more than to recover the

3D structure of the scene up to a 3D Euclidean transformation of the

scene from the projection equations (2.20) alone.

On the other hand, the factor RT

1(M−C1) in the right-hand side of

the ﬁrst projection equation in formula (2.20) is just a Euclidean trans-

formation of the scene. Thus, without loss of generality, we may replace

this factor by M, because we just lost all hope of retrieving the 3D coor-

dinates of Mmore precisely than up to some unknown 3D Euclidean

transformation anyway. The ﬁrst projection equation then simpliﬁes to

ρ1m1=K1M. Solving M=RT

1(M−C1) for Mgives M=R1M+C1, and

substituting this into the second projection equation in (2.20) yields

ρ2m2=K2RT

2R1M+K2RT

2(C1−C2). Together,

ρ1m1=K1Mand ρ2m2=K2RT

2R1M+K2RT

2(C1−C2) (2.21)

constitute a system of equations which allow to recover the 3D structure

of the scene up to a 3D Euclidean transformation M=RT

1(M−C1).

Indeed, as the cameras are assumed to be internally calibrated, K1and

K2are known and the ﬁrst equation in (2.21) can be solved for M, viz.

M=ρ1K−1

1m1. Plugging this new expression for Minto the right-hand

side of the second equation in formula (2.21), one gets

ρ2m2=ρ1K2RT

2R1K−1

1m1+K2RT

2(C1−C2).(2.22)

Notice that this actually brings us back to the epipolar relation (2.16)

which was derived in Section 2.4.1. In this equation RT

2R1and

2(C1−C2) represent, respectively, the relative orientation and the

relative position of the ﬁrst camera with respect to the second one.

As these are assumed to be known too, equality (2.22) yields a system

342 Principles of Passive 3D Reconstruction

of three linear equations from which the two unknown scalar factors

ρ1and ρ2can be computed. And, when ρ1is found, the Euclidean

transformation Mof the scene point Mis found as well.

In summary, if the cameras are internally calibrated and the rela-

tive position and orientation of the cameras is known, then for each

pair of corresponding points m1and m2in the images the Euclidean

transformation Mof the underlying scene point Mcan be recovered f rom

the relations (2.21). Formula (2.21) is therefore referred to as a system

of Euclidean reconstruction equations for the scene and the 3D points

Msatisfying these equations constitute a Euclidean reconstruction of

the scene. As a matter of fact, better than Euclidean reconstruction is

often not needed, as the 3D shape of the objects in the scene is per-

fectly retrieved, only not their position relative to the world coordinate

frame, which is completely irrelevant in many applications.

Notice how we have absorbed the unknown parameters into the new

coordinates M. This is a strategy that will be used repeatedly in the next

sections. It is interesting to observe that M=RT

1(M−C1) are, in fact,

the coordinates of the scene point Mwith respect to the camera-centered

reference frame of the ﬁrst camera, as was calculated in Section 2.2.4.

The Euclidean reconstruction equations (2.21) thus coincide with the

system of projection equations (2.20) if the world frame is the camera-

centered reference frame of the ﬁrst camera (i.e., C1=0and R1=I3)

and the rotation matrix R2then expresses the relative orientation of

the second camera with respect to the ﬁrst one.

2.5.2 Metric 3D Reconstruction

Next consider a stereo setup as that of the previous section, but sup-

pose that we do not know the distance between the centers of pro-

jection C1and C2any more. We do, however, still know the relative

orientation of the cameras and the direction along which the sec-

ond camera is shifted with respect to the ﬁrst one. This means that

2(C1−C2) is only known up to a non-zero scalar factor. As the

cameras still are internally calibrated, the calibration matrices K1and

K2are known and it follows that the last term in the second of the

Euclidean reconstruction equations (2.21), viz. K2RT

2(C1−C2), can

2.5 Two Image-Based 3D Reconstruction Up-Close 343

only be determined up to an unknown scalar factor. According to for-

mula (2.15), K2RT

2(C1−C2)=ρe2e2yields the epipole e2of the ﬁrst

camera in the second image. Knowledge of the direction in which the

second camera is shifted with respect to the ﬁrst one, combined with

knowledge of K2, clearly also allows to determine e2. It is interesting

to notice, however, that, if a suﬃcient number of corresponding points

can be found between the images, then the fundamental matrix of the

image pair can be computed, up to a non-zero scalar factor, and the

epipole e2can also be recovered that way (as explained in item (2) of

Section 2.4.2 and, in more detail, in Section 4 in Part 2 of this tutorial).

Having a suﬃcient number of point correspondences would thus elimi-

nate the need to know about the direction of camera shift beforehand.

On the other hand, the assumption that the inter-camera distance is

not known implies that the scalar factor ρe2in the last term of the

Euclidean reconstruction equations

ρ1m1=K1Mand ρ2m2=K2RT

2R1M+ρe2e2(2.23)

is unknown. This should not come as a surprise, since ρe2is the

projective depth of C1in the second camera, and thus it is directly

related to the inter-camera distance. Algebraically, the six homoge-

neous equations (2.23) do not suﬃce to solve for the six unknowns

M,ρ1,ρ2, and ρe2. One only gets a solution up to an unknown scale.

But, using the absorption trick again, we introduce the new coordinates

M=1

ρe2M=1

ρe2RT

1(M−C1), which is a 3D similarity transformation of

the original scene. Formulas (2.23) then reduces to

¯ρ1m1=K1¯

Mand ¯ρ2m2=K2RT

2R1¯

M+e2,(2.24)

where ¯ρ1=ρ1

ρe2and ¯ρ2=ρ2

ρe2are scalar factors expressing the projective

depth of the scene point underlying m1and m2in each camera relative to

the scale ρe2of the metric reconstruction of the scene. The coordinates

Mprovide a 3D reconstruction of the scene point Mup to an unknown

3D similarity, as expected.

We could have seen this additional scaling issue coming also intu-

itively. If one were to scale a scene together with the cameras in it, then

this would have no impact on the images. In terms of the relative cam-

era positions, this would only change the distance between them, not

344 Principles of Passive 3D Reconstruction

their relative orientations or the relative direction in which one cam-

era is displaced with respect to the other. The calibration matrices K1

and K2would remain the same, since both the focal lengths and the

pixel sizes are supposed to be scaled by the same factor and the num-

ber of pixels in the image is kept the same as well, so that the oﬀsets

in the calibration matrices do not change. Again, as such changes are

not discernible in the images, having internally calibrated cameras and

external calibration only up to the exact distance between the cam-

eras leaves us with one unknown, but ﬁxed, scale factor. Together with

the unknown Euclidean motion already present in the 3D reconstruc-

tion derived in the previous section, this unknown scaling brings the

geometric uncertainty about the 3D scene up to an unknown 3D simi-

larity transformation. Such a reconstruction of the scene is commonly

referred to in the computer vision literature as a metric reconstruction

of the scene, and formula (2.24) is referred to as a system of metric

reconstruction equations. Although annoying, it should be noted that

ﬁxing the overall unknown scale usually is the least of our worries in

practice, as indeed knowledge about a single distance or length in the

scene suﬃces to lift the uncertainty about scale.

2.5.3 Aﬃne 3D Reconstruction

A further step toward our goal of 3D reconstruction from the image data

alone is to give up on knowledge of the internal camera parameters as

well. For the metric reconstruction equations (2.24) this implies that

the calibration matrices K1and K2are also unknown. As before, one

can perform a change of coordinates ˜

M=K1¯

Mand replace ¯

Min the

reconstruction equations (2.24) by ¯

M=K−1

1˜

M. This gives:

˜ρ1m1=˜

Mand ˜ρ2m2=A˜

M+e2,(2.25)

where ˜ρ1=¯ρ1,˜ρ2=¯ρ2, and A=K2RT

2R1K−1

1is the inﬁnite homog-

raphy introduced in Section 2.4.1. If the invertible matrix Ais known,

then this system (2.25) can be solved for the scalars ˜ρ1,˜ρ2, and, more

importantly, for ˜

M, as in the metric case (cf. Section 2.5.2). More on

how to extract Afrom image information only is to follow shortly. As

M=K1¯

M=1

ρe2K1RT

1(M−C1) represents a 3D aﬃne transformation of

2.5 Two Image-Based 3D Reconstruction Up-Close 345

the world space, formula (2.25) is referred to as a system of aﬃne recon-

struction equations for the scene and the 3D points ˜

Msatisfying these

equations constitute an aﬃne reconstruction of the scene, i.e., a recon-

struction which is correct up to an unknown 3D aﬃne transformation.

It suﬃces to know Aand e2in order to compute an aﬃne recon-

struction of the scene. As explained in Section 2.4.2 and in more detail

in Section 4, e2can be extracted from F, and Fcan be derived —

up to a non-zero scalar factor — from corresponding points between

the images. Since e2is in the left nullspace of F, an unknown scalar

factor on Fdoes not prevent the extraction of e2, however. Unfortu-

nately, determining Ais not that easy in practice. Awas deﬁned as

A=K2RT

2R1K−1

1, where K1and K2are the calibration matrices and

2R1represents the relative orientation of the cameras. If this infor-

mation about the cameras is not available, then this formula cannot be

used to compute A. On the other hand, the fundamental matrix Fof

the image pair has been deﬁned in Section 2.4.1 as F=[e2]×A. But,

unfortunately, the relation F=[e2]×Adoes not deﬁne the matrix A

uniquely. Indeed, suppose A1and A2are 3 ×3-matrices such that

F=[e2]×A1and F=[e2]×A2. Then [e2]×(A1−A2)=0. As [e2]×is

the skew-symmetric 3 ×3-matrix which represents the cross product

with the 3-vector e2; i.e., [e2]×v=e2×vfor all v∈R3, the columns

of A1and A2can diﬀer by a scalar multiple of e2. In particular,

A1=A2+e2aTfor some 3-vector a∈R3. Hence, the inﬁnite homog-

raphy Acannot be recovered from point correspondences between the

images alone.

So, what other image information can then be used to determine A?

Recall from Section 2.4.3 that the matrix Atransfers vanishing points

of directions in the scene from the ﬁrst image to the second one. Each

pair of corresponding vanishing points in the images therefore yields

constraints on the inﬁnite homography A. More precisely, if v1and v2

are the vanishing points in, respectively, the ﬁrst and the second image

of a particular direction in the scene, then ρv2=Av1for some non-zero

scalar factor ρ. Since Aisa3×3-matrix and as each such constraint

brings three equations, but also adds one additional unknown ρ,at

least four such constraints are needed to determine the matrix Aup

to a scalar factor ρA. This unknown factor does not form an obstacle

346 Principles of Passive 3D Reconstruction

for the obtained matrix to be used in the aﬃne reconstruction equa-

tions (2.25), because by multiplying the ﬁrst reconstruction equation

with the same factor and then absorbing it into ˜

Min the right-hand side

of both equations, one still obtains an aﬃne 3D reconstruction of the

scene.

Identifying the vanishing points of four independent directions in an

image pair is rarely possible. More often, one has three dominant direc-

tions, typically orthogonal to each other. This is the case for many built-

up environments. Fortunately, there is one direction which is always

available, namely the line passing through the positions C1and C2

in the scene. The vanishing points of this line in the images are the

intersection of the line with each image plane. But these are just the

epipoles. So, the epipoles e1and e2in a pair of images are correspond-

ing vanishing points of the direction of the line through the camera

positions C1and C2in the scene and therefore satisfy the relation

ρee2=Ae1for some ρe∈R,

as we have already noted in Section 2.4.2, item (3). Consequently, if

the vanishing points of three independent directions in the scene can be

identiﬁed in the images, then the inﬁnite homography Acanbecom-

puted up to a non-zero scalar factor; at least if none of the three direc-

tions is that of the line connecting the centers of projection C1and C2.

Notice that we have only absorbed K1into the aﬃne coordinates ˜

of the 3D reconstruction, but that K2also can remain unknown, as it

is absorbed by A.

2.5.4 Projective 3D Reconstruction

Finally, we have arrived at the situation where we assume no knowledge

about the camera conﬁguration or about the scene whatsoever. Instead,

we only assume that one can ﬁnd point correspondences between the

images and extract the fundamental matrix of the image pair.

The main conclusion of the previous section is that, if no information

about the internal and external parameters of the cameras is available

and if insuﬃcient vanishing points can be found and matched, then

the only factor that separates us from an aﬃne 3D reconstruction of

2.5 Two Image-Based 3D Reconstruction Up-Close 347

the scene is the inﬁnite homography A. Let us therefore investigate

whether some partial knowledge about Acan still be retrieved from

general point correspondences between the images.

Recall from Section 2.4.1 that F=[e2]×Aand that the epipole e2

can uniquely be determined from the fundamental matrix F.Itisnot

diﬃcult to verify that ([e2]×)3=−e22[e2]×, where e2denotes the

norm of the 3-vector e2.AsF=[e2]×A, it follows that

[e2]×([e2]×F)=([e2]×)3A=−e22[e2]×A=−e22F.

So, [e2]×Fisa3×3-matrix which when premultiplied with [e2]×

yields a non-zero scalar multiple of the fundamental matrix F.In

other words, up to a non-zero scalar factor, the 3 ×3-matrix [e2]×F

could be a candidate for the unknown matrix A. Unfortunately, as

both the fundamental matrix Fand [e2]×have rank 2, the matrix

[e2]×Fis not invertible as Aought to be. But, recall from the pre-

vious section that two matrices A1and A2satisfying F=[e2]×A1

and F=[e2]×A2are related by A1=A2+e2aTfor some 3-vector

a∈R3. This implies that the unknown matrix Amust be of the form

A=−(1/e22)[e2]×F+e2aTfor some 3-vector a∈R3. It follows

that the invertible matrix A, needed for an aﬃne reconstruction of

the scene, can only be recovered up to three unknown components of

a. As we do not know them, the simplest thing to do is to put them to

zero or to make a random guess. This section analyzes what happens

to the reconstruction if we do just that.

The expression for Aonly takes on the particular form given above

in case Fis obtained from camera calibration (i.e., from the rotation

and calibration matrices). In case Fis to be computed from point

correspondences — as is the case here — it can only be determined up

to a non-zero scalar factor. Let ˆ

Fbe an estimate of the fundamental

matrix as obtained from point correspondences, then F=κˆ

Ffor some

non-zero scalar factor κ. Now deﬁne ˆ

A=−(1/e22)[e2]×ˆ

F. Then A=

κˆ

A+e2aTfor some unknown 3-vector a∈R3. Notice that, as observed

before, the scalar factor κbetween Fand ˆ

Fhas no inﬂuence on the

pixel coordinates of e2, as derived from ˆ

Finstead of F. Using ˆ

Afor A

in the aﬃne reconstruction equations (2.25), for corresponding image

348 Principles of Passive 3D Reconstruction

points m1and m2we now solve the following system of linear equations:

ˆρ1m1=ˆ

Mand ˆρ2m2=ˆ

Aˆ

M+e2,(2.26)

where ˆρ1and ˆρ2are non-zero scalar factors, and where the 3D points

Mconstitute a 3D reconstruction of the scene, which — as will be

demonstrated next — diﬀers from the original scene by a (unique, but

unknown) 3D projective transformation. The set of 3D points ˆ

Mis called

aprojective 3D reconstruction of the scene and formula (2.26) is referred

to as a system of projective reconstruction equations. Figure 2.13 sum-

marizes the steps that have lead up to these equations.

In order to prove that the points ˆ

Mobtained from equations (2.26)

do indeed constitute a projective 3D reconstruction of the scene,

Projective 3D reconstruction from two uncalibrated images

Given:A set of point correspondences m1∈I

1and m2∈I

between two uncalibrated images I1and I2of a static scene

Objective:A projective 3D reconstruction ˆ

Mof the scene

Algorithm:

(1) Compute an estimate ˆ

Ffor the fundamental matrixa

(2) Computebthe epipole e2from ˆ

(3) Compute the 3 ×3-matrix ˆ

A=−(1/e22)[e2]×ˆ

(4) For each pair of corresponding image points m1and m2,

solve the following system of linear equations for ˆ

ˆρ1m1=ˆ

Mand ˆρ2m2=ˆ

Aˆ

M+e2

(ˆρ1and ˆρ2are non-zero scalars )

aCf. item (1) in Section 2.4.2. See also Subsections 4.2.2 and 4.2.3 in Section 4 in Part 2

of this tutorial.

bCf. item (2) in Section 2.4.2. See also Subsection 4.3 in Section 4 in Part 2 of this

tutorial.

Fig. 2.13 A basic algorithm for projective 3D reconstruction from two uncalibrated images.

2.5 Two Image-Based 3D Reconstruction Up-Close 349

we ﬁrst express the equations (2.26) in terms of projection matrices

and extended coordinates ˆ

1=(ˆ

X, ˆ

Y,ˆ

Z,1)Tfor the 3D point ˆ

(ˆ

X, ˆ

Y,ˆ

Z)T(cf. formula (2.7) in Section 2.2.4):

ˆρ1m1=(I3|0)ˆ

1and ˆρ2m2=(ˆ

A|e2)ˆ

1.(2.27)

Similarly, the aﬃne reconstruction equations (2.25) are written as:

˜ρ1m1=(I3|0)˜

1and ˜ρ2m2=(A|e2)˜

1,(2.28)

where ˜

1=(˜

X, ˜

Y,˜

Z,1)Tare the extended coordinates of the 3D point

M=(˜

X, ˜

Y,˜

Z)T. Recall that the invertible matrix Ais of the form

A=κˆ

A+e2aTfor some non-zero scalar κand 3-vector a∈R3. The

last equality in (2.28) therefore is:

˜ρ2m2=(κˆ

A+e2aT|e2)˜

1=(ˆ

A|e2)κI30

aT1˜

1; (2.29)

and the ﬁrst equality in (2.28) can be rewritten as:

κ˜ρ1m1=(I3|0)κI30

aT1˜

1.

Comparing these expressions to formula (2.27), it follows that

λˆ

1=κI30

aT1˜

1.(2.30)

for some non-zero scalar λ∈Rand that ˆρ1=(κ/λ)˜ρ1and ˆρ2=

(1/λ)˜ρ2. Notice that one cannot simply go for a solution with λ=1,as

this implies that aT=0Tand therefore A=κˆ

A. This would mean that

we have been lucky enough to guess the correct inﬁnite homography

matrix A(up to a scalar factor) right away. But this cannot be the case

according to our proposed choice : ˆ

Ahas rank 2 and thus is singular

whereas the correct matrix Ais invertible. Rather, eliminating λfrom

equation (2.30) gives:

M=κ˜

aT˜

M+1

350 Principles of Passive 3D Reconstruction

Recall from Section 2.5.3 that ˜

M=1

ρe2K1RT

1(M−C1). Substituting this

into the previous equation, one sees that

M=κK1RT

1(M−C1)

aTK1RT

1(M−C1)+ρe2

is a projective transformation of the scene. Moreover, since ˜

M=K1¯

with ¯

M=1

ρe2RT

1(M−C1) being the metric reconstruction of the scene

deﬁned in Section 2.5.2, formula (2.30) can also be written as:

λˆ

1=κK10

aTK11¯

1; (2.31)

or, after elimination of λ,

M=κK1¯

aTK1¯

M+1

which shows that ˆ

Malso is a projective transformation of ¯

2.5.5 Taking Stock — Stratiﬁcation

The goal put forward at the beginning of Section 2.5 was to recover

the three-dimensional geometrical structure of a scene from two images

of it, without necessarily having complete information about the inter-

nal and external parameters of the cameras. It was immediately seen

that one can only recover it up to a 3D Euclidean transformation if

only information about the relative camera poses is available for the

external camera parameters. If the precise inter-camera distance is also

unknown, 3D reconstruction is only possible up to a 3D similarity and

the 3D reconstruction is said to be metric. Moreover, if the calibra-

tion matrix of the ﬁrst camera is unknown (and also of the second

camera for that matter), then at most an aﬃne 3D reconstruction of

the scene is feasible. And, if the inﬁnite homography Aintroduced in

Section 2.4.1 is unknown, then the scene can only be reconstructed up

to a 3D projective transformation. The latter case is very relevant, as

it applies to situations where the internal and external parameters of

the camera pair are unknown, but where point correspondences can be

found.

2.5 Two Image-Based 3D Reconstruction Up-Close 351

Projective

Affine

Metric

similarity

Original

Euclidean

Fig. 2.14 The aim of passive 3D reconstruction is to recover the 3D geometrical structure

of the scene from images. With a fully calibrated setup of two cameras a Euclidean recon-

struction can be achieved. The reconstruction degenerates to metric if the inter-camera

distance (sometimes referred to as baseline) is unknown. If the calibration matrices of the

cameras are unknown, then the scene structure can only be recovered up to a 3D aﬃne

transformation; and, if no information about the camera setup is available, then only a

projective reconstruction is feasible.

Figure 2.14 illustrates these diﬀerent situations. In the ﬁgure the

scene consists of a cube. In a metric reconstruction a cube is found, but

the actual size of the cube is undetermined. In an aﬃne reconstruction

the original cube is reconstructed as a parallelepiped. Aﬃne transfor-

mations preserve parallelism, but they do not preserve metric relations

such as angles and relative lengths. In a projective reconstruction the

scene appears as an (irregular) hexahedron, because projective trans-

formations only preserve incidence relations such as collinearity and

coplanarity, but parallelism or any metric information is not preserved.

Table 2.1 shows the mathematical expressions that constitute these geo-

metrical transformations. The mutual relations between the diﬀerent

types of 3D reconstruction are often referred to as stratiﬁcation of the

geometries. This term reﬂects that the transformations higher up in the

list are special types (subgroups) of the transformations lower down.

Obviously, the uncertainty about the reconstruction increases when

going down the list, as also corroborated by the number of degrees of

freedom in these transformations.

In the ﬁrst instance, the conclusion of this section — namely that

without additional information about the (internal and external) cam-

era parameters the three-dimensional structure of the scene only can

be recovered from two images up to an unknown projective transforma-

tion — might come as a disappointment. From a mathematical point of

view, however, this should not come as a surprise, because algebraically

passive 3D reconstruction boils down to solving the reconstruction

352 Principles of Passive 3D Reconstruction

Table 2.1. The stratiﬁcation of geometries.

Geometrical transf. Mathematical expression

Euclidean transf. M=RM+Twith Ra rotation matrix, T∈R3

similarity transf. M=κRM+Twith Ra rotation matrix, T∈R3,κ∈R

aﬃne transf. M=QM+Twith Qan invertible matrix, T∈R3

projective transf.











X=p11X+p12 Y+p13 Z+p14

p41X+p42 Y+p43 Z+p44

Y=p21X+p22 Y+p23 Z+p24

p41X+p42 Y+p43 Z+p44

Z=p31X+p32 Y+p33 Z+p34

p41X+p42 Y+p43 Z+p44

with P=(pij ) an invertible 4 ×4-matrix

equations (2.20) for the camera parameters and the scene points M.In

terms of projection matrices and extended coordinates (cf. formula (2.7)

in Section 2.2.4), the reconstruction equations (2.20) are formulated as:

ρ1m1=P1M

1and ρ2m2=P2M

1,(2.32)

where Pj=(KjRT

j|−KjRT

jCj) is the 3 ×4-projection matrix of the

j-th camera, with j=1 or j= 2, and M

1=(X,Y,Z,1)Tare the

extended coordinates of the scene point M=(X, Y, Z)T. Moreover, it

was observed in Section 2.2.4 that in the general linear camera model

any 3×4-matrix of maximal rank can be interpreted as the projection

matrix of a linear pinhole camera. Consequently, inserting an arbitrary

invertible 4 ×4-matrix and its inverse in the right-hand sides of the

projection equations (2.32) does not alter the image points m1and m2

in the left-hand sides of the equations and yields another — but equally

valid — decomposition of the reconstruction equations:

ρ1m1=P1H−1HM

1and ρ2m2=P2H−1HM

1;

or equivalently,

ˆρ1m1=ˆ

P1ˆ

1and ˆρ2m2=ˆ

P2ˆ

1,(2.33)

with ˆ

P1=P1H−1and ˆ

P2=P2H−1two 3 ×4-matrices of maximal

rank, λˆ

1=HM

1with λa non-zero scalar a 3D projective transfor-

2.5 Two Image-Based 3D Reconstruction Up-Close 353

mation of the scene, and ˆρ1=ρ1

λand ˆρ2=ρ2

λnon-zero scalar factors.

Clearly, formulas (2.33) can be interpreted as the projection equations

of scene points ˆ

Mthat, when observed by cameras with respective pro-

jection matrices ˆ

P1and ˆ

P2, yield the same set of image points m1and m2.

As Hcan be any invertible 4 ×4-matrix, it is clear that one cannot

hope to do better than recovering the 3D geometric structure of the

scene up to an arbitrary 3D projective transformation if no information

about the cameras is available. But, the longer analysis presented in the

previous sections and leading up to the same conclusion has provided

an explicit algorithm for projective 3D reconstruction, which will be

reﬁned in the next sections.

Awareness of the stratiﬁcation is also useful when additional infor-

mation on the scene (rather than the cameras) is available. One may

know some (relative) lengths or angles (including orthogonalities and

parallelisms). Exploiting such information can make it possible to

upgrade the geometric structure of the reconstruction to one with less

uncertainty, i.e., to one higher up in the stratiﬁcation table. Based

on known lengths and angles, it may, for instance, become possible

to construct a 3D projective transformation matrix (homography) H

that converts the projective reconstruction ˆ

M, obtained from the image

data (cf. Figure 2.13 and formula (2.31) ), directly into a Euclidean

one. And, even if no metric information about the scene is available,

other geometrical relations that are known to exist in the scene may

be useful. In particular, we have already discussed the case of parallel

lines in three independent directions, that suﬃce to upgrade a pro-

jective reconstruction into an aﬃne one through the three vanishing

points they deliver. Indeed, they allow to determine the three unknown

parameters a∈R3of the invertible matrix A(cf. Section 2.5.3). Know-

ing aallows to upgrade the projective 3D reconstruction ˆ

Mto the aﬃne

3D reconstruction κ˜

Mby using forumla (2.30) in Section 2.5.4. How-

ever, one does not always need the projections of (at least) two parallel

lines in the image to compute a vanishing point. Alternatively, if in an

image three points can be identiﬁed that are the projections of collinear

scene points M1,M2, and M3of which the ratio d(M1,M2)

d(M1,M3)of their Euclidean

distances in the real world is known (e.g., three equidistant points in

the scene), then one can determine the vanishing point of this direction

354 Principles of Passive 3D Reconstruction

in the image using the cross ratio (cf. Appendix A). In Subsection 4.6.1

of Section 4 in Part 2 of this tutorial such possibilities for improving the

3D reconstruction will be explored further. In the next section, how-

ever, we will assume that no information about the scene is available

and we will investigate how more than two images can contribute to

better than a projective 3D reconstruction.

2.6 From Projective to Metric Using

More Than Two Images

In the geometric stratiﬁcation of the previous section, we ended up

with a 3D projective reconstruction in case no prior camera calibration

information is available whatsoever and we have to work purely from

image correspondences. In this section we will investigate how and when

we can work our way up to a metric 3D reconstruction, if we were to

have more than just two uncalibrated images.

2.6.1 Projective Reconstruction and Projective Camera

Matrices from Multiple Images

Suppose we are given mimages I1,I2,... ,Imof a static scene and a set

of corresponding points mj∈I

jbetween the images (j∈{1,2,...,m}).

As in formula (2.20) the projection equations of the j-th camera are:

ρjmj=KjRT

j(M−Cj) for j∈{1,2,...,m}; (2.34)

or, in terms of projection matrices and extended coordinates as in for-

mula (2.32):

ρjmj=KjRT

j|−KjRT

jCjM

1for j∈{1,2,...,m}. (2.35)

Of course, when one has more than just two views, one can still extract

at least a projective reconstruction of the scene, as when one only had

two. Nonetheless, a note of caution is in place here. If we were to try

and build a projective reconstruction by pairing the ﬁrst with each

of the other images separately and then combine the resulting projec-

tive reconstructions, this would in general not work. Indeed, one has

to ensure that the same projective distortion is obtained for each of

2.6 From Pro jective to Metric Using More Than Two Images 355

the reconstructions. That this will not automatically amount from a

pairwise reconstruct-and-then-combine procedure becomes clear if one

writes down the formulas explicitly. When the internal and external

parameters of the cameras are unknown, a projective 3D reconstruc-

tion of the scene can be computed from the ﬁrst two images by the

procedure of Figure 2.13 in Section 2.5.4. In particular, for each point

correspondence m1∈I

1and m2∈I

2in the ﬁrst two images, the recon-

structed 3D point ˆ

Mis the solution of the system of linear equations:

ˆρ1m1=ˆ

Mand ˆρ2m2=ˆ

A2ˆ

M+e2,(2.36)

where ˆρ1and ˆρ2are non-zero scalars (which depend on ˆ

Mand thus are

also unknown) and ˆ

A2=(−1/e22)[e2]×ˆ

F12 with ˆ

F12 an estimate of

the fundamental matrix between the ﬁrst two images as computed from

point correspondences. The resulting points ˆ

Mconstitute a 3D recon-

struction of the scene which relates to the metric reconstruction ¯

Mby

the projective transformation:

λˆ

1=κK10

aTK11¯

1,(2.37)

as was demonstrated in Section 2.5.4 (formula (2.31)). Let Hbe the

homography matrix in the right-hand side of formula (2.37), viz.:

H=κK10

aTK11.(2.38)

For the other images a similar set of equations can be derived by pairing

each additional image with the ﬁrst one. But, as mentioned before, one

has to be careful in order to end up with the same projective distortion

already resulting from the reconstruction based on the ﬁrst two views.

Indeed, for an additional image Ijwith j≥3, blindly applying the pro-

cedure of Figure 2.13 in Section 2.5.4 to the ﬁrst and the j-th images

without considering the second image or the already obtained projec-

tively reconstructed 3D points ˆ

Mwould indeed result in a 3D projective

reconstruction ˆ

M(j), but one which relates to the metric reconstruction ¯

by the projective transformation:

λˆ

M(j)

1=κ(j)K10

(j)K11¯

1,

356 Principles of Passive 3D Reconstruction

in which the parameters κ(j)∈Rand a(j)∈R3are not related to the

parameters κand ain the projective transformation (2.37) and the

corresponding homography matrix Hin formula (2.38). A consistent

projective reconstruction implies that aT

(j)/κ(j)=aT/κ. So, when intro-

ducing additional images Ijwith j≥3 into the reconstruction process,

one cannot simply choose the matrix ˆ

Ajfor each new view indepen-

dently, but one has to make sure that the linear system of reconstruction

equations:

ˆρ1m1=ˆ

and ˆρjmj=ˆ

Ajˆ

M+ejfor j∈{2,...,m}(2.39)

is consistent, i.e., that it is such that the solutions ˆ

Msatisfy all recon-

struction equations at once, in as far as image projections mjare avail-

able. The correct way to proceed for the j-th image with j≥3 therefore

is to express ˆ

Ajmore generally as ˆ

Aj=κj(−1/ej2)[ej]×ˆ

F1j+ejaT

where κj∈Rand aj∈R3are parameters such that the reconstruc-

tion equations ˆρjmj=ˆ

Ajˆ

M+ejhold for all reconstructed 3D points ˆ

obtained from formulas (2.36). Each image point mjbrings three linear

equations for the unknown parameters κjand aj, but also introduces

1 unknown scalar factor ˆρj. Hence, theoretically 2 point correspon-

dences between the ﬁrst, the second and the j-th image would suﬃce

to determine κjand ajin a linear manner. However, the parameter-

ization of the matrix ˆ

Ajrelies on the availability of the fundamental

matrix F1jbetween the ﬁrst and the j-th image. And, as explained

in Section 2.4.2, if F1jis to be estimated from point correspondences

too, then at least 7 point correspondences are needed. Therefore, from a

computational point of view it is better to estimate both ˆ

Aand e3from

the relation ˆρjmj=ˆ

Ajˆ

M+ejdirectly. Indeed, for each image point mj

this formula yields three linear equations in the nine unknown entries of

the matrix ˆ

Ajand the two unknown pixel coordinates of the epipole ej,

but it also introduces one unknown scalar factor ˆρj. Consequently, at

least 6 point correspondences are needed to uniquely determine ˆ

Ajand

ejin a linear manner. Moreover, since the fundamental matrix between

the ﬁrst and the j-th image is deﬁned in Section 2.4.1 as F1j=[ej]×Aj,

an estimate for Ajand ejimmediately implies an estimate for F1jas

well. At ﬁrst sight this may seem a better approach to estimate the

2.6 From Pro jective to Metric Using More Than Two Images 357

fundamental matrix F1j, because it only needs 6 point correspondences

instead of 7, but one should realize that three images are involved here

(instead of only two images in Section 2.4.2). The relations between

three or more views of a static scene will be further explored in Sec-

tion 3 of Part 2 of this tutorial. More information on how to eﬃciently

compute ˆ

Ajin practice can be found in Subsection 4.4 of Section 4 in

Part 2 of this tutorial.

2.6.2 From Projective to Aﬃne 3D Reconstruction

Now that the discussion of projective reconstruction is extended to the

case of more than two cameras, we next consider options to go beyond

this and upgrade to an aﬃne or even a metric reconstruction. But

before doing so, we ﬁrst make explicit the link between the plane at

inﬁnity of the scene and the projective transformation matrix:

H=κK10

aTK11,(2.40)

which describes the transition from a metric to a projective reconstruc-

tion. Readers who are not very familiar with projective geometry can

ﬁnd in Appendix A in Part 3 of this tutorial all necessary background

material that is used in this subsection, or they may skip this subsection

entirely and continue immediately with Section 2.6.3.

Recall that in projective geometry direction vectors are represented

as points on the plane at inﬁnity of the world space. If we would be able

to identify the plane at inﬁnity of the scene in the projective reconstruc-

tion, then by moving it to inﬁnity, we can already upgrade the projec-

tive reconstruction to an aﬃne one. In the projective reconstruction ˆ

Mof

Section 2.6.1, the plane at inﬁnity of the scene is found as the plane with

equation aTˆ

M−κ= 0. Indeed, as explained in Appendix A, the plane

at inﬁnity of the scene has homogeneous coordinates (0,0,0,1)Tin the

world space. Moreover, if a projective transformation whose action on

homogeneous coordinates of 3D points is represented by an invertible

4×4-homography matrix His applied to the world space, then homo-

geneous coordinates of planes are transformed by the inverse transpose

H−T=(H−1)T=(HT)−1of the homography matrix H. For example,

358 Principles of Passive 3D Reconstruction

the 3D similarity transformation ¯

M=1

ρe2RT

1(M−C1) is represented in

matrix notation by:

¯

1=1

ρe2RT

1−1

ρe2RT

1C1

0T1M

1.

The corresponding homography matrix is:





ρe2RT

1−1

ρe2RT

1C1

0T1



and its inverse transpose is:

ρe2RT

11.

The plane at inﬁnity of the scene has homogeneous coordinates

(0,0,0,1)Tand is mapped by this similarity transformation onto itself,

because

ρe2RT

11











=











This illustrates again the general fact that Euclidean transformations,

similarity transformations and aﬃne transformations of the world space

do not aﬀect the plane at inﬁnity of the scene. A projective transfor-

mation on the contrary generally does aﬀect the plane at inﬁnity. In

particular, the projective transformation λˆ

1=H¯

1deﬁned by the

homography matrix (2.40) maps the plane at inﬁnity of the scene to

the plane with homogeneous coordinates:

H−T











=1

κK−T

1−1

κa

0T1











=−1

κa

1.

In other words, the plane at inﬁnity of the scene is found in the

projective reconstruction ˆ

Mas the plane with equation −1

κaTˆ

M+1=0,

2.6 From Pro jective to Metric Using More Than Two Images 359

or equivalently, aTˆ

M−κ= 0. Given this geometric interpretation of

κa, it becomes even clearer why we need to keep these values the

same for the diﬀerent choices of ˆ

Ajin the foregoing discussion about

building a consistent, multi-view projective reconstruction of the scene.

All two-view projective reconstructions need to share the same plane

at inﬁnity for them to be consistent. This said, even if one keeps the

3D reconstruction consistent following the aforementioned method,

one still does not know Hor 1

κaT.

If the position of the plane at inﬁnity of the scene were known in the

projective reconstruction, one could derive a projective transformation

which will turn the projective reconstruction into an aﬃne one, i.e., to

put the plane at inﬁnity really at inﬁnity. Indeed, formula (2.37) can

be rewritten as:

λˆ

1=κK10

aTK11¯

1=κI30

aT1K1¯

1

=κI30

aT1˜

1,

where ˜

M=K1¯

Mis the aﬃne 3D reconstruction that was introduced by

formula (2.25) in Section 2.5.3 (see also formula (2.30) in Section 2.5.4).

Knowing that −1

κaTˆ

M+ 1 = 0 is the equation of the plane at inﬁnity

of the scene in the projective reconstruction, the previous equation can

also be written as:

λˆ

1=κI30

aT1˜

1=I30

κaT1κ˜

1.(2.41)

Observe that since ˜

Mis an aﬃne 3D reconstruction of the scene, κ˜

Mis

an aﬃne 3D reconstruction as well. Denoting π∞=−1

κa, the plane at

inﬁnity has equation πT

∞ˆ

M+ 1 = 0 and Equation (2.41) reads as:

λˆ

1=I30

−πT

∞1κ˜

1.

This is the 3D projective transformation which maps the plane at

inﬁnity in the aﬃne reconstruction κ˜

Mto the plane with equation

360 Principles of Passive 3D Reconstruction

πT

∞ˆ

M+ 1 = 0 in the projective 3D reconstruction ˆ

M. Put diﬀerently,

if the plane at inﬁnity of the scene can be identiﬁed as the plane

πT

∞ˆ

M+ 1 = 0 in the projective reconstruction ˆ

M(e.g., from directions

which are known to be parallel in the scene or from vanishing points

in the images), then the inverse projective transformation:

λκ˜

1=I30

πT

∞1ˆ

1with ˜

λa non-zero scalar, (2.42)

or equivalently,

κ˜

M=ˆ

π∞Tˆ

M+1

turns the projective 3D reconstruction ˆ

Minto the aﬃne reconstruc-

tion κ˜

M. In mathematical parlance, the projective transformation (2.42)

maps the plane πT

∞ˆ

M+1=0 to inﬁnity.

2.6.3 Recovering Metric Structure : Self-Calibration

Equations

If no information is available about where the plane at inﬁnity is located

in the projective reconstruction ˆ

M, one can still upgrade the reconstruc-

tion, even to metric. Recall from formulas (2.39) that ˆρjmj=ˆ

Ajˆ

M+ej

for all j≥2 and from formula (2.37) that:

λˆ

1=κK10

aTK11¯

1.

Together, these equations yield

ˆρjmj=ˆ

Ajˆ

M+ej=(ˆ

Aj|ej)ˆ

1

λ(ˆ

Aj|ej)κK10

aTK11¯

1

λκˆ

AjK1+ejaTK1ej¯

1

or equivalently,

λˆρjmj=(κˆ

AjK1+ejaTK1)¯

M+ejfor all j≥2.

2.6 From Pro jective to Metric Using More Than Two Images 361

On the other hand, a similar computation as in the derivation of the

metric reconstruction equations (2.24) in Section 2.5.2 shows that

¯ρjmj=KjRT

jR1¯

M+ej,with ¯ρj=ρj

ρej

Comparing both equations, one sees that

κˆ

AjK1+ejaTK1=λjKjRT

jR1for all j≥2

and for some scalar λj. In Section 2.6.2 it is observed that the param-

eters κ∈Rand a∈R3determine the plane at inﬁnity of the scene in

the projective reconstruction: π∞=−1

κa. Making this reference to the

plane at inﬁnity of the scene explicit in the previous equation, one gets

κj(ˆ

Aj−ejπT

∞)K1=KjRT

jR1for all j≥2, (2.43)

where κj=κ/λjis a non-zero scalar. This equation has two interesting

consequences. First of all, multiplying both sides of the equality on the

right with the inverse of the calibration matrix K1gives

κj(ˆ

Aj−ejπT

∞)=KjRT

jR1K−1

1for all j≥2,

which, since the right-hand side of this equation is just the invertible

matrix Ajdeﬁned in Section 2.4.1, yields an explicit relation between

the inﬁnite homography Ajand the 3 ×3-matrices ˆ

Ajcomputed from

the images as described in Section 2.6.1, viz

Aj=κj(ˆ

Aj−ejπT

∞) for all j≥2 (2.44)

with non-zero scalars κj∈R. And secondly, multiplying both sides in

equality (2.43) on the right with the inverse R−1

1=RT

1of the rotation

matrix R1gives

κj(ˆ

Aj−ejπT

∞)K1RT

1=KjRT

jfor all j≥2.

If one now multiplies both sides of this last equation with its transpose,

then

(KjRT

j)(KjRT

j)T=κ2

j(ˆ

Aj−ejπT

∞)K1RT

1(K1RT

1)T(ˆ

Aj−ejπT

∞)T

for all j≥2, which by RT

j=R−1

jreduces to

KjKT

j=κ2

j(ˆ

Aj−ejπT

∞)K1KT

1(ˆ

Aj−ejπT

∞)T(2.45)

362 Principles of Passive 3D Reconstruction

for all j∈{2,...,m}. Equations (2.45) are the so-called self-calibration

or autocalibration equations [8] and all self-calibration methods essen-

tially are variations on solving these equations for the calibration matri-

ces Kjand the 3-vector π∞locating the plane at inﬁnity of the scene

in the projective reconstruction. The various methods may diﬀer in the

constraints or the assumptions on the calibration matrices they employ,

however.

2.6.4 Scrutinizing the Self-Calibration Equations

The self-calibration equations (2.45) have a simple geometric interpre-

tation, which we will explore ﬁrst before looking into ways for solving

them. Readers who are merely interested in the practice of 3D recon-

struction may skip this section and continue with Section 2.6.5.

2.6.4.1 Metric Structure and the Preservation of Angles

In Section 2.5.2 it was observed that if one wants to reconstruct a scene

from images only and if no absolute distance is given for any parts of

the scene, then one can never hope to do better than a metric recon-

struction, i.e., a 3D reconstruction which diﬀers from the original scene

by an unknown 3D similarity transformation. Typical of a 3D simi-

larity transformation is that all distances in the scene are scaled by

a ﬁxed scalar factor and that all angles are preserved. Moreover, for

a projective 3D reconstruction of the scene to be a metric one, it is

necessary and suﬃcient that the angles of any triangle formed by three

points in the reconstruction are equal to the corresponding angles in the

triangle formed by the original three scene points. The self-calibration

equations (2.45) enforce this condition in the projective reconstruction

at hand, as will be explained now.

Let M,P, and Qbe three arbitrary points in the scene. The angle

between the line segments [M,P] and [M,Q] in Euclidean 3-space is found

as the angle between the 3-vectors P−Mand Q−Min R3. This angle

is uniquely deﬁned by its cosine, which is given by the formula:

cos(P−M,Q−M)= P−M,Q−M

P−MQ−M,

2.6 From Pro jective to Metric Using More Than Two Images 363

where P−M,Q−Mdenotes the (standard) inner product; and,

P−M=P−M,P−Mand Q−M=Q−M,Q−Mare the

norms of the 3-vectors P−Mand Q−Min R3.Now,P−Mis a direc-

tion vector for the line deﬁned by the points Mand Pin the scene.

As explained in Appendix A, the vanishing point vjof the line MP in

the j-th image (j∈{1,2,...,m}) is given by ρvj vj=KjRT

j(P−M),

where ρjmj=KjRT

j(M−Cj) are the projection equations of the j-th

camera, as deﬁned by formula (2.34). Since Rjis a rotation matrix,

j=R−1

jand the 3-vector P−Mcan (theoretically) be recovered up

to scale from its vanishing point vjin the j-th image by the for-

mula P−M=ρvj RjK−1

jvj. Similarly, Q−Mis a direction vector of the

line through the points Mand Qin the scene and the vanishing point

wjof this line in the j-th image is given by ρwj wj=KjRT

j(Q−M),

thus yielding Q−M=ρwj RjK−1

jwj. By deﬁnition of the inner

product in R3,

P−M,Q−M=(P−M)T(Q−M)

=(ρvj RjK−1

jvj)T(ρwj RjK−1

jwj)

=ρvj ρwj vT

jK−T

jK−1

jwj

=ρvj ρwj vT

j(KjKT

j)−1wj.

A similar calculation yields

P−M2=(P−M)T(P−M)=ρ2

vj vT

j(KjKT

j)−1vj

and Q−M2=(Q−M)T(Q−M)=ρ2

wj wT

j(KjKT

j)−1wj.

Combining the previous expressions, one gets the following formula for

the cosine of the angle between the line segments [M,P] and [M,Q]inthe

scene:

cos(P−M,Q−M)= vT

j(KjKT

j)−1wj

vT

j(KjKT

j)−1vjwT

j(KjKT

j)−1wj

.(2.46)

This equation states that the angle between two lines in the scene can

be measured from a perspective image of the scene if the vanishing

points of these lines can be identiﬁed in the image and if the calibration

matrix Kjof the camera — or rather KjKT

j— is known.

364 Principles of Passive 3D Reconstruction

This should not come as a surprise, since vanishing points encode

3D directions in a perspective image. What is more interesting to

notice, however, is that the inner product of the scene is encoded in

the image by the symmetric matrix (KjKT

j)−1. In the computer vision

literature this matrix is commonly denoted by ωjand referred to as the

image of the absolute conic in the j-th view [16]. We will not expand on

this interpretation right now, but the interested reader can ﬁnd more

details on this matter in Appendix B. The fact that the calibration

matrix Kjappears in a formula relating measurements in the image to

measurements in the scene was to be expected. The factor RT

j(M−Cj)

in the right-hand side of the projection equations ρjmj=KjRT

j(M−Cj)

corresponds to a rigid motion of the scene, and hence does not have an

inﬂuence on angles and distances in the scene. The calibration matrix

Kj, on the other hand, is an upper triangular matrix, and thus intro-

duces scaling and skewing. When measuring scene angles and distances

from the image, one therefore has to undo this skewing and scaling

ﬁrst by premultiplying the image coordinates with the inverse matrix

K−1

j. And, last but not least, from the point of view of camera self-

calibration, formula (2.46) introduces additional constraints between

diﬀerent images of the same scene, viz:

i(KiKT

i)−1wi

√vT

i(KiKT

i)−1vi√wT

i(KiKT

i)−1wi

=vT

j(KjKT

j)−1wj

vT

j(KjKT

j)−1vjwT

j(KjKT

j)−1wj

which must hold for every pair of images i, j ∈{1,2,...,m}and for

every two pairs of corresponding vanishing points vi,vjand wi,wjin

these images. Obviously, only m−1 of these relations are independent:

1(K1KT

1)−1w1

vT

1(K1KT

1)−1v1wT

1(K1KT

1)−1w1

=vT

j(KjKT

j)−1wj

vT

j(KjKT

j)−1vjwT

j(KjKT

j)−1wj

(2.47)

for every j≥2 and for every two pairs of corresponding vanishing

points v1,vjand w1,wjbetween the ﬁrst and the j-th image. The

self-calibration equations (2.45) enforce these constraints in the projec-

tive reconstruction of the scene, as will be demonstrated next.

2.6 From Pro jective to Metric Using More Than Two Images 365

2.6.4.2 Inﬁnity Homographies and the

Preservation of Angles

To see why the ﬁnal claim in the last section holds, we have to rein-

terpret the underlying relation (2.46) in terms of projective geome-

try. Consider again the (arbitrary) scene points M,P, and Q. Their

extended coordinates, respectively, are M

1,P

1, and Q

1. As explained in

Appendix A, the direction vector P−Mof the line Lthrough the points

Mand Pin the scene corresponds in projective 3-space to the point of

intersection of the line through the projective points M

1and P

1with

the plane at inﬁnity of the scene; and the vanishing point vjof Lin

the j-th image is the perspective projection of this point of intersection

onto the image plane of the j-th camera (j∈{1,2,...,m}). In particu-

lar, the vanishing points v1,v2,... , vmare corresponding points in the

images I1,I2, ... ,Im. Moreover, it was explained in Section 2.4.3 that

the matrix Aj=KjRT

jR1K−1

1introduced in Section 2.4.1 actually is

a homography matrix representing the projective transformation that

maps (vanishing) points from the ﬁrst image via the plane at inﬁnity

of the scene onto the corresponding (vanishing) point in the j-th image

(j≥2), and therefore is referred to in the computer vision literature as

the inﬁnite homography between the ﬁrst and the j-th image. Explicitly,

Ajv1=ρjvjfor some non-zero scalar factor ρj. Similarly, the vanishing

points w1and wjof the line through Mand Qin, respectively, the ﬁrst

and j-th image satisfy Ajw1=σjwjfor some non-zero scalar factor σj.

Using the inﬁnite homography Aj=KjRT

jR1K−1

1, the inner product

j(KjKT

j)−1wjin the j-th image can also be expressed in terms of the

corresponding vanishing points v1and w1in the ﬁrst image:

j(KjKT

j)−1wj=1

ρj

Ajv1T

(KjKT

j)−11

σj

Ajw1

ρj

σj

1AT

j(KjKT

j)−1Ajw1.(2.48)

Using again Aj=KjRT

jR1K−1

1and the fact that RT

j=R−1

j, since

Rjis a rotation matrix, the 3 ×3-matrix AT

j(KjKT

j)−1Ajin the

366 Principles of Passive 3D Reconstruction

right-hand side of the previous equality simpliﬁes to:

j(KjKT

j)−1Aj=(KjRT

jR1K−1

1)T(K−T

jK−1

j)(KjRT

jR1K−1

=(K1KT

1)−1.

Equation (2.48) then reads

1(K1KT

1)−1w1=ρjσjvT

j(KjKT

j)−1wj.

By similar calculations, one also ﬁnds that vT

1(K1KT

1)−1v1=

ρ2

jvT

j(KjKT

j)−1vjand wT

1(K1KT

1)−1w1=σ2

jwT

j(KjKT

j)−1wj. Together,

these three equalities establish the relations (2.47) in an almost triv-

ial manner. It is important to realize, however, that it actually is the

equality:

j(KjKT

j)−1Aj=(K1KT

1)−1,(2.49)

which makes that the constraints (2.47) are satisﬁed. Our claim is that

the self-calibration equations (2.45) are nothing else but this fundamen-

tal relation (2.49) expressed in terms of the projective reconstruction

of the scene which was computed from the given images as described

in Section 2.6.1.

2.6.4.3 Equivalence of Self-Calibration and the

Preservation of Angles

Now, suppose that no information whatsoever on the scene is given,

but that we have computed matrices ˆ

Ajand epipoles ejfor each image

(j≥2) as well as a collection of 3D points ˆ

Msatisfying the projective

reconstruction equations (2.39), viz:

ˆρ1m1=ˆ

Mand ˆρjmj=ˆ

Ajˆ

M+ejfor j∈{2,...,m},

for given point correspondences m1,m2,...,mmbetween the images. The

3D points ˆ

Mhave been shown in Section 2.5.4 to form a projective

3D reconstruction of the scene. To prove our claim about the self-

calibration equations, we will follow the same line of reasoning as the

one that led to relation (2.49) above. Let ˆ

Mand ˆ

Pbe two arbitrary points

in the projective reconstruction. As explained in Appendix A, the van-

ishing point vjin the j-th image of the line ˆ

Lthrough the points ˆ

Mand

2.6 From Pro jective to Metric Using More Than Two Images 367

Pin the 3D reconstruction is the projection of the point of intersection

of the line ˆ

Lwith the plane at inﬁnity of the scene. Since no information

whatsoever on the scene is available, we do not know where the plane

at inﬁnity of the scene is located in the reconstruction. However, we do

know that for every 3D line in the projective reconstruction, its vanish-

ing points in the respective images should be mapped onto each other

by the projective transformation which maps one image to another via

the plane at inﬁnity of the scene. The idea now is to identify a plane in

the projective reconstruction which does exactly that. Suppose that the

(unknown) equation of this plane is πT

∞ˆ

M+ 1 = 0. The projective trans-

formation which maps the ﬁrst image to the j-th one via this plane is

computed as follows: From the ﬁrst projective reconstruction equation

ˆρ1m1=ˆ

Mit follows that parameter equations for the projecting ray of

an arbitrary image point m1in the ﬁrst camera are given by ˆ

M=ˆρm1

where ˆρ∈Ris a scalar parameter. This projecting ray intersects the

plane at inﬁnity of the scene in the point ˆ

M∞=ˆρ∞m1that satiﬁes the

equation πT

∞ˆ

M∞+ 1 = 0. Hence,

ˆρ∞=−1

πT

∞m1

and ˆ

M∞=−1

πT

∞m1

m1.

By the projective reconstruction equations (2.39), the projection mjof

this point ˆ

M∞in the j-th image for j≥2 satisﬁes ˆρjmj=ˆ

Ajˆ

M∞+ej.

Substituting the expression for ˆ

M∞in this equation yields

ˆρjmj=−1

πT

∞m1

Ajm1+ej,

or equivalently,

−(πT

∞m1)ˆρjmj=ˆ

Ajm1−ej(πT

∞m1)=(ˆ

Aj−ejπT

∞)m1.

Consequently, the 3 ×3-matrix ˆ

Aj−ejπT

∞is a homography matrix of

the projective transformation which maps the ﬁrst image to the j-th

one via the plane at inﬁnity of the scene. In Section 2.4.3, on the other

hand, it was demonstrated that the invertible matrix Ajintroduced in

Section 2.4.1 also is a matrix for this inﬁnite homography. Therefore,

Aj−ejπT

∞and Ajmust be equal up to a non-zero scalar factor, i.e.,

Aj=κj(ˆ

Aj−ejπT

∞) for some non-zero scalar κj∈R. Notice that this

368 Principles of Passive 3D Reconstruction

is exactly the same expression for Ajas was found in formula (2.44) of

Section 2.6.3, but now it is obtained by a geometrical argument instead

of an algebraic one.

Clearly, for any plane πTˆ

M+ 1 = 0 the homography matrix ˆ

Aj−

ejπTwill map the ﬁrst image to the j-th one via that plane. But only

the plane at inﬁnity of the scene will guarantee that the cosines

1(K1KT

1)−1w1

vT

1(K1KT

1)−1v1wT

1(K1KT

1)−1w1

and vT

j(KjKT

j)−1wj

vT

j(KjKT

j)−1vjwT

j(KjKT

j)−1wj

(cf. formula (2.47) ) computed in each image j(j≥2) admit the same

value whenever v1is mapped onto vjand w1is mapped onto wjby the

homography ˆ

Aj−ejπT. And, as was observed earlier in this section,

this will only be guaranteed if AT

j(KjKT

j)−1Aj=(K1KT

1)−1where

Ajnow has to be interpreted as the inﬁnite homography mapping the

ﬁrst image to the j-th image by the plane at inﬁnity of the scene (cf.

formula (2.49) ). Since Aj=κj(ˆ

Aj−ejπT

∞), the plane at inﬁnity of

the scene is that plane πT

∞ˆ

M+ 1 = 0 in the projective reconstruction

for which

κ2

j(ˆ

Aj−ejπT

∞)T(KjKT

j)−1(ˆ

Aj−ejπT

∞)=(K1KT

1)−1

for all j∈{2,3,...,m}. This equality expresses that the intrinsic way

of measuring in the j-th image — represented by the symmetric matrix

(KjKT

j)−1— must be compatible with the intrinsic way of mea-

suring in the ﬁrst image — represented by the symmetric matrix

(K1KT

1)−1— and, more importantly, that it is the inﬁnite homogra-

phy between the ﬁrst and the j-th image — represented by the matrix

(ˆ

Aj−ejπT

∞) — which actually transforms the metric (KjKT

j)−1into

the metric (K1KT

1)−1. Finally, if both sides of this matrix equality are

inverted, one gets the relation:

κ2

(ˆ

Aj−ejπT

∞)−1KjKT

j(ˆ

Aj−ejπT

∞)−T=K1KT

or, after solving for KjKT

KjKT

j=κ2

j(ˆ

Aj−ejπT

∞)K1KT

1(ˆ

Aj−ejπT

∞)T,(2.50)

2.6 From Pro jective to Metric Using More Than Two Images 369

which must hold for all j∈{2,3,...,m}. Observe that these are exactly

the self-calibration equations (2.45), as we claimed earlier.

Intuitively, these equations state that if a plane πT

∞ˆ

M+1=0 in

the reconstruction is known or found to be the plane at inﬁnity of the

scene and if a metric — represented by a symmetric, positive-deﬁnite

matrix K1KT

1— is induced in the ﬁrst image, then by really map-

ping the plane π∞ˆ

M+ 1 = 0 to inﬁnity and by measuring in the j-th

image (j≥2) according to a metric induced from K1KT

1through for-

mula (2.50), the projective reconstruction will be upgraded to a metric

3D reconstruction of the scene. This is in accordance with our ﬁndings

in Section 2.6.2 that the projective transformation

κ¯

M=K−1

1ˆ

πT

∞ˆ

M+1

transforms the projective reconstruction ˆ

Minto the metric reconstruc-

tion κ¯

2.6.5 A Glimpse on Absolute Conics and Quadrics

The right-hand side of the self-calibration equations (2.50) can also be

written as:

KjKT

j=κ2

j(ˆ

Aj−ejπT

∞)K1KT

1(ˆ

Aj−ejπT

∞)T

=κ2

jˆ

AjejI3

−πT

∞K1KT

1I3−π∞ˆ

j

=κ2

jˆ

AjejK1KT

1−K1KT

1π∞

−πT

∞K1KT

1πT

∞K1KT

1π∞ˆ

j.

In the computer vision literature, the 3 ×3-matrix KjKT

jin the left-

hand side of the equality is commonly denoted by ω∗

jand referred to

as the dual image of the absolute conic in the j-th view [16]. It is the

mathematical dual of the image of the absolute conic ωjin the j-th

view. The 4 ×4-matrix

Ω∗=K1KT

1−K1KT

1π∞

−πT

∞K1KT

1πT

∞K1KT

1π∞(2.51)

370 Principles of Passive 3D Reconstruction

in the right-hand side of the previous equality, on the other hand, is

in the computer vision literature usually referred to as the absolute

quadric [16]. The self-calibration equations (2.50) are compactly writ-

ten in this manner as:

ω∗

j=κ2

j(ˆ

Aj|ej)Ω

∗(ˆ

Aj|ej)T,(2.52)

where ω∗

j=KjKT

jand with Ω∗as deﬁned above. We will not expand

on this interpretation in terms of projective geometry right now, but the

interested reader can ﬁnd more details on this matter in Appendix B.

The main advantage of writing the self-calibration equations in the

form (2.52) is that it yields linear equations in the entries of ω∗

jand

the entries of Ω∗, which are easier to solve in practice than the non-

linear formulation (2.50). Moreover, the absolute quadric Ω∗encodes

both the plane at inﬁnity of the scene and the internal calibration of

the ﬁrst camera in a very concise fashion. Indeed,

Ω∗=I30

−πT

∞1K1KT

0T0 I30

−πT

∞1T

from which it immediately follows that Ω∗has rank 3 and that its

nullspace is spanned by the plane at inﬁnity of the scene (πT

∞1)T.

Moreover, Ω∗can also be decomposed as:

Ω∗=K10

−πT

∞K11I30

0T0 K10

−πT

∞K11T

where the 4 ×4-matrix

K10

−πT

∞K11=I30

−πT

∞1K10

0T1,(2.53)

is the homography matrix of the projective transformation

λˆ

1=K10

−πT

∞K11κ¯

1,

which relates the projective reconstruction ˆ

Mto the metric reconstruc-

tion κ¯

M, as discussed in Section 2.6.2. Hence, the rectifying homogra-

phy to update the projective reconstruction of the scene to a metric

2.6 From Pro jective to Metric Using More Than Two Images 371

one is directly available once the absolute quadric Ω∗has been recov-

ered. It is interesting to observe that the decomposition (2.53) of the

rectifying homography exhibits the stratiﬁcation of geometries as dis-

cussed in Section 2.5.5. The rightmost matrix in the right-hand side

of equation (2.53) represents the aﬃne deformation induced on the

metric reconstruction κ¯

Mby including the uncertainty about the inter-

nal camera parameters K1in the 3D reconstruction, yielding the aﬃne

reconstruction κ˜

M; and the leftmost matrix in the decomposition (2.53)

changes the plane at inﬁnity to the plane πT

∞ˆ

M+ 1 = 0 in the projective

reconstruction ˆ

Mof the scene.

How Ω∗is computed in practice, given a projective reconstruction

of the scene is discussed in more detail in Subsection 4.6.2 of Section 4

in Part 2 of the tutorial. Therefore we will not continue with this topic

any further here, but instead we will address the question of how many

images are needed in order for the self-calibration equations to yield a

unique solution.

2.6.6 When do the Self-Calibration Equations Yield a

Unique Solution ?

The self-calibration equations yield additional constraints on the cal-

ibration matrices of the cameras and about the location of the plane

at inﬁnity of the scene in a projective 3D reconstruction. But as was

already observed in Sections 2.5.4 and 2.5.5, if no additional informa-

tion about the internal and/or external calibration of the cameras or

about the Euclidean structure of the scene is available, then with only

two images one cannot hope for better than a projective reconstruction

of the scene. The question that remains unsolved up till now is: Is it

possible to recover a metric reconstruction of the scene with only the

images at our disposal; and, more importantly, how many images are

needed to obtain a unique solution?

Consider again the self-calibration equations (2.50) in their compact

formulation:

KjKT

j=κ2

j(ˆ

Aj−ejπT

∞)K1KT

1(ˆ

Aj−ejπT

∞)T(2.54)

for each j∈{2,3,...,m}. In these equations the calibration matrices Kj

all appear as KjKT

j. It is thus advantageous to take the entries of

372 Principles of Passive 3D Reconstruction

KjKT

jas the unknowns in the self-calibration equations instead of

expressing them in terms of the internal camera parameters consti-

tuting Kj(cf. formula (2.4)). As Kjis an invertible upper-triangular

matrix, KjKT

jis a positive-deﬁnite symmetric matrix. So, if KjKT

is known, the calibration matrix Kjitself can uniquely be obtained

from KjKT

jby Cholesky factorization [3]. Furthermore, each KjKT

is a symmetric 3 ×3-matrix whose (3,3)-th entry equals 1 (cf. for-

mula (2.4)). Consequently, each KjKT

jis completely characterized by

ﬁve scalar parameters, viz. the diagonal elements other than the (3,3)-

th one and the upper-triangular entries. Similarly, the scalar factors κ2

can be considered as being single variables in the self-calibration equa-

tions. Together with the three unknown components of the 3-vector π∞,

the number of unknowns in the self-calibration equations (2.54) for

mimages add up to 5m+(m−1)+3=6m+ 2. On the other hand,

for mimages, formula (2.54) yields m−1 matrix equations. Since

both sides of these equations are formed by symmetric 3 ×3-matrices,

each matrix equation induces only six diﬀerent non-linear equations

in the components of KjKT

j, the components of π∞and the scalars

κ2

jfor j∈{1,2,...,m}. Hence, for mimages, the self-calibration equa-

tions (2.54) yield a system of 6(m−1)=6m−6 non-linear equations

in 6m+ 2 unknowns. Clearly, without additional constraints on the

unknowns, this system does not have a unique solution.

In practical situations, however, quantitative or qualitative infor-

mation about the cameras can be used to constrain the number of

solutions. Let us consider some examples.

•Images obtained by the same or identical cameras.

If the images are obtained with one or more cameras whose

calibration matrices are the same, then K1=K2=... =

Km=Kand the self-calibration equations (2.54) reduce to

KKT=κ2

j(ˆ

Aj−ejπT

∞)KKT(ˆ

Aj−ejπT

∞)T

for all j∈{2,...,m}. In this case, only ﬁve internal cam-

era parameters — in practice, the ﬁve independent scalars

characterizing KKT— are to be determined, reducing the

number of unknowns to 5 + (m−1)+3=m+ 7. On the

2.6 From Pro jective to Metric Using More Than Two Images 373

other hand, the self-calibration equations yield six equations

for each image other than the ﬁrst one. If these equations

are independent for each view, a solution is determined

provided 6(m−1) ≥m+ 7. Consequently, if m≥3images

obtained by cameras with identical calibration matrices, then

the calibration matrix Kand the plane at inﬁnity of the

scene can — in principle — be determined from the self-

calibrations equations and a metric reconstruction of the

scene can be obtained.

•Images obtained by the same or identical cameras

with diﬀerent focal lengths.

If only the focal length of the camera is varying between the

images, then four of the ﬁve internal parameters are the same

for all cameras, which brings the total number of unknowns

to 4 + m+(m−1)+3=2m+ 6. Since the self-calibration

equations (2.54) bring six equations for each image other

than the ﬁrst one, a solution is in principle determined pro-

vided 6(m−1) ≥2m+ 6. In other words, when the focal

length of the camera is allowed to vary between the images,

then — in principle — a metric reconstruction of the scene

can be obtained from m≥3images.

•Known aspect ratio and skew, but unknown and dif-

ferent focal length and principal point.

When the aspect ratio and the skew of the cameras are

known, but the focal length and the principal point of the

cameras are unknown and possibly diﬀerent for each image,

only three internal parameters have to be determined for each

camera. This brings the number of unknowns for all mimages

to 3m+(m−1)+3=4m+ 2. As formula (2.54) brings

six equations for each image other than the ﬁrst one, a solu-

tion is in principle determined provided 6(m−1) ≥4m+2.

In other words, when the aspect ratio and the skew of the cam-

eras are known, but the focal length and the principal point

of the cameras are unknown and allowed to vary between the

images, then — in principle — a metric reconstruction of

374 Principles of Passive 3D Reconstruction

the scene can be obtained from m≥4images. Note that the

case of square pixels, usual with digital cameras, is a special

case of this.

•Rectangular pixels (and, hence, known skew) and

unknown, but ﬁxed aspect ratio.

In case the skew of the pixels is known and if the aspect

ratio is identical for all cameras, but unknown, then the

total number of unknowns in the self-calibration equations is

1+3m+(m−1)+3=4m+ 3. Because there are six self-

calibration equations for each image other than the ﬁrst one,

a solution is in principle determined provided 6(m−1) ≥

4m+ 3. In other words, when the skew of the cameras is

known and if the aspect ratio is identical for all cameras, but

unknown, and if the focal length and the principal point of

the cameras are unknown and allowed to vary between the

images, then — in principle — a metric reconstruction of

the scene can be obtained from m≥5images.

•Aspect ratio and skew identical, but unknown.

In the situation where the aspect ratio and the skew of the

cameras are identical, but unknown, two of the ﬁve inter-

nal parameters are the same for all cameras, which brings

the total number of unknowns to 2 + 3 m+(m−1)+3=

4m+ 4. As formula (2.54) brings six equations for each

image other than the ﬁrst one, a solution is in principle deter-

mined provided 6(m−1) ≥4m+ 4. In other words, if the

aspect ratio and the skew are the same for each camera, but

unknown, and when the focal length and the principal point of

the camera are allowed to vary between the images, then —

in principle — a metric reconstruction of the scene can be

obtained from m≥5images.

•Rectangular pixels or known skew.

If only the skew is known for each image, then four internal

parameters have to be determined for each camera, which

brings the total number of unknowns to 4m+(m−1)+3=

5m+ 2. Since there are six self-calibration equations for each

2.6 From Pro jective to Metric Using More Than Two Images 375

image other than the ﬁrst one, a solution is in principle deter-

mined provided 6(m−1) ≥5m+ 2. In other words, when

only the skew is known for each image, but all the other

internal parameters of the camera are unknown and allowed

to vary between the images, then — in principle — a met-

ric reconstruction of the scene can be obtained from m≥8

images.

•Aspect ratio unknown, but ﬁxed.

When the aspect ratio of the pixels is the same for all cam-

eras, but its value is unknown, then of the ﬁve unknown

internal parameters of the cameras, one is the same for all

cameras, thus bringing the total number of unknowns to

1+4m+(m−1)+3=5m+ 3. With six self-calibration

equations for each image other than the ﬁrst one, a solu-

tion is in principle determined provided 6(m−1) ≥5m+3.

In other words, if the aspect ratio of the pixels is the same

for each camera, but its value is unknown, and when all the

other internal parameters of the cameras are unknown and

allowed to vary between the images, then — in principle —

a metric reconstruction of the scene can be obtained from

m≥9images.

In conclusion, although the self-calibration equations (2.54) bring

too few equations to allow a unique solution for the calibration

matrix Kjof each camera and to uniquely identify the plane at inﬁn-

ity of the scene in the projective 3D reconstruction, in most practical

situations a unique solution can be obtained by exploiting additional

constraints on the internal parameters of the cameras, provided a suﬃ-

cient number of images are available. The required minimum number of

images as given above for diﬀerent situations only is indicative in that

it is correct, provided the resulting self-calibration equations are inde-

pendent. In most applications this will be the case. However, one should

always keep in mind that there do exist camera conﬁgurations and cam-

era motions for which the self-calibration equations become dependent

and the system is degenerate. Such special situations are referred to

in the literature as critical motion sequences. A detailed analysis of all

376 Principles of Passive 3D Reconstruction

these cases is beyond the scope of this text, but the interested reader

is referred to [9, 14, 15] for further information. Moreover, it must also

be emphasized that the self-calibration equations in (2.54) are not very

well suited for numerical computations. For practical use, their linear

formulation (2.52) in terms of the absolute quadric is recommended.

Subsection 4.6.2 of Section 4 in Part 2 of this tutorial discusses this

issue in more detail.

2.7 Some Important Special Cases

In the preceding sections the starting point was that no information

about the internal and external parameters of the cameras is available

and that the positions and orientations of the cameras can be com-

pletely arbitrary. In the last section, it was observed that one has to

assume some constraints on the internal parameters in order to solve the

self-calibration equations, thereby still leaving the external parameters

completely unknown at the outset. In practical applications, the lat-

ter might not always be the case either and quite a bit of knowledge

about the camera motion may be available. Situations in which the

object or the camera has purely translated in between the acquisition

of the images are quite common (e.g., a camera on rails), and so are

cases of pure camera rotation (e.g., a camera on a tripod). These simple

camera motions both oﬀer opportunities and limitations, which will be

explored in this section.

Moreover, sometimes it makes more sense to ﬁrst calibrate cameras

internally, and then to only recover 3D structure and external param-

eters from the images. An example is digital surveying (3D measuring

in large-scale environments like cities) with one or more ﬁxed cameras

mounted on a van. These cameras can be internally calibrated before

driving oﬀ for a new surveying campaign, during which the internal

parameters can be expected to remain ﬁxed.

And, last but not least, in practical applications it may often be use-

ful and important — apart from reconstructing the three-dimensional

structure of the scene — to obtain also information about the camera

positions and orientations or about the camera parameters underlying

the image data. In the computer vision literature this is referred to as

2.7 Some Important Special Cases 377

structure-and-motion. The 3D reconstruction process outlined in the

previous sections can provide such information too. The fundamental

concepts needed to retrieve camera pose information will be introduced

in the subsequent sections as well.

2.7.1 Camera Translation and Stereo Rigs

Suppose that in between the acquisition of the ﬁrst and the second

image the camera only has translated and that the internal camera

parameters did not change. In that case, the orientation of the camera

has not changed — i.e., R1=R2=R— and the calibration matrices

are the same — i.e., K1=K2=K. Consequently, the invertible matrix

Aintroduced in Section 2.4.1 reduces to:

A=K2RT

2R1K−1

1=KRTRK−1=I3,

the 3 ×3-identity matrix I3, because Ris an orthogonal matrix and

thus Rt=R−1. In other words, if one knows that the camera has

only translated between the acquisition of the images, then the inﬁnite

homography Ais theoretically known to be the 3 ×3-identity matrix.

Recall from Section 2.5.3 that, if Ais known, an aﬃne reconstruction

of the scene can be computed by solving the system of aﬃne recon-

struction equations (2.25), viz:

˜ρ1m1=˜

Mand ˜ρ2m2=A˜

M+e2=˜

M+e2,

yielding six equations in ﬁve unknowns. Put diﬀerently, in case of a

pure camera translation an aﬃne reconstruction of the scene can be

computed from two uncalibrated images.

An example of an image pair, taken with a camera that has trans-

lated parallel to the object, is shown in Figure 2.15. Two views of

the resulting aﬃne 3D reconstruction are shown in Figure 2.16. The

results are quite convincing for the torsos, but the homogeneous wall

in the background could not be reconstructed. The diﬀerence lies in the

presence or absence of texture, respectively, and the related ease or dif-

ﬁculty in ﬁnding corresponding points. In the untextured background

all points look the same and the search for correspondences fails.

It is important to observe that without additional information about

the camera(s) or the scene, one can still do no better than an aﬃne

378 Principles of Passive 3D Reconstruction

Fig. 2.15 A pair of stereo images for a scene with two torsos, where the cameras have

identical settings and are purely translated with respect to each other.

Fig. 2.16 Two views of a three-dimensional aﬃne reconstruction obtained from the stereo

image pair of Figure 2.15.

reconstruction even if one gets additional, translated views. Indeed, the

self-calibration equations (2.45) derived in Section 2.6.3 are:

KjKT

j=κ2

j(ˆ

Aj−ejπT

∞)K1KT

1(ˆ

Aj−ejπT

∞)T

for all j∈{2,...,m}. In case of a translating camera with con-

stant internal parameters, K1=K2=... =Km=K,A2=A3=... =

Am=I3and π∞=(0,0,0)t, as the scene structure has been recovered

up to a 3D aﬃne transformation (instead of only a general projective

one). Hence, the self-calibration equations reduce to:

KKT=κ2

j(I3−ej0T)KKT(I3−ej0T)T

2.7 Some Important Special Cases 379

for all j∈{2,...,m}; or equivalently, KKT=κ2

jKKT, which only

implies that all κ2

jmust be equal to 1, but do not yield any infor-

mation about the calibration matrix K. In summary, with a translating

camera an aﬃne reconstruction of the scene can be obtained already

from two images, but self-calibration and metric reconstruction are not

possible.

A special case of camera translation, which is regularly used in

practical applications, is a stereo rig with two identical cameras (i.e.,

cameras having the same internal parameters) in the following conﬁg-

uration: The optical axes of the cameras are parallel and their image

planes are coplanar, with coincident x-axes. This special conﬁguration

is depicted in Figure 2.17. The distance between the two centers of pro-

jection is called the baseline of the stereo rig and is denoted by b.To

simplify the mathematical analysis, we may, without loss of generality,

let the world frame coincide with the camera-centered reference frame

of the left or ‘ﬁrst’ camera. The projection equations of the stereo rig

Fig. 2.17 Schematic representation of a simple stereo rig: two identical cameras having

coplanar image planes and parallel axes. In particular, their x-axes are aligned.

380 Principles of Passive 3D Reconstruction

then reduce to:

ρ1



1

=K



Z

and ρ2



1

=K



X−b

Z

,

where K=





αxsp

0αypy

001





is the calibration matrix of the

two cameras. The pixel coordinates (x1,y1) and (x2,y2) of the projec-

tions m1and m2of a scene point Mwhose coordinates with respect to

the world frame are (X,Y,Z), are given by:

x1=αxX

Z+sY

Z+px

y1=αyY

Z+py

and x2=αxX−b

Z+sY

Z+px

y2=αyY

Z+py

In particular, y1=y2and x1=x2+αxb

Z. In other words, correspond-

ing points in the images are found on the same horizontal line (the

epipolar line for this particular setup) and the horizontal distance

between them, viz. x1−x2=αxb

Z, is inversely proportional to the

Z-coordinate (i.e., the projective depth) of the underlying scene point M.

In the computer vision literature the diﬀerence x1−x2is called the dis-

parity between the image points and the projective depth Zis some-

times also referred to as the range of the scene point. In photography

the resolution of an image is deﬁned as the minimum distance two

points in the image have to be apart in order to be visually distinguish-

able. As the range Zis inversely proportional to the disparity, it follows

that beyond a certain distance, depth measurement will become very

coarse. Human stereo depth perception, for example, which is based on

a two-eye conﬁguration similar to this stereo rig, is limited to distances

of about 10 m. Beyond, depth impressions arise from other cues. In a

stereo rig the disparity between corresponding image points, apart from

being inversely proportional to the projective depth Z, also is directly

proportional to the baseline band to αx. Since αxexpresses the focal

length of the cameras in number of pixels for the x-direction of the

image (cf. formula (2.2) in Section 2.2.2), the depth resolution of a

stereo rig can be increased by increasing one or both of these variables.

2.7 Some Important Special Cases 381

Upon such a change, the same distance will correspond to a larger dis-

parity and therefore distance sampling gets ﬁner. It should be noted,

however, that one should strike a balance between increasing resolution

and keeping visible to both cameras as much of the scene as possible.

Indeed, when disparities get larger, chances of ﬁnding both projections

within the images diminish. Figure 2.18 shows planes of equal disparity

for two focal lengths and two baseline distances. Notice how for this

particular stereo setup points with identical disparities form planes at

Fig. 2.18 Equal disparities correspond to points at equal distance to the stereo system

(iso-depth planes): Rays that project onto points with a ﬁxed disparity intersect in planes

of constant range. At a given distance these planes get closer (i.e., sample distance more

densely) if the baseline is increased (top-left to top-right), the focal length is increased

(top-left to bottom-left), or both (top-left to bottom-right).

382 Principles of Passive 3D Reconstruction

equal depth or distance from the stereo system. The same distance is

seen to be sampled ﬁner after the focal length or the baseline have been

increased. A smaller part of the scene is visible to both cameras upon

such a change, certainly for those parts which are close.

Similar considerations are also relevant for more general relative

camera displacements than pure translations: Precisions go up as cam-

era views diﬀer more and projecting rays are intersecting under larger

angles (i.e., if images are taken under wide baseline conditions). On the

other hand, without intermediate views at one’s disposal, there tend to

be holes in the reconstruction, for points visible in only one or no views.

2.7.2 Pure Rotation Around the Center of Projection

Another special case of camera motion is a camera that rotates around

the center of projection. This situation is depicted in Figure 2.19. The

point Cdenotes the center of projection. A ﬁrst image I1is recorded and

then the camera is rotated around the center of projection to record

the second image I2. A scene point Mis projected in the images I1

and I2onto the image points m1and m2, respectively. As the ﬁgure

Fig. 2.19 Two images I1and I2are recorded by a camera which rotates around the center

of projection C. A scene point Mis projected in the images onto the image points m1and m2,

respectively.

2.7 Some Important Special Cases 383

shows, 3D reconstruction from point correspondences is not possible

in this situation. The underlying scene point Mis to be found at the

intersection of the projecting rays of m1and m2, but these coincide.

This conclusion can also be obtained algebraically by investigating

the reconstruction equations for this case. Recall from Section 2.3 that

the projection equations for two cameras in general position are:

ρ1m1=K1RT

1(M−C1) and ρ2m2=K2RT

2(M−C2).

If the camera performs a pure rotation around the center of projection,

then C1=C2=C, and the projection equations become:

ρ1m1=K1RT

1(M−C) and ρ2m2=K2RT

2(M−C).

Solving the ﬁrst equation for the scene point Mand substituting this

into the second equation, as in Section 2.4.1, gives

ρ2m2=ρ1K2RT

2R1K−1

1m1,

or, using A=K2RT

2R1K−1

1as before, one gets

ρ2m2=ρ1Am1.(2.55)

This equation establishes a direct relationship between the pixel coor-

dinates of corresponding image points. In fact, the equation states that

the second image I2is a projective transformation of the ﬁrst image

I1with the invertible matrix Aintroduced in Section 2.4.1 as homogra-

phy matrix. This should not come as a surprise, because looking back at

Figure 2.19 and forgetting for a moment the scene point M, one sees that

image I2is a perspective projection of image I1with Cas the center

of projection. Hence, searching for corresponding points between two

images obtained by a camera that has only rotated around the center of

projection, is quite simple: One only needs to determine the invertible

matrix A. If the internal parameters of the cameras are known and if

the relative orientation R=R1RT

2between the two cameras is known

as well, then the inﬁnite homography Acan be computed directly from

its deﬁnition A=K2RT

2R1K−1

1. However, when the internal and exter-

nal camera parameters are not known, then Acan be computed from

four pairs of corresponding points between the images. Indeed, each

384 Principles of Passive 3D Reconstruction

corresponding point pair (m1,m2)∈I

1×I

2must satisfy the relation

Am1=ρm2for some non-zero scalar factor ρ, and thus brings a system

of three linear equations in the nine unknown components of the matrix

Aand the unknown scalar ρ. As these equations are homogeneous, at

least four point correspondences are needed to determine the matrix A

up to a non-zero scalar factor. We will not pursue these computational

issues any further here, since they are discussed in detail in Subsec-

tion 4.2.4 of Section 4 in Part 2 of the tutorial. Once the matrix Ais

known, formula (2.55) says that for every point m1in the ﬁrst image

Am1are homogeneous coordinates of the corresponding point m2in the

second image.

Observe that formula (2.55) actually expresses the epipolar relation

between the images I1and I2. Indeed, in its homogeneous form (2.16)

the epipolar relation between corresponding image points is:

ρ2m2=ρ1Am1+ρe2e2,

where ρe2e2=K2RT

2(C1−C2) is the epipole of the ﬁrst camera in

the second image. For a rotating camera, C1=C2, which implies that

ρe2e2=0, and formula (2.55) results. On the other hand, as the inﬁnite

homography Acan be determined from point correspondences between

the images, one could hope for an aﬃne 3D reconstruction of the scene,

as explained in Section 2.5.3. Unfortunately, this is not possible — as

we already know — because in this case the aﬃne reconstruction equa-

tions (2.25), viz. ˜ρ1m1=˜

Mand ˜ρ2m2=A˜

M+e2, reduce to ˜ρ1m1=˜

Mand

˜ρ2m2=A˜

M. Taking ˜

M=˜ρ1m1from the ﬁrst equation and substituting

it into the second one in order to compute the unknown scalar, ˜ρ1

now brings back formula (2.55), which does not allow to uniquely

determine ˜ρ1. This proves algebraically that 3D reconstruction from

point correspondences is not possible in case the images are obtained

by a camera that rotates around the center of projection.

Although 3D reconstruction is not possible from images acquired

with a camera that rotates around the center of projection, it is

possible to recover the internal camera parameters from the images

alone. Indeed, as explained in Section 2.6.4, the self-calibration

equations (2.45) result from the fundamental relation (2.49), viz:

AT(K2KT

2)−1A=(K1KT

1)−1,(2.56)

2.7 Some Important Special Cases 385

where Ais the inﬁnite homography introduced in Section 2.4.1. In case

of a rotating camera the matrix Acan be determined up to a non-zero

scalar factor if at least four pairs of corresponding points are identiﬁed

in the images. Let ˆ

Abe such an estimate for A. Then A=κˆ

Afor some

unknown scalar κ. Substitution in formula (2.56) and solving for K2KT

yields K2KT

2=κ2ˆ

A(K1KT

1)ˆ

AT. If the internal camera parameters

have remained constant during camera rotation, then K1=K2=K

and the self-calibration equations reduce to KKT=κ2ˆ

A(KKT)ˆ

AT.

Since ˆ

Ais only determined up to a non-zero-scalar factor, one may

assume without loss of generality that its determinant equals 1. Taking

determinants of both sides of the self-calibration equations, it follows

that κ2= 1, because Kis an invertible matrix and generally has non-

unit determinant. Consequently, the self-calibration equations become

KKT=ˆ

A(KKT)ˆ

ATand they yield a system of six linear equations

in the ﬁve unknown entries of the symmetric matrix KKT. The cali-

bration matrix Kitself can be recovered from KKTby Cholesky fac-

torization [3], as explained in Section 2.6.6. In summary, with a camera

that rotates around the center of projection 3D reconstruction of the

scene is not possible, but self-calibration is [5].

2.7.3 Internally Calibrated Cameras and the Essential

Matrix

In some applications, the internal parameters of the cameras may be

known, through a (self-)calibration procedure applied prior to the cur-

rent processing of new input images. It will be demonstrated in this

section that when the internal parameters of the cameras are known,

but no information about the (absolute or relative) position and orien-

tation of the cameras is available, a metric 3D reconstruction from two

images is feasible.

2.7.3.1 Known Camera Matrices and 3D Reconstruction

Consider again the projection equations (2.20) for two cameras observ-

ing a static scene, as used in Section 2.5:

ρ1m1=K1RT

1(M−C1) and ρ2m2=K2RT

2(M−C2).(2.57)

386 Principles of Passive 3D Reconstruction

If the calibration matrices K1and K2are known, then the (unbiased)

perspective projections of the scene point Min each image plane can be

retrieved as q1=K−1

1m1and q2=K−1

2m2, respectively. By multiplying

the metric 3D reconstruction equations (2.24) derived in Section 2.5.2,

viz.:

¯ρ1m1=K1¯

Mand ¯ρ2m2=K2RT

2R1¯

M+e2,

on the left with K−1

1and K−1

2, respectively, they simplify to

¯ρ1q1=¯

Mand ¯ρ2q2=RT

2R1¯

M+qe,(2.58)

where qe=K−1

2e2is the (unbiased) perspective projection of the posi-

tion C1of the ﬁrst camera onto the image plane of the second camera.

It is interesting to observe that, since the epipole e2is deﬁned by for-

mula (2.15) in Section 2.4.1 as ρe2e2=K2RT

2(C1−C2), it follows that

ρe2qe=RT

2(C1−C2). In other words, ρe2qegives the position C1of the

ﬁrst camera with respect to the camera-centered reference frame of the

second camera, i.e., the relative position. And, as ρe2is the unknown

scale of the metric reconstruction ¯

M,qein fact represents the translation

direction between the two cameras in the metric 3D reconstruction of

the scene.

Thus, the rotation matrix R=RT

2R1is the only unknown factor in

the 3D reconstruction equations (2.58) that separates us from a metric

3D reconstruction of the scene. In fact, Rrepresents the orientation of

the ﬁrst camera in the camera-centered reference frame of the second

one (cf. Section 2.2.4), i.e., the relative orientation of the cameras. It

will be demonstrated below that the rotation matrix Rcan be recovered

from the so-called essential matrix of this pair of calibrated images. But

ﬁrst the notion of essential matrix has to be deﬁned.

2.7.3.2 The Essential Matrix

Recall from Section 2.4.1 that the epipolar relation (2.18) between cor-

responding image points is found by solving the ﬁrst projection equa-

tion in formula (2.57) for Mand substituting the resulting expression in

the second projection equation, thus yielding

ρ2m2=ρ1K2RT

2R1K−1

1m1+K2RT

2(C1−C2)

2.7 Some Important Special Cases 387

(cf. formula (2.14) in Section 2.4.1). Multiplying both sides of this equa-

tion on the left by K−1

2yields

ρ2K−1

2m2=ρ1RT

2R1K−1

1m1+RT

2(C1−C2).

Introducing q1=K−1

1m1,q2=K−1

2m2, and R=RT

2R1as above, one

gets

ρ2q2=ρ1Rq1+RT

2(C1−C2).

Let us denote the last term in this equation by t, then t=

2(C1−C2)=ρe2qerepresents the relative position of the ﬁrst cam-

era with respect to the second one, as explained above. The epipolar

relation then is

ρ2q2=ρ1Rq1+t.(2.59)

From an algebraic point of view, this equation expresses that the

3-vectors q2,Rq1, and tare linearly dependent, and hence the determi-

nant |q2tRq1|= 0. Following the same reasoning as in Section 2.4.1,

|q2tRq1|=qT

2(t×Rq1)=qT

2[t]×Rq1,

where [t]×is the skew-symmetric 3 ×3-matrix that represents the cross

product with the 3-vector t. The 3 ×3-matrix E=[t]×Ris known in

the literature as the essential matrix of the (calibrated) image pair [11]

and the epipolar relation between the calibrated images is expressed

by the equation:

2Eq1=0.(2.60)

Given enough corresponding projections q1and q2, the essential matrix

Ecan be recovered up to a non-zero scalar factor from this relation

in a linear manner. In fact, since E=[t]×Rwith ta 3-vector and

Ra3×3-rotation matrix, the essential matrix Ehas six degrees of

freedom. Consequently, ﬁve corresponding projections q1and q2suﬃce

to compute Eup to a non-zero scalar factor [13, 12]. How this can be

achieved in practice will be discussed in Subsection 4.6.3 of Section 4

in Part 2 of the tutorial.

388 Principles of Passive 3D Reconstruction

2.7.3.3 The Mathematical Relationship Between E and F

For the sake of completeness, it might be useful to highlight the math-

ematical relationship between the essential matrix Eand the funda-

mental matrix F. Since the content of this subsection is not essential

for understanding the remainder of the text, readers who are primarily

interested in the practice of 3D reconstruction can skip this subsection.

Recall from formula (2.60) that qT

2Eq1= 0 describes the epipo-

lar relation between corresponding projections q1and q2in the image

planes of the two cameras. Substituting q1=K−1

1m1and q2=K−1

2m2in

formula (2.60) thus yields the epipolar relation between the two images

in terms of the image (i.e., pixel) coordinates m1and m2, viz.:

2K−T

2EK−1

1m1=0.(2.61)

Comparison with the common form of the epipolar relation mT

2Fm1=0

shows that the fundamental matrix Fis a scalar multiple of the 3 ×

3-matrix K−T

2EK−1

1. More precisely, recall from Section 2.4.1 that

F=[e2]×Aand that

2Fm1=mT

2[e2]×Am1=mT

2(e2×Am1)=|m2e2Am1|.

Substituting m1=K1q1,m2=K2q2, and e2=K2qein the previous

expression gives

2Fm1=|K2q2K2qeAK1q1|=|K2||q2qeK−1

2AK1q1|.

Using A=K2RT

2R1K−1

1,R=RT

2R1, and t=ρe2qe, the right-hand

side simpliﬁes to

2Fm1=|K2|

ρe2|q2tRq1|=|K2|

ρe2qT

2(t×Rq1)

=|K2|

ρe2qT

2[t]×Rq1=|K2|

ρe2qT

2Eq1.

Substituting q1and q2by q1=K−1

1m1and q2=K−1

2m2again, one

ﬁnally gets

2Fm1=|K2|

ρe2

2K−T

2EK−1

1m1=mT

2|K2|

ρe2

K−T

2EK−1

1m1.

Because this equality must hold for all 3-vectors m1and m2, it follows

that

F=|K2|

ρe2

K−T

2EK−1

1; or equivalently, E=ρe2

|K2|KT

2FK1.

2.7 Some Important Special Cases 389

This precise relationship exists between the theoretical deﬁnitions of

the fundamental matrix Fand the essential matrix Eonly. In practice,

however, the fundamental matrix Fand the essential matrix Ecan only

be recovered up to a non-zero scalar factor from point correspondences

between the images. Therefore, it suﬃces to compute such an estimate

for one of them and to use the relevant formula:

F=K−T

2ˆ

EK−1

1or ˆ

E=KT

2ˆ

FK1(2.62)

as an estimate for the other one, as could be inferred directly from

formula (2.61).

2.7.3.4 Recovering the Relative Camera Setup from the

Essential Matrix

Suppose an estimate ˆ

Efor the essential matrix has been computed from

point correspondences between the images. We will now demonstrate

how the relative setup of the cameras (i.e., the rotation matrix R

and (the direction of) the translation vector tdeﬁning the essential

matrix E) can be recovered from ˆ

E. Due to the homogeneous nature

of the epipolar relation (2.60), ˆ

Eis a non-zero scalar multiple of the

essential matrix E=[t]×Rdeﬁned above. Therefore, ˆ

E=λ[t]×Rfor

some non-zero scalar factor λ. Observe that ˆ

Eisa3×3-matrix of

rank 2. Indeed, [t]×is skew-symmetric and thus has rank 2 if tis

a non-zero 3-vector, whereas Ris a rotation matrix and hence it is

invertible. Moreover,

ETt=λ([t]×R)Tt=λRT([t]×)Tt=λRT(−[t]×)t

=−λRT(t×t)=0,

which implies that the 3-vector tbelongs to the left nullspace of ˆ

Let ˆ

E=UΣVTbe the singular value decomposition of the

matrix ˆ

E[3]. Then Σis a diagonal matrix of rank 2 and the left

nullspace of ˆ

Eis spanned by the 3-vector u3constituting the third

column of the orthogonal matrix U.Astbelongs to the left nullspace

of ˆ

E,tmust be a scalar multiple of u3. This already yields t,uptoa

scale.

390 Principles of Passive 3D Reconstruction

Write t=µu3for some non-zero scalar µ. Then ˆ

E=λ[t]×R=

κ[u3]×Rwith κ=λµa non-zero scalar factor. Furthermore, recall from

linear algebra that the singular values of ˆ

Eare the square root of the

eigenvalues of the symmetric matrix ˆ

Eˆ

ET.Now

Eˆ

ET=(κ[u3]×R)(κ[u3]×R)T=κ2[u3]×RRT([u3]×)T

=−κ2([u3]×)2,

where the last equality follows from the fact that Ris a rotation

matrix — and thus RRT=I3is the 3 ×3-identity matrix — and that

[u3]×is a skew-symmetric matrix — i.e., ([u3]×)T=−[u3]×. Because

U=[u1u2u3] is an orthogonal matrix, the ﬁrst two columns u1and u2

of Uare orthogonal to the third column u3and, as they are all unit

vectors, it follows that

([u3]×)2u1=u3×(u3×u1)=−u1

and ([u3]×)2u2=u3×(u3×u2)=−u2.

Consequently,

Eˆ

ETu1=−κ2([u3]×)2u1=κ2u1

and ˆ

Eˆ

ETu2=−κ2([u3]×)2u2=κ2u2.

In particular, u1and u2are eigenvectors of ˆ

Eˆ

ETwith eigenvalue κ2.

Furthermore, u3is an eigenvector of ˆ

Eˆ

ETwith eigenvalue 0, because

Eˆ

ETu3=−κ2([u3]×)2u3=−κ2u3×(u3×u3)=0.

Together this proves that the diagonal matrix Σin the singular value

decomposition of the matrix ˆ

Eequals

Σ=



|κ|00

0|κ|0

000



=|κ|



100

010

000



,(2.63)

where |κ|denotes the absolute value of κ. If we denote the columns

of the orthogonal matrix Vby v1,v2, and v3, respectively, then V=

[v1v2v3] and the singular value decomposition of ˆ

Eis given by:

E=UΣVT=|κ|u1vT

1+|κ|u2vT

2.7 Some Important Special Cases 391

As Uand Vare orthogonal matrices, and because u3and v3do not

actively participate in the singular value decomposition of the rank 2

matrix ˆ

E, we can infer them to be u3=u1×u2and v3=v1×v2.

On the other hand, ˆ

E=κ[u3]×Rand our aim is to compute

the unknown rotation matrix R. To this end, we will re-express the

skew-symmetric matrix [u3]×in terms of the orthogonal matrix U=

[u1u2u3]. Recall that:

[u3]×u1=u3×u1=u2,[u3]×u2=u3×u2=−u1

and [u3]×u3=u3×u3=0;

or, in matrix form,

[u3]×U=[u3]×[u1u2u3]=[u2−u10]

=[u1u2u3]



0−10

100

000



=U



0−10

100

000



.

Consequently, [u3]×U=UZ, or, equivalently, [u3]×=UZUTwith

Z=



0−10

100

000



,

since Uis an orthogonal matrix. The matrix ˆ

E=κ[u3]×Rcan now be

rewritten as ˆ

E=κUZUTR. Combining this expression with the sin-

gular value decomposition ˆ

E=UΣVTyields κUZUTR=UΣVT.

By some algebraic manipulations this equality can be simpliﬁed to

κUZUTR=UΣVT

⇐⇒ κZUTR=ΣVT( multiplying on the left with UT)

⇐⇒ κZUT=ΣVTRT( multiplying on the right with RT)

⇐⇒ κUZT=RVΣT( taking transposes of both sides )

⇐⇒ κ[u1u2u3]



010

−100

000



=|κ|R[v1v2v3]



100

010

000





( expanding U,Z,Vand Σ)

392 Principles of Passive 3D Reconstruction

⇐⇒ κ[−u2u10]=|κ|R[v1v20] ( matrix multiplication )

⇐⇒ Rv1=−u2and Rv2=u1( equality of matrices )

where =κ

|κ|equals 1 if κis positive and −1ifκis negative. Because

Ris a rotation matrix and since v3=v1×v2and u3=u1×u2,

Rv3=R(v1×v2)=(Rv1)×(Rv2)

=(−u2)×(u1)=2u1×u2=u3.

But then

R[v1v2v3]=[−u2u1u3]=[u1u2u3]



00

−00

001



,

or equivalently,

RV =U



00

−00

001



.

This yields the following formula for the rotation matrix R:

R=U



00

−00

001



VT.

Observe that Uand Vare the orthogonal matrices in the singular value

decomposition ˆ

E=UΣVTof the matrix ˆ

Ewhich is computed from

point correspondences in the images, and thus are known. The scalar 

on the other hand, equals =κ

|κ|where κis the unknown scalar factor

in ˆ

E=κ[u3]×Rand hence is not known. But, as can take only values

1 and −1, the previous formula yields two possible solutions for the

rotation matrix R, viz.:

R=U



010

−100

001



VTor ˆ

R=U



0−10

100

001



VT.(2.64)

2.7 Some Important Special Cases 393

2.7.3.5 Euclidean 3D Reconstruction for a known

Inter-Camera Distance

With the conclusion that the unknown relative rotation matrix Rmust

be one of the two matrices in (2.64), it is now proven that a metric

3D reconstruction of the scene can be computed from two images by

the metric reconstruction equations (2.58) (still assuming that the cal-

ibration matrices of both cameras are known). If the distance between

the two camera positions C1and C2is known too, then a Euclidean

3D reconstruction of the scene can be computed. This is quite evident

from the fact that this distance allows us to ﬁx the scale of the metric

reconstruction. In the remainder of this section, we will give a more

formal proof, as it also allows us to dwell further on the ambiguities

that persist.

Consider again the projection equations (2.57) for two cameras

observing a static scene:

ρ1m1=K1RT

1(M−C1),and ρ2m2=K2RT

2(M−C2).

If the calibration matrices K1and K2are known, then the (unbiased)

perspective projections q1=K−1

1m1and q2=K−1

2m2of the scene point

Min the respective image planes can be retrieved. Multiplying both

equations on the left with K−1

1and K−1

2, respectively, yields

ρ1q1=RT

1(M−C1) and ρ2q2=RT

2(M−C2).(2.65)

The right-hand side of the ﬁrst equation, viz. RT

1(M−C1), gives the

3D coordinates of the scene point Mwith respect to the camera-

centered reference frame of the ﬁrst camera (cf. Section 2.2.4). And

similarly, the right-hand side of the second equation, viz. RT

2(M−C2),

gives the 3D coordinates of the scene point Mwith respect to the

camera-centered reference frame of the second camera. As argued in

Section 2.5.1, it is not possible to recover absolute information about

the cameras’ external parameters in the real world from the previous

equations alone. Therefore, the strategy proposed in Section 2.5.1 is

to reconstruct the scene with respect to the camera-centered reference

frame of the ﬁrst camera, thus yielding the Euclidean 3D reconstruction

M=RT

1(M−C1). Solving this expression for M, gives M=C1+R1Mand

394 Principles of Passive 3D Reconstruction

substituting these expressions in formulas (2.65) yields

ρ1q1=Mand ρ2q2=RT

2R1M+RT

2(C1−C2).

Using R=RT

2R1and t=RT

2(C1−C2) as in the previous subsections,

one gets the following Euclidean 3D reconstruction equations for cali-

brated images:

ρ1q1=Mand ρ2q2=RM+t.(2.66)

It was demonstrated in the previous subsection that, if an estimate

Eof the essential matrix Eof the calibrated image pair is available,

then an estimate for the rotation matrix Rand for the direction of the

translation vector tcan be derived from a singular value decomposition

of ˆ

E. More precisely, if ˆ

E=UΣVTis a singular value decomposition

of ˆ

E, then tis a scalar multiple of the unit 3-vector u3constituting

the third column of the orthogonal matrix Uand Ris one of the two

matrices ˆ

Ror ˆ

Rdeﬁned by the expressions in (2.64). Now, since u3is

a unit vector, there are two possibilities for t, namely:

t=tu3or t=−tu3.(2.67)

Together, expressions (2.64) and (2.67) yield four possible, but diﬀerent,

candidates for the relative setup of the two cameras, viz.:

(ˆ

t,ˆ

R),(ˆ

t,ˆ

R),(ˆ

t,ˆ

R),and (ˆ

t,ˆ

R).(2.68)

Observe that tactually is the Euclidean distance between the

two camera positions C1and C2. Indeed, t=RT

2(C1−C2) and thus

||t||2=tTt=||C1−C2||2. Hence, if the distance between the camera

positions C1and C2is known, then each of the four possibilities in (2.68)

together with the reconstruction equations (2.66) yields a Euclidean

3D reconstruction of the scene and one of these 3D reconstructions cor-

responds to a description of the scene in coordinates with respect to the

camera-centered reference frame of the ﬁrst camera. In particular, in the

Euclidean reconstruction Mof the scene computed from formula (2.66)

the ﬁrst camera is positioned at the origin and its orientation is given

by the 3 ×3-identity matrix I3. The position of the second camera in

the 3D reconstruction M, on the other hand, is given by

1(C2−C1)=RT

1R2RT

2(C2−C1)

=RT

2R1T−RT

2(C1−C2)=−RTt

2.7 Some Important Special Cases 395

and the orientation of the second camera in the 3D reconstruction Mis

given by RT

1R2=RT. The four possibilities for a setup of the cameras

which are compatible with an estimated essential matrix ˆ

Eand which

are listed in (2.68) correspond to four mirror symmetric conﬁgurations,

as depicted in Figure 2.20. Since formulas (2.67) imply that ˆ

t=−ˆ

changing ˆ

tinto ˆ

tin the relative setup of the cameras results in a

reversal of the baseline of the camera pair. And, it follows from the

expressions in (2.64) that

R=ˆ

RV 



−100

0−10

001



VT,

where the matrix product

V



−100

0−10

001



VT

in the right-hand side of the equation represents a rotation through

180◦about the line joining the centers of projection. Hence, changing

Fig. 2.20 From the singular value decomposition of (an estimate ˆ

Eof) the essential

matrix E, four possibilities for the relative translation and rotation between the two cam-

eras can be computed. In the ﬁgure, the ﬁrst camera is depicted in grey and the other four

cameras correspond to these four diﬀerent possibilities. In particular, with the notations

used in the text (cf. formula (2.68)), the four possible solutions for the camera setup, viz

(ˆ

t,ˆ

R), (ˆ

t,ˆ

R), (ˆ

t,ˆ

R) and (ˆ

t,ˆ

R), correspond to the blue, yellow, red and green camera,

respectively.

396 Principles of Passive 3D Reconstruction

Rinto ˆ

Rin the relative setup of the cameras results in rotating the

second camera 180◦about the baseline of the camera pair. Moreover,

given a pair m1and m2of corresponding points between the images, the

reconstructed 3D point Mwill be in front of both cameras in only one

of the four possibilities for the camera setup computed from ˆ

E. Hence,

to identify the correct camera setup among the candidates (2.68) it

suﬃces to test for a single reconstructed point Min which of the four

possibilities it is in front of both the cameras (i.e., for which of the

four candidates the projective depths ρ1and ρ2of Min the Euclidean

reconstruction equations (2.66) are both positive) [6, 10].

2.7.3.6 Metric 3D Reconstruction from Two Calibrated

Images

Finally, if the distance between the camera positions C1and C2is

not known, then the transformation ¯

M=1

tM=1

tRT

1(M−C1) still

yields a metric 3D reconstruction of the scene at scale t=C1−C2.

Reconstruction equations for this metric reconstruction follow immedi-

ately from equations (2.66) by dividing both equations by t, viz.:

¯ρ1q1=¯

Mand ¯ρ2q2=R¯

M+u,(2.69)

where u=t

t. It follows from formula (2.67) that the 3-vector u3con-

stituting the third column of the matrix Uin a singular value decompo-

sition of an estimate ˆ

Eof the essential matrix Eyields two candidates

for the unit vector uin the metric 3D reconstruction equations (2.69),

viz. u3and −u3. Together with the two possibilities for the rotation

matrix Rgiven by the expressions in (2.64), one gets the following

four candidates for the relative setup of the two cameras in the metric

reconstruction:

(u3,ˆ

R),(−u3,ˆ

R),(u3,ˆ

R), and (−u3,ˆ

R).(2.70)

As before, the correct camera setup can easily be identiﬁed among the

candidates (2.70) by testing for a single reconstructed point ¯

Min which

of the four possibilities it is in front of both the cameras (i.e., for which

of the four candidates the projective depths ¯ρ1and ¯ρ2of ¯

Min the metric

reconstruction equations (2.69) are both positive).

2.7 Some Important Special Cases 397

It is important to note the diﬀerence between these metric 3D recon-

structions and the metric 3D reconstruction described in Section 2.5.2 :

Apart from diﬀerent setups of the cameras, all four possible metric

reconstructions described here diﬀer from the scene by a ﬁxed scale,

which is the (unknown) distance between the two camera positions;

whereas for the metric 3D reconstruction in Section 2.5.2, nothing is

known or guaranteed about the actual scale of the reconstruction with

respect to the original scene.

Bibliography

Beardsley, P. A., P. H. S. Torr, and A. Zisserman, “3d model acquisition from

extended image sequence,” in Computer Vision — ECCV’96, Lecture Notes in

Computer Science, vol. 1065, pp. II.683–II.695, Springer-Verlag, 1996.

Bouguet, J. Y., “Camera calibration toolbox for matlab,” http://www.vision.

caltech.edu/bouguetj/calib doc/.

Canny, J. F., “A computational approach to edge detection,” IEEE Trans. Pattern

Analysis and Machine Intelligence, vol. 8, no. 6, pp. 679–698, 1986.

Clark, T. A. and J. G. Fryer, “The development of camera calibration methods and

models,” Photogrammetry Records, vol. 16, no. 9, pp. 51–66, 1998.

Devernay, F. and O. Faugeras, “Automatic calibration and removal of distortion

from scenes of structured environments,” in Proc. SPIE Investigative and Trial

Image Processing, vol. 2567, pp. 62–67, 1995.

Devernay, F. and O. Faugeras, “From projective to euclidean reconstruction,” in

Proc. Computer Vision and Pattern Recognition, pp. 264–269, 1996.

Devernay, F. and O. Faugeras, “Straight lines have to be straight,” Machine Vision

and Applications, vol. 13, pp. 14–24, 2001.

Faugeras, O., Three-Dimensional Computer Vision : A Geometric Viewpoint. Cam-

bridge, MA/London, UK: MIT Press, 1993.

Faugeras, O. D., “What can be seen in three dimensions with an uncalibrated

stereo rig,” in Computer Vision — ECCV’92, Lecture Notes in Computer Sci-

ence, vol. 588, pp. 563–578, Springer-Verlag, 1992.

Faugeras, O., “What can be seen in three dimensions with an uncalibrated

stereo rig,” in Computer Vision — (ECCV’92), pp. 563–578, vol. LNCS 588,

Berlin/Heidelberg/New York/Tokyo: Springer-Verlag, 1992.

Fitzgibbon, A. W. , “Simultaneous linear estimation of multiple view geometry and

lens distortion,” in Proc. Computer Vision and Pattern Recognition, pp. 125–132,

398

Bibliography 399

2001.

Fitzgibbon, A. W. and A. Zisserman, “Automatic camera recovery for closed or open

image sequences,” in Computer Vision — ECCV’98, Lecture Notes in Computer

Science, vol. 1406, pp. I.311–I.326, Springer-Verlag, 1998.

Foley, J., A. van Dam, S. Feiner, J. Hughes, and R. Phillips, Introduction to Com-

puter Graphics. Addison-Wesley, 1993.

Golub, G. H. and C. F. V. Loan, Matrix Computations. Baltimore, ML, USA: The

John Hopkins University Press, 1996.

Hartley, R., “A linear method for reconstruction from lines and points,” in Pro-

ceedings of the 5 th International Conference on Computer Vision (ICCV’95),

pp. 882–887, Los Alamitos, CA: IEEE Computer Society Press, 1995.

Hartley, R., “Cheirality,” International Journal of Computer Vision, vol. 26, no. 1,

pp. 41–61, 1998.

Hartley, R., R. Gupta, and T. Chang, “Stereo from uncalibrated cameras,” in Pro-

ceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(CVPR’92), pp. 761–764, IEEE Computer Society Press, 1992.

Hartley, R. and P. Sturm. “Triangulation,” Computer Vision and Image Under-

standing, vol. 68, no. 2, 146–157, November 1997.

Hartley, R. I., “In defence of the 8-point algorithm,” in Proceedings of the 5th Inter-

national Conference on Computer Vision (ICCV’95), pp. 1064–1070, IEEE Com-

puter Society Press, 1995.

Hartley, R., “Estimation of relative camera positions for uncalibrated cam-

eras,” in Computer Vision — (ECCV’92), pp. 579–587, vol. LNCS 588,

Berlin/Heidelberg/New York/Tokyo: Springer-Verlag, 1992.

Hartley, R., “Self-calibration from multiple views with a rotating camera,”

in Computer Vision — (ECCV’94), pp. 471–478, vol. LNCS 800/801,

Berlin/Heidelberg/New York/Tokyo: Springer-Verlag, 1994.

Hartley, R. and S. B. Kang, “Parameter-free radial distortion correction with center

of distortion estimation,” IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 29, no. 8, pp. 1309–1321, doi:10.1109/ TPAMI.2007.1147, June

2007.

Hartley, R. and A. Zisserman, Multiple View Geometry in Computer Vision. Cam-

bridge University Press, ISBN: 0521540518, 2004.

Hofmann, W., “Das problem der Gef¨ahrlichen Fl¨aachen in Theorie und Praxis -

Ein Beitrag zur Hauptaufgabe der Photogrammetrie,” PhD thesis, Fakult¨at f¨ur

Bauwesen, Technische Universit¨aat M¨unchen, Germany, 1953.

Kahl, F., B. Triggs, and K. ˚

Astr¨om, “Critical motions for auto-calibration when some

intrinsic parameters can vary,” Journal of Mathematical Imaging and Vision,

vol. 13, no. 2, pp. 131–146, October 2000.

Koch, R., M. Pollefeys, and L. Van Gool, “Multi viewpoint stereo from uncalibrated

video sequences,” in ECCV ’98: Proceedings of the 5th European Conference on

Computer Vision, vol. I, pp. 55–71, London, UK: Springer-Verlag, 1998.

Kruppa, E., “Zur ermittlung eines objektes aus zwei perspektiven mit innerer ori-

entierung,” Kaiserlicher Akademie der Wissenschaften (Wien) Kl. Abt, IIa(122):

pp. 1939–1948, 1913.

Lafortune, E., S. Foo, K. Torrance, and D. Greenberg, “Non-linear approximation

of reﬂectance functions,” in Proc. SIGGRAPH ’97, vol. 31, pp. 117–126, 1997.

400 Bibliography

Laveau, S. and O. Faugeras, “Oriented projective geometry for computer vision,”

in ECCV ’96: Proceedings of the 4th European Conference on Computer Vision,

vol. I, pp. 147–156, London, UK: Springer-Verlag, 1996.

Longuet-Higgins, H., “The reconstruction of a plane surface from two perspective

projections,” in Proceedings of Royal Society of London, vol. 227 of B, pp. 399–

410, 1986.

Longuet-Higgins, H. C., “A computer algorithm for reconstructing a scene from two

projections,” Nature, vol. 293, no. 10, pp. 133–135, 1981.

Longuet-Higgins, H. C., “A computer algorithm for reconstructing a scene from two

projections,” Nature, vol. 293, pp. 133–135, 1981.

Luong, Q.-T. and O. D. Faugeras, “The fundamental matrix: Theory, algorithms,

and stability analysis,” International Journal on Computer Vision, vol. 17, no. 1,

pp. 43–75, 1996.

Luong, Q.-T. and T. Vi´eville, “Canonic representations for the geometries of multiple

projective views,” in Computer Vision — ECCV’94, Lecture Notes in Computer

Science, vol. 800, pp. I.589–I.599, Springer-Verlag, 1994.

Luong, Q.-T. and T. Vi´eville, “Canonic representations for the geometries of multiple

projective views,” Computer Vision and Image Understanding, vol. 64, no. 2,

pp. 589–599, 1996.

Ma, Y., S. Soatto, J. Kosecka, and S. S. Sastry, An Invitation to 3D Vision: From

Images to Geometric Models. Berlin/Heidelberg/New York/Tokyo: Springer-

Verlag, 2003.

Maybank, S., Theory of Reconstruction from Image Motion Secaucus, NJ, USA:

Springer-Verlag New York, Inc. 1992.

Maybank, S. J., “The critical line concruence for reconstruction from three images,”

Applicable Algebra in Engineering, Communication and Computing, vol. 6,

pp. 89–113, 1995.

Maybank, S. J. and A. Shashua, “Ambiguity in reconstruction from images of six

points,” in Proceedings of the 6 th International Conference on Computer Vision

(ICCV’98), pp. 703–708, Narosa Publishing House, 1998.

McGlone, J. C., (ed.), Manual of Photogrammetry, volume 5th edition. the American

Society of Photogrammetry and Remote Sensing, Bethesda, ML, USA, 2004.

Moons, T., L. Van Gool, M. Van Diest, and E. Pauwels, “Aﬃne reconstruction

from perspective image pairs,” in Applications of Invariance in Computer Vision,

(J. L. Mundy, A. Zisserman, and D. Forsyth eds.), vol. 825 of Lecture Notes in

Computer Science, pp. 297–316, 1994.

Moons, T., L. Van Gool, M. Proesmans, and E. Pauwels. “Aﬃne reconstruction from

perspective image pairs with a relative object–camera translation in between.”

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 1,

77–83, 1996.

Navab, N., O. Faugeras, and T. Vi´eville, “The critical sets of lines for cam-

era displacement estimation: A mixed euclidean–projective and constructive

approach,” in Proceedings of the 4th International Conference on Computer

Vision (ICCV’93), pp. 713–723, Los Alamitos, CA: IEEE Computer Society

Press, 1993.

Bibliography 401

Nist´er, D., “Automatic dense reconstruction from uncalibrated video sequences,”

PhD thesis, Royal Institute of Technology (KTH), Stockholm, Sweden, 2001.

Nist´er, D., “An eﬃcient solution to the ﬁve-point relative pose problem,” IEEE

Transactions On Pattern Analysis and Machine Intelligence, vol. 26, no. 6,

pp. 756–777, June 2004.

Philip, J., “A non-iterative algorithm for determining all essential matrices cor-

responding to ﬁve point Pairs,” The Photogrammetric Record, vol. 15, no. 88,

pp. 589–599, 1996.

Pizarro, O., R. Eustice, and H. Singh, “Relative pose estimation for instrumented,

calibrated imaging platforms,” in Proceedings of VII th Digital Image Computing

Techniques and Applications, vol. 601–612, Sydney, Australia, 2003.

Pollefeys, M., L. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and

R. Koch, “Visual modeling with a hand-held camera,” International Journal of

Computer Vision, vol. 59, no. 3, pp. 207–232, 2004.

Pollefeys, M., R. Koch, and L. Van Gool, “Self-calibration and metric reconstruc-

tion in spite of varying and unknown internal camera parameters.” International

Journal of Computer Vision, vol. 32, no. 1, pp. 7–25, 1999.

Stein, G. P., “Lens distortion calibration using point correspondences,” in Proc.

Computer Vision and Pattern Recognition, pp. 602–608, 1997.

Sturm, P., Critical motion sequences for monocular self-calibration and uncali-

brated euclidean reconstruction, 1996.

Sturm, P., “Critical motion sequences and conjugacy of ambiguous euclidean recon-

structions,” in Proceedings of the 10th Scandinavian Conference on Image Analy-

sis, Lappeenranta, Finland, vol. I, (M. Frydrych, J. Parkkinen, and A. Visa, eds.),

pp. 439–446, June 1997.

Sturm, P., “Critical motion sequences for the self-calibration of cameras and stereo

systems with variable focal length,” in British Machine Vision Conference, Not-

tingham, England, pp. 63–72, September 1999.

Tardif, J.-P., P. Sturm, M. Trudeau, and S. Roy, “Calibration of cam-

eras with radially symmetric distortion,” IEEE Transactions on Pattern

Analysis and Machine Intelligence, vol. 31, no. 9, pp. 1552–1566, July 2009.

doi:10.1109/TPAMI.2008.202.

Triggs, B., “Matching constraints and the joint image,” in Proceedings of the 4th

International Conference on Computer Vision (ICCV’95), pp. 338–343, Los

Alamitos, CA: IEEE Computer Society Press, 1995.

Triggs, B., “The geometry of pro jective reconstruction i: Matching constraints and

the joint image,” International Journal of Computer Vision, pp. 338–343, 1995.

Triggs, B., “Autocalibration and the absolute quadric,” in CVPR ’97: Proceedings of

the 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97),

pp. 609–614, Washington, DC, USA: IEEE Computer Society, 1997.

Triggs, B., “Autocalibration and the absolute quadric,” in CVPR ’97: Proceedings of

the 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97),

pp. 609–614, Washington, DC, USA: IEEE Computer Society, 1997.

Tsai, R. Y., “An ecient and accurate camera calibration technique for 3-d machine

vision.” in Proceedings International Conference on Pattern Recognition, pp. 364–

374, 1986.

402 Bibliography

Tsai, R. Y., “Synopsis of recent progress on camera calibration for 3d machine

vision,” The Robotics Review, vol. 3, no. 4, pp. 323–344, 1987.

Tsai, R. Y., “A versatile camera calibration technique for high-accuracy 3d machine

vision metrology using oﬀ-the-shelf tv cameras and lenses,” IEEE Transactions

of Robotics and Automation, vol. 3, no. 4, pp. 323–344, 1987.

Tsai, R. Y., “A versatile camera calibration technique for high-accuracy 3D

machine vision metrology using oﬀ-the-shelf TV cameras and lenses,” Radiome-

try, pp. 221–244, 1992.

Van Gool, L., T. Moons, M. Proesmans, and M. Van Diest, “Aﬃne reconstruction

from perspective image pairs obtained by a translating camera,” In Proceedings

of the 12th International Conference on Pattern Recognition, pp. 290–294, Los

Alamitos, CA: IEEE Computer Society Press, 1994.

Wang, J., F. Shi, J. Zhang, and Y. Liu, “A new calibration model and method of

camera lens distortion,” Pattern Recognition, vol. 41, no. 2, pp. 607–615, 2008.

Wei, G.-Q., and S. D. Ma, “Implicit and explicit camera calibration: Theory and

experiments,” IEEE Transactions on Pattern Analysis and Machine Intelligence,

vol. 16, no. 5, pp. 469–480, May 1994.

Zhang, Z., “On the epipolar geometry between two images with lens distortion,” in

Proceedings International Conference on Pattern Recognition, pp. 407–411, 1996.

Zhang, Z., “Determining the epipolar geometry and its uncertainty — a review,”

International Journal of Computer Vision, vol. 27, no. 2, pp. 161–195, March

1998.

Zhang, Z., “A ﬂexible new technique for camera calibration,” in Proceedings of the

Seventh International Conference on Computer Vision, pp. 666–673, September

1999.

Zhang, Z., “Flexible camera calibration by viewing a plane from unknown orienta-

tions,” in ICCV, pp. 666–673, 1999.

Zhang, Z., “A ﬂexible new technique for camera calibration,” IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 22, no. 11, pp. 1330–1334, 2000.

References

[1] J. Y. Bouguet, “Camera calibration toolbox for matlab,” http://www.vision.

caltech.edu/bouguetj/calib doc/.

[2] O. Faugeras, “What can be seen in three dimensions with an uncalibrated

stereo rig,” in Computer Vision — (ECCV’92), pp. 563–578, vol. LNCS 588,

Berlin/Heidelberg/New York/Tokyo: Springer-Verlag, 1992.

[3] G. H. Golub and C. F. V. Loan, Matrix Computations. Baltimore, ML, USA:

The John Hopkins University Press, 1996.

[4] R. Hartley, “Estimation of relative camera positions for uncalibrated cam-

eras,” in Computer Vision — (ECCV’92), pp. 579–587, vol. LNCS 588,

Berlin/Heidelberg/New York/Tokyo: Springer-Verlag, 1992.

[5] R. Hartley, “Self-calibration from multiple views with a rotating cam-

era,” in Computer Vision — (ECCV’94), pp. 471–478, vol. LNCS 800/801,

Berlin/Heidelberg/New York/Tokyo: Springer-Verlag, 1994.

[6] R. Hartley, “Cheirality,” International Journal of Computer Vision, vol. 26,

no. 1, pp. 41–61, 1998.

[7] R. Hartley and S. B. Kang, “Parameter-free radial distortion correction

with center of distortion estimation,” IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence, vol. 29, no. 8, pp. 1309–1321, doi:10.1109/

TPAMI.2007.1147, June 2007.

[8] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision.

Cambridge University Press, ISBN: 0521540518, 2004.

[9] F. Kahl, B. Triggs, and K. ˚

Astr¨om, “Critical motions for auto-calibration when

some intrinsic parameters can vary,” Journal of Mathematical Imaging and

Vision, vol. 13, no. 2, pp. 131–146, October 2000.

403

404 References

[10] H. Longuet-Higgins, “A computer algorithm for reconstructing a scene from

two projections,” Nature, vol. 293, no. 10, pp. 133–135, 1981.

[11] H. C. Longuet-Higgins, “A computer algorithm for reconstructing a scene from

two projections,” Nature, vol. 293, pp. 133–135, 1981.

[12] D. Nist´er, “An eﬃcient solution to the ﬁve-point relative pose problem,” IEEE

Transactions On Pattern Analysis and Machine Intelligence, vol. 26, no. 6,

pp. 756–777, June 2004.

[13] J. Philip, “A non-iterative algorithm for determining all essential matrices cor-

responding to ﬁve point Pairs,” The Photogrammetric Record, vol. 15, no. 88,

pp. 589–599, 1996.

[14] P. Sturm, “Critical motion sequences and conjugacy of ambiguous euclidean

reconstructions,” in Proceedings of the 10th Scandinavian Conference on Image

Analysis, Lappeenranta, Finland, vol. I, (M. Frydrych, J. Parkkinen, and

A. Visa, eds.), pp. 439–446, June 1997.

[15] P. Sturm, “Critical motion sequences for the self-calibration of cameras and

stereo systems with variable focal length,” in British Machine Vision Confer-

ence, Nottingham, England, pp. 63–72, September 1999.

[16] B. Triggs, “Autocalibration and the absolute quadric,” in CVPR ’97: Pro-

ceedings of the 1997 Conference on Computer Vision and Pattern Recognition

(CVPR ’97), pp. 609–614, Washington, DC, USA: IEEE Computer Society,

1997.

[17] R. Y. Tsai, “A versatile camera calibration technique for high-accuracy 3D

machine vision metrology using oﬀ-the-shelf TV cameras and lenses,” Radiom-

etry, pp. 221–244, 1992.

[18] Z. Zhang, “Flexible camera calibration by viewing a plane from unknown ori-

entations,” in ICCV, pp. 666–673, 1999.

Uncertainty-Aware Deep Multi-View Photometric Stereo

Preprint

Full-text available

Feb 2022

This paper presents a simple and effective solution to the longstanding classical multi-view photometric stereo (MVPS) problem. It is well-known that photometric stereo (PS) is excellent at recovering high-frequency surface details, whereas multi-view stereo (MVS) can help remove the low-frequency distortion due to PS and retain the global geometry of the shape. This paper proposes an approach that can effectively utilize such complementary strengths of PS and MVS. Our key idea is to combine them suitably while considering the per-pixel uncertainty of their estimates. To this end, we estimate per-pixel surface normals and depth using an uncertainty-aware deep-PS network and deep-MVS network, respectively. Uncertainty modeling helps select reliable surface normal and depth estimates at each pixel which then act as a true representative of the dense surface geometry. At each pixel, our approach either selects or discards deep-PS and deep-MVS network prediction depending on the prediction uncertainty measure. For dense, detailed, and precise inference of the object's surface profile, we propose to learn the implicit neural shape representation via a multilayer perceptron (MLP). Our approach encourages the MLP to converge to a natural zero-level set surface using the confident prediction from deep-PS and deep-MVS networks, providing superior dense surface reconstruction. Extensive experiments on the DiLiGenT-MV benchmark dataset show that our method provides high-quality shape recovery with a much lower memory footprint while outperforming almost all of the existing approaches.

Distributed Control for 3D Inspection using Multi-UAV Systems

Conference Paper

Full-text available

Jun 2023

Cooperative control of multi-UAV systems has attracted substantial research attention due to its significance in various application sectors such as emergency response, search and rescue missions, and critical infrastructure inspection. This paper proposes a distributed control algorithm to generate collision-free trajectories that drive the multi-UAV system to completely inspect a set of 3D points on the surface of an object of interest. The objective of the UAVs is to cooperatively inspect the object of interest in the minimum amount of time. Extensive numerical simulations for a team of quadrotor UAVs inspecting a real 3D structure illustrate the validity and effectiveness of the proposed approach.

3D Shape Estimation from RGB Data Using 2.5D Features and Deep Learning

Conference Paper

Full-text available

Feb 2023

Creation of 3D models from a single RGB image is challenging problem in image processing these days, as the technology is in its early development stage. However, the demands for 3D technology and 3D reconstruction have been rapidly increasing nowadays. The traditional approach of computer graphics is to create a geometric model in 3D and try to reproduce it onto a 2D image with rendering. The major aim of the study is to create 3D models from 2D RGB image using machine learning techniques to be less computationally complex as compared to any deep learning algorithm. The proposed model has been based on three different modules such as: 2.5D features extraction, mesh generation, and 3D boundary detection. The ShapeNet dataset has been used for comparison. The testing results has shown an accuracy of 90.77 % in the plane class, 85.72% in the chair class, and 72.14% in the automobile class. The proposed model could be applicable to problems where reconstruction of 3D models is required such as: variations in geometric scale, mix of textured, uniformly colored, and reflective surfaces.

A comparative evaluation of 3D geometries of scenes estimated using factor graph based disparity estimation algorithms

Article

Jan 2023

ENHANCING CONTRAST OF IMAGES TO IMPROVE GEOMETRIC ACCURACY OF A UAV PHOTOGRAMMETRY PROJECT

Article

Full-text available

May 2022

In recent years, Unmanned Aerial Vehicles (UAVs) have become popular tools in mapping applications. In such applications, the image motion, bad lighting effects, and poor texture all directly affect the quality of the derived tie points, which in turn imposes constraints on image extraction and may lead to a low accuracy point cloud. This paper proposes a contrast enhancement technique to improve the accuracy of a photogrammetric model created using UAV images. The luminance component (Y) in the YIQ color space is normalized using the sigmoid function, and the low contrast images are enhanced using the Contrast-Limited Adaptive Histogram Equalization (CLAHE) on the luminosity component. To evaluate the proposed method, three-dimensional models were created using images acquired by the Phantom 4 Pro UAV in three distinct places and at altitudes of 20, 40, 60, 80, and 90 meters. The results showed that enhancing the contrast of images increased the number of tie points and reduced reprojection error by approximately 10%. It also improved the resolution of the digital elevation model by approximately 2cm/pixel while greatly improving the texture and quality with respect to that developed using the original images.

Automatic 3D Estimation and Deep Geometry Understanding using Voxelization of Point Cloud

Article

Jul 2023

Hamid Ashfaq

Automatic concrete defect detection and reconstruction by aligning aerial images onto semantic-rich building information model

Article

Full-text available

Oct 2022
COMPUT-AIDED CIV INF

Concrete defect information is of vital importance to building maintenance. Increasingly, computer vision has been explored for automated concrete defect detection. However, existing studies suffer from the challenging issue of false positives. In addition, 3D reconstruction of the defects to pinpoint their positions and geometries has not been sufficiently explored. To address these limitations, this study proposes a novel computational approach for detecting and reconstructing concrete defects from geotagged aerial images. A bundle registration algorithm is devised to align a batch of aerial photographs with a building information model (BIM). The registration enables the retrieval of material semantics in BIM to determine the regions of interest for defect detection. It helps rectify the camera poses of the aerial images, enabling precise defect reconstruction. Experiments demonstrate the effectiveness of the approach, which significantly reduced the false discovery rate from 70.8% to 56.8%, resulting in an intersection over union 6.4% higher than that of the traditional method. The geometry of the defects was successfully reconstructed in 3D world space. This study opens a new avenue to advance the field of defect detection by exploiting the rich information from BIM. The approach can be deployed at scale, supporting urban renovation, numerical simulation, and other smart applications.

Uncertainty-Aware Deep Multi-View Photometric Stereo

Conference Paper

Jun 2022

Learned Intrinsic Auto-Calibration From Fundamental Matrices

Conference Paper

May 2022

Fine-grained Semantics-aware Representation Enhancement for Self-supervised Monocular Depth Estimation

Conference Paper

Oct 2021

Straight lines have to be straight

Article

Jan 2001

Triangulation

Article

Jan 1994

A versatile camera calibration technique for high accuracy 3-D machine vision metrology using off-the-shelf tv cameras and lenses

Article

Jan 1992

R.Y. Tsai

WHAT CAN BE SEEN IN 3 DIMENSIONS WITH AN UNCALIBRATED STEREO RIG

Article

Jan 1992
Lect Notes Comput Sci

Olivier D. Faugeras

This paper addresses the problem of determining the kind of three-dimensional reconstructions that can be obtained from a binocular stereo rig for which no three-dimensional metric calibration data is available. The only information at our disposal is a set of pixel correspondences between the two retinas which we assume are obtained by some correlation technique or any other means. We show that even in this case some very rich non-metric reconstructions of the environment can nonetheless be obtained. Specifically we show that if we choose five arbitrary correspondences, then a unique (up to an arbitrary projective transformation) projective representation of the environment can be constructed which is relative to the five points in three-dimensional space which gave rise to the correspondences. We then show that if we choose only four arbitrary correspondences, then an affine representation of the environment can be constructed. This reconstruction is defined up to an arbitrary affine transformation and is relative to the four points in three-dimensional space which gave rise to the correspondences. The reconstructed scene also depends upon three arbitrary parameters and two scenes reconstructed from the same set of correspondences with two different sets of parameter values are related by a projective transformation. Our results indicate that computer vision may have been slightly overdoing it in trying at all costs to obtain metric information from images. Indeed, our past experience with the computation of such information has shown us that it is difficult to obtain, requiring awkward calibration procedures and special purpose patterns which are difficult if not impossible to use in natural environments with active vision systems. In fact it is not often the case that accurate metric information is necessary for robotics applications for example where relative information is usually all what is needed.

Introduction to Computer Graphics

Article

Jan 1994

Das Problem der ”gefährlichen Flächen” in Theorie und Praxis. Ein Beitrag zur Hauptaufgabe der Photogrammetrie

Article

Walther Hofmann

An invitation to 3-D vision. From images to geometric models

Article

Jan 2003

The Development of Camera Calibration Methods and Models

Article

Apr 1998
PHOTOGRAMM REC

Correction for image distortion in cameras has been an important topic for as long as users have wanted to faithfully reproduce or use observed information. Initially the main application was mapping. While this task continues today other applications also require precise calibration of cameras such as close range 3-D measurement and many 2-D measurement tasks. In the past the cameras used were few in number and highly expensive whereas today a typical large industrial company will have many inexpensive cameras being used for highly important measurement tasks . Cameras are used more today than they ever were but the golden age of camera calibration for aerial mapping is now well in the past. This paper considers some of the key developments and attempts to put them into perspective. In particular the driving forces behind each improvement have been highlighted.

A ne reconstruction from perspective image pairs

Article

Jan 1993

Zur ermittlung eines objektes aus zwei perspektiven mit innerer orientierung

Article

Jan 1913

E. Kruppa

3D Reconstruction from Multiple Images: Part 1 - Principles.

Abstract and Figures

Recommended publications

3D Reconstruction from Multiple Images Part 1: Principles

3D MODELING FROM GNOMONIC PROJECTIONS

Improved low-cost 3D reconstruction pipeline by merging data from different color and depth cameras

3D Reconstruction from Multiple Images Part 1: Principles