Conference PaperPDF Available

Robust autocalibration for a surveillance camera network

January 2013

January 2013

DOI:10.1109/WACV.2013.6475051

Conference: Applications of Computer Vision (WACV), 2013 IEEE Workshop on

Authors:

Yanxi Liu

Pennsylvania State University

We propose a novel approach for multi-camera autocalibration by observing multiview surveillance video of pedestrians walking through the scene. Unlike existing methods, we do NOT require tracking or explicit correspondences of the same person across time/views. Instead, we take noisy foreground blobs as the only input and rely on a joint optimization framework with robust statistics to achieve accurate calibration under challenging scenarios. First, each individual camera is roughly calibrated into its local World Coordinate System (lWCS) based on analysis of relative 3D pedestrian height distribution. Then, all lWCSs are iteratively registered with respect to a shared global World Coordinate System (gWCS) by incorporating robust matching with a partial Direct Linear Transform (pDLT). As demonstrated by extensive evaluation, our algorithm achieves satisfactory results in various camera settings with up to moderate crowd densities with a large proportion of foreground outliers.

Example frames for calibration: (top) original frame overlayed with calibration results; (middle) noisy foreground masks with major axes of inlier blobs; (bottom) registered top- down view with the same blob

…

Illustration of the global WCS (black) and the local WCSs (blue and red), where the local Y-axis is coplanar with the camera optical axis and all XY-planes lie within the ground-plane.

…

Synthesized 3-camera dataset with calibration results.

…

Example of calibration results for: (a) indoor 4-people sequence; (b) outdoor campus sequence; (c) PETS09 sparse; (d) PETS09 dense. For each sequence: Top: original frame overlayed with calibration results. Middle: noisy foreground masks overlayed with major axes of inlier blobs. Bottom: registered top-down view of inlier blobs.

…

Figures - uploaded by Yanxi Liu

Content may be subject to copyright.

Content uploaded by Yanxi Liu

Content may be subject to copyright.

Robust Autocalibration for a Surveillance Camera Network

Jingchen Liu, Robert T. Collins, and Yanxi Liu

The Pennsylvania State University

University Park, PA 16802, USA

{jingchen, rcollins, yanxi}@cse.psu.edu

Abstract

We propose a novel approach for multi-camera autocali-

bration by observing multiview surveillance video of pedes-

trians walking through the scene. Unlike existing methods,

we do NOT require tracking or explicit correspondences of

the same person across time/views. Instead, we take noisy

foreground blobs as the only input and rely on a joint opti-

mization framework with robust statistics to achieve accu-

rate calibration under challenging scenarios. First, each

individual camera is roughly calibrated into its local World

Coordinate System (lWCS) based on analysis of relative 3D

pedestrian height distribution. Then, all lWCSs are itera-

tively registered with respect to a shared global World Co-

ordinate System (gWCS) by incorporating robust matching

with a partial Direct Linear Transform (pDLT). As demon-

strated by extensive evaluation, our algorithm achieves sat-

isfactory results in various camera settings with up to mod-

erate crowd densities with a large proportion of foreground

outliers.

1. Introduction

The main goal of surveillance camera calibration is to

ﬁnd the mapping relating objects in the 3D scene to their

projections in the 2D image plane[9]. This helps to infer

object locations as well as scales and orientations, and al-

lows for more accurate object detection and tracking. For

example, sampling-based pedestrian detection [1,4] yields

better performance when hypotheses are generated in 3D

and then projected into one or more image views. Pedes-

trian/face/object detection based on sliding windows can

also beneﬁt from calibration, since the search over orienta-

tion and scale can be constrained to a small range, reducing

false positives [13].

In this paper, we present an automated calibration

method that enables smart sampling of object size and ori-

entation in all views given either a 2D location in one view

or 3D location in the scene. The method works directly on

noisy foreground observations collected by the surveillance

Figure 1. Example frames for calibration: (top) original frame

overlayed with calibration results; (middle) noisy foreground

masks with major axes of inlier blobs; (bottom) registered top-

down view with the same blob

system, without any further information such as scene ge-

ometry or tracklets from tracking.

Most existing work on unsupervised surveillance

(pedestrian-based) camera calibration focuses on the single-

view case, ([13,7,15,12,8]) and requires clean pedestrian

detections as well as explicit correspondences of the same

person at different locations in the scene. For example, [12]

proposes to detect leg-crossings for more accurate pedes-

trian height estimation; [13] requires the extraction of mul-

tiple control points on the contour of the pedestrian. [3,6]

and [14] adopt similar ideas and use a walking human to

calibrate a camera network. In all the above work, the cor-

respondence of the same person, if not manually labeled,

is obtained either by tracking, or under the assumption that

there is only one person in the view.

In some cases however, it can be very difﬁcult to accu-

rately detect pedestrians prior to calibration, let alone track

them robustly through the scene. [11] takes noisy fore-

ground blobs as input and achieves camera calibration based

on the analysis of pedestrian height distribution, with no

correspondence information needed. However the estima-

tion of focal length may not be very accurate, and it only ap-

plies to single views. To the best of our knowledge, the au-

tomatic extraction of cross-view correspondences in noisy

environments has not been addressed in the above work.

It is known that once clean correspondences (of points,

planes, objects in 2D/3D) are given, calibration is a well-

solved problem, e.g., using bundle adjustment[16]. The

main contribution of this work is to propose a novel frame-

work for unsupervised surveillance system calibration that

efﬁciently prunes outliers and estimates the calibration

based on a subset of inlier foreground correspondences dis-

covered through applying a series of robust statistics.

We address four major challenges that are commonly en-

countered but not fully considered in the existing literature

on surveillance-based calibration: (1) moderately crowded

scenes; (2)a large proportion of outliers from foreground ex-

traction; (3) large noise (variance) in foreground detections;

(4) no correspondence information across frames/views.

Similar to most surveillance work, we assume (1) there is

one single ﬂat ground-plane and (2) people are almost ver-

tical, standing/walking on the ground plane.

2. Camera Model and the Coordinate System

Adopting a simpliﬁed CCD camera model with focal

length being the only intrinsic parameter, we calibrate each

view into its local World Coordinate System (lWCS), where

each camera has zero pan angle and is translated from the

local origin OLby one unit along the Z-axis (thus the rel-

ative scale of the coordinate system is proportional to the

camera height above the ground). Camera orientation is

modeled by a tilt angle θaround the X-axis (θ∈(π/2, π)

for a downward looking camera) and a roll angle ρaround

the Z-axis. The 3D-to-2D projection matrix is thus deﬁned

by:

PL=





RZ(ρ)RX(θ)



1 0

1−1



,

(1)

where, e.g., RX(θ)is a 3D rotation around the X-axis by

angle θ.

It has been shown [10] that the local extrinsic parameters

(ρ,θ) can be estimated given the vertical vanishing point

v0= (vx, vy,1)Ttogether with the focal length fas:

vxx+vyy+f2= 0 (2)

ρ=atan(−vx/vy)(3)

θ=atan2(qv2

x+v2

y,−f),(4)

where Eqn. 2is the equation of the horizon.

To relate all lWCSs, we choose a global World Coordi-

nate System (gWCS) that is aligned to the ground plane so

that each lWCS can be registered with the gWCS by a 2D

Figure 2. Illustration of the global WCS (black) and the local

WCSs (blue and red), where the local Y-axis is coplanar with the

camera optical axis and all XY-planes lie within the ground-plane.

translation and rotation within the ground plane (XY-plane),

as well as a relative scaling (proportional to the individual

camera height), as illustrated in Fig. 2. The ﬁnal projection

matrix of a camera is deﬁned as

P=PL·PG,(5)

where PGdenotes the ground-plane alignment transforma-

tion,

PG=

















cos α−sin α Tx

sin αcos α Ty

1 0







(6)

3. Camera Calibration

Our algorithm works on videos captured by multiple

cameras with overlapping views. The original frames are

preprocessed to generate foreground blobs. From these

noisy blobs in each single view, we ﬁrst estimate the vertical

vanishing point and the maximum-likelihood focal length f

that recovers 3D blob heights resembling the real world dis-

tribution of human heights using [11], thus estimating the

individual camera calibration matrix P(k)

Lthat relates the

coordinate system of camera kto its local WCS. We then

iteratively and sequentially modify the global registration

(PG) of each camera. The iteration usually converges in a

few rounds (2 to 3). During each iteration, we minimize the

re-projection error between the 2D blobs in the current view

and the joint set of global-world 3D blobs maintained by all

cameras, where the correspondence information is implic-

itly encoded in a robust-statistic error metric. We then ef-

ﬁciently solve for a ﬁnal estimate of PGvia partial Direct

Linear Transform (pDLT) in a reduced solution space. The

workﬂow of the algorithm is summarized in Alg. 1.

3.1. Framework

Following adaptive background subtraction on the input

video sequences, we merge connected foreground pixels to

form foreground blobs. We then ﬁt an ellipse to each blob

and represent the blob by the two end points of the major

axis of the ellipse. Assuming each foreground blob corre-

sponds to a person in the 2D image plane, the two end points

approximately represent the pixel locations of the foot and

head of the person. Denote by b(k)

nthe nth 2D blob ex-

tracted from the kth view and Bk

nthe corresponding 3D

blob in the gWCS. These 2D and 3D blobs can be repre-

sented in homogeneous coordinates as:

b(k)

n=



xfxh

yfyh

1 1



B(k)

n=





X X

Y Y

1 1







,(7)

where xf, yf, xh, yhare pixel locations of the foot and

head, and X,Y,Hindicate the pedestrian’s location (in

the ground plane) and height respectively. The projective

matrix P(k)(Eqn. 6) projects a 3D blob into 2D: b(k)

n∼

P(k)·B(k)

n.Assuming upright pedestrians walking in the

ground-plane, the degree of freedom (DoF) of {B(k)

n}is

3. Thus B(k)

ncan be linearly solved given P(k)

3×4and b(k)

Speciﬁcally, let

M=





xfP31 −P11 xfP32 −P12 0

yfP31 −P21 yfP32 −P22 0

xhP31 −P11 xhP32 −P12 xhP33 −P13

yhP31 −P21 yhP32 −P22 yhP33 −P23







(8)

t=





P14 −xfP34

P24 −yfP34

P14 −xhP34

P24 −xhP34







,(9)

it can be proven that [X, Y, H ]T= (MTM)−1MTt. We

denote such a backward projection from 2D to 3D as:

B(k)

n={P(k)}−1b(k)

n.Note that the {·}−1operator here

does not refer to the conventional matrix inversion.

We formulate the multi-camera calibration as a joint en-

ergy minimization problem. Among various choices of the

cost function for multi-camera calibration, we select the

widely used mean-squared image re-projection error as our

optimization goal, which can also be interpreted as maxi-

mum likelihood estimation under an assumption of Gaus-

sian noise[5]. The re-projection from view jto view kcan

be expressed as:

bn(k|j)∼P(k){P(j)}−1b(j)

∼P(k)B(j)

n(10)

where a 2D blob in view jis ﬁrst back-projected into the

gWCS with operator {P(j)}−1and then projected into view

kunder P(k). The overall re-projection error is deﬁned by

ε=X

ε(k)=X

e(b(k), b(k|j))(11)

e(bk, P (k)B)(12)

Alg. 1 Unsupervised multi-camera calibration.

input: {b(k)|k= 1, . . . , K}

output: {P(k)|k= 1, . . . , K}

individual camera calibration:

estimate {P(k)

L|k= 1, . . . , K}via [11]

joint camera network calibration:

initialize 3D blobs:

B(k)=∅, k = 1, . . . , K −1

B(K)={P(K)

L}−1b(K)(i.e., P(K)=P(K)

for m= 1,2, . . .

for k= 1, . . . , K

1) B= [B(1), . . . , B (k−1), B(k+1), . . . , B(K)]

2) update P(k)=P(k)

L·P(k)

G←arg min{εk|b(k), B}

3) update B(k)←(P(k))−1b(k)

4) compute εaccording to Eqn. 12

if εm> εm−1: terminate

where B= (B(1), . . . , B (K))is the set of 3D blobs con-

tributed by all views. Note that the self back-projection al-

ways has bk|k

n=bk

n. Hence e(bk, P(k)B(k))is always 0.

e(,)is a robust matching error metric, deﬁned in Sec. 3.3

that measures the compatibility between the set of fore-

ground blobs bkobserved in view kand the set of re-

projected blobs from 3D (B), as contributed by all views,

where no correspondence information between bkand Bis

given.

Minimizing the above cost function directly is in-

tractable. However, as can be seen from Eqn. 12, we can

iteratively optimize the projection matrix for each view. To

initiate the lWCS-gWCS matching, we ﬁrst align the gWCS

with one of the lWCSs (here we pick the last view indexed

by K) to obtain an initial set of 3D blobs. Then we se-

quentially calibrate each camera kby (1) inferring the cor-

respondences between b(k)and Bunder the presence of out-

liers and noise with a robust matching metric (Sec. 3.3) and

(2) optimizing the projective matrix P(k)given the 2D-3D

blob correspondences by partial direct linear transformation

(Sec. 3.4). Empirically, we observe that our multi-camera

calibration usually converges in no more than three itera-

tions, i.e., m≤3in Alg. 1.

3.2. Individual Camera Calibration to lWCS

This section explains how we obtain the input matrices

{P(k)

L}w.r.t. lWCS. As shown in Eqns. 1,3,4, this is equiv-

alent to estimating the focal length fand the vertical van-

ishing point v0. We apply the method of [11] that ﬁrst uses

RANSAC to ﬁnd the vanishing point and then roughly es-

timates the focal length based on prior knowledge of the

3D-height distribution of inlier blobs. Note that we do not

need to assume a constant person height for calibration, as

is often done in algorithms for foot-head homography esti-

Figure 3. Vanishing point detection under different camera angles,

foreground blob sizes, and crowd densities. Green lines indicate

the major axes of inlier blobs. Outliers are marked with red lines.

Yellow dashed lines indicate the vanishing points.

mation.

The vanishing point estimation is carried out in homoge-

neous coordinates, and is robust in cases when a vanishing

point is close to inﬁnity. Fig. 3demonstrates a few exam-

ples of RANSAC-based vanishing point voting under dif-

ferent camera settings as well as varying foreground sizes

and densities. It is worth mentioning that many blobs cor-

responding to real pedestrians are classiﬁed as outliers be-

cause of region deformation due to partial detection or be-

ing merged with other people, especially in crowded scenes.

However our goal is not to detect all pedestrian foreground

blobs but to extract enough inlier blobs for the following

analysis.

Focal length estimation follows a hypothesize-and-test

process. Given a hypothesized focal length f, together

with the vanishing point v0, we can recover the relative

3D heights Hiof each inlier blob bi(w.r.t. the camera

height)1. This process leverages the fact that the distribu-

tion of human heights in 3D forms a very strong cluster with

|Hi−µ|/µ < λ, where µis the average inlier pedestrian

height and λ= 0.1[11]. Different hypotheses are evaluated

against a robust log likelihood function deﬁned as:

L(f) = 1

µ2X

i∈I

max{λµ − |Hi−µ|,0}2,(13)

where Irepresents the set of RANSAC inliers, and outlier

candidates Hithat fall out of the height range of the major-

ity of the inlier blobs, e.g., Hi>(1 + λ)µ, are ignored. As

we sample the camera ﬁeld of view (FoV) angle at a res-

olution of 1◦, which is about the state-of-the-art accuracy

for pedestrian-based surveillance camera calibration [8], the

focal length fthat produces the highest likelihood score ac-

cording to Eqn. 13 is selected as our initial estimate for the

multi-view calibration.

3.3. Robust Distance Metric for Blob Matching

This section explains the cost function (Eqn. 11) for

cross-view lWCS matching. Recall that b(i)denotes the set

1A more efﬁcient method would use the cross-ratio invariance trick

[11].

of 2D blobs in view ifor the entire sequence, B(i)denotes

the set of 3D blobs back-projected from view ito gWCS,

and b(i|j)is the set of re-projected 2D blobs from view j

to i. The cost function is deﬁned as the sum of the re-

projection errors of all pairs of 2D-3D blobs (denoted as

set l) extracted at the same time stamp.

Since the foreground blobs are noisy in the sense of

both false alarms and missed detections, the proportion

of ‘good’ pairs (two ‘good’ blobs extracted from different

views corresponding to the same person) is even smaller,

especially under crowded scenarios. We thus adopt trun-

cated quadratic[2], which belongs to the robust statistics of

truncated least squares, deﬁned as:

e(b(i), b(i|j)) = X

(b(i)

n1,B(j)

n2)∈l

min{d(b(i)

n1, b(i|j)

n2)2, τ 2},(14)

where d(,)is the 4D Euclidian distance, (x, y)coordinates

of feet and head, between 2 blobs in pixels, and the error

tolerance is set to be τ2=1

100 W·H, where Wand H

are the width and height of the image. We ﬁnd this setting

yields satisfactorily consistent results for video sequences

with very different camera settings, as demonstrated in the

experiments section. We iteratively use the error tolerance

as a threshold to discover ‘good’ blob correspondences from

all possible pairs in land re-estimate calibration parameters

based on these inliers.

3.4. Multiview Calibration to gWCS by Partial DLT

This section describes the sequential registration of mul-

tiple lWCS in Alg. 1. The goal is to estimate the cam-

era projection matrix Pfrom an initial set of noisy inlier

2D-3D blob correspondences (lin) and the results of the

single-view calibration (PL). We propose an iterative pro-

cess based on a variant of direct linear transform (DLT). The

algorithm iteratively estimates the global projection matrix

and reﬁnes the inlier correspondences lin and PLonce PG

has been updated. The overall optimization is summarized

in Alg. 2.

To solve the global transformation of the kth view Pk=

L·Pk

G(Eqn. 6) from a linear system constructed from

inlier blob correspondences between the 2D blobs and 3D

blobs

[x, y, z]T∼Pk

L·Pk

G·[X, Y, Z, 1]T,(15)

a straightforward approach would treat Pas a general ma-

trix with 12 DoF, and solve it using DLT. However, the DLT

solution is known to be overdetermined, as the real perspec-

tive matrix only has 11 DoF (up to a scale). More impor-

tantly, when the 2D-3D correspondences are noisy, DLT can

easily overﬁt the free-form solution, making it harder to dis-

tinguish inlier correspondences from outliers during the it-

erative process, which would further degrade the estimation

accuracy. Therefore, we limit the DoF of the solution space

Alg. 2 Calibration to gWCS by Partial DLT

input: {P(k)

L}init,b(k),B,l

output: P(k),{P(k)

L}updated,{lin }updated

initialization:

randomly sample the initial correspondences: lin ⊂l

for m= 1, . . . , M

1) optimize αin P(k)

G, given lin,P(k)

2) optimize s, tx, tyin P(k)

G, given lin,P(k)

L, α

3) optimize: f , P (k)

L, given lin,P(k)

4) compute ε(k)and update lin (Eqn. 14)

if ε(k)

m> ε(k)

m−1: terminate

by estimating the ﬁve variables in Eqns. 1,6:α, s, tx, ty, f

sequentially, while ﬁxing the vanishing point. The motiva-

tion is that (1) we assume the initial estimation for vanishing

point v0is accurate enough; (2) by reducing the DoF of the

projection matrix, we introduce a partial DLT here to solve

for subsets of parameters efﬁciently without suffering from

the ‘over-determinant’ problem of DLT.

To estimate the ground-plane rotation angle α, we ﬁx f

and Pk

Lso that Pk

Ghas 7 DoF:

G∼Pk

G|7=





m11 m12 m14

m21 m22 m24

m33







.(16)

PG|7can be directly optimized by linearizing Eqn. 15 sim-

ilar to the DLT (thus referred to as partial DLT) and the

rotation angle can be approximated as α=atan2(m21 −

m12, m11 +m22 ). We then ﬁx αand f, so that s, tx, tyto-

gether can be directly linearly solved in the same way (see

Appendix for detailed derivation). Although the initial esti-

mation of focal length fmay not be very accurate, it con-

strains the search space to a small region for reﬁning the

estimates. To optimize f, we ﬁx the current estimatesof the

otehr parameters and adopt pDLT again on the linear system

of:







∼





·Pk

L·Pk

G·











,(17)

with afbeing the only parameter. This suggests the optimal

focal length should be updated as: f∗←af·f. We then up-

date θin PLwith f∗according to Eqn. 4(thus the vanishing

point remains the same) and re-evaluate the matching error

of Eqn. 14. We only accept f∗if it reduces the matching

error.

4. Experiments

We have conducted extensive evaluation on a synthe-

sized dataset for stress testing, and on four different pub-

lic sequences with various camera settings, crowd densities,

and background subtraction qualities: (a) indoor 4-person

sequence with three views of resolution 288 ×360 [1],

(b) outdoor campus sequence with two views of 960 ×

1280 [14], (c) PETS09 sparse crowd sequence with 4 views

(1,5,6,8), where view#1 is 576 ×768 and the rest are

576×720, and (d) PETS09 medium density crowd sequence

with two views (#1 and #2) of resolution 576 ×768 2.

We also provide quantitative comparison against [11] in

terms of focal length estimation, since to our best knowl-

edge, no other existing work estimates surveillance network

calibration without correspondence information.

4.1. Synthesized Dataset for Stress-Testing

We synthesize a dataset with three cameras of different

focal lengths (f1= 1000, f2= 1200, f3= 1000), view-

ing angles, ground plane locations, and heights. Multiple

pedestrians were synthesized in 3D as feet-head pairs with

a height variance of ±10%, and then projected into indi-

vidual views. In the stress test, we consider three sources

of input noise: location noise σ0, foreground blob recall

rand precision prate. The location noise is produced as

zero-mean Gaussian noise added onto the original feet-head

pixel locations. The blob-recall and precision indicate false

negatives and false positives and are simulated by randomly

removing inlier blobs and adding outlier blobs. The default

Figure 4. Synthesized 3-camera dataset with calibration results.

stress parameters are set as σ0= 5 (the standard deviation

of Gaussian noise in pixels), r= 70% and p= 70%. We

then vary σ0from 1to 10 and r,pfrom 1to 0respectively

with the other two parameters ﬁxed at the default level.

Fig. 4visualizes the calibration results. We also quanti-

tatively evaluate the performance of our algorithm by mea-

suring the average RMSE re-projection error in pixels as

well as the average relative focal length estimation error

e=|f−fGT |/fGT . The performance of our approach

under different stress levels is plotted in Fig. 5.

It can be seen the re-projection error increases with

increasing feet-head location noise, but is relatively sta-

ble w.r.t. missing inliers and against outliers, which well

2http://www.cvg.rdg.ac.uk/PETS2009/a.html

(a) (b) (c)

Figure 5. Performance under different levels of (a) location noise; (b) foreground recall rate; (c) foreground precision rate. Top: average

re-projection error. Bottom: focal length error of our approach (red) and [11]

demonstrates the effectiveness of our robust estimation met-

ric. When the precision increases from p= 0.2to p= 0.1

in Fig. 5(bottom), the proportion of outliers increases from

50% to 90% and results in a big performance drop. In all

other cases, our focal length estimation remains accurate

and performs better than a single-view based approach [11].

4.2. Real Sequences

As can be seen from the calibration results in Fig. 6, our

ground plane estimations are accurate in general (a black

grid mask overlaid on the original frame). For quantitative

evaluation, we manually labeled corresponding pedestrians

across views on 50 frames for each sequence. Fig. 7com-

pares the algorithm re-projections with the groundtruth la-

bels. The view-speciﬁc re-projection error is summarized

in Tab. 1, where Ein column indicates the matching er-

ror when pedestrians from other views are mapped into the

current view, and the Eout column indicates the matching

error when pedestrians from the current view are mapped

into other views. We show the RMSE in both pixels and

in terms of normalized distance as Er=E/√W·H. Note

that some labeling error exists on feet/head positions of the

groundtruth labels.

The focal length estimation for sequences with ground

truth fGT and relative error eare summarized in Tab. 2.

Again, compared with the baseline method using single

view calibration, our method achieves better accuracy in

most cases. The second view of the campus sequence

reported the largest error; however, the cross-view re-

projection estimation and groundplane estimation are still

acceptable. This can be explained as a case when inaccu-

rate focal length estimation cancels out the inaccurate van-

ishing point estimation, resulting in a small re-projection

error, which is the primary goal for surveillance calibration.

View Ein Eout Ein

rEout

1 13 16 4.0% 5.0%

indoor 2 23 12 7.1% 3.6%

3 14 22 4.3% 6.8%

campus 1 36 24 3.3% 2.2%

2 24 36 2.2% 3.3%

1 10 34 1.6% 5.1%

PETS09 2 29 12 4.7% 2.0%

(sparse) 3 13 11 2.0% 1.8%

4 16 12 2.5% 1.8%

PETS09 1 18 8 2.6% 1.2%

(dense) 2 8 18 1.2% 2.6%

Table 1. View speciﬁc re-projection error on real sequences.

View fGT fbase ebase f e

Campus 1 1057 1044 1% 1034 2%

21198 1545 29% 1427 19%

11170 1084 7% 1218 4%

PETS09 5 830 828 .2% 828 .2%

(sparse) 6 877 891 2% 869 1%

8737 772 5% 772 5%

PETS09 1 1170 950 19% 1067 9%

(dense) 2 659 624 5% 634 4%

Table 2. Focal length estimation, where fGT is the groundtruth

focal length, fbase and fare estimates computed by the

baseline([11]) and our method, respectively, and ebase and eare

relative errors w.r.t. the groundtruth. Our method outperforms the

baseline in most cases (shown in bold).

(a) (b)

Figure 6. Example of calibration results for: (a) indoor 4-people sequence; (b) outdoor campus sequence; (c) PETS09 sparse; (d) PETS09

dense. For each sequence: Top: original frame overlayed with calibration results. Middle: noisy foreground masks overlayed with major

axes of inlier blobs. Bottom: registered top-down view of inlier blobs.

(a) (b)

Figure 7. Re-projection evaluation. Manually labeled corresponding pedestrians in different views are plotted in straight-lines with the

same color. The cross-view re-projections based on calibration are plotted in dashed lines.

5. Summary

We propose a novel framework for unsupervised surveil-

lance camera network calibration. We take noisy fore-

ground (pedestrian) blobs captured directly by the cameras

without any cross-time or cross-view correspondence in-

formation. We ﬁrst apply robust self-calibration to cali-

brate each camera w.r.t. the lWCS to reduce the DoF of the

projective transformation to be estimated later, meanwhile

pruning a large proportion of outlier blob observations. We

then sequentially align all the lWCSs to a shared gWCS,

during which we introduce truncated least-squares as a ro-

bust error metric to iteratively determine inlier correspon-

dences, while applying a series of partial DLTs to solve

for a projective transformation. We demonstrate the ro-

bustness of our algorithm against different camera settings,

foreground qualities, outlier ratios and crowd densities via

extensive experiments on synthesized image sequences as

well as on publicly available real datasets.

Appendix: Groundplane Registration using

Partial DLT

Given ﬁxed ground-plane rotation αand local calibration

matrix PL, we linearize the equation sets by introducing

variable zto resolve the ambiguity:





x·z

y·z



=PL·PG·











,(18)

with

PL=



p11 p12 p13 p14

p21 p22 p23 p24

p31 p32 p33 p34



(19)

PG=





cos α·s, −sin α·s, tx

sin α·s, cos α·s, ty







(20)

Via Gaussian elimination on the auxiliary variable z, we can

reorganize Eqn. 18 to obtain 2sets of constrains for each

2D−3Dpoint pair correspondence, so that the calibration

parameters [s, tx, ty]can be linearly solved:

m11 m12 m13

m21 m22 m23 ·





=u1

u2(21)

where

m11 = (p11X+p12 Y−p31xX −p32 xY ) cos α+ (p12 X

−p11Y−p32 xX +p31xY ) sin α+ (p13 −p33 x)Z

m21 = (p21X+p22 Y−p31yX −p32yY ) cos α+ (p22X

−p21Y−p32 yX +p31 yY ) sin α+ (p23 −p33 y)Z,

and

m12 =p11 −p31x, m22 =p21 −p31 y,

m13 =p12 −p32x, m23 =p22 −p32 y,

u1=−p14 +p34x, u2=−p24 +p34 y,

(22)

References

[1] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua. Multiple Ob-

ject Tracking using K-Shortest Paths Optimization. PAMI,

2011. 1,5

[2] M. Black and P. Anandan. The robust estimation of multi-

ple motions: Parametric and piecewise-smooth ﬂow ﬁelds.

CVIU, 63(1):75–104, 1996. 4

[3] T. Chen, A. Bimbo, F. Pernici, and G. Serra. Accurate self-

calibration of two cameras by observations of a moving per-

son on a ground plane. In Proc. AVSS, 2007. 1

[4] W. Ge and R. T. Collins. Crowd detection with a multiview

sampler. In Proc. ECCV, 2010. 1

[5] R. I. Hartley and A. Zisserman. Multiple View Geometry

in Computer Vision. Cambridge University Press, ISBN:

0521623049, 2000. 3

[6] M. Hodlmoser and M. Kampel. Multiple camera self-

calibration and 3D reconstruction using pedestrians. In Proc.

ISVC, 2010. 1

[7] I. N. Junejo and H. Foroosh. Trajectory rectiﬁcation and path

modeling for video surveillance. In Proc. ICCV, 2007. 1

[8] N. Krahnstoever and P. R. Mendonca. Bayesian autocali-

bration for surveillance. In Proc. ICCV, pages 1858–1865,

2005. 1,4

[9] D. Liebowitz. Camera Calibration and Reconstruction of

Geometry from Images. PhD thesis, University of Oxford,

2001. 1

[10] D. Liebowitz and A. Zisserman. Combining scene and auto-

calibration constraints. In Proc. EuroGraphics, pages 293–

300, 1999. 2

[11] J. Liu, R. T. Collins, and Y. Liu. Surveillance camera auto-

calibration based on pedestrian height distributions. In Proc.

BMVC, 2011. 1,2,3,4,5,6,8

[12] F. Lv, T. Zhao, and R. Nevatia. Camera calibration from

video of a walking human. PAMI, 28(9):1513–1518, 2006. 1

[13] B. Micusik and T. Pajdla. Simultaneous surveillance camera

calibration and foot-head homology estimation from human

detections. In Proc. CVPR, 2010. 1

[14] H. Possegger, R. Matthias, S. Sabine, M. Thomas, K. Man-

fred, R. P. M., and B. Horst. Unsupervised calibration of

camera networks and virtual PTZ cameras. In Computer Vi-

sion Winter Workshop, 2012. 1,5

[15] D. Rother and K. A. Patwardhan. What can casual walkers

tell us about the 3D scene. In Proc. ICCV, 2007. 1

[16] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon.

Bundle adjustment–a modern synthesis. Vision Algorithms:

Theory and Practice, 2000. 2

Online Marker-free Extrinsic Camera Calibration using Person Keypoint Detections

Preprint

Sep 2022

Calibration of multi-camera systems, i.e. determining the relative poses between the cameras, is a prerequisite for many tasks in computer vision and robotics. Camera calibration is typically achieved using offline methods that use checkerboard calibration targets. These methods, however, often are cumbersome and lengthy, considering that a new calibration is required each time any camera pose changes. In this work, we propose a novel, marker-free online method for the extrinsic calibration of multiple smart edge sensors, relying solely on 2D human keypoint detections that are computed locally on the sensor boards from RGB camera images. Our method assumes the intrinsic camera parameters to be known and requires priming with a rough initial estimate of the camera poses. The person keypoint detections from multiple views are received at a central backend where they are synchronized, filtered, and assigned to person hypotheses. We use these person hypotheses to repeatedly solve optimization problems in the form of factor graphs. Given suitable observations of one or multiple persons traversing the scene, the estimated camera poses converge towards a coherent extrinsic calibration within a few minutes. We evaluate our approach in real-world settings and show that the calibration with our method achieves lower reprojection errors compared to a reference calibration generated by an offline method using a traditional calibration target.

Online Marker-free Extrinsic Camera Calibration using Person Keypoint Detections

Conference Paper

Full-text available

Sep 2022

UAV-Assisted Wide Area Multi-Camera Space Alignment Based on Spatiotemporal Feature Map

Article

Full-text available

Mar 2021

In this paper, we investigate the problem of aligning multiple deployed camera into one united coordinate system for cross-camera information sharing and intercommunication. However, the difficulty is greatly increased when faced with large-scale scene under chaotic camera deployment. To address this problem, we propose a UAV-assisted wide area multi-camera space alignment approach based on spatiotemporal feature map. It employs the great global perception of Unmanned Aerial Vehicles (UAVs) to meet the challenge from wide-range environment. Concretely, we first present a novel spatiotemporal feature map construction approach to represent the input aerial and ground monitoring data. In this way, the motion consistency across view is well mined to overcome the great perspective gap between the UAV and ground cameras. To obtain the corresponding relationship between their pixels, we propose a cross-view spatiotemporal matching strategy. Through solving relative relationship with the above air-to-ground point correspondences, all ground cameras can be aligned into one surveillance space. The proposed approach was evaluated in both simulation and real environments qualitatively and quantitatively. Extensive experimental results demonstrate that our system can successfully align all ground cameras with very small pixel error. Additionally, the comparisons with other works on different test situations also verify its superior performance.

A camera calibration method based on scene and human semantics

Conference Paper

Nov 2022

Online Marker-Free Extrinsic Camera Calibration Using Person Keypoint Detections

Chapter

Sep 2022

Single View Physical Distance Estimation using Human Pose

Conference Paper

Oct 2021

Multi-Camera Joint Self-Calibration from Observations of Pedestrians

Conference Paper

Jun 2021

Better Prior Knowledge Improves Human-Pose-Based Extrinsic Camera Calibration

Conference Paper

Jan 2021

Automatic Extrinsic Calibration of Camera Networks Based on Pedestrians

Conference Paper

Sep 2019

Crop growth monitoring through Sentinel and Landsat data based NDVI time-series

Article

Full-text available

Jun 2020

Crop growth monitoring is an important phenomenon for agriculture classification, yield estimation, agriculture field management, improve productivity, irrigation, fertilizer management, sustainable agricultural development, food security and to understand how environment and climate change effect on crops especially in Russia as it has a large and diverse agricultural production. In this study, we assimilated monthly crop phenology from January to December 2018 by using the NDVI time series derived from moderate to high Spatio-temporal resolution Sentinel and Landsat data in cropland field at Samara airport area, Russia. The results support the potential of Sentinel and Landsat data derived NDVI time series for accurate crop phenological monitoring with all crop growth stages such as active tillering, jointing, maturity and harvesting according to crop calendar with reasonable thematic accuracy. This satellite data generated NDVI based work has great potential to provide valuable support for assessing crop growth status and the above-mentioned objectives with sustainable agriculture development.

Surveillance Camera Autocalibration based on Pedestrian Height Distributions

Conference Paper

Full-text available

Jan 2011

We propose a new framework for automatic surveillance camera calibration by observing videos of pedestrians walking through the scene. Unlike existing methods that require accurate pedestrian detection and tracking, our method takes noisy foreground masks as input and automatically estimates the necessary intrinsic and extrinsic camera parameters using prior knowledge about the distribution of relative human heights. Our algorithm is computationally efficient enough for online parameter estimation. Experimental results on both synthetic and real data show the robustness of our method to camera pose and noisy foreground detections. © 2011. The

Unsupervised Calibration of Camera Networks and Virtual PTZ Cameras

Conference Paper

Full-text available

Jan 2012

Pan-Tilt-Zoom (PTZ) cameras are widely used in video surveillance tasks. In particular, they can be used in combination with static cameras to provide high resolution imagery of interesting events in a scene on demand. Nevertheless, PTZ cameras only provide a single trajectory at a time. Hence, engineering algorithms for common computer vision tasks, such as automatic calibration or tracking, for camera networks including PTZ cameras is difficult. Therefore, we propose a virtual PTZ (vPTZ) camera to simplify the algorithm development for such camera networks. The vPTZ camera is built on a cylindrical panoramic view of the scene and allows to reposition its field of view arbitrarily to provide several trajectories. Further, we propose an unsupervised extrinsic self-calibration method for a network of static cameras and PTZ cameras solely based on correspondences between tracks of a walking human. Our experimental results show that we can obtain accurate estimates of the extrinsic camera parameters in both, outdoor and indoor scenarios.

Simultaneous surveillance camera calibration and foot-head homology estimation from human detections

Conference Paper

Full-text available

Jun 2010
IEEE Comput Soc Conf Comput Vis Pattern Recogn

We propose a novel method for automatic camera calibration and foot-head homology estimation by observing persons standing at several positions in the camera field of view. We demonstrate that human body can be considered as a calibration target thus avoiding special calibration objects or manually established fiducial points. First, by assuming roughly parallel human poses we derive a new constraint which allows to formulate the calibration of internal and external camera parameters as a Quadratic Eigenvalue Problem. Secondly, we couple the calibration with an improved effective integral contour based human detector and use 3D projected models to capture a large variety of person and camera mutual positions. The resulting camera auto-calibration method is very robust and efficient, and thus well suited for surveillance applications where the camera calibration process cannot use special calibration targets and must be simple.

What Can Casual Walkers Tell Us About A 3D Scene?

Conference Paper

Full-text available

Jan 2007

An approach for incremental learning of a 3D scene from a single static video camera is presented in this paper. In particular, we exploit the presence of casual people walking in the scene to infer relative depth, learn shadows, and segment the critical ground structure. Considering that this type of video data is so ubiquitous, this work provides an important step towards D scene analysis from single cameras in readily available ordinary videos and movies. On-line 3D scene learning, as presented here, is very important for applications such as scene analysis, foreground refinement, tracking, biometrics, automated camera collaboration, activity analysis, identification, and real-time computer-graphics applications. The main contributions of this work are then two-fold. First, we use the people in the scene to continuously learn and update the 3D scene parameters using an incremental robust (L1) error minimization. Secondly, models of shadows in the scene are learned using a statistical framework. A symbiotic relationship between the shadow model and the estimated scene geometry is exploited towards incremental mutual improvement. We illustrate the effectiveness of the proposed framework with applications in foreground refinement, automatic segmentation as well as relative depth mapping of the floor/ground, and estimation of 3D trajectories of people in the scene.

Multiple Camera Self-calibration and 3D Reconstruction Using Pedestrians

Conference Paper

Full-text available

Nov 2010

The analysis of human motion is an important task in various surveillance applications. Getting 3D information through a calibrated system might enhance the benefits of such analysis. This paper presents a novel technique to automatically recover both intrinsic and extrinsic parameters for each surveillance camera within a camera network by only using a walking human. The same feature points of a pedestrian are taken to calculate each camera’s intrinsic parameters and to determine the relative orientations of multiple cameras within a network as well as the absolute positions within a common coordinate system. Experimental results, showing the accuracy and the practicability, are presented at the end of the paper.

Bundle adjustment - A modern synthesis

Article

Jan 2000

Trajectory Rectification and Path Modeling for Video Surveillance

Conference Paper

Jan 2007

Path modeling for video surveillance is an active area of research. We address the issue of Euclidean path modeling in a single camera for activity monitoring in a multi- camera video surveillance system. The paper proposes (i) a novel linear solution to auto-calibrate any camera observing pedestrians and (ii) to use these calibrated cameras to detect unusual object behavior. During the unsupervised training phase, after auto-calibrating a camera and metric rectifying the input trajectories, the input sequences are registered to the satellite imagery and prototype path models are constructed. This allows us to estimate metric information directly from the video sequences. During the testing phase, using our simple yet efficient similarity measures, we seek a relation between the input trajectories derived from a sequence and the prototype path models. We test the proposed method on synthetic as well as on real-world pedestrian sequences.

Camera Calibration and Reconstruction of Geometry from Images

Article

D. Liebowitz

Crowd Detection with a Multiview Sampler

Conference Paper

Sep 2010

We present a Bayesian approach for simultaneously estimating the number of people in a crowd and their spatial locations by sampling from a posterior distribution over crowd configurations. Although this framework can be naturally extended from single to multiview detection, we show that the naive extension leads to an inefficient sampler that is easily trapped in local modes. We therefore develop a set of novel proposals that leverage multiview geometry to propose global moves that jump more efficiently between modes of the posterior distribution. We also develop a statistical model of crowd configurations that can handle dependencies among people and while not requiring discretization of their spatial locations. We quantitatively evaluate our algorithm on a publicly available benchmark dataset with different crowd densities and environmental conditions, and show that our approach outperforms other state-of-the-art methods for detecting and counting people in crowds.

Multiple view geometry in computer vision (2. ed.).

Book

Jan 2006

Robust autocalibration for a surveillance camera network

Abstract and Figures

Recommended publications

Counting pedestrians in high-density crowd scenes using cross-sectional flow statistics

Surveillance Camera Autocalibration based on Pedestrian Height Distributions

Markov Chain Monte Carlo Cascade for Camera Network Calibration Based on Unconstrained Pedestrian Tr...

Automatic Calibration of Stationary Surveillance Cameras in the Wild

Extrinsic Calibration of Camera Networks Based on Pedestrians