Conference PaperPDF Available

Robust autocalibration for a surveillance camera network

Authors:

Abstract and Figures

We propose a novel approach for multi-camera autocalibration by observing multiview surveillance video of pedestrians walking through the scene. Unlike existing methods, we do NOT require tracking or explicit correspondences of the same person across time/views. Instead, we take noisy foreground blobs as the only input and rely on a joint optimization framework with robust statistics to achieve accurate calibration under challenging scenarios. First, each individual camera is roughly calibrated into its local World Coordinate System (lWCS) based on analysis of relative 3D pedestrian height distribution. Then, all lWCSs are iteratively registered with respect to a shared global World Coordinate System (gWCS) by incorporating robust matching with a partial Direct Linear Transform (pDLT). As demonstrated by extensive evaluation, our algorithm achieves satisfactory results in various camera settings with up to moderate crowd densities with a large proportion of foreground outliers.
Content may be subject to copyright.
Robust Autocalibration for a Surveillance Camera Network
Jingchen Liu, Robert T. Collins, and Yanxi Liu
The Pennsylvania State University
University Park, PA 16802, USA
{jingchen, rcollins, yanxi}@cse.psu.edu
Abstract
We propose a novel approach for multi-camera autocali-
bration by observing multiview surveillance video of pedes-
trians walking through the scene. Unlike existing methods,
we do NOT require tracking or explicit correspondences of
the same person across time/views. Instead, we take noisy
foreground blobs as the only input and rely on a joint opti-
mization framework with robust statistics to achieve accu-
rate calibration under challenging scenarios. First, each
individual camera is roughly calibrated into its local World
Coordinate System (lWCS) based on analysis of relative 3D
pedestrian height distribution. Then, all lWCSs are itera-
tively registered with respect to a shared global World Co-
ordinate System (gWCS) by incorporating robust matching
with a partial Direct Linear Transform (pDLT). As demon-
strated by extensive evaluation, our algorithm achieves sat-
isfactory results in various camera settings with up to mod-
erate crowd densities with a large proportion of foreground
outliers.
1. Introduction
The main goal of surveillance camera calibration is to
find the mapping relating objects in the 3D scene to their
projections in the 2D image plane[9]. This helps to infer
object locations as well as scales and orientations, and al-
lows for more accurate object detection and tracking. For
example, sampling-based pedestrian detection [1,4] yields
better performance when hypotheses are generated in 3D
and then projected into one or more image views. Pedes-
trian/face/object detection based on sliding windows can
also benefit from calibration, since the search over orienta-
tion and scale can be constrained to a small range, reducing
false positives [13].
In this paper, we present an automated calibration
method that enables smart sampling of object size and ori-
entation in all views given either a 2D location in one view
or 3D location in the scene. The method works directly on
noisy foreground observations collected by the surveillance
Figure 1. Example frames for calibration: (top) original frame
overlayed with calibration results; (middle) noisy foreground
masks with major axes of inlier blobs; (bottom) registered top-
down view with the same blob
system, without any further information such as scene ge-
ometry or tracklets from tracking.
Most existing work on unsupervised surveillance
(pedestrian-based) camera calibration focuses on the single-
view case, ([13,7,15,12,8]) and requires clean pedestrian
detections as well as explicit correspondences of the same
person at different locations in the scene. For example, [12]
proposes to detect leg-crossings for more accurate pedes-
trian height estimation; [13] requires the extraction of mul-
tiple control points on the contour of the pedestrian. [3,6]
and [14] adopt similar ideas and use a walking human to
calibrate a camera network. In all the above work, the cor-
respondence of the same person, if not manually labeled,
is obtained either by tracking, or under the assumption that
there is only one person in the view.
In some cases however, it can be very difficult to accu-
rately detect pedestrians prior to calibration, let alone track
them robustly through the scene. [11] takes noisy fore-
ground blobs as input and achieves camera calibration based
on the analysis of pedestrian height distribution, with no
correspondence information needed. However the estima-
1
tion of focal length may not be very accurate, and it only ap-
plies to single views. To the best of our knowledge, the au-
tomatic extraction of cross-view correspondences in noisy
environments has not been addressed in the above work.
It is known that once clean correspondences (of points,
planes, objects in 2D/3D) are given, calibration is a well-
solved problem, e.g., using bundle adjustment[16]. The
main contribution of this work is to propose a novel frame-
work for unsupervised surveillance system calibration that
efficiently prunes outliers and estimates the calibration
based on a subset of inlier foreground correspondences dis-
covered through applying a series of robust statistics.
We address four major challenges that are commonly en-
countered but not fully considered in the existing literature
on surveillance-based calibration: (1) moderately crowded
scenes; (2)a large proportion of outliers from foreground ex-
traction; (3) large noise (variance) in foreground detections;
(4) no correspondence information across frames/views.
Similar to most surveillance work, we assume (1) there is
one single flat ground-plane and (2) people are almost ver-
tical, standing/walking on the ground plane.
2. Camera Model and the Coordinate System
Adopting a simplified CCD camera model with focal
length being the only intrinsic parameter, we calibrate each
view into its local World Coordinate System (lWCS), where
each camera has zero pan angle and is translated from the
local origin OLby one unit along the Z-axis (thus the rel-
ative scale of the coordinate system is proportional to the
camera height above the ground). Camera orientation is
modeled by a tilt angle θaround the X-axis (θ(π/2, π)
for a downward looking camera) and a roll angle ρaround
the Z-axis. The 3D-to-2D projection matrix is thus defined
by:
PL=
f
f
1
RZ(ρ)RX(θ)
1 0
1 0
11
,
(1)
where, e.g., RX(θ)is a 3D rotation around the X-axis by
angle θ.
It has been shown [10] that the local extrinsic parameters
(ρ,θ) can be estimated given the vertical vanishing point
v0= (vx, vy,1)Ttogether with the focal length fas:
vxx+vyy+f2= 0 (2)
ρ=atan(vx/vy)(3)
θ=atan2(qv2
x+v2
y,f),(4)
where Eqn. 2is the equation of the horizon.
To relate all lWCSs, we choose a global World Coordi-
nate System (gWCS) that is aligned to the ground plane so
that each lWCS can be registered with the gWCS by a 2D
Figure 2. Illustration of the global WCS (black) and the local
WCSs (blue and red), where the local Y-axis is coplanar with the
camera optical axis and all XY-planes lie within the ground-plane.
translation and rotation within the ground plane (XY-plane),
as well as a relative scaling (proportional to the individual
camera height), as illustrated in Fig. 2. The final projection
matrix of a camera is defined as
P=PL·PG,(5)
where PGdenotes the ground-plane alignment transforma-
tion,
PG=
s
s
s
1
cos αsin α Tx
sin αcos α Ty
1 0
1
.
(6)
3. Camera Calibration
Our algorithm works on videos captured by multiple
cameras with overlapping views. The original frames are
preprocessed to generate foreground blobs. From these
noisy blobs in each single view, we first estimate the vertical
vanishing point and the maximum-likelihood focal length f
that recovers 3D blob heights resembling the real world dis-
tribution of human heights using [11], thus estimating the
individual camera calibration matrix P(k)
Lthat relates the
coordinate system of camera kto its local WCS. We then
iteratively and sequentially modify the global registration
(PG) of each camera. The iteration usually converges in a
few rounds (2 to 3). During each iteration, we minimize the
re-projection error between the 2D blobs in the current view
and the joint set of global-world 3D blobs maintained by all
cameras, where the correspondence information is implic-
itly encoded in a robust-statistic error metric. We then ef-
ficiently solve for a final estimate of PGvia partial Direct
Linear Transform (pDLT) in a reduced solution space. The
workflow of the algorithm is summarized in Alg. 1.
3.1. Framework
Following adaptive background subtraction on the input
video sequences, we merge connected foreground pixels to
form foreground blobs. We then fit an ellipse to each blob
and represent the blob by the two end points of the major
axis of the ellipse. Assuming each foreground blob corre-
sponds to a person in the 2D image plane, the two end points
approximately represent the pixel locations of the foot and
head of the person. Denote by b(k)
nthe nth 2D blob ex-
tracted from the kth view and Bk
nthe corresponding 3D
blob in the gWCS. These 2D and 3D blobs can be repre-
sented in homogeneous coordinates as:
b(k)
n=
xfxh
yfyh
1 1
B(k)
n=
X X
Y Y
0H
1 1
,(7)
where xf, yf, xh, yhare pixel locations of the foot and
head, and X,Y,Hindicate the pedestrian’s location (in
the ground plane) and height respectively. The projective
matrix P(k)(Eqn. 6) projects a 3D blob into 2D: b(k)
n
P(k)·B(k)
n.Assuming upright pedestrians walking in the
ground-plane, the degree of freedom (DoF) of {B(k)
n}is
3. Thus B(k)
ncan be linearly solved given P(k)
3×4and b(k)
n.
Specifically, let
M=
xfP31 P11 xfP32 P12 0
yfP31 P21 yfP32 P22 0
xhP31 P11 xhP32 P12 xhP33 P13
yhP31 P21 yhP32 P22 yhP33 P23
,
(8)
t=
P14 xfP34
P24 yfP34
P14 xhP34
P24 xhP34
,(9)
it can be proven that [X, Y, H ]T= (MTM)1MTt. We
denote such a backward projection from 2D to 3D as:
B(k)
n={P(k)}1b(k)
n.Note that the {·}1operator here
does not refer to the conventional matrix inversion.
We formulate the multi-camera calibration as a joint en-
ergy minimization problem. Among various choices of the
cost function for multi-camera calibration, we select the
widely used mean-squared image re-projection error as our
optimization goal, which can also be interpreted as maxi-
mum likelihood estimation under an assumption of Gaus-
sian noise[5]. The re-projection from view jto view kcan
be expressed as:
bn(k|j)P(k){P(j)}1b(j)
n
P(k)B(j)
n(10)
where a 2D blob in view jis first back-projected into the
gWCS with operator {P(j)}1and then projected into view
kunder P(k). The overall re-projection error is defined by
ε=X
k
ε(k)=X
kX
j
e(b(k), b(k|j))(11)
=X
k
e(bk, P (k)B)(12)
Alg. 1 Unsupervised multi-camera calibration.
input: {b(k)|k= 1, . . . , K}
output: {P(k)|k= 1, . . . , K}
individual camera calibration:
estimate {P(k)
L|k= 1, . . . , K}via [11]
joint camera network calibration:
initialize 3D blobs:
B(k)=, k = 1, . . . , K 1
B(K)={P(K)
L}1b(K)(i.e., P(K)=P(K)
L)
for m= 1,2, . . .
for k= 1, . . . , K
1) B= [B(1), . . . , B (k1), B(k+1), . . . , B(K)]
2) update P(k)=P(k)
L·P(k)
Garg min{εk|b(k), B}
3) update B(k)(P(k))1b(k)
4) compute εaccording to Eqn. 12
if εm> εm1: terminate
where B= (B(1), . . . , B (K))is the set of 3D blobs con-
tributed by all views. Note that the self back-projection al-
ways has bk|k
n=bk
n. Hence e(bk, P(k)B(k))is always 0.
e(,)is a robust matching error metric, defined in Sec. 3.3
that measures the compatibility between the set of fore-
ground blobs bkobserved in view kand the set of re-
projected blobs from 3D (B), as contributed by all views,
where no correspondence information between bkand Bis
given.
Minimizing the above cost function directly is in-
tractable. However, as can be seen from Eqn. 12, we can
iteratively optimize the projection matrix for each view. To
initiate the lWCS-gWCS matching, we first align the gWCS
with one of the lWCSs (here we pick the last view indexed
by K) to obtain an initial set of 3D blobs. Then we se-
quentially calibrate each camera kby (1) inferring the cor-
respondences between b(k)and Bunder the presence of out-
liers and noise with a robust matching metric (Sec. 3.3) and
(2) optimizing the projective matrix P(k)given the 2D-3D
blob correspondences by partial direct linear transformation
(Sec. 3.4). Empirically, we observe that our multi-camera
calibration usually converges in no more than three itera-
tions, i.e., m3in Alg. 1.
3.2. Individual Camera Calibration to lWCS
This section explains how we obtain the input matrices
{P(k)
L}w.r.t. lWCS. As shown in Eqns. 1,3,4, this is equiv-
alent to estimating the focal length fand the vertical van-
ishing point v0. We apply the method of [11] that first uses
RANSAC to find the vanishing point and then roughly es-
timates the focal length based on prior knowledge of the
3D-height distribution of inlier blobs. Note that we do not
need to assume a constant person height for calibration, as
is often done in algorithms for foot-head homography esti-
Figure 3. Vanishing point detection under different camera angles,
foreground blob sizes, and crowd densities. Green lines indicate
the major axes of inlier blobs. Outliers are marked with red lines.
Yellow dashed lines indicate the vanishing points.
mation.
The vanishing point estimation is carried out in homoge-
neous coordinates, and is robust in cases when a vanishing
point is close to infinity. Fig. 3demonstrates a few exam-
ples of RANSAC-based vanishing point voting under dif-
ferent camera settings as well as varying foreground sizes
and densities. It is worth mentioning that many blobs cor-
responding to real pedestrians are classified as outliers be-
cause of region deformation due to partial detection or be-
ing merged with other people, especially in crowded scenes.
However our goal is not to detect all pedestrian foreground
blobs but to extract enough inlier blobs for the following
analysis.
Focal length estimation follows a hypothesize-and-test
process. Given a hypothesized focal length f, together
with the vanishing point v0, we can recover the relative
3D heights Hiof each inlier blob bi(w.r.t. the camera
height)1. This process leverages the fact that the distribu-
tion of human heights in 3D forms a very strong cluster with
|Hiµ|/µ < λ, where µis the average inlier pedestrian
height and λ= 0.1[11]. Different hypotheses are evaluated
against a robust log likelihood function defined as:
L(f) = 1
µ2X
iI
max{λµ − |Hiµ|,0}2,(13)
where Irepresents the set of RANSAC inliers, and outlier
candidates Hithat fall out of the height range of the major-
ity of the inlier blobs, e.g., Hi>(1 + λ)µ, are ignored. As
we sample the camera field of view (FoV) angle at a res-
olution of 1, which is about the state-of-the-art accuracy
for pedestrian-based surveillance camera calibration [8], the
focal length fthat produces the highest likelihood score ac-
cording to Eqn. 13 is selected as our initial estimate for the
multi-view calibration.
3.3. Robust Distance Metric for Blob Matching
This section explains the cost function (Eqn. 11) for
cross-view lWCS matching. Recall that b(i)denotes the set
1A more efficient method would use the cross-ratio invariance trick
[11].
of 2D blobs in view ifor the entire sequence, B(i)denotes
the set of 3D blobs back-projected from view ito gWCS,
and b(i|j)is the set of re-projected 2D blobs from view j
to i. The cost function is defined as the sum of the re-
projection errors of all pairs of 2D-3D blobs (denoted as
set l) extracted at the same time stamp.
Since the foreground blobs are noisy in the sense of
both false alarms and missed detections, the proportion
of ‘good’ pairs (two ‘good’ blobs extracted from different
views corresponding to the same person) is even smaller,
especially under crowded scenarios. We thus adopt trun-
cated quadratic[2], which belongs to the robust statistics of
truncated least squares, defined as:
e(b(i), b(i|j)) = X
(b(i)
n1,B(j)
n2)l
min{d(b(i)
n1, b(i|j)
n2)2, τ 2},(14)
where d(,)is the 4D Euclidian distance, (x, y)coordinates
of feet and head, between 2 blobs in pixels, and the error
tolerance is set to be τ2=1
100 W·H, where Wand H
are the width and height of the image. We find this setting
yields satisfactorily consistent results for video sequences
with very different camera settings, as demonstrated in the
experiments section. We iteratively use the error tolerance
as a threshold to discover ‘good’ blob correspondences from
all possible pairs in land re-estimate calibration parameters
based on these inliers.
3.4. Multiview Calibration to gWCS by Partial DLT
This section describes the sequential registration of mul-
tiple lWCS in Alg. 1. The goal is to estimate the cam-
era projection matrix Pfrom an initial set of noisy inlier
2D-3D blob correspondences (lin) and the results of the
single-view calibration (PL). We propose an iterative pro-
cess based on a variant of direct linear transform (DLT). The
algorithm iteratively estimates the global projection matrix
and refines the inlier correspondences lin and PLonce PG
has been updated. The overall optimization is summarized
in Alg. 2.
To solve the global transformation of the kth view Pk=
Pk
L·Pk
G(Eqn. 6) from a linear system constructed from
inlier blob correspondences between the 2D blobs and 3D
blobs
[x, y, z]TPk
L·Pk
G·[X, Y, Z, 1]T,(15)
a straightforward approach would treat Pas a general ma-
trix with 12 DoF, and solve it using DLT. However, the DLT
solution is known to be overdetermined, as the real perspec-
tive matrix only has 11 DoF (up to a scale). More impor-
tantly, when the 2D-3D correspondences are noisy, DLT can
easily overfit the free-form solution, making it harder to dis-
tinguish inlier correspondences from outliers during the it-
erative process, which would further degrade the estimation
accuracy. Therefore, we limit the DoF of the solution space
Alg. 2 Calibration to gWCS by Partial DLT
input: {P(k)
L}init,b(k),B,l
output: P(k),{P(k)
L}updated,{lin }updated
initialization:
randomly sample the initial correspondences: lin l
for m= 1, . . . , M
1) optimize αin P(k)
G, given lin,P(k)
L
2) optimize s, tx, tyin P(k)
G, given lin,P(k)
L, α
3) optimize: f , P (k)
L, given lin,P(k)
G
4) compute ε(k)and update lin (Eqn. 14)
if ε(k)
m> ε(k)
m1: terminate
by estimating the five variables in Eqns. 1,6:α, s, tx, ty, f
sequentially, while fixing the vanishing point. The motiva-
tion is that (1) we assume the initial estimation for vanishing
point v0is accurate enough; (2) by reducing the DoF of the
projection matrix, we introduce a partial DLT here to solve
for subsets of parameters efficiently without suffering from
the ‘over-determinant’ problem of DLT.
To estimate the ground-plane rotation angle α, we fix f
and Pk
Lso that Pk
Ghas 7 DoF:
Pk
GPk
G|7=
m11 m12 m14
m21 m22 m24
m33
1
.(16)
PG|7can be directly optimized by linearizing Eqn. 15 sim-
ilar to the DLT (thus referred to as partial DLT) and the
rotation angle can be approximated as α=atan2(m21
m12, m11 +m22 ). We then fix αand f, so that s, tx, tyto-
gether can be directly linearly solved in the same way (see
Appendix for detailed derivation). Although the initial esti-
mation of focal length fmay not be very accurate, it con-
strains the search space to a small region for refining the
estimates. To optimize f, we fix the current estimatesof the
otehr parameters and adopt pDLT again on the linear system
of:
x
y
z
af
af
1
·Pk
L·Pk
G·
X
Y
Z
1
,(17)
with afbeing the only parameter. This suggests the optimal
focal length should be updated as: faf·f. We then up-
date θin PLwith faccording to Eqn. 4(thus the vanishing
point remains the same) and re-evaluate the matching error
of Eqn. 14. We only accept fif it reduces the matching
error.
4. Experiments
We have conducted extensive evaluation on a synthe-
sized dataset for stress testing, and on four different pub-
lic sequences with various camera settings, crowd densities,
and background subtraction qualities: (a) indoor 4-person
sequence with three views of resolution 288 ×360 [1],
(b) outdoor campus sequence with two views of 960 ×
1280 [14], (c) PETS09 sparse crowd sequence with 4 views
(1,5,6,8), where view#1 is 576 ×768 and the rest are
576×720, and (d) PETS09 medium density crowd sequence
with two views (#1 and #2) of resolution 576 ×768 2.
We also provide quantitative comparison against [11] in
terms of focal length estimation, since to our best knowl-
edge, no other existing work estimates surveillance network
calibration without correspondence information.
4.1. Synthesized Dataset for Stress-Testing
We synthesize a dataset with three cameras of different
focal lengths (f1= 1000, f2= 1200, f3= 1000), view-
ing angles, ground plane locations, and heights. Multiple
pedestrians were synthesized in 3D as feet-head pairs with
a height variance of ±10%, and then projected into indi-
vidual views. In the stress test, we consider three sources
of input noise: location noise σ0, foreground blob recall
rand precision prate. The location noise is produced as
zero-mean Gaussian noise added onto the original feet-head
pixel locations. The blob-recall and precision indicate false
negatives and false positives and are simulated by randomly
removing inlier blobs and adding outlier blobs. The default
Figure 4. Synthesized 3-camera dataset with calibration results.
stress parameters are set as σ0= 5 (the standard deviation
of Gaussian noise in pixels), r= 70% and p= 70%. We
then vary σ0from 1to 10 and r,pfrom 1to 0respectively
with the other two parameters fixed at the default level.
Fig. 4visualizes the calibration results. We also quanti-
tatively evaluate the performance of our algorithm by mea-
suring the average RMSE re-projection error in pixels as
well as the average relative focal length estimation error
e=|ffGT |/fGT . The performance of our approach
under different stress levels is plotted in Fig. 5.
It can be seen the re-projection error increases with
increasing feet-head location noise, but is relatively sta-
ble w.r.t. missing inliers and against outliers, which well
2http://www.cvg.rdg.ac.uk/PETS2009/a.html
(a) (b) (c)
Figure 5. Performance under different levels of (a) location noise; (b) foreground recall rate; (c) foreground precision rate. Top: average
re-projection error. Bottom: focal length error of our approach (red) and [11]
.
demonstrates the effectiveness of our robust estimation met-
ric. When the precision increases from p= 0.2to p= 0.1
in Fig. 5(bottom), the proportion of outliers increases from
50% to 90% and results in a big performance drop. In all
other cases, our focal length estimation remains accurate
and performs better than a single-view based approach [11].
4.2. Real Sequences
As can be seen from the calibration results in Fig. 6, our
ground plane estimations are accurate in general (a black
grid mask overlaid on the original frame). For quantitative
evaluation, we manually labeled corresponding pedestrians
across views on 50 frames for each sequence. Fig. 7com-
pares the algorithm re-projections with the groundtruth la-
bels. The view-specific re-projection error is summarized
in Tab. 1, where Ein column indicates the matching er-
ror when pedestrians from other views are mapped into the
current view, and the Eout column indicates the matching
error when pedestrians from the current view are mapped
into other views. We show the RMSE in both pixels and
in terms of normalized distance as Er=E/W·H. Note
that some labeling error exists on feet/head positions of the
groundtruth labels.
The focal length estimation for sequences with ground
truth fGT and relative error eare summarized in Tab. 2.
Again, compared with the baseline method using single
view calibration, our method achieves better accuracy in
most cases. The second view of the campus sequence
reported the largest error; however, the cross-view re-
projection estimation and groundplane estimation are still
acceptable. This can be explained as a case when inaccu-
rate focal length estimation cancels out the inaccurate van-
ishing point estimation, resulting in a small re-projection
error, which is the primary goal for surveillance calibration.
View Ein Eout Ein
rEout
r
1 13 16 4.0% 5.0%
indoor 2 23 12 7.1% 3.6%
3 14 22 4.3% 6.8%
campus 1 36 24 3.3% 2.2%
2 24 36 2.2% 3.3%
1 10 34 1.6% 5.1%
PETS09 2 29 12 4.7% 2.0%
(sparse) 3 13 11 2.0% 1.8%
4 16 12 2.5% 1.8%
PETS09 1 18 8 2.6% 1.2%
(dense) 2 8 18 1.2% 2.6%
Table 1. View specific re-projection error on real sequences.
View fGT fbase ebase f e
Campus 1 1057 1044 1% 1034 2%
21198 1545 29% 1427 19%
11170 1084 7% 1218 4%
PETS09 5 830 828 .2% 828 .2%
(sparse) 6 877 891 2% 869 1%
8737 772 5% 772 5%
PETS09 1 1170 950 19% 1067 9%
(dense) 2 659 624 5% 634 4%
Table 2. Focal length estimation, where fGT is the groundtruth
focal length, fbase and fare estimates computed by the
baseline([11]) and our method, respectively, and ebase and eare
relative errors w.r.t. the groundtruth. Our method outperforms the
baseline in most cases (shown in bold).
(a) (b)
(a) (b)
Figure 6. Example of calibration results for: (a) indoor 4-people sequence; (b) outdoor campus sequence; (c) PETS09 sparse; (d) PETS09
dense. For each sequence: Top: original frame overlayed with calibration results. Middle: noisy foreground masks overlayed with major
axes of inlier blobs. Bottom: registered top-down view of inlier blobs.
(a) (b)
(c) (d)
Figure 7. Re-projection evaluation. Manually labeled corresponding pedestrians in different views are plotted in straight-lines with the
same color. The cross-view re-projections based on calibration are plotted in dashed lines.
5. Summary
We propose a novel framework for unsupervised surveil-
lance camera network calibration. We take noisy fore-
ground (pedestrian) blobs captured directly by the cameras
without any cross-time or cross-view correspondence in-
formation. We first apply robust self-calibration to cali-
brate each camera w.r.t. the lWCS to reduce the DoF of the
projective transformation to be estimated later, meanwhile
pruning a large proportion of outlier blob observations. We
then sequentially align all the lWCSs to a shared gWCS,
during which we introduce truncated least-squares as a ro-
bust error metric to iteratively determine inlier correspon-
dences, while applying a series of partial DLTs to solve
for a projective transformation. We demonstrate the ro-
bustness of our algorithm against different camera settings,
foreground qualities, outlier ratios and crowd densities via
extensive experiments on synthesized image sequences as
well as on publicly available real datasets.
Appendix: Groundplane Registration using
Partial DLT
Given fixed ground-plane rotation αand local calibration
matrix PL, we linearize the equation sets by introducing
variable zto resolve the ambiguity:
x·z
y·z
z
=PL·PG·
X
Y
Z
1
,(18)
with
PL=
p11 p12 p13 p14
p21 p22 p23 p24
p31 p32 p33 p34
(19)
PG=
cos α·s, sin α·s, tx
sin α·s, cos α·s, ty
s
1
(20)
Via Gaussian elimination on the auxiliary variable z, we can
reorganize Eqn. 18 to obtain 2sets of constrains for each
2D3Dpoint pair correspondence, so that the calibration
parameters [s, tx, ty]can be linearly solved:
m11 m12 m13
m21 m22 m23 ·
s
tx
ty
=u1
u2(21)
where
m11 = (p11X+p12 Yp31xX p32 xY ) cos α+ (p12 X
p11Yp32 xX +p31xY ) sin α+ (p13 p33 x)Z
m21 = (p21X+p22 Yp31yX p32yY ) cos α+ (p22X
p21Yp32 yX +p31 yY ) sin α+ (p23 p33 y)Z,
and
m12 =p11 p31x, m22 =p21 p31 y,
m13 =p12 p32x, m23 =p22 p32 y,
u1=p14 +p34x, u2=p24 +p34 y,
(22)
References
[1] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua. Multiple Ob-
ject Tracking using K-Shortest Paths Optimization. PAMI,
2011. 1,5
[2] M. Black and P. Anandan. The robust estimation of multi-
ple motions: Parametric and piecewise-smooth flow fields.
CVIU, 63(1):75–104, 1996. 4
[3] T. Chen, A. Bimbo, F. Pernici, and G. Serra. Accurate self-
calibration of two cameras by observations of a moving per-
son on a ground plane. In Proc. AVSS, 2007. 1
[4] W. Ge and R. T. Collins. Crowd detection with a multiview
sampler. In Proc. ECCV, 2010. 1
[5] R. I. Hartley and A. Zisserman. Multiple View Geometry
in Computer Vision. Cambridge University Press, ISBN:
0521623049, 2000. 3
[6] M. Hodlmoser and M. Kampel. Multiple camera self-
calibration and 3D reconstruction using pedestrians. In Proc.
ISVC, 2010. 1
[7] I. N. Junejo and H. Foroosh. Trajectory rectification and path
modeling for video surveillance. In Proc. ICCV, 2007. 1
[8] N. Krahnstoever and P. R. Mendonca. Bayesian autocali-
bration for surveillance. In Proc. ICCV, pages 1858–1865,
2005. 1,4
[9] D. Liebowitz. Camera Calibration and Reconstruction of
Geometry from Images. PhD thesis, University of Oxford,
2001. 1
[10] D. Liebowitz and A. Zisserman. Combining scene and auto-
calibration constraints. In Proc. EuroGraphics, pages 293–
300, 1999. 2
[11] J. Liu, R. T. Collins, and Y. Liu. Surveillance camera auto-
calibration based on pedestrian height distributions. In Proc.
BMVC, 2011. 1,2,3,4,5,6,8
[12] F. Lv, T. Zhao, and R. Nevatia. Camera calibration from
video of a walking human. PAMI, 28(9):1513–1518, 2006. 1
[13] B. Micusik and T. Pajdla. Simultaneous surveillance camera
calibration and foot-head homology estimation from human
detections. In Proc. CVPR, 2010. 1
[14] H. Possegger, R. Matthias, S. Sabine, M. Thomas, K. Man-
fred, R. P. M., and B. Horst. Unsupervised calibration of
camera networks and virtual PTZ cameras. In Computer Vi-
sion Winter Workshop, 2012. 1,5
[15] D. Rother and K. A. Patwardhan. What can casual walkers
tell us about the 3D scene. In Proc. ICCV, 2007. 1
[16] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon.
Bundle adjustment–a modern synthesis. Vision Algorithms:
Theory and Practice, 2000. 2
... The approach is based on computing vanishing points using RANSAC. In [22] they expand their method by introducing a joint calibration for a network of cameras based on the Direct Linear Transform [14]. Henning et al. [15] jointly optimize the trajectory of a monocular camera and a human body mesh by formulating an optimization problem in the form of a factor graph. ...
Preprint
Calibration of multi-camera systems, i.e. determining the relative poses between the cameras, is a prerequisite for many tasks in computer vision and robotics. Camera calibration is typically achieved using offline methods that use checkerboard calibration targets. These methods, however, often are cumbersome and lengthy, considering that a new calibration is required each time any camera pose changes. In this work, we propose a novel, marker-free online method for the extrinsic calibration of multiple smart edge sensors, relying solely on 2D human keypoint detections that are computed locally on the sensor boards from RGB camera images. Our method assumes the intrinsic camera parameters to be known and requires priming with a rough initial estimate of the camera poses. The person keypoint detections from multiple views are received at a central backend where they are synchronized, filtered, and assigned to person hypotheses. We use these person hypotheses to repeatedly solve optimization problems in the form of factor graphs. Given suitable observations of one or multiple persons traversing the scene, the estimated camera poses converge towards a coherent extrinsic calibration within a few minutes. We evaluate our approach in real-world settings and show that the calibration with our method achieves lower reprojection errors compared to a reference calibration generated by an offline method using a traditional calibration target.
... The approach is based on computing vanishing points using RANSAC. In [22] they expand their method by introducing a joint calibration for a network of cameras based on the Direct Linear Transform [14]. Henning et al. [15] jointly optimize the trajectory of a monocular camera and a human body mesh by formulating an optimization problem in the form of a factor graph. ...
Conference Paper
Full-text available
Calibration of multi-camera systems, i.e. determining the relative poses between the cameras, is a prerequisite for many tasks in computer vision and robotics. Camera calibration is typically achieved using offline methods that use checkerboard calibration targets. These methods, however, often are cumbersome and lengthy, considering that a new calibration is required each time any camera pose changes. In this work, we propose a novel, marker-free online method for the extrinsic calibration of multiple smart edge sensors, relying solely on 2D human keypoint detections that are computed locally on the sensor boards from RGB camera images. Our method assumes the intrinsic camera parameters to be known and requires priming with a rough initial estimate of the camera poses. The person keypoint detections from multiple views are received at a central backend where they are synchronized, filtered, and assigned to person hypotheses. We use these person hypotheses to repeatedly solve optimization problems in the form of factor graphs. Given suitable observations of one or multiple persons traversing the scene, the estimated camera poses converge towards a coherent extrinsic calibration within a few minutes. We evaluate our approach in real-world settings and show that the calibration with our method achieves lower reprojection errors compared to a reference calibration generated by an offline method using a traditional calibration target.
... For example, Lv et al. [11] detected moving humans, who represent common visual information across cameras, and regarded them as a set of sticks with the same height for camera calibration based on vanishing point theory. Liu et al. [12,13] put forward an automatic camera calibration approach and its improvement method using common pedestrian feature. Their methods are proposed under the assumption that all humans are on one plane surface. ...
Article
Full-text available
In this paper, we investigate the problem of aligning multiple deployed camera into one united coordinate system for cross-camera information sharing and intercommunication. However, the difficulty is greatly increased when faced with large-scale scene under chaotic camera deployment. To address this problem, we propose a UAV-assisted wide area multi-camera space alignment approach based on spatiotemporal feature map. It employs the great global perception of Unmanned Aerial Vehicles (UAVs) to meet the challenge from wide-range environment. Concretely, we first present a novel spatiotemporal feature map construction approach to represent the input aerial and ground monitoring data. In this way, the motion consistency across view is well mined to overcome the great perspective gap between the UAV and ground cameras. To obtain the corresponding relationship between their pixels, we propose a cross-view spatiotemporal matching strategy. Through solving relative relationship with the above air-to-ground point correspondences, all ground cameras can be aligned into one surveillance space. The proposed approach was evaluated in both simulation and real environments qualitatively and quantitatively. Extensive experimental results demonstrate that our system can successfully align all ground cameras with very small pixel error. Additionally, the comparisons with other works on different test situations also verify its superior performance.
Chapter
Calibration of multi-camera systems, i.e. determining the relative poses between the cameras, is a prerequisite for many tasks in computer vision and robotics. Camera calibration is typically achieved using offline methods that use checkerboard calibration targets. These methods, however, often are cumbersome and lengthy, considering that a new calibration is required each time any camera pose changes. In this work, we propose a novel, marker-free online method for the extrinsic calibration of multiple smart edge sensors, relying solely on 2D human keypoint detections that are computed locally on the sensor boards from RGB camera images. Our method assumes the intrinsic camera parameters to be known and requires priming with a rough initial estimate of the camera poses. The person keypoint detections from multiple views are received at a central backend where they are synchronized, filtered, and assigned to person hypotheses. We use these person hypotheses to repeatedly solve optimization problems in the form of factor graphs. Given suitable observations of one or multiple persons traversing the scene, the estimated camera poses converge towards a coherent extrinsic calibration within a few minutes. We evaluate our approach in real-world settings and show that the calibration with our method achieves lower reprojection errors compared to a reference calibration generated by an offline method using a traditional calibration target.
Article
Full-text available
Crop growth monitoring is an important phenomenon for agriculture classification, yield estimation, agriculture field management, improve productivity, irrigation, fertilizer management, sustainable agricultural development, food security and to understand how environment and climate change effect on crops especially in Russia as it has a large and diverse agricultural production. In this study, we assimilated monthly crop phenology from January to December 2018 by using the NDVI time series derived from moderate to high Spatio-temporal resolution Sentinel and Landsat data in cropland field at Samara airport area, Russia. The results support the potential of Sentinel and Landsat data derived NDVI time series for accurate crop phenological monitoring with all crop growth stages such as active tillering, jointing, maturity and harvesting according to crop calendar with reasonable thematic accuracy. This satellite data generated NDVI based work has great potential to provide valuable support for assessing crop growth status and the above-mentioned objectives with sustainable agriculture development.
Conference Paper
Full-text available
We propose a new framework for automatic surveillance camera calibration by observing videos of pedestrians walking through the scene. Unlike existing methods that require accurate pedestrian detection and tracking, our method takes noisy foreground masks as input and automatically estimates the necessary intrinsic and extrinsic camera parameters using prior knowledge about the distribution of relative human heights. Our algorithm is computationally efficient enough for online parameter estimation. Experimental results on both synthetic and real data show the robustness of our method to camera pose and noisy foreground detections. © 2011. The
Conference Paper
Full-text available
Pan-Tilt-Zoom (PTZ) cameras are widely used in video surveillance tasks. In particular, they can be used in combination with static cameras to provide high resolution imagery of interesting events in a scene on demand. Nevertheless, PTZ cameras only provide a single trajectory at a time. Hence, engineering algorithms for common computer vision tasks, such as automatic calibration or tracking, for camera networks including PTZ cameras is difficult. Therefore, we propose a virtual PTZ (vPTZ) camera to simplify the algorithm development for such camera networks. The vPTZ camera is built on a cylindrical panoramic view of the scene and allows to reposition its field of view arbitrarily to provide several trajectories. Further, we propose an unsupervised extrinsic self-calibration method for a network of static cameras and PTZ cameras solely based on correspondences between tracks of a walking human. Our experimental results show that we can obtain accurate estimates of the extrinsic camera parameters in both, outdoor and indoor scenarios.
Conference Paper
Full-text available
We propose a novel method for automatic camera calibration and foot-head homology estimation by observing persons standing at several positions in the camera field of view. We demonstrate that human body can be considered as a calibration target thus avoiding special calibration objects or manually established fiducial points. First, by assuming roughly parallel human poses we derive a new constraint which allows to formulate the calibration of internal and external camera parameters as a Quadratic Eigenvalue Problem. Secondly, we couple the calibration with an improved effective integral contour based human detector and use 3D projected models to capture a large variety of person and camera mutual positions. The resulting camera auto-calibration method is very robust and efficient, and thus well suited for surveillance applications where the camera calibration process cannot use special calibration targets and must be simple.
Conference Paper
Full-text available
An approach for incremental learning of a 3D scene from a single static video camera is presented in this paper. In particular, we exploit the presence of casual people walking in the scene to infer relative depth, learn shadows, and segment the critical ground structure. Considering that this type of video data is so ubiquitous, this work provides an important step towards D scene analysis from single cameras in readily available ordinary videos and movies. On-line 3D scene learning, as presented here, is very important for applications such as scene analysis, foreground refinement, tracking, biometrics, automated camera collaboration, activity analysis, identification, and real-time computer-graphics applications. The main contributions of this work are then two-fold. First, we use the people in the scene to continuously learn and update the 3D scene parameters using an incremental robust (L1) error minimization. Secondly, models of shadows in the scene are learned using a statistical framework. A symbiotic relationship between the shadow model and the estimated scene geometry is exploited towards incremental mutual improvement. We illustrate the effectiveness of the proposed framework with applications in foreground refinement, automatic segmentation as well as relative depth mapping of the floor/ground, and estimation of 3D trajectories of people in the scene.
Conference Paper
Full-text available
The analysis of human motion is an important task in various surveillance applications. Getting 3D information through a calibrated system might enhance the benefits of such analysis. This paper presents a novel technique to automatically recover both intrinsic and extrinsic parameters for each surveillance camera within a camera network by only using a walking human. The same feature points of a pedestrian are taken to calculate each camera’s intrinsic parameters and to determine the relative orientations of multiple cameras within a network as well as the absolute positions within a common coordinate system. Experimental results, showing the accuracy and the practicability, are presented at the end of the paper.
Conference Paper
Path modeling for video surveillance is an active area of research. We address the issue of Euclidean path modeling in a single camera for activity monitoring in a multi- camera video surveillance system. The paper proposes (i) a novel linear solution to auto-calibrate any camera observing pedestrians and (ii) to use these calibrated cameras to detect unusual object behavior. During the unsupervised training phase, after auto-calibrating a camera and metric rectifying the input trajectories, the input sequences are registered to the satellite imagery and prototype path models are constructed. This allows us to estimate metric information directly from the video sequences. During the testing phase, using our simple yet efficient similarity measures, we seek a relation between the input trajectories derived from a sequence and the prototype path models. We test the proposed method on synthetic as well as on real-world pedestrian sequences.
Conference Paper
We present a Bayesian approach for simultaneously estimating the number of people in a crowd and their spatial locations by sampling from a posterior distribution over crowd configurations. Although this framework can be naturally extended from single to multiview detection, we show that the naive extension leads to an inefficient sampler that is easily trapped in local modes. We therefore develop a set of novel proposals that leverage multiview geometry to propose global moves that jump more efficiently between modes of the posterior distribution. We also develop a statistical model of crowd configurations that can handle dependencies among people and while not requiring discretization of their spatial locations. We quantitatively evaluate our algorithm on a publicly available benchmark dataset with different crowd densities and environmental conditions, and show that our approach outperforms other state-of-the-art methods for detecting and counting people in crowds.