Content uploaded by Kassem Al Ismaeil
Author content
All content in this area was uploaded by Kassem Al Ismaeil on Feb 03, 2015
Content may be subject to copyright.
DYNAMIC SUPER RESOLUTION OF DEPTH SEQUENCES WITH NON-RIGID MOTIONS
Kassem Al Ismaeil?Djamila Aouada?Bruno Mirbach†Bj¨
orn Ottersten?
?SnT - University of Luxembourg †Advanced Engineering - IEE S.A.
{kassem.alismaeil, djamila.aouada, bjorn.ottersten}@uni.lu bruno.mirbach@iee.lu
ABSTRACT
We enhance the resolution of depth videos acquired with low
resolution time-of-flight cameras. To that end, we propose
a new dedicated dynamic super-resolution that is capable to
accurately super-resolve a depth sequence containing one or
multiple moving objects without strong constraints on their
shape or motion, thus clearly outperforming any existing
super-resolution techniques that perform poorly on depth
data and are either restricted to global motions or not precise
because of an implicit estimation of motion. Our proposed
approach is based on a new data model that leads to a robust
registration of all depth frames after a dense upsampling. The
textureless nature of depth images allows to robustly handle
sequences with multiple moving objects as confirmed by our
experiments.
Index Terms—Depth sequence, dynamic super-resolution,
motion estimation, upsampling, ToF data, moving object.
1. INTRODUCTION
Super resolution (SR) is the process of recovering a high res-
olution (HR) image from a set of captured low resolution
(LR) frames. SR has originally been defined for static scenes,
i.e., scenes where the motion between the observed images is
global as opposed to dynamic scenes containing a moving ob-
ject. The past two decades have witnessed tremendous work
on SR for static scenes. As presented in [3], these algorithms,
commonly referred to as classical SR, are numerically limited
to small global motions even for an increased number of LR
frames. Moreover, they cannot handle scenes with moving
objects and consider the corresponding frames as outliers. As
a solution to these major limitations, example-based SR algo-
rithms have been proposed [4], as well as their combinations
with classical multi-frame SR [5]. However, such algorithms
depend on a heavy training phase and the quality of the super-
resolved image is dependent on the suitability of the training
data. Relatively little attention has been given to the SR of
dynamic scenes. Farsiu et al. [1] have proposed a dynamic
shift and add model (dynamic S&A) as a mere extension of
the static case [2], hence suffering from the same restrictions.
Other methods [6, 7, 8, 9] have been proposed to tackle the
problem of dynamic SR by segmenting the moving object
first before super resolving it. Such methods do not handle
pixels on the boundary of the object causing major artifacts.
In 2010, van Eekeren et al. [9] proposed an algorithm to solve
the problem of boundary pixels; however, this algorithm is
computationally heavy and based upon strong assumptions.
In 2009, dynamic SR models were proposed with an implicit
motion estimation, e.g., steering kernels for SR (SKSR) [20].
While the idea is theoretically attractive, it is very impracti-
cal as it relies on heavy computations and on many empirical
parameters. Moreover, these methods are dedicated for 2D
intensity sequences, strongly failing when it comes to depth
data because of its abrupt value changes around edges and
textureless nature. Such data, usually captured with a time-of-
flight (ToF) camera, requires a resolution enhancement. Fu-
sion based methods have been proposed as a solution for dy-
namic depth scenes [10, 11, 12, 13, 14] where a HR 2D cam-
era is coupled with a depth LR camera. These methods often
suffer from texture copying problems and require a perfect
alignment and synchronization of 2D and depth sequences. In
this work, we propose to release the limitations on scale and
motion of SR algorithms for dynamic depth scenes containing
one or multiple moving objects without prior assumptions on
their shape or motion, and without engaging in an additional
learning stage. The proposed algorithm takes advantage of
the textureless nature of depth data, leading to a robust me-
dian estimation without fusing with 2D data; hence, avoiding
blurring and texture copying artifacts. This algorithm is based
on a new data model that starts by densely upsamling the LR
measurements for an accurate registration.
The organization of the paper starts by formulating the
problem of dynamic SR in Section 2. We then provide our
key concepts for a robust motion estimation in Section 3. In
Section 4, we propose a new data model that leads to a robust
dynamic depth SR algorithm. In Section 5, we experimen-
tally compare its performance with state-of-art techniques us-
ing depth sequences. A conclusion is given in Section 6.
2. PROBLEM FORMULATION
The aim of dynamic SR algorithms is to estimate a sequence
of NHR images {xt}N
t=1 of size (m×n)from an observed
LR sequence {yt}N
t=1, where each LR image ytis of size
(m0×n0)pixels, with n=r·n0and m=r·m0, such that ris
the SR factor. Every image ytmay be viewed as an LR noisy
and deformed realization of xt0at the acquisition time t, with
t0≤t. Rearranging all images xtand yt,t= 1,· · · , N , in
lexicographic order, i.e., column vectors of lengths mn and
m0n0, respectively, we consider the following classical data
model:
yt=DHLt0
txt0+nt, t0≤tand t, t0∈[1, N ]⊂N∗,(1)
where Dis a matrix of dimension (m0n0×mn)that repre-
sents the downsampling operator, and which we assume to
be known and constant over time. The system blur is repre-
sented by the time and space invariant matrix H. The vector
ntis an additive Laplacian noise at time tas justified in [2].
The matrices Lt0
tare (mn ×mn)matrices corresponding to
the geometric motion between the considered HR image xt0
and the observed LR image ytprior to its downsampling.
The dynamic SR problem is simplified by reconstructing one
HR image at a time using the full observed sequence. From
now on, we fix the reference time to t0, and focus on the re-
construction of xt0from {yt}N
t=t0. The operation may be
repeated for t0= 1,· · · , N . Based on the data model in
(1), and using an L1norm between the observations and the
model, the Maximum Likelihood (ML) estimate of xt0is ob-
tained as follows:
ˆ
xt0= arg min
xt0
N
X
t=t0
kDHLt0
txt0−ytk1.(2)
Using the same approach as in [2, 17], we consider that H
and Lt0
tare block circulant matrices. Therefore:
HLt0
t=Lt0
tH.(3)
The minimization in (2) can therefore be decomposed into
two steps; estimation of a blurred HR image zt0=Hxt0,
followed by a deblurring step. In what follows, we assume
that ytis simply the noisy and decimated version of ztwith-
out any geometric warp. We may thus write Lt
t=I,∀t,I
being the identity matrix, hence, Lt0
tzt0=zt=Hxt. This
operation can be assimilated to registering zt0to zt. We draw
attention to the fact that in the case of static multi-frame SR,
instead of a sequence, a set of observed LR images is consid-
ered, i.e., there is no order between frames. Such order be-
comes crucial in dynamic SR because the estimation of mo-
tion, based on the optical flow paradigm, happens between
consecutive frames only. An accurate dynamic SR estimation
is consequently highly dependent on the accuracy of estimat-
ing the registration matrices Lt−1
t, as well as Lt0
t. In the case
of one moving object with a very small translational motion
through few frames, a subpixel motion estimation would be
sufficient to guarantee a good HR image. This assumption is
not valid anymore if the object moves fast or the scene has
multiple objects moving with different motions. In this case,
the SR process becomes more challenging and a robust reg-
istration method is required using a dense optical flow. Most
SR algorithms are directly related to a registration based on
a too coarse pixel correspondence as compared to the scale
of details in the scene. It is therefore necessary to call upon
a very accurate subpixel correspondence. In what follows,
we argue that this accuracy is highly increased after an up-
sampling of the observed sequence as presented in Section 3.
We accordingly propose a new data formulation for dynamic
depth SR and give its corresponding algorithm in Section 4.
3. MOTION ESTIMATION AND REGISTRATION
It has been shown in [16] that higher image resolutions help
increase the accuracy of motion estimation which justifies ap-
plying an upsampling framework to get higher scale images.
Moreover, performing the registration process on upsampled
images guarantees a better result with a higher accuracy than
registering the LR images ytfollowed by upsampling them.
This is due to the fact that registration parameters are approx-
imated by rounding the motion vectors with an expected error
of ±1
2pixel. The effect of this error is related to the size
of the registered images, whereas the upsampling process re-
duces this effect from ±1
2min the LR case to ±1
2rm . Hence,
we propose to upsample the observed LR images even be-
fore registering them. Due to the specifications of depth data,
classical interpolation based methods (e.g. bicubic) cannot be
used and lead to jagged values and blurring effects especially
for boundary pixels. Thus, we propose to densely upsam-
ple yt,t= 1, ..., N , up to the super-resolved image of size
(m×n). We define the resulting image as:
yt↑=U·yt,(4)
where Uis a dense upsampling matrix of size (mn ×m0n0),
which we choose to be the transpose of D, s.t., UD =A,
where Ais a block circulant matrix that defines a new blur-
ring matrix B=AH. Therefore, we redefine ztas:
zt=Bxt.(5)
Since the optical flow approach works under the assumption
of small motions, the frames which are further from the refer-
ence frame yt0↑would introduce a higher registration error
than the ones that are closer to yt0↑. They will thus be con-
sidered as outliers. The percentage of these outliers is related
to two main factors; the speed of the moving objects and the
length of the sequence N. For example, a long sequence with
a fast moving object would most likely lead to more than 50%
of outliers and the SR process fails even when using a robust
estimator with a high breakdown value such as a median esti-
mator. To tackle this problem, we herein propose to use a new
registration method based on a cumulative motion compensa-
tion.
Considering two consecutive upsampled frames yt−1↑
and yt↑, the optimal registration solution is:
ˆ
Mt
t−1= arg min
MΨ (yt−1↑,yt↑,M),(6)
where Ψis a dense optical flow-related cost function and
yt↑=Mt
t−1yt−1↑+vt.(7)
The vector vtcontains the innovation that we assume negligi-
ble in this framework. In addition, similarly to [20], for ana-
lytical convenience, we assume that all pixels in yt↑originate
from pixels in yt−1↑in a one to one mapping. Therefore,
each row in Mt
t−1contains 1for each position corresponding
to the address of the source pixel in yt−1↑. This bijective
property implies that the matrix ˆ
Mt
t−1is an invertible per-
mutation, s.t., [ˆ
Mt
t−1]−1=ˆ
Mt−1
t. Furthermore, its estimate
leads to the following registration to yt−1:
yt↑=ˆ
Mt−1
tyt↑.(8)
We then need to define yt0
t↑, the registered version of yt↑
to the reference yt0↑. To that end, we use all the regis-
tered upsampled images yt↑, as defined in (8), for t > t0.
We propose, similarly to our work in [16], a cumulative mo-
tion compensation approach with an additional improvement
where we further reduce the cumulated motion error by re-
computing ˆ
Mt
t−1as follows:
ˆ
Mt
t−1= arg min
MΨ (yt−1↑,yt↑,M),(9)
We prove by induction the following registration relationship
for non-consecutive frames:
yt0
t↑=ˆ
Mt0
tyt↑=ˆ
Mt0
t0+1 · · · ˆ
Mt−1
t
| {z }
(t−t0)times
·yt↑.(10)
Considering the bijection simplification, we further write:
ˆ
Mt0
t≈Lt
t0= [Lt0
t]−1.(11)
4. PROPOSED ALGORITHM
The subpixel accuracy in motion estimation induced by the
combined upsampling and cumulative motions proposed in
Section 3, make it feasible to handle a depth sequence with
a moving object without using any prior information on its
shape, rigidity, or motion. These advantages are extended to
the much more complex case of multiple moving objects. In-
deed, the textureless nature of depth images categorizes them
as images containing gross information only, i.e., with no tex-
ture information, as per Mallikarajuna et al.’s composite im-
age model [21]. This property, combined with the SR impulse
noise nt, suggests that a temporal median estimator is a robust
equivalent to the ML formulation of (2).
We reformulate the data model in (1) to introduce the upsam-
pling strategy of Section 3. Combining (1), (3), (4), (5), and
(11), we find1:
yt0
t↑=zt0+wt, t0≤tand t, t0∈[1, N ]⊂N∗,(12)
1A full proof of (12) will be provided in another paper.
This work was supported by the National Research Fund, Luxembourg, under
the CORE project C11/BM/1204105/FAVE/Ottersten.
where wt=ˆ
Mt0
tU·ntis an additive Laplacian noise at t.
The estimation in (2) becomes:
ˆ
zt0= arg min
zt0
N
X
t=t0
kzt0−yt0
t↑ k1,(13)
which corresponds to the pixel-wise temporal median estima-
tor, i.e., ˆ
zt0=medt{yt0
t↑}N
t=t0. Then follows a simple im-
age deblurring to recover ˆ
xt0from ˆ
zt0. We hence propose
a new dynamic SR algorithm corresponding to the presented
new SR estimation, that we refer to as Upsampling for Precise
Super-Resolution (UP-SR) as summarized below:
UP-SR algorithm
for t0,
1. Choose the reference frame yt0.
for t,s.t.,t0< t < N,
do
2. Compute yt↑using (4).
3. Estimate the registration matrices ˆ
Mt0
tusing (10).
4. Compute yt0
t↑using (10).
end do
end for
5. Find ˆ
zt0by applying a median estimator (13).
6. Deduce ˆ
xt0by deblurring.
end for
5. EXPERIMENTS
We test the performance of the proposed UP-SR algorithm
on depth data acquired with a ToF camera. Using the SR
estimation on such data is suitable as it suffers from a very
low resolution. We start with a simple case of one moving
object (hand) with a translational motion. We compare the
performance of the proposed algorithm for both cases, regis-
tered measured LR depth images yt0
tand registered densely
upsampled depth images yt0
t↑,t0<t<N. Results show
that in the latter case (Fig. 1(b)), the registration is more ac-
curate and leads to sharper edges. Directly relying on LR
images, however, leads to blurred edges (Fig. 1(a)), necessi-
tating a special treatment or a segmentation step to reduce the
artifacts caused by the boundary pixels. This experimentally
confirms the benefit of our upsampling strategy. Next, we
tested UP-SR on a real sequence of LR depth images contain-
ing multiple moving objects. We mounted an LR ToF camera
at a 2.5 meter hight looking down at the ground with two per-
sons sitting on chairs sliding in two different directions. A
sequence of 9 LR depth images, of size (56×61) pixels, were
super-resolved with an SR scale factor r= 5 using UP-SR,
2D/depth fusion [13], SKSR [17] and dynamic S&A [1]. Vi-
sual results for one frame are given in Fig. 2 (c), (d), (e) and
(f), clearly showing that SKSR and dynamic S&A fail badly
(a) (b)
Fig. 1. UP-SR results: (a) Proposed method using motion
estimated from an LR sequence. (b) Proposed method using
motion estimated from densely upsampled sequence (r= 5).
on depth data mainly on boundary pixels while 2D/depth fu-
sion, although computationally efficient, often suffers from
strong 2D texture copying on the final super-resolved depth
frame. Fig. 2(f) shows the result of UP-SR where we obtained
clear sharp edges in addition to an efficient removal of noisy
pixel values. This is mostly due to the proposed subpixel mo-
tion estimation combined with an accurate registration lead-
ing to a successful temporal fusion of the sequence. Finally,
in order to provide a quantitative evaluation, we generated an
LR depth sequence by downsampling an available HR depth
sequence with a factor r= 4, and further degrading it by
additive white Gaussian noise (AWGN) with signal to noise
ratio (SNR) of 15, 25, 35, and 45 dB. We quantitatively com-
pare our proposed algorithm with SKRS and dynamic S&A.
We tested these methods using the corresponding softwares
provided in [18] and [19]. Since we have a known ground
truth {xt}N
t=1, we measure the quality of an estimated HR
depth frame ˆ
xt0using peak SNR (PSNR), which is defined
as: PSNR = 10 log10
m×n
kxt0−ˆ
xt0k2. Obtained results show the
superiority of the UP-SR algorithm where it provides the best
results among discussed state-of-the-art SR methods across
all noise levels. As illustrated in Fig. 3, it is not surprising to
see that even for a very high noise level (SNR = 15 dB) re-
sults are good. This is due to the key components of UP-SR,
namely, its subpixel motion estimation and accurate multi-
frame registration combined with a robust median filtering
that matches the textureless property of depth data. There-
fore, our algorithm results with good quality depth images
without having to call upon an additional regularization step.
Such type of treatment could not be applied on 2D images
where the results would be blurry with lost details. This may
be explained by the fact that depth data generally fall under
the model in (12).
6. CONCLUSION
A new algorithm has been presented to enhance the quality of
LR depth videos for dynamic scenes containing one or multi-
ple moving objects. This algorithm is based on the SR frame-
(a) 2D image (b) Low resolution frame
(c) 2D/depth fusion [13] (d) SKSR [17]
(e) Dynamic S&A [1] (f) Proposed UP-SR
Fig. 2. UP-SR example of a dynamic depth scene (r= 5):
(a) 2D Image corresponding to the last frame. (b) Last frame
of 9 LR (56 ×61) depth images. (c) 2D/depth fusion [13]. (d)
SKSR [17]. (e) Dynamic S&A [1]. (f) Proposed UP-SR.
Fig. 3. PSNR for different SR methods applied the moving
hand sequence with (r= 4).
work without strong constraints on objects’ shape or motion.
It takes advantage of the textureless nature of depth data to
achieve robust SR estimation after densely upsampling LR
frames. Experimental results with both synthetic and real ToF
depth images showed that this new approach for SR, although
conceptually simple, provides a more accurate motion estima-
tion which leads to greatly outperforming existing methods
such as fusion based techniques, SKSR, and dynamic S&A.
7. REFERENCES
[1] S. Farsiu, D. Robinson, M. Elad, and P. Milanfar ”Ad-
vances and Challenges in Super-Resolution” , International
Journal of Imaging Systems and Technology, vol. 14, pp.
47-57, 2004
[2] S. Farsiu, D. Robinson, M. Elad, and P. Milanfar, “Fast
and Robust Multi-Frame Super-Resolution”, IEEE TIP,
vol. 13, pp. 1327-1344, 2003.
[3] Z. Lin and H. Shum, “Fundamental Limits of
Reconstruction-Based Superresolution Algorithms under
Local Translation”, IEEE PAMI, vol. 26, no. 1, Jan.2004.
[4] O. M. Aodha, N. Campbell, A. Nair, G. Brostow,
“Patch Based Synthesis for Single Depth Image Super-
Resolution”, ECCV 2012.
[5] D. Glasner, S. Bagon, M. Irani, “Super-Resolution from
a Single Image”, ICCV 2009.
[6] R. Hardie, T. Tuinstra, K. Barnard, J. Bognar, and E.
Armstrong, “High Resolution Image Reconstruction from
Digital Video with In-Scene Motion”, ICIP 1997, pp. 153-
156.
[7] S. Farsiu, M. Elad, adn P. Milanfar, “Video-to-Video
Dynamic Superresolution for Grayscale and Color Se-
quences”, in Journal of Applied Signal Proc., vol. 2006,
ID 61859.
[8] A. W. M. van Eekeren, K. Schutte, J. Dijk, Dirk-Jan de
Lange, and L. J. van Vliet, “Super-Resolution on Moving
Objects and Background”, ICIP 2006, pp. 2709-2712.
[9] A. W. M. van Eekeren, K. Schutte, and L. J. van Vliet,
“Multiframe Super-Resolution Reconstruction of Small
Moving Objects”, IEEE TIP, vol. 19, pp. 2901-2912, 2010.
[10] J. Diebel and S. Thrun, “An application of markov ran-
dom fields to range sensing”, NIPS 18, pp. 291-298, 2006.
[11] J. Kopf, M. Cohen, D. Lischinski, and M. Uyttendaele,
“Joint bilateral upsampling”, ACM TOG, 26(3), 2007.
[12] Q. Yang, R. Yang, J. Davis, and D. Nister, “Spatial-
depth super resolution for range images”, CVPR, 2007.
[13] F. Garcia, D. Aouada, B. Mirbach, T. Solignac, B. Ot-
tersten, “Real-time Hybrid ToF Multi-Camera Rig Fusion
System for Depth Map Enhancement”, IEEE CVPRW,
pp.1-8, 20-25, Jun. 2011.
[14] X. Xiang, G. Li, J. Tong, Z. Pan, “Fast and Simple Super
Resolution for Range Data”, IEEE Conf on Cyberworlds
(CW), pp.319-324, 20-22 Oct. 2010.
[15] F. Garcia, D. Aouada, B. Mirbach, B Ottersten, “Spatio-
Temporal ToF Data Enhancement by Fusion”, ICIP 2012,
pp.981-984.
[16] L. Xu, J. Jia, S. B. Kang, “Improving sub-pixel corre-
spondence through upsampling”, CVIU, vol 116, Issue 2,
February 2012, pp. 250-261, ISSN 1077-3142.
[17] H. Takeda, P. Milanfar, M. Protter, and M. Elad, “Super-
resolution without Explicit Subpixel Motion Estimation”,
IEEE TIP, Vol. 18, No. 9, September 2009.
[18] http://users.soe.ucsc.edu/htakeda/SpaceTimeSKR.htm
[19] http://users.soe.ucsc.edu/milanfar/software/superresol-
ution.html
[20] M. Elad and A. Feuer, “Super-Resolution reconstruction
of Continuous Image Sequence”, IEEE PAMI, Vol. 21, No.
9, pp. 817-834, Sep 1999.
[21] H. S. Mallikarjuna, L. F. Chaparro, “Iterative composite
filtering for image restoration,” IEEE PAMI, vol.14, no.6,
pp.674-678, Jun 1992