Conference PaperPDF Available

Dynamic super resolution of depth sequences with non-rigid motions

Authors:

Abstract and Figures

We enhance the resolution of depth videos acquired with low resolution time-of-flight cameras. To that end, we propose a new dedicated dynamic super-resolution that is capable to accurately super-resolve a depth sequence containing one or multiple moving objects without strong constraints on their shape or motion, thus clearly outperforming any existing super-resolution techniques that perform poorly on depth data and are either restricted to global motions or not precise because of an implicit estimation of motion. The proposed approach is based on a new data model that leads to a robust registration of all depth frames after a dense upsampling. The textureless nature of depth images allows to robustly handle sequences with multiple moving objects as confirmed by our experiments.
Content may be subject to copyright.
DYNAMIC SUPER RESOLUTION OF DEPTH SEQUENCES WITH NON-RIGID MOTIONS
Kassem Al Ismaeil?Djamila Aouada?Bruno MirbachBj¨
orn Ottersten?
?SnT - University of Luxembourg Advanced Engineering - IEE S.A.
{kassem.alismaeil, djamila.aouada, bjorn.ottersten}@uni.lu bruno.mirbach@iee.lu
ABSTRACT
We enhance the resolution of depth videos acquired with low
resolution time-of-flight cameras. To that end, we propose
a new dedicated dynamic super-resolution that is capable to
accurately super-resolve a depth sequence containing one or
multiple moving objects without strong constraints on their
shape or motion, thus clearly outperforming any existing
super-resolution techniques that perform poorly on depth
data and are either restricted to global motions or not precise
because of an implicit estimation of motion. Our proposed
approach is based on a new data model that leads to a robust
registration of all depth frames after a dense upsampling. The
textureless nature of depth images allows to robustly handle
sequences with multiple moving objects as confirmed by our
experiments.
Index TermsDepth sequence, dynamic super-resolution,
motion estimation, upsampling, ToF data, moving object.
1. INTRODUCTION
Super resolution (SR) is the process of recovering a high res-
olution (HR) image from a set of captured low resolution
(LR) frames. SR has originally been defined for static scenes,
i.e., scenes where the motion between the observed images is
global as opposed to dynamic scenes containing a moving ob-
ject. The past two decades have witnessed tremendous work
on SR for static scenes. As presented in [3], these algorithms,
commonly referred to as classical SR, are numerically limited
to small global motions even for an increased number of LR
frames. Moreover, they cannot handle scenes with moving
objects and consider the corresponding frames as outliers. As
a solution to these major limitations, example-based SR algo-
rithms have been proposed [4], as well as their combinations
with classical multi-frame SR [5]. However, such algorithms
depend on a heavy training phase and the quality of the super-
resolved image is dependent on the suitability of the training
data. Relatively little attention has been given to the SR of
dynamic scenes. Farsiu et al. [1] have proposed a dynamic
shift and add model (dynamic S&A) as a mere extension of
the static case [2], hence suffering from the same restrictions.
Other methods [6, 7, 8, 9] have been proposed to tackle the
problem of dynamic SR by segmenting the moving object
first before super resolving it. Such methods do not handle
pixels on the boundary of the object causing major artifacts.
In 2010, van Eekeren et al. [9] proposed an algorithm to solve
the problem of boundary pixels; however, this algorithm is
computationally heavy and based upon strong assumptions.
In 2009, dynamic SR models were proposed with an implicit
motion estimation, e.g., steering kernels for SR (SKSR) [20].
While the idea is theoretically attractive, it is very impracti-
cal as it relies on heavy computations and on many empirical
parameters. Moreover, these methods are dedicated for 2D
intensity sequences, strongly failing when it comes to depth
data because of its abrupt value changes around edges and
textureless nature. Such data, usually captured with a time-of-
flight (ToF) camera, requires a resolution enhancement. Fu-
sion based methods have been proposed as a solution for dy-
namic depth scenes [10, 11, 12, 13, 14] where a HR 2D cam-
era is coupled with a depth LR camera. These methods often
suffer from texture copying problems and require a perfect
alignment and synchronization of 2D and depth sequences. In
this work, we propose to release the limitations on scale and
motion of SR algorithms for dynamic depth scenes containing
one or multiple moving objects without prior assumptions on
their shape or motion, and without engaging in an additional
learning stage. The proposed algorithm takes advantage of
the textureless nature of depth data, leading to a robust me-
dian estimation without fusing with 2D data; hence, avoiding
blurring and texture copying artifacts. This algorithm is based
on a new data model that starts by densely upsamling the LR
measurements for an accurate registration.
The organization of the paper starts by formulating the
problem of dynamic SR in Section 2. We then provide our
key concepts for a robust motion estimation in Section 3. In
Section 4, we propose a new data model that leads to a robust
dynamic depth SR algorithm. In Section 5, we experimen-
tally compare its performance with state-of-art techniques us-
ing depth sequences. A conclusion is given in Section 6.
2. PROBLEM FORMULATION
The aim of dynamic SR algorithms is to estimate a sequence
of NHR images {xt}N
t=1 of size (m×n)from an observed
LR sequence {yt}N
t=1, where each LR image ytis of size
(m0×n0)pixels, with n=r·n0and m=r·m0, such that ris
the SR factor. Every image ytmay be viewed as an LR noisy
and deformed realization of xt0at the acquisition time t, with
t0t. Rearranging all images xtand yt,t= 1,· · · , N , in
lexicographic order, i.e., column vectors of lengths mn and
m0n0, respectively, we consider the following classical data
model:
yt=DHLt0
txt0+nt, t0tand t, t0[1, N ]N,(1)
where Dis a matrix of dimension (m0n0×mn)that repre-
sents the downsampling operator, and which we assume to
be known and constant over time. The system blur is repre-
sented by the time and space invariant matrix H. The vector
ntis an additive Laplacian noise at time tas justified in [2].
The matrices Lt0
tare (mn ×mn)matrices corresponding to
the geometric motion between the considered HR image xt0
and the observed LR image ytprior to its downsampling.
The dynamic SR problem is simplified by reconstructing one
HR image at a time using the full observed sequence. From
now on, we fix the reference time to t0, and focus on the re-
construction of xt0from {yt}N
t=t0. The operation may be
repeated for t0= 1,· · · , N . Based on the data model in
(1), and using an L1norm between the observations and the
model, the Maximum Likelihood (ML) estimate of xt0is ob-
tained as follows:
ˆ
xt0= arg min
xt0
N
X
t=t0
kDHLt0
txt0ytk1.(2)
Using the same approach as in [2, 17], we consider that H
and Lt0
tare block circulant matrices. Therefore:
HLt0
t=Lt0
tH.(3)
The minimization in (2) can therefore be decomposed into
two steps; estimation of a blurred HR image zt0=Hxt0,
followed by a deblurring step. In what follows, we assume
that ytis simply the noisy and decimated version of ztwith-
out any geometric warp. We may thus write Lt
t=I,t,I
being the identity matrix, hence, Lt0
tzt0=zt=Hxt. This
operation can be assimilated to registering zt0to zt. We draw
attention to the fact that in the case of static multi-frame SR,
instead of a sequence, a set of observed LR images is consid-
ered, i.e., there is no order between frames. Such order be-
comes crucial in dynamic SR because the estimation of mo-
tion, based on the optical flow paradigm, happens between
consecutive frames only. An accurate dynamic SR estimation
is consequently highly dependent on the accuracy of estimat-
ing the registration matrices Lt1
t, as well as Lt0
t. In the case
of one moving object with a very small translational motion
through few frames, a subpixel motion estimation would be
sufficient to guarantee a good HR image. This assumption is
not valid anymore if the object moves fast or the scene has
multiple objects moving with different motions. In this case,
the SR process becomes more challenging and a robust reg-
istration method is required using a dense optical flow. Most
SR algorithms are directly related to a registration based on
a too coarse pixel correspondence as compared to the scale
of details in the scene. It is therefore necessary to call upon
a very accurate subpixel correspondence. In what follows,
we argue that this accuracy is highly increased after an up-
sampling of the observed sequence as presented in Section 3.
We accordingly propose a new data formulation for dynamic
depth SR and give its corresponding algorithm in Section 4.
3. MOTION ESTIMATION AND REGISTRATION
It has been shown in [16] that higher image resolutions help
increase the accuracy of motion estimation which justifies ap-
plying an upsampling framework to get higher scale images.
Moreover, performing the registration process on upsampled
images guarantees a better result with a higher accuracy than
registering the LR images ytfollowed by upsampling them.
This is due to the fact that registration parameters are approx-
imated by rounding the motion vectors with an expected error
of ±1
2pixel. The effect of this error is related to the size
of the registered images, whereas the upsampling process re-
duces this effect from ±1
2min the LR case to ±1
2rm . Hence,
we propose to upsample the observed LR images even be-
fore registering them. Due to the specifications of depth data,
classical interpolation based methods (e.g. bicubic) cannot be
used and lead to jagged values and blurring effects especially
for boundary pixels. Thus, we propose to densely upsam-
ple yt,t= 1, ..., N , up to the super-resolved image of size
(m×n). We define the resulting image as:
yt=U·yt,(4)
where Uis a dense upsampling matrix of size (mn ×m0n0),
which we choose to be the transpose of D, s.t., UD =A,
where Ais a block circulant matrix that defines a new blur-
ring matrix B=AH. Therefore, we redefine ztas:
zt=Bxt.(5)
Since the optical flow approach works under the assumption
of small motions, the frames which are further from the refer-
ence frame yt0would introduce a higher registration error
than the ones that are closer to yt0. They will thus be con-
sidered as outliers. The percentage of these outliers is related
to two main factors; the speed of the moving objects and the
length of the sequence N. For example, a long sequence with
a fast moving object would most likely lead to more than 50%
of outliers and the SR process fails even when using a robust
estimator with a high breakdown value such as a median esti-
mator. To tackle this problem, we herein propose to use a new
registration method based on a cumulative motion compensa-
tion.
Considering two consecutive upsampled frames yt1
and yt, the optimal registration solution is:
ˆ
Mt
t1= arg min
MΨ (yt1,yt,M),(6)
where Ψis a dense optical flow-related cost function and
yt=Mt
t1yt1+vt.(7)
The vector vtcontains the innovation that we assume negligi-
ble in this framework. In addition, similarly to [20], for ana-
lytical convenience, we assume that all pixels in ytoriginate
from pixels in yt1in a one to one mapping. Therefore,
each row in Mt
t1contains 1for each position corresponding
to the address of the source pixel in yt1. This bijective
property implies that the matrix ˆ
Mt
t1is an invertible per-
mutation, s.t., [ˆ
Mt
t1]1=ˆ
Mt1
t. Furthermore, its estimate
leads to the following registration to yt1:
yt=ˆ
Mt1
tyt.(8)
We then need to define yt0
t, the registered version of yt
to the reference yt0. To that end, we use all the regis-
tered upsampled images yt, as defined in (8), for t > t0.
We propose, similarly to our work in [16], a cumulative mo-
tion compensation approach with an additional improvement
where we further reduce the cumulated motion error by re-
computing ˆ
Mt
t1as follows:
ˆ
Mt
t1= arg min
MΨ (yt1,yt,M),(9)
We prove by induction the following registration relationship
for non-consecutive frames:
yt0
t=ˆ
Mt0
tyt=ˆ
Mt0
t0+1 · · · ˆ
Mt1
t
| {z }
(tt0)times
·yt.(10)
Considering the bijection simplification, we further write:
ˆ
Mt0
tLt
t0= [Lt0
t]1.(11)
4. PROPOSED ALGORITHM
The subpixel accuracy in motion estimation induced by the
combined upsampling and cumulative motions proposed in
Section 3, make it feasible to handle a depth sequence with
a moving object without using any prior information on its
shape, rigidity, or motion. These advantages are extended to
the much more complex case of multiple moving objects. In-
deed, the textureless nature of depth images categorizes them
as images containing gross information only, i.e., with no tex-
ture information, as per Mallikarajuna et al.’s composite im-
age model [21]. This property, combined with the SR impulse
noise nt, suggests that a temporal median estimator is a robust
equivalent to the ML formulation of (2).
We reformulate the data model in (1) to introduce the upsam-
pling strategy of Section 3. Combining (1), (3), (4), (5), and
(11), we find1:
yt0
t=zt0+wt, t0tand t, t0[1, N ]N,(12)
1A full proof of (12) will be provided in another paper.
This work was supported by the National Research Fund, Luxembourg, under
the CORE project C11/BM/1204105/FAVE/Ottersten.
where wt=ˆ
Mt0
tU·ntis an additive Laplacian noise at t.
The estimation in (2) becomes:
ˆ
zt0= arg min
zt0
N
X
t=t0
kzt0yt0
t↑ k1,(13)
which corresponds to the pixel-wise temporal median estima-
tor, i.e., ˆ
zt0=medt{yt0
t↑}N
t=t0. Then follows a simple im-
age deblurring to recover ˆ
xt0from ˆ
zt0. We hence propose
a new dynamic SR algorithm corresponding to the presented
new SR estimation, that we refer to as Upsampling for Precise
Super-Resolution (UP-SR) as summarized below:
UP-SR algorithm
for t0,
1. Choose the reference frame yt0.
for t,s.t.,t0< t < N,
do
2. Compute ytusing (4).
3. Estimate the registration matrices ˆ
Mt0
tusing (10).
4. Compute yt0
tusing (10).
end do
end for
5. Find ˆ
zt0by applying a median estimator (13).
6. Deduce ˆ
xt0by deblurring.
end for
5. EXPERIMENTS
We test the performance of the proposed UP-SR algorithm
on depth data acquired with a ToF camera. Using the SR
estimation on such data is suitable as it suffers from a very
low resolution. We start with a simple case of one moving
object (hand) with a translational motion. We compare the
performance of the proposed algorithm for both cases, regis-
tered measured LR depth images yt0
tand registered densely
upsampled depth images yt0
t,t0<t<N. Results show
that in the latter case (Fig. 1(b)), the registration is more ac-
curate and leads to sharper edges. Directly relying on LR
images, however, leads to blurred edges (Fig. 1(a)), necessi-
tating a special treatment or a segmentation step to reduce the
artifacts caused by the boundary pixels. This experimentally
confirms the benefit of our upsampling strategy. Next, we
tested UP-SR on a real sequence of LR depth images contain-
ing multiple moving objects. We mounted an LR ToF camera
at a 2.5 meter hight looking down at the ground with two per-
sons sitting on chairs sliding in two different directions. A
sequence of 9 LR depth images, of size (56×61) pixels, were
super-resolved with an SR scale factor r= 5 using UP-SR,
2D/depth fusion [13], SKSR [17] and dynamic S&A [1]. Vi-
sual results for one frame are given in Fig. 2 (c), (d), (e) and
(f), clearly showing that SKSR and dynamic S&A fail badly
(a) (b)
Fig. 1. UP-SR results: (a) Proposed method using motion
estimated from an LR sequence. (b) Proposed method using
motion estimated from densely upsampled sequence (r= 5).
on depth data mainly on boundary pixels while 2D/depth fu-
sion, although computationally efficient, often suffers from
strong 2D texture copying on the final super-resolved depth
frame. Fig. 2(f) shows the result of UP-SR where we obtained
clear sharp edges in addition to an efficient removal of noisy
pixel values. This is mostly due to the proposed subpixel mo-
tion estimation combined with an accurate registration lead-
ing to a successful temporal fusion of the sequence. Finally,
in order to provide a quantitative evaluation, we generated an
LR depth sequence by downsampling an available HR depth
sequence with a factor r= 4, and further degrading it by
additive white Gaussian noise (AWGN) with signal to noise
ratio (SNR) of 15, 25, 35, and 45 dB. We quantitatively com-
pare our proposed algorithm with SKRS and dynamic S&A.
We tested these methods using the corresponding softwares
provided in [18] and [19]. Since we have a known ground
truth {xt}N
t=1, we measure the quality of an estimated HR
depth frame ˆ
xt0using peak SNR (PSNR), which is defined
as: PSNR = 10 log10
m×n
kxt0ˆ
xt0k2. Obtained results show the
superiority of the UP-SR algorithm where it provides the best
results among discussed state-of-the-art SR methods across
all noise levels. As illustrated in Fig. 3, it is not surprising to
see that even for a very high noise level (SNR = 15 dB) re-
sults are good. This is due to the key components of UP-SR,
namely, its subpixel motion estimation and accurate multi-
frame registration combined with a robust median filtering
that matches the textureless property of depth data. There-
fore, our algorithm results with good quality depth images
without having to call upon an additional regularization step.
Such type of treatment could not be applied on 2D images
where the results would be blurry with lost details. This may
be explained by the fact that depth data generally fall under
the model in (12).
6. CONCLUSION
A new algorithm has been presented to enhance the quality of
LR depth videos for dynamic scenes containing one or multi-
ple moving objects. This algorithm is based on the SR frame-
(a) 2D image (b) Low resolution frame
(c) 2D/depth fusion [13] (d) SKSR [17]
(e) Dynamic S&A [1] (f) Proposed UP-SR
Fig. 2. UP-SR example of a dynamic depth scene (r= 5):
(a) 2D Image corresponding to the last frame. (b) Last frame
of 9 LR (56 ×61) depth images. (c) 2D/depth fusion [13]. (d)
SKSR [17]. (e) Dynamic S&A [1]. (f) Proposed UP-SR.
Fig. 3. PSNR for different SR methods applied the moving
hand sequence with (r= 4).
work without strong constraints on objects’ shape or motion.
It takes advantage of the textureless nature of depth data to
achieve robust SR estimation after densely upsampling LR
frames. Experimental results with both synthetic and real ToF
depth images showed that this new approach for SR, although
conceptually simple, provides a more accurate motion estima-
tion which leads to greatly outperforming existing methods
such as fusion based techniques, SKSR, and dynamic S&A.
7. REFERENCES
[1] S. Farsiu, D. Robinson, M. Elad, and P. Milanfar ”Ad-
vances and Challenges in Super-Resolution” , International
Journal of Imaging Systems and Technology, vol. 14, pp.
47-57, 2004
[2] S. Farsiu, D. Robinson, M. Elad, and P. Milanfar, “Fast
and Robust Multi-Frame Super-Resolution”, IEEE TIP,
vol. 13, pp. 1327-1344, 2003.
[3] Z. Lin and H. Shum, “Fundamental Limits of
Reconstruction-Based Superresolution Algorithms under
Local Translation”, IEEE PAMI, vol. 26, no. 1, Jan.2004.
[4] O. M. Aodha, N. Campbell, A. Nair, G. Brostow,
“Patch Based Synthesis for Single Depth Image Super-
Resolution”, ECCV 2012.
[5] D. Glasner, S. Bagon, M. Irani, “Super-Resolution from
a Single Image”, ICCV 2009.
[6] R. Hardie, T. Tuinstra, K. Barnard, J. Bognar, and E.
Armstrong, “High Resolution Image Reconstruction from
Digital Video with In-Scene Motion”, ICIP 1997, pp. 153-
156.
[7] S. Farsiu, M. Elad, adn P. Milanfar, “Video-to-Video
Dynamic Superresolution for Grayscale and Color Se-
quences”, in Journal of Applied Signal Proc., vol. 2006,
ID 61859.
[8] A. W. M. van Eekeren, K. Schutte, J. Dijk, Dirk-Jan de
Lange, and L. J. van Vliet, “Super-Resolution on Moving
Objects and Background”, ICIP 2006, pp. 2709-2712.
[9] A. W. M. van Eekeren, K. Schutte, and L. J. van Vliet,
“Multiframe Super-Resolution Reconstruction of Small
Moving Objects”, IEEE TIP, vol. 19, pp. 2901-2912, 2010.
[10] J. Diebel and S. Thrun, “An application of markov ran-
dom fields to range sensing”, NIPS 18, pp. 291-298, 2006.
[11] J. Kopf, M. Cohen, D. Lischinski, and M. Uyttendaele,
“Joint bilateral upsampling”, ACM TOG, 26(3), 2007.
[12] Q. Yang, R. Yang, J. Davis, and D. Nister, “Spatial-
depth super resolution for range images”, CVPR, 2007.
[13] F. Garcia, D. Aouada, B. Mirbach, T. Solignac, B. Ot-
tersten, “Real-time Hybrid ToF Multi-Camera Rig Fusion
System for Depth Map Enhancement”, IEEE CVPRW,
pp.1-8, 20-25, Jun. 2011.
[14] X. Xiang, G. Li, J. Tong, Z. Pan, “Fast and Simple Super
Resolution for Range Data”, IEEE Conf on Cyberworlds
(CW), pp.319-324, 20-22 Oct. 2010.
[15] F. Garcia, D. Aouada, B. Mirbach, B Ottersten, “Spatio-
Temporal ToF Data Enhancement by Fusion”, ICIP 2012,
pp.981-984.
[16] L. Xu, J. Jia, S. B. Kang, “Improving sub-pixel corre-
spondence through upsampling”, CVIU, vol 116, Issue 2,
February 2012, pp. 250-261, ISSN 1077-3142.
[17] H. Takeda, P. Milanfar, M. Protter, and M. Elad, “Super-
resolution without Explicit Subpixel Motion Estimation”,
IEEE TIP, Vol. 18, No. 9, September 2009.
[18] http://users.soe.ucsc.edu/htakeda/SpaceTimeSKR.htm
[19] http://users.soe.ucsc.edu/milanfar/software/superresol-
ution.html
[20] M. Elad and A. Feuer, “Super-Resolution reconstruction
of Continuous Image Sequence”, IEEE PAMI, Vol. 21, No.
9, pp. 817-834, Sep 1999.
[21] H. S. Mallikarjuna, L. F. Chaparro, “Iterative composite
filtering for image restoration,” IEEE PAMI, vol.14, no.6,
pp.674-678, Jun 1992
... Indeed, LidarBoost and KinectFusion are rigid depth fusion approaches, and they immediately fail in providing any reasonable result on non-rigidly deforming scenes, the focus of this paper. In [8]- [10], we proposed the UP-SR algorithm, which stands for Upsampling for Precise Super-Resolution, as the first dynamic multi-frame depth video super-resolution (SR) algorithm that can enhance depth videos containing non-rigidly deforming scenes without any prior assumption on the number of moving objects they contain or on the topology of these objects. These advantages were possible thanks to a direct processing on depth maps without using connectivity information inherent to meshing as used in subsequent methods, namely, KinectDeform [11] and Dynamic-Fusion [12]. ...
... Indeed, it exploits the deformations collected over time in an inverse SR reconstruction framework. UP-SR was shown in [8] and [10] to have a higher reconstruction accuracy as compared to representative methods from the first and second categories of depth enhancement methods. Its major drawbacks, however, are its limitation to lateral motions and its computational complexity. ...
... The above framework has been first proposed in the case of static 2D scenes in [17] and for static depth scenes in [44]. In [8] and in [10] it has been extended to dynamic depth scenes defining the UP-SR algorithm. ...
Article
Full-text available
We propose a novel approach for enhancing depth videos containing non-rigidly deforming objects. Depth sensors are capable of capturing depth maps in real-time but suffer from high noise levels and low spatial resolutions. While solutions for reconstructing 3D details in static scenes, or scenes with rigid global motions have been recently proposed, handling unconstrained non-rigid deformations in relative complex scenes remains a challenge. Our solution consists in a recursive dynamic multi-frame super-resolution algorithm where the relative local 3D motions between consecutive frames are directly accounted for. We rely on the assumption that these 3D motions can be decoupled into lateral motions and radial displacements. This allows to perform a simple local per-pixel tracking where both depth measurements and deformations are dynamically optimized. The geometric smoothness is subsequently added using a multi-level L1 minimization with a bilateral total variation regularization. The performance of this method is thoroughly evaluated on both real and synthetic data. As compared to alternative approaches, the results show a clear improvement in reconstruction accuracy and in robustness to noise, to relative large non-rigid deformations, and to topological changes.
... This work has been published in [13] and [41]. An extended version is currently under review [40]. is not practical because of a heavy cumulative motion estimation process applied to a number of frames buffered in the memory. ...
... Earlier attempts for recursive SR approaches have proposed to use a Kalman filter formulation [78,84,96,97,98]. These in [13] is not sufficient, it is necessary to account for the full 3D motion in the SR reconstruction, known as scene flow, or the 2.5D motion, known as range flow. ...
... Then, similarly to the previous set-up, we downsample the obtained depth sequence with r = 4 and further degrade it with additive Gaussian noise with standard deviation σ varying from 0 to 50 mm. The created LR noisy depth sequence is then super-resolved using state-of-art methods, the conventional bicubic interpolation, UP-SR [13], SISR [9], and the proposed RecUP-SR. Table 5.1 reports the 3D reconstruction error of each method at different noise levels. ...
Thesis
Full-text available
Sensing using 3D technologies has seen a revolution in the past years where cost-effective depth sensors are today part of accessible consumer electronics. Their ability in directly capturing depth videos in real-time has opened tremendous possibilities for multiple applications in computer vision. These sensors, however, have major shortcomings due to their high noise contamination, including missing and jagged measurements, and their low spatial resolutions. In order to extract detailed 3D features from this type of data, a dedicated data enhancement is required. We propose a generic depth multi-frame super-resolution framework that addresses the limitations of state-of-the-art depth enhancement approaches. The proposed framework does not need any additional hardware or coupling with different modalities. It is based on a new data model that uses densely upsampled low resolution observations. This results in a robust median initial estimation, further refined by a deblurring operation using a bilateral total variation as the regularization term. The upsampling operation ensures a systematic improvement in the registration accuracy. This is explored in different scenarios based on the motions involved in the depth video. For the general and most challenging case of objects deforming non-rigidly in full 3D, we propose a recursive dynamic multi-frame super-resolution algorithm where the relative local 3D motions between consecutive frames are directly accounted for. We rely on the assumption that these 3D motions can be decoupled into lateral motions and radial displacements. This allows to perform a simple local per-pixel tracking where both depth measurements and deformations are optimized. As compared to alternative approaches, the results show a clear improvement in reconstruction accuracy and in robustness to noise, to relative large non-rigid deformations, and to topological changes. Moreover, the proposed approach, implemented on a CPU, is shown to be computationally efficient and working in real-time.
... This work has been published in [13] and [41]. An extended version is currently under review [40]. is not practical because of a heavy cumulative motion estimation process applied to a number of frames buffered in the memory. ...
... Earlier attempts for recursive SR approaches have proposed to use a Kalman filter formulation [78,84,96,97,98]. These in [13] is not sufficient, it is necessary to account for the full 3D motion in the SR reconstruction, known as scene flow, or the 2.5D motion, known as range flow. ...
... Then, similarly to the previous set-up, we downsample the obtained depth sequence with r = 4 and further degrade it with additive Gaussian noise with standard deviation σ varying from 0 to 50 mm. The created LR noisy depth sequence is then super-resolved using state-of-art methods, the conventional bicubic interpolation, UP-SR [13], SISR [9], and the proposed RecUP-SR. Table 5.1 reports the 3D reconstruction error of each method at different noise levels. ...
Thesis
Full-text available
Sensing using 3D technologies has seen a revolution in the past years where cost-effective depth sensors are today part of accessible consumer electronics. Their ability in directly capturing depth videos in real-time has opened tremendous possibilities for multiple applications in computer vision. These sensors, however, have major shortcomings due to their high noise contamination, including missing and jagged measurements, and their low spatial resolutions. In order to extract detailed 3D features from this type of data, a dedicated data enhancement is required. We propose a generic depth multi-frame super-resolution framework that addresses the limitations of state-of-the-art depth enhancement approaches. The proposed framework does not need any additional hardware or coupling with different modalities. It is based on a new data model that uses densely upsampled low resolution observations. This results in a robust median initial estimation, further refined by a deblurring operation using a bilateral total variation as the regularization term. The upsampling operation ensures a systematic improvement in the registration accuracy. This is explored in different scenarios based on the motions involved in the depth video. For the general and most challenging case of objects deforming non-rigidly in full 3D, we propose a recursive dynamic multi-frame super-resolution algorithm where the relative local 3D motions between consecutive frames are directly accounted for. We rely on the assumption that these 3D motions can be decoupled into lateral motions and radial displacements. This allows to perform a simple local per-pixel tracking where both depth measurements and deformations are optimized. As compared to alternative approaches, the results show a clear improvement in reconstruction accuracy and in robustness to noise, to relative large non-rigid deformations, and to topological changes. Moreover, the proposed approach, implemented on a CPU, is shown to be computationally efficient and working in real-time.
... This is a very typical scenario encountered in people sensing, cloth [5]. (d) Upsampling for Precise Super Resolution (UP-SR) [4]. (e) Proposed algorithm (50 ms per frame). ...
... Super-resolution (SR) algorithms have been proposed as a solution to this problem. Two categories of algorithms may be distinguished; multi-frame SR which use multiple frames in an inverse problem formulation to reconstruct one high resolution frame [16,7,4]. The second category is known as single-image SR. ...
... In [4], we proposed the first dynamic multi-frame depth SR. This algorithm is, however, limited to lateral motions, and fails in the case of radial deformations. ...
Conference Paper
Full-text available
This paper proposes to enhance low resolution dynamic depth videos containing freely non–rigidly moving objects with a new dynamic multi–frame super–resolution algorithm. Existent methods are either limited to rigid objects, or restricted to global lateral motions discarding radial displacements. We address these shortcomings by accounting for non–rigid displacements in 3D. In addition to 2D optical flow, we estimate the depth displacement, and simultaneously correct the depth measurement by Kalman filtering, This concept is incorporated efficiently in a multi–frame super–resolution framework. It is formulated in a recursive manner that ensures an efficient deployment in real–time. Results show the overall improved performance of the proposed method as compared to alternative approaches, and specifically in handling relatively large 3D motions. Test examples range from a full moving human body to a highly dynamic facial video with varying expressions.
... To tackle LR and noisy non-rigidly deforming data we look into image-based SR techniques [6][7][8][31][32][33][34]. It is important to mention the work of Al Ismaeil et al. in this regard which, though restricted to enhancement of mono-view dynamic depth videos, proposes to tackle the problem of LR sensing systems via a recursive dynamic multi-frame depth SR algorithm [33,34]. ...
... where r (.) is the measurement function which incorporates system blur and downsampling operators, and W t represents additive white noise at time t and has same size as L t . We can perform dense upsampling on the acquired LR point clouds as a pre-processing step which eliminates the resolution difference between the measured data and the desiredĤ t that we are to estimate, and helps in decreasing the registration error [31,32]. Considering a dense upsampling operator ↑ which performs an increase or enhancement in resolution, with a factor o, (1) becomes: ...
Article
Full-text available
In this work, we discuss enhanced full 360 • 3D reconstruction of dynamic scenes containing non-rigidly deforming objects using data acquired from commodity depth or 3D cameras. Several approaches for enhanced and full 3D reconstruction of non-rigid objects have been proposed in the literature. These approaches suffer from several limitations due to requirement of a template, inability to tackle large local deformations and topology changes, inability to tackle highly noisy and low-resolution data, and inability to produce on-line results. We target on-line and template-free enhancement of the quality of noisy and low-resolution full 3D reconstructions of dynamic non-rigid objects. For this purpose, we propose a view-independent recursive and dynamic multi-frame 3D super-resolution scheme for noise removal and resolution enhancement of 3D measurements. The proposed scheme tracks the position and motion of each 3D point at every-time step by making use of the current acquisition and the result of the previous iteration. The affects of system blur due to per-point tracking are subsequently tackled by introducing a novel and efficient multi-level 3D bilateral total variation regularization. These characteristics enable the proposed scheme to handle large deformations and topology changes accurately. A thorough evaluation of the proposed scheme on both real and simulated data is carried out. The results show that the proposed scheme improves upon the performance of the state-of-art methods and is able to accurately enhance the quality of low-resolution and highly noisy 3D reconstructions while being robust to large local deformations.
... whereas the weighting term w BF in (2) considers the corresponding high-resolution 2-D image I. Thereby and in the following depth enhancement techniques by fusion, the low-resolution depth map D has to be upsampled to have the same spatial resolution as I. To this end, nearest-neighbour interpolation might be used since alternative data interpolation approaches such as bilinear or bicubic interpolation might introduce non-valid depth measurements among the upsampled edges of the depth map [17,25]. As in (1), the resulting depth map J JBU is an enhanced version of D that has been upsampled to the 2-D guidance image resolution. ...
... We perform a qualitative evaluation in which we compare the enhanced depth maps resulting from the proposed UML filter against those from the JBU, NAFDU, and PWAS filters. We show, in Figure 5, the filters behaviour when setting σ S ∈ [5,25]. We can observe that the JBU filter presents edge blurring when σ S is too small, e.g., σ S = 5 (see Figure 5d), whereas it presents the texture copying artefact for large σ S values, e.g., σ S = 25 (see the copied texture within the test box in Figure 5f). ...
Article
This paper proposes a unified multi-lateral filter to efficiently increase the spatial resolution of low-resolution and noisy depth maps in real-time. Time-of-Flight (ToF) cameras have become a very promising alternative to stereo-based range sensing systems as they provide depth measurements at a high frame rate. However, there are actually two main drawbacks that restrict their use in a wide range of applications; namely, their fairly low spatial resolution as well as the amount of noise within the depth estimation. In order to address these drawbacks, we propose a new approach based on sensor fusion. That is, we couple a ToF camera of low-resolution with a 2-D camera of higher resolution to which the low-resolution depth map will be efficiently upsampled. In this paper, we first review the existing depth map enhancement approaches based on sensor fusion and discuss their limitations. We then propose a unified multi-lateral filter that accounts for the inaccuracy of depth edges position due to the low-resolution ToF depth maps. By doing so, unwanted artefacts such as texture copying and edge blurring are almost entirely eliminated. Moreover, the proposed filter is configurable to behave as most of the alternative depth enhancement approaches. Using a convolution-based formulation and data quantization and downsampling, the described filter has been effectively and efficiently implemented for dynamic scenes in real-time applications. The experimental results show a sensitive qualitative as well as quantitative improvement on raw depth maps, outperforming state-of-the-art multi-lateral filters.
... Rjagopalan [15] constructs an energy function through the Markov Random Field, and minimizes the energy function to get the HR image. Ismaeil [16] proposes a dynamic-scene SR method for depth image to deal with the problem of non-rigid body motion between LR images. Gevrekci [17] uses convex projection method to construct the imaging model of depth image sequence for depth image SR. ...
Article
Full-text available
Depth image super-resolution (SR) is a technique which can reconstruct a high-resolution (HR) depth image from a low-resolution (LR) depth image. Its purpose is to obtain HR details to meet the needs of various applications in computer vision. In general, conventional depth image SR methods often cause edges in the final HR image to be blurred or ragged. To solve this problem, an edge-guided method for depth image SR is presented in this paper. To get high-quality edge information, a pair of sparse dictionaries was applied to reconstruct edges of depth image. Then, with the guidance of these high-quality edges, a depth image was interpolated by using a modified joint bilateral filter. Edge-guided method can preserve the sharpness of edges and effectively avoid generating blurry and ragged edges when SR is performed. Experiments showed that the proposed method can get better results on both subjective and objective evaluation, and the reconstructed performance was superior to conventional depth image SR methods.
... The proposed solution can handle scenes containing one or more moving objects even non-rigidly without prior assumptions on their shape, and without training. Our algorithm referred to as Upsampling for Precise Super Resolution (UP-SR) builds on our work in [13,14,15]. We herein give a unified framework and provide additional details and proofs, and a more extensive experimental part, where we evaluate the accuracy of the proposed algorithm theoretically and experimentally as function of the SR factor, and the level of contaminations with noise. ...
Article
Full-text available
Multi-frame super-resolution is the process of recovering a high resolution image or video from a set of captured low resolution images. Super-resolution approaches have been largely explored in 2-D imaging. However, their extension to depth videos is not straightforward due to the textureless nature of depth data, and to their high frequency contents coupled with fast motion artifacts. Recently, few attempts have been introduced where only the super-resolution of static depth scenes has been addressed. In this work, we propose to enhance the resolution of dynamic depth videos with non-rigidly moving objects. The proposed approach is based on a new data model that uses densely upsampled, and cumulatively registered versions of the observed low resolution depth frames. We show the impact of upsampling in increasing the sub-pixel accuracy and reducing the rounding error of the motion vectors. Furthermore, with the proposed cumulative motion estimation, a high registration accuracy is achieved between non-successive upsampled frames with relative large motions. A statistical performance analysis is derived in terms of mean square error explaining the effect of the number of observed frames and the effect of the super-resolution factor at a given noise level. We evaluate the accuracy of the proposed algorithm theoretically and experimentally as function of the SR factor, and the level of contamination with noise. Experimental results on both real and synthetic data show the effectiveness of the proposed algorithm on dynamic depth videos as compared to state-of-art methods.
Article
To enhance the resolution and accuracy of depth data, some video-based depth super-resolution methods have been proposed which utilizes its neighboring depth images in the temporal domain. They often consist of two main stages: motion compensation of temporally neighboring depth images and fusion of compensated depth images. However, large displacement 3D motion often leads to compensation error, and the compensation error is further introduced into the fusion. A video-based depth super-resolution method with novel motion compensation and fusion approaches is proposed in this paper. We claim that, 3D Nearest Neighboring Field (NNF) is a better choice than using positions with true motion displacement for depth enhancements. To handle large displacement 3D motion, the compensation stage utilized 3D NNF instead of true motion used in previous methods. Next, the fusion approach is modeled as a regression problem to predict the super-resolution result efficiently for each depth image by using its compensated depth images. A new deep convolutional neural network architecture is designed for fusion, which is able to employ a large amount of video data for learning the complicated regression function. We comprehensively evaluate our method on various RGB-D video sequences to show its superior performance.
Conference Paper
Full-text available
All existent methods for statistical analysis of super–resolution approaches have stopped at the variance term, not accounting for the bias. In this paper we give an original derivation of the bias term. We propose to use a patch-based method inspired by the work of (P. Chatterjee and P. Milanfar, 2009). Our approach, however, is completely new as we derive a new affine bias model dedicated for the multi-frame super resolution framework. We apply the proposed statistical performance analysis to the Upsampling for Precise Super–Resolution (UP-SR) algorithm. This algorithm was shown experimentally to be a good solution for enhancing the res- olution of depth sequences in both cases of global and local motions. Its performance is herein analyzed theoretically in terms of its approximated mean square error, using the proposed derivation of the bias. This analysis is validated experimentally on a simulated static and dynamic depth sequences with a known ground truth. This provides an insightful understanding of the effects of noise variance, number of observed low resolution frames, and super–resolution factor on the final and intermediate performance of UP–SR. Our conclusion is that increasing the number of frames should improve the performance while the error is increased due to local motions, and to the upsampling which is part of UP-SR.
Conference Paper
Full-text available
We use multi-frame super-resolution, specifically, Shift & Add, to increase the resolution of depth data. In order to be able to deploy such a framework in practice, without requiring a very high number of observed low resolution frames, we improve the initial estimation of the high resolution frame. To that end, we propose a new data model that leads to a median estimation from densely upsampled low resolution frames. We show that this new formulation solves the problem of undefined pixels and further allows to improve the performance of pyramidal motion estimation in the context of super-resolution without additional computational cost. As a consequence, it increases the motion diversity within a small number of observed frames, making the enhancement of depth data more practical. Quantitative experiments run on the Middlebury dataset show that our method outperforms state-of-the-art techniques in terms of accuracy and robustness to the number of frames and to the noise level.
Conference Paper
Full-text available
We present an algorithm to synthetically increase the resolution of a solitary depth image using only a generic database of local patches. Modern range sensors measure depths with non-Gaussian noise and at lower starting resolutions than typical visible-light cameras. While patch based approaches for upsampling intensity images continue to improve, this is the first exploration of patching for depth images. We match against the height field of each low resolution input depth patch, and search our database for a list of appropriate high resolution candidate patches. Selecting the right candidate at each location in the depth image is then posed as a Markov random field labeling problem. Our experiments also show how important further depth-specific processing, such as noise removal and correct patch normalization, dramatically improves our results. Perhaps surprisingly, even better results are achieved on a variety of real test scenes by providing our algorithm with only synthetic training depth data.
Conference Paper
Full-text available
We propose an extension of our previous work on spatial domain Time-of-Flight (ToF) data enhancement to the temporal domain. Our goal is to generate enhanced depth maps at the same frame rate of the 2-D camera that, coupled with a ToF camera, constitutes a hybrid ToF multi-camera rig. To that end, we first estimate the motion between consecutive 2-D frames, and then use it to predict their corresponding depth maps. The enhanced depth maps result from the fusion between the recorded 2-D frames and the predicted depth maps by using our previous contribution on ToF data enhancement. The experimental results show that the proposed approach overcomes the ToF camera drawbacks; namely, low resolution in space and time and high level of noise within depth measurements, providing enhanced depth maps at video frame rate.
Conference Paper
Full-text available
We present a full real-time implementation of a multilateral filtering system for depth sensor data fusion with 2-D data. For such a system to perform in real-time, it is necessary to have a real-time implementation of the filter, but also a real-time alignment of the data to be fused. To achieve an automatic data mapping, we express disparity as a function of the distance between the scene and the cameras, and simplify the matching procedure to a simple indexation procedure. Our experiments show that this implementation ensures the fusion of 3-D data and 2-D data in real-time and with high accuracy.
Article
Full-text available
Multiframe super-resolution (SR) reconstruction of small moving objects against a cluttered background is difficult for two reasons: a small object consists completely of “mixed” boundary pixels and the background contribution changes from frame-to-frame. We present a solution to this problem that greatly improves recognition of small moving objects under the assumption of a simple linear motion model in the real-world. The presented method not only explicitly models the image acquisition system, but also the space-time variant fore- and background contributions to the “mixed” pixels. The latter is due to a changing local background as a result of the apparent motion. The method simultaneously estimates a subpixel precise polygon boundary as well as a high-resolution (HR) intensity description of a small moving object subject to a modified total variation constraint. Experiments on simulated and real-world data show excellent performance of the proposed multiframe SR reconstruction method.
Conference Paper
Full-text available
Methods for super-resolution can be broadly classified into two families of methods: (i) The classical multi-image super-resolution (combining images obtained at subpixel misalignments), and (ii) Example-Based super-resolution (learning correspondence between low and high resolution image patches from a database). In this paper we propose a unified framework for combining these two families of methods. We further show how this combined approach can be applied to obtain super resolution from as little as a single image (with no database or prior examples). Our approach is based on the observation that patches in a natural image tend to redundantly recur many times inside the image, both within the same scale, as well as across different scales. Recurrence of patches within the same image scale (at subpixel misalignments) gives rise to the classical super-resolution, whereas recurrence of patches across different scales of the same image gives rise to example-based super-resolution. Our approach attempts to recover at each pixel its best possible resolution increase based on its patch redundancy within and across scales.
Article
Super-resolution reconstruction produces one or a set of high-resolution images from a se-quence of low-resolution frames. This paper reviews a variety of super-resolution methods proposed in the last twenty years, and provides some insight to, and a summary of, our recent contributions to the general super-resolution problem. In the process, a detailed study of several very important aspects of super-resolution, often ignored in the literature, is presented. Specifically, we discuss robustness, treatment of color, and dynamic operation modes. Novel methods for addressing these issues are accompanied by experimental results on simulated and real data. Finally, some future challenges in super-resolution are outlined and discussed.
Article
The need for precise (subpixel accuracy) motion estimates in conventional super-resolution has limited its applicability to only video sequences with relatively simple motions such as global translational or affine displacements. In this paper, we introduce a novel framework for adaptive enhancement and spatiotemporal upscaling of videos containing complex activities without explicit need for accurate motion estimation. Our approach is based on multidimensional kernel regression, where each pixel in the video sequence is approximated with a 3-D local (Taylor) series, capturing the essential local behavior of its spatiotemporal neighborhood. The coefficients of this series are estimated by solving a local weighted least-squares problem, where the weights are a function of the 3-D space-time orientation in the neighborhood. As this framework is fundamentally based upon the comparison of neighboring pixels in both space and time, it implicitly contains information about the local motion of the pixels across time, therefore rendering unnecessary an explicit computation of motions of modest size. The proposed approach not only significantly widens the applicability of super-resolution methods to a broad variety of video sequences containing complex motions, but also yields improved overall performance. Using several examples, we illustrate that the developed algorithm has super-resolution capabilities that provide improved optical resolution in the output, while being able to work on general input video with essentially arbitrary motion.