Conference PaperPDF Available

High Quality Depth Map Upsampling for 3D-TOF Cameras

Authors:

Abstract and Figures

This paper describes an application framework to perform high quality upsampling on depth maps captured from a low-resolution and noisy 3D time-of-flight (3D-ToF) camera that has been coupled with a high-resolution RGB camera. Our framework is inspired by recent work that uses nonlocal means filtering to regularize depth maps in order to maintain fine detail and structure. Our framework extends this regularization with an additional edge weighting scheme based on several image features based on the additional high-resolution RGB input. Quantitative and qualitative results show that our method outperforms existing approaches for 3D-ToF upsampling. We describe the complete process for this system, including device calibration, scene warping for input alignment, and even how the results can be further processed using simple user markup.
Content may be subject to copyright.
High Quality Depth Map Upsampling for 3D-TOF Cameras
Jaesik ParkHyeongwoo Kim†∗ Yu-Wing TaiMichael S. Brown§Inso Kweon
Korea Advanced Institute of Science and Technology (KAIST)
National University of Singapore (NUS)§
Abstract
This paper describes an application framework to per-
form high quality upsampling on depth maps captured from
a low-resolution and noisy 3D time-of-flight (3D-ToF) cam-
era that has been coupled with a high-resolution RGB cam-
era. Our framework is inspired by recent work that uses
nonlocal means filtering to regularize depth maps in order
to maintain fine detail and structure. Our framework ex-
tends this regularization with an additional edge weighting
scheme based on several image features based on the ad-
ditional high-resolution RGB input. Quantitative and qual-
itative results show that our method outperforms existing
approaches for 3D-ToF upsampling. We describe the com-
plete process for this system, including device calibration,
scene warping for input alignment, and even how the results
can be further processed using simple user markup.
1. Introduction
Active 3D time-of-flight (3D-ToF) cameras are becom-
ing a popular alternative to stereo-based range sensors.
Such 3D-ToF cameras use active sensing to capture 3D
range data at frame-rate as a per-pixel depth. A light source
from the camera emits a near-infrared wave which is then
reflected by the scene and is captured by a dedicated sen-
sor. Depending on the distance of the objects in the scene,
the captured light wave is delayed in phase compared to
the original emitted light wave. By measuring the phase
delay, the distance at each pixel can be estimated. The res-
olution of the depth map captured by 3D-ToF cameras is
relatively low; typically less then 1/4th the resolution of a
standard definition video camera. In addition, the captured
depth maps are often corrupted by significant amounts of
noise.
The goal of this paper is to estimate a high quality high-
resolution depth map from the 3D-ToF through upsampling
in the face of sensor noise. To aid this procedure, an aux-
iliary high-resolution conventional camera is coupled with
The first and the second authors provided equal contributions to this
work.
(a) (b)
(c) (d)
Figure 1. (a) Low-resolution depth map (enlarged using nearest
neighbor upsampling), (b) high-resolution RGB image, (c) result
from [19], (d) our result. User scribble areas (blue) and the addi-
tional depth sample (red) are highlighted. The dark areas in (c) are
the areas without depth samples after registration. Full resolution
comparisons are provided in the supplemental materials.
the 3D-ToF camera to synchronously capture the scene. Re-
lated work [19,4,7] also using coupled device setups for
depth map upsampling have focused primarily on image fil-
tering techniques such as joint bilateral filtering [8,12] or
variations. Such filtering techniques can often over smooth
results, especially in areas of fine structure.
We formulate the depth map upsampling problem using
constrained optimization. Our approach is inspired by the
recent success of nonlocal means regularization for depth
map construction from depth-from-defocus [9]. In particu-
lar, we describe how to formulate the problem into a least-
squares optimization that combines nonlocal means regu-
larization together with an edge weighting scheme that fur-
ther reenforces fine details. We also employ scene warping
to better align the low-resolution imagery to the auxiliary
camera input. While this work is more applied in nature,
the result is a system that is able to produce high-quality
upsampled depth maps superior in quality to prior work.
In addition, our approach can be easily extended to incor-
porate simple user markup to correct errors along disconti-
1
nuity boundaries without explicit image segmentation (e.g.
Figure 1).
2. Related Work
Previous work on depth map upsampling can be clas-
sified as either image fusion techniques that combine the
low-resolution depth map with the high-resolution image or
super-resolution techniques that merge multiple misaligned
low-resolution depth maps. Our approach falls into the
first category of image fusion which is the focus of the re-
lated work presented here. Image fusion approaches assume
there exists a joint occurrence between depth discontinuities
and image edges and that regions of homo genous color have
similar 3D geometry [22,16]. Representative image fusion
approaches include [6,19,4,7]. In [6], Diebel and Thrun
performed upsampling using an MRF formulation with the
data term computed from the depth map and weights of the
smoothness terms between estimated high-resolution depth
samples derived from the high-resolution image. Yang et
al. [19] used joint bilateral filtering [8,12] to interpolate
the high-resolution depth values. Since filtering can often
over smooth the interpolated depth values, especially along
the depth discontinuity boundaries, they quantized the depth
values into several discrete layers. This work was later ex-
tended by [21] to use a stereo camera for better disconti-
nuity detection in order to avoid over smoothing of depth
boundaries. Chan et al. [4] introduced a noise-aware bi-
lateral filter that decides how to blend between the results
of standard upsampling or joint bilateral filtering depend-
ing on the depth map’s regional statistics. Dolson et al. [7]
also used a joint bilateral filter scheme, however, their ap-
proach includes additional time stamp information to main-
tain temporal coherence for the depth map upsampling in
video sequences.
The advantage of these bilateral filtering techniques is
they can be performed quickly; e.g. Chan et al. [4] reported
near real-time speeds using a GPU implementation. How-
ever, the downside is they involve can still over smooth fine
details. Work by [14] proposed a joint global mode fil-
ter based on global image histograms of the low-resolution
depth and high-resolution image. Our approach is more re-
lated to Diebel and Thrun [6] in that we formulate the prob-
lem using an MRF optimization scheme. However, our ap-
proach incorporates a nonlocal means (NLM) term in the
MRF to help preserve local structure. This additional NLM
term was inspired by the recent work by Favaro [9] which
demonstrated that NLM filtering is useful maintaining fine
details even with noisy input data. Work in [11] has also
used the NLM to fuse the 3D point cloud and 2D image to
enhance the density of 3D points. We also include an addi-
tional weighting scheme based on several image derive fea-
tures to further reinforce the preservation of fine detail. In
addition, we perform a warping step to better align the low-
(a) (b)
Figure 2. (a) Our imaging setup uses a 3D-ToF camera which cap-
tures images at 176 ×144 resolution that is synchronized with a
1280 ×960 resolution RGB camera. (b) Our calibration configu-
ration that uses a planar calibration pattern with holes to allow the
3D-ToF camera to be calibrated.
resolution and high-resolution input. Our experimental re-
sults on ground truth data shows that our application frame-
work can outperform existing techniques for the majority
of scenes with various upsampling factors. Since our goal
is high-quality depth maps, the need for manual cleanup for
machine vision related input is unavoidable. Another ad-
vantage of our approach is that it can easily incorporate user
markup to improve the results.
3. System Setup and Preprocessing
In this section, we describe our system and the prepro-
cessing step to register the 3D-ToF camera and conventional
camera and to perform an initial outlier rejection on the 3D-
ToF input.
3.1. System Configuration
Figure 2(a) shows our hardware configuration consisting
of a 3D-ToF camera and a high-resolution RGB camera. For
the depth camera, we use the SwissRangerT M SR4000 [1]
which captures a 176 ×144 depth map. For the RGB cam-
era, we use the Point Grey Research Flea RGB camera with
a resolution of 1280 ×960 pixels. Since the data captured
from the two cameras have slightly different viewpoints, we
need to register the camera according to the depth values
from the low-resolution depth map.
3.2. Depth Map Registration
Let Xd= (X, Y, Z, 1)be a 3D homogeneous coordi-
nate acquired by the 3D-ToF camera, and xc= (u, v, 1)
be the 2D homogeneous coordinate of the high-resolution
RGB image. We can compute a projection of Xdonto xc
by:
sxc=KR|tXd,(1)
where sis a scale factor, Kis the intrinsic parameters of the
RGB camera, and Rand tare the rotation and translation
matrix which describe the rotation and translation of the the
RGB camera and the depth camera with respect to the 3D
world coordinate.
To calibrate the two cameras’ parameters, we use the cal-
ibration method introduced by Zhang [20]. Since the 3D-
ToF camera cannot capture textures, we instead use a pla-
nar calibration pattern consisting of holes for our purpose
(Figure 2(b)). This unique calibration pattern allows us to
detect the positions on the planar surface that are observed
by the 3D-ToF camera. After camera calibration, for any
point, xt, on the low-resolution depth map with depth value
dt, we can compute its corresponding position in the high-
resolution RGB image by the following equation:
sxc=KcR|tPt
1[xtdt1]T(2)
where Ptis the 4×4projective transformation converting
the world coordinate Xdinto the local coordinate of the
3D-ToF camera. We obtain the scaling term sby calcu-
lating the relative resolution between the depth camera and
the RGB camera. Since the depth map from the depth cam-
era is noisy, we impose a neighborhood smoothness regu-
larization using thin-plate splines to forward map the low-
resolution depth map to the high-resolution image.
3.3. Outliers Detection
The depth map from the 3D-ToF camera contains depth
edges that are blurred by mixing the depth values of two dif-
ferent depth layers along depth boundaries. These blurred
depth boundaries are unreliable and should be removed be-
fore upsampling. For each pixel in the low-resolution depth
map, we compare the depth value of a pixel to the local
maximum depth and the local minimum depth within a
small local window (e.g. 9×9) in the low-resolution depth
map. The contrast between the local maximum and mini-
mum depth determines whether this local window contains
two different depth layer. If the depth value of a pixel is at
the middle of the two depth layers, we consider this pixel
as a boundary pixel. Since the input depth map is noisy, we
use an MRF [3] to clean up the noisy estimation as:
E(l) = X
p
Op(l) + λpq X
qN(p)
Opq (l)
,(3)
where l[0,1] is a map of binary label indicating whether
a pixel is an outlier or not, Op(l)is the data term defined by
the extent of contrast within a small window, Opq(l)is the
smoothness term defined by the Hamming distance between
lpand its neighbor lq. This simple outlier rejection step
is performed on each input frame captured by the 3D-ToF
camera.
4. Optimization Framework
This section describes our optimization framework for
unsampling the low-resolution depth map given the aligned
sparse depth samples and the high-resolution RGB image.
(a) (b)
Figure 3. Comparison of our result without (a) and with (b) the
NLM term. The same weighting scheme proposed in Section 4.2
is used for both (a) and (b). Although the usage of NLM does
not significantly affect the RMS error, it is important in generating
high quality depth maps especially along thin structure elements.
Similar to the previous image fusion approaches [6,19,7],
we assume there are co-occurrences of depth boundaries
and image boundaries.
4.1. Objective Function
We define the objective function for depth map upsam-
pling as follows:
E(D) = Ed(D) + λsEs(D) + λNENLM(D)(4)
where Ed(D)is the data term, Es(D)is the neighborhood
smoothness term, and ENLM(D)is a NLM regularization.
The term λsand λNare the relative weights to balance the
energy between the three terms. Note that the smoothness
term and NLM term could be combined into a single term,
however, we keep them separated here for sake of clarity.
Our data term is defined according to the initial sparse
depth map:
Ed(D) = X
p∈G
(D(p)G(p))2,(5)
where Gis a set of pixels which has the initial depth value.
Our smoothness term is defined as:
Es(D) = X
pX
q∈N (p)
wpq (D(p)D(q))2,(6)
where N(p)is the first order neighborhood of p, and wpq is
the confidence weighting which will be detailed in the fol-
lowing section. Combining Equation (5) and Equation (6)
forms a quadratic objective function which is similar to the
objective function in [13]. The work in [13] was designed
to propagate sparse color values to a gray high-resolution
image, which is similar in nature to our problem of propa-
gating sparse depth values to the high-resolution RGB im-
age.
The difference between our method and that of [13] is the
definition of wpq . Work in [13] defined wpq using intensity
difference between the first order neighborhood pixels to
preserve discontinuities. We further combine segmentation,
color information, and edge saliency as well as the bicubic
(a) (b) (c) (d) (e) (f) (g) (h)
Figure 4. (a) Low-resolution depth map (enlarged using nearest neighbor upsampling). (b) High-resolution RGB image. (c) Color
segmentation by [17]. (d) Edge saliency map. (e) Guided depth map by using bicubic interpolation of (a). (f) Our upsampling result
without the guided depth map weighting, depth bleeding occurred in highly textured regions. (g) Our upsampling result with guided depth
map weighting. (h) Ground truth. We subsampled the depth value of a dataset from Middlebury to create the synthetic low-resolution depth
map. The magnification factor in this example is 5×. The sum of squared difference(SSD) between (f) and (g) comparing to the ground
truth are 31.66 and 24.62 respectively. Note that the depth bleeding problem in highly textured regions has been improved.
upsampled depth map to define wpq . The reason for this
is that we find the first order neighborhood does not prop-
erly consider the image structure. As the result, propagated
color information in [13] was often prone to bleeding errors
near fine detail. In addition, we include a NLM regulariza-
tion term, which protects the thin structures by allowing the
pixels on the same nonlocal structure to reinforce each other
within a larger neighborhood. We define the NLM regular-
ization term using an anisotropic structural-aware filter [5]:
El(D) = X
pX
q∈A(p)
κpq (D(p)D(q))2,(7)
where A(p)is a local window (e.g. 11 ×11) in the
high-resolution image, κpq is the weight of the anisotropic
structural-aware filter defined as:
κpq =1
2exp((pq)TΣ1
p(pq))+
exp((pq)TΣ1
q(pq)),
Σp=1
|A| X
p∈A(p)
I(p)I(p)T.(8)
Here, I(p) = {∇xI(p),yI(p)}Tis the x- and y- im-
age gradient vector at p, and Iis the high-resolution color
image. The term Σqis defined similarly to Σp. This
anisotropic structural-aware filter defines how likely pand
qare on the same structure in the high-resolution RGB im-
age, i.e. if pand qare on the same structure, κpq will be
large. This NLM filter essential allows similar pixel to re-
inforce each other even if they are not first-order neighbors.
To maintain the sparsity of the linear system, we remove
neighborhood entries with κpq < t. A comparison of our
approach on the effectiveness of the NLM regularization is
shown in Figure 3.
4.2. Confidence Weighting
In the section, we describe our confidence weighting
scheme for defining the weights wpq in Equation (6). The
value of wpq defines the spatial coherence of neighborhood
pixels. The larger wpq is, the more likely that the two neigh-
borhood pixels having the same depth value. Our confi-
dence weighting is decomposed into four terms based on
color similarities (wc), segmentation (ws), edge saliency
(we), and guided bicubic interpolated depth map (wd).
The color similarity term is defined in the YUV color
space as follows:
wc= exp (X
IY UV
(I(p)I(q))2
2σ2
I
),(9)
where σIcontrols the relative sensitivity of the different
color channels.
Our second term is defined based on color segmentation
using the library provided in [17] to segment an image into
super pixels as shown in Figure 4(c). For the neighborhood
pixels that are not within the same super pixel, we give a
penalty term defined as:
ws=1if Sco(p) = Sco (q)
tse otherwise (10)
where Sco(·)is the segmentation label, tse is the penalty fac-
tor with its value between 0 and 1. In our implementation,
we empirically set it equal to 0.7.
Inspired by [2], we have also included a weight which
depends on the edge saliency response. Different from the
color similarity term, the edge saliency responses are de-
tected by a set of Gabor filters with different sizes and ori-
entations. The edge saliency map contains image struc-
tures rather than just color differences between neighbor-
hood pixels. We combine the responses of different Ga-
bor filters to form the edge saliency map as shown in Fig-
ure 4(d). Our weighting is computed as:
we=1
psx(p)2+sx(q)2+ 1 ,(11)
where sx(·)is the value of xaxis edge saliency map if p
and qare xaxis neighborhoods.
Allowing the depth values to propagate freely with only
very sparse data constraint can lead to severe depth bleed-
ing. Here, we introduce the guided depth map to resolve
(a) (b) (c)
(d) (e) (f)
Figure 5. Depth map refinement via user markup. (a)(d) Color
image of small scale structure. (b)(e) Upsampled depth map before
user correction. The user scribble areas in (b), and the user added
depth samples in (e) are indicated by the yellow lines and dots
respectively. (c)(f) Refined depth maps.
this problem. The guided depth map weighting is similar
to the intensity weighting in a bilateral filter. Since we do
not have a depth sample at each high-resolution pixel loca-
tion, we use bicubic interpolation to obtain the guided depth
map, Dg, as shown in Figure 4(e). Similar to the bilateral
filter, we define the guided depth map weighting as follow:
wd= exp ((Dg(p)Dg(q))2
2σ2
g
),(12)
Combining the weight defined from Equation (9) to Equa-
tion (13) by multiplication, we obtain the weight wpq =
wswcwewd. Note that except for the edge saliency term, all
the weighting defined in this subsection can be applied to
the weighting κpq via multiplication to the NLM regular-
ization term.
4.3. User Adjustments
Since the goal is high-quality upsampling, it is inevitable
that some depth frames are going to require user touch up,
especially if the data is intended for media related applica-
tions. Our approach allows easy user corrections by direct
manipulation of the weighting term wpq or by adding addi-
tional sparse depth sampling for error corrections.
For the manipulation of the weighting term, we allow
the user to draw scribbles along fuzzy image boundaries,
or along the boundaries where the image contrast is low.
These fuzzy boundaries or low contrast boundaries repre-
sent difficult regions for segmentation and edge saliency
detection. As a result, they cause depth bleeding in the re-
constructed high-resolution depth map as illustrated in Fig-
ure 5(b). Within the scribble areas, we compute an alpha
matte based on the work by Wang et al. [18] for the two
different depth layers. An additional weighting term will
be added according to the estimated alpha values within the
(a) (b) (c)
2x 4x 8x 16x
35
36
37
38
39
40
41
42
Upsampled Scale
PSNR
color
edge
segment
depth
combination
(d)
Figure 6. A synthetic example for self-evaluation of our weight-
ing term. (a)(b) A synthetic image pair consists of high-resolution
color image and low-resolution depth image. (c) Our 4×upsam-
pled depth map with the combined weighting term. (d) The plot of
PSNR accuracy against the results with the combined weighting
term and the results with each weighting term individually. The
combined weighting term consistently produce the best results un-
der different upsampling scale.
scribble areas. For the two pixels pand qwithin the scrib-
ble areas, if they belong to the same depth layer, they should
have the same or similar alpha value. Hence, our additional
weighting term for counting the additional depth disconti-
nuity information is defined as:
exp ((α(p)α(q))2
2σ2
α
),(13)
where α(·)is the estimated alpha values within the scribble
areas. Figure 5(c) shows the effects after adding this alpha
weighting term. The scribble areas are indicated by the yel-
low lines in Figure 5(b).
Our second type of user correction allows the user to
draw or remove depth samples on the high-resolution depth
map directly. When adding a depth sample, the user can
simply pick a depth value from the computed depth map
and then assign this depth value to locations where depth
samples are “missing”. After adding the additional depth
samples, our algorithm generates the new depth map using
the new depth samples as a hard constraint in Equation (4).
The second row of Figure 5shows an example of this user
correction. Note that for image filtering techniques, such
depth sample correction can be more complicated to incor-
porate since the effect of new depth samples can be filtered
by the original depth sample within a large local neighbor-
hood. Removal of depth samples can also cause a hole in
the result of image filtering techniques.
4.4. Evaluation on the Weighting Terms
Our weighting term, wpq , is a combination of several
heuristic weighting terms. Here we provide some insight to
the relative effectiveness of each individual weighting term
Synthetic Time(Sec.) Real-world Time(Sec.)
Art 21.60 Lion 18.60
Books 26.47 Office with person 16.65
Mobius 24.07 Lounge 18.28
Classroom 19.00
Table 1. Running time of our algorithm for 8x upsampling. The
upsampled depthmap resolution is 1376×1088 for Synthetic and
1280×960 for Real-world examples. The algorithm was imple-
mented using unoptimized matlab code.
and their combined effect as shown in Figure 6. Our exper-
iments found that using only the color similar term can still
cause propagation errors. The edge cue is more effective
in preserving structure, but cannot entirely remove propa-
gation errors. The effect of the segmentation cue is sim-
ilar to the color cue as the segmentation is also based on
color information, but generally produces sharper boundary
with piecewise smoothed depth inside each segment. The
depth cue is good in avoiding propagation bleeding, but is
not effective along the depth boundaries because it ignores
the co-occurrence of image edges and depth edges. After
combining the four different cues together, the combined
weighting scheme shows the best results. The results pro-
duced with the combined weighting term can effectively uti-
lize the structures in the high-resolution RGB image while
it can avoid bleeding by including the depth cue which con-
sists with the low-resolution depth map.
5. Results and Comparisons
We tested our approach using both synthetic examples
and real world examples as described in the following sec-
tions. The value of λs,λNare chosen as 0.2 and 0.1 re-
spectively, and they are fixed during our experiments. The
system configuration for experiments is 3Ghz CPU, 8GB
RAM. We implemented our algorithm via Matlab using its
built-in standard linear solver. The computation time is
summarized in Table 1.
5.1. Evaluations using the Middlebury stereo
dataset
We use synthetic examples for quantitative comparisons
with the results from previous approaches [6,19,10]. The
depth map from the Middlebury stereo datasets [15] are
used as the ground truth. We down sampled the ground truth
depth map by different factors to create the low-resolution
depth map. The original color image is used as the high-
resolution RGB image. We compare our results with bilin-
ear interpolation, MRF [6], bilateral filter [19], and a recent
work on guided image filter [10]. Since the previous ap-
proaches do not contain a user correction step, the results
generated by our method for these synthetic examples are
all based on our automatic method in Section 4.1 and Sec-
tion 4.2 for fair comparisons. Table 2 summaries the RMSE
(root-mean-square error) against the ground truth under dif-
ferent magnification factors for different testing examples.
Our results consistently achieved the lowest RMSE among
all the test cases especially for large scale upsampling. The
qualitative comparison with the results from [6] and [19]
under 8×magnification factor can be found in Figure 7.
In terms of depth map quality, we found that the MRF
method in [6] produces the most blurred result. This is due
to its simple use of neighborhood term which co nsiders only
the image intensity difference as the neighborhood similar-
ity for depth propagation. The results from bilateral filtering
in [19] are comparable to ours with sharp depth discontinu-
ities in some of the test examples. However, since segmen-
tation and edge saliency are not considered, their results can
still suffer from depth bleeding highly textured regions. We
also found that for the real world example in Figure 1, the
results from [19] tended to be blurry.
5.2. Robustness to Depth Noise
The depth map captured by 3D-ToF cameras are always
noisy. We compare the robustness of our algorithm and
the previous algorithms by adding noise. We also compare
against the Noise-Aware bilateral filter approach in [4]. We
observe that the noise characteristics in a 3D-ToF camera
depends on the distance between the camera and the scene.
To simulate this effect, we add a conditional Gaussian noise:
p(x, k, σd) = kexp (x
2(1 + σd)2),(14)
where σdis a value proportional to the depth value, and k
is the magnitude of the Gaussian noise. Although the actual
noise distribution of 3D-ToF camera is more complicated
than the Gaussian noise model, many previous depth map
upsampling algorithms do not consider the problem of noise
in the low-resolution depth map. This experiment therefore
attempts an objective comparison on the robustness of dif-
ferent algorithms with respect to noisy depth maps. The
results in term of RMSE are summarized in Table 3.
5.3. Real World Examples
Figure 8shows the real world examples of our approach.
Since the goal of our paper is to obtain high quality depth
maps, we include user corrections for the examples in the
top and middle row. We show our upsampled depth as well
as a novel view rendered by using our depth map. The mag-
nification factors for all these examples are 8×. These real
world examples are challenging with complicated bound-
aries and thin structures. Some of the objects contain al-
most identical colors but with different depth values. Our
approach is successful in distinguishing the various depth
layers with sharp boundaries. All results without user cor-
rections can be found in the supplemental materials.
(a) (b) (c)
Figure 7. Qualitative comparison on Middlebury dataset. (a) MRFs optimzation [6]. (b) Bilateral filtering with subpixel refinement [19].
(c) Our results. The image resolution are enhanced by 8×. Note that we do not include any user correction in these synthetic testing cases.
The results are cropped for the visualization, full resolution comparisons are provided in the supplemental materials.
Art Books Mobius
2×4×8×16×2×4×8×16×2×4×8×16×
Bilinear 0.56 1.09 2.10 4.03 0.19 0.35 0.65 1.24 0.20 0.37 0.70 1.32
MRFs [6] 0.62 1.01 1.97 3.94 0.22 0.33 0.62 1.21 0.25 0.37 0.67 1.29
Bilateral [19] 0.57 0.70 1.50 3.69 0.30 0.45 0.64 1.45 0.39 0.48 0.69 1.14
Guided [10] 0.66 1.06 1.77 3.63 0.22 0.36 0.60 1.16 0.24 0.38 0.61 1.20
Ours 0.43 0.67 1.08 2.21 0.17 0.31 0.57 1.05 0.18 0.30 0.52 0.90
Table 2. Quantitative comparison on Middlebury dataset. The error is measured in RMSE for 4 different magnification factors. The
performance of our algorithm is the best among all compared algorithm. Note that no user correction is included in these synthetic testing
examples.
6. Discussion and Summary
We have presented a framework to upsample a low-
resolution depth map from the 3D-ToF camera using an
auxiliary high-resolution RGB image. Our framework is
based on a least-square optimization that combines several
weighting factors together with nonlocal means filtering to
maintain sharp depth boundaries and to prevent depth bleed-
ing during propagation. Although this work is admittedly
more engineering in nature, we believe it provides useful
insight on various weighting strategies for those working
with noisy range senors. Moreover, experimental result
show that our results typically out performs previous work
in terms of both RMSE and visual quality. In addition to the
automatic method, we have also discussed how to extend
our approach to incorporate user markup. Our user correc-
tion method is simple and intuitive and does not require any
addition modifications in order to solve the objective func-
tion defined in Section 4.1.
7. Acknowledgements
We are grateful to anonymous reviewers for their con-
structive comments. This research was partially sup-
ported by Samsung Advanced Institute of Technology
(RRA0109ZZ-61RF-1) and the National Strategic R&D
Program for Industrial Technology, Korea. Yu-Wing Tai
was supported by the National Research Foundation (NRF)
of Korea (2011-0013349) and the Ministry of Culture,
Sports and Tourism (MCST) and Korea Content Agency
(KOCCA) in the Culture Technology Research and Devel-
opment Program 2011. Michael S. Brown was supported
by the Singapore Academic Research Fund (AcRF) Tier 1
Grant (R-252-000-423-112).
References
[1] SwissRangerT M SR4000 data sheet, http://www.mesa-
imaging.ch/prodview4k.php.
[2] P. Bhat, C. L. Zitnick, M. F. Cohen, and B. Curless. Gra-
dientshop: A gradient-domain optimization framework for
image and video filtering. ACM Trans. Graph., 29(2), 2010.
[3] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-
ergy minimization via graph cuts. IEEE Trans. PAMI, 2001.
[4] D. Chan, H. Buisman, C. Theobalt, and S. Thrun. A noise
aware filter for real-time depth upsampling. In ECCV Work-
shop on Multicamera and Multimodal Sensor Fusion Algo-
rithms and Applications, 2008.
[5] J. Chen, C. Tang, and J. Wang. Noise brush: interactive
high quality image-noise separation. ACM Trans. Graphics,
28(5), 2009.
[6] J. Diebel and S. Thrun. An application of markov random
fields to range sensing. In NIPS, 2005.
[7] J. Dolson, J. Baek, C. Plagemann, and S. Thrun. Upsampling
range data in dynamic environments. In CVPR, 2010.
[8] E. Eisemann and F. Durand. Flash photography enhancement
via intrinsic relighting. ACM Trans. Graphics, 23(3):673–
678, 2004.
[9] P. Favaro. Recovering thin structures via nonlocal-means
regularization with application to depth from defocus. In
CVPR, 2010.
[10] K. He, J. Sun, and X. Tang. Guided image filtering. In ECCV,
2010.
[11] B. Huhle, T. Schairer, P. Jenke, and W. Strasser. Fusion
of range and color images for denoising and resolution en-
hancement with a non-local filter. CVIU, 114(12):1136–
1345, 2010.
Art Books Mobius
2×4×8×16×2×4×8×16×2×4×8×16×
Bilinear 3.09 3.59 4.39 5.91 2.91 3.12 3.34 3.71 3.21 3.45 3.62 4.00
MRFs [6] 1.62 2.54 3.85 5.70 1.34 2.08 2.85 3.54 1.47 2.29 3.09 3.81
Bilateral [19] 1.36 1.93 2.45 4.52 1.12 1.47 1.81 2.92 1.25 1.63 2.06 3.21
Guided [10] 1.92 2.40 3.32 5.08 1.60 1.82 2.31 3.06 1.77 2.03 2.60 3.34
NAFDU [4] 1.83 2.90 4.75 7.70 1.04 1.36 1.94 3.07 1.17 1.55 2.28 3.55
Ours 1.24 1.82 2.78 4.17 0.99 1.43 1.98 3.04 1.03 1.49 2.13 3.09
Table 3. Quantitative comparison on Middlebury dataset with additive noise. Our algorithm achieves the lowest RMSE in most cases. Note
that all these results are generated without any user correction. Better performance is possible after including user correction.
(a) (b) (c)
Figure 8. (a) Our input, the low-resolution depth maps are shown on the lower left corner (Ratio between the two images are preserved).
(b) Our results. User scribble areas (blue) and the additional depth sample (red) were high-lighted. (c) Novel view rendering of our result.
Note that no user markup is required in our results in the third row. More results can be found in supplemental materials
[12] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele.
Joint bilateral upsampling. ACM Trans. Graphics, 26(3):96,
2007.
[13] A. Levin, D. Lischinski, and Y. Weiss. Colorization using
optimization. ACM Trans. Graphics, 23(3):689–694, 2004.
[14] D. Min, J. Lu, and M. Do. Depth video enhancement based
on weighted mode filtering. IEEE TIP, to appear.
[15] D. Scharstein and R. Szeliski. A taxonomy and evaluation of
dense two-frame stereo correspondence algorithms. IJCV,
47(1/2/3):7–42, 2002.
[16] A. Torralba and W. T. Freeman. Properties and applications
of shape recipes. In CVPR, pages 383–390, 2003.
[17] A. Vedaldi and B. Fulkerson. VLFeat: An open
and portable library of computer vision algorithms.
http://www.vlfeat.org/, 2008.
[18] J. Wang, M. Agrawala, and M. Cohen. Soft scissors: An
interactive tool for realtime high quality matting. SIG-
GRAPH’07.
[19] Q. Yang, R. Yang, J. Davis, and D. Nist´er. Spatial-depth
super resolution for range images. In CVPR, 2007.
[20] Z. Zhang. A flexible new technique for camera calibration.
IEEE Trans. PAMI, 22(11):1330–1334, 2000.
[21] J. Zhu, L. Wang, R. Yang, and J. Davis. Fusion of time-
of-flight depth and stereo for high accuracy depth maps. In
CVPR, 2008.
[22] A. Zomet and S. Peleg. Multi-sensor super-resolution. In
WACV, pages 27–31, 2002.
... Many image interpolation-related problems have used a scene's color information to guide the propagation of some other type of information, such as depth inpainting and image enhancement [2]. Depth completion is not an exception. ...
... Comparison of obtained MSE between models: Our proposal, Bicubic, TGV[10], Ham[28], JBU, and Park[2]. ...
Article
Full-text available
Depth map estimation is crucial for a wide range of applications. Unfortunately, it often presents missing or unreliable data. The objective of depth completion is to fill in the “holes” in a depth map by propagating the depth information using guidance from other sources of information, such as color. Nowadays, classical image processing methods have been outperformed by deep learning techniques. Nevertheless, these approaches require a significantly large number of images and enormous computing power for training. This fact limits their usability and makes them not the best solution in some resource-constrained environments. Therefore, this paper investigates three simple hybrid models for depth completion. We explore a hybrid pipeline that combines a very efficient and powerful interpolator (infinity Laplacian or AMLE) and a series of convolutional stages. The contributions of this article are (i) the use a Texture+Structuredecomposition as a pre-filter stage; (ii) an objective evaluation with three different approaches using KITTI and NYU_V2 data sets; (iii) the use of an anisotropic metric as a mechanism to improve interpolation; and iv) the inclusion of an ablation test. The main conclusions of this work are that using an anisotropic metric improves model performance, and the ablation test demonstrates that the model’s final stage is a critical component in the pipeline; its suppression leads to an approximate 4% increase in MSE. We also show that our model outperforms state-of-the-art alternatives with similar levels of complexity.
... We extensively experiment with the proposed framework in terms of both qualitative and quantitative evaluations on mainstream benchmark datasets (Silberman et al. 2012;Scharstein and Pal 2007;Lu, Ren, and Liu 2014;Park et al. 2011) and real-world dataset (He et al. 2021). It is demonstrated that our IPPNet achieves state-of-the-art performance (e.g., NYU v2 (Silberman et al. 2012 ...
... Datasets and Training Details. Noisy Middlebury (Park et al. 2011) is used for testing, including 3 RGB/D pairs, i.e., Art, Books. Following (Kim, Ponce, and Ham 2021), we simulate noisy LR input depth map by adding multiplicative Gaussian noise η(x) = N (0, τ x), where τ is the magnitude of the noise and x is proportional to the depth value. ...
Article
Depth map super-resolution (DSR) plays an indispensable role in 3D vision. We discover an non-trivial spectral phenomenon: the components of high-resolution (HR) and low-resolution (LR) depth maps manifest the same intrinsic phase, and the spectral phase of RGB is a superset of them, which suggests that a phase-aware filter can assist in the precise use of RGB cues. Motivated by this, we propose an intrinsic phase-preserving DSR paradigm, named IPPNet, to fully exploit inter-modality collaboration in a mutually guided way. In a nutshell, a novel Phase-Preserving Filtering Module (PPFM) is developed to generate dynamic phase-aware filters according to the LR depth flow to filter out erroneous noisy components contained in RGB and then conduct depth enhancement via the modulation of the phase-preserved RGB signal. By stacking multiple PPFM blocks, the proposed IPPNet is capable of reaching a highly competitive restoration performance. Extensive experiments on various benchmark datasets, e.g., NYU v2, RGB-D-D, reach SOTA performance and also well demonstrate the validity of the proposed phase-preserving scheme. Code: https://github.com/neuralchen/IPPNet/.
... In the field of agriculture, 3D reconstruction methods can be divided into active vision and passive vision. Active vision methods include laser scanning [8], [9], structured light [10], shadow methods [11], Time-of-Flight (TOF) technology [12], radar technology [13], Kinect technology [14], etc. Passive vision methods involve obtaining image sequences through visual sensors and then reconstructing the 3D structure. This approach first captures image sequences, extracts useful information, performs reverse modeling, and obtains the object's 3D structure. ...
Article
Full-text available
Three-dimensional reconstruction plays a crucial role in capturing plant phenotypes and expediting the process of agricultural informatization. However, the reconstruction of small objects such as plant specimens and grains often faces challenges like low two-dimensional image resolution and sparse textures. To enhance the three-dimensional reconstruction of plant specimens like wheat grains for comprehensive phenotypic characterization, this study proposes a novel super-resolution reconstruction network called T-transformer net. The network leverages the self-attention mechanism of Transformers to extract extensive global information from spatial sequences. By employing a hourglass block structure to construct spatial attention units and combining channel attention with window-based self-attention schemes, it effectively harnesses their complementary advantages. This encompasses utilizing global statistical data while capitalizing on potent local fitting capabilities. Evaluation of the model on publicly available datasets Set5, Set14, and Manga109 demonstrates superior overall performance of T-transformer net compared to mainstream super-resolution algorithms at upscaling factors of 2x, 3x, and 4x. In the context of super-resolution tasks involving wheat grain datasets, the peak signal-to-noise ratio reaches 42.89 dB, and the structural similarity index attains 0.9643. Subsequently, we subject the super-resolved wheat grain images to three-dimensional reconstruction. Through comprehensive extraction of high-level semantic information by neural networks, the reconstruction accuracy is improved by 38.96% compared with the unprocessed image, effectively mitigating challenges arising from sparse textures and repetitive patterns in wheat grain structures. This study contributes valuable methodology and insights to the realm of three-dimensional reconstruction in botany, holding significant implications for advancing agricultural informatization.
... Moreover, other starting points from strategies aimed at the same objectives and applied in neighboring fields, as well as already very settled, come from (Park et al. 2011) who studied the possibility of improving the resolution of point clouds acquired by TOF 3D cameras, notoriously characterized by low resolution, improving the edges of objects in depth maps using high resolutions RGB inputs. ...
Article
Full-text available
Nowadays 3D digitization through the combination or hybridization of different sensors, with the final aim of accelerating the phases of data acquisition and storage, develops user friendly and robotics systems, making efficient the operator role. New technologies as Hybrid Reality Capture™ (HRC), with Flash Technology (FARO Tech.) certainly fits into this market trend, and it is characterized by rapid acquisition, involving 3D scanning data with panoramic images contribution. The system is still under patent, and nothing is yet released on the technology. This research presents the analysis and discussion of results based on the raw and processed data related to the new FARO system. The assumption – based on the information declared by the manufacturer (FARO, 2023) – is that the new colored Flash scans are faster and denser than scans of the same resolution obtained using traditional static scanning method, due to the crucial contribution of the PanoCam data and resolution on which the upsampling strategy is based. An evaluation based on detailed analysis of the upsampling results is reported, delivering that the surface point density exponentially decreases with the distance and with the ray incidence inclination. A comparison with a mobile mapping technology is finally presented and discussed.
Article
Full-text available
With the rapid development of 3D reconstruction, especially the emergence of algorithms such as NeRF and 3DGS, 3D reconstruction has become a popular research topic in recent years. 3D reconstruction technology provides crucial support for training extensive computer vision models and advancing the development of general artificial intelligence. With the development of deep learning and GPU technology, the demand for high-precision and high-efficiency 3D reconstruction information is increasing, especially in the fields of unmanned systems, human-computer interaction, virtual reality, and medicine. The rapid development of 3D reconstruction is becoming inevitable. This survey categorizes the various methods and technologies used in 3D reconstruction. It explores and classifies them based on three aspects: traditional static, dynamic, and machine learning. Furthermore, it compares and discusses these methods. At the end of the survey, which includes a detailed analysis of the trends and challenges in 3D reconstruction development, we aim to provide a comprehensive introduction for individuals who are currently engaged in or planning to conduct research on 3D reconstruction. Our goal is to help them gain a comprehensive understanding of the relevant knowledge related to 3D reconstruction.
Article
The Guided Depth map Super-Resolution (GDSR) task aims to solve the low-quality problem of depth maps obtained from different devices. The most difficult challenge for the GDSR task is obtaining consistent depth discontinuity features from the depth map and the corresponding color image. This paper proposes a novel learning-based hierarchical network for GDSR task with the Edge-guided Bidirectional Interaction Module (EBIM). First, the EBIM refines the edge information by recalibrating the color details and selecting the consistent edge feature from information flow of the depth feature to color features. Second, EBIM enhances the depth discontinuity information details in the depth branch by transferring the refined edge information from the color branch. In addition, we use hierarchical edge constraints to intensifier the refined edge information through the EBIM module. Experimental results on existing public datasets show that our method outperforms other state-of-the-art methods in different measures.
Chapter
Depth maps captured by mainstream depth sensors are still of low resolution compared with color images. The main difficulties in depth super-resolution lie in the recovery of tiny textures from severely undersampled measurements and texture-copy artifacts due to depth-texture inconsistency. To address these problems, we propose a simple and efficient convolutional filtering approach based on guided attention, named HDSRnet-light, for high quality depth super-resolution. In HDSRnet-light, a guided attention scheme is proposed to fuse features of the pyramidal main branch with complementary features from two side-branches associated with the auxiliary high-resolution color image and a bicubic upsampled version of the input depth map. In this way, high-resolution features are progressively recovered from multi-scale information from both the depth map and the color image. Experimental results show that our method achieves state-of-art performance for depth map super-resolution.
Article
Full-text available
In this paper, we propose a novel explicit image filter called guided filter. Derived from a local linear model, the guided filter computes the filtering output by considering the content of a guidance image, which can be the input image itself or another different image. The guided filter can be used as an edge-preserving smoothing operator like the popular bilateral filter [1], but it has better behaviors near edges. The guided filter is also a more generic concept beyond smoothing: It can transfer the structures of the guidance image to the filtering output, enabling new filtering applications like dehazing and guided feathering. Moreover, the guided filter naturally has a fast and nonapproximate linear time algorithm, regardless of the kernel size and the intensity range. Currently, it is one of the fastest edge-preserving filters. Experiments show that the guided filter is both effective and efficient in a great variety of computer vision and computer graphics applications, including edge-aware smoothing, detail enhancement, HDR compression, image matting/feathering, dehazing, joint upsampling, etc.
Article
Full-text available
This paper presents a novel approach for depth video enhancement. Given a high-resolution color video and its corresponding low-quality depth video, we improve the quality of the depth video by increasing its resolution and suppressing noise. For that, a weighted mode filtering method is proposed based on a joint histogram. When the histogram is generated, the weight based on color similarity between reference and neighboring pixels on the color image is computed and then used for counting each bin on the joint histogram of the depth map. A final solution is determined by seeking a global mode on the histogram. We show that the proposed method provides the optimal solution with respect to L <sub>1</sub> norm minimization. For temporally consistent estimate on depth video, we extend this method into temporally neighboring frames. Simple optical flow estimation and patch similarity measure are used for obtaining the high-quality depth video in an efficient manner. Experimental results show that the proposed method has outstanding performance and is very efficient, compared with existing methods. We also show that the temporally consistent enhancement of depth video addresses a flickering problem and improves the accuracy of depth video.
Conference Paper
We present Soft Scissors, an interactive tool for extracting alpha mattes of foreground objects in realtime. We recently proposed a novel offline matting algorithm capable of extracting high-quality mattes for complex foreground objects such as furry animals [Wang and Cohen 2007]. In this paper we both improve the quality of our offline algorithm and give it the ability to incrementally update the matte in an online interactive setting. Our realtime system efficiently estimates foreground color thereby allowing both the matte and the final composite to be revealed instantly as the user roughly paints along the edge of the foreground object. In addition, our system can dynamically adjust the width and boundary conditions of the scissoring paint brush to approximately capture the boundary of the foreground object that lies ahead on the scissor's path. These advantages in both speed and accuracy create the first interactive tool for high quality image matting and compositing.
Article
Stereo matching is one of the most active research areas in computer vision. While a large number of algorithms for stereo correspondence have been developed, relatively little work has been done on characterizing their performance. In this paper, we present a taxonomy of dense, two-frame stereo methods. Our taxonomy is designed to assess the different components and design decisions made in individual stereo algorithms. Using this taxonomy, we compare existing stereo methods and present experiments evaluating the performance of many different variants. In order to establish a common software platform and a collection of data sets for easy evaluation, we have designed a stand-alone, flexible C++ implementation that enables the evaluation of individual components and that can easily be extended to include new algorithms. We have also produced several new multi-frame stereo data sets with ground truth and are making both the code and data sets available on the Web. Finally, we include a comparative evaluation of a large set of today's best-performing stereo algorithms.
Conference Paper
This paper describes a highly successful application of MRFs to the prob- lem of generating high-resolution range images. A new generation of range sensors combines the capture of low-resolution range images with the acquisition of registered high-resolution camera images. The MRF in this paper exploits the fact that discontinuities in range and coloring tend to co-align. This enables it to generate high-resolution, low-noise range images by integrating regular camera images into the range data. We show that by using such an MRF, we can substantially improve over existing range imaging technology.
Conference Paper
VLFeat is an open and portable library of computer vision algorithms. It aims at facilitating fast prototyping and reproducible research for computer vision scientists and students. It includes rigorous implementations of common building blocks such as feature detectors, feature extractors, (hierarchical) k-means clustering, randomized kd-tree matching, and super-pixelization. The source code and interfaces are fully documented. The library integrates directly with MATLAB, a popular language for computer vision research.
Conference Paper
We present a new post-processing step to enhance the resolution of range images. Using one or two registered and potentially high-resolution color images as reference, we iteratively refine the input low-resolution range image, in terms of both its spatial resolution and depth preci- sion. Evaluation using the Middlebury benchmark shows across-the-board improvement for sub-pixel accuracy. We also demonstrated its effectiveness for spatial resolution en- hancement up to 100× with a single reference image.