Conference PaperPDF Available

High Quality Depth Map Upsampling for 3D-TOF Cameras

November 2011
Proceedings / IEEE International Conference on Computer Vision. IEEE International Conference on Computer Vision

November 2011

DOI:10.1109/ICCV.2011.6126423

Source
DBLP

Conference: IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011

Authors:

Jaesik Park

Intel

Hyeongwoo Kim

Chung-Ang University

Yu-Wing Tai

Show all 5 authorsHide

This paper describes an application framework to perform high quality upsampling on depth maps captured from a low-resolution and noisy 3D time-of-flight (3D-ToF) camera that has been coupled with a high-resolution RGB camera. Our framework is inspired by recent work that uses nonlocal means filtering to regularize depth maps in order to maintain fine detail and structure. Our framework extends this regularization with an additional edge weighting scheme based on several image features based on the additional high-resolution RGB input. Quantitative and qualitative results show that our method outperforms existing approaches for 3D-ToF upsampling. We describe the complete process for this system, including device calibration, scene warping for input alignment, and even how the results can be further processed using simple user markup.

(a) Our imaging setup uses a 3D-ToF camera which captures images at 176 × 144 resolution that is synchronized with a 1280 × 960 resolution RGB camera. (b) Our calibration configuration that uses a planar calibration pattern with holes to allow the 3D-ToF camera to be calibrated.

…

(a) Low-resolution depth map (enlarged using nearest neighbor upsampling). (b) High-resolution RGB image. (c) Color segmentation by [17]. (d) Edge saliency map. (e) Guided depth map by using bicubic interpolation of (a). (f) Our upsampling result without the guided depth map weighting, depth bleeding occurred in highly textured regions. (g) Our upsampling result with guided depth map weighting. (h) Ground truth. We subsampled the depth value of a dataset from Middlebury to create the synthetic low-resolution depth map. The magnification factor in this example is 5 × . The sum of squared difference(SSD) between (f) and (g) comparing to the ground truth are 31 . 66 and 24 . 62 respectively. Note that the depth bleeding problem in highly textured regions has been improved.

…

A synthetic example for self-evaluation of our weighting term. (a)(b) A synthetic image pair consists of high-resolution color image and low-resolution depth image. (c) Our 4 × upsampled depth map with the combined weighting term. (d) The plot of PSNR accuracy against the results with the combined weighting term and the results with each weighting term individually. The combined weighting term consistently produce the best results under different upsampling scale.

…

Qualitative comparison on Middlebury dataset. (a) MRFs optimzation [6]. (b) Bilateral filtering with subpixel refinement [19]. (c) Our results. The image resolution are enhanced by 8×. Note that we do not include any user correction in these synthetic testing cases. The results are cropped for the visualization, full resolution comparisons are provided in the supplemental materials. Art Books Mobius 2× 4× 8× 16× 2× 4× 8× 16× 2× 4× 8× 16× Bilinear 0.56 1.09 2.10 4.03 0.19 0.35 0.65 1.24 0.20 0.37 0.70 1.32 MRFs [6] 0.62 1.01 1.97 3.94 0.22 0.33 0.62 1.21 0.25 0.37 0.67 1.29 Bilateral [19] 0.57 0.70 1.50 3.69 0.30 0.45 0.64 1.45 0.39 0.48 0.69 1.14 Guided [10] 0.66 1.06 1.77 3.63 0.22 0.36 0.60 1.16 0.24 0.38 0.61 1.20 Ours 0.43 0.67 1.08 2.21 0.17 0.31 0.57 1.05 0.18 0.30 0.52 0.90

…

Figures - uploaded by Jaesik Park

Content may be subject to copyright.

Content uploaded by Jaesik Park

Content may be subject to copyright.

High Quality Depth Map Upsampling for 3D-TOF Cameras

Jaesik Park†Hyeongwoo Kim†∗ Yu-Wing Tai†Michael S. Brown§Inso Kweon†

Korea Advanced Institute of Science and Technology (KAIST)†

National University of Singapore (NUS)§

Abstract

This paper describes an application framework to per-

form high quality upsampling on depth maps captured from

a low-resolution and noisy 3D time-of-ﬂight (3D-ToF) cam-

era that has been coupled with a high-resolution RGB cam-

era. Our framework is inspired by recent work that uses

nonlocal means ﬁltering to regularize depth maps in order

to maintain ﬁne detail and structure. Our framework ex-

tends this regularization with an additional edge weighting

scheme based on several image features based on the ad-

ditional high-resolution RGB input. Quantitative and qual-

itative results show that our method outperforms existing

approaches for 3D-ToF upsampling. We describe the com-

plete process for this system, including device calibration,

scene warping for input alignment, and even how the results

can be further processed using simple user markup.

1. Introduction

Active 3D time-of-ﬂight (3D-ToF) cameras are becom-

ing a popular alternative to stereo-based range sensors.

Such 3D-ToF cameras use active sensing to capture 3D

range data at frame-rate as a per-pixel depth. A light source

from the camera emits a near-infrared wave which is then

reﬂected by the scene and is captured by a dedicated sen-

sor. Depending on the distance of the objects in the scene,

the captured light wave is delayed in phase compared to

the original emitted light wave. By measuring the phase

delay, the distance at each pixel can be estimated. The res-

olution of the depth map captured by 3D-ToF cameras is

relatively low; typically less then 1/4th the resolution of a

standard deﬁnition video camera. In addition, the captured

depth maps are often corrupted by signiﬁcant amounts of

noise.

The goal of this paper is to estimate a high quality high-

resolution depth map from the 3D-ToF through upsampling

in the face of sensor noise. To aid this procedure, an aux-

iliary high-resolution conventional camera is coupled with

∗The ﬁrst and the second authors provided equal contributions to this

work.

(a) (b)

Figure 1. (a) Low-resolution depth map (enlarged using nearest

neighbor upsampling), (b) high-resolution RGB image, (c) result

from [19], (d) our result. User scribble areas (blue) and the addi-

tional depth sample (red) are highlighted. The dark areas in (c) are

the areas without depth samples after registration. Full resolution

comparisons are provided in the supplemental materials.

the 3D-ToF camera to synchronously capture the scene. Re-

lated work [19,4,7] also using coupled device setups for

depth map upsampling have focused primarily on image ﬁl-

tering techniques such as joint bilateral ﬁltering [8,12] or

variations. Such ﬁltering techniques can often over smooth

results, especially in areas of ﬁne structure.

We formulate the depth map upsampling problem using

constrained optimization. Our approach is inspired by the

recent success of nonlocal means regularization for depth

map construction from depth-from-defocus [9]. In particu-

lar, we describe how to formulate the problem into a least-

squares optimization that combines nonlocal means regu-

larization together with an edge weighting scheme that fur-

ther reenforces ﬁne details. We also employ scene warping

to better align the low-resolution imagery to the auxiliary

camera input. While this work is more applied in nature,

the result is a system that is able to produce high-quality

upsampled depth maps superior in quality to prior work.

In addition, our approach can be easily extended to incor-

porate simple user markup to correct errors along disconti-

nuity boundaries without explicit image segmentation (e.g.

Figure 1).

2. Related Work

Previous work on depth map upsampling can be clas-

siﬁed as either image fusion techniques that combine the

low-resolution depth map with the high-resolution image or

super-resolution techniques that merge multiple misaligned

low-resolution depth maps. Our approach falls into the

ﬁrst category of image fusion which is the focus of the re-

lated work presented here. Image fusion approaches assume

there exists a joint occurrence between depth discontinuities

and image edges and that regions of homo genous color have

similar 3D geometry [22,16]. Representative image fusion

approaches include [6,19,4,7]. In [6], Diebel and Thrun

performed upsampling using an MRF formulation with the

data term computed from the depth map and weights of the

smoothness terms between estimated high-resolution depth

samples derived from the high-resolution image. Yang et

al. [19] used joint bilateral ﬁltering [8,12] to interpolate

the high-resolution depth values. Since ﬁltering can often

over smooth the interpolated depth values, especially along

the depth discontinuity boundaries, they quantized the depth

values into several discrete layers. This work was later ex-

tended by [21] to use a stereo camera for better disconti-

nuity detection in order to avoid over smoothing of depth

boundaries. Chan et al. [4] introduced a noise-aware bi-

lateral ﬁlter that decides how to blend between the results

of standard upsampling or joint bilateral ﬁltering depend-

ing on the depth map’s regional statistics. Dolson et al. [7]

also used a joint bilateral ﬁlter scheme, however, their ap-

proach includes additional time stamp information to main-

tain temporal coherence for the depth map upsampling in

video sequences.

The advantage of these bilateral ﬁltering techniques is

they can be performed quickly; e.g. Chan et al. [4] reported

near real-time speeds using a GPU implementation. How-

ever, the downside is they involve can still over smooth ﬁne

details. Work by [14] proposed a joint global mode ﬁl-

ter based on global image histograms of the low-resolution

depth and high-resolution image. Our approach is more re-

lated to Diebel and Thrun [6] in that we formulate the prob-

lem using an MRF optimization scheme. However, our ap-

proach incorporates a nonlocal means (NLM) term in the

MRF to help preserve local structure. This additional NLM

term was inspired by the recent work by Favaro [9] which

demonstrated that NLM ﬁltering is useful maintaining ﬁne

details even with noisy input data. Work in [11] has also

used the NLM to fuse the 3D point cloud and 2D image to

enhance the density of 3D points. We also include an addi-

tional weighting scheme based on several image derive fea-

tures to further reinforce the preservation of ﬁne detail. In

addition, we perform a warping step to better align the low-

(a) (b)

Figure 2. (a) Our imaging setup uses a 3D-ToF camera which cap-

tures images at 176 ×144 resolution that is synchronized with a

1280 ×960 resolution RGB camera. (b) Our calibration conﬁgu-

ration that uses a planar calibration pattern with holes to allow the

3D-ToF camera to be calibrated.

resolution and high-resolution input. Our experimental re-

sults on ground truth data shows that our application frame-

work can outperform existing techniques for the majority

of scenes with various upsampling factors. Since our goal

is high-quality depth maps, the need for manual cleanup for

machine vision related input is unavoidable. Another ad-

vantage of our approach is that it can easily incorporate user

markup to improve the results.

3. System Setup and Preprocessing

In this section, we describe our system and the prepro-

cessing step to register the 3D-ToF camera and conventional

camera and to perform an initial outlier rejection on the 3D-

ToF input.

3.1. System Conﬁguration

Figure 2(a) shows our hardware conﬁguration consisting

of a 3D-ToF camera and a high-resolution RGB camera. For

the depth camera, we use the SwissRangerT M SR4000 [1]

which captures a 176 ×144 depth map. For the RGB cam-

era, we use the Point Grey Research Flea RGB camera with

a resolution of 1280 ×960 pixels. Since the data captured

from the two cameras have slightly different viewpoints, we

need to register the camera according to the depth values

from the low-resolution depth map.

3.2. Depth Map Registration

Let Xd= (X, Y, Z, 1)⊤be a 3D homogeneous coordi-

nate acquired by the 3D-ToF camera, and xc= (u, v, 1)⊤

be the 2D homogeneous coordinate of the high-resolution

RGB image. We can compute a projection of Xdonto xc

by:

sxc=KR|tXd,(1)

where sis a scale factor, Kis the intrinsic parameters of the

RGB camera, and Rand tare the rotation and translation

matrix which describe the rotation and translation of the the

RGB camera and the depth camera with respect to the 3D

world coordinate.

To calibrate the two cameras’ parameters, we use the cal-

ibration method introduced by Zhang [20]. Since the 3D-

ToF camera cannot capture textures, we instead use a pla-

nar calibration pattern consisting of holes for our purpose

(Figure 2(b)). This unique calibration pattern allows us to

detect the positions on the planar surface that are observed

by the 3D-ToF camera. After camera calibration, for any

point, xt, on the low-resolution depth map with depth value

dt, we can compute its corresponding position in the high-

resolution RGB image by the following equation:

sxc=KcR|tPt

−1[xtdt1]T(2)

where Ptis the 4×4projective transformation converting

the world coordinate Xdinto the local coordinate of the

3D-ToF camera. We obtain the scaling term sby calcu-

lating the relative resolution between the depth camera and

the RGB camera. Since the depth map from the depth cam-

era is noisy, we impose a neighborhood smoothness regu-

larization using thin-plate splines to forward map the low-

resolution depth map to the high-resolution image.

3.3. Outliers Detection

The depth map from the 3D-ToF camera contains depth

edges that are blurred by mixing the depth values of two dif-

ferent depth layers along depth boundaries. These blurred

depth boundaries are unreliable and should be removed be-

fore upsampling. For each pixel in the low-resolution depth

map, we compare the depth value of a pixel to the local

maximum depth and the local minimum depth within a

small local window (e.g. 9×9) in the low-resolution depth

map. The contrast between the local maximum and mini-

mum depth determines whether this local window contains

two different depth layer. If the depth value of a pixel is at

the middle of the two depth layers, we consider this pixel

as a boundary pixel. Since the input depth map is noisy, we

use an MRF [3] to clean up the noisy estimation as:

E(l) = X



Op(l) + λpq X

q∈N(p)

Opq (l)

,(3)

where l∈[0,1] is a map of binary label indicating whether

a pixel is an outlier or not, Op(l)is the data term deﬁned by

the extent of contrast within a small window, Opq(l)is the

smoothness term deﬁned by the Hamming distance between

lpand its neighbor lq. This simple outlier rejection step

is performed on each input frame captured by the 3D-ToF

camera.

4. Optimization Framework

This section describes our optimization framework for

unsampling the low-resolution depth map given the aligned

sparse depth samples and the high-resolution RGB image.

(a) (b)

Figure 3. Comparison of our result without (a) and with (b) the

NLM term. The same weighting scheme proposed in Section 4.2

is used for both (a) and (b). Although the usage of NLM does

not signiﬁcantly affect the RMS error, it is important in generating

high quality depth maps especially along thin structure elements.

Similar to the previous image fusion approaches [6,19,7],

we assume there are co-occurrences of depth boundaries

and image boundaries.

4.1. Objective Function

We deﬁne the objective function for depth map upsam-

pling as follows:

E(D) = Ed(D) + λsEs(D) + λNENLM(D)(4)

where Ed(D)is the data term, Es(D)is the neighborhood

smoothness term, and ENLM(D)is a NLM regularization.

The term λsand λNare the relative weights to balance the

energy between the three terms. Note that the smoothness

term and NLM term could be combined into a single term,

however, we keep them separated here for sake of clarity.

Our data term is deﬁned according to the initial sparse

depth map:

Ed(D) = X

p∈G

(D(p)−G(p))2,(5)

where Gis a set of pixels which has the initial depth value.

Our smoothness term is deﬁned as:

Es(D) = X

q∈N (p)

wpq (D(p)−D(q))2,(6)

where N(p)is the ﬁrst order neighborhood of p, and wpq is

the conﬁdence weighting which will be detailed in the fol-

lowing section. Combining Equation (5) and Equation (6)

forms a quadratic objective function which is similar to the

objective function in [13]. The work in [13] was designed

to propagate sparse color values to a gray high-resolution

image, which is similar in nature to our problem of propa-

gating sparse depth values to the high-resolution RGB im-

age.

The difference between our method and that of [13] is the

deﬁnition of wpq . Work in [13] deﬁned wpq using intensity

difference between the ﬁrst order neighborhood pixels to

preserve discontinuities. We further combine segmentation,

color information, and edge saliency as well as the bicubic

(a) (b) (c) (d) (e) (f) (g) (h)

Figure 4. (a) Low-resolution depth map (enlarged using nearest neighbor upsampling). (b) High-resolution RGB image. (c) Color

segmentation by [17]. (d) Edge saliency map. (e) Guided depth map by using bicubic interpolation of (a). (f) Our upsampling result

without the guided depth map weighting, depth bleeding occurred in highly textured regions. (g) Our upsampling result with guided depth

map weighting. (h) Ground truth. We subsampled the depth value of a dataset from Middlebury to create the synthetic low-resolution depth

map. The magniﬁcation factor in this example is 5×. The sum of squared difference(SSD) between (f) and (g) comparing to the ground

truth are 31.66 and 24.62 respectively. Note that the depth bleeding problem in highly textured regions has been improved.

upsampled depth map to deﬁne wpq . The reason for this

is that we ﬁnd the ﬁrst order neighborhood does not prop-

erly consider the image structure. As the result, propagated

color information in [13] was often prone to bleeding errors

near ﬁne detail. In addition, we include a NLM regulariza-

tion term, which protects the thin structures by allowing the

pixels on the same nonlocal structure to reinforce each other

within a larger neighborhood. We deﬁne the NLM regular-

ization term using an anisotropic structural-aware ﬁlter [5]:

El(D) = X

q∈A(p)

κpq (D(p)−D(q))2,(7)

where A(p)is a local window (e.g. 11 ×11) in the

high-resolution image, κpq is the weight of the anisotropic

structural-aware ﬁlter deﬁned as:

κpq =1

2exp(−(p−q)TΣ−1

p(p−q))+

exp(−(p−q)TΣ−1

q(p−q)),

Σp=1

|A| X

p′∈A(p)

∇I(p′)∇I(p′)T.(8)

Here, ∇I(p) = {∇xI(p),∇yI(p)}Tis the x- and y- im-

age gradient vector at p, and Iis the high-resolution color

image. The term Σqis deﬁned similarly to Σp. This

anisotropic structural-aware ﬁlter deﬁnes how likely pand

qare on the same structure in the high-resolution RGB im-

age, i.e. if pand qare on the same structure, κpq will be

large. This NLM ﬁlter essential allows similar pixel to re-

inforce each other even if they are not ﬁrst-order neighbors.

To maintain the sparsity of the linear system, we remove

neighborhood entries with κpq < t. A comparison of our

approach on the effectiveness of the NLM regularization is

shown in Figure 3.

4.2. Conﬁdence Weighting

In the section, we describe our conﬁdence weighting

scheme for deﬁning the weights wpq in Equation (6). The

value of wpq deﬁnes the spatial coherence of neighborhood

pixels. The larger wpq is, the more likely that the two neigh-

borhood pixels having the same depth value. Our conﬁ-

dence weighting is decomposed into four terms based on

color similarities (wc), segmentation (ws), edge saliency

(we), and guided bicubic interpolated depth map (wd).

The color similarity term is deﬁned in the YUV color

space as follows:

wc= exp −(X

I∈Y UV

(I(p)−I(q))2

2σ2

),(9)

where σIcontrols the relative sensitivity of the different

color channels.

Our second term is deﬁned based on color segmentation

using the library provided in [17] to segment an image into

super pixels as shown in Figure 4(c). For the neighborhood

pixels that are not within the same super pixel, we give a

penalty term deﬁned as:

ws=1if Sco(p) = Sco (q)

tse otherwise (10)

where Sco(·)is the segmentation label, tse is the penalty fac-

tor with its value between 0 and 1. In our implementation,

we empirically set it equal to 0.7.

Inspired by [2], we have also included a weight which

depends on the edge saliency response. Different from the

color similarity term, the edge saliency responses are de-

tected by a set of Gabor ﬁlters with different sizes and ori-

entations. The edge saliency map contains image struc-

tures rather than just color differences between neighbor-

hood pixels. We combine the responses of different Ga-

bor ﬁlters to form the edge saliency map as shown in Fig-

ure 4(d). Our weighting is computed as:

we=1

psx(p)2+sx(q)2+ 1 ,(11)

where sx(·)is the value of x−axis edge saliency map if p

and qare x−axis neighborhoods.

Allowing the depth values to propagate freely with only

very sparse data constraint can lead to severe depth bleed-

ing. Here, we introduce the guided depth map to resolve

(a) (b) (c)

(d) (e) (f)

Figure 5. Depth map reﬁnement via user markup. (a)(d) Color

image of small scale structure. (b)(e) Upsampled depth map before

user correction. The user scribble areas in (b), and the user added

depth samples in (e) are indicated by the yellow lines and dots

respectively. (c)(f) Reﬁned depth maps.

this problem. The guided depth map weighting is similar

to the intensity weighting in a bilateral ﬁlter. Since we do

not have a depth sample at each high-resolution pixel loca-

tion, we use bicubic interpolation to obtain the guided depth

map, Dg, as shown in Figure 4(e). Similar to the bilateral

ﬁlter, we deﬁne the guided depth map weighting as follow:

wd= exp −((Dg(p)−Dg(q))2

2σ2

),(12)

Combining the weight deﬁned from Equation (9) to Equa-

tion (13) by multiplication, we obtain the weight wpq =

wswcwewd. Note that except for the edge saliency term, all

the weighting deﬁned in this subsection can be applied to

the weighting κpq via multiplication to the NLM regular-

ization term.

4.3. User Adjustments

Since the goal is high-quality upsampling, it is inevitable

that some depth frames are going to require user touch up,

especially if the data is intended for media related applica-

tions. Our approach allows easy user corrections by direct

manipulation of the weighting term wpq or by adding addi-

tional sparse depth sampling for error corrections.

For the manipulation of the weighting term, we allow

the user to draw scribbles along fuzzy image boundaries,

or along the boundaries where the image contrast is low.

These fuzzy boundaries or low contrast boundaries repre-

sent difﬁcult regions for segmentation and edge saliency

detection. As a result, they cause depth bleeding in the re-

constructed high-resolution depth map as illustrated in Fig-

ure 5(b). Within the scribble areas, we compute an alpha

matte based on the work by Wang et al. [18] for the two

different depth layers. An additional weighting term will

be added according to the estimated alpha values within the

(a) (b) (c)

2x 4x 8x 16x

Upsampled Scale

PSNR

color

edge

segment

depth

combination

(d)

Figure 6. A synthetic example for self-evaluation of our weight-

ing term. (a)(b) A synthetic image pair consists of high-resolution

color image and low-resolution depth image. (c) Our 4×upsam-

pled depth map with the combined weighting term. (d) The plot of

PSNR accuracy against the results with the combined weighting

term and the results with each weighting term individually. The

combined weighting term consistently produce the best results un-

der different upsampling scale.

scribble areas. For the two pixels pand qwithin the scrib-

ble areas, if they belong to the same depth layer, they should

have the same or similar alpha value. Hence, our additional

weighting term for counting the additional depth disconti-

nuity information is deﬁned as:

exp −((α(p)−α(q))2

2σ2

),(13)

where α(·)is the estimated alpha values within the scribble

areas. Figure 5(c) shows the effects after adding this alpha

weighting term. The scribble areas are indicated by the yel-

low lines in Figure 5(b).

Our second type of user correction allows the user to

draw or remove depth samples on the high-resolution depth

map directly. When adding a depth sample, the user can

simply pick a depth value from the computed depth map

and then assign this depth value to locations where depth

samples are “missing”. After adding the additional depth

samples, our algorithm generates the new depth map using

the new depth samples as a hard constraint in Equation (4).

The second row of Figure 5shows an example of this user

correction. Note that for image ﬁltering techniques, such

depth sample correction can be more complicated to incor-

porate since the effect of new depth samples can be ﬁltered

by the original depth sample within a large local neighbor-

hood. Removal of depth samples can also cause a hole in

the result of image ﬁltering techniques.

4.4. Evaluation on the Weighting Terms

Our weighting term, wpq , is a combination of several

heuristic weighting terms. Here we provide some insight to

the relative effectiveness of each individual weighting term

Synthetic Time(Sec.) Real-world Time(Sec.)

Art 21.60 Lion 18.60

Books 26.47 Ofﬁce with person 16.65

Mobius 24.07 Lounge 18.28

Classroom 19.00

Table 1. Running time of our algorithm for 8x upsampling. The

upsampled depthmap resolution is 1376×1088 for Synthetic and

1280×960 for Real-world examples. The algorithm was imple-

mented using unoptimized matlab code.

and their combined effect as shown in Figure 6. Our exper-

iments found that using only the color similar term can still

cause propagation errors. The edge cue is more effective

in preserving structure, but cannot entirely remove propa-

gation errors. The effect of the segmentation cue is sim-

ilar to the color cue as the segmentation is also based on

color information, but generally produces sharper boundary

with piecewise smoothed depth inside each segment. The

depth cue is good in avoiding propagation bleeding, but is

not effective along the depth boundaries because it ignores

the co-occurrence of image edges and depth edges. After

combining the four different cues together, the combined

weighting scheme shows the best results. The results pro-

duced with the combined weighting term can effectively uti-

lize the structures in the high-resolution RGB image while

it can avoid bleeding by including the depth cue which con-

sists with the low-resolution depth map.

5. Results and Comparisons

We tested our approach using both synthetic examples

and real world examples as described in the following sec-

tions. The value of λs,λNare chosen as 0.2 and 0.1 re-

spectively, and they are ﬁxed during our experiments. The

system conﬁguration for experiments is 3Ghz CPU, 8GB

RAM. We implemented our algorithm via Matlab using its

built-in standard linear solver. The computation time is

summarized in Table 1.

5.1. Evaluations using the Middlebury stereo

dataset

We use synthetic examples for quantitative comparisons

with the results from previous approaches [6,19,10]. The

depth map from the Middlebury stereo datasets [15] are

used as the ground truth. We down sampled the ground truth

depth map by different factors to create the low-resolution

depth map. The original color image is used as the high-

resolution RGB image. We compare our results with bilin-

ear interpolation, MRF [6], bilateral ﬁlter [19], and a recent

work on guided image ﬁlter [10]. Since the previous ap-

proaches do not contain a user correction step, the results

generated by our method for these synthetic examples are

all based on our automatic method in Section 4.1 and Sec-

tion 4.2 for fair comparisons. Table 2 summaries the RMSE

(root-mean-square error) against the ground truth under dif-

ferent magniﬁcation factors for different testing examples.

Our results consistently achieved the lowest RMSE among

all the test cases especially for large scale upsampling. The

qualitative comparison with the results from [6] and [19]

under 8×magniﬁcation factor can be found in Figure 7.

In terms of depth map quality, we found that the MRF

method in [6] produces the most blurred result. This is due

to its simple use of neighborhood term which co nsiders only

the image intensity difference as the neighborhood similar-

ity for depth propagation. The results from bilateral ﬁltering

in [19] are comparable to ours with sharp depth discontinu-

ities in some of the test examples. However, since segmen-

tation and edge saliency are not considered, their results can

still suffer from depth bleeding highly textured regions. We

also found that for the real world example in Figure 1, the

results from [19] tended to be blurry.

5.2. Robustness to Depth Noise

The depth map captured by 3D-ToF cameras are always

noisy. We compare the robustness of our algorithm and

the previous algorithms by adding noise. We also compare

against the Noise-Aware bilateral ﬁlter approach in [4]. We

observe that the noise characteristics in a 3D-ToF camera

depends on the distance between the camera and the scene.

To simulate this effect, we add a conditional Gaussian noise:

p(x, k, σd) = kexp −(x

2(1 + σd)2),(14)

where σdis a value proportional to the depth value, and k

is the magnitude of the Gaussian noise. Although the actual

noise distribution of 3D-ToF camera is more complicated

than the Gaussian noise model, many previous depth map

upsampling algorithms do not consider the problem of noise

in the low-resolution depth map. This experiment therefore

attempts an objective comparison on the robustness of dif-

ferent algorithms with respect to noisy depth maps. The

results in term of RMSE are summarized in Table 3.

5.3. Real World Examples

Figure 8shows the real world examples of our approach.

Since the goal of our paper is to obtain high quality depth

maps, we include user corrections for the examples in the

top and middle row. We show our upsampled depth as well

as a novel view rendered by using our depth map. The mag-

niﬁcation factors for all these examples are 8×. These real

world examples are challenging with complicated bound-

aries and thin structures. Some of the objects contain al-

most identical colors but with different depth values. Our

approach is successful in distinguishing the various depth

layers with sharp boundaries. All results without user cor-

rections can be found in the supplemental materials.

(a) (b) (c)

Figure 7. Qualitative comparison on Middlebury dataset. (a) MRFs optimzation [6]. (b) Bilateral ﬁltering with subpixel reﬁnement [19].

(c) Our results. The image resolution are enhanced by 8×. Note that we do not include any user correction in these synthetic testing cases.

The results are cropped for the visualization, full resolution comparisons are provided in the supplemental materials.

Art Books Mobius

2×4×8×16×2×4×8×16×2×4×8×16×

Bilinear 0.56 1.09 2.10 4.03 0.19 0.35 0.65 1.24 0.20 0.37 0.70 1.32

MRFs [6] 0.62 1.01 1.97 3.94 0.22 0.33 0.62 1.21 0.25 0.37 0.67 1.29

Bilateral [19] 0.57 0.70 1.50 3.69 0.30 0.45 0.64 1.45 0.39 0.48 0.69 1.14

Guided [10] 0.66 1.06 1.77 3.63 0.22 0.36 0.60 1.16 0.24 0.38 0.61 1.20

Ours 0.43 0.67 1.08 2.21 0.17 0.31 0.57 1.05 0.18 0.30 0.52 0.90

Table 2. Quantitative comparison on Middlebury dataset. The error is measured in RMSE for 4 different magniﬁcation factors. The

performance of our algorithm is the best among all compared algorithm. Note that no user correction is included in these synthetic testing

examples.

6. Discussion and Summary

We have presented a framework to upsample a low-

resolution depth map from the 3D-ToF camera using an

auxiliary high-resolution RGB image. Our framework is

based on a least-square optimization that combines several

weighting factors together with nonlocal means ﬁltering to

maintain sharp depth boundaries and to prevent depth bleed-

ing during propagation. Although this work is admittedly

more engineering in nature, we believe it provides useful

insight on various weighting strategies for those working

with noisy range senors. Moreover, experimental result

show that our results typically out performs previous work

in terms of both RMSE and visual quality. In addition to the

automatic method, we have also discussed how to extend

our approach to incorporate user markup. Our user correc-

tion method is simple and intuitive and does not require any

addition modiﬁcations in order to solve the objective func-

tion deﬁned in Section 4.1.

7. Acknowledgements

We are grateful to anonymous reviewers for their con-

structive comments. This research was partially sup-

ported by Samsung Advanced Institute of Technology

(RRA0109ZZ-61RF-1) and the National Strategic R&D

Program for Industrial Technology, Korea. Yu-Wing Tai

was supported by the National Research Foundation (NRF)

of Korea (2011-0013349) and the Ministry of Culture,

Sports and Tourism (MCST) and Korea Content Agency

(KOCCA) in the Culture Technology Research and Devel-

opment Program 2011. Michael S. Brown was supported

by the Singapore Academic Research Fund (AcRF) Tier 1

Grant (R-252-000-423-112).

References

[1] SwissRangerT M SR4000 data sheet, http://www.mesa-

imaging.ch/prodview4k.php.

[2] P. Bhat, C. L. Zitnick, M. F. Cohen, and B. Curless. Gra-

dientshop: A gradient-domain optimization framework for

image and video ﬁltering. ACM Trans. Graph., 29(2), 2010.

[3] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-

ergy minimization via graph cuts. IEEE Trans. PAMI, 2001.

[4] D. Chan, H. Buisman, C. Theobalt, and S. Thrun. A noise

aware ﬁlter for real-time depth upsampling. In ECCV Work-

shop on Multicamera and Multimodal Sensor Fusion Algo-

rithms and Applications, 2008.

[5] J. Chen, C. Tang, and J. Wang. Noise brush: interactive

high quality image-noise separation. ACM Trans. Graphics,

28(5), 2009.

[6] J. Diebel and S. Thrun. An application of markov random

ﬁelds to range sensing. In NIPS, 2005.

[7] J. Dolson, J. Baek, C. Plagemann, and S. Thrun. Upsampling

range data in dynamic environments. In CVPR, 2010.

[8] E. Eisemann and F. Durand. Flash photography enhancement

via intrinsic relighting. ACM Trans. Graphics, 23(3):673–

678, 2004.

[9] P. Favaro. Recovering thin structures via nonlocal-means

regularization with application to depth from defocus. In

CVPR, 2010.

[10] K. He, J. Sun, and X. Tang. Guided image ﬁltering. In ECCV,

2010.

[11] B. Huhle, T. Schairer, P. Jenke, and W. Strasser. Fusion

of range and color images for denoising and resolution en-

hancement with a non-local ﬁlter. CVIU, 114(12):1136–

1345, 2010.

Art Books Mobius

2×4×8×16×2×4×8×16×2×4×8×16×

Bilinear 3.09 3.59 4.39 5.91 2.91 3.12 3.34 3.71 3.21 3.45 3.62 4.00

MRFs [6] 1.62 2.54 3.85 5.70 1.34 2.08 2.85 3.54 1.47 2.29 3.09 3.81

Bilateral [19] 1.36 1.93 2.45 4.52 1.12 1.47 1.81 2.92 1.25 1.63 2.06 3.21

Guided [10] 1.92 2.40 3.32 5.08 1.60 1.82 2.31 3.06 1.77 2.03 2.60 3.34

NAFDU [4] 1.83 2.90 4.75 7.70 1.04 1.36 1.94 3.07 1.17 1.55 2.28 3.55

Ours 1.24 1.82 2.78 4.17 0.99 1.43 1.98 3.04 1.03 1.49 2.13 3.09

Table 3. Quantitative comparison on Middlebury dataset with additive noise. Our algorithm achieves the lowest RMSE in most cases. Note

that all these results are generated without any user correction. Better performance is possible after including user correction.

(a) (b) (c)

Figure 8. (a) Our input, the low-resolution depth maps are shown on the lower left corner (Ratio between the two images are preserved).

(b) Our results. User scribble areas (blue) and the additional depth sample (red) were high-lighted. (c) Novel view rendering of our result.

Note that no user markup is required in our results in the third row. More results can be found in supplemental materials

[12] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele.

Joint bilateral upsampling. ACM Trans. Graphics, 26(3):96,

2007.

[13] A. Levin, D. Lischinski, and Y. Weiss. Colorization using

optimization. ACM Trans. Graphics, 23(3):689–694, 2004.

[14] D. Min, J. Lu, and M. Do. Depth video enhancement based

on weighted mode ﬁltering. IEEE TIP, to appear.

[15] D. Scharstein and R. Szeliski. A taxonomy and evaluation of

dense two-frame stereo correspondence algorithms. IJCV,

47(1/2/3):7–42, 2002.

[16] A. Torralba and W. T. Freeman. Properties and applications

of shape recipes. In CVPR, pages 383–390, 2003.

[17] A. Vedaldi and B. Fulkerson. VLFeat: An open

and portable library of computer vision algorithms.

http://www.vlfeat.org/, 2008.

[18] J. Wang, M. Agrawala, and M. Cohen. Soft scissors: An

interactive tool for realtime high quality matting. SIG-

GRAPH’07.

[19] Q. Yang, R. Yang, J. Davis, and D. Nist´er. Spatial-depth

super resolution for range images. In CVPR, 2007.

[20] Z. Zhang. A ﬂexible new technique for camera calibration.

IEEE Trans. PAMI, 22(11):1330–1334, 2000.

[21] J. Zhu, L. Wang, R. Yang, and J. Davis. Fusion of time-

of-ﬂight depth and stereo for high accuracy depth maps. In

CVPR, 2008.

[22] A. Zomet and S. Peleg. Multi-sensor super-resolution. In

WACV, pages 27–31, 2002.

Depth Completion with Anisotropic Metric, Convolutional Stages, and Infinity Laplacian

Article

Full-text available

May 2024

Depth map estimation is crucial for a wide range of applications. Unfortunately, it often presents missing or unreliable data. The objective of depth completion is to fill in the “holes” in a depth map by propagating the depth information using guidance from other sources of information, such as color. Nowadays, classical image processing methods have been outperformed by deep learning techniques. Nevertheless, these approaches require a significantly large number of images and enormous computing power for training. This fact limits their usability and makes them not the best solution in some resource-constrained environments. Therefore, this paper investigates three simple hybrid models for depth completion. We explore a hybrid pipeline that combines a very efficient and powerful interpolator (infinity Laplacian or AMLE) and a series of convolutional stages. The contributions of this article are (i) the use a Texture+Structuredecomposition as a pre-filter stage; (ii) an objective evaluation with three different approaches using KITTI and NYU_V2 data sets; (iii) the use of an anisotropic metric as a mechanism to improve interpolation; and iv) the inclusion of an ablation test. The main conclusions of this work are that using an anisotropic metric improves model performance, and the ablation test demonstrates that the model’s final stage is a critical component in the pipeline; its suppression leads to an approximate 4% increase in MSE. We also show that our model outperforms state-of-the-art alternatives with similar levels of complexity.

Intrinsic Phase-Preserving Networks for Depth Super Resolution

Article

Mar 2024

Depth map super-resolution (DSR) plays an indispensable role in 3D vision. We discover an non-trivial spectral phenomenon: the components of high-resolution (HR) and low-resolution (LR) depth maps manifest the same intrinsic phase, and the spectral phase of RGB is a superset of them, which suggests that a phase-aware filter can assist in the precise use of RGB cues. Motivated by this, we propose an intrinsic phase-preserving DSR paradigm, named IPPNet, to fully exploit inter-modality collaboration in a mutually guided way. In a nutshell, a novel Phase-Preserving Filtering Module (PPFM) is developed to generate dynamic phase-aware filters according to the LR depth flow to filter out erroneous noisy components contained in RGB and then conduct depth enhancement via the modulation of the phase-preserved RGB signal. By stacking multiple PPFM blocks, the proposed IPPNet is capable of reaching a highly competitive restoration performance. Extensive experiments on various benchmark datasets, e.g., NYU v2, RGB-D-D, reach SOTA performance and also well demonstrate the validity of the proposed phase-preserving scheme. Code: https://github.com/neuralchen/IPPNet/.

Research on Super-Resolution Enhancement Technology Using Improved Transformer Network and 3D Reconstruction of Wheat Grains

Article

Full-text available

Jan 2024

Three-dimensional reconstruction plays a crucial role in capturing plant phenotypes and expediting the process of agricultural informatization. However, the reconstruction of small objects such as plant specimens and grains often faces challenges like low two-dimensional image resolution and sparse textures. To enhance the three-dimensional reconstruction of plant specimens like wheat grains for comprehensive phenotypic characterization, this study proposes a novel super-resolution reconstruction network called T-transformer net. The network leverages the self-attention mechanism of Transformers to extract extensive global information from spatial sequences. By employing a hourglass block structure to construct spatial attention units and combining channel attention with window-based self-attention schemes, it effectively harnesses their complementary advantages. This encompasses utilizing global statistical data while capitalizing on potent local fitting capabilities. Evaluation of the model on publicly available datasets Set5, Set14, and Manga109 demonstrates superior overall performance of T-transformer net compared to mainstream super-resolution algorithms at upscaling factors of 2x, 3x, and 4x. In the context of super-resolution tasks involving wheat grain datasets, the peak signal-to-noise ratio reaches 42.89 dB, and the structural similarity index attains 0.9643. Subsequently, we subject the super-resolved wheat grain images to three-dimensional reconstruction. Through comprehensive extraction of high-level semantic information by neural networks, the reconstruction accuracy is improved by 38.96% compared with the unprocessed image, effectively mitigating challenges arising from sparse textures and repetitive patterns in wheat grain structures. This study contributes valuable methodology and insights to the realm of three-dimensional reconstruction in botany, holding significant implications for advancing agricultural informatization.

ENHANCING TERRESTRIAL POINT CLOUDS USING UPSAMPLING STRATEGY: FIRST OBSERVATION AND TEST ON FARO FLASH TECHNOLOGY

Article

Full-text available

Feb 2024

Nowadays 3D digitization through the combination or hybridization of different sensors, with the final aim of accelerating the phases of data acquisition and storage, develops user friendly and robotics systems, making efficient the operator role. New technologies as Hybrid Reality Capture™ (HRC), with Flash Technology (FARO Tech.) certainly fits into this market trend, and it is characterized by rapid acquisition, involving 3D scanning data with panoramic images contribution. The system is still under patent, and nothing is yet released on the technology. This research presents the analysis and discussion of results based on the raw and processed data related to the new FARO system. The assumption – based on the information declared by the manufacturer (FARO, 2023) – is that the new colored Flash scans are faster and denser than scans of the same resolution obtained using traditional static scanning method, due to the crucial contribution of the PanoCam data and resolution on which the upsampling strategy is based. An evaluation based on detailed analysis of the upsampling results is reported, delivering that the surface point density exponentially decreases with the distance and with the ray incidence inclination. A comparison with a mobile mapping technology is finally presented and discussed.

Depth Map Super-Resolution Reconstruction Based on Second-order Total Generalized Variation Guided by Color Segmentation

Conference Paper

May 2024

用于全息三维显示的数据获取方法进展

Article

Jan 2024

A Comprehensive Review of Vision-Based 3D Reconstruction Methods

Article

Full-text available

Apr 2024
SENSORS-BASEL

With the rapid development of 3D reconstruction, especially the emergence of algorithms such as NeRF and 3DGS, 3D reconstruction has become a popular research topic in recent years. 3D reconstruction technology provides crucial support for training extensive computer vision models and advancing the development of general artificial intelligence. With the development of deep learning and GPU technology, the demand for high-precision and high-efficiency 3D reconstruction information is increasing, especially in the fields of unmanned systems, human-computer interaction, virtual reality, and medicine. The rapid development of 3D reconstruction is becoming inevitable. This survey categorizes the various methods and technologies used in 3D reconstruction. It explores and classifies them based on three aspects: traditional static, dynamic, and machine learning. Furthermore, it compares and discusses these methods. At the end of the survey, which includes a detailed analysis of the trends and challenges in 3D reconstruction development, we aim to provide a comprehensive introduction for individuals who are currently engaged in or planning to conduct research on 3D reconstruction. Our goal is to help them gain a comprehensive understanding of the relevant knowledge related to 3D reconstruction.

Hierarchical Edge Refinement Network for Guided Depth Map Super-Resolution

Article

Jan 2024

The Guided Depth map Super-Resolution (GDSR) task aims to solve the low-quality problem of depth maps obtained from different devices. The most difficult challenge for the GDSR task is obtaining consistent depth discontinuity features from the depth map and the corresponding color image. This paper proposes a novel learning-based hierarchical network for GDSR task with the Edge-guided Bidirectional Interaction Module (EBIM). First, the EBIM refines the edge information by recalibrating the color details and selecting the consistent edge feature from information flow of the depth feature to color features. Second, EBIM enhances the depth discontinuity information details in the depth branch by transferring the refined edge information from the color branch. In addition, we use hierarchical edge constraints to intensifier the refined edge information through the EBIM module. Experimental results on existing public datasets show that our method outperforms other state-of-the-art methods in different measures.

Segmentation-Based Depth Map Adjustment for Improved Grasping Pose Detection

Article

Feb 2024

Fast Hierarchical Depth Super-Resolution via Guided Attention

Chapter

Feb 2024

Depth maps captured by mainstream depth sensors are still of low resolution compared with color images. The main difficulties in depth super-resolution lie in the recovery of tiny textures from severely undersampled measurements and texture-copy artifacts due to depth-texture inconsistency. To address these problems, we propose a simple and efficient convolutional filtering approach based on guided attention, named HDSRnet-light, for high quality depth super-resolution. In HDSRnet-light, a guided attention scheme is proposed to fuse features of the pyramidal main branch with complementary features from two side-branches associated with the auxiliary high-resolution color image and a bicubic upsampled version of the input depth map. In this way, high-resolution features are progressively recovered from multi-scale information from both the depth map and the color image. Experimental results show that our method achieves state-of-art performance for depth map super-resolution.

Guided Image Filtering

Article

Full-text available

Jun 2013

In this paper, we propose a novel explicit image filter called guided filter. Derived from a local linear model, the guided filter computes the filtering output by considering the content of a guidance image, which can be the input image itself or another different image. The guided filter can be used as an edge-preserving smoothing operator like the popular bilateral filter [1], but it has better behaviors near edges. The guided filter is also a more generic concept beyond smoothing: It can transfer the structures of the guidance image to the filtering output, enabling new filtering applications like dehazing and guided feathering. Moreover, the guided filter naturally has a fast and nonapproximate linear time algorithm, regardless of the kernel size and the intensity range. Currently, it is one of the fastest edge-preserving filters. Experiments show that the guided filter is both effective and efficient in a great variety of computer vision and computer graphics applications, including edge-aware smoothing, detail enhancement, HDR compression, image matting/feathering, dehazing, joint upsampling, etc.

Depth Video Enhancement Based on Weighted Mode Filtering

Article

Full-text available

Apr 2012

This paper presents a novel approach for depth video enhancement. Given a high-resolution color video and its corresponding low-quality depth video, we improve the quality of the depth video by increasing its resolution and suppressing noise. For that, a weighted mode filtering method is proposed based on a joint histogram. When the histogram is generated, the weight based on color similarity between reference and neighboring pixels on the color image is computed and then used for counting each bin on the joint histogram of the depth map. A final solution is determined by seeking a global mode on the histogram. We show that the proposed method provides the optimal solution with respect to L <sub>1</sub> norm minimization. For temporally consistent estimate on depth video, we extend this method into temporally neighboring frames. Simple optical flow estimation and patch similarity measure are used for obtaining the high-quality depth video in an efficient manner. Experimental results show that the proposed method has outstanding performance and is very efficient, compared with existing methods. We also show that the temporally consistent enhancement of depth video addresses a flickering problem and improves the accuracy of depth video.

A Taxonomy And Evaluation Of Dense Two-Frame Stereo Correspondence Algorithms

Article

Jan 2000

An application of markov random fields to range sensing

Article

Jan 2005

Soft scissors

Conference Paper

Jul 2007

We present Soft Scissors, an interactive tool for extracting alpha mattes of foreground objects in realtime. We recently proposed a novel offline matting algorithm capable of extracting high-quality mattes for complex foreground objects such as furry animals [Wang and Cohen 2007]. In this paper we both improve the quality of our offline algorithm and give it the ability to incrementally update the matte in an online interactive setting. Our realtime system efficiently estimates foreground color thereby allowing both the matte and the final composite to be revealed instantly as the user roughly paints along the edge of the foreground object. In addition, our system can dynamically adjust the width and boundary conditions of the scissoring paint brush to approximately capture the boundary of the foreground object that lies ahead on the scissor's path. These advantages in both speed and accuracy create the first interactive tool for high quality image matting and compositing.

Efficient Approximate Energy Minimization via Graph Cuts

Article

Jan 2001

A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms

Article

Apr 2002

Stereo matching is one of the most active research areas in computer vision. While a large number of algorithms for stereo correspondence have been developed, relatively little work has been done on characterizing their performance. In this paper, we present a taxonomy of dense, two-frame stereo methods. Our taxonomy is designed to assess the different components and design decisions made in individual stereo algorithms. Using this taxonomy, we compare existing stereo methods and present experiments evaluating the performance of many different variants. In order to establish a common software platform and a collection of data sets for easy evaluation, we have designed a stand-alone, flexible C++ implementation that enables the evaluation of individual components and that can easily be extended to include new algorithms. We have also produced several new multi-frame stereo data sets with ground truth and are making both the code and data sets available on the Web. Finally, we include a comparative evaluation of a large set of today's best-performing stereo algorithms.

An Application of Markov Random Fields to Range Sensing.

Conference Paper

Jan 2005
Adv Neural Inform Process Syst

This paper describes a highly successful application of MRFs to the prob- lem of generating high-resolution range images. A new generation of range sensors combines the capture of low-resolution range images with the acquisition of registered high-resolution camera images. The MRF in this paper exploits the fact that discontinuities in range and coloring tend to co-align. This enables it to generate high-resolution, low-noise range images by integrating regular camera images into the range data. We show that by using such an MRF, we can substantially improve over existing range imaging technology.

VLFeat: An open and portable library of computer vision algorithms

Conference Paper

Oct 2010

VLFeat is an open and portable library of computer vision algorithms. It aims at facilitating fast prototyping and reproducible research for computer vision scientists and students. It includes rigorous implementations of common building blocks such as feature detectors, feature extractors, (hierarchical) k-means clustering, randomized kd-tree matching, and super-pixelization. The source code and interfaces are fully documented. The library integrates directly with MATLAB, a popular language for computer vision research.

Spatial-Depth Super Resolution for Range Images

Conference Paper

Jun 2007
IEEE Comput Soc Conf Comput Vis Pattern Recogn

We present a new post-processing step to enhance the resolution of range images. Using one or two registered and potentially high-resolution color images as reference, we iteratively refine the input low-resolution range image, in terms of both its spatial resolution and depth preci- sion. Evaluation using the Middlebury benchmark shows across-the-board improvement for sub-pixel accuracy. We also demonstrated its effectiveness for spatial resolution en- hancement up to 100× with a single reference image.

High Quality Depth Map Upsampling for 3D-TOF Cameras

Abstract and Figures

Recommended publications

High Quality Depth Map Upsampling and Completion for RGB-D cameras.

Upsampling and Denoising of Depth Maps via Joint-Segmentation

A Stereo-Vision-Assisted model for depth map super-resolution

Reliability-Based Multiview Depth Enhancement Considering Interview Coherence