ArticlePDF Available

Optimal Ridge Detection using Coverage Risk

Authors:

Abstract and Figures

We introduce the concept of coverage risk as an error measure for density ridge estimation. The coverage risk generalizes the mean integrated square error to set estimation. We propose two risk estimators for the coverage risk and we show that we can select tuning parameters by minimizing the estimated risk. We study the rate of convergence for coverage risk and prove consistency of the risk estimators. We apply our method to three simulated datasets and to cosmology data. In all the examples, the proposed method successfully recover the underlying density structure.
Content may be subject to copyright.
A preview of the PDF is not available
... This paper handles the same problem but explicitly estimates the manifold by the density ridge and generates new data by bootstrapping, which avoids the computational cost of MCMC sampling. Our method makes heavy use of ridge estimation; see related development in [25,17,11,9,10,16]. ...
... We use an oversmoothing parameter α, usually between 2 and 4, and good estimates can be often obtained across a wide range of α values. The authors of [9] gave a method of selecting h that minimizes coverage risk estimates. ...
... The search space of bandwidth is set as as minimum following Genoves e et al. (2016). The maximum bandwidth value is set as Silverman's r ule-of-thumb (Silverman, 1986) since this bandwidth selection is usua lly considered oversmoothing (Hall et al., 1991), and this idea was p reviously also used for ridge detection analysis (Chen et al., 2015). Removing low density data points (outliers) to infer the persistent h omology features is recommended (Chazal et al., 2018), so we set the threshold to eliminate data points that is where is a kernel den sity function with the bandwidth parameter and is kernel dens ity estimate using all data points. ...
... We use an oversmoothing parameter α, usually between 2 and 4, and good estimates can be often obtained across a wide range of α values. [8] gave a method to select h that minimizes coverage risk estimates. ...
Preprint
Probabilistic models of data sets often exhibit salient geometric structure. Such a phenomenon is summed up in the manifold distribution hypothesis, and can be exploited in probabilistic learning. Here we present normal-bundle bootstrap (NBB), a method that generates new data which preserve the geometric structure of a given data set. Inspired by algorithms for manifold learning and concepts in differential geometry, our method decomposes the underlying probability measure into a marginalized measure on a learned data manifold and conditional measures on the normal spaces. The algorithm estimates the data manifold as a density ridge, and constructs new data by bootstrapping projection vectors and adding them to the ridge. We apply our method to the inference of density ridge and related statistics, and data augmentation to reduce overfitting.
... More recent contributions regarding computational issues are due to Ozertem and Erdogmus (2011) and Sasaki et al. (2014Sasaki et al. ( , 2017. Asymptotic results for ridge detection using kernel estimators are derived in Chen et al. (2014Chen et al. ( , 2015b. show that kernel estimation of density ridges leads to consistent estimators under mild regularity assumptions and establish bounds for the Hausdorff distance between the true and estimated ridge. ...
Article
Full-text available
Both music and language are found in all known human societies, yet no studies have compared similarities and differences between song, speech, and instrumental music on a global scale. In this Registered Report, we analyzed two global datasets: (i) 300 annotated audio recordings representing matched sets of traditional songs, recited lyrics, conversational speech, and instrumental melodies from our 75 coauthors speaking 55 languages; and (ii) 418 previously published adult-directed song and speech recordings from 209 individuals speaking 16 languages. Of our six preregistered predictions, five were strongly supported: Relative to speech, songs use (i) higher pitch, (ii) slower temporal rate, and (iii) more stable pitches, while both songs and speech used similar (iv) pitch interval size and (v) timbral brightness. Exploratory analyses suggest that features vary along a “musi-linguistic” continuum when including instrumental melodies and recited lyrics. Our study provides strong empirical evidence of cross-cultural regularities in music and speech.
Article
This paper studies the linear convergence of the subspace constrained mean shift (SCMS) algorithm, a well-known algorithm for identifying a density ridge defined by a kernel density estimator. By arguing that the SCMS algorithm is a special variant of a subspace constrained gradient ascent (SCGA) algorithm with an adaptive step size, we derive the linear convergence of such SCGA algorithm. While the existing research focuses mainly on density ridges in the Euclidean space, we generalize density ridges and the SCMS algorithm to directional data. In particular, we establish the stability theorem of density ridges with directional data and prove the linear convergence of our proposed directional SCMS algorithm.
Article
We consider nonparametric estimation of the ridge of a probability density function for multivariate linear processes with long-range dependence. We derive functional limit theorems for estimated eigenvectors and eigenvalues of the Hessian matrix. We use these results to obtain the weak convergence for the estimated ridge and asymptotic simultaneous confidence regions.
Article
Evolution equations comprise a broad framework for describing the dynamics of a system in a general state space: when the state space is finite-dimensional, they give rise to systems of ordinary differential equations; for infinite-dimensional state spaces, they give rise to partial differential equations. Several modern statistical and machine learning methods concern the estimation of objects that can be formalized as solutions to evolution equations, in some appropriate state space, even if not stated as such. The corresponding equations, however, are seldom known exactly, and are empirically derived from data, often by means of nonparametric estimation. This induces uncertainties on the equations and their solutions that are challenging to quantify, and moreover the diversity and the specifics of each particular setting may obscure the path for a general approach. In this paper, we address the problem of constructing general yet tractable methods for quantifying such uncertainties, by means of asymptotic theory combined with bootstrap methodology. We demonstrates these procedures in important examples including gradient line estimation, diffusion tensor imaging tractography, and local principal component analysis. The bootstrap perspective is particularly appealing as it circumvents the need to simulate from stochastic (partial) differential equations that depend on (infinite-dimensional) unknowns. We assess the performance of the bootstrap procedure via simulations and find that it demonstrates good finite-sample coverage. © 2018, Institute of Mathematical Statistics. All rights reserved.
Article
Full-text available
In the context of estimating local modes of a conditional density based on kernel density estimators, we show that existing bandwidth selection methods developed for kernel density estimation are unsuitable for mode estimation. We propose two methods to select bandwidths tailored for mode estimation in the regression setting. Numerical studies using synthetic data and a real-life data set are carried out to demonstrate the performance of the proposed methods in comparison with several well received bandwidth selection methods for density estimation.
Article
Full-text available
The detection and characterization of filamentary structures in the cosmic web allows cosmologists to constrain parameters that dictate the evolution of the Universe. While many filament estimators have been proposed, they generally lack estimates of uncertainty, reducing their inferential power. In this paper, we demonstrate how one may apply the subspace constrained mean shift (SCMS) algorithm (Ozertem & Erdogmus 2011; Genovese et al. 2014) to uncover filamentary structure in galaxy data. The SCMS algorithm is a gradient ascent method that models filaments as density ridges, one-dimensional smooth curves that trace high-density regions within the point cloud. We also demonstrate how augmenting the SCMS algorithm with bootstrap-based methods of uncertainty estimation allows one to place uncertainty bands around putative filaments. We apply the SCMS first to the data set generated from the Voronoi model. The density ridges show strong agreement with the filaments from Voronoi method. We then apply the SCMS method data sets sampled from a P3M N-body simulation, with galaxy number densities consistent with SDSS and WFIRST-AFTA, and to LOWZ and CMASS data from the Baryon Oscillation Spectroscopic Survey (BOSS). To further assess the efficacy of SCMS, we compare the relative locations of BOSS filaments with galaxy clusters in the redMaPPer catalogue, and find that redMaPPer clusters are significantly closer (with p-values <10−9) to SCMS-detected filaments than to randomly selected galaxies.
Article
Full-text available
Modal regression estimates the local modes of the distribution of Y given X = x, instead of the mean, as in the usual regression sense, and can hence reveal important structure missed by usual regression methods. We study a simple nonparametric method for modal regression, based on a kernel density estimate (KDE) of the joint distribution of Y and X. We derive asymptotic error bounds for this method, and propose techniques for constructing confidence sets and prediction sets. The latter is used to select the smoothing bandwidth of the underlying KDE. The idea behind modal regression is connected to many others, such as mixture regression and density ridge estimation, and we discuss these ties as well.
Article
Full-text available
The large sample theory of estimators for density modes is well-understood. In this paper we consider density ridges, which are a higher-dimensional extension of modes. Modes correspond to zero-dimensional, local high-density regions in point clouds. Density ridges correspond to $s$-dimensional, local high-density regions in point clouds. We establish three main results. First we show that, under appropriate regularity conditions, the local variation of the estimated ridge can be approximated by an empirical process. Second, we show that the distribution of the estimated ridge converges to a Gaussian process. Third, we establish that the bootstrap leads to valid confidence sets for density ridges.
Article
Full-text available
The generalized density is a product of a density function and a weight function. For example, the average local brightness of an astronomical image is the probability of finding a galaxy times the mean brightness of the galaxy. We propose a method for studying the geometric structure of generalized densities. In particular, we show how to find the modes and ridges of a generalized density function using a modification of the mean shift algorithm and its variant, subspace constrained mean shift. Our method can be used to perform clustering and to calculate a measure of connectivity between clusters. We establish consistency and rates of convergence for our estimator and apply the methods to data from two astronomical problems.
Article
Full-text available
In this study, we compare two vectorial tracing methods for 3D color images: (i) a conventional piecewise linear generalized cylinder algorithm that uses color and edge information and (ii) a principal curve tracing algorithm that uses the gradient and Hessian of a given density estimate. We tested the algorithms on synthetic and Brainbow dataset to show the effectiveness of the proposed algorithms. Results indicate that the proposed methods can successfully trace multiple axons in dense neighborhoods.
Article
Accurate road centerline extraction plays an important role in practical remote sensing applications. Most existing centerline extraction methods have many limitations when the classified image contains complicated objects such as curvilinear, close, or short extent features. To cope with these limitations, this study presents a novel accurate centerline extraction method that integrates tensor voting, principal curves, and the geodesic method. The proposed method consists of three main steps. Tensor voting is first used to extract feature points from the classified image. The extracted feature points are then projected onto the principal curves. Finally, the feature points are linked by the geodesic method to create the central line. The experimental results demonstrate that the proposed method, which is automatic, provides a comparatively accurate solution for centerline extraction from a classified image.