FIG 6 - uploaded by George Djorgovski
Content may be subject to copyright.
A k-dimensional tree. The data are represented as a tree of nodes; each node has two daughter nodes as one splits the data in two (along the axis with the largest dimension). On the left are the nodes for the third level of the tree (the top level has one node; the second level has two nodes). The right side plot shows level 5 in the tree. The individual datum points in this two-dimensional space are shown as dots, while the bounding boxes of the nodes are shown as lines. The cached statistics, mean and covariance, are plotted as a large dot and ellipse. Images courtesy of Robert Nichol.

A k-dimensional tree. The data are represented as a tree of nodes; each node has two daughter nodes as one splits the data in two (along the axis with the largest dimension). On the left are the nodes for the third level of the tree (the top level has one node; the second level has two nodes). The right side plot shows level 5 in the tree. The individual datum points in this two-dimensional space are shown as dots, while the bounding boxes of the nodes are shown as lines. The cached statistics, mean and covariance, are plotted as a large dot and ellipse. Images courtesy of Robert Nichol.

Source publication
Article
Full-text available
The data complexity and volume of astronomical findings have increased in recent decades due to major technological improvements in instrumentation and data collection methods. The contemporary astronomer is flooded with terabytes of raw data that produce enormous multidimensional catalogs of objects (stars, galaxies, quasars, etc.) numbering in th...

Context in source publication

Context 1
... strategy is to use fast multiresolutional, k-dimensional (mrKD) tree codes. Figure 6 shows an example of a mrKD tree which is an optimal index scheme that utilizes the emerging tech- nology of cached statistics in computer science to store sufficient statistics for the EM calculation at each node in the tree. For various counting queries, one does not need to traverse the whole tree, but simply use these stored statistics to rapidly return the necessary count. ...

Citations

... High-dimensional data sets have posed both statistical and computational challenges in recent decades (Babu and Djorgovski, 2004;Fan et al., 2014;Wainwright, 2014). In the "large n, small m" regime, where n refers to the problem dimension and m refers to the sample size, it is well known that obtaining consistent estimators is impossible unless the model is endowed with some additional structures. ...
Preprint
In this paper, we analyse the recovery properties of nonconvex regularized $M$-estimators, under the assumption that the true parameter is of soft sparsity. In the statistical aspect, we establish the recovery bound for any stationary point of the nonconvex regularized $M$-estimator, under restricted strong convexity and some regularity conditions on the loss function and the regularizer, respectively. In the algorithmic aspect, we slightly decompose the objective function and then solve the nonconvex optimization problem via the proximal gradient method, which is proved to achieve a linear convergence rate. In particular, we note that for commonly-used regularizers such as SCAD and MCP, a simpler decomposition is applicable thanks to our assumption on the regularizer, which helps to construct the estimator with better recovery performance. Finally, we demonstrate our theoretical consequences and the advantage of the assumption by several numerical experiments on the corrupted errors-in-variables linear regression model. Simulation results show remarkable consistency with our theory under high-dimensional scaling.
... Most of the modern survey data sets are so information-rich, that a wide variety of different scientific studies can be done with the same data. Therein lies their scientific potential (Djorgovski et al. 1997b, 2001abc, 2002, Babu & Djorgovski 2004, and many others). However, this requires some powerful, general tools for the exploration, visualization, and analysis of large survey data sets. ...
Article
Full-text available
Sky surveys represent a fundamental data basis for astronomy. We use them to map in a systematic way the universe and its constituents, and to discover new types of objects or phenomena. We review the subject, with an emphasis on the wide-field imaging surveys, placing them in a broader scientific and historical context. Surveys are the largest data generators in astronomy, propelled by the advances in information and computation technology, and have transformed the ways in which astronomy is done. We describe the variety and the general properties of surveys, the ways in which they may be quantified and compared, and offer some figures of merit that can be used to compare their scientific discovery potential. Surveys enable a very wide range of science; that is perhaps their key unifying characteristic. As new domains of the observable parameter space open up thanks to the advances in technology, surveys are often the initial step in their exploration. Science can be done with the survey data alone or a combination of different surveys, or with a targeted follow-up of potentially interesting selected sources. Surveys can be used to generate large, statistical samples of objects that can be studied as populations, or as tracers of larger structures. They can be also used to discover or generate samples of rare or unusual objects, and may lead to discoveries of some previously unknown types. We discuss a general framework of parameter spaces that can be used for an assessment and comparison of different surveys, and the strategies for their scientific exploration. As we move into the Petascale regime, an effective processing and scientific exploitation of such large data sets and data streams poses many challenges, some of which may be addressed in the framework of Virtual Observatory and Astroinformatics, with a broader application of data mining and knowledge discovery technologies.
... Considering a relatively short history of statistics, the interaction between statistics and astronomy has a long history than one may imagine. Important statistical concepts and methods such as least squares are indeed developed by astronomers (Babu and Djorgovski , 2004). However, from the mid of 19th century the relationship weakened since astronomers more focused on astrophysics while statisticians turned to applications in agriculture and biological sciences. ...
Article
Full-text available
Clusters of galaxies are a useful proxy to trace the distribution of mass in the universe. By measuring the mass of clusters of galaxies on different scales, one can follow the evolution of the mass distribution (Martínez and Saar, Statistics of the Galaxy Distribution, 2002). It can be shown that finding galaxy clusters is equivalent to finding density contour clusters (Hartigan, Clustering Algorithms, 1975): connected components of the level set S c ≡{f>c} where f is a probability density function. Cuevas et al. (Can. J. Stat. 28, 367–382, 2000; Comput. Stat. Data Anal. 36, 441–459, 2001) proposed a nonparametric method for density contour clusters, attempting to find density contour clusters by the minimal spanning tree. While their algorithm is conceptually simple, it requires intensive computations for large datasets. We propose a more efficient clustering method based on their algorithm with the Fast Fourier Transform (FFT). The method is applied to a study of galaxy clustering on large astronomical sky survey data.
... To give some specific examples of challenges ahead, let us consider the general area of exploration of observable parameter spaces, which would be a typical VO activity in exploring the massive sky surveys and their federation, and clustering analysis in particular [12][13][14][15][16]. Generally, original image data are processed and catalogs of detected sources are derived, and many parameters (attributes) measured for each source. ...
Conference Paper
All sciences, including astronomy, are now entering the era of information abundance. The exponentially increasing volume and complexity of modern data sets promises to transform the scientific practice, but also poses a number of common technological challenges. The Virtual Observatory concept is the astronomical community's response to these challenges: it aims to harness the progress in information technology in the service of astronomy, and at the same time provide a valuable testbed for information technology and applied computer science. Challenges broadly fall into two categories: data handling (or "data farming"), including issues such as archives, intelligent storage, databases, interoperability, fast networks, etc., and data mining, data understanding, and knowledge discovery, which include issues such as automated clustering and classification, multivariate correlation searches, pattern recognition, visualization in highly hyperdimensional parameter spaces, etc., as well as various applications of machine learning in these contexts. Such techniques are forming a methodological foundation for science with massive and complex data sets in general, and are likely to have a much broather impact on the modern society, commerce, information economy, security, etc. There is a powerful emerging synergy between the computationally enabled science and the science-driven computing, which will drive the progress in science, scholarship, and many other venues in the 21st century.
... To give some specific examples of challenges ahead, let us consider the general area of exploration of observable parameter spaces, which would be a typical VO activity in exploring the massive sky surveys and their federation, and clustering analysis in particular [12][13][14][15][16]. Generally, original image data are processed and catalogs of detected sources are derived, and many parameters (attributes) measured for each source. ...
Article
All sciences, including astronomy, are now entering the era of information abundance. The exponentially increasing volume and complexity of modern data sets promises to transform the scientific practice, but also poses a number of common technological challenges. The Virtual Observatory concept is the astronomical community's response to these challenges: it aims to harness the progress in information technology in the service of astronomy, and at the same time provide a valuable testbed for information technology and applied computer science. Challenges broadly fall into two categories: data handling (or "data farming"), including issues such as archives, intelligent storage, databases, interoperability, fast networks, etc., and data mining, data understanding, and knowledge discovery, which include issues such as automated clustering and classification, multivariate correlation searches, pattern recognition, visualization in highly hyperdimensional parameter spaces, etc., as well as various applications of machine learning in these contexts. Such techniques are forming a methodological foundation for science with massive and complex data sets in general, and are likely to have a much broather impact on the modern society, commerce, information economy, security, etc. There is a powerful emerging synergy between the computationally enabled science and the science-driven computing, which will drive the progress in science, scholarship, and many other venues in the 21st century.
... Considering a relatively short history of statistics, the interaction between statistics and astronomy has a long history than one may imagine. Important statistical concepts and methods such as least squares are indeed developed by astronomers (Babu and Djorgovski , 2004). However, from the mid of 19th century the relationship weakened since astronomers more focused on astrophysics while statisticians turned to applications in agriculture and biological sciences. ...
Article
Clustering often plays an important role in analyzing data in the social and physical sciences. Cuevas et al. (2000, 2001) proposed a clustering method for density contour clusters defined (Hartigan, 1975) as connected components of the level set S c ≡ {f > c}. They used unions of balls centered at data points to estimate the connected components of the level set and provided an algorithm to extract the connected components of the estimated level set. While their algo-rithm is conceptually simple, it requires intensive computations for large datasets. We propose a more efficient clustering method based on their algorithm. Instead of using data points, we use grid points as centers of the balls. As a result, the Fast Fourier Transform (FFT) can be employed to reduce the cost of original computations. The method is applied to a study of galaxy clustering on large astronomical sky survey data.