Fig 2 - uploaded by Katarzyna Małek
Content may be subject to copyright.
Schematic representation of the SVM algorithm classification process. We take as input the preselected training sample consisting of (in the case of this work) three distinct classes of objects. The SVM is taught how to distinguish one class from the others based on the discriminating properties chosen as feature vectors. Then, the classifier is trained by tuning the free parameters ( C and γ ). If the result reaches a high enough accuracy rate (the number of objects from the training sample that are correctly recognised by the classifier) without overfit- ting (the resulting hyperplane does not confine the sources of a specific type too tightly), it will be used to classify the unknown objects (test sample). If the accuracy is not satisfactory, a di ff erent parameter space (or training sample, if possible) is chosen to tune C and γ . After a number of iterations, which allow the classifier to reach high enough e ffi ciency level, a real sample can be classified using the discriminant hyperplanes. 

Schematic representation of the SVM algorithm classification process. We take as input the preselected training sample consisting of (in the case of this work) three distinct classes of objects. The SVM is taught how to distinguish one class from the others based on the discriminating properties chosen as feature vectors. Then, the classifier is trained by tuning the free parameters ( C and γ ). If the result reaches a high enough accuracy rate (the number of objects from the training sample that are correctly recognised by the classifier) without overfit- ting (the resulting hyperplane does not confine the sources of a specific type too tightly), it will be used to classify the unknown objects (test sample). If the accuracy is not satisfactory, a di ff erent parameter space (or training sample, if possible) is chosen to tune C and γ . After a number of iterations, which allow the classifier to reach high enough e ffi ciency level, a real sample can be classified using the discriminant hyperplanes. 

Source publication
Article
Full-text available
The aim of this work is to develop a comprehensive method for classifying sources in large sky surveys and we apply the techniques to the VIMOS Public Extragalactic Redshift Survey (VIPERS). Using the optical (u*, g', r', i') and NIR data (z', Ks), we develop a classifier, based on broad-band photometry, for identifying stars, AGNs and galaxies imp...

Context in source publication

Context 1
... schematic representation of the SVM algorithm classification process, beginning with choosing the training sample, tuning C and γ parameters, self-checking of the classifier, and finally, classifying the real sample is shown in Fig. 2. For our analysis we used LIBSVM 6 (Chang & Lin 2011), an integrated software for support vector classification, which allows for multiclass classification. We used R 7 , a free software environment for statistical computing and graphics, with e1071 interface (Meyer 2001) package installed. The successful application of an SVM algorithm requires a carefully selected training sample – a set of objects with confirmed classes which will serve as a template for distinguishing the sources whose class we want to determine. Since this work is fo- cused on the selection of galaxies, AGNs, and stars we select as a training sample a set of sources whose basic class (galaxy, AGN or star) was established with the highest reliability thanks to their high quality spectra (their redshift being measured with the highest confidence flag within the VIPERS or VVDS surveys). For these sources the accurate photometric information provided by the CFHTLS wide-survey and the WIRCam follow-up observations of the VIPERS / VVDS fields, provided the colour information needed to create the discriminant vectors for training our SVM algorithm. We produced a model (the optimised C and γ parameters based on the training data), which predicts the target values of the test data given only the test data attributes (Hsu et al. 2010). As a galaxy training sample we used the sources with the best redshift measurements in both the W1 and W4 VIPERS fields (VIPERS Zflag = 4, corresponding to the highest confidence level of redshift measurements and thus of spectroscopic classification as a galaxy). It is useful to remember that VIPERS is preselected not only in magnitude ( i < 22 . 5) but also in colours: ( r − i ) > 0 . 5 ∗ ( u ∗ − g ) or ( r − i ) > 0 . 7. We have divided the galaxy training set into i -based apparent magnitude-binned samples and trained the classifier on each subset. As a galaxy training sample we used 16 271 galaxies: 1884, 5483, 6778, and 3226 for 19 i 20, 20 i 21, 21 i 22, and 22 i < 22 . 5 apparent magnitude-bins, respectively. Based on our initial tests, we decided to divide our galaxy sample into the magnitude bins to separate more e ffi ciently di ff erent groups of galaxies seen in di ff erent i apparent magnitude ranges to improve their classification. Figure 3 shows that galaxies in di ff erent magnitude bins occupy di ff erent areas of the colour–colour plots, partly because of di ff erent redshift range and di ff erent morphology. Given the small number of AGNs detected in the VIPERS fields with the VIPERS Zflag = 14, we increased the AGN sample by using all AGNs which had at least 99% confidence level of spectroscopic classification (VIPERS Zflag 13 and 14, in total 398 objects). AGN spectra are quite easy to recognise, so a lower flag on the quality of the measured redshift does not infringe on the reliability of the classification as an AGN. There are two ways that an AGN can be observed in ...

Similar publications

Preprint
Full-text available
Deep generative models including generative adversarial networks (GANs) are powerful unsupervised tools in learning the distributions of data sets. Building a simple GAN architecture in PyTorch and training on the CANDELS data set, we generate galaxy images with the Hubble Space Telescope resolution starting from a noise vector. We proceed by modif...
Article
Full-text available
We report the PACS-100um/160um detections of a sample of 42 GALEX-selected and FIR-detected Lyman break galaxies (LBGs) at z ~ 1 located in the COSMOS field and analyze their ultra-violet (UV) to far-infrared (FIR) properties. The detection of these LBGs in the FIR indicates that they have a dust content high enough so that its emission can be dire...
Article
Full-text available
We describe the construction and general features of VIPERS, the VIMOS Public Extragalactic Redshift Survey. This `Large Programme' has been using the ESO VLT with the aim of building a spectroscopic sample of ~100,000 galaxies with i_{AB}<22.5 and 0.5<z<1.5. The survey covers a total area of ~24 deg^2 within the CFHTLS-Wide W1 and W4 fields. VIPER...
Article
Full-text available
We combine Hubble Space Telescope (HST) G102 and G141 near-IR (NIR) grism spectroscopy with HST/WFC3-UVIS, HST/WFC3-IR, and Spitzer/IRAC [3.6 μm] photometry to assemble a sample of massive (log (M star/M ☉) ~ 11.0) and quenched (specific star formation rate <0.01 Gyr–1) galaxies at z ~ 1.5. Our sample of 41 galaxies is the largest with G102+G141 NI...

Citations

... The basis of this approach are SVMs, a supervised machinelearning technique that iteratively finds a hyperplane in the feature space that optimally discriminates two classes (Cortes & Vapnik 1995). SVMs have been applied successfully to various astrophysical tasks (Wadadekar 2005;Huertas-Company et al. 2008;Małek et al. 2013;Marton et al. 2016). If the data are not linearly separable (like in our case), the SVM attempts to find an optimum by minimizing the number of misclassifications and their distance to the decision boundary. ...
Article
Full-text available
In unveiling the nature of the first stars, the main astronomical clue is the elemental compositions of the second generation of stars, observed as extremely metal-poor (EMP) stars, in the Milky Way. However, no observational constraint was available on their multiplicity, which is crucial for understanding early phases of galaxy formation. We develop a new data-driven method to classify observed EMP stars into mono- or multi-enriched stars with support vector machines. We also use our own nucleosynthesis yields of core-collapse supernovae with mixing fallback that can explain many of the observed EMP stars. Our method predicts, for the first time, that 31.8% ± 2.3% of 462 analyzed EMP stars are classified as mono-enriched. This means that the majority of EMP stars are likely multi-enriched, suggesting that the first stars were born in small clusters. Lower-metallicity stars are more likely to be enriched by a single supernova, most of which have high carbon enhancement. We also find that Fe, Mg. Ca, and C are the most informative elements for this classification. In addition, oxygen is very informative despite its low observability. Our data-driven method sheds a new light on solving the mystery of the first stars from the complex data set of Galactic archeology surveys.
... The basis of this approach are SVMs, a supervised machine learning technique that iteratively finds a hyperplane in the feature space that optimally discriminates two classes (Cortes & Vapnik 1995). SVMs have been applied successfully to various astrophysical tasks (Wadadekar 2005;Huertas-Company et al. 2008;Ma lek et al. 2013;Marton et al. 2016). If the data is not linearly separable (like in our case), the SVM attempts to find an optimum by minimizing the number of misclassifications and their distance to the decision boundary. ...
Preprint
Full-text available
In unveiling the nature of the first stars, the main astronomical clue is the elemental compositions of the second generation of stars, observed as extremely metal-poor (EMP) stars, in our Milky Way Galaxy. However, no observational constraint was available on their multiplicity, which is crucial for understanding early phases of galaxy formation. We develop a new data-driven method to classify observed EMP stars into mono- or multi-enriched stars with Support Vector Machines. We also use our own nucleosynthesis yields of core-collapse supernovae with mixing-fallback that can explain many of observed EMP stars. Our method predicts, for the first time, that $31.8\% \pm 2.3\%$ of 462 analyzed EMP stars are classified as mono-enriched. This means that the majority of EMP stars are likely multi-enriched, suggesting that the first stars were born in small clusters. Lower metallicity stars are more likely to be enriched by a single supernova, most of which have high carbon enhancement. We also find that Fe, Mg. Ca, and C are the most informative elements for this classification. In addition, oxygen is very informative despite its low observability. Our data-driven method sheds a new light on solving the mystery of the first stars from the complex data set of Galactic archaeology surveys.
... Solarz et al. (2012) used the infrared information to separate galaxies from stars and the accuracy reached 90% for galaxies and 98% for stars. Małek et al. (2013) trained a SVM classifier to classify stars, active galactic nucleus (AGNs) and galaxies using spectroscopically confirmed sources from the VIPERS and VVDS surveys. In the stellar spectral classification, A stars and G stars can be identified easily, while it was hard to identify O, B and K stars. ...
Preprint
Full-text available
Classification is valuable and necessary in spectral analysis, especially for data-driven mining. Along with the rapid development of spectral surveys, a variety of classification techniques have been successfully applied to astronomical data processing. However, it is difficult to select an appropriate classification method in practical scenarios due to the different algorithmic ideas and data characteristics. Here, we present the second work in the data mining series - a review of spectral classification techniques. This work also consists of three parts: a systematic overview of current literature, experimental analyses of commonly used classification algorithms and source codes used in this paper. Firstly, we carefully investigate the current classification methods in astronomical literature and organize these methods into ten types based on their algorithmic ideas. For each type of algorithm, the analysis is organized from the following three perspectives. (1) their current applications and usage frequencies in spectral classification are summarized; (2) their basic ideas are introduced and preliminarily analysed; (3) the advantages and caveats of each type of algorithm are discussed. Secondly, the classification performance of different algorithms on the unified data sets is analysed. Experimental data are selected from the LAMOST survey and SDSS survey. Six groups of spectral data sets are designed from data characteristics, data qualities and data volumes to examine the performance of these algorithms. Then the scores of nine basic algorithms are shown and discussed in the experimental analysis. Finally, nine basic algorithms source codes written in python and manuals for usage and improvement are provided.
... Stars are usually selected in two ways in extragalactic surveys-by selecting point sources and by applying empirical color-color cuts (e.g., Daddi et al. 2004;Barro et al. 2009;Henrion et al. 2011;Małek et al. 2013). The former only works for bright sources because the morphological information is limited for faint sources. ...
Article
Full-text available
W-CDF-S, ELAIS-S1, and XMM-LSS will be three Deep-Drilling Fields (DDFs) of the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST), but their extensive multiwavelength data have not been fully utilized as done in the COSMOS field, another LSST DDF. To prepare for future science, we fit source spectral energy distributions (SEDs) from X-ray to far-infrared in these three fields mainly to derive galaxy stellar masses and star formation rates. We use CIGALE v2022.0, a code that has been regularly developed and evaluated, for the SED fitting. Our catalog includes 0.8 million sources covering 4.9 deg ² in W-CDF-S, 0.8 million sources covering 3.4 deg ² in ELAIS-S1, and 1.2 million sources covering 4.9 deg ² in XMM-LSS. Besides fitting normal galaxies, we also select candidates that may host active galactic nuclei (AGNs) or are experiencing recent star formation variations and use models specifically designed for these sources to fit their SEDs; this increases the utility of our catalog for various projects in the future. We calibrate our measurements by comparison with those in well-studied smaller regions and briefly discuss the implications of our results. We also perform detailed tests of the completeness and purity of SED-selected AGNs. Our data can be retrieved from a public website.
... Stars are usually selected in two ways in extragalactic surveys -by selecting point sources and by applying empirical color-color cuts (e.g., Daddi et al. 2004;Barro et al. 2009;Henrion et al. 2011;Małek et al. 2013). The former only works for bright sources because morphological information is limited for faint sources. ...
Preprint
W-CDF-S, ELAIS-S1, and XMM-LSS will be three Deep-Drilling Fields (DDFs) of the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST), but their extensive multi-wavelength data have not been fully utilized as done in the COSMOS field, another LSST DDF. To prepare for future science, we fit source spectral energy distributions (SEDs) from X-ray to far-infrared in these three fields mainly to derive galaxy stellar masses and star-formation rates. We use CIGALE v2022.0, a code that has been regularly developed and evaluated, for the SED fitting. Our catalog includes 0.8 million sources covering $4.9~\mathrm{deg^2}$ in W-CDF-S, 0.8 million sources covering $3.4~\mathrm{deg^2}$ in ELAIS-S1, and 1.2 million sources covering $4.9~\mathrm{deg^2}$ in XMM-LSS. Besides fitting normal galaxies, we also select candidates that may host active galactic nuclei (AGNs) or are experiencing recent star-formation variations and use models specifically designed for these sources to fit their SEDs; this increases the utility of our catalog for various projects in the future. We calibrate our measurements by comparison with those in well-studied smaller regions and briefly discuss the implications of our results. We also perform detailed tests of the completeness and purity of SED-selected AGNs. Our data can be retrieved from a public website.
... Deep Learning for star cluster classification 3179 use of so-called decision trees (Weir, Fayyad & Djorgovski 1995;Suchkov, Hanisch & Margon 2005;Ball et al. 2006;Vasconcellos et al. 2011;Sevilla-Noarbe & Etayo-Sotos 2015) and support vector machines (Fadely, Hogg & Willman 2012;Małek et al 2013;Solarz et al. 2017). ...
Article
We present the results of a proof-of-concept experiment that demonstrates that deep learning can successfully be used for production-scale classification of compact star clusters detected in Hubble Space Telescope(HST) ultraviolet-optical imaging of nearby spiral galaxies (⁠D≲20Mpc⁠) in the Physics at High Angular Resolution in Nearby GalaxieS (PHANGS)–HST survey. Given the relatively small nature of existing, human-labelled star cluster samples, we transfer the knowledge of state-of-the-art neural network models for real-object recognition to classify star clusters candidates into four morphological classes. We perform a series of experiments to determine the dependence of classification performance on neural network architecture (ResNet18 and VGG19-BN), training data sets curated by either a single expert or three astronomers, and the size of the images used for training. We find that the overall classification accuracies are not significantly affected by these choices. The networks are used to classify star cluster candidates in the PHANGS–HST galaxy NGC 1559, which was not included in the training samples. The resulting prediction accuracies are 70 per cent, 40 per cent, 40–50 per cent, and 50–70 per cent for class 1, 2, 3 star clusters, and class 4 non-clusters, respectively. This performance is competitive with consistency achieved in previously published human and automated quantitative classification of star cluster candidate samples (70–80 per cent, 40–50 per cent, 40–50 per cent, and 60–70 per cent). The methods introduced herein lay the foundations to automate classification for star clusters at scale, and exhibit the need to prepare a standardized data set of human-labelled star cluster classifications, agreed upon by a full range of experts in the field, to further improve the performance of the networks introduced in this study.
... Some of these machine learning algorithms have been integrated into widely-used methods for image processing, such as the neural networks trained for star/galaxy separation in the automated source detection and photometry software SEXTRACTOR (Bertin & Arnouts 1996). Other applications of machine learning for image classification include the use of so-called decision trees (Weir et al. 1995;Suchkov et al. 2005;Ball et al. 2006;Vasconcellos et al. 2011;Sevilla-Noarbe & Etayo-Sotos 2015) and support vector machines (Fadely et al. 2012;Solarz et al. 2017;Ma lek & et al 2013). ...
Preprint
We present the results of a proof-of-concept experiment which demonstrates that deep learning can successfully be used for production-scale classification of compact star clusters detected in HST UV-optical imaging of nearby spiral galaxies in the PHANGS-HST survey. Given the relatively small and unbalanced nature of existing, human-labelled star cluster datasets, we transfer the knowledge of neural network models for real-object recognition to classify star clusters candidates into four morphological classes. We show that human classification is at the 66%:37%:40%:61% agreement level for the four classes considered. Our findings indicate that deep learning algorithms achieve 76%:63%:59%:70% for a star cluster sample within 4Mpc < D <10Mpc. We tested the robustness of our deep learning algorithms to generalize to different cluster images using the first data obtained by PHANGS-HST of NGC1559, which is more distant at D = 19Mpc, and found that deep learning produces classification accuracies 73%:42%:52%:67%. We furnish evidence for the robustness of these analyses by using two different neural network models for image classification, trained multiple times from the ground up to assess the variance and stability of our results. We quantified the importance of the NUV, U, B, V and I images for morphological classification with our deep learning models, and find that the V-band is the key contributor as human classifications are based on images taken in that filter. This work lays the foundations to automate classification for these objects at scale, and the creation of a standardized dataset.
... star-galaxy separation on colour-brightness and colour-colour diagrams (e.g. Małek et al. 2013). ...
Article
The second Gaia Data Release (DR2) contains astrometric and photometric data for more than 1.6 billion objects with mean Gaia G magnitude <20.7, including many Young Stellar Objects (YSOs) in different evolutionary stages. In order to explore the YSO population of the Milky Way, we combined the Gaia DR2 data base with Wide-field Infrared Survey Explorer (WISE) and Planck measurements and made an all-sky probabilistic catalogue of YSOs using machine learning techniques, such as Support Vector Machines, Random Forests, or Neural Networks. Our input catalogue contains 103 million objects from the DR2xAllWISE cross-match table. We classified each object into four main classes: YSOs, extragalactic objects, main-sequence stars, and evolved stars. At a 90 per cent probability threshold, we identified 1 129 295 YSO candidates. To demonstrate the quality and potential of our YSO catalogue, here we present two applications of it. (1) We explore the 3D structure of the Orion A star-forming complex and show that the spatial distribution of the YSOs classified by our procedure is in agreement with recent results from the literature. (2) We use our catalogue to classify published Gaia Science Alerts. As Gaia measures the sources at multiple epochs, it can efficiently discover transient events, including sudden brightness changes of YSOs caused by dynamic processes of their circumstellar disc. However, in many cases the physical nature of the published alert sources are not known. A cross-check with our new catalogue shows that about 30 per cent more of the published Gaia alerts can most likely be attributed to YSO activity. The catalogue can be also useful to identify YSOs among future Gaia alerts.
... A basic selection on object size relative to the PSF can be used to separate samples of spatially extended galaxies from point-like stars and quasars. Accurate object classification becomes challenging for ground-based imaging surveys at faint magnitudes, and accordingly, optimal use of morphological, color, and temporal information is an active area of research (e.g., Fadely et al. 2012;Małek et al. 2013;Bertin et al. 2015;Kim et al. 2015;Kim & Brunner 2017). Several object classification schemes have been applied to DES data for a variety of science cases (e.g., Chang et al. 2015;Reed et al. 2015;Soumagnac et al. 2015;Drlica-Wagner et al. 2018;Sevilla-Noarbe et al. 2018). ...
Article
Full-text available
We describe the first public data release of the Dark Energy Survey, DES DR1, consisting of reduced single-epoch images, co-added images, co-added source catalogs, and associated products and services assembled over the first 3 yr of DES science operations. DES DR1 is based on optical/near-infrared imaging from 345 distinct nights (2013 August to 2016 February) by the Dark Energy Camera mounted on the 4 m Blanco telescope at the Cerro Tololo Inter-American Observatory in Chile. We release data from the DES wide-area survey covering ~5000 deg2 of the southern Galactic cap in five broad photometric bands, grizY. DES DR1 has a median delivered point-spread function of $g=1.12$, r = 0.96, i = 0.88, z = 0.84, and Y = 0farcs90 FWHM, a photometric precision of <1% in all bands, and an astrometric precision of 151 $\,\mathrm{mas}$. The median co-added catalog depth for a 1farcs95 diameter aperture at signal-to-noise ratio (S/N) = 10 is g = 24.33, r = 24.08, i = 23.44, z = 22.69, and Y = 21.44 $\,\mathrm{mag}$ . DES DR1 includes nearly 400 million distinct astronomical objects detected in ~10,000 co-add tiles of size 0.534 deg2 produced from ~39,000 individual exposures. Benchmark galaxy and stellar samples contain ~310 million and ~80 million objects, respectively, following a basic object quality selection. These data are accessible through a range of interfaces, including query web clients, image cutout servers, jupyter notebooks, and an interactive co-add image visualization tool. DES DR1 constitutes the largest photometric data set to date at the achieved depth and photometric precision.
... In addition, many large, multiband imaging surveys use morphology, such as for Sloan Digital Sky Survey (SDSS; Stoughton et al. 2002) and/or have incorporated colour information into their classifiers (see Ball et al. 2006 for SDSS as well, Hildebrandt et al. 2012for CFHTLS or Saglia et al. 2012 for Pan-STARRS). Adopting a Bayesian approach to incorporate fits to stellar and galaxy templates has been shown to be a promising avenue (Fadely, Hogg & Willman 2012) as well as the use of infrared data to complement the optical band observations (Małek et al. 2013;Banerji et al. 2015;Kovács & Szapudi 2015). ...
Article
We perform a comparison of different approaches to star–galaxy classification using the broad-band photometric data from Year 1 of the Dark Energy Survey. This is done by performing a wide range of tests with and without external ‘truth’ information, which can be ported to other similar data sets. We make a broad evaluation of the performance of the classifiers in two science cases with DES data that are most affected by this systematic effect: large-scale structure and Milky Way studies. In general, even though the default morphological classifiers used for DES Y1 cosmology studies are sufficient to maintain a low level of systematic contamination from stellar misclassification, contamination can be reduced to the O(1 per cent) level by using multi-epoch and infrared information from external data sets. For Milky Way studies, the stellar sample can be augmented by |${\sim }20{{\ \rm per\ cent}}$| for a given flux limit. Reference catalogues used in this work are available at http://des.ncsa.illinois.edu/releases/y1a1.