Figure 2 - uploaded by Timothy T Rogers
Content may be subject to copyright.
As the noise is increased (a), our proposed penalty function (SOSlasso) allows us to recover the true coefficients more accurately than the group lasso (Glasso). Also, when alpha is large, the active groups are not sparse, and the standard overlapping group lasso outperforms the other methods. However, as α reduces, the method we propose outperforms the group lasso (b). (c) shows a toy sparsity pattern, with different colors denoting different overlapping groups 

As the noise is increased (a), our proposed penalty function (SOSlasso) allows us to recover the true coefficients more accurately than the group lasso (Glasso). Also, when alpha is large, the active groups are not sparse, and the standard overlapping group lasso outperforms the other methods. However, as α reduces, the method we propose outperforms the group lasso (b). (c) shows a toy sparsity pattern, with different colors denoting different overlapping groups 

Source publication
Article
Full-text available
Multitask learning can be effective when features useful in one task are also useful for other tasks, and the group lasso is a standard method for selecting a common subset of features. In this paper, we are interested in a less restrictive form of multitask learning, wherein (1) the available features can be organized into subsets according to a n...

Similar publications

Article
Full-text available
The logistic loss function is often advocated in machine learning and statistics as a smooth and strictly convex surrogate for the 0-1 loss. In this paper we investigate the question of whether these smoothness and convexity properties make the logistic loss preferable to other widely considered options such as the hinge loss. We show that in contr...
Article
Full-text available
This paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing...
Article
Full-text available
Consider convex optimization problems subject to a large number of constraints. We focus on stochastic problems in which the objective takes the form of expected values and the feasible set is the intersection of a large number of convex sets. We propose a class of algorithms that perform both stochastic gradient descent and random feasibility upda...
Article
Full-text available
The problem of maximizing precision at the top of a ranked list, often dubbed Precision@k (prec@k), finds relevance in myriad learning applications such as ranking, multi-label classification, and learning with severe label imbalance. However, despite its popularity, there exist significant gaps in our understanding of this problem and its associat...
Article
Full-text available
In the era of deep learning several unsupervised models have been developed to capture the key features in unlabeled handwritten data. Popular among them is the Restricted Boltzmann Machines RBM. However, due to the novelty in handwritten multidialect data, the RBM may fail to generate an efficient representation. In this paper we propose a generat...

Citations

... The study conducted by Chatterjee et al. [2012] explored the use of sparse group Lasso as a specialized method for investigating regularization with tree hierarchy. Additionally, several studies, including Rao et al. [2013], have successfully applied sparse group Lasso in the context of multitasking learning. Moreover, Ahsen and Vidyasagar [2017] presented a comprehensive framework for analyzing error bounds of various techniques, encompassing Group Lasso, sparse group Lasso, and Group Lasso with tree overlap. ...
Preprint
This paper introduces a rigorous approach to establish the sharp minimax optimalities of both LASSO and SLOPE within the framework of double sparse structures, notably without relying on RIP-type conditions. Crucially, our findings illuminate that the achievement of these optimalities is fundamentally anchored in a sparse group normalization condition, complemented by several novel sparse group restricted eigenvalue (RE)-type conditions introduced in this study. We further provide a comprehensive comparative analysis of these eigenvalue conditions. Furthermore, we demonstrate that these conditions hold with high probability across a wide range of random matrices. Our exploration extends to encompass the random design, where we prove the random design properties and optimal sample complexity under both weak moment distribution and sub-Gaussian distribution.
... Conversely, there is interest in sparsity penalties that promote intra-group sparsity, that is, ensuring groups within a vector are themselves sparse. Such penalties have been used in areas such as deep learning [4], computer vision [5], [6], and medicine [7], [8]. Previous work in this area includes the convex elitist [9], [10], or exclusive LASSO (E-LASSO) [11], and the nonconvex sparsity within and across groups (SWAG) [12] plus extensions that allow for overlapping groups of components [13]. ...
... It demonstrates that for fixed λ, the solution to (8) can be discontinuous in y. Nevertheless, the solution has remained stable in the sense of ∥ x∥ ≤ ∥y∥. ...
Article
Full-text available
Penalty functions or regularization terms that promote structured solutions to optimization problems are of great interest in many fields. We introduce MEGS, a nonconvex structured sparsity penalty that promotes mutual exclusivity between components in solutions to optimization problems. This enforces, or promotes, 1-sparsity within arbitrary overlapping groups in a vector. The mutual exclusivity structure is represented by a matrix S. We discuss the design of S from engineering principles and show example use cases including the modeling of occlusions in 3D imaging and a total variation variant with uses in image restoration. We also demonstrate synergy between MEGS and other regularizers and propose an algorithm to efficiently solve problems regularized or constrained by MEGS.
... In Sect. 5, we apply the novel algorithms to two important real problems, the group sparse linear regression problem and fuzzy C-means problem, which have wide applications (Boutsidis et al. 2014;Rademacher and Deshpande 2010;Woodruff et al. 2011;Dunn 1973;Dhillon et al. 2016;Nowak et al. 2013;Zhang et al. 2010;Shashua and Zass 2005). ...
Article
Full-text available
In this paper, we concentrate on exploring fast algorithms for the minimization of a non-increasing supermodular or non-supermodular function f subject to a cardinality constraint. As for the non-supermodular minimization problem with the weak supermodularity ratio r, we can obtain a (1+ϵ)-approximation algorithm with adaptivity O(nϵlogrn·f(∅)ϵ·OPT) under the bi-criteria strategy, where OPT denotes the optimal objective value of the problem. That is, instead of selecting at most k elements on behalf of the constraint, the cardinality of the output may reach to krlogf(∅)ϵ·OPT. Moreover, for the supermodular minimization problem, we propose two (1+ϵ)-approximation algorithms for which the output solution X is of size |X0|+Oklogf(X0)ϵ·OPT. The adaptivities of this two algorithms are Olog2n·logf(X0)ϵ·OPT and Ologn·logf(X0)ϵ·OPT, where X0 is an input set and OPT is the optimal value. Applications to group sparse linear regression problems and fuzzy C-means problems are studied at the end.
... Multi-task learning is an optimal learning method by mining shared information among tasks while training multiple related tasks. It can significantly improve the learning effect of the algorithm, and has been applied to many fields such as spam filtering [22], natural image classification [23], various disease modeling, classification and prediction [24,25] and so on. In the articles [26] and [27], using mutual inductive bias, multi-task learning can obtain bias information to supplement the lack of samples, which can simultaneously learn single task's unique feature information and feature information shared by multiple tasks, and effectively improve the generalization ability of the model. ...
Article
Full-text available
Background: Machine learning techniques and magnetic resonance imaging methods have been widely used in computer-aided diagnosis and prognosis of severe brain diseases such as schizophrenia, Alzheimer, etc. Methods: In this paper, a regularized multi-task learning method for schizophrenia classification is proposed, and three MRI datasets of schizophrenia, collected from different data centers, are investigated. Firstly, slice extraction is used in image preprocessing. Then texture features of gray-level co-occurrence matrices are extracted from the above processed images. Finally, a p-norm regularized multi-task learning method is proposed to simultaneously learn the site-specific and site-shared features of the multi-site data, which can effectively discriminate schizophrenia patients from normal controls. Results: The classification error rate on 10 datasets can be reduced from 10% to 30%. Conclusions: The proposed method obtains excellent results and provides objective evidence for clinical diagnosis and treatment of schizophrenia.
... There are a variety of alternative ways to estimate the shared sparsity structure in MTL that could lead to different model selections, including one competing approach that uses the 2,1 penalty to impose group-wise sparsity (Obozinski et al., 2006;Rao et al., 2013). The focus of our work is primarily on conducting valid inference for a class of plausible models, and some models and penalties are easier to work with than others. ...
Preprint
Multi-task learning is frequently used to model a set of related response variables from the same set of features, improving predictive performance and modeling accuracy relative to methods that handle each response variable separately. Despite the potential of multi-task learning to yield more powerful inference than single-task alternatives, prior work in this area has largely omitted uncertainty quantification. Our focus in this paper is a common multi-task problem in neuroimaging, where the goal is to understand the relationship between multiple cognitive task scores (or other subject-level assessments) and brain connectome data collected from imaging. We propose a framework for selective inference to address this problem, with the flexibility to: (i) jointly identify the relevant covariates for each task through a sparsity-inducing penalty, and (ii) conduct valid inference in a model based on the estimated sparsity structure. Our framework offers a new conditional procedure for inference, based on a refinement of the selection event that yields a tractable selection-adjusted likelihood. This gives an approximate system of estimating equations for maximum likelihood inference, solvable via a single convex optimization problem, and enables us to efficiently form confidence intervals with approximately the correct coverage. Applied to both simulated data and data from the Adolescent Cognitive Brain Development (ABCD) study, our selective inference methods yield tighter confidence intervals than commonly used alternatives, such as data splitting. We also demonstrate through simulations that multi-task learning with selective inference can more accurately recover true signals than single-task methods.
... Finally, an exciting possibility for future work is to combine the brain kernel with other advanced statistical modeling techniques. Methods for modeling structured variability, such as topographic ICA Manning et al. (2014) , and methods for structured sparsity, such as GraphNet Grosenick et al. (2013) , sparse overlapping sets lasso Rao et al. (2013) , and dependent relevance determination Wu et al. (2019Wu et al. ( , 2014 , rely on capturing dependencies between nearby voxels. All such methods might thus be improved by using nearness in functional embedding space provided by the brain kernel, instead of nearness in 3D Euclidean space inside the brain. ...
Article
Full-text available
A key problem in functional magnetic resonance imaging (fMRI) is to estimate spatial activity patterns from noisy high-dimensional signals. Spatial smoothing provides one approach to regularizing such estimates. However, standard smoothing methods ignore the fact that correlations in neural activity may fall off at different rates in different brain areas, or exhibit discontinuities across anatomical or functional boundaries. Moreover, such methods do not exploit the fact that widely separated brain regions may exhibit strong correlations due to bilateral symmetry or the network organization of brain regions. To capture this non-stationary spatial correlation structure, we introduce the brain kernel, a continuous covariance function for whole-brain activity patterns. We define the brain kernel in terms of a continuous nonlinear mapping from 3D brain coordinates to a latent embedding space, parametrized with a Gaussian process (GP). The brain kernel specifies the prior covariance between voxels as a function of the distance between their locations in embedding space. The GP mapping warps the brain nonlinearly so that highly correlated voxels are close together in latent space, and uncorrelated voxels are far apart. We estimate the brain kernel using resting-state fMRI data, and we develop an exact, scalable inference method based on block coordinate descent to overcome the challenges of high dimensionality (10-100K voxels). Finally, we illustrate the brain kernel's usefulness with applications to brain decoding and factor analysis with multiple task-based fMRI datasets.
... 623 Finally, an exciting possibility for future work is to combine the brain kernel with other advanced statistical 624 modeling techniques. Methods for modeling structured variability, such as topographic ICA [39], and 625 methods for structured sparsity, such as GraphNet [42], sparse overlapping sets lasso [43], and dependent 626 relevance determination [44,45], rely on capturing dependencies between nearby voxels. All such methods 627 might thus be improved by using nearness in functional embedding space provided by the brain kernel, 628 instead of nearness in 3D Euclidean space inside the brain. ...
Preprint
Full-text available
A key problem in functional magnetic resonance imaging (fMRI) is to estimate spatial activity patterns from noisy high-dimensional signals. Spatial smoothing provides one approach to regularizing such estimates. However, standard smoothing methods ignore the fact that correlations in neural activity may fall off at different rates in different brain areas, or exhibit discontinuities across anatomical or functional boundaries. Moreover, such methods do not exploit the fact that widely separated brain regions may exhibit strong correlations due to bilateral symmetry or the network organization of brain regions. To capture this non-stationary spatial correlation structure, we introduce the brain kernel, a continuous covariance function for whole-brain activity patterns. We define the brain kernel in terms of a continuous nonlinear mapping from 3D brain coordinates to a latent embedding space, parametrized with a Gaussian process (GP). The brain kernel specifies the prior covariance between voxels as a function of the distance between their locations in embedding space. The GP mapping warps the brain nonlinearly so that highly correlated voxels are close together in latent space, and uncorrelated voxels are far apart. We estimate the brain kernel using resting-state fMRI data, and we develop an exact, scalable inference method based on block coordinate descent to overcome the challenges of high dimensionality (10-100K voxels). Finally, we illustrate the brain kernel's usefulness with applications to brain decoding and factor analysis with multiple task-based fMRI datasets.
... We may think of using statistical methods regression which do not refer to some hidden function f as it is the case for the parallelogram-inspired methods. We thus tried lasso regression [52,46] with various parameters, but the results were very poor: the best accuracy we got is 42.8%, without normalisation and with an alpha value of 0.001. ...
Article
Analogical proportions are statements of the form ‘a is to b as c is to d’, formally denoted a:b::c:d. They are the basis of analogical reasoning which is often considered as an essential ingredient of human intelligence. For this reason, recognizing analogies in natural language has long been a research focus within the Natural Language Processing (NLP) community. With the emergence of word embedding models, a lot of progress has been made in NLP, essentially assuming that a word analogy like man:king::woman:queen is an instance of a parallelogram within the underlying vector space. In this paper, we depart from this assumption to adopt a machine learning approach, i.e., learning a substitute of the parallelogram model. To achieve our goal, we first review the formal modeling of analogical proportions, highlighting the properties which are useful from a machine learning perspective. For instance, the postulates supposed to govern such proportions entail that when a:b::c:d holds, then seven permutations of a,b,c,d still constitute valid analogies. From a machine learning perspective, this provides guidelines to build training sets of positive and negative examples. Taking into account these properties for augmenting the set of positive and negative examples, we first implement word analogy classifiers using various machine learning techniques, then we approximate by regression an analogy completion function, i.e., a way to compute the missing word when we have the three other ones. Using a GloVe embedding, classifiers show very high accuracy when recognizing analogies, improving state of the art on word analogy classification. Also, the regression processes usually lead to much more successful analogy completion than the ones derived from the parallelogram assumption.
... According to [25], [27], the sparse group LASSO model in (18) can be solved by sparse group iteration algorithm. In this paper, we choose 1 as an example to verify the effectiveness of the proposed multichannel SAR imaging method for nonuniform under-sampling. ...
Article
Full-text available
The azimuth multichannel synthetic aperture radar (SAR) technology is capable of overcoming the minimum antenna area constraint and achieving high-resolution and wide-swath (HRWS) imaging. Generally speaking, the pulse repetition frequency (PRF) of the spaceborne multichannel SAR systems should satisfy the azimuthal uniform sampling condition, but it is sometimes impossible due to the limitation of radar system timing conditions which is often referred as “coverage diagram”. For the Gaofen-3 system, the PRF of each channel at some beam positions is slightly less than that of uniform sampling in the dual channel mode, leading to the nonuniform under-sampling, hence resulting the azimuth ambiguities in the recovered images. Although the ambiguous energy in Gaofen-3 images is not high in general, it is still noticeable amid surrounding weak clutters of strong targets. In this paper, a novel multichannel SAR imaging method for nonuniform under-sampling based on $L_{2,q}$ regularization (0<q $\leq$ 1) is proposed. By analyzing the reasons of azimuth ambiguities in multichannel SAR system, the imaging model is established with emphasizing the difference from conventional single-channel SAR. Then, we combine the multichannel SAR data processing operators with the group sparsity property to construct the novel imaging method. The group sparsity property is modeled by the 2,q-norm, and the $L_{2,q}$ regularization problem can be solved via sparse group thresholding function. It is shown that the proposed method can efficiently suppress the azimuth ambiguities caused by nonuniform under-sampling. Simulations and Gaofen-3 real data experiments are exploited to verify the effectiveness of proposed method.
... In prior work, we defined the sparse-overlapping-sets (SOS) LASSO to meet these criteria (Rao et al., 2013(Rao et al., , 2016. Voxels from all subjects are projected into a common reference space without interpolation or averaging. ...
... The cost for each set is the proportional weighted sum of the LASSO sparsity penalty and a grouping penalty formulated as the root of the sum of squared coefficients. Because this root is taken over units within a set, the penalty is smaller when nonzero coefficients occupy the same set than when they occupy different sets (Rao et al., 2013). Thus, SOS LASSO (https://zenodo.org/record/3609239) ...
... Our formulation allows different coefficients within set (so that solutions can vary across individual participants) and does not require a prespecified sparsity pattern. These relationships, together with mathematical analysis of the regularizer, are explained further in the studies by Rao et al. (2013Rao et al. ( , 2016. ...
Article
Full-text available
The human cortex encodes information in complex networks that can be anatomically dispersed and variable in their microstructure across individuals. Using simulations with neural network models, we show that contemporary statistical methods for functional brain imaging-including univariate contrast, searchlight multivariate pattern classification, and whole-brain decoding with L1 or L2 regularization-each have critical and complementary blind-spots under these conditions. We then introduce the sparse-overlapping-sets (SOS) LASSO: a whole-brain multivariate approach that exploits structured sparsity to find network-distributed information and show in simulation that it captures the advantages of other approaches while avoiding their limitations. When applied to fMRI data to find neural responses that discriminate visually presented faces from other visual stimuli, each method yields a different result, but existing approaches all support the canonical view that face perception engages localized areas in posterior occipital and temporal regions. In contrast, SOS LASSO uncovers a network spanning all four lobes of the brain. The result cannot reflect spurious selection of out-of-system areas, because decoding accuracy remains exceedingly high even when canonical face and place systems are removed from the dataset. When used to discriminate visual scenes from other stimuli the same approach reveals a localized signal consistent with other methods-illustrating that SOS LASSO can detect both widely-distributed and localized representational structure. Thus, structured sparsity can provide an unbiased method for testing claims of functional localization. For faces and possibly other domains, such decoding may reveal representations more widely distributed than previously suspected.SIGNIFICANCE STATEMENTBrain systems represent information as patterns of activation over neural populations connected in networks that can be widely distributed anatomically, variable across individuals, and intermingled with other networks. We show that four widespread statistical approaches to functional brain imaging have critical blind spots in this scenario and use simulations with neural network models to illustrate why. We then introduce a new approach designed specifically to find radically distributed representations in neural networks. In simulation and in fMRI data collected in the well-studied domain of face perception, the new approach discovers extensive signal missed by the other methods-suggesting that prior functional imaging work may have significantly underestimated the degree to which neuro-cognitive representations are distributed and variable across individuals.