PreprintPDF Available

Sparse Gaussian chain graphs with the spike-and-slab LASSO: Algorithms and asymptotics

Authors:

Abstract and Figures

The Gaussian chain graph model simultaneously parametrizes (i) the direct effects of $p$ predictors on $q$ correlated outcomes and (ii) the residual partial covariance between pair of outcomes. We introduce a new method for fitting sparse Gaussian chain graph models with spike-and-slab LASSO (SSL) priors. We develop an Expectation-Conditional Maximization algorithm to obtain sparse estimates of the $p \times q$ matrix of direct effects and the $q \times q$ residual precision matrix. Our algorithm iteratively solves a sequence of penalized maximum likelihood problems with self-adaptive penalties that gradually filter out negligible regression coefficients and partial covariances. Because it adaptively penalizes model parameters, our method is seen to outperform fixed-penalty competitors on simulated data. We establish the posterior concentration rate for our model, buttressing our method's excellent empirical performance with strong theoretical guarantees. We use our method to reanalyze a dataset from a study of the effects of diet and residence type on the composition of the gut microbiome of elderly adults.
Content may be subject to copyright.
Sparse Gaussian chain graphs with the spike-and-slab
LASSO: Algorithms and asymptotics
Yunyi Shen
, Claudia Sol´ıs-Lemus
, and Sameer K. Deshpande
July 15, 2022
Abstract
The Gaussian chain graph model simultaneously parametrizes (i) the direct effects
of ppredictors on qcorrelated outcomes and (ii) the residual partial covariance be-
tween pair of outcomes. We introduce a new method for fitting sparse Gaussian chain
graph models with spike-and-slab LASSO (SSL) priors. We develop an Expectation
Conditional Maximization algorithm to obtain sparse estimates of the p×qmatrix of
direct effects and the q×qresidual precision matrix. Our algorithm iteratively solves a
sequence of penalized maximum likelihood problems with self-adaptive penalties that
gradually filter out negligible regression coefficients and partial covariances. Because it
adaptively penalizes model parameters, our method is seen to outperform fixed-penalty
competitors on simulated data. We establish the posterior concentration rate for our
model, buttressing our method’s excellent empirical performance with strong theoret-
ical guarantees. We use our method to reanalyze a dataset from a study of the effects
of diet and residence type on the composition of the gut microbiome of elderly adults.
Laboratory for Information & Decision Systems, Massachusetts Institute of Technology. The work was
done while the author was at the University of Wisconsin–Madison.
Wisconsin Institute for Discovery & Dept. of Plant Pathology, University of Wisconsin–Madison, Cor-
respondence to: solislemus@wisc.edu
Dept. of Statistics, University of Wisconsin–Madison. Correspondence to: sameer.deshpande@wisc.edu
1
arXiv:2207.07020v1 [stat.ME] 14 Jul 2022
1 Introduction
1.1 Motivation
There are between 10 and 100 trillion microorganisms living within each person’s lower
intenstines. These bacteria, fungi, viruses and other microbes constitute the human gut
microbiome (Guinane and Cotter,2013). Recent research suggests that the composition of
human gut microbiome can have a substantial effect on our health and well-being (Shreiner
et al.,2015): microbes living in the gut play an integral role in our digestive and metabolic
processes (Larsbrink et al.,2014;Belcheva et al.,2014); they can mediate our immune
response to various diseases (Kamada and nez,2014;Kim et al.,2017); and they can even
influence disease pathogenesis and progression (Scher et al.,2013;Wang et al.,2011).
Additional emerging evidence suggests that the gut microbiome mediates the effects of
lifestyle factors such as diet and medication use on human health (Singh et al.,2017;Battson
et al.,2018;Hills Jr et al.,2019). That is, such lifestyle factors may first affect the com-
position of the gut microbiome, which in turn influences health outcomes. In fact, lifestyle
factors and medication use can impact the composition of the microbiome in direct and in-
direct ways. For instance, many antibiotics target and kill certain microbial species, thereby
directly affecting the abundances of the targeted species. However, by killing the targeted
species, the antibiotics may reduce the overall competition for nutrients, thereby allowing
non-targeted species to proliferate. In other words, by directly reducing the abundance of
certain targeted microbes, antibiotics may indirectly increase the abundance of other non-
targeted species. Our goal in this paper is to estimate such direct and indirect effects.
1.2 Sparse chain graph models
At a high level, the statistical challenge is to estimate the functional relationship between
a vector of predictors xRpand vector of responses yRq.In our application, we re-
analyze a dataset from Claesson et al. (2012) containing n= 178 predictor-response pairs
(x,y) where xcontains measures of p= 11 factors related to diet, medication use, and
residence type, and ycontains the logit-transformed relative abundances of q= 14 different
microbial taxa. Our goal is to uncover the direct and indirect effect of these factors on the
abundance of each microbial taxon as well as any interactions between microbial taxa. The
Gaussian chain graph model (Lauritzen and Wermuth,1989;Frydenberg,1990;Lauritzen
and Richardson,2002), which simultaneously parameterizes the direct effects of predictors
2
on responses and the residual dependence structure between response, is natural for these
data. The model asserts that
y|Ψ,,x N(Ω1Ψ>x,1).(1)
where Ψ is a p×qmatrix and is a symmetric, positive definite q×qmatrix. As we detail
in Section 2.1, the (j, k) entry of Ψ, ψj,k,quantifies the direct effect of the jth predictor
Xjon the kth response Yk.The (k, k0) entry of , ωk,k0,encodes the residual conditional
covariance between outcomes Ykand Yk0that remains after accounting for the direct effects
of the predictors and all of the other response variables.
To fit the model in Equation (1), we must estimate pq +q(q+ 1)/2 unknown parameters.
When the total number of unknown parameters is comparable to or larger than the sample
size n, it is common to assume that the matrices Ψ and are sparse. If ωk,k0= 0,we
can conclude that after adjusting for the covariates and all other outcomes, outcomes Yk
and Yk0are conditionally independent. If ψj,k = 0,we can conclude that Xjdoes not have
adirect effect on the kth outcome variable Yk.Furthermore, when ψj,k = 0,any marginal
correlation between Xjand Ykis due solely to Xj’s direct effects on other outcomes Yk0that
are themselves conditionally correlate with Yk.
1.3 Our contributions
We introduce the chain graph spike-and-slab LASSO (cgSSL) procedure for fitting the model
in Equation (1) in a sparse fashion. At a high level, we place separate spike-and-slab LASSO
priors (Roˇckov´a and George,2018) on the entries of Ψ and on the off-diagonal entries of
in Equation (1). We derive an efficient Expectation Conditional Maximization algorithm to
compute the maximum a posteriori (MAP) estimates of Ψ and .Our algorithm is equivalent
to solving a series of maximum likelihood problems with self-adaptive penalties. On synthetic
data, we demonstrate that our algorithm displays excellent support recovery and estimation
performance. We further establish the posterior contraction rate for each of Ψ,,ΨΩ1,
and XΨΩ1.Our contraction results imply that our proposed cgSSL procedure consistently
estimates these quantities and also provides an upper bound for the minimax optimal rate
of estimating these quantities in the Frobenius norm. To the best of our knowledge, ours are
the first posterior contraction results for fitting sparse Gaussian chain graph models with
element-wise priors on Ψ and .
3
Here is an outline for the rest of our paper. We review the Gaussian chain graph model and
spike-and-slab LASSO in Section 2. We next introduce the cgSSL procedure in Section 3
and carefully derive our ECM algorithm for finding the MAP in Sections 3.2. We present our
asymptotic results in Section 4before demonstrating the excellent finite sample performance
of the cgSSL on several synthetic datasets in Section 5. We apply the cgSSL to our motivating
gut microbiome data in Section 6. We conclude in Section 7by outlining several avenues for
future development.
2 Background
2.1 The Gaussian chain graph model
Graphical models are a convenient way to represent the dependence structure between several
variables. Specifically, we can represent each variable as a node in a graph and we can draw
edges to indicate conditional dependence between variables. Absence of an edge between
two nodes indicates the corresponding variables are conditionally independent given all of the
other variables. In the context of our gut microbiome data, we can represent each predictor
Xjwith a node and each outcome Ykwith a node. We are primarily interested in detecting
edges between predictors and outcomes and edges between outcomes. Figure 1a is a cartoon
illustration of such a graphical model with p= 3 and q= 4.Note that we have not drawn
any edges between the predictors as such edges are not typically of primary interest.
4
(a) (b)
Figure 1: Cartoon illustrations of a general graphical model (a) and a Gaussian chain graph
model (b) with p= 3 covariates and q= 4 outcomes. Edges in both graphs encode condi-
tional dependence relationships. The edge labels in (b) correspond to the non-zero parame-
ters in Equation (1).
Without additional modeling assumptions, estimating a discrete graph like that in Figure 1a
from npairs of data (x1,y1),...,(xn,yn) is a challenging task. The Gaussian chain graph
model in Equation (1) translates the discrete graph estimation problem into a much more
tractable continuous parameter estimation problem. Specifically, the model introduces two
matrices, Ψ and and asserts that y|Ψ,,x N(Ω1Ψ>x,1).Under the Gaussian graph
model, Xjand Ykare conditionally independent if and only if ψj,k = 0.Furthermore, Ykand
Yk0are conditionally independent if and only if ωk,k0= 0.In other words, by first estimating
Ψ and and then examining their supports, we can recover the underlying graphical model.
Figure 1b reproduces the cartoon from Figure 1a with edges labelled by the corresponding
non-zero parameters in Equation (1).
In the Gaussian chain graph model, the direct effect of Xjon Ykis defined as
E[Yk|Xj=xj+ 1, Yk, Xj,Ψ,Ω] E[Yk|Xj=xj, Yk, Xj,Ψ,Ω] = ψj,kk,k
That is, fixing the values of all of the other covariates and all of the other outcomes, an
increase of one unit in Xjis associated with ψj,k k,k unit increase in the expectation of
Yk. Notice that the direct effect of Xjon Ykis defined conditionally on the values of all
other outcome Yk0.Because of this, the direct effect of Xjon Ykis typically not equal to its
5
marginal effect, which is defined as
E[Yk|Xj=xj+ 1, Xj,Ψ,Ω] E[Yk|Xj=xj, Xj,Ψ,Ω] = βj,k,
where βj,k is the (j, k) entry of the matrix B= ΨΩ1.Notice that we can re-parametrize the
Gaussian chain graph model in Equation (1) in terms of B
y|B, ,x N(B>x,1).(2)
We will refer to this re-parametrized model as the marginal regression model. There is a
considerable literature on fitting sparse marginal regression models and we refer the reader
to Deshpande et al. (2019) and references therein for a review.
Generally speaking, under (1), the supports of Ψ and Bwill be different. Specifically, it is
possible for Xjto have a marginal effect but no direct effect on Yk.For instance, in Figure 1,
although X3does not directly affect Y2,it may still be marginally correlated with Y2thanks
to the conditional correlation between Y2and Y3.That is, changing the value of X3can
change the value of Y3,which in turn changes the value of Y2.Consequently, if we fit a sparse
marginal regression model, we cannot generally expect to recover sparse estimates of the
matrix of direct effects.
2.2 Related works
Learning sparse chain graphs.McCarter and Kim (2014) proposing fitting sparse Gaus-
sian chain graphical models by maximizing a penalized negative log-likelihood. They specifi-
cally proposed homogeneous L1penalties on the entries of Ψ and and used cross-validation
to set the penalty parameters for Ψ and .Shen and Solis-Lemus (2021) developed a Bayesian
version of that chain graphical LASSO and put a Gamma prior on the penalty parameters.
In this way, they automatically learned the degree to which the entries ψj,k and ωk,k0are
shrunk to zero. Although these paper differ in how they determine the appropriate amount
of penalization, both McCarter and Kim (2014) and Shen and Solis-Lemus (2021) deploy a
single fixed penalty on all of the entries in Ψ and a single fixed penalty on all entries in .
With such fixed penalties, larger parameter estimates are shrunk towards zero as aggressively
as the smaller parameter estimates, which can introduce substantial estimation bias.
Spike-and-slab variable selection with the EM algorithm. Spike-and-slab priors are
the workhorses of sparse Bayesian modeling. As introduced by Mitchell and Beauchamp
6
(1988), the spike-and-slab prior is mixture between a point mass at 0 (the “spike”) and
a uniform distribution over a wide interval (the “slab”). George and McCulloch (1993)
introduced a continuous relaxation of the original spike-and-slab prior, respectively replacing
the point mass spike and uniform slab distributions with zero-mean Gaussians with extremely
small and large variances. In this way, one may imagine generating all of the “essentially
negligible” parameters in a model from the spike distribution and generating all of the
“relevant” or “significant” parameters from the slab distribution. Despite their intuitive
appeal, spike-and-slab priors usually produce extremely multimodal posterior distributions.
In high dimensions, exploring these distributions with Markov chain Monte Carlo (MCMC)
is computationally prohibitive.
In response, Roˇckov´a and George (2014) introduced EMVS, a fast EM algorithm targeting
the maximum a posteriori (MAP) estimate of the regression parameters. They later extended
EMVS, which used conditionally conjugate Gaussian spike and slab distributions, to use
Laplacian spike and slab distributions in Roˇckoa and George (2018). The resulting spike-
and-slab LASSO (SSL) procedure demonstrated excellent empirical performance. At a high-
level, the SSL algorithm solves a sequence of L1penalized regression problems with self-
adaptive penalties. The adaptive penalty mixing is key to the empirical success of the SSL
(George and Roˇckov´a,2020;Bai et al.,2021), as it facilitates shrinking larger parameter
estimates to zero less aggressively than smaller parameter estimates.
Since Roˇckov´a and George (2014), the general EM technique for maximizing spike-and-slab
posteriors has been successfully applied to many problems. For instance, Bai et al. (2020)
introduced a grouped version of the SSL that adaptively shrinks groups of parameter values
towards zero. Tang et al. (2017,2018) similarly deployed the SSL and its grouped variant
to generalized linear models. Outside of the single-outcome regression context, continuous
spike-and-slab priors have been used to estimate sparse Gaussian graphical models (Li et al.,
2019;Gan et al.,2019a,b), sparse factor models Roˇckov´a and George (2016), and to bi-
clustering Moran et al. (2021). Deshpande et al. (2019) introduce a multivariate SSL for
estimating Band in the marginal regression model in Equation (2). In each extension,
the adaptive penalization performed by the EM algorithm resulted in support recovery and
parameter estimation superior to that of fixed penalty methods.
The asymptotics of spike-and-slab variable selection. Beyond its excellent empirical
performance, Roˇckov´a and George (2018)’s SSL enjoys strong theoretical support. Using
general techniques proposed by Zhang and Zhang (2012) and Ghosal and van der Vaart
7
(2017), they proved that, under mild conditions, the posterior induced by the SSL prior in
high-dimensional, single-outcome linear regression contracts at a near minimax-optimal rate
as n .Their contraction result implies that the MAP estimate returned by their EM
algorithm is consistent and is, up to a log factor, rate-optimal. By directly applying Ghosal
and van der Vaart (2017)’s general theory, Bai et al. (2020) extended these results to the
group SSL posterior with an unknown variance.
In the context of Gaussian graphical models, Gan et al. (2019a) showed that the MAP
estimator corresponding to placing spike-and-slab LASSO priors on the off-diagonal elements
of a precision matrix is consistent. They did not, however, establish the contraction rate of
the posterior. Ning et al. (2020) showed that the joint posterior distribution of (B, Ω) in
the multivariate regression model in Equation (2) concentrates when using a group spike-
and-slab prior with Laplace slab and point mass spike on Band a carefully selected prior
on the eigendecomposition of 1.To the best of our knowledge, however, the asymptotic
properties of the posterior formed by placing SSL priors on both the precision matrix and
regression coefficients Ψ in Equation (1) have not yet been established.
3 Introducing the cgSSL
3.1 The cgSSL prior
To quantify the prior belief that many entries in Ψ are essentially negligible, we model each
ψj,k as having been drawn either from a spike distribution, which is sharply concentrated
around zero, or a slab distribution, which is much more diffuse. More specifically, we take
the spike distribution to be Laplace(λ0) and the slab distribution to be Laplace(λ1),where
0< λ1λ0are fixed positive constants. This way, the spike distribution is much more
heavily concentrated around zero than is the slab. We further let θ[0,1] be the prior
probability that each ψj,k is drawn from the slab and model the ψj,k ’s as conditionally
independent given θ. Thus, the prior density for Ψ, conditional on θ, is given by
π|θ) =
p
Y
j=1
q
Y
k=1 θλ1
2eλ1|ψj,k|+(1 θ)λ0
2eλ0|ψj,k|.(3)
Since is symmetric, it is enough to specify a prior on the entries ωk,k0where kk0.To
this end, we begin by placing an entirely analogous spike-and-slab prior on the off-diagonal
8
entries. That is, we model each ωk,k0as being drawn from a Laplace(ξ1),with probability
η[0,1],or a Laplace(ξ0),with probability 1 η, where 0 < ξ1ξ0.We similarly model
each ωk,k0as conditionally independent given ηand place independent Exp(ξ1) priors on
the diagonal entries of .We then truncate the resulting distribution of |θto the cone of
symmetric positive definite matrices, yielding the prior density
π(Ω|η) Y
1k<k0qηξ1
2eξ1|ωk,k0|+(1 η)ξ0
2eξ0|ωk,k0|!× q
Y
k=1
ξeξωk,k !×
1
(Ω 0) (4)
Observe that 1 θand 1 ηrespectively quantify the proportion of entries in Ψ and
that are essentially negligible. To model our uncertainty about these proportions, we place
Beta priors on each of θand η. Specifically, we independently model θBeta(aθ, bθ) and
ηBeta(aη, bη),where aθ, bθ, aη, bη>0 are fixed positive constants.
3.2 Targeting the MAP
Unfortunately the posterior distribution of , θ, , η)|Yis analytically intractable. Fur-
ther, it is generally high-dimensional and rather multimodal, rendering stochastic search
techniques like Markov Chain Monte Carlo computationally impractical. We instead fol-
low Roˇckoa and George (2018)’s example and focus on finding the maximum a posteriori
(MAP) estimate of , θ, , η).Throughout, we assume that the columns of Xhave been
centered and scaled to have norm n.
9
To this end, we attempt to maximize the log posterior density
log π, θ, , η|Y) = n
2log|| 1
2tr (YXΨΩ1)Ω(YXΨΩ)>
+
p
X
j=1
q
X
k=1
log θλ1eλ1|ψj,k |+ (1 θ)λ0eλ0|ψj,k|
+
q1
X
k=1
q
X
k0>k
log ηξ1eξ1|ωk,k0|+ (1 η)ξ0eξ0|ωk ,k0|
q
X
k=1
ξ0ωk,k + log
1
(Ω 0)
+ (aθ1) log(θ)+(bθ1) log(1 θ)
+ (aη1) log(η)+(bη1) log(1 η)
(5)
Optimizing the log posterior density directly is complicated by the non-concavity of log π(Ω|η).
Instead, following Deshpande et al. (2019), we iteratively optimize a surrogate objective using
an EM-like algorithm.
To motivate this approach, observe that we can obtain the posterior density π(Ω|η) in Equa-
tion (4) by marginalizing an augmented prior
π(Ω|η) = Zπ(Ω|δ)π(δ|η)dδ
where δ={δk,k0: 1 k < k0q}is a collection of q(q1)/2 i.i.d. Bernoulli(η) variables
and
π(Ω|δ) Y
1k<k0qξ1eξ1|ωk,k0|δk ,k0ξ0eξ0|ωk,k0|1δk,k0!× q
Y
k=1
ξeξωk,k !×
1
(Ω 0).
In our augmented prior, δk,k0indicates whether ωk,k0is drawn from the slab (δk,k0= 1) or the
spike (δk,k0= 0).
The above marginalization immediately suggests an EM algorithm: rather than optimize
log π, θ, , η|Y) directly, we can iteratively optimize a surrogate objective formed by
marginalizing the augmented log posterior density. That is, starting from some initial guess
(0), θ(0),(0), η(0)),for t > 1,the tth iteration of our algorithm consists of two steps. In the
10
first step, we compute the surrogate objective
F(t), θ, , η) = Eδ[log π, θ, , η, δ|y)|Ψ = Ψ(t1), = (t1) , θ =θ(t1), η =η(t1) ],
where the expectation is taken with respect to the conditional posterior distribution of the
indicators δgiven the current value of , θ, , η).Then in the second step, we maximize
the surrogate objective and set (t), θ(t),(t), η(t)) = arg max F(t), θ, , η).
It turns out that, given and η, the indicators δk,k0are conditionally independent Bernoulli
random variables whose means are easy to evaluate, making it simple to compute a closed
form expression for the surrogate objective F(t).Unfortunately, maximizing F(t)is still dif-
ficult. Consequently, similar to Deshpande et al. (2019), we carry out two conditional max-
imizations, first optimizing with respect to , θ) while holding (Ω, η) fixed, and then opti-
mizing with respect to (Ω, η) while holding , θ) fixed. That is, in the second step of each
iteration of our algorithm, we set
(t), θ(t)) = arg max
Ψ
F(t), θ, (t1), η(t1)) (6)
(Ω(t), η(t)) = arg max
F(t)(t), θ(t),, η).(7)
In summary, we propose finding the MAP estimate of , θ, , η) using an Expectation
Conditional Maximization (ECM; Meng and Rubin,1993) algorithm.
When we fix the values of and η, the surrogate objective F(t)is separable in Ψ and
θ. That is, the objective function F(t), θ, (t1), η(t1)) in Equation (6) can be written
as the sum of a function of Ψ alone and a function of θalone. This means that we can
separately compute Ψ(t)and θ(t)while fixing (Ω, η) = (Ω(t1), η(t1)). The objective function
in Equation (7) is similarly separable and we can separately compute (t)and η(t)while
fixing , θ) = (t), θ(t)).As we describe in Section S1 of the Supplementary Materials,
computing θ(t)and η(t)is relatively straightforward; we compute θ(t)with a simple Newton
algorithm and there is a closed form expression for η(t).The main computational challenge
is computing Ψ(t)and (t).In the next subsection, we detail how updating Ψ and reduces
to solving penalized likelihood problems with self-adaptive penalties.
11
3.3 Adaptive penalty mixing
Before describing how we compute Ψ(t)and (t),we introduce two important functions:
p?(x, θ) = θλ1eλ1|x|
θλ1eλ1|x|+ (1 θ)λ0eλ0|x|
q?(x, η) = ηξ1eξ1|x|
ηξ1eξ1|x|+ (1 η)ξ0eξ0|x|
For each 1 jpand 1 kq, p?(ψj,k, θ) is the conditional posterior probability that ψj,k
was drawn from the Laplace(λ1) slab distribution. Similarly, for 1 k < k0q, q?(ωk,k0, η)
is just the conditional posterior probability that ωk,k0was drawn from the Laplace(ξ1) slab.
That is, q?(ωk,k0, η) = E[δk,k0|Y,Ψ,, θ, η].
Updating Ψ.Fixing the value = (t1) ,computing Ψ(t)is equivalent to solving the
following penalized optimization problem
Ψ(t)= arg max
Ψ(1
2tr (YXΨ)Ω1(YXΨ)>+X
jk
pen(ψj,k ;θ))(8)
where
pen(ψj,k ;θ) = log π(ψj,k|θ)
π(0|θ)=λ1|ψj,k|+ log p?(ψj,k , θ)
p?(0, θ).
Note that the first term in the objective of Equation (8) can obtained by distributing a
factor of through the quadratic form that appears in the log-likelihood (see Equations (S5)
and (S7) of the Supplementary Materials for details).
Following arguments similar to those in Deshpande et al. (2019), the Karush-Kuhn-Tucker
(KKT) condition for (8) tells us that
ψ(t)
j,k =n1h|zj,k| λ?(ψ(t)
j,k, θ)i+sign(zj,k),(9)
12
where
zjk =(t)
j,k +X>
jrk+X
k06=k
(Ω1)k,k0
(Ω1)k,k
X>
jrk0
rk0= (YXΨ(t))k0
λ?(ψ(t)
j,k, θ) = λ1p?(ψ(t)
j,k, θ) + λ0(1 p?(ψ(t)
j,k, θ)).
The KKT conditions suggest a natural coordinate-ascent strategy for computing Ψ(t): start-
ing from some initial guess Ψ0,we cyclically update the entries ψj,k by soft-thresholding ψj,k
at λ?
j,k.During our cyclical coordinate ascent, whenever the current value of ψj,k is very large,
the corresponding value of p?(ψj,k, θ) will be close to one, and the threshold λ?will be close
to the slab penalty λ1.On the other hand, when ψj,k is very small, the corresponding p?will
be close to zero and the threshold λ?will be close to the spike penalty λ0.Since λ1λ0,
we are therefore able to apply a stronger penalty to the smaller entries of Ψ and a weaker
penalty to the larger entries. As our cyclical coordinate ascent proceeds, we iteratively refine
the thresholds λ?,thereby adaptively shrinking our estimates of ψj,k.
Before proceeding, we note that the quantity zj,k depends not only on the inner product
between the Xj,the jth column of the design matrix, and the partial residual rkbut also
on the inner product between Xjand all other partial residuals rk0for k06=k. Practically
this means that in our cyclical coordinate ascent algorithm, our estimate of the direct effect
of predictor Xjon outcome Ykcan depend on how well we have fit all other outcomes Yk0.
Moreover, the entries of 1determine the degree to which ψj,k depends on the outcomes Yk0
for k06=k. Specifically, if (Ω1)k,k0= 0,then we are unable to leverage information contained
in Yk0to inform our estimate of ψj,k.
Updating .Fixing Ψ = Ψ(t)and letting S=n1Y>Yand M=n1(XΨ)>XΨ,we can
compute (t)by solving
(t)= arg max
0(n
2log|| tr(SΩ) tr(M1)
q
X
k=1 "ξωk,k +X
k0>k
ξ?
k,k0|ωk,k0|#) (10)
where ξ?
k,k0=ξ1q?(ω(t1)
k,k0, η(t1)) + ξ0(1 q?(ω(t1)
k,k0, η(t1))).
The objective in Equation (10) is extremely similar to the conventional graphical LASSO
(GLASSO; Friedman et al.,2008) objective. However, there are two crucial differences. First,
13
because the conditional mean of Ydepends on in the Gaussian chain graph model (1), we
have an additional term tr(M1) that is absent in the GLASSO objective. Second, and
more substantively, the objective in Equation (10) contains individualized penalties ξ?
k,k0on
the off-diagonal entries of .Here, the penalty ξ?
k,k0will be large (resp. small) whenever
the previous estimate of ω(t1)
k,k0is small (resp. large). In other words, as we run our ECM
algorithm, we can refine the amount of penalization applied to each off-diagonal entry in .
Although the objective in Equation (10) is somewhat different than the GLASSO objective,
we can solve it by suitably modifying an existing GLASSO algorithm. Specifically, we
solve the optimization problem in Equation (10) with a modified version of Hsieh et al.
(2011)’s QUIC algorithm. Our solution repeatedly (i) forms a quadratic approximation
of the objective, (ii) computes a suitable Newton direction, and (iii) follows that Newton
direction for step size chosen with an Armijo rule. In Section S2.4 of the Supplementary
Materials, we show that the optimization problem in Equation (10) has a unique solution
and that our modification to QUIC converges to the unique solution.
3.4 Selecting the spike and slab penalties
The proposed ECM algorithm depends on two sets of hyperparameters. The first set, contain-
ing aθ, bθ, aη,and bη,encode our initial beliefs about the overall proportion of non-negligible
entries in Ψ and .We set aθ= 1, bθ=pq, aη= 1,and bη=qsimilar to Deshpande et al.
(2019). The second set of hyperparameters consists of the spike and slab penalties λ0, λ1, ξ0
and ξ1.Rather than run cgSSL with a single set of these penalties, we use Deshpande et al.
(2019)’s path-following dynamic posterior exploration (DPE) strategy to obtain the MAP
estimates corresponding to several different choices of spike penalties.
Specifically, we fix the slab penalties λ1and ξ1and specify grids of increasing spike penalties
Iλ={λ(1)
0<··· < λ(L)
0}and Iξ={ξ1
0<··· < ξ(L)
0}.We then run cgSSL with warmstarts for
each combination of spike penalties, yielding a set of posterior modes {(s,t), θ(s,t),(s,t), η(s,t)}
indexed by the choices (λ(s)
0, ξ(t)
0).To warm start the estimation of the mode corresponding to
(λ(s)
0, ξ(t)
0),we first compute the models found with (λs1
0, ξ(t1)
0), (λs
0, ξ(t1)
0) and (λs1
0, ξ(t)
0).
We evaluate the posterior density using (λ0, ξ0)=(λ(s)
0, ξ(t)
0) at each of the three previously
computed modes and initialize at the mode with largest density.
Following this DPE strategy provides a snapshot of the many different cgSSL posteriors.
However, it can be computationally intensive, as we must run our ECM algorithm to conver-
gence for every pair of spike penalties. Deshpande et al. (2019) introduced a faster variant,
14
called dynamic conditional posterior exploration (DCPE), which we also implemented for
the cgSSL. In DCPE, we first run our ECM algorithm with warm-starts over the ladder Iλ
while keeping = Ifixed. Then, fixing , θ) at the final value from the first step, we run
our ECM algorithm with warm-starts over the ladder I.Finally, we run our ECM algo-
rithm starting from the final estimates of the parameters obtained in the first two steps with
(λ0, ξ0)=(λ(L)
0, ξ(0)
L).Generally speaking, DPE and DCPE trace different paths through the
parameter space and typically return different final estimates.
When the spike and slab penalties are similar in size (i.e. λ1λ0, ξ1ξ0), we noticed that
our ECM algorithm would sometimes return very dense estimates of Ψ and diagonal estimates
of with very large diagonal entries. Essentially, when the spike and slab distributions are
not too different, our ECM algorithm has a tendency to overfit the response with a dense
Ψ, leaving very little residual variation to be quantified with .On further investigation, we
found that we could detect such pathological behavior by examining the condition number
of the matrix YXΨ. To avoid propagating dense Ψ’s and diagonal Ω’s through the
DPE and DCPE, we terminate our ECM early whenever the condition number of YXΨ
exceeds 10n. We then set the corresponding Ψ(s)= 0 and (t)=Iand continue the dynamic
exploration from that point. While this is admittedly ad hoc heuristic, we have found that
it works well in practice and note that Moran et al. (2019) utilized a similar strategy in the
single-outcome high-dimensional linear regression setting with unknown variance.
The DPE and DCPE cgSSL procedures are implemented in the mSSL R(R Core Team,
2022) package, which is available at https://github.com/YunyiShen/mSSL. Note that this
package contains a new implementation of Deshpande et al. (2019)’s mSSL procedure as
well.
4 Asymptotic theory of cgSSL
If the Gaussian chain graph model in Equation (1) is well-specified that is, if our data
(xi,yi) are truly generated according to the model will the posterior distribution of Ψ
and collapse to a point-mass at the true data generating parameters as n ? Such
a collapse would, among other things, imply the MAP estimate returned by the cgSSL
procedure described in Section 3is consistent, providing an asymptotic justification for its
use. In this section, we answer the question affirmatively: under some mild assumptions
and with some slight modifications, the cgSSL posterior concentrates around the truth. We
15
further establish the rate of concentration, which quantifies the speed at which the posterior
distribution shrinks to the true data generating parameters. We begin by briefly reviewing
our general proof strategy before precisely stating our assumptions and results. Proofs of
our main results are available in Section S5 of the Supplementary Materials.
4.1 Proof strategy
To establish the posterior concentration rate for Ψ and Ω, we followed Ning et al. (2020)
and Bai et al. (2020) and first showed that the posterior concentrates in log-affinity (see
Section S5.3 in the Supplementary Materials for details). Posterior concentration of the
individual parameters followed as a consequence. To show that the posterior concentrates
in log-affinity, we appealed to general results about posterior concentration for independent
but non-identically distributed observations. Specifically, we verified the three conditions
of Theorem 8.23 of Ghosal and van der Vaart (2017). First, we confirmed that the cgSSL
prior introduced in Section 3.1 places enough prior probability mass in small neighborhoods
around every possible choice of ,Ω).This was done by verifying that for each ,Ω),
the prior probability contained in a small Kullback-Leibler ball around ,Ω) can be lower
bounded by a function of the ball’s radius (the so-called “KL-condition” in Lemma S2 of the
Supplementary Materials). Then we studied a sequence of likelihood ratio tests defined on
sieves of the parameter space that can correctly distinguish between parameter values that
are sufficiently far away from each other in log-affinity. In particular, we bounded the error
rate of such tests and then bounded the covering number of the sieves (Lemma S4 of the
Supplementary Materials).
Ning et al. (2020) studied the sparse marginal regression model in Equation (2) instead of the
sparse chain graph. Although these are somewhat different models, our overall proof strategy
is quite similar to theirs. However, we pause here to highlight some important technical
differences. First, Ning et al. (2020) placed a prior on Ω’s eigendecomposition while we
placed an arguably simpler and more natural element-wise prior on .The second and more
substantive difference is in how we bound the covering number of sieves of the underlying
parameter space. Because Ning et al. (2020) specified exactly sparse priors on the elements
of B= ΨΩ1,it was enough for them to carefully bound the covering number of exactly
low-dimensional sets of the form A × {0}rwhere Ais some subset of a multi-dimensional
Euclidean space and r > 0 is a positive integer. In contrast, because we specified absolutely
continuous priors on the elements of Ψ,we had to cover “effectively low-dimensional” sets
16
of the form A × [δ, δ]rfor small δ > 0.Our key lemma (Lemma S4 in the Supplementary
Materials) provides sufficient conditions on δfor bounding the -packing number of such
effectively low-dimensional sets using the 0-packing number of Afor a carefully chosen
0>0.
4.2 Contraction of cgSSL
In order to establish our posterior concentration results, we first assume that the data
(x1,y1),...,(xn,yn) were generated according to a Gaussian chain graph model with true
parameter Ψ0and 0.We need to make additional assumptions about the spectra of Ψ0and
0and on the dimensions n, p and q.
A1 Ψ0and 0have bounded operator norm: that is, Ψ0 T0={Ψ : |||Ψ|||2< a1}and
0 H0{ : ||||||2[1/b2,1/b1] where ||| · ||| is the operator norm and a1, b1, b2>0
are fixed positive constants.
A2 Dimensionality: We assume that log(n).log(q); log(n).log(p); and
max{p, q, s
0, sΨ
0}log(max{p, q})/n 0,
where s
0and sΨ
0are the number of non-zero free parameters in and Ψ respectively;
and an.bnmeans for sufficient large n, there exists a constant Cindependent of n
such that anCbn
A3 Tuning the Ψ prior: We assume that (1 θ) (pq)2+a0for some a0>0; λ0
max{n, pq}2+b0for some b0>1/2; and λ11/n
A4 Tuning the prior: We assume that that (1 η) max{Q, pq}2+afor some a > 0,
where Q=q(q1)/2; ξ0max{Q, pq, n}4+bfor some b > 0; ξ11/n and ξ
1/max{Q, n}
Before stating our main result, we pause to highlight two key differences between the above
assumptions and model introduced in Section 3.1. Although the prior in Section 3.1 restricts
to the positive-definite cone, Assumption 1 is slightly stronger as it bounds the smallest
eigenvalue of away from zero. The stronger assumption ensures that the entries of XΨΩ1
do not diverge in our theoretical analysis. We additionally restricted our theoretical analysis
to the setting where the proportion of non-negligible parameters, θand η, are fixed and
known (Assumption 4). We note that Roˇckoa and George (2018) and Gan et al. (2019a)
17
make similar assumptions in their theoretical analyses.
Theorem 1 (Posterior contraction of cgSSL).Under Assumptions A1–A4, there is a con-
stant M1>0,which does not depend on n, such that
sup
Ψ∈T0,∈H0
E0ΠΨ : ||X(ΨΩ1Ψ01
0)||2
FM1n2
n|Y1, . . . , Yn0 (11)
sup
Ψ∈T0,∈H0
E0Π : ||0||2
FM12
n|Y1, . . . , Yn0 (12)
where n=pmax{p, q, s
0, sΨ
0}log(max{p, q})/n. Note that n0as n .
A key step in proving Theorem 1is Lemma 1, which shows that the cgSSL posterior does
not place too much probability on Ψ’s and Ω’s with too many large entries. In order to
state this lemma, we denote the effective dimensions of Ψ and by |ν(Ψ)|and |ν(Ω)|.The
effective dimension of Ψ (resp. Ω) counts the number of entries (resp. off-diagonal entries in
the lower-triangle) whose absolute value exceeds the intersection point of the spike and slab
prior densities.
Lemma 1 (Dimension recovery of cgSSL).For a sufficiently large number C0
3>0, we have:
sup
Ψ∈T0,∈H0
E0Π : |ν(Ψ)|> C0
3s?|Y1, . . . , Yn)0 (13)
sup
B∈T0,∈H0
E0Π (Ω : |ν(Ω)|> C0
3s?|Y1, . . . , Yn)0 (14)
where s?= max{p, q, s
0, sΨ
0}.
Lemma 1essentially guarantees that the cgSSL posterior does not grossly overestimate the
number of predictor-response and response-response edges in the underlying graphical model.
Note that the result in Equation (11) shows that the vector containing the nevaluations
of the regression function (i.e. the vector XΨΩ1), converges to the vector containing the
evaluations of the true regression function 1
0Ψ>
0x.Importantly, apart from Assumption A2
about the dimensions of X, we did not make any additional assumptions about the design
matrix. The contraction rates for Ψ and ΨΩ1,however, depend critically on X. To state
these results, denote the restricted eigenvalue of a matrix Aas
φ2(s) = inf
ARp×q:0≤|ν(A)|≤skXAk2
F
nkAk2
F.
18
Corollary 1 (Recovery of regression coefficients in cgSSL).Under Assumptions A1–A4,
there is some constant M0>0,which does not depend on n, such that
sup
Ψ∈T0,∈H0
E0Π||ΨΩ1Ψ01
0||2
FM02
n
φ2(sΨ
0+C0
3s?)0 (15)
sup
Ψ∈T0,∈H0
E0Π||ΨΨ0||2
FM02
n
min{φ2(sΨ
0+C0
3s?),1}0.(16)
Corollary 1shows that the posterior distribution of ΨΩ1can contract at a faster or slower
rate than the posterior distributions of XΨΩ1and ,depending on the design matrix. In
particular, when Xis poorly conditioned, we might expect the rate to be slower. In contrast,
the term min{φ2(sΨ
0+C0
3s?),1}appearing in the denominator of the rate in Equation (16)
implies that the posterior distribution of Ψ cannot concentrate at a faster rate than the
posterior distributions of ΨΩ1and ,regardless of the design matrix. To develop some
intuition about this phenomenon, notice that we can decompose the difference Ψ Ψ0as
ΨΨ0= (ΨΩ1Ψ01
0)Ω + 01
0(Ω 0)Ω1)Ω.
Roughly speaking, the decomposition suggests that in order to estimate Ψ well, we must
be able to estimate both and ΨΩ1well. In other words, estimating Ψ is at least as
hard, statistically, as estimating and ΨΩ1.Taken together, the two results in Corollary 1
suggest that while a carefully constructed design matrix can improve estimation of the matrix
of marginal effects, B= ΨΩ1,it cannot generally improve estimation of the matrix of direct
effects Ψ.
5 Synthetic experiments
We performed a simulation study to assess how well our two implementations of cgSSL
(cgSSL-DPE and cgSSL-DCPE) (i) recover the supports of Ψ and and (ii) estimate each
matrix. We compared both implementations of cgSSL to several competitors: a fixed-penalty
method (cgLASSO), which deploys a single penalty λfor the entries in Ψ and a single fixed
penalty ξfor the entries in Ω; Shen and Solis-Lemus (2021)’s CAR-LASSO procedure (CAR),
which puts Laplace priors on Ψ and entries and Gamma prior on the overall shrinkage
strength; and Shen and Solis-Lemus (2021)’s adaptive CAR-LASSO (CAR-A), which puts
individualized Laplace prior on free parameters of Ψ and .Note that cgSSL and cgLASSO
19
perform optimization while CAR and CAR-A run MCMC. Further, we selected the penalties
in cgLASSO with 10-fold cross-validation. Additionally, cgLASSO and CAR apply the same
amount of shrinkage to every element of Ψ and the same amount of shrinkage to every
element of .CAR-A, on the other hand, applied individualized shrinkage.
We simulated several synthetic datasets of various dimensions and with different sparsity
patterns in (Figure 2). Across all of these choices of dimension and ,we found that
cgSSL-DPE achieved somewhat lower sensitivity but much higher precision in estimating the
supports of both Ψ and than the competing methods. Taken together, these findings
suggest that while cgSSL-DPE tended to return fewer non-zero parameter estimates than the
other methods, we can be much more certain that those parameters are truly non-zero. Put
another way, although the other methods can recover more of the truly non-zero signal, they
do so at the expense of making many more false positive identifications in the supports of Ψ
and than cgSSL-DPE.
5.1 Simulation design
We simulated data with three different choices of dimensions (n, p, q) = (100,10,10),(100,20,30),
and (400,100,30).For each choice of (n, p, q),we considered five different choices of Ω: (i)
an AR(1) model for 1so that is tri-diagonal; (ii) an AR(2) model for 1so that
ωk,k0= 0 whenever |kk0|>2; (iii) a block model in which is block-diagonal with two
dense q/2×q/2 diagonal blocks; (iv) a star graph where the off-diagonal entry ωk,k0= 0
unless kor k0is equal to 1; and a dense model with all off-diagonal elements ωk,k0= 2.
In the AR(1) model we set (Ω1)k,k0= 0.7|kk0|so that ωk,k0= 0 whenever |kk0|>1.In
the AR(2) model, we set ωk,k = 1, ωk1,k =ωk,k1= 0.5,and ωk2,k =ωk,k2= 0.25.For
the block model, we partitioned Σ = 1into 4 q/2×q/2 blocks and set all entries in the
off-diagonal blocks of Σ to zero. We then set σk,k = 1 and σk,k0= 0.5 for 1 k6=k0q/2
and for q/2 + 1 k6=k0q. For the star graph, we set ωk,k = 1, ω1,k =ωk,1= 0.1 for each
k= 2, . . . , q, and set the remaining off-diagonal elements of equal to zero.
These five specifications of (top row of Figure 2) correspond to rather different underlying
graphical structure among the response variables (bottom row of Figure 2). The AR(1)
model, for instance, represents an extremely sparse but regular structure while the AR(2)
model is somewhat less sparse. While the star model and AR(1) model contain the same
number of edges, the underlying graphs have markedly different degree distributions. Com-
pared to the AR(1), AR(2), and star models, the block model is considerably denser. We
20
included the full model, which corresponds to a dense ,to assess how well all of the methods
perform in a misspecified regime.
In total, we considered 15 combinations of dimensions (n, p, q) and .For each combination,
we generated Ψ by randomly selecting 20% of entries to be non-zero. We drew the non-zero
entries uniformly from a U(2,2) distribution. For each combination of (n, p, q), and Ψ,
we generated 100 synthetic datasets from the Gaussian chain graph model (1). The entries
of the design matrix Xwere independently drawn from a standard N(0,1) distribution.
Figure 2: Visualization of the supports of for q= 10 under each of the five specifications
(top) and corresponding graph (bottom). In the top row, we have gray cells indicate non-zero
entries in and white cells indicate zeros
5.2 Results
To assess estimation performance, we computed the Frobenius norm between the estimated
matrices and the true data generating matrices. To assess the support recovery performance,
we counted the number of elements in each of Ψ and that were (i) correctly estimated
as non-zero (true positives; TP); (ii) correctly estimated as zero (true negatives; TN); (iii)
incorrectly estimated as non-zero (false positives; FP); and (iv) incorrectly estimated as zero
(false negatives; FN). We report the sensitivity (TP/(TP + FN)) and precision (TP/(TP
+ FP)). Generally speaking, we prefer methods with high sensitivity and high precision.
High sensitivity indicates that the method has correctly estimated most of the true non-
zero parameters as non-zero. High precision, on the other hand, indicates that most of the
estimated non-zero parameters are truly non-zero. For brevity, we only report the average
sensitivity, precision, and Frobenius errors for the (n, p, q) = (100,10,10) setting in Table 1.
21
We observed qualitatively similar results for the other two settings of dimension and report
average performance in those settings in Tables S2S3 of the Supplementary Materials.
Table 1: Average (sd) sensitivity, precision, and Frobenius error for Ψ and when (n, p, q) =
(100,10,10) for each specification of across 100 simulated datasets. For each choice of ,
the best performance is bold-faced.
Ψ recovery recovery
Method SEN PREC FROB SEN PREC FROB
AR(1) model
cgLASSO 0.88 (0.08) 0.44 (0.15) 0.13 (0.16) 0.78 (0.37) 0.55 (0.31) 31.93 (22.08)
CAR 0.86 (0.06) 0.31 (0.03) 0.04 (0.01) 1 (0) 0.3 (0.03) 4.16 (1.18)
CAR-A 0.87 (0.06) 0.59 (0.07) 0.02 (0.01) 1 (0) 0.83 (0.1) 2.75 (1.59)
cgSSL-dcpe 0.64 (0.05) 0.8 (0.16) 0.08 (0.05) 0.94 (0.11) 0.96 (0.07) 6.32 (6.64)
cgSSL-dpe 0.65 (0.05) 0.99 (0.03) 0.04 (0.01) 1 (0) 0.97 (0.05) 2.49 (1.12)
AR(2) model
cgLASSO 1 (0.02) 0.22 (0.06) 0.17 (0.09) 0.84 (0.29) 0.55 (0.17) 2.7 (1.66)
CAR 0.9 (0.06) 0.34 (0.04) 0.03 (0.01) 0.98 (0.03) 0.57 (0.06) 0.58 (0.21)
CAR-A 0.89 (0.05) 0.67 (0.08) 0.02 (0.01) 1 (0.02) 0.91 (0.06) 0.46 (0.32)
cgSSL-dcpe 0.96 (0.06) 0.43 (0.12) 0.45 (0.28) 0.24 (0.3) 0.63 (0.14) 5 (0.98)
cgSSL-dpe 0.73 (0.05) 1 (0.01) 0.02 (0.01) 1 (0) 0.86 (0.06) 0.38 (0.21)
Block model
cgLASSO 0.95 (0.05) 0.39 (0.18) 0.13 (0.11) 0.73 (0.38) 0.78 (0.21) 5.15 (2.27)
CAR 0.89 (0.06) 0.31 (0.03) 0.03 (0.01) 0.95 (0.02) 0.61 (0.06) 1.89 (0.75)
CAR-A 0.87 (0.06) 0.57 (0.07) 0.03 (0.01) 0.86 (0.07) 0.93 (0.05) 2.97 (1.22)
cgSSL-dcpe 0.76 (0.06) 0.29 (0.02) 0.28 (0.02) 0.01 (0.03) 0.71 (0.39) 8.85 (0.2)
cgSSL-dpe 0.69 (0.07) 0.99 (0.02) 0.03 (0.01) 0.71 (0.06) 0.95 (0.05) 3.28 (1.17)
Star model
cgLASSO 0.96 (0.04) 0.48 (0.14) 0.04 (0.02) 0.36 (0.41) 0.2 (0.18) 0.86 (0.35)
CAR 0.91 (0.05) 0.34 (0.03) 0.02 (0) 0.55 (0.18) 0.25 (0.08) 0.57 (0.29)
CAR-A 0.91 (0.04) 0.57 (0.06) 0.02 (0.01) 0.22 (0.14) 0.46 (0.24) 0.57 (0.26)
cgSSL-dcpe 0.83 (0.04) 0.96 (0.05) 0.01 (0) 0.05 (0.09) 0.9 (0.24) 0.22 (0.12)
cgSSL-dpe 0.79 (0.06) 0.99 (0.03) 0.01 (0.01) 0.09 (0.13) 0.71 (0.29) 0.29 (0.19)
Dense model
cgLASSO 0.92 (0.04) 0.57 (0.07) 0.03 (0.01) 0.88 (0.32) 1 (0) 16.93 (32.74)
CAR 0.85 (0.06) 0.28 (0.03) 0.04 (0.01) 0.03 (0.02) 1 (0) 92.51 (1.74)
CAR-A 0.84 (0.06) 0.4 (0.04) 0.04 (0.01) 0 (0.01) 1 (0) 96.04 (1.21)
cgSSL-dcpe 0.82 (0.03) 0.84 (0.06) 0.02 (0) 0.01 (0.02) 1 (0) 99.93 (0.39)
cgSSL-dpe 0.72 (0.07) 0.93 (0.06) 0.03 (0.01) 0.05 (0.04) 1 (0) 99.99 (0.98)
In terms of identifying non-zero direct effects (i.e. estimating the support of Ψ), cgLASSO
consistently achieves the highest sensitivity. On further inspection, we found that the penal-
ties selected by 10-fold cross-validation tended to be quite small, meaning that cgLASSO
returned many non-zero ˆ
ψj,k’s. As the precision results indicate, many of cgLASSO’s “discov-
eries” were in fact false positives. The other fixed penalty method, CAR, similarly displayed
somewhat high sensitivity and low precision. Interestingly, for several choices of ,the pre-
22
cisions of cgLASSO and CAR for recovering the support of Ψ were less than 0.5. Such low
precisions indicate that most of the returned non-zero estimates were in fact false positives.
In contrast, methods that deployed adaptive penalties (CAR-A and both implementations
of cgSSL), displayed higher precision in estimating the support of Ψ.In fact, at least for
estimating the support of Ψ,cgSSL-DPE made almost no false positives.
We observed essentially the same phenomenon for Ω: although the cgSSL generally returned
fewer non-zero estimates of ωk,k0, the vast majority of these estimates were true positives. In
a sense, the fixed penalty methods (cgLASSO and CAR) cast a very wide net when searching
for non-zero signal in Ψ and ,leading to large number of false positive identifications in the
supports of these matrices. Adaptive penalty methods, on the other hand, are much more
discerning.
In terms of estimation performance, we found that the fixed penalty methods (cgLASSO and
CAR) tended to have much larger Frobenius error, reflecting the well-documented bias in-
troduced by L1regularization. The one exception was in the misspecified setting where
was dense. Interestingly, for the four sparse Ω’s, we did not observe any method achieving
high Frobenius error for but low Frobenius error for Ψ.This finding helps substantiate
our intuition about Corollary 1. Namely, in order to estimate Ψ well, one must estimate
well. Finally, like Deshpande et al. (2019), we found that dynamic conditional poste-
rior exploration implementation of cgSSL performed slightly worse than dynamic posterior
exploration implementation.
6 Real data experiments
Claesson et al. (2012) studied the gut microbiota of elderly individuals using data sequenced
from fecal samples taken from 178 subjects. They were primarily interested in understanding
differences in the gut microbiome composition across several residence types (in the commu-
nity, day-hospital, rehabilitation, or in long-term residential care) and across several different
types of diet. We refer the reader to the Supplementary Notes and Supplementary Table 3
of Claesson et al. (2012) for more details. They found that the gut microbiomes of residents
in long-term care facilities were considerably less diverse than those of residents dwelling
in the community. They additionally reported that diet had a large marginal effect on gut
microbe diversity but they did not examine conditional or direct effects, which might align
more closely with the underlying biological mechanism. In this section, we re-analyze their
23
data using the cgSSL to try to estimate the direct effects of each type of diet and residence
type on gut microbiome composition.
We pre-processed the raw 16s-rRNA data in the MG-RAST server (Keegan et al.,2016);
please see Section S4 of the Supplementary Materials for more details on the pre-processing.
In all, we had n= 178 observations of p= 11 predictors and q= 14 taxa. Figure 3shows
the graphical model estimated by cgSSL-DPE. In the figure, edges are colored according to
the sign of the effect, with blue edges corresponding to negative conditional correlation and
red edges corresponding to positive conditional correlation. The edge widths correspond to
the absolute value of the parameter, with wider edges indicating larger parameter values.
We found a large number of edges between the different species, suggesting that there was
considerable conditional dependence between their abundances after adjusting for the co-
variates. In fact, we found only two non-zero entries in Ψ.We estimated that percutaneous
endoscopic gastronomy (PEG), in which a feeding tube is inserted into the abdomen, had a
negative direct effect on the abundance of Veillonella, which is involved in lactose fermen-
tation. Reassuringly for us, our finding aligns with those in Takeshita et al. (2011), which
reported a negative effect of PEG on this genus. We additionally found that staying in a
day hospital had a positive direct effect on Caloramator.
24
Alistipes
Bacteroides
Barnesiella
Blautia
Butyrivibrio
Caloramator
Clostridium
Eubacterium
Faecalibacterium
Hespellia
Parabacteroides
Ruminococcus
Selenomonas
Veillonella
Age
GenderMale
StratumDayHospital
StratumLong−term
StratumRehabDiet1
Diet2 Diet3
Diet4
DietPEG
BMI
abs_weight 0.2 0.4 0.6 0.8
Figure 3: The estimated graphical model underlying Claesson et al. (2012)’s gut micro-
biome dataset. We annotate the edge weight by the absolute value of conditional regression
coefficients and red color represents positive (conditional) dependence and blue represents
negative (conditional) dependence.
Our results suggest that the large marginal effects reported by Claesson et al. (2012) are a by-
product of only a few direct effects and substantial residual conditional dependence between
species. For instance, because PEG has a direct effect on Veillonella, which is conditionally
correlated with Clostridium,Butyrivibrio, and Blautia, PEG displays a marginal effect on
each of these other genus. In this way, the cgSSL can provide a more nuanced understanding
of the underlying biological mechanism than simply estimating the matrix of marginal effects
B= ΨΩ1.We note, however, that Claesson et al. (2012)’s dataset does not contain an
exhaustive set of environmental and patient life-style predictors. Accordingly, our re-analysis
is limited in the sense that were we able to incorporate additional predictors, the estimated
graphical model may be quite different.
25
7 Discussion
In the Gaussian chain graph model in Equation (1), Ψ is a matrix containing all of the direct
effects of ppredictors on qoutcomes while is the residual precision matrix that encodes
the conditional dependence relationships between the outcomes that remain after adjusting
for the predictors. We have introduced the cgSSL procedure for obtaining simultaneously
sparse estimates of Ψ and .In our procedure, we formally specify spike-and-slab LASSO
priors on the free elements of Ψ and and use an ECM algorithm to maximize the posterior
density. Our ECM algorithm iteratively solves a penalized maximum likelihood problem
with self-adaptive penalties. Across several simulated datasets, cgSSL demonstrated excel-
lent support recovery and estimation performance, substantially out-performing competitors
that deployed constant shrinkage penalties. We further characterized the asymptotic prop-
erties of cgSSL posteriors, establishing posterior concentration rates under relatively mild
assumptions. To the best of our knowledge, these are the first such results for sparse Gaussian
chain graph models.
Although our main theoretical result (Theorem 1) implies that a slightly modified version of
the cgSSL procedure from Section 3is asymptotically consistent, quantifying finite sample
posterior uncertainty remains challenging. Several authors have proposed extensions of New-
ton and Raftery (1994)’s weighted likelihood bootstrap for quantifying posterior uncertainty.
Basically, these procedures work by repeatedly maximizing a randomized objective formed
by carefully re-weighting each term in the log-likelihood and the log-prior. In fact, Nie and
Roˇckov´a (2022) recently deployed this strategy to quantify uncertainty in SSL posteriors for
single-outcome regression in high dimensions. A key ingredient in Nie and Roˇckoa (2022) is
the introduction of an additional random location shift in the prior to offset the tendency of
the SSL to return exactly sparse parameter estimates. In our Gaussian chain graph problem,
introducing a similar shift is challenging due to the constraint that be positive definite.
Overcoming this difficulty is the subject of on-going work.
In many applications, analysts encounter multiple outcomes of mixed type (i.e. continuous
and discrete). In its current form, the cgSSL is not applicable to these situations. It is possi-
ble, however, to extend the cgSSL to model outcomes of mixed type using a strategy similar
to one found in Kowal and Canale (2020), which modeled discrete variables as truncated
and transformed versions of latent Gaussian random variables.
26
Acknowledgements
The authors are grateful to Ray Bai for helpful comments on the theoretical results and to
Gemma Moran for feedback on an early draft of the manuscript.
This work was supported by the National Institute of Food and Agriculture, United States
Department of Agriculture, Hatch project 1023699. This work was also supported by the
Department of Energy [DE-SC0021016 to C.S.L.]. Support for S.K.D. was provided by the
University of Wisconsin–Madison, Office of the Vice Chancellor for Research and Graduate
Education with funding from the Wisconsin Alumni Research Foundation.
This research was performed using the compute resources and assistance of the UW–Madison
Center For High Throughput Computing (CHTC) in the Department of Computer Sciences.
The CHTC is supported by UW–Madison, the Advanced Computing Initiative, the Wiscon-
sin Alumni Research Foundation, the Wisconsin Institutes for Discovery, and the National
Science Foundation, and is an active member of the OSG Consortium, which is supported
by the National Science Foundation and the U.S. Department of Energy’s Office of Science.
References
Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal
Statistical Society: Series B (Methodological), 44(2):139–160.
Bai, R., Moran, G. E., Antonelli, J. L., Chen, Y., and Boland, M. R. (2020). Spike-and-slab
group LASSOs for grouped regression and sparse generalized additive models. Journal of
the American Statistical Association.
Bai, R., Roˇckov´a, V., and George, E. I. (2021). Spike-and-slab meets LASSO: A review of the
spike-and-slab LASSO. In Tadesse, M. and Vannucci, M., editors, Handbook of Bayesian
Variable Selection. Routledge.
Banerjee, O., Ghaoui, L. E., and D’Aspremont, A. (2008). Model selection through sparse
maximum likelihood estimation for multivariate Gaussian or binary data. Journal of
Machine Learning Research, 9:485–516.
Battson, M. L., Lee, D. M., Weir, T. L., and Gentile, C. L. (2018). The gut microbiota
as a novel regulator of cardiovascular function and disease. The Journal of Nutritional
Biochemistry, 56:1–15.
27
Belcheva, A., Irrazabal, T., Robertson, S. J., Streutker, C., Maughan, H., Rubino, S.,
Moriyama, E. H., Copeland, J. K., Surendra, A., Kumar, S., et al. (2014). Gut mi-
crobial metabolism drives transformation of MSH2-deficient colon epithelial cells. Cell,
158(2):288–299.
Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymp-
totic theory of independence. Oxford university press.
Boyd, S. P. and Barratt, C. H. (1991). Linear controller design: limits of performance.
Prentice-Hall.
Claesson, M. J., Jeffery, I. B., Conde, S., Power, S. E., O’connor, E. M., Cusack, S., Harris,
H. M., Coakley, M., Lakshminarayanan, B., O’Sullivan, O., et al. (2012). Gut microbiota
composition correlates with diet and health in the elderly. Nature, 488(7410):178–184.
Deshpande, S. K., Roˇckov´a, V., and George, E. I. (2019). Simultaneous variable and co-
variance selection with the multivariate spike-and-slab LASSO. Journal of Computational
and Graphical Statistics, 28(4):921–931.
Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation
with the graphical LASSO. Biostatistics, 9(3):432–441.
Frydenberg, M. (1990). The chain graph Markov property. Scandinavian Journal of Statis-
tics, 17(4):333–353.
Gan, L., Narisetty, N. N., and Liang, F. (2019a). Bayesian regularization for graphical models
with unequal shrinkage. Journal of the American Statistical Association, 114(527):1218–
1231.
Gan, L., Yang, X., Narisetty, N. N., and Liang, F. (2019b). Bayesian joint estimation of mul-
tiple graphical models. Advances in Neural Information Processing Systems (NeurIPS).
George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal
of the American Statistical Association, 88(423):881–889.
George, E. I. and Roˇckov´a, V. (2020). Comment: Regularization via Bayesian penalty
mixing. Technometrics, 62(4):438 442.
Ghosal, S. and van der Vaart, A. (2017). Fundamentals of nonparametric Bayesian inference,
volume 44. Cambridge University Press.
28
Guinane, C. M. and Cotter, P. D. (2013). Role of the gut microbiota in health and chronic
gastrointestinal disease: understanding a hidden metabolic organ. Therapeutic Advances
in Gastroenterology, 6(4):295–308.
Hills Jr, R. D., Pontefract, B. A., Mishcon, H. R., Black, C. A., Sutton, S. C., and Theberge,
C. R. (2019). Gut microbiome: profound implications for diet and disease. Nutrients,
11(7):1613.
Hsieh, C. J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P. (2011). Sparse inverse covariance
matrix estimation using quadratic approximation. In Advances in Neural Information
Processing Systems (NeurIPS).
Kamada, N. and N´nez, G. (2014). Regulation of the immune system by the resident
intestinal bacteria. Gastroenterology, 146(6):1477–1488.
Keegan, K. P., Glass, E. M., and Meyer, F. (2016). MG-RAST, a metagenomics service
for analysis of microbial community structure and function. In Microbial Environmental
Genomics (MEG), pages 207–233. Springer.
Kim, D., Zeng, M. Y., and nez, G. (2017). The interplay between host immune cells and
gut microbiota in chronic inflammatory diseases. Experimental & Molecular Medicine,
49(5):e339–e339.
Kowal, D. R. and Canale, A. (2020). Simultaneous transformation and rounding (STAR)
models for integer-valued data. Electronic Journal of Statistics, 14(1):1744–1772.
Larsbrink, J., Rogers, T. E., Hemsworth, G. R., McKee, L. S., Tauzin, A. S., Spadiut, O.,
Klinter, S., Pudlo, N. A., Urs, K., Koropatkin, N. M., et al. (2014). A discrete genetic locus
confers xyloglucan metabolism in select human gut Bacteroidetes. Nature, 506(7489):498–
502.
Lauritzen, S. L. and Richardson, T. S. (2002). Chain graph models and their causal inter-
pretations. Journal of the Royal Statistical Society: Series B, 64(3):321–348.
Lauritzen, S. L. and Wermuth, N. (1989). Graphical models for associations between vari-
ables, some of which are qualitative and some quantitative. The Annals of Statistics,
17(1):31–57.
Li, Z., Mccormick, T., and Clark, S. (2019). Bayesian joint spike-and-slab graphical LASSO.
29
In Proceedings of the 36th International Conference on Machine Learning (ICML), pages
3877–3885. PMLR.
McCarter, C. and Kim, S. (2014). On sparse Gaussian chain graph models. Advances in
Neural Information Processing Systems (NeurIPS).
Meng, X.-L. and Rubin, D. B. (1993). Maximum likleihood estimation via the ECM algo-
rithm: a general framework. Biometrika, 80(2):267–278.
Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression.
Journal of the American Statistical Association, 83(404):1023–1032.
Moran, G. E., Roˇckov´a, V., and George, E. I. (2019). Variance prior forms for high-
dimensional Bayesian variable selection. Bayesian Analysis, 14(4):1091–1119.
Moran, G. E., Roˇckov´a, V., and George, E. I. (2021). Spike-and-slab LASSO biclustering.
The Annals of Applied Statistics, 15(1):148–173.
Newton, M. A. and Raftery, A. E. (1994). Approximate Bayesian inference with the weighted
likelihood bootstrap. Journal of the Royal Statistical Society: Series B, 56(1):3–26.
Nie, L. and Roˇckov´a, V. (2022). Bayesian bootstrap spike-and-slab LASSO. Journal of the
American Statistical Association.
Ning, B., Jeong, S., and Ghosal, S. (2020). Bayesian linear regression for multivariate
responses under group sparsity. Bernoulli, 26(3):2353–2382.
R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foun-
dation for Statistical Consulting, Vienna, Austria.
Roˇckov´a, V. and George, E. I. (2014). EMVS: The EM approach to Bayesian variable
selection. Journal of the American Statistical Association, 109(506):828–846.
Roˇckov´a, V. and George, E. I. (2018). The spike-and-slab LASSO. Journal of the American
Statistical Association, 113(521):431–444.
Roˇckov´a, V. and George, E. I. (2016). Fast Bayesian factor analysis via automatic rotations
to sparsity. Journal of the American Statistical Association, 111(516):1608–1622.
Scher, J. U., Sczesnak, A., Longman, R. S., Segata, N., Ubeda, C., Bielski, C., Rostron,
T., Cerundolo, V., Pamer, E. G., Abramson, S. B., et al. (2013). Expansion of intestinal
prevotella copri correlates with enhanced susceptibility to arthritis. eLife, 2:e01202.
30
Shen, Y. and Solis-Lemus, C. (2021). Bayesian conditional auto-regressive LASSO models
to learn sparse microbial networks with predictors. arXiv preprint arXiv:2012.08397.
Shreiner, A. B., Kao, J. Y., and Young, V. B. (2015). The gut microbiome in health and in
disease. Current Opinion in Gastroenterology, 31(1):69.
Singh, R. K., Chang, H.-W., Yan, D., Lee, K. M., Ucmak, D., Wong, K., Abrouk, M., Farah-
nik, B., Nakamura, M., Zhu, T. H., et al. (2017). Influence of diet on the gut microbiome
and implications for human health. Journal of Translational Medicine, 15(1):1–17.
Takeshita, T., Yasui, M., Tomioka, M., Nakano, Y., Shimazaki, Y., and Yamashita, Y.
(2011). Enteral tube feeding alters the oral indigenous microbiota in elderly adults. Applied
and Environmental Microbiology, 77(19):6739–6745.
Tang, Z., Shen, Y., Li, Y., Zhang, X., Wen, J., Qian, C., Zhuang, W., Shi, X., and Yi,
N. (2018). Group spike-and-slab LASSO generalized linear models for disease prediction
and associated genes detected by incorporating pathway information. Bioinformatics,
34(6):901–910.
Tang, Z., Shen, Y., Zhang, X., and Yi, N. (2017). The spike-and-slab LASSO generalized
linear models for prediction and associated genes detection. Genetics, 205:77–88.
Wang, Z., Klipfell, E., Bennett, B. J., Koeth, R., Levison, B. S., DuGar, B., Feldstein, A. E.,
Britt, E. B., Fu, X., Chung, Y.-M., et al. (2011). Gut flora metabolism of phosphatidyl-
choline promotes cardiovascular disease. Nature, 472(7341):57–63.
Zhang, C. H. and Zhang, T. (2012). A General theory of concave regularization for high-
dimensional sparse estimation problems. Statistical Science, 27(4):576–593.
31
Supplementary Materials
In Section S1 we derive the Expectation Conditional Maximization (ECM) algorithm used to
find the maximum a posteriori (MAP) estimates of Ψ and in the cgSSL model. One of the
conditional maximization steps of that algorithm involves solving a CGLASSO problem. We
introduce a new algorithm, cgQUIC, to solve the general CGLASSO problem in Section S2.
Specifically, we show that the problem has unique global optimum (Theorem S1) and that our
cgQUIC algorithm converges to this optimum (Theorem S2). Then, we present additional
results from the simulation study described in Section 5of the main text in Section S3. In
Section S4, we detail the preprocessing steps we took to prepare the gut microbiome data
for analysis with the cgSSL. Finally, we state and prove our main asymptotic results in
Section S5.
S1 The cgSSL algorithm
In this section, we provide full details of the Expectation Conditional Maximization (ECM)
algorithm that is used in the cgSSL procedure. We describe the algorithm for a fixed set of
spike-and-slab penalties (λ0, λ1, ξ0, xi1) and fixed set of hyperparameters (aθ, bθ, aη, bη). For
notational brevity, we will let Θ = {Ψ, θ, , η}denote the set of four parameters of interest.
Recall from Section 3.3 of the main text that we wish to maximize the log posterior density
log π|Y) = n
2log|| 1
2tr (YXΨΩ1)Ω(YXΨΩ)>
+
p
X
j=1
q
X
k=1
log θλ1eλ1|ψj,k |+ (1 θ)λ0eλ0|ψj,k|
+
q1
X
k=1
q
X
k0>k
log ηξ1eξ1|ωk,k0|+ (1 η)ξ0eξ0|ωk ,k0|
q
X
k=1
ξ0ωk,k + log
1
(Ω 0)
+ (aθ1) log(θ)+(bθ1) log(1 θ)
+ (aη1) log(η)+(bη1) log(1 η)
(S1)
Instead of optimizing log π|Y) directly, we use an ECM algorithm and iteratively update
32
the surrogate objective
F(Θ) = Eδ[log π, δ|Y)|Θ],
where log π, δ|Y) is the log-density of the posterior in an augmented model involving the
spike-and-slab indicators δ={δk,k0: 1 k < k0q}.Note that the expectation is taken
with respect to the conditional posterior distribution of δgiven Θ.In our augmented model,
δk,k0indicates whether ωk,k0was drawn from the spike (δk,k0= 0) or the slab (δk,k0= 1).
Given Θ and the data Y,these indicators are conditionally independent with
E[δk,k0|Y,Θ] = ηξ1eξ1|ωk,k0|
ηξ1eξ1|ωk,k0|+ (1 η)ξ0eξ0|ωk ,k0|.
The surrogate objective F(Θ) is given by
F(Θ) = n
2log(||) + tr(Y>XΨ) 1
2tr(Y>YΩ) 1
2tr((XΨ)>(XΨ)Ω1)
+X
ij
log θλ1eλ1|ψj,k |+ (1 θ)λ0eλ0|ψj,k|X
k<k0
ξ?
k,k0|ωk,k0| ξ1
q
X
k=1
ωk,k
+ (aθ1) log θ+ (bθ1) log(1 θ)+(aη1) log η+ (bη1) log(1 η)
(S2)
where ξ?
k,k0=ξ1q?
k,k0+ξ0(1 q?
k,k0) and
q?(x, η) = ηξ1eξ1|x|
ηξ1eξ1|x|+ (1 η)ξ0eξ0|x|.
Our ECM algorithm iteratively computes F(Θ) based on the current value of Θ (the E-step)
and then updates the value of Θ by performing two conditional maximizations (the CM-
step). More specifically, for t1,if Θ(t1) is the value of Θ at the start of the tth iteration,
in the E-step we compute
F(t)(Θ) = Eδ[log π, δ|Y)|Θ = Θ(t1)].
We then compute Θ(t)by first optimizing F(t)(Θ) with respect to , θ) while fixing (Ω, η) =
(Ω(t1), η(t1)) to obtain (t), θ(t)). Then we optimizing F(t)(Θ) with respect to (Ω, η) while
fixing , θ) = (t), θ(t)) to obtain (Ω(t), η(t)).That is, in the CM-step we solve the following
33
optimization problems
(t), θ(t)) = arg max
Ψ
F(t), θ, (t1), η(t1)) (S3)
(Ω(t), η(t)) = arg max
F(t)(t), θ(t),, η).(S4)
Once we solve the optimization problems in Equations (S3) and (S4), we set
Θ(t)= (t), θ(t),(t), η(t)).
Our ECM algorithm iterates between the E-step and CM-step until the percentage change
in the estimated entries of Ψ and or log posterior density is below some user-defined
tolerance. In our implementation, we have found that tolerance of 103works well. The
following subsections detail how we carry out each conditional maximization step.
S1.1 Updating Ψand θ
.
Fixing (Ω, η) = (Ω(t1), η(t1)),observe that
F(t), θ, (t1), η(t1)) = 1
2tr (YXΨΩ1)Ω(YXΨΩ1)>+ log π, θ)
=1
2tr (YXΨΩ1)ΩΩ1Ω(YXΨΩ1)>+ log π, θ)
=1
2tr (YXΨ)Ω1(YXΨ)>+ log π, θ)
(S5)
where
log π, θ) =
p
X
j=1
q
X
k=1
log θλ1eλ1|ψj,k |+ (1 θ)λ0eλ0|ψj,k|
+ (aθ1) log(θ)+(bθ1) log(1 θ)
(S6)
We solve the optimization problem in Equation (S5) using a coordinate ascent strategy that
iteratively updates Ψ (resp. θ) while holding θ(resp. Ψ) fixed. We run the coordinate ascent
until all active ψj,k has relative change under the user defined tolerance.
Updating θgiven Ψ. Notice that the objective in Equation (S5) depends on θonly through
34
the log π, θ) term. Accordingly, to updating θconditionally on Ψ,it is enough to maximize
the expression in Equation (S6) as a function of θwhile keeping all ψj,k terms fixed. We use
Newton’s method for this optimization and we terminate once the Newton step has a step
size less than the user defined tolerance.
Updating Ψgiven θ. With θfixed, optimizing Equation (S5) is equivalent to solving
Ψ(t)= arg max
Ψ1
2(YXΨ)Ω1(YXΨ)>+ log π|θ)
= arg max
Ψ(1
2(YXΨ)Ω1(YXΨ)>+X
j,k
log π(ψj,k|θ)
π(0|θ))
= arg max
Ψ(1
2(YXΨ)Ω1(YXΨ)>+X
j,k
pen(ψj,k |θ))
(S7)
where
pen(ψj,k |θ) = log π(ψj,k|θ)
π(0|θ)=λ1|ψj,k|+ log p?(ψj,k , θ)
p?(0, θ).
Following essentially the same arguments as those in Deshpande et al. (2019) and using
the fact that the columns of Xhave norm n, the Karush-Kuhn-Tucker (KKT) condition for
optimization problem in the final line of Equation (S7) tells us that the optimizer Ψsatisfies
ψ
j,k =n1|zj,k| λ?(ψ
j,k, θ)+sign(zj,k),(S8)
where
zjk =
j,k +X>
jrk+X
k06=k
(Ω1)k,k0
(Ω1)k,k
X>
jrk0
rk0= (YXΨ)k0
λ?(ψ
j,k, θ) = λ1p?(ψ
j,k, θ) + λ0(1 p?(ψ
j,k, θ)).
The KKT condition immediately suggests a cyclical coordinate ascent strategy for solving
the problem in Equation (S7) that involves soft thresholding the running estimates of ψj,k.
Like Roˇckoa and George (2018) and Deshpande et al. (2019), we can, however, obtain a
35
more refined characterization of the global model ˜
Ψ = ( ˜
ψj,k):
˜
ψj,k =n1h|zj,k| λ?(˜
ψj,k, θ)i+sign(zj,k)×
1
(|zj,k|>j,k ),
where
j,k = inf
t>0(nt
2pen( ˜
ψj,k, θ)
(Ω1)k,kt).
Though the exact thresholds j,k are difficult to compute, they can be bounded use an
analog to Theorem 2.1 of Roˇckov´a and George (2018) and Proposition 2 of Deshpande et al.
(2019). Specifically, suppose we have (λ1λ0)>2pn(Ω1)k,k and (λ?(0, θ)λ1)2>
2n(Ω1)k,kp?(0, θ). Then we have L
j,k j,k U
j,k, where:
L
j,k =q2n((Ω1)k,k)1log p?(0, θ)((Ω1)k,k )2d+λ1/(Ω1)k,k
U
j,k =q2n((Ω1)k,k)1log p?(0, θ) + λ1/(Ω1)k,k
where d=(λ?(δc+, θ)λ1)22n(Ω1)k,k log p?(δc+, θ) and δc+is the largest root of
pen00(x|θ) = (Ω1)k,k.
Our refined characterization of ˜
Ψ suggests a cyclical coordinate descent strategy that com-
bines hard thresholding at j,k and soft-thresholding at λ?
j,k.
Remark 1. Equation (S5)and our approach to solving the optimization problem are ex-
tremely similar to Equation 3 and the coordinate ascent strategy used in Deshpande et al.
(2019), who fit sparse marginal multivariate linear models with spike-and-slab LASSO priors.
This is because if Y N(XΨΩ1,1)in our chain graph model, then Y N(XΨ,Ω).
Thus, if we fix the value of ,we can use any computational strategy for estimating marginal
effects in the multivariate linear regression model to estimate Ψby working with the trans-
formed data Y.
S1.2 Updating and η
Fixing Ψ = Ψ(t)and θ=θ(t), we compute (t)and η(t)by optimizing the function
36
F(t)(t), θ(t),, η) = n
2log|| tr(SΩ) tr(M1)X
k<k0
ξ?
k,k0|ωk,k0|
q
X
k=1
ωk,k
+ aη1 + X
k<k0
q?
k,k!×log(η)
+ bη1 + q(q1)/2X
k<k0
q?
k,k!×log(1 η)
(S9)
where S=1
nY>Yand M=1
n(XΨ)>XΨ.
We immediately observe that expression in Equation (S9) is separable in and η, meaning
that we can compute (t)and η(t)separately. Specifically, we have
η(t)=aη1 + Pk<k0q?
k,k0
aη+bη2 + q(q1)/2(S10)
and
(t)= arg max
>0(n
2log(||)tr(SΩ) tr(M1)X
k<k0
ξ?
k,k0|ωk,k0| ξ
q
X
k=1
ωk,k).(S11)
The objective function in Equation (S11) is similar to a graphical LASSO (GLASSO; Fried-
man et al.,2008) problem insofar as both problems involve a term like log||+ tr(SΩ) and
separable L1 penalties on the off-diagonal elements of .However, Equation (S11) includes
an additional term tr(M1), which does not appear in the GLASSO. This term arises from
through the entanglement of Ψ and in the Gaussian chain graph model and we accord-
ingly call the problem in Equation (S11) the CGLASSO problem. We solve this problem
by (i) forming a quadratic approximation of the objective, (ii) computing a suitable Newton
direction, and (iii) following that Newton direction for a suitable step size. We detail this
solution strategy in Section S2.
37
S2 Chain graphical LASSO with cgQUIC
Equation (S11) is a specific instantiation of what we term the “chain graphical LASSO”
(CGLASSO) problem, whose general form is
arg min
(log(||) + tr(SΩ) + tr(M1) + X
k,k0
ξk,k0|ωk,k0|)(S12)
where Sand Mare symmetric positive semi-definite q×qmatrices; is a symmetric positive
definite q×qmatrix; and the ξk,k0’s are symmetric non-negative penalty weights (i.e. we
have ξk,k0=ξk0,k).
Notice that when Mis the 0 matrix, the CGLASSO problem reduces to a general GLASSO
problem, which admits several computational solutions. One well known solution is to solve
the dual problem, which involves minimization of a log determinant under Lconstraint
(Banerjee et al.,2008).
Unfortunately, the dual form of the CGLASSO problem does not have such a simple form.
To wit, the dual of the CGLASSO problem with uniform penalty ξis given by:
min
||U|| max
0log|| tr[(S+U)Ω] tr[MX1]
The inner optimization about can be solved by setting the derivative to 0; the optimal
value of solves a special case of continuous time algebraic Riccati equation (CARE) (Boyd
and Barratt,1991):
Ω(S+U)Ω + M= 0
Unfortunately, this problem does not have a closed form solution and solving it numerically
in every step of the cgSSL is computationally prohibitive.
We instead solve the CGLASSO problem using a suitably modified version of Hsieh et al.
(2011)’s QUIC algorithm for solving the GLASSO problem. At a high level, instead of using
the first order gradient or solving dual problem, the algorithm is based on Newton’s method
and uses a quadratic approximation. Basically, we sequentially cycle over the parameters
ωk,k0and update each parameter by following a Newton direction for suitable step-size. The
step-size is chosen to ensure that our running estimate of remains positive definite while
38
also ensuring sufficient decrease in the overall objective. We call our solution CGQUIC,
which we summarize in Algorithm 1.
To describe CGQUIC, we first define the “smooth” part of the CGLASSO objective as g(Ω)
and the objective function as f(Ω):
g(Ω) = log(||) + tr(SΩ) + tr(M1),
f(Ω) = g(Ω) + X
k,k0
λk,k0ωk,k0.(S13)
The function g(Ω) is twice differentiable and strictly convex. To see this, observe that g(Ω)
is just the log-likelihood of a Gaussian chain graph model with known Ψ.Its Hessian is
just the negative Fisher information of and is positive definite. The second-order Taylor
expansion of the smooth part g(Ω) based on and evaluated at + for a symmetric is
¯g(∆) = log(||) + tr(SΩ) + tr(M W )
+ tr(S∆) tr(W∆) tr(W M W ∆)
+1
2tr(WW∆) + tr(W M W W∆)
(S14)
S2.1 Newton Direction
We now consider the coordinate descent update for the variable k,k0for kk0.Let Ddenote
the current approximation of the Newton direction and let D0be the updated direction. To
preserve symmetry, we set D0=D+µ(eke>
k0+ek0e>
k).Our goal, then, is to find the optimal
µ:
arg min
µ{¯g(D+µ(eke>
k0+ek0e>
k)) + 2ξk,k0|k,k0+Dk,k0+µ|} (S15)
We begin by substituting = D0into ¯g(∆).Note that terms not depending on µdo not
affect the line search. Compared to QUIC, we have two additional terms, tr(W M W ∆) and
tr(W M W W∆).The first term turns out to be linear µand the second is quadratic in µ.
39
Algorithm 1: The CGQUIC algorithm for CGLASSO problem
Data: S=Y>Y/n,M= (XΨ)>(XΨ)/n, regularization parameter matrix Ξ, initial
0, inner stopping tolerance , parameters 0 < σ < 0.5, 0 < β < 1
Result: path of positive definite tthat converge to arg minfwith
f(Ω) = g(Ω) + Pk,k0ξk,k0|ωk,k0|, where g(Ω) = log(||) + tr(SΩ) + tr(M1)
Initialize W0= 1
0;
for t= 1,2, . . . do
D= 0, U = 0;
Q=MWt1;
while not converged do
Partition the variables into fixed and free sets based on gradient1
Sfixed := {(k, k0) : |∇k,k0g(Ω)|< ξk,k0and ωk,k0= 0};
Sfree := {(k, k0) : |∇k,k0g(Ω)| ξk,k0or ωk,k06= 0};
for (k, k0)Sfree do
Calculate Newton direction:
b=Sk,k0Wk,k0+w>
kDwk0w>
kMwk0+w>
k0DW M wk+w>
kDW M wk0;
c= k,k0+Dk,k0;
if i6=jthen
a=W2
k,k0+Wk,k Wk0,k0+Wk,kw>
k0Mwk0+Wk0,k0w>
kMwk+ 2Wk,k0w>
kMwk0;
end
else
a=W2
k,k + 2Wk,k w>
kMwk;
end
µ=c+ [|cb/a| ξk,k0/a]+sign(cb/a) ;
Dk,k0+ = µ;
uk+ = µwk0;
uk0+ = µwk;
end
end
for α= 1, β, β 2, . . . do
Compute the Cholesky decomposition of t1+αD;
if tis not positive definite then
continue;
end
Compute f(Ωt1+αD);
if
f(Ωt1+αD)f(Ωt1)+ασδ, δ = tr[g(Ωt1)>D]+||t1+D||1,Ξ−||t1||1,Ξ
then
break;
end
end
t= t1+αD;
Wt= 1
tusing Cholesky decomposition result;
end
return {t};40
To see this, first observe
tr(W M W ∆) = tr(W MW (D+µ(eke>
k0+ek0e>
k))
=Cµtr(W M W eke>
k0+W M W ek0e>
k)
=Cµtr(e>
k0W M W ek+e>
kW M W ek0)
=C2µe>
kW M W ek0
=C2µw>
kMwk0
(S16)
where wkis the kth column of Ω.
Furthermore, we have
tr(W M W W∆) = tr[W M W (D+µ(eke>
k0+ek0e>
k))W(D+µ(eke>
k0+ek0e>
k))]
= tr[DW M W + 2µD W MW (eke>
k0+ek0e>
k)W]
+ tr[µ2(eke>
k0+ek0e>
k)W M W (eke>
k0+ek0e>
k)W]
=C+ tr[2µDW M W (eke>
k0+ek0e>
k)W]
+ tr[µ2(eke>
k0+ek0e>
k)W M W (eke>
k0+ek0e>
k)W]
=C+ 2µw>
k0DW M wk+ 2µw>
kDW M w>
k0
+µ2tr[(eke>
k0+ek0e>
k)W M W (eke>
k0+ek0e>
k)W]
=C+ 2µ(w>
k0DW M wk+w>
kDW M w>
k0)
+µ2(Wk,kw>
k0Mwk0+Wk0,k0w>
kMwk+ 2Wk,k0w>
kMwk0)
(S17)
By combining the above simplifications, we can minimize the objective with coordinate
descent. The update for ωk,k0is given by:
1
2[W2
k,k0+Wk,k Wk0,k0+Wk,kw>
k0Mwk0+Wk0,k0w>
kMwk+ 2Wk,k0w>
kMwk0]µ2
+[Sk,k0Wk,k0+wkDwk0wkMwk0+w>
k0DW M wk+w>
kDW M wk0]µ
+ξk,k0|k,k0+Dk,k0+µ|
(S18)
41
The optimal solution (for off-diagonal ωk,k0) is given by
µ=c+ [|cb/a|−ξk,k0/a]+sign(cb/a) (S19)
where
a=W2
k,k0+Wk,k Wk0,k0+Wk,kw>
k0Mwk0+Wk0,k0w>
kMwk+ 2Wk,k0w>
kMwk0
b=Sk,k0Wk,k0+w>
kDwk0w>
kMwk0+w>
k0DW M wk+w>
kDW M wk0
c=ωk,k0+Dk,k0
For diagonal entries, we take D0=D+µeke>
k, the two terms involving Dare then:
tr(W M W ∆) = Cµw>
kMwk
tr(W M W W∆) = C+ 2µw>
kDW M wk+µ2Wk,kw>
kMwk
(S20)
Then we can take
a=W2
k,k + 2Wk,k w>
kMwk
b=Sk,k Wk,k +w>
kDwkw>
kMwk+ 2w>
kDW M wk
c=ωk,k +Dk,k
and use Equation (S19) to obtain the optimal µand thus the updated Newton direction D0.
Note that computing the optimal µrequires repeated calculation of quantities like w>
kMwk0
and w>
kUMwk0.To enable rapid computation, we track and update the values of U=DW
and Q=MW during our optimization.
S2.2 Step Size
Like Hsieh et al. (2011), we use Armijo’s rule to set a step size αthat simultaneously ensures
our estimate of remains positive definite and sufficient decrease of our overall objective
function. We denote the Newton direction after a complete update over all active coordinates
42
as D(see Appendix S2.3 for active sets). We require our step size to satisfy the line search
condition (S21).
f(Ω + αD)f(Ω) + ασδ, δ = tr[g(Ω)>D] + || + D||1,Ξ ||||1,Ξ(S21)
Three important properties can be established following Hsieh et al. (2011):
P1. The condition was satisfied for small enough α. This property is satisfied exactly
following proposition 1 of Hsieh et al. (2011).
P2. We have δ < 0 for all 0, which ensures that the objective function decreases. This
property generally follow Lemma 2 and Proposition 2 of Hsieh et al. (2011), which
requires the Hessian of the smooth part g(Ω) to be positive definite. In our case the
Hessian of g(Ω) is the Fisher information of the chain graph model, ensuring its positive
definiteness.
P3. When is close to the global optimum, the step size α= 1 will satisfy the line search
condition. To establish this, we follow the proof of Proposition 3 in Hsieh et al. (2011).
S2.3 Thresholding to Decide the Active Sets
Similar to the QUIC procedure, our algorithm does not need to update every ωk,k0in each
iteration. We instead follow Hsieh et al. (2011) and only update those parameters exceeding
a certain threshold. More specifically, we can partition the parameters ωk,k0into a fixed set
Sfixed,containing those parameters falling below the threshold, and a free set Sfree,containing
those parameters exceeding the threshold. That is
ωk,k0Sf ixed if |∇k,k0g(Ω)| ξk,k0and ωk,k0= 0
ωk,k0Sf ree otherwise (S22)
We can determine the free set Sfree using the minimum-norm sub-gradient gradS
k,k0f(Ω),which
is defined in Definition 2 of Hsieh et al. (2011). In our case g(Ω) = S11M1,
so the minimum-norm sub-gradient is
43
gradS
k,k0f(Ω) =
(S11M1)k,k0+ξk,k0if ωk,k0>0
(S11M1)k,k0ξk,k0if ωk,k0<0
sign((S11M1)k,k0)[|(S11M1)k,k0| ξk,k0]+if ωk,k0= 0
(S23)
Note that the subgradient evaluated on the fixed set is always equal to 0. Thus, following
Lemma 4 in Hsieh et al. (2011), the elements of the fixed set do not change during our
coordinate descent procedure. It suffices, then, to only compute the Newton direction on
the free set and update those parameters.
S2.4 Unique minimizer
In this subsection, we show that the CGLASSO problem admits a unique minimizer. Our
proof largely follows the proofs of Lemma 3 and Theorem 1 of Hsieh et al. (2011) but makes
suitable modifications to account for the extra tr(M1) term in the CGLASSO objective.
Theorem S1 (Unique minimizer).There is a unique global minimum for the CGLASSO
problem (S12).
We first show the entire sequence of iterates {t}lies in a particular, compact level set. To
this end, let
U={|f(Ω) f(Ω0),Sp
++}.(S24)
To see that all iterations lies in U, we check need to check the line search condition Equa-
tion (S21) has a δ < 0. By directly applying Hsieh et al. (2011)’s Lemma 2 to g(Ω), we have
that
δ vec(D)>2g(Ω) vec(D)
where Dis the Newton direction. Since g(Ω) (Equation (S13)) is convex, 2g(Ω) is positive
definite, so the function value f(Ωt) is always decreasing.
Now we need to check that the level set is actually contained in a compact set, by suitably
adapt Lemma 3 of Hsieh et al. (2011).
44
Lemma S1. The level set Udefined in (S24)is contained in the set {mI NI }for
some constants m, N > 0, if we assume that the off-diagonal elements of Ξand the diagonal
elements of Sare positive.
Proof. We begin by showing that the largest eigenvalue of is bounded by some constant
that does not depend on .Recall that Sand Mare positive semi-definite. Since is
positive definite, we have tr(SΩ) + tr(M1)>0 and ||||1,Ξ+ tr(M1)>0.
Therefore we have
f(Ω0)> f(Ω) log(||) + ||||1,Ξ
f(Ω0)> f(Ω) log(||) + tr(SΩ) (S25)
Since ||||2is the largest eigenvalue of Ω, we have log(||)qlog(||||2).
Using the assumption that off-diagonal entries of Ξ is larger than some positive number ξ,
we know that
ξX
i6=j|k,k0| ||||1,Ξf(Ω0) + qlog(||||2) (S26)
Similarly, we have
tr(SΩ) f(Ω0) + qlog(||||2) (S27)
Let α= mink(Sk,k) and β= maxk6=k0Sk,k0. We can split the tr(SΩ) into two parts, which
can be further lower bounded:
tr(SΩ) = X
k
Sk,kk,k +X
k6=k0
Sk,k0k,k0αtr(Ω) βX
k6=k0|k,k0|(S28)
Since ||||2tr(Ω), by using Equation (S28), we have,
α||||2αtr(Ω) tr(ΩS) + βX
k6=k0|k,k0|(S29)
45
By combining Equations (S26), (S27), and(S29), we conclude that
α||||2(1 + β/ξ)(f(Ω0) + qlog(||||2)) (S30)
The left hand side as a function of ||||2grows much faster than then right hand side. Thus
||||2can be bounded by a quantity depending only on the value of f(Ω0), α,βand ξ.
We now consider the smallest eigenvalue denoted by a. We use the upper bound of other
eigenvalues to bound the determinant. By using the fact that f(Ω) always decreasing during
iterations, we have
f(Ω0)> f(Ω) >log(||) log(a)(q1) log(N) (S31)
Thus we have m=ef(Ω0)Mq+1 is a lower bound for smallest eigenvalue a.
We are now ready to prove Theorem S1, by showing the objective function is strongly convex
on a compact set.
Proof. Because of Lemma S1, the level set Ucontains all iterates produced by cgQUIC. The
set Uis further contained in the compact set {mI NI}. By the Weierstrass extreme
value theorem the continuous function f(Ω) (S13) attains its minimum on this set.
Further, the modified objective function is also strongly convex in its smooth part. This is
because tr(M1) and tr(SΩ) are convex and log(||) is strongly convex. Since tr(M1)
is convex, the Hessian of the smooth part has the same lower bound as in Theorem 1 of
Hsieh et al. (2011). By following the argument of in the proof of Theorem 1 of Hsieh et al.
(2011), we can show the objective function f(Ω) is strongly convex on the compact set
{mI NI}, and thus has a unique minimizer.
We can further show that the cgQUIC procedure converges to the unique minimizer, using
the general results on quadratic approximation methods studied in Hsieh et al. (2011).
Theorem S2 (Convergence).The cgQUIC converge to global optimum.
Proof. cgQUIC is an example of quadratic approximation method investigated in Section
4.1 of Hsieh et al. (2011) with a strongly convex smooth part g(Ω) in (S13). Convergence to
46
the global optimum follows from their Theorem 2.
S3 Synthetic experiment results
We now present the remaining results from our simulation experiments. These results are
qualitatively similar to those from the (n, p, q) = (100,10,10) setting presented in the main
text. Generally speaking, in terms of support recovery, the methods that deployed a single
fixed penalty (cgLASSO and CAR) displayed higher sensitivity but lower precision than both
cgSSL-DPE and cgSSL-DCPE. The only exception was when was dense. Furthermore,
methods with adaptive penalties (both cgSSL procedures and CAR-A) tended to return a
fewer number of non-zero estimates than the fixed penalty. Most of these non-zero estimates
were in fact true positives. Across all settings of (n, p, q), cgSSL-DPE makes virtually no
false positive identifications in the support of Ψ.In terms of parameter estimation, the fixed
penalty methods tended to have larger Frobenius error in estimating both Ψ and than
the cgSSL. Note that cgLASSO uses ten-fold cross-validation to set the two penalty levels.
Even with a parallel implementation and warm-starts, the full cgLASSO procedure did not
converge after 72 hours in the n= 400 setting.
Figure S4: Sensitivity, specificity and Frobenius loss of parameter estimations when p=
10, q = 10, n = 100
47
Figure S5: Sensitivity, specificity and Frobenius loss of parameter estimations when p=
20, q = 30, n = 100
48
Figure S6: Sensitivity, specificity and Frobenius loss of parameter estimations when p=
10, q = 30, n = 400, cgLASSO was not able to finish with dense within 72 hours thus we
omit the result.
49
Table S2: Sensitivity, precision, and Frobenius error for Ψ and when (n, p, q) = (100,20,30)
for each specification of .For each choice of ,the best performance is bold-faced.
Ψ recovery recovery
Method SEN PREC FROB SEN PREC FROB
AR(1) model
cgLASSO 0.94 (0.09) 0.3 (0.14) 0.21 (0.21) 0.74 (0.42) 0.48 (0.36) 111.66 (66.15)
CAR 0.54 (0.05) 0.39 (0.03) 0.11 (0.02) 1 (0.01) 0.21 (0.01) 11.15 (2.35)
CAR-A 0.69 (0.03) 0.69 (0.04) 0.04 (0.01) 1 (0) 0.74 (0.06) 15.07 (4.99)
cgSSL-dcpe 0.66 (0.02) 0.82 (0.07) 0.08 (0.02) 0.94 (0.04) 0.68 (0.07) 34.1 (19.97)
cgSSL-dpe 0.69 (0.02) 1 (0.01) 0.02 (0) 1 (0) 0.82 (0.07) 4.87 (1.73)
AR(2) model
cgLASSO 0.94 (0.07) 0.3 (0.13) 0.17 (0.08) 0.98 (0.12) 0.18 (0.12) 14.44 (6.93)
CAR 0.42 (0.04) 0.37 (0.03) 0.15 (0.02) 0.38 (0.08) 0.25 (0.04) 15.8 (2.68)
CAR-A 0.64 (0.04) 0.76 (0.04) 0.07 (0.02) 0.91 (0.05) 0.87 (0.04) 3.16 (1.15)
cgSSL-dcpe 0.73 (0.02) 0.7 (0.06) 0.05 (0.01) 0.81 (0.09) 0.32 (0.05) 14.5 (4.21)
cgSSL-dpe 0.72 (0.02) 0.99 (0.01) 0.01 (0) 1 (0) 0.48 (0.03) 1.04 (0.35)
Block model
cgLASSO 0.92 (0.07) 0.4 (0.19) 0.51 (0.46) 0.62 (0.46) 0.93 (0.06) 27.48 (6.01)
CAR 0.51 (0.05) 0.4 (0.03) 0.12 (0.02) 0.53 (0.04) 0.68 (0.03) 12.96 (2.49)
CAR-A 0.66 (0.04) 0.64 (0.04) 0.06 (0.01) 0.47 (0.03) 0.96 (0.02) 23.14 (4.16)
cgSSL-dcpe 0.82 (0.1) 0.29 (0.19) 0.68 (0.19) 0.1 (0.28) 0.88 (0.1) 30.22 (2.11)
cgSSL-dpe 0.61 (0.02) 0.99 (0.02) 0.07 (0.02) 0.66 (0.36) 0.9 (0.05) 30.39 (4.02)
Star model
cgLASSO 0.91 (0.03) 0.45 (0.04) 0.08 (0.02) 0.7 (0.18) 0.31 (0.14) 6.45 (3.27)
CAR 0.45 (0.06) 0.41 (0.04) 0.14 (0.02) 0.32 (0.09) 0.12 (0.03) 4.36 (1.07)
CAR-A 0.69 (0.05) 0.68 (0.03) 0.06 (0.02) 0.31 (0.09) 0.35 (0.09) 2.83 (0.69)
cgSSL-dcpe 0.77 (0.02) 0.94 (0.03) 0.01 (0) 0.61 (0.21) 0.57 (0.08) 0.6 (0.21)
cgSSL-dpe 0.73 (0.02) 1 (0) 0.01 (0) 0.83 (0.08) 0.54 (0.1) 0.89 (0.32)
Dense model
cgLASSO 0.89 (0.04) 0.43 (0.05) 0.07 (0.03) 0.34 (0.41) 1 (0) 712.46 (354.87)
CAR 0.49 (0.06) 0.39 (0.03) 0.13 (0.02) 0.05 (0.01) 1 (0) 914.47 (6.45)
CAR-A 0.7 (0.04) 0.64 (0.04) 0.05 (0.01) 0.01 (0.01) 1 (0) 897.91 (5.95)
cgSSL-dcpe 0.77 (0.01) 0.99 (0.01) 0.01 (0) 0 (0.01) 1 (0) 900 (0.01)
cgSSL-dpe 0.72 (0.02) 1 (0.01) 0.01 (0.01) 0.03 (0.03) 1 (0) 901.45 (2.97)
50
Table S3: Sensitivity, precision, and Frobenius error for Ψ and when (n, p, q) =
(400,100,30) for each specification of .For each choice of ,the best performance is
bold-faced. The cgLASSO method was not able to finish within 72 hours.
Ψ recovery recovery
Method SEN PREC FROB SEN PREC FROB
AR(1) model
cgLASSO 1 (0) 0.2 (0) 0.07 (0.11) 0.94 (0.23) 0.46 (0.14) 27.98 (40.98)
CAR 0.82 (0.02) 0.46 (0.01) 0.02 (0) 1 (0) 0.27 (0.02) 2.23 (0.51)
CAR-A 0.86 (0.01) 0.73 (0.02) 0.01 (0) 1 (0) 0.89 (0.05) 7.32 (1.48)
cgSSL-dcpe 0.74 (0.01) 0.89 (0.03) 0.07 (0) 1 (0) 0.46 (0.03) 70.84 (3.72)
cgSSL-dpe 0.87 (0.01) 0.99 (0) 0 (0) 1 (0) 0.78 (0.06) 3.42 (0.8)
AR(2) model
cgLASSO 1 (0) 0.2 (0) 0.15 (0.05) 0.79 (0.23) 0.63 (0.22) 10.82 (4.04)
CAR 0.85 (0.02) 0.5 (0.01) 0.01 (0) 0.98 (0.02) 0.49 (0.03) 0.38 (0.15)
CAR-A 0.89 (0.01) 0.77 (0.02) 0.01 (0) 1 (0.01) 0.94 (0.03) 1.22 (0.24)
cgSSL-dcpe 0.87 (0.01) 0.79 (0.14) 0.04 (0.02) 1 (0) 0.31 (0.05) 10.5 (6.66)
cgSSL-dpe 0.92 (0) 1 (0) 0 (0) 1 (0) 0.47 (0.03) 0.26 (0.08)
Block model
cgLASSO 1 (0) 0.2 (0) 0.44 (0.25) 0.87 (0.21) 0.97 (0.11) 10.05 (11.21)
CAR 0.84 (0.02) 0.46 (0.01) 0.02 (0) 0.71 (0.03) 0.76 (0.02) 3.36 (0.24)
CAR-A 0.88 (0.01) 0.7 (0.02) 0.01 (0) 0.75 (0.02) 0.99 (0.01) 4.13 (0.5)
cgSSL-dcpe 0.9 (0.01) 0.22 (0) 0.82 (0.04) 0 (0.01) 0.64 (NA) 29.51 (0.24)
cgSSL-dpe 0.86 (0.01) 0.99 (0.01) 0.01 (0) 0.98 (0.02) 0.98 (0.01) 1.44 (0.43)
Star model
cgLASSO 0.93 (0) 0.83 (0.02) 0.01 (0) 0.53 (0.41) 0.59 (0.45) 4.68 (3.43)
CAR 0.89 (0.01) 0.48 (0.01) 0.01 (0) 0.73 (0.09) 0.25 (0.03) 0.55 (0.1)
CAR-A 0.91 (0.01) 0.7 (0.02) 0.01 (0) 0.87 (0.07) 0.74 (0.06) 1.07 (0.18)
cgSSL-dcpe 0.88 (0) 1 (0) 0 (0) 1 (0) 0.89 (0.05) 0.29 (0.08)
cgSSL-dpe 0.89 (0) 1 (0) 0 (0) 1 (0) 0.9 (0.05) 0.27 (0.06)
Dense model
cgLASSO
CAR 0.87 (0.02) 0.39 (0.01) 0.01 (0) 0 (0) NaN (NA) 964.24 (9.25)
CAR-A 0.88 (0.01) 0.52 (0.01) 0.01 (0) 0 (0) NaN (NA) 964.08 (9.71)
cgSSL-dcpe 0.87 (0.01) 0.94 (0.02) 0.03 (0) 0.21 (0.02) 1 (0) 913.81 (2.33)
cgSSL-dpe 0.86 (0.01) 0.98 (0.01) 0.04 (0.01) 0.26 (0.01) 1 (0) 918.35 (4.57)
S4 Preprocessing for real data experiment
To conduct our reanalysis of Claesson et al. (2012)’s gut microbiome data, we preprocesses
the raw 16s-rRNAseq data following the workflow provided by the MG-RAST server (Keegan
et al.,2016). We first “annotated” the sequences to get genus counts (i.e. number of segments
belongs to one genus). The annotation process compares the rRNA segments detected during
sequencing to the reference sequence of each genus of microbes, then counts the number of
rRNA segments match with each genus. We used the MG-RAST server’s default tuning
51
parameters during the annotation process. That is, we set e-value to be 5 and annotated
with 60% identity, alignment length of 15 bp, and set a minimal abundance of 10 reads.
Following standard practices of analyzing microbiome data, we transformed raw counts into
relative abundance.We selected genera with more than 0.5% relative abundance in more than
50 samples as the focal genus and all other genera aggregated as the reference group. We
further took the log-odds (with respect to the reference group described above) to stabilize
the variances (Aitchison,1982) in order to fit our normal model.
S5 Proofs of posterior contraction for cgSSL
This section provides detail on the posterior contraction results for the cgSSL. Our proof
was inspired by Ning et al. (2020) and Bai et al. (2020). We first show the contraction in
log-affinity by verifying KL condition and test conditions following Ghosal and van der Vaart
(2017). Then we use the results in log-affinity to show recovery of parameters.
To establish our results, we work with a slightly modified prior on that has density
f(Ω) Y
k>k0(1 η)ξ0
2exp (ξ0|ωk,k0|) + ηξ1
2exp (ξ1|ωk,k0|)
×Y
k
ξexp [ξωk,k]×
1
(Ω τI)
(S32)
fΨ(Ψ) = Y
jk (1 θ)λ0
2exp (λ0|ψj,k|) + θλ1
2exp (λ1|ψj,k|)(S33)
where 0 < τ < 1/b2.This way τis less than the lower bound of the smallest eigenvalue of
the true precision matrix 0.
S5.1 The Kullback-Leibler condition
We need to verify that our prior places enough probability in small neighborhoods around
each of the possible values of the true parameters. These neighborhoods are defined in a KL
sense.
Lemma S2 (KL conditions).Let n=pmax{p, q, s
0, sΨ
0}log(max{p, q})/n. Then for all
52
true parameters 0,0)we have
log Π ,Ω) : K(f0, f )n2
n, V (f0, f )n2
nC1n2
n
Further, let Enbe the event
En={Y:ZZ f /f0dΠ(Ψ)Πd(Ω) eC1n2
n}.
Then for all 0,0,we have P0(Ec
n)0as n .
The last assertion that P(Ec
n)0 follows from Lemma 8.1 of Ghosal and van der Vaart
(2017) so we now focus on establishing the first assertion of the Lemma. To verify this
condition we need to bound the prior mass of certain events A. However, the truncation of
the prior on makes computing these masses intractable. To overcome this, we first bound
the prior probability of events of the form A{τ I }by observing the prior on can be
viewed as a particular conditional distribution.
Specifically, let ˜
Π be the untruncated spike-and-slab LASSO prior with density
˜
f(Ω) = Y
k>k0(1 η)ξ0
2exp (ξ0|ωk,k0|) + ηξ1
2exp (ξ1|ωk,k0|)×Y
k
ξexp (ξωk,k).
The following Lemma shows that we can bound Π probabilities using ˜
Π probabilities.
Lemma S3 (Bounds of the graphical prior).Let ˜
Πbe the untruncated version of the prior
on .Then for all events A, for large enough nthere is a number Rthat does not depend
on nsuch that
˜
Π(Ω τI|A)˜
Π(A)Π(A {τI})exp(2ξQ log(R))˜
Π(A) (S34)
where Q=q(q1)/2is the total number of free off-diagonal entries in .
Proof. Consider an event of form A{τ I} Rq×q. The prior mass Π(A {τ I})
can be viewed as a conditional probability:
Π(A {τI}) = ˜
Π(A|τI) = ˜
Π(Ω τI|A)˜
Π(A)
˜
Π(Ω τI)(S35)
53
The lower bound follows because the denominator is bounded from above by 1.
For the upper bound, we first observe that
Π(A {τI}) = ˜
Π(A|τI) = ˜
Π(Ω τI|A)˜
Π(A)
˜
Π(Ω τI)(˜
Π(Ω τI))1˜
Π(A) (S36)
To upper bound the probability in Equation (S35), we find a lower bound of the denominator
˜
Π(Ω τI). To this end, let
G= : ωk,k > q 1,|ωk,k0| 1τ
q1for k06=k
and consider an G.Since all of Ω’s eigenvalues are real, they must each be contained
in at least one Gershgorin disc. Consider the kth Gershgorin disc, whose intersection with
the real line is an interval centered at ωk,k with half-width Pk06=k|ωk,k0|.Any eigenvalue of
that lies in this disc must be greater than
ωk,k X
k06=k|ωk,k0|> q 1(q1τ) = τ
Thus, we have G= {τI}.
Since the entries of are independent under ˜
Π, we compute
˜
Π(G)Y
kZ
q1
ξexp(ξωk,k)k,k0(1 η)QY
k>k0Z|ωk,k0|≤1τ
q1
ξ0
2exp(ξ0|ωk,k0|)k,k0
exp(2ξQ)(1 η)Q"1E|ωk,k0|
1τ
q1#Q
= exp(2ξQ)(1 η)Q"11
ξ0(1 τ
q1)#Q
exp(2ξQ)11
1 + K1Q2+aQ11
K3Q2+b(1 τ)Q
exp(2ξQ + log(R)),
(S37)
where R > 0 does not depend onn. Note that the first inequality holds by ignoring the
contribution to the probability from slab distribution. The second inequality is Markov’s
inequality and the third inequality follows from our assumptions about how ξ0and ηare
54
tuned.
Let SΨ
0and S
0respectively denote the supports of Ψ and .Similarly, let sΨ
0be the number
of true non-zero entries in Ψ0and let s
0be the true number of non-zero off-diagonal entries
in 0
The KL divergence between a Gaussian chain graph model with parameters 0,0) and
one with parameters ,Ω) is
1
nK(f0, f ) = E0log f0
f
=1
2 log |0|
||q+ tr(Ω1
0Ω) + 1
n
n
X
i=1 ||1/2(ΨΩ Ψ00)>X>
i||2
2!(S38)
The KL variance is:
1
nV(f0, f ) = V ar0log f0
f
=1
2tr((Ω1
0Ω)2)2 tr(Ω1
0Ω) + q+1
n
n
X
i=1 ||1/2
0Ω(ΨΩ1Ψ01
0)>X>
i||2
2
(S39)
We need to lower bound the prior probability of the event
{,Ω) : K(f0, f )n2
n, V (f0, f )n2
n}
for large enough n.
We first obtain an upper bound of the average KL divergence and variance so that the mass
of such event can serve as a lower bound. To simplify the notation, we denote Ψ0) = Ψ
and 0= . We observe that ΨΩ1Ψ01
0= (∆ΨΨ01
0)Ω1.
Using the fact that ||AB||2
2(||A||2+||B||2)22||A||2
2+ 2||B||2
2for any two matrices A
55
and B, we obtain a simple upper bound:
1
nK(f0, f )
=1
2 log |0|
||q+ tr(Ω1
0Ω) + 1
n
n
X
i=1 ||1/2>
ΨX>
i1/21
0Ψ>
0X>
i||2
2!
1
2log |0|
||q+ tr(Ω1
0Ω)+1
n
n
X
i=1 ||1/21
0Ψ>
0X>
i||2
2
+1
n
n
X
i=1 ||1/2>
ΨX>
i||2
2
=1
2log |0|
||q+ tr(Ω1
0Ω)+1
n||XΨ01
01/2||2
F
+1
n||XΨ1/2||2
F
(S40)
The last line holds because 1/2>
ΨX>
iis the ith row of XΨ1/2.
Using the same inequality, we derive a similar upper bound for the average KL variance:
1
nV(f0, f )
=1
2tr((Ω1
0Ω)2)2 tr(Ω1
0Ω) + q+1
n
n
X
i=1 ||1/2
0>
ΨX>
i1/2
01
0Ψ>
0X>
i||2
2
1
2tr((Ω1
0Ω)2)2 tr(Ω1
0Ω) + q+2
n
n
X
i=1 ||1/2
01
0Ψ>
0X>
i||2
2
+2
n
n
X
i=1 ||1/2
0>
ΨX>
i||2
2
=1
2tr((Ω1
0Ω)2)2 tr(Ω1
0Ω) + q+2
n||XΨ01
01/2
0||2
F+2
n||XΨ1/2
0||2
F
(S41)
Similar to Ning et al. (2020) and Bai et al. (2020), we find event A1involving only and
event A2involving both and Ψsuch that (A1{0}) A2is a subset of the event
of interest {K/n 2
n, V /n 2
n}.
56
To this end, define
A1= : 1
2tr((Ω1
0Ω)2)2 tr(Ω1
0Ω) + q+2
n||XΨ01
01/2
0||2
F2
n/2
\1
2log |0|
||q+ tr(Ω1
0Ω)+1
n||XΨ01
01/2||2
F2
n/2(S42)
and
A2={(Ω,Ψ) : 1
n||XΨ1/2
0||2
F2
n
2,2
n||XΨ1/2||2
F2
n
2}(S43)
We separately bound the prior probabilities Π(A1) and Π(A1|A2)
S5.1.1 Bounding the prior mass Π(A1)
The goal here is to find a proper lower bound of prior mass on A1. To do this, first consider
the set
A?
1={2X
k>k0|ω0,k,k0ωk,k0|+X
k|ω0,k,k ωk,k | n
c1p}
where c1>0 is a constant to be specified. Since the Frobenius norm is bounded by the
vectorized L1 norm, we immediately conclude that
A?
1k0kFn
c1p.
We now show that nk0kFn
c1po A1.
Since the Frobenius norm bounds the L2 operator norm, if ||0||Fn
c1pthen the
absolute value of the eigenvalues of 0are bounded by n
c1p.Further, because we have
assumed 0has bounded spectrum, the spectrum of = 0+0is bounded by λminn
c1p
and λmax +n
c1p.When nis large enough, these quantities are further bounded by λmin/2
and 2λmax.Thus, for nlarge enough, if ||0||Fn
c1p,then we know has bounded
spectrum.
Consequently, 1/2has bounded L2 operator norm. Using the fact that ||AB||Fmin(|||A|||2||B||F,|||B|||2||A||F),
57
we have for some constant c2not depending on n,
2
n||XΨ01
01/2
0||2
F2
n|||XΨ0|||2
2||1
01/2
0||2
F
2
n||X||2
F|||Ψ0|||2
2|||1
01/2
0||2
F
pc2
2||||2
F,
where we have used the fact that ||X||F=np. Thus ||||Fn
2c2pimplies
1
n||XΨ01
01/2
0||2
F2
n/4.
Similarly, for some constant c3, we have that
1
n||XΨ01
01/2||2
F1
n||X||2
F||Ψ0||2
2||1
01/2||2
F
pc2
3||||2
F
Thus we have ||||Fn
2c3pimplies 1
2n||XΨ01
01/2||2
F2
n/4
Using an argument from Ning et al. (2020), ||||Fn
2b2pn/2b2implies the following
two inequalities
1
2(tr((Ω1
0Ω)2)2 tr(Ω1
0Ω) + q)2
n/4
1
2(log(|0|
||)q+ tr(Ω1
0Ω)) 2
n/4.
Thus by taking c1= 2 max{c2, c3, b2}, we can conclude {||0||Fn
c1p}⊂A1. Thus
A?
1 A1
Since A?
1 { : ||0||Fn/c1p},we know that ˜
Π(Ω τI|A?
1) = 1.We can therefore
lower bound Π(A1) by Π(A
1 {τI}). Instead of calculating the latter probability
directly, we can lower bound it by observing
2X
k>k0|ω0,k,k0ωk,k0|+X
k|ω0,k,k ωk,k |
=2 X
(k,k0)S
0
|ω0,k,k0ωk,k0|+ 2 X
(k,k0)(S
0)c|ωk,k0|+X
k|ω0,k,k ωk,k |.
58
Consider the following events
B1=
X
(k,k0)S
0
|ω0,k,k0ωk,k0| n
6c1p
B2=
X
(k,k0)(S
0)c|ωk,k0| n
6c1p
B3=(X
k|ω0,k,k ωk,k | n
3c1p)
Let B=T3
i=1 Bi A
1 A1. Since the prior probability of Blower bounds Π(A1), we now
focus on estimating ˜
Π(B). Recall that the untruncated prior ˜
Π is separable. Consequently,
Π(A1 {τI})˜
Π(A1)˜
Π(B) =
3
Y
i=1
˜
Π(Bi)
We first bound the probability of B1.Note that we can use only the slab part of the prior
to bound this probability. A similar technique was used by Bai et al. (2020) (specifically in
their Equation D.18) and by Roˇckoa and George (2018). Specifically, we have
˜
Π(B1) = ZB1Y
(k,k0)S
0
π(ωk,k0|η)
Y
(k,k0)S
0Z|ω0,k,k0ωk,k0|≤ n
6s
0c1p
π(ωk,k0|η)k,k0
ηs
0Y
(k,k0)S
0Z|ω0,k,k0ωk,k0|≤ n
6s
0c1p
ξ1
2exp(ξ1|ωk,k0|)k,k0
ηs
0exp(ξ1X
(k,k0)S
0
|ω0,k,k0|)Y
(k,k0)S
0Z|ω0,k,k0ωk,k0|≤ n
6s
0c1p
ξ1
2exp(ξ1|ω0,k,k0ωk,k0|)k,k0
=ηs
0exp(ξ1||0,S
0||1)Y
(k,k0)S
0Z||≤ n
6s
0c1p
ξ1
2exp(ξ1||)d
ηs
0exp(ξ1||0,S
0||1)eξ1n
6c1s
0pξ1n
6s
0c1ps
0
59
The first inequality holds because the fact that |ω0,k,k0ωk,k0| n/(6s
0c1p) implies that
the sum less than n/(6c1p).The last inequality is a special case of Equation D.18 of Bai
et al. (2020).
For B2,we derive the lower bound using the spike component of the prior. To this end, let
Q=q(q1)/2 denote the number of off-diagonal entries of matrix Ω. We have
˜
Π(B2) = ZB2Y
(k,k0)(S
0)c
π(ωk,k0|η)
Y
(k,k0)(S
0)cZ|ωk,k0|≤ n
6(Qs
0)c1p
π(ωk,k0|η)
(1 η)Qs
0Y
(k,k0)(S
0)cZ|ωk,k0|≤ n
6(Qs
0)c1p
ξ0
2exp(ξ0|ωk,k0|)k,k0
(1 η)Qs
0Y
(k,k0)(S
0)c16(Qs
0)c1p
n
Eπ|ωk,k0|
= (1 η)Qs
016(Qs
0)c1p
nξ0Qs
0
&(1 η)Qs
011
Qs
0Qs
0
(1 η)Qs
0
To derive the last two lines, we used an argument similar to the one used by Bai et al. (2020)
to derive Equation D.22. That is, we used the assumption that ξ0max{Q, n, pq}4+b
for some b > 0 to conclude that n/ max{Q, n, pq}1/2+b1.This inequality allows us to
control the Qin the numerator. Since s
0grows slower than Q, we can lower bound the
above function some multiplier of the form (1 η)Qs
0.Thus, for large enough n, we have
6(Qs
0)c1p
nξ06(Qs
0)c1pn
pplog(q)Q2+b
=6c1
plog(q)
Qs0
Q2
n
Qb
Qs0
Q2
1
Qs0
60
The event B3only involves diagonal entries. The untruncated prior mass can be directly
bounded using the exponential distribution
˜
Π(B3) = ZB3
q
Y
k=1
π(ωk,k)
q
Y
k=1 Z|ω0,k,kωk ,k|≤ n
3qc1p
π(ωk,k)k,k
=
q
Y
k=1 Zω0,k,k+n
3qc1p
ω0,k,kn
3qc1p
ξexp(ξωk,k)k,k
q
Y
k=1 Zω0,k,k+n
3qc1p
ω0,k,k
ξexp(ξωk,k)k,k
= exp(ξ
q
X
i=1
ω0,k,k)Zn
3qc1p
0
ξexp(ξωk,k)k,k
exp(ξ
q
X
i=1
ω0,k,k)eξn
3c1qpξn
3qc1pq
Now we are ready to show that the log prior mass on Bcan be bounded by some C1n2
n. To
this end, consider the negative log probability
log(Π(A1 {τ I}))
3
X
i=1 log( ˜
Π(Bi))
.s
0log(η) + ξ1||0,S
0||1+ξ1n
6c1ps
0log ξ1n
6s
0c1p(Qs
0) log(1 η)
+ξX
k
ω0,k,k +ξn
3c1pqlog ξn
3qc1p
=log ηs
0(1 η)Qs
0+ξ1||0,S
0||1+ξ1n
6c1p+ξX
k
ω0,k,k +ξn
3c1p
s
0log ξ1n
6s
0c1pqlog ξn
3qc1p
The ξ1n
6c1pand ξn
3c1pterms are O(n/p).n2
nwhich goes to infinity. The 4th term is of
61
order qsince the diagonal entries is controlled by the largest eigenvalue of ,which was
assumed to be bounded.
ξ1||0,S
0||1ξ1s
0sup |ω0,k,k0|
is of order s
0as the entries of ω0,k,k0is controlled.
Without tuning of η, the first term log ηs
0(1 η)Qs
0has order of Q. But since
we assumed 1η
ηmax{Q, pq}2+afor some a > 0, we have K1max{Q, pq}2+a1η
η
K2max{Q, pq}2+a. That is, we have 1/(1+K2max{Q, pq}2+a)η1/(1+K1max{Q, pq}2+a).
We can derive a simple lower bound as
ηs
0(1 η)Qs
0(1 + K2max{Q, pq}2+a)s
0(1 η)Qs
0
(1 + K2max{Q, pq}2+a)s
011
1 + K1max{Q, pq}2+aQs
0
&(1 + K2max{Q, pq}2+a)s
0
The last line is because max{Q, pq}2+agrows faster than Qs
0.Thus (11
1+K1max{Q,pq}2+a)max{Q,pq}−s
0
can be bounded below by some constant.
log ηs
0(1 η)Qs
0.s
0log(1 + K2max{Q, pq}2+a).s
0log(max{Q, pq})
s
0log(max{q, p})max(p, q, s
0) log(max{q, p})
The last two terms can be treated in the same way, using the assumption ξ11/n and
ξ1/max{Q, n}.
s
0log ξ1n
6s
0c1p=s
0log 6s
0c1p
ξ1n
.s
0log n3/2s
0p
pmax{s
0, p, q}log(q)!
s
0log n3/2s
0
.s
0log(q2)
.n2
n
62
The third line holds because ppmax{s
0, p, q}and log(q)1,which together imply
that p/pmax{s
0, p, q}log(q)1. The fourth line follows from our assumption that
log(n).log(q) because s
0< q2. The last line uses the definition of n.
Finally, we have
qlog ξn
3qc1p=qlog 3qc1p
ξn
.qlog n1/2max{Q, n}qp
pmax{s
0, p, q}log(q)!
qlog n1/2max{Q, n}q
.qlog(q)
.n2
n
S5.1.2 Bounding the conditional probability Π(A2|A1)
To bound Π(A2|A1),we use a very similar strategy as the one above. The difference is that
we now focus on the matrix Ψ.We show that mass on a L1 norm ball serves as a lower bound
similar to that of Ω. To see that, using an argument from Ning et al. (2020), we show that
powers of and 0are bounded in operator norm. Thus the terms 1
n||XΨ1/2
0||2
Fand
2
n||XΨ1/2||2
Fthat appear in the KL condition are bounded by a constant multiplier of
n1||XΨ||2
F.Using the fact that the columns of Xhave norm n, we can found this norm:
||XΨ||Fn
p
X
j=1 ||Ψ,j,.||Fn
p
X
j=1
q
X
k=1 |ψj,k ψ0,j,k|
Thus to bound Π(A2|A1) from below, it suffices to bound Π(P|ψj,k ψ0,j,k | c4n) for
some fixed constant c4>0.
We separate the sum based on whether the true value is 0, similar to our treatment on Ω:
X
ij |ψj,k ψ0,j,k|=X
(j,k)SΨ
0
|ψj,k ψ0,j,k|+X
(j,k)(SΨ
0)c|ψj,k|
Using the same argument as in Ω, we can consider the events whose intersection is a subset
63
of A2:
B4=
X
(j,k)SΨ
0
|ψj,k ψ0,j,k| c4n
2
B5=
X
(j,k)(SΨ
0)c|ψj,k ψ0,j,k| c4n
2
We have B4 B5 A2.Since the elements of Ψ are a priori independent of each other and
of ,we compute
Π(A2|A1)Π(B4|A1)Π(B5|A1) = Π(B4)Π(B5)
We bound each of these terms using the same argument as in the previous subsection:
Π(B4) = ZB4Y
(j,k)SΨ
0
π(ψj,k|θ)
Y
(j,k)SΨ
0Z|ψj,kψ0,j,k |≤c4n
2sΨ
0
π(ψj,k|θ)j,k
θsΨ
0Y
(j,k)SΨ
0Z|ψj,kψ0,j,k |≤c4n
2sΨ
0
λ1
2exp(λ1|ψj,k|)j,k
θsΨ
0exp(λ1X
(j,k)SΨ
0
|ψ0,j,k|)Y
(j,k)SΨ
0Z|ψj,kψ0,j,k |≤c4n
2sΨ
0
λ1
2exp(λ1|ψj,k ψ0,j,k|)j,k
=θsΨ
0exp(λ1X
(j,k)SΨ
0
|ψ0,j,k|)Y
(j,k)SΨ
0Z||≤c4n
2sΨ
0
λ1
2exp(λ1||)d
θsΨ
0exp(λ1||Ψ0,SΨ
0||1)ec4λ1n
2sΨ
0c4n
2sΨ
0sΨ
0
Similarly, we have
Π(B5)(1 θ)pqsΨ
012(pq sΨ
0)c4
nλ0pqsΨ
0
&(1 θ)pqsΨ
0
64
From here we have
log(Π(A2|A1)) log(Π(B4)) log(Π(B5))
=log(θsΨ
0(1 θ)pqsΨ
0) + λ1||Ψ0,SΨ
0||1+λ1c4n
2s0
Ψlog c4n
2sΨ
0
Since Ψ0has bounded L2 operator norm, we know that the entries of Ψ0are all bounded.
Thusλ1||Ψ0,SΨ
0||1=O(sΨ
0).n2
n. The last two terms are O(n).n2
n.
For the first term, recall that we assumed 1θ
θ(pq)2+bfor some b > 0.That is, there
are constants M3and M4such that M3(pq)2+b1θ
θM4(pq)2+b1θ
θ. Since 1/(1 +
M4(pq)2+b)θ1/(1 + M3(pq)2+b),we compute
θsΨ
0(1 θ)pqsΨ
0(1 + M4(pq)2+b)sΨ
0(1 θ)pqsΨ
0
(1 + M4(pq)2+b)sΨ
011/(1 + M3(pq)2+b)pqsΨ
0
&(1 + M4(pq)2+b)sΨ
0
Note that the last line is due to the fact that (pq)2+bgrows faster than pq sΨ
0.Conse-
quently, the term 11/(1 + M3(pq)2+b)pqsΨ
0can be bounded from below by a constant
not depending on n. Thus,
log θsΨ
0(1 θ)pqsΨ
0.sΨ
0log(1 + M4(pq)2+b).sΨ
0log(pq).sΨ
0max{log(q),log(p)}
For the last term, we use the same argument as we did with Ω.
s0
Ψlog c4n
2sΨ
0=s0
Ψlog 2sΨ
0
c4n
.sΨ
0log n
plog(pq)!
sΨ
0log(n)
.n2
n
65
S5.2 Test condition
To simplify the parameter space to be concerned in the test condition, we first show the
dimension recovery result by bounding the prior probability, with our effective dimension
defined as number of entries whose absolute value is larger than the intersection of spike
and slab components. Then we find the proper vectorized L1 norm sieve in the “lower-
dimensional” parameter space. We construct tests based on the supremum of a collection of
single-alternative Neyman-Pearson likelihood ratio tests in the subsets of the sieve that are
norm balls, then we show that the number of such subsets needed to cover the sieve can be
bounded properly.
S5.2.1 Dimension recovery
Unlike Ning et al. (2020), our prior assigns no mass on exactly sparse solutions. Nevertheless,
similar to Roˇckov´a and George (2018), we can define a notion of “effective sparsity” and
generalized dimension. Intuitively the generalized dimension can be defined as how many
coefficients are drawn from the slab rather than the spike part of the prior. Formally the
generalized inclusion functions νψand νωfor Ψ and can be defined as:
νψ(ψj,k) =
1
(|ψj,k|> δψ)
νω(ωk,k0) =
1
(|ωk,k0|> δω)
where δψand δωis the threshold where the spike and slab part has the same density.
δψ=1
λ0λ1
log 1θ
θ
λ0
λ1
δω=1
ξ0ξ1
log 1η
η
ξ0
ξ1
Then the generalized dimension can be defined as number of entries are included:
|ν(ψ)|=X
jk
νψ(ψj,k)
|ν(Ω)|=X
k>k0
νω(ωk,k0)
(S44)
66
Note that we only count the off-diagonal entries in Ω.
We are now ready to prove Lemma 1from the main text. The main idea is to check the pos-
terior probability directly. Let BΨ
n={Ψ : |ν(Ψ)|< rΨ
n}for some rΨ
n=C0
3max{p, q, sΨ
0, s
0}
with C0
3> C1in the KL condition. For Ω, let B
n={τI :|ν(Ω)|< r
n}for
r
n=C0
3max{p, q, sΨ
0, s
0}with some C0
3> C1in the KL condition. We aim to show that
E0Π(Ω (B
n)c|Y1, . . . , Yn)0 and E0Π(Ψ (BΨ
n)c|Y1, . . . , Yn)0.
The marginal posterior can be expressed using log-likelihood `n:
Π(Ψ BΨ
n|Y1, . . . , Yn) = RRBΨ
nexp(`n,Ω) `n0,0))dΠ(Ψ)dΠ(Ω)
RR exp(`n,Ω) `n0,0))dΠ(Ψ)dΠ(Ω)
Π(Ω B
n|Y1, . . . , Yn) = RRB
nexp(`n,Ω) `n0,0))dΠ(Ψ)dΠ(Ω)
RR exp(`n,Ω) `n0,0))dΠ(Ψ)dΠ(Ω)
(S45)
By using the result of KL condition (Lemma S2), we know the denominators are bounded
from below by eC1n2
nwith large probability. Thus, we focus now on upper bounding the
numerators beginning with Ψ.
Consider the numerator:
E0ZZ(BΨ
n)c
f/f0dΠ(Ψ)dΠ(Ω)=Z ZZ(BΨ
n)c
f/f0dΠ(Ψ)dΠ(Ω)f0dy
=ZZ(BΨ
n)cZfdydΠ(Ψ)dΠ(Ω)
Z(BΨ
n)c
dΠ(Ψ) = Π(|ν(Ψ)| rΨ
n)
We can bound the above display using the fact that when |ψj,k|> δψwe have π(ψj,k)<
2θλ1
2exp(λ1|ψj,k|), this is by definition of the effective dimension:
Π(|ν(Ψ)| rΨ
n)X
|S|>rΨ
n
(2θ)|S|Y
(j,k)SZ|ψj,k |ψ
λ1
2exp(λ1|ψj,k|)j,k Y
(j,k)/SZ|ψj,k|ψ
π(ψj,k)j,k
X
|S|>rΨ
n
(2θ)|S|
67
Using the assumption on θ, and the fact pq
k(epq/k)k(similar to Bai et al. (2020)’s
equation D.32), we can further upper bound the probability
Π(|ν(Ψ)| rΨ
n)X
|S|>rΨ
n
(2θ)|S|X
|S|>rΨ
n
(2
1 + M4(pq)2+b)|S|
pq
X
k=brΨ
nc+1 pq
k 2
M4(pq)2k
pq
X
k=brΨ
nc+1 2e
M4kpq k
<
pq
X
k=brΨ
nc+1 2e
M4(brΨ
nc+ 1)pq k
.(pq)(brΨ
nc+1)
exp((rΨ
n) log(pq)).
Taking rΨ
n=C0
3max{p, q, sΨ
0, s
0}for some C0
3> C1, we have:
Π(|ν(Ψ)| rΨ
n)exp(C0
3max{p, q, sΨ
0, s
0}log(pq))
Therefore,
E0Π((BΨ
n)c|Y1, . . . , Yn)E0Π((BΨ
n)c|Y1, . . . , Yn)IEn+P0(Ec
n),
where Enis the event in the KL condition. On En, the KL condition ensures that the
denominator in Equation (S45) is lower bounded by exp(C1n2
n) while the denominator is
upper bounded by exp(C0
3max{p, q, sΨ
0, s
0}log(pq))., Since P0(Ec
n) is o(1) per KL condition,
we have the upper bound
E0Π((BΨ
n)c|Y1, . . . , Yn)exp(C1n2
nC0
3max{p, q, sΨ
0, s
0}log(pq)) + o(1) 0
This completes the proof of the dimension recovery result of Ψ.
The workflow for is very similar, except we need to use the upper bound of the graphical
prior in Equation (S34) to properly bound the prior mass.
68
We upper bound the numerator:
E0ZZ(B
n)c
f/f0dΠ(Ψ)dΠ(Ω)Z(B
n)c
dΠ(Ω) = Π(|ν(Ω)| r
n)exp(2ξQ log(R)) ˜
Π(|ν(Ω)| r
n)
We bound the above display using the fact that when |ωk,k0|> δωwe have π(ωk,k0)<
2ηξ1
2exp(ξ1|ωk,k0|). Note that this follows from the definition of the effective dimension.
We have
˜
Π(|ν(Ω)| r
n)X
|S|>r
n
(2η)|S|Y
(k,k0)SZ|ωk,k0|ω
ξ1
2exp(ξ1|ωk,k0|)k,k0Y
(k,k0)/SZ|ωk,k0|ω
π(ωk,k0)k,k0
X
|S|>r
n
(2η)|S|
By using the assumption on η, and the fact Q
k(eQ/k)k, we can further upper bound the
probability:
˜
Π(|ν(Ω)| r
n)X
|S|>r
n
(2η)|S|X
|S|>r
n
(2
1 + K4max{pq, Q}2+b)|S|
Q
X
k=br
nc+1 Q
k 2
K4max{pq, Q}2k
max{pq,Q}
X
k=br
nc+1 max{pq, Q}
k 2
K4max{pq, Q}2k
max{pq,Q}
X
k=br
nc+1 2e
K4kmax{pq, Q}k
<
max{pq,Q}
X
k=br
nc+1 2e
K4(br
nc+ 1) max{pq, Q}k
.max{pq, Q}(br
nc+1)
exp((r
n) log(max{pq, Q}))
Taking r
n=C0
3max{p, q, sΨ
0, s
0}and C0
3> C1, we have
˜
Π(|ν(Ω)| r
n)exp(C0
3max{p, q, sΨ
0, s
0}log(max{pq, Q})) exp(C3n2
n)
69
Thus, using the assumption ξ1/max{Q, n}, for some R0not depending on n, we have
Π(|ν(Ω)| r
n)exp(C3n2
n+ 2ξQ log(R)) exp(C3n2
n+ log(R0))
We therefore conclude that
E0Π((B
n)c|Y1, . . . , Yn)E0Π((B
n)c|Y1, . . . , Yn)IEn,+P0(Ec
n)
where Enis the event in KL condition. On En, the KL condition ensures that the denomi-
nator in Equation (S45) is lower bounded by exp(C1n2
n) while the denominator is upper
bounded by exp(C0
3n2
n+ log(R0)).Since P0(Ec
n) is o(1) per KL condition, we conclude
E0Π((B
n)c|Y1, . . . , Yn)exp(C1n2
nC0
3n2
n+ log(R0)) + o(1) 0
We pause now to reflect on how dimension recovery can help us establish contraction. Our
end goal is to show the posterior distribution contract to the true value by first showing that
event with log-affinity difference larger than any given > 0 has an o(1) posterior mass.
For any such event, we can take a partition based on whether it intersects with BΨ
n,B
nor
their complements. Because the complements (BΨ
n)cand (B
n)chave o(1) posterior mass, we
have the partition that intersects with any of these two complements also has o(1) posterior
mass. Thus, we only need to show that events with log-affinity difference larger than any
given > 0and recovered the low dimension structure have an o(1) posterior mass. The
recovery condition reduces the complexity of the events (on the parameter space) that we
need to deal with by reducing the effective dimension of such events. We will make use of
this low dimension structure during checking the test condition.
Formally for every > 0, we have
E0Π(Ψ,τI :1
nXρ(fi, f0,i)> |Y1, . . . , Yn)
E0Π(Ψ BΨ
n,τI :1
nXρ(fi, f0,i)> |Y1, . . . , Yn) + E0Π((BΨ
n)c|Y1, . . . , Yn)
E0Π(Ψ BΨ
n, B
n:1
nXρ(fi, f0,i)> |Y1, . . . , Yn)
+E0Π((BΨ
n)c|Y1, . . . , Yn) + E0Π((B
n)c|Y1, . . . , Yn)
The last two terms are o(1),as proved above.
70
S5.2.2 Sieve
As shown in the previous section, we can concentrate on the events with proper dimension
recovery, i.e. {Ψ BΨ
n, B
n}. To apply Ghosal and van der Vaart (2017)’s general theory
of posterior contraction, to establish contraction on the event of proper dimension recovery
(i.e. E0Π(Ψ BΨ
n, B
n:1
nPρ(fi, f0,i)> |Y1, . . . , Yn)0), we need to find a sieve that
covers enough of the support of the prior. We will show that an L1 norm sieve is sufficient.
Formally we will show that there exist a sieve Fnsuch that for some constants C2> C1+ 2:
Π(Fc
n)exp(C2n2
n) (S46)
Consider the sieve:
Fn=Ψ BΨ
n, B
n:||Ψ||12C3p, ||||18C3q(S47)
for some large C3> C1+ 2 + log(3) where C1is the constant in KL condition. We have
Π(Fc
n)Π(||Ψ||1>2C3p) + Π((||||1>8C3q) {τI})
We upper bound each term similar to Bai et al. (2020). By using the bound in Equation (S34),
we know that
Π((||||1>8C3q) {τI})exp(2ξQ log(R))˜
Π(||||1>8C3q).
Since ||||1= 2 Pk>k0|ωk,k0|+Pk|ωk,k |, at least one of these two sums exceeds 8C3q/2.
Thus, we can form an upper bound on the L1 norm probability
˜
Π(||||1>8C3q)˜
Π X
k>k0|ωk,k0|>8C3q
4!+˜
Π X
k|ωk,k|>8C3q
2!.
To get an upper bound under ˜
Π,we can act as if all ωk,k0’s were drawn from the slab
distribution. In that setting, Pk>k0|ωk,k0|is Gamma distributed with shape parameter Q
and rate parameter ξ1. By using an appropriate tail probability for the Gamma distribution
71
(Boucheron et al. (2013), pp.29) and the fact 1 + x1+2x(x1)/2,we compute
exp(2ξQ log(R)) ˜
Π(X
k>k0|ωk,k0|>8C3q/4) exp "Q 1s1+28C3q
41
+8C3q
41!+ 2Qlog(R)#
exp 8C3q
8ξ1
+5
2Qlog(R)
Since we have assumed ξ11/n, for sufficiently large n, we have n2
nqlog(q).Conse-
quently, qn2
nQlog(q), Q =o(qn2
n), and we see that
8C3q
8ξ15
2Qlog(R)C3(nq)5
2Qlog(R)
C3(qn2
n)5
2Qlog(R)
=C3(qn2
n)o(qn2
n)
C3n2
n
The first order term of Qon the left hand side can be ignored when nlarge as the left hand
side is dominated by the Qlog(q) term. Note that we used the assumption that n0. We
further have
exp(2ξQ log(R)) ˜
Π((X
k>k0|ωk,k0|>8C3q/4)) exp(C3n2
n)
For the diagonal, the sum follows a gamma distribution with shape qand rate ξ. We obtain
a similar bound
exp(2ξQ log(R)) ˜
Π(X
k|ωk,k|>8C3q/2) exp(2Qlog(R)) exp "q 1s1+28C3q
2 +8C3q
2 !#
exp 8C3q
4ξ+Q2 + q
2Qlog(R)
72
Using the same argument as before and the fact that ξ1/max{Q, n}, we have
8C3q
4ξQ2 + q
2Q+ log(R)2C3(max{Q, n}q)) Q2 + q
2Q+ log(R)
C3qn2
no(qn2
n)
C3n2
n
The first order term of Qon the left hand side can be ignored when nlarge as the left hand
side is dominated by the Qlog(q) term and q/Q 0.
By combining the above results, we have:
Π((||||1>8C3q) {τI})exp(2Qlog(R)) ˜
Π(||||1>8C3q)
exp(2Qlog(R)) ˜
Π(X
k>k0|ωk,k0|>8C3q
4)
+ exp(2Qlog(R)) ˜
Π(X
k|ωk,k|>8C3q
2)
2 exp(C3n2
n)
(S48)
The probability ||Ψ||1>2C3pcan be bounded by tail probability of Gamma distribution
with shape parameter pq and rate parameter λ1:
Π(||Ψ||1>2C3p)exp "pq 1s1+22C3p
pqλ1
+2C3p
pqλ1!#
exp pq 2C3p
2pqλ11
2
exp 2C3p
2λ1
+pq
2
Using the same argument, we have pn pn2
npq log(q) and thus, pq =o(pn2
n) for large
n. Consequently,
exp 2C3p
2λ1
+pq
2exp C3pn2
n+o(pn2
n)exp(C3n2
n)
73
and
Π(||Ψ||1>2C3p)exp(C3n2
n)(S49)
By combining the result from Equations (S48) and (S49), we conclude
Π(Fc
n)3 exp(C3n2
n) = exp(C3n2
n+ log(3)).
With our choice of C3, the above probability is asymptotically bounded from above by
exp(C2n2
n) with some C2C1+ 2.
S5.2.3 Tests around a representative point
To apply the general theory, we need to construct test ϕn, such that for some M2> C1+ 1:
Ef0ϕn.eM2n2/2
sup
f∈Fn:ρ(f0,f)>M2n2
n
Ef(1 ϕn).eM2n2
n(S50)
where f=Qn
i=1 N(XiΨΩ1,1) while f0=Qn
i=1 N(XiΨ01
0,1
0)
Instead of directly constructing the ϕnon the whole sieve, we use the method similar to
Ning et al. (2020). That is, we construct tests versus a representative point and show that
these tests works well in the neighborhood of the representative points. We then take the
supremum of these tests and show that the number of pieces needed to cover the entire sieve
can be appropriately bounded.
For a representative point f1, consider the Neyman-Pearson test for a single point alternative
H0:f=f0, H1:f=f1,φn=I{f1/f01}. If the average half order enyi divergence
n1log(Rf0f1)2, we will have:
Ef0(φn)Zf1>f0pf1/f0f0 Zpf1f0 en2
Ef1(1 φn)Zf0>f1pf0/f1f1 Zpf0f1 en2
74
By Cauchy-Schwarz, for any alternative fwe can control the Type II error rate:
Ef(1 φn) {Ef1(1 φn)}1/2{Ef1(f/f1)2}1/2
So long as the second factor grows at most like ecn2for some properly chosen small c,
the full expression can be controlled. Thus we can consider the neighborhood around the
representative point small enough so that the second factor can be actually bounded.
Consider every density with parameters satisfying
||||||2 ||||18C3q,
||Ψ1Ψ||2 ||Ψ1Ψ||11
2C3np,
|||1|||2 ||1||11
8C3nmax{p, q}3/21
8C3nq3/2
(S51)
We show that Ef1(f/f1)2is bounded on the above set when parameters are from the sieve
Fn.
Similar to Ning et al. (2020), denote Σ1= 1
1, Σ = 1as well as Σ?
1= 1/2Σ11/2,
and Ψ= Ψ Ψ1while = 1. Using the observation ΨΩ1Ψ11
1= (∆Ψ
Ψ11)Ω1,we have
Ef1(f/f1)2=|Σ?
1|n/2|2IΣ?1
1|n/2
×exp n
X
i=1
Xi(ΨΩ1Ψ11
1)Ω1/2(2Σ?
1I)11/2(ΨΩ1Ψ11
1)>X>
i!
=|Σ?
1|n/2|2IΣ?1
1|n/2
×exp n
X
i=1
Xi(∆ΨΨ01)Ω1/2(2Σ?
1I)11/2(∆ΨΨ01)>X>
i!
(S52)
For the first factor we use a similar argument as in Ning et al. (2020) (after Equation 5.9).
Since Fn, we have |||1|||21. The fact |||1|||2δ0
n= 1/8C3nq3/2implies
|||Σ?
1I|||2 |||1|||2|||1|||2δ0
n
and thus we can bound the spectrum of Σ?
1, i.e. 1 δ0
n eig1?
1)eigq?
1)1 + δ0
n.
75
Thus
|Σ?
1|
|2IΣ?1
1|n/2
= exp n
2
q
X
i=1
log(eigi?
1)) n
2
q
X
i=1
log 21
eigi?
1)!
exp nq
2log(1 + δ0
n)nq
2log 1δ0
n
1δ0
n 
exp nq2
2δ0
n+nq
2δ0
n
12δ0
n 
exp(nqδ0
n)
e
The third inequality is due to the fact 1 x1log(x)x1.
We can bound the log of the second factor of Equation (S52).
|||1|||2|||(2Σ?
1I)1|||2
n
X
i=1 ||Xi(∆ΨΨ11)||2
22
n
X
i=1 ||Xi(∆ΨΨ11)||2
2
We can further bound the sum on the sieve.
n
X
i=1 ||Xi(∆ΨΨ11)||2
22
n
X
i=1 ||XiΨ||2
2+ 2
n
X
i=1 ||XiΨ11||2
2
2np|||Ψ|||2
2+ 2np|||Ψ1|||2
2|||1|||2
2||||2
F
2np 1
2C3np + 2np 2C3p+1
2C3np21
τ2
1
(8C3nmax{p, q}3/2)2
2np 1
2C3np + 2np16C2
3p21
τ2
1
(8C3nmax{p, q}3/2)2
.1
We bound the norm of Ψ1using triangle inequality, |||Ψ1||| |||Ψ|||+|||Ψ1Ψ||| 2C3p+
1/2C3np. The first term is O(1) and second term is O(1/q), by combining the result we
conclude the second factor of Equation (S52) is bounded.
Thus, following the argument of Ning et al. (2020), the desired test ϕnin Equation (S50)
can be obtained as the maximum of all tests φndescribed above.
76
S5.2.4 Pieces needed to cover the sieve
From here we show the contraction in log-affinity ρ(f, f0). To finish up the proof, we check
that number of sets described in Equation (S51) needed to cover sieve Fn, denoted by N,
can be bounded by exp(Cn2
n) for some suitable constant C.
The number Nis called a covering number of Fn. A closely related quantity is the packing
number, which is defined as the maximum number of disjoint balls centered in a set and
upper bounds the covering number. Both the covering number and packing number can be
used as a measure of complexity of a given set (Ghosal and van der Vaart,2017).
The packing number of a set usually depends exponentially on the sets dimensions. Because
Ning et al. (2020) studied posteriors which place positive probability on exactly sparse pa-
rameters, they were able to directly bound the packing number of suitable low-dimensional
sets. In our case, which uses an absolutely continuous prior, we need to instead control the
packing number of “effectively low dimensional” spaces.
Lemma S4 provides a sufficient condition for bounding the complexity (evaluated by packing
number) of an set of “effectively sparse” vectors can be bounded by the complexity of a set
of actually sparse vectors.
Lemma S4 (packing a shallow cylinder in Lp).For a set of form E=A×[δ, δ]QsRQ
where ARs, (with s > 0and Qs+ 1 are integers) for 1p < and a given T > 1,
if δ <
2[T(Qs)]1/p , we have the packing number:
D(, A, || · ||p)D(, E, || · ||p)D((1 T1)1/p , A, || · ||p)
Proof. The lower bound is trivial by observing A×{0}QsEand the packing number of
A×{0}Qsis exactly the packing number of A. For the upper bound, we show that for each
packing on E, we can slice that packing with the 0-plane to form a packing on Awith the
same number of balls but smaller radius (see Figure S7 for an illustration).
We first show any Lp /2ball Bθ(/2) centered in the set Eintersects the plane Rs×{0}Qs.
Assume the center is θ= (x1, . . . , xQ). It suffices to show the center’s distance to the plane
is less than the radius of the ball. Since the center is in E, we have |xi| δfor the last
Qscoordinates. Denote the projection of the center on the plane as θA= (x1, . . . , xs,0)
77
A× {0}Qs. Then the Lp distance from the center to the plane is
||θAθ||p
p=
Q
X
i=s+1 |xi|p(Qs)δp< T 1(/2)p
Next we show the slice Bθ()Rs×{0}Qsis also a ball centered at θAin the lower dimensional
plane. It suffice to show the boundary is a sphere. Suppose we take a point afrom the
intersection of boundary of Bθ()Rs×{0}Qs, the vector from center to the point can be
decomposed to sum of two orthogonal component, namely the vector from θAto aand from
θAto θ, we have in this case
||aθA||p
p+||θAθ||p
p=||aθ||p
p=p/2p
because aθAhas all 0 entries on the last Qsaxis and θAθhas all 0 entries on the
first sentries. Thhus any such point has a fixed distance to θA, the projection of the center
θon the plane of A. Notice that
||aθA||p
p=p/2p ||θAθ||p
p,
which is fixed. Thus the collection of aforms a sphere on A’s plane.
From here, we can also lower bound the radius of slice by (1 T1)1/p /2 since ||θAθ||p
p<
T1(/2)p, thus we have the radius ||aθA||p>(1 T1)1/p/2. Thus, we have the smaller
ball must lie within the slice, i.e.
BθA((1 T1)1/p/2) × {0}QsBθ(/2) (Rs× {0}Qs)Bθ(/2) (S53)
That is, any /2ball centered in Ehas a corresponding (1 T1)1/p/2 lower dimension
ball centered in A. With the above observations in hand, we can now prove the inequality
by contradiction.
Suppose we have a packing on E{θ1, . . . , θD}, where Dis larger than the packing number
of Ain the main result. By Equation (S53), the lower dimension balls BθiA ((1 T1)1/p/2)
must also be disjoint. Since the centers of the balls θiA A, these balls form a packing of
Awith radius 0= (1 T1)1/p. That is, we can find a packing with more balls than the
78
packing number, yielding the desired contradiction. Thus we must have
DD((1 T1)1/p, A, || · ||p)
Figure S7: A schematic of argument used in the packing number lemma proof. We showed
two disjoint unit L1 balls (red) centered in (0.8,0,0.5) and (.3,1,.2), all with in A×
[0.5,0.5] (with A= [1,1] ×[1,1] shown in the middle plane), their slice in the z= 0
plane (in blue) also forms L1 balls in R2whose radius are lower bounded and centered within
A, thus induced a packing of the lower dimensional set.
Now we can bound the logarithm of the covering number log(N) similar to Ning et al.
(2020).
log(N)log N1
2C3np,{Ψ BΨ
n:||Ψ||12C3p},|| · ||1
+ log N1
8C3nmax{p, q}3/2,{ B
n,||||18C3q},|| · ||1
The two terms above can be treated in a similar way. Denote max{p, q, sB
0, s
0}=s?. There
are multiple ways to allocate the effective 0’s, which introduces the binomial coefficient
79
below:
N1
8C3nmax{p, q}3/2,{ B
n,||||18C3q},|| · ||1
Q
C0
3s?N1
8C3nmax{p, q}3/2,{VRQ+q:|vi|< δωfor 1 iQ+qC0
3s?,||V||18C3q},|| · ||1
N1
2C3np,{Ψ BΨ
n:||Ψ||12C3p},|| · ||1
pq
C0
3s?N1
2C3np,{VRpq :|vi|< δψfor 1 ipq C0
3s?,||V||12C3p},|| · ||1
Note that has Q+q < 2Qfree parameters. We have first
log Q
C0
3s?.s?log(Q).n2
n
log pq
C0
3s?.s?log(pq).n2
n
We further bound the covering number using the result in Lemma S4. Observe that ||V||1
{|vi|< δωfor 1 iQ+qC0
3s?} {||V0|| 8C3q}×[δω, δ]Q+qC0
3s?where V0RC0
3s?
we have
N1
8C3nmax{p, q}3/2,{V:|vi|< δωfor 1 iQ+qC0
3s?,||V||18C3q},|| · ||1
N1
8C3nmax{p, q}3/2,{VRC0
3s?:||V0||18C3q×[δω, δω]Q+qC0
3s?},|| · ||1
We check the condition of Lemma S4 (with p= 1 and T= 2), by our assumption on ξ0, we
have:
(Q+qC0
3s?)δω2ω= 2Q1
ξ0ξ1
log 1η
η
ξ0
ξ1.Qlog(max{p, q, n})
max{Q, pq, n}4+b/2+b/2
1
max{Q, pq, n}3+b/2
80
The denominator dominates C3nmax{p, q}3/2thus for large enough n, we have (Q+q
C0
3s?)δω1
32C3nmax{p,q}3/2thus by Lemma S4, we can control the covering number by the
packing number:
log N1
8C3nmax{p, q}3/2,{V:|vi|< δωfor 1 iQ+qC0
3s?,||V||18C3q},|| · ||1
log D1
16C3nmax{p, q}3/2,{V0RC0
3s?,||V0||18C3q},|| · ||1
.s?log(128C2
3qn max{p, q}3/2)
.n2
n
Similarly for Ψ,
N1
2C3np,{V:|vi|< δψfor 1 ipq C0
3s?,||V||12C3p},|| · ||1
N1
2C3np,{V0RC0
3s?:||V0||12C3p×[δψ, δψ]},|| · ||1
We again check the condition of Lemma S4 (again with p= 1 and T= 2):
(pq C0
3s?)δψpqδψ=pq
λ0λ1
log 1θ
θ
λ0
λ1.pq log(max{p, q, n})
max{pq, n}5/2+b/2+b/2
1
max{pq, n}3/2+b/2
The denominator dominates 2C3np, Thus for enough large n, we have (pq C0
3s?)δψ
1/42C3np. Thus similar to Ω, we have:
log N1
2C3np,{V:|vi|< δωfor 1 ipq C0
3s?,||V||12C3p},|| · ||1
log D1
22C3np,{V0RC0
3s?,||V0||12C3p},|| · ||1
.s?log(4C3pp2C3np)
.n2
n
81
Thus we finally get the contraction under log-affinity.
S5.3 From log-affinity to and XΨΩ1
In this section we show the main result Theorem 1using the contraction under log-affinity.
Denoting Ψ Ψ0= Ψand 0= we have the log-affinity 1
nPρ(fi, f0i) is
1
nXρ(fi, f0i) = log |1|1/4|1
0|1/4
|(Ω1+ 1
0)/2|1/2
+1
8nXXi(ΨΩ1Ψ01
0)1+ 1
0
21
(ΨΩ1Ψ01
0)>X>
i
Thus Pρ(fif0i).n2
nimplies
log |1|1/4|1
0|1/4
|(Ω1+ 1
0)/2|1/2.2
n
1
8nXXi(ΨΩ1Ψ01
0)1+ 1
0
21
(ΨΩ1Ψ01
0)>X>
i.2
n
(S54)
This is almost the same as Ning et al. (2020)’s Equations 5.11-5.12. We can directly apply
the result from Ning et al. (2020)’s Equation 5.11 as it is the same as the first equation in
Equation (S54). Because Ψ0and 1have bounded operator norms and because can be
controlled, the cross-term is also controlled by n.The first part of Equation (S54) implies
||11
0||2
F.2
n.
Meanwhile ||11
0||2
F.2
nimplies for large enough nΩ’s L2 operator norm is bounded
(since we assume bounds on 1
0’s operator norm and the difference cannot have very large
eigenvalues which make the sum has 0 eigenvalue), using the result ||AB||F |||A|||2||B||F,
while also observe 0 = Ω(Ω11
0)Ω0, and by assumption that 0has bounded L2
operator norm, we conclude (S54) implies ||0||F.n.
Since |||1|||2is bounded for large enough n, we can directly apply an argument from Ning
et al. (2020) ( specifically the argument around their Equation 5.12) to conclude (S54)’s
82
second part implies:
2
n&1
8nX||Xi(ΨΩ1Ψ01
0)||2
2|||1+ 1
0
2|||1
2
&1
8nX||Xi(ΨΩ1Ψ01
0)||2
2|||1+ 1
0
2|||1
2
&1
nX||Xi(ΨΩ1Ψ01
0)||2
2/p2
n+ 1
Combining all of these results yields the desired result.
S5.4 Contraction of Ψ
Contraction of Ψ requires more assumptions on the design matrix X. Similar to Roˇckoa
and George (2018) and Ning et al. (2020), we introduce the restricted eigenvalue
φ2s) = inf ||XA||2
F
n||A||2
F
: 0 |ν(A)| ˜s
With this definition,
||X(ΨΩ1)Ψ01
0)||2
F.n2
n
||0||2
F.2
n
implies the result in Equation (15) of the main text. Namely,
||ΨΩ1Ψ01
0||2
F=||(∆ΨΨ01)Ω1||2
F.2
n2(sΨ
0+C0
3s?)
Since both and 1have bounded operator norm when ||0||2
F.2
n,for large enough
n, we must have:
||Ψ||F ||Ψ01||F ||ΨΨ01||F.n/qφ2(sΨ
0+C0
3s?)
Since Ψ0and 1have bounded operator norm, ||Ψ01||F.n, and we must have:
||Ψ||F.n/qmin{φ2(sΨ
0+C0
3s?),1}
83
Thus we can conclude
sup
Ψ∈T0,∈H0
E0Π||ΨΨ0||2
FM02
n
min{φ2(sΨ
0+C0
3s?),1}0
84
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added ℓ 1 -norm penalty term. The problem as formulated is convex but the memory requirements and complexity of existing interior point methods are prohibitive for problems with more than tens of nodes. We present two new algorithms for solving problems with at least a thousand nodes in the Gaussian case. Our first algorithm uses block coordinate descent, and can be interpreted as recursive ℓ 1 -norm penalized regression. Our second algorithm, based on Nesterov's first order method, yields a complexity estimate with a better dependence on problem size than existing interior point methods. Using a log determinant relaxation of the log partition function (Wainwright and Jordan [2006]), we show that these same algorithms can be used to solve an approximate sparse maximum likelihood problem for the binary case. We test our algorithms on synthetic data, as well as on gene expression and senate voting records data.
Article
We introduce the spike-and-slab group lasso (SSGL) for Bayesian estimation and variable selection in linear regression with grouped variables. We further extend the SSGL to sparse generalized additive models (GAMs), thereby introducing the first nonparametric variant of the spike-and-slab lasso methodology. Our model simultaneously performs group selection and estimation, while our fully Bayes treatment of the mixture proportion allows for model complexity control and automatic self-adaptivity to different levels of sparsity. We develop theory to uniquely characterize the global posterior mode under the SSGL and introduce a highly efficient block coordinate ascent algorithm for maximum a posteriori (MAP) estimation. We further employ de-biasing methods to provide uncertainty quantification of our estimates. Thus, implementation of our model avoids the computational intensiveness of Markov chain Monte Carlo (MCMC) in high dimensions. We derive posterior concentration rates for both grouped linear regression and sparse GAMs when the number of covariates grows at nearly exponential rate with sample size. Finally, we illustrate our methodology through extensive simulations and data analysis.
Article
The gut microbiome has emerged as a critical regulator of human physiology. Deleterious changes to the composition or number of gut bacteria, commonly referred to as gut dysbiosis, has been linked to the development and progression of numerous diet-related diseases, including cardiovascular disease (CVD). Most CVD risk factors, including aging, obesity, certain dietary patterns, and a sedentary lifestyle, have been shown to induce gut dysbiosis. Dysbiosis is associated with intestinal inflammation and reduced integrity of the gut barrier, which in turn increases circulating levels of bacterial structural components and microbial metabolites that may facilitate the development of CVD. The aim of the current review is to summarize the available data regarding the role of the gut microbiome in regulating CVD function and disease processes. Particular emphasis is placed on nutrition-related alterations in the microbiome, as well as the underlying cellular mechanisms by which the microbiome may alter CVD risk.
Spike-and-slab meets LASSO: A review of the spike-and-slab LASSO
  • R Bai
  • V Ročková
Bai, R., Ročková, V., and George, E. I. (2021). Spike-and-slab meets LASSO: A review of the spike-and-slab LASSO. In Tadesse, M. and Vannucci, M., editors, Handbook of Bayesian Variable Selection. Routledge.