Content uploaded by Yunyi Shen
Author content
All content in this area was uploaded by Yunyi Shen on Jul 15, 2022
Content may be subject to copyright.
Available via license: CC BY 4.0
Content may be subject to copyright.
Sparse Gaussian chain graphs with the spike-and-slab
LASSO: Algorithms and asymptotics
Yunyi Shen∗
, Claudia Sol´ıs-Lemus†
, and Sameer K. Deshpande ‡
July 15, 2022
Abstract
The Gaussian chain graph model simultaneously parametrizes (i) the direct effects
of ppredictors on qcorrelated outcomes and (ii) the residual partial covariance be-
tween pair of outcomes. We introduce a new method for fitting sparse Gaussian chain
graph models with spike-and-slab LASSO (SSL) priors. We develop an Expectation
Conditional Maximization algorithm to obtain sparse estimates of the p×qmatrix of
direct effects and the q×qresidual precision matrix. Our algorithm iteratively solves a
sequence of penalized maximum likelihood problems with self-adaptive penalties that
gradually filter out negligible regression coefficients and partial covariances. Because it
adaptively penalizes model parameters, our method is seen to outperform fixed-penalty
competitors on simulated data. We establish the posterior concentration rate for our
model, buttressing our method’s excellent empirical performance with strong theoret-
ical guarantees. We use our method to reanalyze a dataset from a study of the effects
of diet and residence type on the composition of the gut microbiome of elderly adults.
∗Laboratory for Information & Decision Systems, Massachusetts Institute of Technology. The work was
done while the author was at the University of Wisconsin–Madison.
†Wisconsin Institute for Discovery & Dept. of Plant Pathology, University of Wisconsin–Madison, Cor-
respondence to: solislemus@wisc.edu
‡Dept. of Statistics, University of Wisconsin–Madison. Correspondence to: sameer.deshpande@wisc.edu
1
arXiv:2207.07020v1 [stat.ME] 14 Jul 2022
1 Introduction
1.1 Motivation
There are between 10 and 100 trillion microorganisms living within each person’s lower
intenstines. These bacteria, fungi, viruses and other microbes constitute the human gut
microbiome (Guinane and Cotter,2013). Recent research suggests that the composition of
human gut microbiome can have a substantial effect on our health and well-being (Shreiner
et al.,2015): microbes living in the gut play an integral role in our digestive and metabolic
processes (Larsbrink et al.,2014;Belcheva et al.,2014); they can mediate our immune
response to various diseases (Kamada and N´u˜nez,2014;Kim et al.,2017); and they can even
influence disease pathogenesis and progression (Scher et al.,2013;Wang et al.,2011).
Additional emerging evidence suggests that the gut microbiome mediates the effects of
lifestyle factors such as diet and medication use on human health (Singh et al.,2017;Battson
et al.,2018;Hills Jr et al.,2019). That is, such lifestyle factors may first affect the com-
position of the gut microbiome, which in turn influences health outcomes. In fact, lifestyle
factors and medication use can impact the composition of the microbiome in direct and in-
direct ways. For instance, many antibiotics target and kill certain microbial species, thereby
directly affecting the abundances of the targeted species. However, by killing the targeted
species, the antibiotics may reduce the overall competition for nutrients, thereby allowing
non-targeted species to proliferate. In other words, by directly reducing the abundance of
certain targeted microbes, antibiotics may indirectly increase the abundance of other non-
targeted species. Our goal in this paper is to estimate such direct and indirect effects.
1.2 Sparse chain graph models
At a high level, the statistical challenge is to estimate the functional relationship between
a vector of predictors x∈Rpand vector of responses y∈Rq.In our application, we re-
analyze a dataset from Claesson et al. (2012) containing n= 178 predictor-response pairs
(x,y) where xcontains measures of p= 11 factors related to diet, medication use, and
residence type, and ycontains the logit-transformed relative abundances of q= 14 different
microbial taxa. Our goal is to uncover the direct and indirect effect of these factors on the
abundance of each microbial taxon as well as any interactions between microbial taxa. The
Gaussian chain graph model (Lauritzen and Wermuth,1989;Frydenberg,1990;Lauritzen
and Richardson,2002), which simultaneously parameterizes the direct effects of predictors
2
on responses and the residual dependence structure between response, is natural for these
data. The model asserts that
y|Ψ,Ω,x∼ N(Ω−1Ψ>x,Ω−1).(1)
where Ψ is a p×qmatrix and Ω is a symmetric, positive definite q×qmatrix. As we detail
in Section 2.1, the (j, k) entry of Ψ, ψj,k,quantifies the direct effect of the jth predictor
Xjon the kth response Yk.The (k, k0) entry of Ω, ωk,k0,encodes the residual conditional
covariance between outcomes Ykand Yk0that remains after accounting for the direct effects
of the predictors and all of the other response variables.
To fit the model in Equation (1), we must estimate pq +q(q+ 1)/2 unknown parameters.
When the total number of unknown parameters is comparable to or larger than the sample
size n, it is common to assume that the matrices Ψ and Ω are sparse. If ωk,k0= 0,we
can conclude that after adjusting for the covariates and all other outcomes, outcomes Yk
and Yk0are conditionally independent. If ψj,k = 0,we can conclude that Xjdoes not have
adirect effect on the kth outcome variable Yk.Furthermore, when ψj,k = 0,any marginal
correlation between Xjand Ykis due solely to Xj’s direct effects on other outcomes Yk0that
are themselves conditionally correlate with Yk.
1.3 Our contributions
We introduce the chain graph spike-and-slab LASSO (cgSSL) procedure for fitting the model
in Equation (1) in a sparse fashion. At a high level, we place separate spike-and-slab LASSO
priors (Roˇckov´a and George,2018) on the entries of Ψ and on the off-diagonal entries of Ω
in Equation (1). We derive an efficient Expectation Conditional Maximization algorithm to
compute the maximum a posteriori (MAP) estimates of Ψ and Ω.Our algorithm is equivalent
to solving a series of maximum likelihood problems with self-adaptive penalties. On synthetic
data, we demonstrate that our algorithm displays excellent support recovery and estimation
performance. We further establish the posterior contraction rate for each of Ψ,Ω,ΨΩ−1,
and XΨΩ−1.Our contraction results imply that our proposed cgSSL procedure consistently
estimates these quantities and also provides an upper bound for the minimax optimal rate
of estimating these quantities in the Frobenius norm. To the best of our knowledge, ours are
the first posterior contraction results for fitting sparse Gaussian chain graph models with
element-wise priors on Ψ and Ω.
3
Here is an outline for the rest of our paper. We review the Gaussian chain graph model and
spike-and-slab LASSO in Section 2. We next introduce the cgSSL procedure in Section 3
and carefully derive our ECM algorithm for finding the MAP in Sections 3.2. We present our
asymptotic results in Section 4before demonstrating the excellent finite sample performance
of the cgSSL on several synthetic datasets in Section 5. We apply the cgSSL to our motivating
gut microbiome data in Section 6. We conclude in Section 7by outlining several avenues for
future development.
2 Background
2.1 The Gaussian chain graph model
Graphical models are a convenient way to represent the dependence structure between several
variables. Specifically, we can represent each variable as a node in a graph and we can draw
edges to indicate conditional dependence between variables. Absence of an edge between
two nodes indicates the corresponding variables are conditionally independent given all of the
other variables. In the context of our gut microbiome data, we can represent each predictor
Xjwith a node and each outcome Ykwith a node. We are primarily interested in detecting
edges between predictors and outcomes and edges between outcomes. Figure 1a is a cartoon
illustration of such a graphical model with p= 3 and q= 4.Note that we have not drawn
any edges between the predictors as such edges are not typically of primary interest.
4
(a) (b)
Figure 1: Cartoon illustrations of a general graphical model (a) and a Gaussian chain graph
model (b) with p= 3 covariates and q= 4 outcomes. Edges in both graphs encode condi-
tional dependence relationships. The edge labels in (b) correspond to the non-zero parame-
ters in Equation (1).
Without additional modeling assumptions, estimating a discrete graph like that in Figure 1a
from npairs of data (x1,y1),...,(xn,yn) is a challenging task. The Gaussian chain graph
model in Equation (1) translates the discrete graph estimation problem into a much more
tractable continuous parameter estimation problem. Specifically, the model introduces two
matrices, Ψ and Ω and asserts that y|Ψ,Ω,x∼ N(Ω−1Ψ>x,Ω−1).Under the Gaussian graph
model, Xjand Ykare conditionally independent if and only if ψj,k = 0.Furthermore, Ykand
Yk0are conditionally independent if and only if ωk,k0= 0.In other words, by first estimating
Ψ and Ω and then examining their supports, we can recover the underlying graphical model.
Figure 1b reproduces the cartoon from Figure 1a with edges labelled by the corresponding
non-zero parameters in Equation (1).
In the Gaussian chain graph model, the direct effect of Xjon Ykis defined as
E[Yk|Xj=xj+ 1, Y−k, X−j,Ψ,Ω] −E[Yk|Xj=xj, Y−k, X−j,Ψ,Ω] = −ψj,k/ωk,k
That is, fixing the values of all of the other covariates and all of the other outcomes, an
increase of one unit in Xjis associated with −ψj,k /ωk,k unit increase in the expectation of
Yk. Notice that the direct effect of Xjon Ykis defined conditionally on the values of all
other outcome Yk0.Because of this, the direct effect of Xjon Ykis typically not equal to its
5
marginal effect, which is defined as
E[Yk|Xj=xj+ 1, X−j,Ψ,Ω] −E[Yk|Xj=xj, X−j,Ψ,Ω] = βj,k,
where βj,k is the (j, k) entry of the matrix B= ΨΩ−1.Notice that we can re-parametrize the
Gaussian chain graph model in Equation (1) in terms of B
y|B, Ω,x∼ N(B>x,Ω−1).(2)
We will refer to this re-parametrized model as the marginal regression model. There is a
considerable literature on fitting sparse marginal regression models and we refer the reader
to Deshpande et al. (2019) and references therein for a review.
Generally speaking, under (1), the supports of Ψ and Bwill be different. Specifically, it is
possible for Xjto have a marginal effect but no direct effect on Yk.For instance, in Figure 1,
although X3does not directly affect Y2,it may still be marginally correlated with Y2thanks
to the conditional correlation between Y2and Y3.That is, changing the value of X3can
change the value of Y3,which in turn changes the value of Y2.Consequently, if we fit a sparse
marginal regression model, we cannot generally expect to recover sparse estimates of the
matrix of direct effects.
2.2 Related works
Learning sparse chain graphs.McCarter and Kim (2014) proposing fitting sparse Gaus-
sian chain graphical models by maximizing a penalized negative log-likelihood. They specifi-
cally proposed homogeneous L1penalties on the entries of Ψ and Ω and used cross-validation
to set the penalty parameters for Ψ and Ω.Shen and Solis-Lemus (2021) developed a Bayesian
version of that chain graphical LASSO and put a Gamma prior on the penalty parameters.
In this way, they automatically learned the degree to which the entries ψj,k and ωk,k0are
shrunk to zero. Although these paper differ in how they determine the appropriate amount
of penalization, both McCarter and Kim (2014) and Shen and Solis-Lemus (2021) deploy a
single fixed penalty on all of the entries in Ψ and a single fixed penalty on all entries in Ω.
With such fixed penalties, larger parameter estimates are shrunk towards zero as aggressively
as the smaller parameter estimates, which can introduce substantial estimation bias.
Spike-and-slab variable selection with the EM algorithm. Spike-and-slab priors are
the workhorses of sparse Bayesian modeling. As introduced by Mitchell and Beauchamp
6
(1988), the spike-and-slab prior is mixture between a point mass at 0 (the “spike”) and
a uniform distribution over a wide interval (the “slab”). George and McCulloch (1993)
introduced a continuous relaxation of the original spike-and-slab prior, respectively replacing
the point mass spike and uniform slab distributions with zero-mean Gaussians with extremely
small and large variances. In this way, one may imagine generating all of the “essentially
negligible” parameters in a model from the spike distribution and generating all of the
“relevant” or “significant” parameters from the slab distribution. Despite their intuitive
appeal, spike-and-slab priors usually produce extremely multimodal posterior distributions.
In high dimensions, exploring these distributions with Markov chain Monte Carlo (MCMC)
is computationally prohibitive.
In response, Roˇckov´a and George (2014) introduced EMVS, a fast EM algorithm targeting
the maximum a posteriori (MAP) estimate of the regression parameters. They later extended
EMVS, which used conditionally conjugate Gaussian spike and slab distributions, to use
Laplacian spike and slab distributions in Roˇckov´a and George (2018). The resulting spike-
and-slab LASSO (SSL) procedure demonstrated excellent empirical performance. At a high-
level, the SSL algorithm solves a sequence of L1penalized regression problems with self-
adaptive penalties. The adaptive penalty mixing is key to the empirical success of the SSL
(George and Roˇckov´a,2020;Bai et al.,2021), as it facilitates shrinking larger parameter
estimates to zero less aggressively than smaller parameter estimates.
Since Roˇckov´a and George (2014), the general EM technique for maximizing spike-and-slab
posteriors has been successfully applied to many problems. For instance, Bai et al. (2020)
introduced a grouped version of the SSL that adaptively shrinks groups of parameter values
towards zero. Tang et al. (2017,2018) similarly deployed the SSL and its grouped variant
to generalized linear models. Outside of the single-outcome regression context, continuous
spike-and-slab priors have been used to estimate sparse Gaussian graphical models (Li et al.,
2019;Gan et al.,2019a,b), sparse factor models Roˇckov´a and George (2016), and to bi-
clustering Moran et al. (2021). Deshpande et al. (2019) introduce a multivariate SSL for
estimating Band Ω in the marginal regression model in Equation (2). In each extension,
the adaptive penalization performed by the EM algorithm resulted in support recovery and
parameter estimation superior to that of fixed penalty methods.
The asymptotics of spike-and-slab variable selection. Beyond its excellent empirical
performance, Roˇckov´a and George (2018)’s SSL enjoys strong theoretical support. Using
general techniques proposed by Zhang and Zhang (2012) and Ghosal and van der Vaart
7
(2017), they proved that, under mild conditions, the posterior induced by the SSL prior in
high-dimensional, single-outcome linear regression contracts at a near minimax-optimal rate
as n→ ∞.Their contraction result implies that the MAP estimate returned by their EM
algorithm is consistent and is, up to a log factor, rate-optimal. By directly applying Ghosal
and van der Vaart (2017)’s general theory, Bai et al. (2020) extended these results to the
group SSL posterior with an unknown variance.
In the context of Gaussian graphical models, Gan et al. (2019a) showed that the MAP
estimator corresponding to placing spike-and-slab LASSO priors on the off-diagonal elements
of a precision matrix is consistent. They did not, however, establish the contraction rate of
the posterior. Ning et al. (2020) showed that the joint posterior distribution of (B, Ω) in
the multivariate regression model in Equation (2) concentrates when using a group spike-
and-slab prior with Laplace slab and point mass spike on Band a carefully selected prior
on the eigendecomposition of Ω−1.To the best of our knowledge, however, the asymptotic
properties of the posterior formed by placing SSL priors on both the precision matrix Ω and
regression coefficients Ψ in Equation (1) have not yet been established.
3 Introducing the cgSSL
3.1 The cgSSL prior
To quantify the prior belief that many entries in Ψ are essentially negligible, we model each
ψj,k as having been drawn either from a spike distribution, which is sharply concentrated
around zero, or a slab distribution, which is much more diffuse. More specifically, we take
the spike distribution to be Laplace(λ0) and the slab distribution to be Laplace(λ1),where
0< λ1λ0are fixed positive constants. This way, the spike distribution is much more
heavily concentrated around zero than is the slab. We further let θ∈[0,1] be the prior
probability that each ψj,k is drawn from the slab and model the ψj,k ’s as conditionally
independent given θ. Thus, the prior density for Ψ, conditional on θ, is given by
π(Ψ|θ) =
p
Y
j=1
q
Y
k=1 θλ1
2e−λ1|ψj,k|+(1 −θ)λ0
2e−λ0|ψj,k|.(3)
Since Ω is symmetric, it is enough to specify a prior on the entries ωk,k0where k≤k0.To
this end, we begin by placing an entirely analogous spike-and-slab prior on the off-diagonal
8
entries. That is, we model each ωk,k0as being drawn from a Laplace(ξ1),with probability
η∈[0,1],or a Laplace(ξ0),with probability 1 −η, where 0 < ξ1ξ0.We similarly model
each ωk,k0as conditionally independent given ηand place independent Exp(ξ1) priors on
the diagonal entries of Ω.We then truncate the resulting distribution of Ω|θto the cone of
symmetric positive definite matrices, yielding the prior density
π(Ω|η)∝ Y
1≤k<k0≤qηξ1
2e−ξ1|ωk,k0|+(1 −η)ξ0
2e−ξ0|ωk,k0|!× q
Y
k=1
ξe−ξωk,k !×
1
(Ω 0) (4)
Observe that 1 −θand 1 −ηrespectively quantify the proportion of entries in Ψ and Ω
that are essentially negligible. To model our uncertainty about these proportions, we place
Beta priors on each of θand η. Specifically, we independently model θ∼Beta(aθ, bθ) and
η∼Beta(aη, bη),where aθ, bθ, aη, bη>0 are fixed positive constants.
3.2 Targeting the MAP
Unfortunately the posterior distribution of (Ψ, θ, Ω, η)|Yis analytically intractable. Fur-
ther, it is generally high-dimensional and rather multimodal, rendering stochastic search
techniques like Markov Chain Monte Carlo computationally impractical. We instead fol-
low Roˇckov´a and George (2018)’s example and focus on finding the maximum a posteriori
(MAP) estimate of (Ψ, θ, Ω, η).Throughout, we assume that the columns of Xhave been
centered and scaled to have norm n.
9
To this end, we attempt to maximize the log posterior density
log π(Ψ, θ, Ω, η|Y) = −n
2log|Ω| − 1
2tr (Y−XΨΩ−1)Ω(Y−XΨΩ)>
+
p
X
j=1
q
X
k=1
log θλ1e−λ1|ψj,k |+ (1 −θ)λ0e−λ0|ψj,k|
+
q−1
X
k=1
q
X
k0>k
log ηξ1e−ξ1|ωk,k0|+ (1 −η)ξ0e−ξ0|ωk ,k0|
−
q
X
k=1
ξ0ωk,k + log
1
(Ω 0)
+ (aθ−1) log(θ)+(bθ−1) log(1 −θ)
+ (aη−1) log(η)+(bη−1) log(1 −η)
(5)
Optimizing the log posterior density directly is complicated by the non-concavity of log π(Ω|η).
Instead, following Deshpande et al. (2019), we iteratively optimize a surrogate objective using
an EM-like algorithm.
To motivate this approach, observe that we can obtain the posterior density π(Ω|η) in Equa-
tion (4) by marginalizing an augmented prior
π(Ω|η) = Zπ(Ω|δ)π(δ|η)dδ
where δ={δk,k0: 1 ≤k < k0≤q}is a collection of q(q−1)/2 i.i.d. Bernoulli(η) variables
and
π(Ω|δ)∝ Y
1≤k<k0≤qξ1e−ξ1|ωk,k0|δk ,k0ξ0e−ξ0|ωk,k0|1−δk,k0!× q
Y
k=1
ξe−ξωk,k !×
1
(Ω 0).
In our augmented prior, δk,k0indicates whether ωk,k0is drawn from the slab (δk,k0= 1) or the
spike (δk,k0= 0).
The above marginalization immediately suggests an EM algorithm: rather than optimize
log π(Ψ, θ, Ω, η|Y) directly, we can iteratively optimize a surrogate objective formed by
marginalizing the augmented log posterior density. That is, starting from some initial guess
(Ψ(0), θ(0),Ω(0), η(0)),for t > 1,the tth iteration of our algorithm consists of two steps. In the
10
first step, we compute the surrogate objective
F(t)(Ψ, θ, Ω, η) = Eδ|·[log π(Ψ, θ, Ω, η, δ|y)|Ψ = Ψ(t−1),Ω = Ω(t−1) , θ =θ(t−1), η =η(t−1) ],
where the expectation is taken with respect to the conditional posterior distribution of the
indicators δgiven the current value of (Ψ, θ, Ω, η).Then in the second step, we maximize
the surrogate objective and set (Ψ(t), θ(t),Ω(t), η(t)) = arg max F(t)(Ψ, θ, Ω, η).
It turns out that, given Ω and η, the indicators δk,k0are conditionally independent Bernoulli
random variables whose means are easy to evaluate, making it simple to compute a closed
form expression for the surrogate objective F(t).Unfortunately, maximizing F(t)is still dif-
ficult. Consequently, similar to Deshpande et al. (2019), we carry out two conditional max-
imizations, first optimizing with respect to (Ψ, θ) while holding (Ω, η) fixed, and then opti-
mizing with respect to (Ω, η) while holding (Ψ, θ) fixed. That is, in the second step of each
iteration of our algorithm, we set
(Ψ(t), θ(t)) = arg max
Ψ,θ
F(t)(Ψ, θ, Ω(t−1), η(t−1)) (6)
(Ω(t), η(t)) = arg max
Ω,η
F(t)(Ψ(t), θ(t),Ω, η).(7)
In summary, we propose finding the MAP estimate of (Ψ, θ, Ω, η) using an Expectation
Conditional Maximization (ECM; Meng and Rubin,1993) algorithm.
When we fix the values of Ω and η, the surrogate objective F(t)is separable in Ψ and
θ. That is, the objective function F(t)(Ψ, θ, Ω(t−1), η(t−1)) in Equation (6) can be written
as the sum of a function of Ψ alone and a function of θalone. This means that we can
separately compute Ψ(t)and θ(t)while fixing (Ω, η) = (Ω(t−1), η(t−1)). The objective function
in Equation (7) is similarly separable and we can separately compute Ω(t)and η(t)while
fixing (Ψ, θ) = (Ψ(t), θ(t)).As we describe in Section S1 of the Supplementary Materials,
computing θ(t)and η(t)is relatively straightforward; we compute θ(t)with a simple Newton
algorithm and there is a closed form expression for η(t).The main computational challenge
is computing Ψ(t)and Ω(t).In the next subsection, we detail how updating Ψ and Ω reduces
to solving penalized likelihood problems with self-adaptive penalties.
11
3.3 Adaptive penalty mixing
Before describing how we compute Ψ(t)and Ω(t),we introduce two important functions:
p?(x, θ) = θλ1e−λ1|x|
θλ1e−λ1|x|+ (1 −θ)λ0e−λ0|x|
q?(x, η) = ηξ1e−ξ1|x|
ηξ1e−ξ1|x|+ (1 −η)ξ0e−ξ0|x|
For each 1 ≤j≤pand 1 ≤k≤q, p?(ψj,k, θ) is the conditional posterior probability that ψj,k
was drawn from the Laplace(λ1) slab distribution. Similarly, for 1 ≤k < k0≤q, q?(ωk,k0, η)
is just the conditional posterior probability that ωk,k0was drawn from the Laplace(ξ1) slab.
That is, q?(ωk,k0, η) = E[δk,k0|Y,Ψ,Ω, θ, η].
Updating Ψ.Fixing the value Ω = Ω(t−1) ,computing Ψ(t)is equivalent to solving the
following penalized optimization problem
Ψ(t)= arg max
Ψ(−1
2tr (YΩ−XΨ)Ω−1(YΩ−XΨ)>+X
jk
pen(ψj,k ;θ))(8)
where
pen(ψj,k ;θ) = log π(ψj,k|θ)
π(0|θ)=−λ1|ψj,k|+ log p?(ψj,k , θ)
p?(0, θ).
Note that the first term in the objective of Equation (8) can obtained by distributing a
factor of Ω through the quadratic form that appears in the log-likelihood (see Equations (S5)
and (S7) of the Supplementary Materials for details).
Following arguments similar to those in Deshpande et al. (2019), the Karush-Kuhn-Tucker
(KKT) condition for (8) tells us that
ψ(t)
j,k =n−1h|zj,k| − λ?(ψ(t)
j,k, θ)i+sign(zj,k),(9)
12
where
zjk =nψ(t)
j,k +X>
jrk+X
k06=k
(Ω−1)k,k0
(Ω−1)k,k
X>
jrk0
rk0= (YΩ−XΨ(t))k0
λ?(ψ(t)
j,k, θ) = λ1p?(ψ(t)
j,k, θ) + λ0(1 −p?(ψ(t)
j,k, θ)).
The KKT conditions suggest a natural coordinate-ascent strategy for computing Ψ(t): start-
ing from some initial guess Ψ0,we cyclically update the entries ψj,k by soft-thresholding ψj,k
at λ?
j,k.During our cyclical coordinate ascent, whenever the current value of ψj,k is very large,
the corresponding value of p?(ψj,k, θ) will be close to one, and the threshold λ?will be close
to the slab penalty λ1.On the other hand, when ψj,k is very small, the corresponding p?will
be close to zero and the threshold λ?will be close to the spike penalty λ0.Since λ1λ0,
we are therefore able to apply a stronger penalty to the smaller entries of Ψ and a weaker
penalty to the larger entries. As our cyclical coordinate ascent proceeds, we iteratively refine
the thresholds λ?,thereby adaptively shrinking our estimates of ψj,k.
Before proceeding, we note that the quantity zj,k depends not only on the inner product
between the Xj,the jth column of the design matrix, and the partial residual rkbut also
on the inner product between Xjand all other partial residuals rk0for k06=k. Practically
this means that in our cyclical coordinate ascent algorithm, our estimate of the direct effect
of predictor Xjon outcome Ykcan depend on how well we have fit all other outcomes Yk0.
Moreover, the entries of Ω−1determine the degree to which ψj,k depends on the outcomes Yk0
for k06=k. Specifically, if (Ω−1)k,k0= 0,then we are unable to leverage information contained
in Yk0to inform our estimate of ψj,k.
Updating Ω.Fixing Ψ = Ψ(t)and letting S=n−1Y>Yand M=n−1(XΨ)>XΨ,we can
compute Ω(t)by solving
Ω(t)= arg max
Ω0(n
2log|Ω| − tr(SΩ) −tr(MΩ−1)−
q
X
k=1 "ξωk,k +X
k0>k
ξ?
k,k0|ωk,k0|#) (10)
where ξ?
k,k0=ξ1q?(ω(t−1)
k,k0, η(t−1)) + ξ0(1 −q?(ω(t−1)
k,k0, η(t−1))).
The objective in Equation (10) is extremely similar to the conventional graphical LASSO
(GLASSO; Friedman et al.,2008) objective. However, there are two crucial differences. First,
13
because the conditional mean of Ydepends on Ω in the Gaussian chain graph model (1), we
have an additional term tr(MΩ−1) that is absent in the GLASSO objective. Second, and
more substantively, the objective in Equation (10) contains individualized penalties ξ?
k,k0on
the off-diagonal entries of Ω.Here, the penalty ξ?
k,k0will be large (resp. small) whenever
the previous estimate of ω(t−1)
k,k0is small (resp. large). In other words, as we run our ECM
algorithm, we can refine the amount of penalization applied to each off-diagonal entry in Ω.
Although the objective in Equation (10) is somewhat different than the GLASSO objective,
we can solve it by suitably modifying an existing GLASSO algorithm. Specifically, we
solve the optimization problem in Equation (10) with a modified version of Hsieh et al.
(2011)’s QUIC algorithm. Our solution repeatedly (i) forms a quadratic approximation
of the objective, (ii) computes a suitable Newton direction, and (iii) follows that Newton
direction for step size chosen with an Armijo rule. In Section S2.4 of the Supplementary
Materials, we show that the optimization problem in Equation (10) has a unique solution
and that our modification to QUIC converges to the unique solution.
3.4 Selecting the spike and slab penalties
The proposed ECM algorithm depends on two sets of hyperparameters. The first set, contain-
ing aθ, bθ, aη,and bη,encode our initial beliefs about the overall proportion of non-negligible
entries in Ψ and Ω.We set aθ= 1, bθ=pq, aη= 1,and bη=qsimilar to Deshpande et al.
(2019). The second set of hyperparameters consists of the spike and slab penalties λ0, λ1, ξ0
and ξ1.Rather than run cgSSL with a single set of these penalties, we use Deshpande et al.
(2019)’s path-following dynamic posterior exploration (DPE) strategy to obtain the MAP
estimates corresponding to several different choices of spike penalties.
Specifically, we fix the slab penalties λ1and ξ1and specify grids of increasing spike penalties
Iλ={λ(1)
0<··· < λ(L)
0}and Iξ={ξ1
0<··· < ξ(L)
0}.We then run cgSSL with warmstarts for
each combination of spike penalties, yielding a set of posterior modes {(Ψ(s,t), θ(s,t),Ω(s,t), η(s,t)}
indexed by the choices (λ(s)
0, ξ(t)
0).To warm start the estimation of the mode corresponding to
(λ(s)
0, ξ(t)
0),we first compute the models found with (λs−1
0, ξ(t−1)
0), (λs
0, ξ(t−1)
0) and (λs−1
0, ξ(t)
0).
We evaluate the posterior density using (λ0, ξ0)=(λ(s)
0, ξ(t)
0) at each of the three previously
computed modes and initialize at the mode with largest density.
Following this DPE strategy provides a snapshot of the many different cgSSL posteriors.
However, it can be computationally intensive, as we must run our ECM algorithm to conver-
gence for every pair of spike penalties. Deshpande et al. (2019) introduced a faster variant,
14
called dynamic conditional posterior exploration (DCPE), which we also implemented for
the cgSSL. In DCPE, we first run our ECM algorithm with warm-starts over the ladder Iλ
while keeping Ω = Ifixed. Then, fixing (Ψ, θ) at the final value from the first step, we run
our ECM algorithm with warm-starts over the ladder IΩ.Finally, we run our ECM algo-
rithm starting from the final estimates of the parameters obtained in the first two steps with
(λ0, ξ0)=(λ(L)
0, ξ(0)
L).Generally speaking, DPE and DCPE trace different paths through the
parameter space and typically return different final estimates.
When the spike and slab penalties are similar in size (i.e. λ1≈λ0, ξ1≈ξ0), we noticed that
our ECM algorithm would sometimes return very dense estimates of Ψ and diagonal estimates
of Ω with very large diagonal entries. Essentially, when the spike and slab distributions are
not too different, our ECM algorithm has a tendency to overfit the response with a dense
Ψ, leaving very little residual variation to be quantified with Ω.On further investigation, we
found that we could detect such pathological behavior by examining the condition number
of the matrix YΩ−XΨ. To avoid propagating dense Ψ’s and diagonal Ω’s through the
DPE and DCPE, we terminate our ECM early whenever the condition number of YΩ−XΨ
exceeds 10n. We then set the corresponding Ψ(s)= 0 and Ω(t)=Iand continue the dynamic
exploration from that point. While this is admittedly ad hoc heuristic, we have found that
it works well in practice and note that Moran et al. (2019) utilized a similar strategy in the
single-outcome high-dimensional linear regression setting with unknown variance.
The DPE and DCPE cgSSL procedures are implemented in the mSSL R(R Core Team,
2022) package, which is available at https://github.com/YunyiShen/mSSL. Note that this
package contains a new implementation of Deshpande et al. (2019)’s mSSL procedure as
well.
4 Asymptotic theory of cgSSL
If the Gaussian chain graph model in Equation (1) is well-specified – that is, if our data
(xi,yi) are truly generated according to the model – will the posterior distribution of Ψ
and Ω collapse to a point-mass at the true data generating parameters as n→ ∞? Such
a collapse would, among other things, imply the MAP estimate returned by the cgSSL
procedure described in Section 3is consistent, providing an asymptotic justification for its
use. In this section, we answer the question affirmatively: under some mild assumptions
and with some slight modifications, the cgSSL posterior concentrates around the truth. We
15
further establish the rate of concentration, which quantifies the speed at which the posterior
distribution shrinks to the true data generating parameters. We begin by briefly reviewing
our general proof strategy before precisely stating our assumptions and results. Proofs of
our main results are available in Section S5 of the Supplementary Materials.
4.1 Proof strategy
To establish the posterior concentration rate for Ψ and Ω, we followed Ning et al. (2020)
and Bai et al. (2020) and first showed that the posterior concentrates in log-affinity (see
Section S5.3 in the Supplementary Materials for details). Posterior concentration of the
individual parameters followed as a consequence. To show that the posterior concentrates
in log-affinity, we appealed to general results about posterior concentration for independent
but non-identically distributed observations. Specifically, we verified the three conditions
of Theorem 8.23 of Ghosal and van der Vaart (2017). First, we confirmed that the cgSSL
prior introduced in Section 3.1 places enough prior probability mass in small neighborhoods
around every possible choice of (Ψ,Ω).This was done by verifying that for each (Ψ,Ω),
the prior probability contained in a small Kullback-Leibler ball around (Ψ,Ω) can be lower
bounded by a function of the ball’s radius (the so-called “KL-condition” in Lemma S2 of the
Supplementary Materials). Then we studied a sequence of likelihood ratio tests defined on
sieves of the parameter space that can correctly distinguish between parameter values that
are sufficiently far away from each other in log-affinity. In particular, we bounded the error
rate of such tests and then bounded the covering number of the sieves (Lemma S4 of the
Supplementary Materials).
Ning et al. (2020) studied the sparse marginal regression model in Equation (2) instead of the
sparse chain graph. Although these are somewhat different models, our overall proof strategy
is quite similar to theirs. However, we pause here to highlight some important technical
differences. First, Ning et al. (2020) placed a prior on Ω’s eigendecomposition while we
placed an arguably simpler and more natural element-wise prior on Ω.The second and more
substantive difference is in how we bound the covering number of sieves of the underlying
parameter space. Because Ning et al. (2020) specified exactly sparse priors on the elements
of B= ΨΩ−1,it was enough for them to carefully bound the covering number of exactly
low-dimensional sets of the form A × {0}rwhere Ais some subset of a multi-dimensional
Euclidean space and r > 0 is a positive integer. In contrast, because we specified absolutely
continuous priors on the elements of Ψ,we had to cover “effectively low-dimensional” sets
16
of the form A × [−δ, δ]rfor small δ > 0.Our key lemma (Lemma S4 in the Supplementary
Materials) provides sufficient conditions on δfor bounding the -packing number of such
effectively low-dimensional sets using the 0-packing number of Afor a carefully chosen
0>0.
4.2 Contraction of cgSSL
In order to establish our posterior concentration results, we first assume that the data
(x1,y1),...,(xn,yn) were generated according to a Gaussian chain graph model with true
parameter Ψ0and Ω0.We need to make additional assumptions about the spectra of Ψ0and
Ω0and on the dimensions n, p and q.
A1 Ψ0and Ω0have bounded operator norm: that is, Ψ0∈ T0={Ψ : |||Ψ|||2< a1}and
Ω0∈ H0{Ω : |||Ω|||2∈[1/b2,1/b1] where ||| · ||| is the operator norm and a1, b1, b2>0
are fixed positive constants.
A2 Dimensionality: We assume that log(n).log(q); log(n).log(p); and
max{p, q, sΩ
0, sΨ
0}log(max{p, q})/n →0,
where sΩ
0and sΨ
0are the number of non-zero free parameters in Ω and Ψ respectively;
and an.bnmeans for sufficient large n, there exists a constant Cindependent of n
such that an≤Cbn
A3 Tuning the Ψ prior: We assume that (1 −θ)/θ ∼(pq)2+a0for some a0>0; λ0∼
max{n, pq}2+b0for some b0>1/2; and λ11/n
A4 Tuning the Ω prior: We assume that that (1 −η)/η ∼max{Q, pq}2+afor some a > 0,
where Q=q(q−1)/2; ξ0∼max{Q, pq, n}4+bfor some b > 0; ξ11/n and ξ
1/max{Q, n}
Before stating our main result, we pause to highlight two key differences between the above
assumptions and model introduced in Section 3.1. Although the prior in Section 3.1 restricts
Ω to the positive-definite cone, Assumption 1 is slightly stronger as it bounds the smallest
eigenvalue of Ω away from zero. The stronger assumption ensures that the entries of XΨΩ−1
do not diverge in our theoretical analysis. We additionally restricted our theoretical analysis
to the setting where the proportion of non-negligible parameters, θand η, are fixed and
known (Assumption 4). We note that Roˇckov´a and George (2018) and Gan et al. (2019a)
17
make similar assumptions in their theoretical analyses.
Theorem 1 (Posterior contraction of cgSSL).Under Assumptions A1–A4, there is a con-
stant M1>0,which does not depend on n, such that
sup
Ψ∈T0,Ω∈H0
E0ΠΨ : ||X(ΨΩ−1−Ψ0Ω−1
0)||2
F≥M1n2
n|Y1, . . . , Yn→0 (11)
sup
Ψ∈T0,Ω∈H0
E0ΠΩ : ||Ω−Ω0||2
F≥M12
n|Y1, . . . , Yn→0 (12)
where n=pmax{p, q, sΩ
0, sΨ
0}log(max{p, q})/n. Note that n→0as n→ ∞.
A key step in proving Theorem 1is Lemma 1, which shows that the cgSSL posterior does
not place too much probability on Ψ’s and Ω’s with too many large entries. In order to
state this lemma, we denote the effective dimensions of Ψ and Ω by |ν(Ψ)|and |ν(Ω)|.The
effective dimension of Ψ (resp. Ω) counts the number of entries (resp. off-diagonal entries in
the lower-triangle) whose absolute value exceeds the intersection point of the spike and slab
prior densities.
Lemma 1 (Dimension recovery of cgSSL).For a sufficiently large number C0
3>0, we have:
sup
Ψ∈T0,Ω∈H0
E0Π (Ψ : |ν(Ψ)|> C0
3s?|Y1, . . . , Yn)→0 (13)
sup
B∈T0,Ω∈H0
E0Π (Ω : |ν(Ω)|> C0
3s?|Y1, . . . , Yn)→0 (14)
where s?= max{p, q, sΩ
0, sΨ
0}.
Lemma 1essentially guarantees that the cgSSL posterior does not grossly overestimate the
number of predictor-response and response-response edges in the underlying graphical model.
Note that the result in Equation (11) shows that the vector containing the nevaluations
of the regression function (i.e. the vector XΨΩ−1), converges to the vector containing the
evaluations of the true regression function Ω−1
0Ψ>
0x.Importantly, apart from Assumption A2
about the dimensions of X, we did not make any additional assumptions about the design
matrix. The contraction rates for Ψ and ΨΩ−1,however, depend critically on X. To state
these results, denote the restricted eigenvalue of a matrix Aas
φ2(s) = inf
A∈Rp×q:0≤|ν(A)|≤skXAk2
F
nkAk2
F.
18
Corollary 1 (Recovery of regression coefficients in cgSSL).Under Assumptions A1–A4,
there is some constant M0>0,which does not depend on n, such that
sup
Ψ∈T0,Ω∈H0
E0Π||ΨΩ−1−Ψ0Ω−1
0||2
F≥M02
n
φ2(sΨ
0+C0
3s?)→0 (15)
sup
Ψ∈T0,Ω∈H0
E0Π||Ψ−Ψ0||2
F≥M02
n
min{φ2(sΨ
0+C0
3s?),1}→0.(16)
Corollary 1shows that the posterior distribution of ΨΩ−1can contract at a faster or slower
rate than the posterior distributions of XΨΩ−1and Ω,depending on the design matrix. In
particular, when Xis poorly conditioned, we might expect the rate to be slower. In contrast,
the term min{φ2(sΨ
0+C0
3s?),1}appearing in the denominator of the rate in Equation (16)
implies that the posterior distribution of Ψ cannot concentrate at a faster rate than the
posterior distributions of ΨΩ−1and Ω,regardless of the design matrix. To develop some
intuition about this phenomenon, notice that we can decompose the difference Ψ −Ψ0as
Ψ−Ψ0= (ΨΩ−1−Ψ0Ω−1
0)Ω + (Ψ0Ω−1
0(Ω −Ω0)Ω−1)Ω.
Roughly speaking, the decomposition suggests that in order to estimate Ψ well, we must
be able to estimate both Ω and ΨΩ−1well. In other words, estimating Ψ is at least as
hard, statistically, as estimating Ω and ΨΩ−1.Taken together, the two results in Corollary 1
suggest that while a carefully constructed design matrix can improve estimation of the matrix
of marginal effects, B= ΨΩ−1,it cannot generally improve estimation of the matrix of direct
effects Ψ.
5 Synthetic experiments
We performed a simulation study to assess how well our two implementations of cgSSL
(cgSSL-DPE and cgSSL-DCPE) (i) recover the supports of Ψ and Ω and (ii) estimate each
matrix. We compared both implementations of cgSSL to several competitors: a fixed-penalty
method (cgLASSO), which deploys a single penalty λfor the entries in Ψ and a single fixed
penalty ξfor the entries in Ω; Shen and Solis-Lemus (2021)’s CAR-LASSO procedure (CAR),
which puts Laplace priors on Ψ and Ω entries and Gamma prior on the overall shrinkage
strength; and Shen and Solis-Lemus (2021)’s adaptive CAR-LASSO (CAR-A), which puts
individualized Laplace prior on free parameters of Ψ and Ω.Note that cgSSL and cgLASSO
19
perform optimization while CAR and CAR-A run MCMC. Further, we selected the penalties
in cgLASSO with 10-fold cross-validation. Additionally, cgLASSO and CAR apply the same
amount of shrinkage to every element of Ψ and the same amount of shrinkage to every
element of Ω.CAR-A, on the other hand, applied individualized shrinkage.
We simulated several synthetic datasets of various dimensions and with different sparsity
patterns in Ω (Figure 2). Across all of these choices of dimension and Ω,we found that
cgSSL-DPE achieved somewhat lower sensitivity but much higher precision in estimating the
supports of both Ψ and Ω than the competing methods. Taken together, these findings
suggest that while cgSSL-DPE tended to return fewer non-zero parameter estimates than the
other methods, we can be much more certain that those parameters are truly non-zero. Put
another way, although the other methods can recover more of the truly non-zero signal, they
do so at the expense of making many more false positive identifications in the supports of Ψ
and Ω than cgSSL-DPE.
5.1 Simulation design
We simulated data with three different choices of dimensions (n, p, q) = (100,10,10),(100,20,30),
and (400,100,30).For each choice of (n, p, q),we considered five different choices of Ω: (i)
an AR(1) model for Ω−1so that Ω is tri-diagonal; (ii) an AR(2) model for Ω−1so that
ωk,k0= 0 whenever |k−k0|>2; (iii) a block model in which Ω is block-diagonal with two
dense q/2×q/2 diagonal blocks; (iv) a star graph where the off-diagonal entry ωk,k0= 0
unless kor k0is equal to 1; and a dense model with all off-diagonal elements ωk,k0= 2.
In the AR(1) model we set (Ω−1)k,k0= 0.7|k−k0|so that ωk,k0= 0 whenever |k−k0|>1.In
the AR(2) model, we set ωk,k = 1, ωk−1,k =ωk,k−1= 0.5,and ωk−2,k =ωk,k−2= 0.25.For
the block model, we partitioned Σ = Ω−1into 4 q/2×q/2 blocks and set all entries in the
off-diagonal blocks of Σ to zero. We then set σk,k = 1 and σk,k0= 0.5 for 1 ≤k6=k0≤q/2
and for q/2 + 1 ≤k6=k0≤q. For the star graph, we set ωk,k = 1, ω1,k =ωk,1= 0.1 for each
k= 2, . . . , q, and set the remaining off-diagonal elements of Ω equal to zero.
These five specifications of Ω (top row of Figure 2) correspond to rather different underlying
graphical structure among the response variables (bottom row of Figure 2). The AR(1)
model, for instance, represents an extremely sparse but regular structure while the AR(2)
model is somewhat less sparse. While the star model and AR(1) model contain the same
number of edges, the underlying graphs have markedly different degree distributions. Com-
pared to the AR(1), AR(2), and star models, the block model is considerably denser. We
20
included the full model, which corresponds to a dense Ω,to assess how well all of the methods
perform in a misspecified regime.
In total, we considered 15 combinations of dimensions (n, p, q) and Ω.For each combination,
we generated Ψ by randomly selecting 20% of entries to be non-zero. We drew the non-zero
entries uniformly from a U(−2,2) distribution. For each combination of (n, p, q),Ω and Ψ,
we generated 100 synthetic datasets from the Gaussian chain graph model (1). The entries
of the design matrix Xwere independently drawn from a standard N(0,1) distribution.
Figure 2: Visualization of the supports of Ω for q= 10 under each of the five specifications
(top) and corresponding graph (bottom). In the top row, we have gray cells indicate non-zero
entries in Ω and white cells indicate zeros
5.2 Results
To assess estimation performance, we computed the Frobenius norm between the estimated
matrices and the true data generating matrices. To assess the support recovery performance,
we counted the number of elements in each of Ψ and Ω that were (i) correctly estimated
as non-zero (true positives; TP); (ii) correctly estimated as zero (true negatives; TN); (iii)
incorrectly estimated as non-zero (false positives; FP); and (iv) incorrectly estimated as zero
(false negatives; FN). We report the sensitivity (TP/(TP + FN)) and precision (TP/(TP
+ FP)). Generally speaking, we prefer methods with high sensitivity and high precision.
High sensitivity indicates that the method has correctly estimated most of the true non-
zero parameters as non-zero. High precision, on the other hand, indicates that most of the
estimated non-zero parameters are truly non-zero. For brevity, we only report the average
sensitivity, precision, and Frobenius errors for the (n, p, q) = (100,10,10) setting in Table 1.
21
We observed qualitatively similar results for the other two settings of dimension and report
average performance in those settings in Tables S2–S3 of the Supplementary Materials.
Table 1: Average (sd) sensitivity, precision, and Frobenius error for Ψ and Ω when (n, p, q) =
(100,10,10) for each specification of Ω across 100 simulated datasets. For each choice of Ω,
the best performance is bold-faced.
Ψ recovery Ω recovery
Method SEN PREC FROB SEN PREC FROB
AR(1) model
cgLASSO 0.88 (0.08) 0.44 (0.15) 0.13 (0.16) 0.78 (0.37) 0.55 (0.31) 31.93 (22.08)
CAR 0.86 (0.06) 0.31 (0.03) 0.04 (0.01) 1 (0) 0.3 (0.03) 4.16 (1.18)
CAR-A 0.87 (0.06) 0.59 (0.07) 0.02 (0.01) 1 (0) 0.83 (0.1) 2.75 (1.59)
cgSSL-dcpe 0.64 (0.05) 0.8 (0.16) 0.08 (0.05) 0.94 (0.11) 0.96 (0.07) 6.32 (6.64)
cgSSL-dpe 0.65 (0.05) 0.99 (0.03) 0.04 (0.01) 1 (0) 0.97 (0.05) 2.49 (1.12)
AR(2) model
cgLASSO 1 (0.02) 0.22 (0.06) 0.17 (0.09) 0.84 (0.29) 0.55 (0.17) 2.7 (1.66)
CAR 0.9 (0.06) 0.34 (0.04) 0.03 (0.01) 0.98 (0.03) 0.57 (0.06) 0.58 (0.21)
CAR-A 0.89 (0.05) 0.67 (0.08) 0.02 (0.01) 1 (0.02) 0.91 (0.06) 0.46 (0.32)
cgSSL-dcpe 0.96 (0.06) 0.43 (0.12) 0.45 (0.28) 0.24 (0.3) 0.63 (0.14) 5 (0.98)
cgSSL-dpe 0.73 (0.05) 1 (0.01) 0.02 (0.01) 1 (0) 0.86 (0.06) 0.38 (0.21)
Block model
cgLASSO 0.95 (0.05) 0.39 (0.18) 0.13 (0.11) 0.73 (0.38) 0.78 (0.21) 5.15 (2.27)
CAR 0.89 (0.06) 0.31 (0.03) 0.03 (0.01) 0.95 (0.02) 0.61 (0.06) 1.89 (0.75)
CAR-A 0.87 (0.06) 0.57 (0.07) 0.03 (0.01) 0.86 (0.07) 0.93 (0.05) 2.97 (1.22)
cgSSL-dcpe 0.76 (0.06) 0.29 (0.02) 0.28 (0.02) 0.01 (0.03) 0.71 (0.39) 8.85 (0.2)
cgSSL-dpe 0.69 (0.07) 0.99 (0.02) 0.03 (0.01) 0.71 (0.06) 0.95 (0.05) 3.28 (1.17)
Star model
cgLASSO 0.96 (0.04) 0.48 (0.14) 0.04 (0.02) 0.36 (0.41) 0.2 (0.18) 0.86 (0.35)
CAR 0.91 (0.05) 0.34 (0.03) 0.02 (0) 0.55 (0.18) 0.25 (0.08) 0.57 (0.29)
CAR-A 0.91 (0.04) 0.57 (0.06) 0.02 (0.01) 0.22 (0.14) 0.46 (0.24) 0.57 (0.26)
cgSSL-dcpe 0.83 (0.04) 0.96 (0.05) 0.01 (0) 0.05 (0.09) 0.9 (0.24) 0.22 (0.12)
cgSSL-dpe 0.79 (0.06) 0.99 (0.03) 0.01 (0.01) 0.09 (0.13) 0.71 (0.29) 0.29 (0.19)
Dense model
cgLASSO 0.92 (0.04) 0.57 (0.07) 0.03 (0.01) 0.88 (0.32) 1 (0) 16.93 (32.74)
CAR 0.85 (0.06) 0.28 (0.03) 0.04 (0.01) 0.03 (0.02) 1 (0) 92.51 (1.74)
CAR-A 0.84 (0.06) 0.4 (0.04) 0.04 (0.01) 0 (0.01) 1 (0) 96.04 (1.21)
cgSSL-dcpe 0.82 (0.03) 0.84 (0.06) 0.02 (0) 0.01 (0.02) 1 (0) 99.93 (0.39)
cgSSL-dpe 0.72 (0.07) 0.93 (0.06) 0.03 (0.01) 0.05 (0.04) 1 (0) 99.99 (0.98)
In terms of identifying non-zero direct effects (i.e. estimating the support of Ψ), cgLASSO
consistently achieves the highest sensitivity. On further inspection, we found that the penal-
ties selected by 10-fold cross-validation tended to be quite small, meaning that cgLASSO
returned many non-zero ˆ
ψj,k’s. As the precision results indicate, many of cgLASSO’s “discov-
eries” were in fact false positives. The other fixed penalty method, CAR, similarly displayed
somewhat high sensitivity and low precision. Interestingly, for several choices of Ω,the pre-
22
cisions of cgLASSO and CAR for recovering the support of Ψ were less than 0.5. Such low
precisions indicate that most of the returned non-zero estimates were in fact false positives.
In contrast, methods that deployed adaptive penalties (CAR-A and both implementations
of cgSSL), displayed higher precision in estimating the support of Ψ.In fact, at least for
estimating the support of Ψ,cgSSL-DPE made almost no false positives.
We observed essentially the same phenomenon for Ω: although the cgSSL generally returned
fewer non-zero estimates of ωk,k0, the vast majority of these estimates were true positives. In
a sense, the fixed penalty methods (cgLASSO and CAR) cast a very wide net when searching
for non-zero signal in Ψ and Ω,leading to large number of false positive identifications in the
supports of these matrices. Adaptive penalty methods, on the other hand, are much more
discerning.
In terms of estimation performance, we found that the fixed penalty methods (cgLASSO and
CAR) tended to have much larger Frobenius error, reflecting the well-documented bias in-
troduced by L1regularization. The one exception was in the misspecified setting where Ω
was dense. Interestingly, for the four sparse Ω’s, we did not observe any method achieving
high Frobenius error for Ω but low Frobenius error for Ψ.This finding helps substantiate
our intuition about Corollary 1. Namely, in order to estimate Ψ well, one must estimate
Ω well. Finally, like Deshpande et al. (2019), we found that dynamic conditional poste-
rior exploration implementation of cgSSL performed slightly worse than dynamic posterior
exploration implementation.
6 Real data experiments
Claesson et al. (2012) studied the gut microbiota of elderly individuals using data sequenced
from fecal samples taken from 178 subjects. They were primarily interested in understanding
differences in the gut microbiome composition across several residence types (in the commu-
nity, day-hospital, rehabilitation, or in long-term residential care) and across several different
types of diet. We refer the reader to the Supplementary Notes and Supplementary Table 3
of Claesson et al. (2012) for more details. They found that the gut microbiomes of residents
in long-term care facilities were considerably less diverse than those of residents dwelling
in the community. They additionally reported that diet had a large marginal effect on gut
microbe diversity but they did not examine conditional or direct effects, which might align
more closely with the underlying biological mechanism. In this section, we re-analyze their
23
data using the cgSSL to try to estimate the direct effects of each type of diet and residence
type on gut microbiome composition.
We pre-processed the raw 16s-rRNA data in the MG-RAST server (Keegan et al.,2016);
please see Section S4 of the Supplementary Materials for more details on the pre-processing.
In all, we had n= 178 observations of p= 11 predictors and q= 14 taxa. Figure 3shows
the graphical model estimated by cgSSL-DPE. In the figure, edges are colored according to
the sign of the effect, with blue edges corresponding to negative conditional correlation and
red edges corresponding to positive conditional correlation. The edge widths correspond to
the absolute value of the parameter, with wider edges indicating larger parameter values.
We found a large number of edges between the different species, suggesting that there was
considerable conditional dependence between their abundances after adjusting for the co-
variates. In fact, we found only two non-zero entries in Ψ.We estimated that percutaneous
endoscopic gastronomy (PEG), in which a feeding tube is inserted into the abdomen, had a
negative direct effect on the abundance of Veillonella, which is involved in lactose fermen-
tation. Reassuringly for us, our finding aligns with those in Takeshita et al. (2011), which
reported a negative effect of PEG on this genus. We additionally found that staying in a
day hospital had a positive direct effect on Caloramator.
24
Alistipes
Bacteroides
Barnesiella
Blautia
Butyrivibrio
Caloramator
Clostridium
Eubacterium
Faecalibacterium
Hespellia
Parabacteroides
Ruminococcus
Selenomonas
Veillonella
Age
GenderMale
StratumDayHospital
StratumLong−term
StratumRehabDiet1
Diet2 Diet3
Diet4
DietPEG
BMI
abs_weight 0.2 0.4 0.6 0.8
Figure 3: The estimated graphical model underlying Claesson et al. (2012)’s gut micro-
biome dataset. We annotate the edge weight by the absolute value of conditional regression
coefficients and red color represents positive (conditional) dependence and blue represents
negative (conditional) dependence.
Our results suggest that the large marginal effects reported by Claesson et al. (2012) are a by-
product of only a few direct effects and substantial residual conditional dependence between
species. For instance, because PEG has a direct effect on Veillonella, which is conditionally
correlated with Clostridium,Butyrivibrio, and Blautia, PEG displays a marginal effect on
each of these other genus. In this way, the cgSSL can provide a more nuanced understanding
of the underlying biological mechanism than simply estimating the matrix of marginal effects
B= ΨΩ−1.We note, however, that Claesson et al. (2012)’s dataset does not contain an
exhaustive set of environmental and patient life-style predictors. Accordingly, our re-analysis
is limited in the sense that were we able to incorporate additional predictors, the estimated
graphical model may be quite different.
25
7 Discussion
In the Gaussian chain graph model in Equation (1), Ψ is a matrix containing all of the direct
effects of ppredictors on qoutcomes while Ω is the residual precision matrix that encodes
the conditional dependence relationships between the outcomes that remain after adjusting
for the predictors. We have introduced the cgSSL procedure for obtaining simultaneously
sparse estimates of Ψ and Ω.In our procedure, we formally specify spike-and-slab LASSO
priors on the free elements of Ψ and Ω and use an ECM algorithm to maximize the posterior
density. Our ECM algorithm iteratively solves a penalized maximum likelihood problem
with self-adaptive penalties. Across several simulated datasets, cgSSL demonstrated excel-
lent support recovery and estimation performance, substantially out-performing competitors
that deployed constant shrinkage penalties. We further characterized the asymptotic prop-
erties of cgSSL posteriors, establishing posterior concentration rates under relatively mild
assumptions. To the best of our knowledge, these are the first such results for sparse Gaussian
chain graph models.
Although our main theoretical result (Theorem 1) implies that a slightly modified version of
the cgSSL procedure from Section 3is asymptotically consistent, quantifying finite sample
posterior uncertainty remains challenging. Several authors have proposed extensions of New-
ton and Raftery (1994)’s weighted likelihood bootstrap for quantifying posterior uncertainty.
Basically, these procedures work by repeatedly maximizing a randomized objective formed
by carefully re-weighting each term in the log-likelihood and the log-prior. In fact, Nie and
Roˇckov´a (2022) recently deployed this strategy to quantify uncertainty in SSL posteriors for
single-outcome regression in high dimensions. A key ingredient in Nie and Roˇckov´a (2022) is
the introduction of an additional random location shift in the prior to offset the tendency of
the SSL to return exactly sparse parameter estimates. In our Gaussian chain graph problem,
introducing a similar shift is challenging due to the constraint that Ω be positive definite.
Overcoming this difficulty is the subject of on-going work.
In many applications, analysts encounter multiple outcomes of mixed type (i.e. continuous
and discrete). In its current form, the cgSSL is not applicable to these situations. It is possi-
ble, however, to extend the cgSSL to model outcomes of mixed type using a strategy similar
to one found in Kowal and Canale (2020), which modeled discrete variables as truncated
and transformed versions of latent Gaussian random variables.
26
Acknowledgements
The authors are grateful to Ray Bai for helpful comments on the theoretical results and to
Gemma Moran for feedback on an early draft of the manuscript.
This work was supported by the National Institute of Food and Agriculture, United States
Department of Agriculture, Hatch project 1023699. This work was also supported by the
Department of Energy [DE-SC0021016 to C.S.L.]. Support for S.K.D. was provided by the
University of Wisconsin–Madison, Office of the Vice Chancellor for Research and Graduate
Education with funding from the Wisconsin Alumni Research Foundation.
This research was performed using the compute resources and assistance of the UW–Madison
Center For High Throughput Computing (CHTC) in the Department of Computer Sciences.
The CHTC is supported by UW–Madison, the Advanced Computing Initiative, the Wiscon-
sin Alumni Research Foundation, the Wisconsin Institutes for Discovery, and the National
Science Foundation, and is an active member of the OSG Consortium, which is supported
by the National Science Foundation and the U.S. Department of Energy’s Office of Science.
References
Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal
Statistical Society: Series B (Methodological), 44(2):139–160.
Bai, R., Moran, G. E., Antonelli, J. L., Chen, Y., and Boland, M. R. (2020). Spike-and-slab
group LASSOs for grouped regression and sparse generalized additive models. Journal of
the American Statistical Association.
Bai, R., Roˇckov´a, V., and George, E. I. (2021). Spike-and-slab meets LASSO: A review of the
spike-and-slab LASSO. In Tadesse, M. and Vannucci, M., editors, Handbook of Bayesian
Variable Selection. Routledge.
Banerjee, O., Ghaoui, L. E., and D’Aspremont, A. (2008). Model selection through sparse
maximum likelihood estimation for multivariate Gaussian or binary data. Journal of
Machine Learning Research, 9:485–516.
Battson, M. L., Lee, D. M., Weir, T. L., and Gentile, C. L. (2018). The gut microbiota
as a novel regulator of cardiovascular function and disease. The Journal of Nutritional
Biochemistry, 56:1–15.
27
Belcheva, A., Irrazabal, T., Robertson, S. J., Streutker, C., Maughan, H., Rubino, S.,
Moriyama, E. H., Copeland, J. K., Surendra, A., Kumar, S., et al. (2014). Gut mi-
crobial metabolism drives transformation of MSH2-deficient colon epithelial cells. Cell,
158(2):288–299.
Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymp-
totic theory of independence. Oxford university press.
Boyd, S. P. and Barratt, C. H. (1991). Linear controller design: limits of performance.
Prentice-Hall.
Claesson, M. J., Jeffery, I. B., Conde, S., Power, S. E., O’connor, E. M., Cusack, S., Harris,
H. M., Coakley, M., Lakshminarayanan, B., O’Sullivan, O., et al. (2012). Gut microbiota
composition correlates with diet and health in the elderly. Nature, 488(7410):178–184.
Deshpande, S. K., Roˇckov´a, V., and George, E. I. (2019). Simultaneous variable and co-
variance selection with the multivariate spike-and-slab LASSO. Journal of Computational
and Graphical Statistics, 28(4):921–931.
Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation
with the graphical LASSO. Biostatistics, 9(3):432–441.
Frydenberg, M. (1990). The chain graph Markov property. Scandinavian Journal of Statis-
tics, 17(4):333–353.
Gan, L., Narisetty, N. N., and Liang, F. (2019a). Bayesian regularization for graphical models
with unequal shrinkage. Journal of the American Statistical Association, 114(527):1218–
1231.
Gan, L., Yang, X., Narisetty, N. N., and Liang, F. (2019b). Bayesian joint estimation of mul-
tiple graphical models. Advances in Neural Information Processing Systems (NeurIPS).
George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal
of the American Statistical Association, 88(423):881–889.
George, E. I. and Roˇckov´a, V. (2020). Comment: Regularization via Bayesian penalty
mixing. Technometrics, 62(4):438 – 442.
Ghosal, S. and van der Vaart, A. (2017). Fundamentals of nonparametric Bayesian inference,
volume 44. Cambridge University Press.
28
Guinane, C. M. and Cotter, P. D. (2013). Role of the gut microbiota in health and chronic
gastrointestinal disease: understanding a hidden metabolic organ. Therapeutic Advances
in Gastroenterology, 6(4):295–308.
Hills Jr, R. D., Pontefract, B. A., Mishcon, H. R., Black, C. A., Sutton, S. C., and Theberge,
C. R. (2019). Gut microbiome: profound implications for diet and disease. Nutrients,
11(7):1613.
Hsieh, C. J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P. (2011). Sparse inverse covariance
matrix estimation using quadratic approximation. In Advances in Neural Information
Processing Systems (NeurIPS).
Kamada, N. and N´u˜nez, G. (2014). Regulation of the immune system by the resident
intestinal bacteria. Gastroenterology, 146(6):1477–1488.
Keegan, K. P., Glass, E. M., and Meyer, F. (2016). MG-RAST, a metagenomics service
for analysis of microbial community structure and function. In Microbial Environmental
Genomics (MEG), pages 207–233. Springer.
Kim, D., Zeng, M. Y., and N´u˜nez, G. (2017). The interplay between host immune cells and
gut microbiota in chronic inflammatory diseases. Experimental & Molecular Medicine,
49(5):e339–e339.
Kowal, D. R. and Canale, A. (2020). Simultaneous transformation and rounding (STAR)
models for integer-valued data. Electronic Journal of Statistics, 14(1):1744–1772.
Larsbrink, J., Rogers, T. E., Hemsworth, G. R., McKee, L. S., Tauzin, A. S., Spadiut, O.,
Klinter, S., Pudlo, N. A., Urs, K., Koropatkin, N. M., et al. (2014). A discrete genetic locus
confers xyloglucan metabolism in select human gut Bacteroidetes. Nature, 506(7489):498–
502.
Lauritzen, S. L. and Richardson, T. S. (2002). Chain graph models and their causal inter-
pretations. Journal of the Royal Statistical Society: Series B, 64(3):321–348.
Lauritzen, S. L. and Wermuth, N. (1989). Graphical models for associations between vari-
ables, some of which are qualitative and some quantitative. The Annals of Statistics,
17(1):31–57.
Li, Z., Mccormick, T., and Clark, S. (2019). Bayesian joint spike-and-slab graphical LASSO.
29
In Proceedings of the 36th International Conference on Machine Learning (ICML), pages
3877–3885. PMLR.
McCarter, C. and Kim, S. (2014). On sparse Gaussian chain graph models. Advances in
Neural Information Processing Systems (NeurIPS).
Meng, X.-L. and Rubin, D. B. (1993). Maximum likleihood estimation via the ECM algo-
rithm: a general framework. Biometrika, 80(2):267–278.
Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression.
Journal of the American Statistical Association, 83(404):1023–1032.
Moran, G. E., Roˇckov´a, V., and George, E. I. (2019). Variance prior forms for high-
dimensional Bayesian variable selection. Bayesian Analysis, 14(4):1091–1119.
Moran, G. E., Roˇckov´a, V., and George, E. I. (2021). Spike-and-slab LASSO biclustering.
The Annals of Applied Statistics, 15(1):148–173.
Newton, M. A. and Raftery, A. E. (1994). Approximate Bayesian inference with the weighted
likelihood bootstrap. Journal of the Royal Statistical Society: Series B, 56(1):3–26.
Nie, L. and Roˇckov´a, V. (2022). Bayesian bootstrap spike-and-slab LASSO. Journal of the
American Statistical Association.
Ning, B., Jeong, S., and Ghosal, S. (2020). Bayesian linear regression for multivariate
responses under group sparsity. Bernoulli, 26(3):2353–2382.
R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foun-
dation for Statistical Consulting, Vienna, Austria.
Roˇckov´a, V. and George, E. I. (2014). EMVS: The EM approach to Bayesian variable
selection. Journal of the American Statistical Association, 109(506):828–846.
Roˇckov´a, V. and George, E. I. (2018). The spike-and-slab LASSO. Journal of the American
Statistical Association, 113(521):431–444.
Roˇckov´a, V. and George, E. I. (2016). Fast Bayesian factor analysis via automatic rotations
to sparsity. Journal of the American Statistical Association, 111(516):1608–1622.
Scher, J. U., Sczesnak, A., Longman, R. S., Segata, N., Ubeda, C., Bielski, C., Rostron,
T., Cerundolo, V., Pamer, E. G., Abramson, S. B., et al. (2013). Expansion of intestinal
prevotella copri correlates with enhanced susceptibility to arthritis. eLife, 2:e01202.
30
Shen, Y. and Solis-Lemus, C. (2021). Bayesian conditional auto-regressive LASSO models
to learn sparse microbial networks with predictors. arXiv preprint arXiv:2012.08397.
Shreiner, A. B., Kao, J. Y., and Young, V. B. (2015). The gut microbiome in health and in
disease. Current Opinion in Gastroenterology, 31(1):69.
Singh, R. K., Chang, H.-W., Yan, D., Lee, K. M., Ucmak, D., Wong, K., Abrouk, M., Farah-
nik, B., Nakamura, M., Zhu, T. H., et al. (2017). Influence of diet on the gut microbiome
and implications for human health. Journal of Translational Medicine, 15(1):1–17.
Takeshita, T., Yasui, M., Tomioka, M., Nakano, Y., Shimazaki, Y., and Yamashita, Y.
(2011). Enteral tube feeding alters the oral indigenous microbiota in elderly adults. Applied
and Environmental Microbiology, 77(19):6739–6745.
Tang, Z., Shen, Y., Li, Y., Zhang, X., Wen, J., Qian, C., Zhuang, W., Shi, X., and Yi,
N. (2018). Group spike-and-slab LASSO generalized linear models for disease prediction
and associated genes detected by incorporating pathway information. Bioinformatics,
34(6):901–910.
Tang, Z., Shen, Y., Zhang, X., and Yi, N. (2017). The spike-and-slab LASSO generalized
linear models for prediction and associated genes detection. Genetics, 205:77–88.
Wang, Z., Klipfell, E., Bennett, B. J., Koeth, R., Levison, B. S., DuGar, B., Feldstein, A. E.,
Britt, E. B., Fu, X., Chung, Y.-M., et al. (2011). Gut flora metabolism of phosphatidyl-
choline promotes cardiovascular disease. Nature, 472(7341):57–63.
Zhang, C. H. and Zhang, T. (2012). A General theory of concave regularization for high-
dimensional sparse estimation problems. Statistical Science, 27(4):576–593.
31
Supplementary Materials
In Section S1 we derive the Expectation Conditional Maximization (ECM) algorithm used to
find the maximum a posteriori (MAP) estimates of Ψ and Ω in the cgSSL model. One of the
conditional maximization steps of that algorithm involves solving a CGLASSO problem. We
introduce a new algorithm, cgQUIC, to solve the general CGLASSO problem in Section S2.
Specifically, we show that the problem has unique global optimum (Theorem S1) and that our
cgQUIC algorithm converges to this optimum (Theorem S2). Then, we present additional
results from the simulation study described in Section 5of the main text in Section S3. In
Section S4, we detail the preprocessing steps we took to prepare the gut microbiome data
for analysis with the cgSSL. Finally, we state and prove our main asymptotic results in
Section S5.
S1 The cgSSL algorithm
In this section, we provide full details of the Expectation Conditional Maximization (ECM)
algorithm that is used in the cgSSL procedure. We describe the algorithm for a fixed set of
spike-and-slab penalties (λ0, λ1, ξ0, xi1) and fixed set of hyperparameters (aθ, bθ, aη, bη). For
notational brevity, we will let Θ = {Ψ, θ, Ω, η}denote the set of four parameters of interest.
Recall from Section 3.3 of the main text that we wish to maximize the log posterior density
log π(Θ|Y) = −n
2log|Ω| − 1
2tr (Y−XΨΩ−1)Ω(Y−XΨΩ)>
+
p
X
j=1
q
X
k=1
log θλ1e−λ1|ψj,k |+ (1 −θ)λ0e−λ0|ψj,k|
+
q−1
X
k=1
q
X
k0>k
log ηξ1e−ξ1|ωk,k0|+ (1 −η)ξ0e−ξ0|ωk ,k0|
−
q
X
k=1
ξ0ωk,k + log
1
(Ω 0)
+ (aθ−1) log(θ)+(bθ−1) log(1 −θ)
+ (aη−1) log(η)+(bη−1) log(1 −η)
(S1)
Instead of optimizing log π(Θ|Y) directly, we use an ECM algorithm and iteratively update
32
the surrogate objective
F(Θ) = Eδ|·[log π(Θ, δ|Y)|Θ],
where log π(Θ, δ|Y) is the log-density of the posterior in an augmented model involving the
spike-and-slab indicators δ={δk,k0: 1 ≤k < k0≤q}.Note that the expectation is taken
with respect to the conditional posterior distribution of δgiven Θ.In our augmented model,
δk,k0indicates whether ωk,k0was drawn from the spike (δk,k0= 0) or the slab (δk,k0= 1).
Given Θ and the data Y,these indicators are conditionally independent with
E[δk,k0|Y,Θ] = ηξ1e−ξ1|ωk,k0|
ηξ1e−ξ1|ωk,k0|+ (1 −η)ξ0e−ξ0|ωk ,k0|.
The surrogate objective F(Θ) is given by
F(Θ) = n
2log(|Ω|) + tr(Y>XΨ) −1
2tr(Y>YΩ) −1
2tr((XΨ)>(XΨ)Ω−1)
+X
ij
log θλ1e−λ1|ψj,k |+ (1 −θ)λ0e−λ0|ψj,k|−X
k<k0
ξ?
k,k0|ωk,k0| − ξ1
q
X
k=1
ωk,k
+ (aθ−1) log θ+ (bθ−1) log(1 −θ)+(aη−1) log η+ (bη−1) log(1 −η)
(S2)
where ξ?
k,k0=ξ1q?
k,k0+ξ0(1 −q?
k,k0) and
q?(x, η) = ηξ1e−ξ1|x|
ηξ1e−ξ1|x|+ (1 −η)ξ0e−ξ0|x|.
Our ECM algorithm iteratively computes F(Θ) based on the current value of Θ (the E-step)
and then updates the value of Θ by performing two conditional maximizations (the CM-
step). More specifically, for t≥1,if Θ(t−1) is the value of Θ at the start of the tth iteration,
in the E-step we compute
F(t)(Θ) = Eδ|·[log π(Θ, δ|Y)|Θ = Θ(t−1)].
We then compute Θ(t)by first optimizing F(t)(Θ) with respect to (Ψ, θ) while fixing (Ω, η) =
(Ω(t−1), η(t−1)) to obtain (Ψ(t), θ(t)). Then we optimizing F(t)(Θ) with respect to (Ω, η) while
fixing (Ψ, θ) = (Ψ(t), θ(t)) to obtain (Ω(t), η(t)).That is, in the CM-step we solve the following
33
optimization problems
(Ψ(t), θ(t)) = arg max
Ψ,θ
F(t)(Ψ, θ, Ω(t−1), η(t−1)) (S3)
(Ω(t), η(t)) = arg max
Ω,η
F(t)(Ψ(t), θ(t),Ω, η).(S4)
Once we solve the optimization problems in Equations (S3) and (S4), we set
Θ(t)= (Ψ(t), θ(t),Ω(t), η(t)).
Our ECM algorithm iterates between the E-step and CM-step until the percentage change
in the estimated entries of Ψ and Ω or log posterior density is below some user-defined
tolerance. In our implementation, we have found that tolerance of 10−3works well. The
following subsections detail how we carry out each conditional maximization step.
S1.1 Updating Ψand θ
.
Fixing (Ω, η) = (Ω(t−1), η(t−1)),observe that
F(t)(Ψ, θ, Ω(t−1), η(t−1)) = −1
2tr (Y−XΨΩ−1)Ω(Y−XΨΩ−1)>+ log π(Ψ, θ)
=−1
2tr (Y−XΨΩ−1)ΩΩ−1Ω(Y−XΨΩ−1)>+ log π(Ψ, θ)
=−1
2tr (YΩ−XΨ)Ω−1(YΩ−XΨ)>+ log π(Ψ, θ)
(S5)
where
log π(Ψ, θ) =
p
X
j=1
q
X
k=1
log θλ1e−λ1|ψj,k |+ (1 −θ)λ0e−λ0|ψj,k|
+ (aθ−1) log(θ)+(bθ−1) log(1 −θ)
(S6)
We solve the optimization problem in Equation (S5) using a coordinate ascent strategy that
iteratively updates Ψ (resp. θ) while holding θ(resp. Ψ) fixed. We run the coordinate ascent
until all active ψj,k has relative change under the user defined tolerance.
Updating θgiven Ψ. Notice that the objective in Equation (S5) depends on θonly through
34
the log π(Ψ, θ) term. Accordingly, to updating θconditionally on Ψ,it is enough to maximize
the expression in Equation (S6) as a function of θwhile keeping all ψj,k terms fixed. We use
Newton’s method for this optimization and we terminate once the Newton step has a step
size less than the user defined tolerance.
Updating Ψgiven θ. With θfixed, optimizing Equation (S5) is equivalent to solving
Ψ(t)= arg max
Ψ−1
2(YΩ−XΨ)Ω−1(YΩ−XΨ)>+ log π(Ψ|θ)
= arg max
Ψ(−1
2(YΩ−XΨ)Ω−1(YΩ−XΨ)>+X
j,k
log π(ψj,k|θ)
π(0|θ))
= arg max
Ψ(−1
2(YΩ−XΨ)Ω−1(YΩ−XΨ)>+X
j,k
pen(ψj,k |θ))
(S7)
where
pen(ψj,k |θ) = log π(ψj,k|θ)
π(0|θ)=−λ1|ψj,k|+ log p?(ψj,k , θ)
p?(0, θ).
Following essentially the same arguments as those in Deshpande et al. (2019) and using
the fact that the columns of Xhave norm n, the Karush-Kuhn-Tucker (KKT) condition for
optimization problem in the final line of Equation (S7) tells us that the optimizer Ψ∗satisfies
ψ∗
j,k =n−1|zj,k| − λ?(ψ∗
j,k, θ)+sign(zj,k),(S8)
where
zjk =nψ∗
j,k +X>
jrk+X
k06=k
(Ω−1)k,k0
(Ω−1)k,k
X>
jrk0
rk0= (YΩ−XΨ∗)k0
λ?(ψ∗
j,k, θ) = λ1p?(ψ∗
j,k, θ) + λ0(1 −p?(ψ∗
j,k, θ)).
The KKT condition immediately suggests a cyclical coordinate ascent strategy for solving
the problem in Equation (S7) that involves soft thresholding the running estimates of ψj,k.
Like Roˇckov´a and George (2018) and Deshpande et al. (2019), we can, however, obtain a
35
more refined characterization of the global model ˜
Ψ = ( ˜
ψj,k):
˜
ψj,k =n−1h|zj,k| − λ?(˜
ψj,k, θ)i+sign(zj,k)×
1
(|zj,k|>∆j,k ),
where
∆j,k = inf
t>0(nt
2−pen( ˜
ψj,k, θ)
(Ω−1)k,kt).
Though the exact thresholds ∆j,k are difficult to compute, they can be bounded use an
analog to Theorem 2.1 of Roˇckov´a and George (2018) and Proposition 2 of Deshpande et al.
(2019). Specifically, suppose we have (λ1−λ0)>2pn(Ω−1)k,k and (λ?(0, θ)−λ1)2>
−2n(Ω−1)k,kp?(0, θ). Then we have ∆L
j,k ≤∆j,k ≤∆U
j,k, where:
∆L
j,k =q−2n((Ω−1)k,k)−1log p?(0, θ)−((Ω−1)k,k )−2d+λ1/(Ω−1)k,k
∆U
j,k =q−2n((Ω−1)k,k)−1log p?(0, θ) + λ1/(Ω−1)k,k
where d=−(λ?(δc+, θ)−λ1)2−2n(Ω−1)k,k log p?(δc+, θ) and δc+is the largest root of
pen00(x|θ) = (Ω−1)k,k.
Our refined characterization of ˜
Ψ suggests a cyclical coordinate descent strategy that com-
bines hard thresholding at ∆j,k and soft-thresholding at λ?
j,k.
Remark 1. Equation (S5)and our approach to solving the optimization problem are ex-
tremely similar to Equation 3 and the coordinate ascent strategy used in Deshpande et al.
(2019), who fit sparse marginal multivariate linear models with spike-and-slab LASSO priors.
This is because if Y∼ N(XΨΩ−1,Ω−1)in our chain graph model, then YΩ∼ N(XΨ,Ω).
Thus, if we fix the value of Ω,we can use any computational strategy for estimating marginal
effects in the multivariate linear regression model to estimate Ψby working with the trans-
formed data YΩ.
S1.2 Updating Ωand η
Fixing Ψ = Ψ(t)and θ=θ(t), we compute Ω(t)and η(t)by optimizing the function
36
F(t)(Ψ(t), θ(t),Ω, η) = n
2log|Ω| − tr(SΩ) −tr(MΩ−1)−X
k<k0
ξ?
k,k0|ωk,k0| −
q
X
k=1
ωk,k
+ aη−1 + X
k<k0
q?
k,k!×log(η)
+ bη−1 + q(q−1)/2−X
k<k0
q?
k,k!×log(1 −η)
(S9)
where S=1
nY>Yand M=1
n(XΨ)>XΨ.
We immediately observe that expression in Equation (S9) is separable in Ω and η, meaning
that we can compute Ω(t)and η(t)separately. Specifically, we have
η(t)=aη−1 + Pk<k0q?
k,k0
aη+bη−2 + q(q−1)/2(S10)
and
Ω(t)= arg max
Ω>0(n
2log(|Ω|)−tr(SΩ) −tr(MΩ−1)−X
k<k0
ξ?
k,k0|ωk,k0| − ξ
q
X
k=1
ωk,k).(S11)
The objective function in Equation (S11) is similar to a graphical LASSO (GLASSO; Fried-
man et al.,2008) problem insofar as both problems involve a term like log|Ω|+ tr(SΩ) and
separable L1 penalties on the off-diagonal elements of Ω.However, Equation (S11) includes
an additional term tr(MΩ−1), which does not appear in the GLASSO. This term arises from
through the entanglement of Ψ and Ω in the Gaussian chain graph model and we accord-
ingly call the problem in Equation (S11) the CGLASSO problem. We solve this problem
by (i) forming a quadratic approximation of the objective, (ii) computing a suitable Newton
direction, and (iii) following that Newton direction for a suitable step size. We detail this
solution strategy in Section S2.
37
S2 Chain graphical LASSO with cgQUIC
Equation (S11) is a specific instantiation of what we term the “chain graphical LASSO”
(CGLASSO) problem, whose general form is
arg min
Ω(−log(|Ω|) + tr(SΩ) + tr(MΩ−1) + X
k,k0
ξk,k0|ωk,k0|)(S12)
where Sand Mare symmetric positive semi-definite q×qmatrices; Ω is a symmetric positive
definite q×qmatrix; and the ξk,k0’s are symmetric non-negative penalty weights (i.e. we
have ξk,k0=ξk0,k).
Notice that when Mis the 0 matrix, the CGLASSO problem reduces to a general GLASSO
problem, which admits several computational solutions. One well known solution is to solve
the dual problem, which involves minimization of a log determinant under L∞constraint
(Banerjee et al.,2008).
Unfortunately, the dual form of the CGLASSO problem does not have such a simple form.
To wit, the dual of the CGLASSO problem with uniform penalty ξis given by:
min
||U||∞<ξ max
Ω0log|Ω| − tr[(S+U)Ω] −tr[MX−1]
The inner optimization about Ω can be solved by setting the derivative to 0; the optimal
value of Ω solves a special case of continuous time algebraic Riccati equation (CARE) (Boyd
and Barratt,1991):
Ω−Ω(S+U)Ω + M= 0
Unfortunately, this problem does not have a closed form solution and solving it numerically
in every step of the cgSSL is computationally prohibitive.
We instead solve the CGLASSO problem using a suitably modified version of Hsieh et al.
(2011)’s QUIC algorithm for solving the GLASSO problem. At a high level, instead of using
the first order gradient or solving dual problem, the algorithm is based on Newton’s method
and uses a quadratic approximation. Basically, we sequentially cycle over the parameters
ωk,k0and update each parameter by following a Newton direction for suitable step-size. The
step-size is chosen to ensure that our running estimate of Ω remains positive definite while
38
also ensuring sufficient decrease in the overall objective. We call our solution CGQUIC,
which we summarize in Algorithm 1.
To describe CGQUIC, we first define the “smooth” part of the CGLASSO objective as g(Ω)
and the objective function as f(Ω):
g(Ω) = −log(|Ω|) + tr(SΩ) + tr(MΩ−1),
f(Ω) = g(Ω) + X
k,k0
λk,k0ωk,k0.(S13)
The function g(Ω) is twice differentiable and strictly convex. To see this, observe that g(Ω)
is just the log-likelihood of a Gaussian chain graph model with known Ψ.Its Hessian is
just the negative Fisher information of Ω and is positive definite. The second-order Taylor
expansion of the smooth part g(Ω) based on Ω and evaluated at Ω + ∆ for a symmetric ∆ is
¯gΩ(∆) = −log(|Ω|) + tr(SΩ) + tr(M W )
+ tr(S∆) −tr(W∆) −tr(W M W ∆)
+1
2tr(W∆W∆) + tr(W M W ∆W∆)
(S14)
S2.1 Newton Direction
We now consider the coordinate descent update for the variable Ωk,k0for k≤k0.Let Ddenote
the current approximation of the Newton direction and let D0be the updated direction. To
preserve symmetry, we set D0=D+µ(eke>
k0+ek0e>
k).Our goal, then, is to find the optimal
µ:
arg min
µ{¯g(D+µ(eke>
k0+ek0e>
k)) + 2ξk,k0|Ωk,k0+Dk,k0+µ|} (S15)
We begin by substituting ∆ = D0into ¯g(∆).Note that terms not depending on µdo not
affect the line search. Compared to QUIC, we have two additional terms, tr(W M W ∆) and
tr(W M W ∆W∆).The first term turns out to be linear µand the second is quadratic in µ.
39
Algorithm 1: The CGQUIC algorithm for CGLASSO problem
Data: S=Y>Y/n,M= (XΨ)>(XΨ)/n, regularization parameter matrix Ξ, initial
Ω0, inner stopping tolerance , parameters 0 < σ < 0.5, 0 < β < 1
Result: path of positive definite Ωtthat converge to arg minΩfwith
f(Ω) = g(Ω) + Pk,k0ξk,k0|ωk,k0|, where g(Ω) = −log(|Ω|) + tr(SΩ) + tr(MΩ−1)
Initialize W0= Ω−1
0;
for t= 1,2, . . . do
D= 0, U = 0;
Q=MWt−1;
while not converged do
Partition the variables into fixed and free sets based on gradient1
Sfixed := {(k, k0) : |∇k,k0g(Ω)|< ξk,k0and ωk,k0= 0};
Sfree := {(k, k0) : |∇k,k0g(Ω)| ≥ ξk,k0or ωk,k06= 0};
for (k, k0)∈Sfree do
Calculate Newton direction:
b=Sk,k0−Wk,k0+w>
kDwk0−w>
kMwk0+w>
k0DW M wk+w>
kDW M wk0;
c= Ωk,k0+Dk,k0;
if i6=jthen
a=W2
k,k0+Wk,k Wk0,k0+Wk,kw>
k0Mwk0+Wk0,k0w>
kMwk+ 2Wk,k0w>
kMwk0;
end
else
a=W2
k,k + 2Wk,k w>
kMwk;
end
µ=−c+ [|c−b/a| − ξk,k0/a]+sign(c−b/a) ;
Dk,k0+ = µ;
uk+ = µwk0;
uk0+ = µwk;
end
end
for α= 1, β, β 2, . . . do
Compute the Cholesky decomposition of Ωt−1+αD;
if Ωtis not positive definite then
continue;
end
Compute f(Ωt−1+αD);
if
f(Ωt−1+αD∗)≤f(Ωt−1)+ασδ, δ = tr[∇g(Ωt−1)>D∗]+||Ωt−1+D∗||1,Ξ−||Ωt−1||1,Ξ
then
break;
end
end
Ωt= Ωt−1+αD;
Wt= Ω−1
tusing Cholesky decomposition result;
end
return {Ωt};40
To see this, first observe
−tr(W M W ∆) = −tr(W MW (D+µ(eke>
k0+ek0e>
k))
=C−µtr(W M W eke>
k0+W M W ek0e>
k)
=C−µtr(e>
k0W M W ek+e>
kW M W ek0)
=C−2µe>
kW M W ek0
=C−2µw>
kMwk0
(S16)
where wkis the kth column of Ω.
Furthermore, we have
tr(W M W ∆W∆) = tr[W M W (D+µ(eke>
k0+ek0e>
k))W(D+µ(eke>
k0+ek0e>
k))]
= tr[DW M W + 2µD W MW (eke>
k0+ek0e>
k)W]
+ tr[µ2(eke>
k0+ek0e>
k)W M W (eke>
k0+ek0e>
k)W]
=C+ tr[2µDW M W (eke>
k0+ek0e>
k)W]
+ tr[µ2(eke>
k0+ek0e>
k)W M W (eke>
k0+ek0e>
k)W]
=C+ 2µw>
k0DW M wk+ 2µw>
kDW M w>
k0
+µ2tr[(eke>
k0+ek0e>
k)W M W (eke>
k0+ek0e>
k)W]
=C+ 2µ(w>
k0DW M wk+w>
kDW M w>
k0)
+µ2(Wk,kw>
k0Mwk0+Wk0,k0w>
kMwk+ 2Wk,k0w>
kMwk0)
(S17)
By combining the above simplifications, we can minimize the objective with coordinate
descent. The update for ωk,k0is given by:
1
2[W2
k,k0+Wk,k Wk0,k0+Wk,kw>
k0Mwk0+Wk0,k0w>
kMwk+ 2Wk,k0w>
kMwk0]µ2
+[Sk,k0−Wk,k0+wkDwk0−wkMwk0+w>
k0DW M wk+w>
kDW M wk0]µ
+ξk,k0|Ωk,k0+Dk,k0+µ|
(S18)
41
The optimal solution (for off-diagonal ωk,k0) is given by
µ=−c+ [|c−b/a|−ξk,k0/a]+sign(c−b/a) (S19)
where
a=W2
k,k0+Wk,k Wk0,k0+Wk,kw>
k0Mwk0+Wk0,k0w>
kMwk+ 2Wk,k0w>
kMwk0
b=Sk,k0−Wk,k0+w>
kDwk0−w>
kMwk0+w>
k0DW M wk+w>
kDW M wk0
c=ωk,k0+Dk,k0
For diagonal entries, we take D0=D+µeke>
k, the two terms involving Dare then:
−tr(W M W ∆) = C−µw>
kMwk
tr(W M W ∆W∆) = C+ 2µw>
kDW M wk+µ2Wk,kw>
kMwk
(S20)
Then we can take
a=W2
k,k + 2Wk,k w>
kMwk
b=Sk,k −Wk,k +w>
kDwk−w>
kMwk+ 2w>
kDW M wk
c=ωk,k +Dk,k
and use Equation (S19) to obtain the optimal µand thus the updated Newton direction D0.
Note that computing the optimal µrequires repeated calculation of quantities like w>
kMwk0
and w>
kUMwk0.To enable rapid computation, we track and update the values of U=DW
and Q=MW during our optimization.
S2.2 Step Size
Like Hsieh et al. (2011), we use Armijo’s rule to set a step size αthat simultaneously ensures
our estimate of Ω remains positive definite and sufficient decrease of our overall objective
function. We denote the Newton direction after a complete update over all active coordinates
42
as D∗(see Appendix S2.3 for active sets). We require our step size to satisfy the line search
condition (S21).
f(Ω + αD∗)≤f(Ω) + ασδ, δ = tr[∇g(Ω)>D∗] + ||Ω + D∗||1,Ξ− ||Ω||1,Ξ(S21)
Three important properties can be established following Hsieh et al. (2011):
P1. The condition was satisfied for small enough α. This property is satisfied exactly
following proposition 1 of Hsieh et al. (2011).
P2. We have δ < 0 for all Ω 0, which ensures that the objective function decreases. This
property generally follow Lemma 2 and Proposition 2 of Hsieh et al. (2011), which
requires the Hessian of the smooth part g(Ω) to be positive definite. In our case the
Hessian of g(Ω) is the Fisher information of the chain graph model, ensuring its positive
definiteness.
P3. When Ω is close to the global optimum, the step size α= 1 will satisfy the line search
condition. To establish this, we follow the proof of Proposition 3 in Hsieh et al. (2011).
S2.3 Thresholding to Decide the Active Sets
Similar to the QUIC procedure, our algorithm does not need to update every ωk,k0in each
iteration. We instead follow Hsieh et al. (2011) and only update those parameters exceeding
a certain threshold. More specifically, we can partition the parameters ωk,k0into a fixed set
Sfixed,containing those parameters falling below the threshold, and a free set Sfree,containing
those parameters exceeding the threshold. That is
ωk,k0∈Sf ixed if |∇k,k0g(Ω)| ≤ ξk,k0and ωk,k0= 0
ωk,k0∈Sf ree otherwise (S22)
We can determine the free set Sfree using the minimum-norm sub-gradient gradS
k,k0f(Ω),which
is defined in Definition 2 of Hsieh et al. (2011). In our case ∇g(Ω) = S−Ω−1−Ω−1MΩ−1,
so the minimum-norm sub-gradient is
43
gradS
k,k0f(Ω) =
(S−Ω−1−Ω−1MΩ−1)k,k0+ξk,k0if ωk,k0>0
(S−Ω−1−Ω−1MΩ−1)k,k0−ξk,k0if ωk,k0<0
sign((S−Ω−1−Ω−1MΩ−1)k,k0)[|(S−Ω−1−Ω−1MΩ−1)k,k0| − ξk,k0]+if ωk,k0= 0
(S23)
Note that the subgradient evaluated on the fixed set is always equal to 0. Thus, following
Lemma 4 in Hsieh et al. (2011), the elements of the fixed set do not change during our
coordinate descent procedure. It suffices, then, to only compute the Newton direction on
the free set and update those parameters.
S2.4 Unique minimizer
In this subsection, we show that the CGLASSO problem admits a unique minimizer. Our
proof largely follows the proofs of Lemma 3 and Theorem 1 of Hsieh et al. (2011) but makes
suitable modifications to account for the extra tr(MΩ−1) term in the CGLASSO objective.
Theorem S1 (Unique minimizer).There is a unique global minimum for the CGLASSO
problem (S12).
We first show the entire sequence of iterates {Ωt}lies in a particular, compact level set. To
this end, let
U={Ω|f(Ω) ≤f(Ω0),Ω∈Sp
++}.(S24)
To see that all iterations lies in U, we check need to check the line search condition Equa-
tion (S21) has a δ < 0. By directly applying Hsieh et al. (2011)’s Lemma 2 to g(Ω), we have
that
δ≤ −vec(D∗)>∇2g(Ω) vec(D∗)
where D∗is the Newton direction. Since g(Ω) (Equation (S13)) is convex, ∇2g(Ω) is positive
definite, so the function value f(Ωt) is always decreasing.
Now we need to check that the level set is actually contained in a compact set, by suitably
adapt Lemma 3 of Hsieh et al. (2011).
44
Lemma S1. The level set Udefined in (S24)is contained in the set {mI ≤Ω≤NI }for
some constants m, N > 0, if we assume that the off-diagonal elements of Ξand the diagonal
elements of Sare positive.
Proof. We begin by showing that the largest eigenvalue of Ω is bounded by some constant
that does not depend on Ω.Recall that Sand Mare positive semi-definite. Since Ω is
positive definite, we have tr(SΩ) + tr(MΩ−1)>0 and ||Ω||1,Ξ+ tr(MΩ−1)>0.
Therefore we have
f(Ω0)> f(Ω) ≥ −log(|Ω|) + ||Ω||1,Ξ
f(Ω0)> f(Ω) ≥ −log(|Ω|) + tr(SΩ) (S25)
Since ||Ω||2is the largest eigenvalue of Ω, we have log(|Ω|)≤qlog(||Ω||2).
Using the assumption that off-diagonal entries of Ξ is larger than some positive number ξ,
we know that
ξX
i6=j|Ωk,k0| ≤ ||Ω||1,Ξ≤f(Ω0) + qlog(||Ω||2) (S26)
Similarly, we have
tr(SΩ) ≤f(Ω0) + qlog(||Ω||2) (S27)
Let α= mink(Sk,k) and β= maxk6=k0Sk,k0. We can split the tr(SΩ) into two parts, which
can be further lower bounded:
tr(SΩ) = X
k
Sk,kΩk,k +X
k6=k0
Sk,k0Ωk,k0≥αtr(Ω) −βX
k6=k0|Ωk,k0|(S28)
Since ||Ω||2≤tr(Ω), by using Equation (S28), we have,
α||Ω||2≤αtr(Ω) ≤tr(ΩS) + βX
k6=k0|Ωk,k0|(S29)
45
By combining Equations (S26), (S27), and(S29), we conclude that
α||Ω||2≤(1 + β/ξ)(f(Ω0) + qlog(||Ω||2)) (S30)
The left hand side as a function of ||Ω||2grows much faster than then right hand side. Thus
||Ω||2can be bounded by a quantity depending only on the value of f(Ω0), α,βand ξ.
We now consider the smallest eigenvalue denoted by a. We use the upper bound of other
eigenvalues to bound the determinant. By using the fact that f(Ω) always decreasing during
iterations, we have
f(Ω0)> f(Ω) >−log(|Ω|)≥ −log(a)−(q−1) log(N) (S31)
Thus we have m=e−f(Ω0)M−q+1 is a lower bound for smallest eigenvalue a.
We are now ready to prove Theorem S1, by showing the objective function is strongly convex
on a compact set.
Proof. Because of Lemma S1, the level set Ucontains all iterates produced by cgQUIC. The
set Uis further contained in the compact set {mI ≤Ω≤NI}. By the Weierstrass extreme
value theorem the continuous function f(Ω) (S13) attains its minimum on this set.
Further, the modified objective function is also strongly convex in its smooth part. This is
because tr(MΩ−1) and tr(SΩ) are convex and −log(|Ω|) is strongly convex. Since tr(MΩ−1)
is convex, the Hessian of the smooth part has the same lower bound as in Theorem 1 of
Hsieh et al. (2011). By following the argument of in the proof of Theorem 1 of Hsieh et al.
(2011), we can show the objective function f(Ω) is strongly convex on the compact set
{mI ≤Ω≤NI}, and thus has a unique minimizer.
We can further show that the cgQUIC procedure converges to the unique minimizer, using
the general results on quadratic approximation methods studied in Hsieh et al. (2011).
Theorem S2 (Convergence).The cgQUIC converge to global optimum.
Proof. cgQUIC is an example of quadratic approximation method investigated in Section
4.1 of Hsieh et al. (2011) with a strongly convex smooth part g(Ω) in (S13). Convergence to
46
the global optimum follows from their Theorem 2.
S3 Synthetic experiment results
We now present the remaining results from our simulation experiments. These results are
qualitatively similar to those from the (n, p, q) = (100,10,10) setting presented in the main
text. Generally speaking, in terms of support recovery, the methods that deployed a single
fixed penalty (cgLASSO and CAR) displayed higher sensitivity but lower precision than both
cgSSL-DPE and cgSSL-DCPE. The only exception was when Ω was dense. Furthermore,
methods with adaptive penalties (both cgSSL procedures and CAR-A) tended to return a
fewer number of non-zero estimates than the fixed penalty. Most of these non-zero estimates
were in fact true positives. Across all settings of (n, p, q), cgSSL-DPE makes virtually no
false positive identifications in the support of Ψ.In terms of parameter estimation, the fixed
penalty methods tended to have larger Frobenius error in estimating both Ψ and Ω than
the cgSSL. Note that cgLASSO uses ten-fold cross-validation to set the two penalty levels.
Even with a parallel implementation and warm-starts, the full cgLASSO procedure did not
converge after 72 hours in the n= 400 setting.
Figure S4: Sensitivity, specificity and Frobenius loss of parameter estimations when p=
10, q = 10, n = 100
47
Figure S5: Sensitivity, specificity and Frobenius loss of parameter estimations when p=
20, q = 30, n = 100
48
Figure S6: Sensitivity, specificity and Frobenius loss of parameter estimations when p=
10, q = 30, n = 400, cgLASSO was not able to finish with dense Ω within 72 hours thus we
omit the result.
49
Table S2: Sensitivity, precision, and Frobenius error for Ψ and Ω when (n, p, q) = (100,20,30)
for each specification of Ω.For each choice of Ω,the best performance is bold-faced.
Ψ recovery Ω recovery
Method SEN PREC FROB SEN PREC FROB
AR(1) model
cgLASSO 0.94 (0.09) 0.3 (0.14) 0.21 (0.21) 0.74 (0.42) 0.48 (0.36) 111.66 (66.15)
CAR 0.54 (0.05) 0.39 (0.03) 0.11 (0.02) 1 (0.01) 0.21 (0.01) 11.15 (2.35)
CAR-A 0.69 (0.03) 0.69 (0.04) 0.04 (0.01) 1 (0) 0.74 (0.06) 15.07 (4.99)
cgSSL-dcpe 0.66 (0.02) 0.82 (0.07) 0.08 (0.02) 0.94 (0.04) 0.68 (0.07) 34.1 (19.97)
cgSSL-dpe 0.69 (0.02) 1 (0.01) 0.02 (0) 1 (0) 0.82 (0.07) 4.87 (1.73)
AR(2) model
cgLASSO 0.94 (0.07) 0.3 (0.13) 0.17 (0.08) 0.98 (0.12) 0.18 (0.12) 14.44 (6.93)
CAR 0.42 (0.04) 0.37 (0.03) 0.15 (0.02) 0.38 (0.08) 0.25 (0.04) 15.8 (2.68)
CAR-A 0.64 (0.04) 0.76 (0.04) 0.07 (0.02) 0.91 (0.05) 0.87 (0.04) 3.16 (1.15)
cgSSL-dcpe 0.73 (0.02) 0.7 (0.06) 0.05 (0.01) 0.81 (0.09) 0.32 (0.05) 14.5 (4.21)
cgSSL-dpe 0.72 (0.02) 0.99 (0.01) 0.01 (0) 1 (0) 0.48 (0.03) 1.04 (0.35)
Block model
cgLASSO 0.92 (0.07) 0.4 (0.19) 0.51 (0.46) 0.62 (0.46) 0.93 (0.06) 27.48 (6.01)
CAR 0.51 (0.05) 0.4 (0.03) 0.12 (0.02) 0.53 (0.04) 0.68 (0.03) 12.96 (2.49)
CAR-A 0.66 (0.04) 0.64 (0.04) 0.06 (0.01) 0.47 (0.03) 0.96 (0.02) 23.14 (4.16)
cgSSL-dcpe 0.82 (0.1) 0.29 (0.19) 0.68 (0.19) 0.1 (0.28) 0.88 (0.1) 30.22 (2.11)
cgSSL-dpe 0.61 (0.02) 0.99 (0.02) 0.07 (0.02) 0.66 (0.36) 0.9 (0.05) 30.39 (4.02)
Star model
cgLASSO 0.91 (0.03) 0.45 (0.04) 0.08 (0.02) 0.7 (0.18) 0.31 (0.14) 6.45 (3.27)
CAR 0.45 (0.06) 0.41 (0.04) 0.14 (0.02) 0.32 (0.09) 0.12 (0.03) 4.36 (1.07)
CAR-A 0.69 (0.05) 0.68 (0.03) 0.06 (0.02) 0.31 (0.09) 0.35 (0.09) 2.83 (0.69)
cgSSL-dcpe 0.77 (0.02) 0.94 (0.03) 0.01 (0) 0.61 (0.21) 0.57 (0.08) 0.6 (0.21)
cgSSL-dpe 0.73 (0.02) 1 (0) 0.01 (0) 0.83 (0.08) 0.54 (0.1) 0.89 (0.32)
Dense model
cgLASSO 0.89 (0.04) 0.43 (0.05) 0.07 (0.03) 0.34 (0.41) 1 (0) 712.46 (354.87)
CAR 0.49 (0.06) 0.39 (0.03) 0.13 (0.02) 0.05 (0.01) 1 (0) 914.47 (6.45)
CAR-A 0.7 (0.04) 0.64 (0.04) 0.05 (0.01) 0.01 (0.01) 1 (0) 897.91 (5.95)
cgSSL-dcpe 0.77 (0.01) 0.99 (0.01) 0.01 (0) 0 (0.01) 1 (0) 900 (0.01)
cgSSL-dpe 0.72 (0.02) 1 (0.01) 0.01 (0.01) 0.03 (0.03) 1 (0) 901.45 (2.97)
50
Table S3: Sensitivity, precision, and Frobenius error for Ψ and Ω when (n, p, q) =
(400,100,30) for each specification of Ω.For each choice of Ω,the best performance is
bold-faced. The cgLASSO method was not able to finish within 72 hours.
Ψ recovery Ω recovery
Method SEN PREC FROB SEN PREC FROB
AR(1) model
cgLASSO 1 (0) 0.2 (0) 0.07 (0.11) 0.94 (0.23) 0.46 (0.14) 27.98 (40.98)
CAR 0.82 (0.02) 0.46 (0.01) 0.02 (0) 1 (0) 0.27 (0.02) 2.23 (0.51)
CAR-A 0.86 (0.01) 0.73 (0.02) 0.01 (0) 1 (0) 0.89 (0.05) 7.32 (1.48)
cgSSL-dcpe 0.74 (0.01) 0.89 (0.03) 0.07 (0) 1 (0) 0.46 (0.03) 70.84 (3.72)
cgSSL-dpe 0.87 (0.01) 0.99 (0) 0 (0) 1 (0) 0.78 (0.06) 3.42 (0.8)
AR(2) model
cgLASSO 1 (0) 0.2 (0) 0.15 (0.05) 0.79 (0.23) 0.63 (0.22) 10.82 (4.04)
CAR 0.85 (0.02) 0.5 (0.01) 0.01 (0) 0.98 (0.02) 0.49 (0.03) 0.38 (0.15)
CAR-A 0.89 (0.01) 0.77 (0.02) 0.01 (0) 1 (0.01) 0.94 (0.03) 1.22 (0.24)
cgSSL-dcpe 0.87 (0.01) 0.79 (0.14) 0.04 (0.02) 1 (0) 0.31 (0.05) 10.5 (6.66)
cgSSL-dpe 0.92 (0) 1 (0) 0 (0) 1 (0) 0.47 (0.03) 0.26 (0.08)
Block model
cgLASSO 1 (0) 0.2 (0) 0.44 (0.25) 0.87 (0.21) 0.97 (0.11) 10.05 (11.21)
CAR 0.84 (0.02) 0.46 (0.01) 0.02 (0) 0.71 (0.03) 0.76 (0.02) 3.36 (0.24)
CAR-A 0.88 (0.01) 0.7 (0.02) 0.01 (0) 0.75 (0.02) 0.99 (0.01) 4.13 (0.5)
cgSSL-dcpe 0.9 (0.01) 0.22 (0) 0.82 (0.04) 0 (0.01) 0.64 (NA) 29.51 (0.24)
cgSSL-dpe 0.86 (0.01) 0.99 (0.01) 0.01 (0) 0.98 (0.02) 0.98 (0.01) 1.44 (0.43)
Star model
cgLASSO 0.93 (0) 0.83 (0.02) 0.01 (0) 0.53 (0.41) 0.59 (0.45) 4.68 (3.43)
CAR 0.89 (0.01) 0.48 (0.01) 0.01 (0) 0.73 (0.09) 0.25 (0.03) 0.55 (0.1)
CAR-A 0.91 (0.01) 0.7 (0.02) 0.01 (0) 0.87 (0.07) 0.74 (0.06) 1.07 (0.18)
cgSSL-dcpe 0.88 (0) 1 (0) 0 (0) 1 (0) 0.89 (0.05) 0.29 (0.08)
cgSSL-dpe 0.89 (0) 1 (0) 0 (0) 1 (0) 0.9 (0.05) 0.27 (0.06)
Dense model
cgLASSO
CAR 0.87 (0.02) 0.39 (0.01) 0.01 (0) 0 (0) NaN (NA) 964.24 (9.25)
CAR-A 0.88 (0.01) 0.52 (0.01) 0.01 (0) 0 (0) NaN (NA) 964.08 (9.71)
cgSSL-dcpe 0.87 (0.01) 0.94 (0.02) 0.03 (0) 0.21 (0.02) 1 (0) 913.81 (2.33)
cgSSL-dpe 0.86 (0.01) 0.98 (0.01) 0.04 (0.01) 0.26 (0.01) 1 (0) 918.35 (4.57)
S4 Preprocessing for real data experiment
To conduct our reanalysis of Claesson et al. (2012)’s gut microbiome data, we preprocesses
the raw 16s-rRNAseq data following the workflow provided by the MG-RAST server (Keegan
et al.,2016). We first “annotated” the sequences to get genus counts (i.e. number of segments
belongs to one genus). The annotation process compares the rRNA segments detected during
sequencing to the reference sequence of each genus of microbes, then counts the number of
rRNA segments match with each genus. We used the MG-RAST server’s default tuning
51
parameters during the annotation process. That is, we set e-value to be 5 and annotated
with 60% identity, alignment length of 15 bp, and set a minimal abundance of 10 reads.
Following standard practices of analyzing microbiome data, we transformed raw counts into
relative abundance.We selected genera with more than 0.5% relative abundance in more than
50 samples as the focal genus and all other genera aggregated as the reference group. We
further took the log-odds (with respect to the reference group described above) to stabilize
the variances (Aitchison,1982) in order to fit our normal model.
S5 Proofs of posterior contraction for cgSSL
This section provides detail on the posterior contraction results for the cgSSL. Our proof
was inspired by Ning et al. (2020) and Bai et al. (2020). We first show the contraction in
log-affinity by verifying KL condition and test conditions following Ghosal and van der Vaart
(2017). Then we use the results in log-affinity to show recovery of parameters.
To establish our results, we work with a slightly modified prior on Ω that has density
fΩ(Ω) ∝Y
k>k0(1 −η)ξ0
2exp (−ξ0|ωk,k0|) + ηξ1
2exp (ξ1|ωk,k0|)
×Y
k
ξexp [−ξωk,k]×
1
(Ω τI)
(S32)
fΨ(Ψ) = Y
jk (1 −θ)λ0
2exp (−λ0|ψj,k|) + θλ1
2exp (λ1|ψj,k|)(S33)
where 0 < τ < 1/b2.This way τis less than the lower bound of the smallest eigenvalue of
the true precision matrix Ω0.
S5.1 The Kullback-Leibler condition
We need to verify that our prior places enough probability in small neighborhoods around
each of the possible values of the true parameters. These neighborhoods are defined in a KL
sense.
Lemma S2 (KL conditions).Let n=pmax{p, q, sΩ
0, sΨ
0}log(max{p, q})/n. Then for all
52
true parameters (Ψ0,Ω0)we have
−log Π (Ψ,Ω) : K(f0, f )≤n2
n, V (f0, f )≤n2
n≤C1n2
n
Further, let Enbe the event
En={Y:ZZ f /f0dΠ(Ψ)Πd(Ω) ≥e−C1n2
n}.
Then for all (Ψ0,Ω0,we have P0(Ec
n)→0as n→ ∞.
The last assertion that P(Ec
n)→0 follows from Lemma 8.1 of Ghosal and van der Vaart
(2017) so we now focus on establishing the first assertion of the Lemma. To verify this
condition we need to bound the prior mass of certain events A. However, the truncation of
the prior on Ω makes computing these masses intractable. To overcome this, we first bound
the prior probability of events of the form A∩{Ωτ I }by observing the prior on Ω can be
viewed as a particular conditional distribution.
Specifically, let ˜
Π be the untruncated spike-and-slab LASSO prior with density
˜
f(Ω) = Y
k>k0(1 −η)ξ0
2exp (−ξ0|ωk,k0|) + ηξ1
2exp (ξ1|ωk,k0|)×Y
k
ξexp (−ξωk,k).
The following Lemma shows that we can bound Π probabilities using ˜
Π probabilities.
Lemma S3 (Bounds of the graphical prior).Let ˜
Πbe the untruncated version of the prior
on Ω.Then for all events A, for large enough nthere is a number Rthat does not depend
on nsuch that
˜
Π(Ω τI|A)˜
Π(A)≤ΠΩ(A∩ {ΩτI})≤exp(2ξQ −log(R))˜
Π(A) (S34)
where Q=q(q−1)/2is the total number of free off-diagonal entries in Ω.
Proof. Consider an event of form A∩{Ωτ I} ⊂ Rq×q. The prior mass ΠΩ(A∩ {Ωτ I})
can be viewed as a conditional probability:
ΠΩ(A∩ {ΩτI}) = ˜
Π(A|ΩτI) = ˜
Π(Ω τI|A)˜
Π(A)
˜
Π(Ω τI)(S35)
53
The lower bound follows because the denominator is bounded from above by 1.
For the upper bound, we first observe that
ΠΩ(A∩ {ΩτI}) = ˜
Π(A|ΩτI) = ˜
Π(Ω τI|A)˜
Π(A)
˜
Π(Ω τI)≤(˜
Π(Ω τI))−1˜
Π(A) (S36)
To upper bound the probability in Equation (S35), we find a lower bound of the denominator
˜
Π(Ω τI). To this end, let
G=Ω : ωk,k > q −1,|ωk,k0| ≤ 1−τ
q−1for k06=k
and consider an Ω ∈ G.Since all of Ω’s eigenvalues are real, they must each be contained
in at least one Gershgorin disc. Consider the kth Gershgorin disc, whose intersection with
the real line is an interval centered at ωk,k with half-width Pk06=k|ωk,k0|.Any eigenvalue of
Ω that lies in this disc must be greater than
ωk,k −X
k06=k|ωk,k0|> q −1−(q−1−τ) = τ
Thus, we have G=⊂ {ΩτI}.
Since the entries of Ω are independent under ˜
Π, we compute
˜
Π(G)≥Y
kZ∞
q−1
ξexp(−ξωk,k)dωk,k0(1 −η)QY
k>k0Z|ωk,k0|≤1−τ
q−1
ξ0
2exp(−ξ0|ωk,k0|)dωk,k0
≥exp(−2ξQ)(1 −η)Q"1−E|ωk,k0|
1−τ
q−1#Q
= exp(−2ξQ)(1 −η)Q"1−1
ξ0(1 −τ
q−1)#Q
≥exp(−2ξQ)1−1
1 + K1Q2+aQ1−1
K3Q2+b(1 −τ)Q
≥exp(−2ξQ + log(R)),
(S37)
where R > 0 does not depend onn. Note that the first inequality holds by ignoring the
contribution to the probability from slab distribution. The second inequality is Markov’s
inequality and the third inequality follows from our assumptions about how ξ0and ηare
54
tuned.
Let SΨ
0and SΩ
0respectively denote the supports of Ψ and Ω.Similarly, let sΨ
0be the number
of true non-zero entries in Ψ0and let sΩ
0be the true number of non-zero off-diagonal entries
in Ω0
The KL divergence between a Gaussian chain graph model with parameters (Ψ0,Ω0) and
one with parameters (Ψ,Ω) is
1
nK(f0, f ) = E0log f0
f
=1
2 log |Ω0|
|Ω|−q+ tr(Ω−1
0Ω) + 1
n
n
X
i=1 ||Ω1/2(ΨΩ −Ψ0Ω0)>X>
i||2
2!(S38)
The KL variance is:
1
nV(f0, f ) = V ar0log f0
f
=1
2tr((Ω−1
0Ω)2)−2 tr(Ω−1
0Ω) + q+1
n
n
X
i=1 ||Ω−1/2
0Ω(ΨΩ−1−Ψ0Ω−1
0)>X>
i||2
2
(S39)
We need to lower bound the prior probability of the event
{(Ψ,Ω) : K(f0, f )≤n2
n, V (f0, f )≤n2
n}
for large enough n.
We first obtain an upper bound of the average KL divergence and variance so that the mass
of such event can serve as a lower bound. To simplify the notation, we denote (Ψ−Ψ0) = ∆Ψ
and Ω −Ω0= ∆Ω. We observe that ΨΩ−1−Ψ0Ω−1
0= (∆Ψ−Ψ0Ω−1
0∆Ω)Ω−1.
Using the fact that ||A−B||2
2≤(||A||2+||B||2)2≤2||A||2
2+ 2||B||2
2for any two matrices A
55
and B, we obtain a simple upper bound:
1
nK(f0, f )
=1
2 log |Ω0|
|Ω|−q+ tr(Ω−1
0Ω) + 1
n
n
X
i=1 ||Ω−1/2∆>
ΨX>
i−Ω−1/2∆ΩΩ−1
0Ψ>
0X>
i||2
2!
≤1
2log |Ω0|
|Ω|−q+ tr(Ω−1
0Ω)+1
n
n
X
i=1 ||Ω−1/2∆ΩΩ−1
0Ψ>
0X>
i||2
2
+1
n
n
X
i=1 ||Ω−1/2∆>
ΨX>
i||2
2
=1
2log |Ω0|
|Ω|−q+ tr(Ω−1
0Ω)+1
n||XΨ0Ω−1
0∆ΩΩ−1/2||2
F
+1
n||X∆ΨΩ−1/2||2
F
(S40)
The last line holds because Ω−1/2∆>
ΨX>
iis the ith row of X∆ΨΩ−1/2.
Using the same inequality, we derive a similar upper bound for the average KL variance:
1
nV(f0, f )
=1
2tr((Ω−1
0Ω)2)−2 tr(Ω−1
0Ω) + q+1
n
n
X
i=1 ||Ω−1/2
0∆>
ΨX>
i−Ω−1/2
0∆ΩΩ−1
0Ψ>
0X>
i||2
2
≤1
2tr((Ω−1
0Ω)2)−2 tr(Ω−1
0Ω) + q+2
n
n
X
i=1 ||Ω−1/2
0∆ΩΩ−1
0Ψ>
0X>
i||2
2
+2
n
n
X
i=1 ||Ω−1/2
0∆>
ΨX>
i||2
2
=1
2tr((Ω−1
0Ω)2)−2 tr(Ω−1
0Ω) + q+2
n||XΨ0Ω−1
0∆ΩΩ−1/2
0||2
F+2
n||X∆ΨΩ−1/2
0||2
F
(S41)
Similar to Ning et al. (2020) and Bai et al. (2020), we find event A1involving only ∆Ωand
event A2involving both ∆Ωand ∆Ψsuch that (A1∩{Ω0})∩ A2is a subset of the event
of interest {K/n ≤2
n, V /n ≤2
n}.
56
To this end, define
A1=Ω : 1
2tr((Ω−1
0Ω)2)−2 tr(Ω−1
0Ω) + q+2
n||XΨ0Ω−1
0∆ΩΩ−1/2
0||2
F≤2
n/2
\1
2log |Ω0|
|Ω|−q+ tr(Ω−1
0Ω)+1
n||XΨ0Ω−1
0∆ΩΩ−1/2||2
F≤2
n/2(S42)
and
A2={(Ω,Ψ) : 1
n||X∆ΨΩ−1/2
0||2
F≤2
n
2,2
n||X∆ΨΩ−1/2||2
F≤2
n
2}(S43)
We separately bound the prior probabilities Π(A1) and Π(A1|A2)
S5.1.1 Bounding the prior mass Π(A1)
The goal here is to find a proper lower bound of prior mass on A1. To do this, first consider
the set
A?
1={2X
k>k0|ω0,k,k0−ωk,k0|+X
k|ω0,k,k −ωk,k | ≤ n
c1√p}
where c1>0 is a constant to be specified. Since the Frobenius norm is bounded by the
vectorized L1 norm, we immediately conclude that
A?
1⊂kΩ0−ΩkF≤n
c1√p.
We now show that nkΩ0−ΩkF≤n
c1√po⊂ A1.
Since the Frobenius norm bounds the L2 operator norm, if ||Ω0−Ω||F≤n
c1√pthen the
absolute value of the eigenvalues of Ω −Ω0are bounded by n
c1√p.Further, because we have
assumed Ω0has bounded spectrum, the spectrum of Ω = Ω0+Ω−Ω0is bounded by λmin−n
c1√p
and λmax +n
c1√p.When nis large enough, these quantities are further bounded by λmin/2
and 2λmax.Thus, for nlarge enough, if ||Ω0−Ω||F≤n
c1√p,then we know Ω has bounded
spectrum.
Consequently, Ω−1/2has bounded L2 operator norm. Using the fact that ||AB||F≤min(|||A|||2||B||F,|||B|||2||A||F),
57
we have for some constant c2not depending on n,
2
n||XΨ0Ω−1
0∆ΩΩ−1/2
0||2
F≤2
n|||XΨ0|||2
2||Ω−1
0∆ΩΩ−1/2
0||2
F
≤2
n||X||2
F|||Ψ0|||2
2|||Ω−1
0∆ΩΩ−1/2
0||2
F
≤pc2
2||∆Ω||2
F,
where we have used the fact that ||X||F=√np. Thus ||∆Ω||F≤n
2c2√pimplies
1
n||XΨ0Ω−1
0∆ΩΩ−1/2
0||2
F≤2
n/4.
Similarly, for some constant c3, we have that
1
n||XΨ0Ω−1
0∆ΩΩ−1/2||2
F≤1
n||X||2
F||Ψ0||2
2||Ω−1
0∆ΩΩ−1/2||2
F
≤pc2
3||∆Ω||2
F
Thus we have ||∆Ω||F≤n
2c3√pimplies 1
2n||XΨ0Ω−1
0∆ΩΩ−1/2||2
F≤2
n/4
Using an argument from Ning et al. (2020), ||∆Ω||F≤n
2b2√p≤n/2b2implies the following
two inequalities
1
2(tr((Ω−1
0Ω)2)−2 tr(Ω−1
0Ω) + q)≤2
n/4
1
2(log(|Ω0|
|Ω|)−q+ tr(Ω−1
0Ω)) ≤2
n/4.
Thus by taking c1= 2 max{c2, c3, b2}, we can conclude {||Ω0−Ω||F≤n
c1√p}⊂A1. Thus
A?
1⊂ A1
Since A?
1∈ {Ω : ||Ω0−Ω||F≤n/c1√p},we know that ˜
Π(Ω τI|A?
1) = 1.We can therefore
lower bound Π(A1) by Π(A∗
1∩ {ΩτI}). Instead of calculating the latter probability
directly, we can lower bound it by observing
2X
k>k0|ω0,k,k0−ωk,k0|+X
k|ω0,k,k −ωk,k |
=2 X
(k,k0)∈SΩ
0
|ω0,k,k0−ωk,k0|+ 2 X
(k,k0)∈(SΩ
0)c|ωk,k0|+X
k|ω0,k,k −ωk,k |.
58
Consider the following events
B1=
X
(k,k0)∈SΩ
0
|ω0,k,k0−ωk,k0| ≤ n
6c1√p
B2=
X
(k,k0)∈(SΩ
0)c|ωk,k0| ≤ n
6c1√p
B3=(X
k|ω0,k,k −ωk,k | ≤ n
3c1√p)
Let B=T3
i=1 Bi⊂ A∗
1⊂ A1. Since the prior probability of Blower bounds Π(A1), we now
focus on estimating ˜
Π(B). Recall that the untruncated prior ˜
Π is separable. Consequently,
Π(A1∩ {ΩτI})≥˜
Π(A1)≥˜
Π(B) =
3
Y
i=1
˜
Π(Bi)
We first bound the probability of B1.Note that we can use only the slab part of the prior
to bound this probability. A similar technique was used by Bai et al. (2020) (specifically in
their Equation D.18) and by Roˇckov´a and George (2018). Specifically, we have
˜
Π(B1) = ZB1Y
(k,k0)∈SΩ
0
π(ωk,k0|η)dµ
≥Y
(k,k0)∈SΩ
0Z|ω0,k,k0−ωk,k0|≤ n
6sΩ
0c1√p
π(ωk,k0|η)dωk,k0
≥ηsΩ
0Y
(k,k0)∈SΩ
0Z|ω0,k,k0−ωk,k0|≤ n
6sΩ
0c1√p
ξ1
2exp(−ξ1|ωk,k0|)dωk,k0
≥ηsΩ
0exp(−ξ1X
(k,k0)∈SΩ
0
|ω0,k,k0|)Y
(k,k0)∈SΩ
0Z|ω0,k,k0−ωk,k0|≤ n
6sΩ
0c1√p
ξ1
2exp(−ξ1|ω0,k,k0−ωk,k0|)dωk,k0
=ηsΩ
0exp(−ξ1||Ω0,SΩ
0||1)Y
(k,k0)∈SΩ
0Z|∆|≤ n
6sΩ
0c1√p
ξ1
2exp(−ξ1|∆|)d∆
≥ηsΩ
0exp(−ξ1||Ω0,SΩ
0||1)e−ξ1n
6c1sΩ
0√pξ1n
6sΩ
0c1√psΩ
0
59
The first inequality holds because the fact that |ω0,k,k0−ωk,k0| ≤ n/(6sΩ
0c1√p) implies that
the sum less than n/(6c1√p).The last inequality is a special case of Equation D.18 of Bai
et al. (2020).
For B2,we derive the lower bound using the spike component of the prior. To this end, let
Q=q(q−1)/2 denote the number of off-diagonal entries of matrix Ω. We have
˜
Π(B2) = ZB2Y
(k,k0)∈(SΩ
0)c
π(ωk,k0|η)dµ
≥Y
(k,k0)∈(SΩ
0)cZ|ωk,k0|≤ n
6(Q−sΩ
0)c1√p
π(ωk,k0|η)dµ
≥(1 −η)Q−sΩ
0Y
(k,k0)∈(SΩ
0)cZ|ωk,k0|≤ n
6(Q−sΩ
0)c1√p
ξ0
2exp(−ξ0|ωk,k0|)dωk,k0
≥(1 −η)Q−sΩ
0Y
(k,k0)∈(SΩ
0)c1−6(Q−sΩ
0)c1√p
n
Eπ|ωk,k0|
= (1 −η)Q−sΩ
01−6(Q−sΩ
0)c1√p
nξ0Q−sΩ
0
&(1 −η)Q−sΩ
01−1
Q−sΩ
0Q−sΩ
0
(1 −η)Q−sΩ
0
To derive the last two lines, we used an argument similar to the one used by Bai et al. (2020)
to derive Equation D.22. That is, we used the assumption that ξ0∼max{Q, n, pq}4+b
for some b > 0 to conclude that √n/ max{Q, n, pq}1/2+b≤1.This inequality allows us to
control the Qin the numerator. Since sΩ
0grows slower than Q, we can lower bound the
above function some multiplier of the form (1 −η)Q−sΩ
0.Thus, for large enough n, we have
6(Q−sΩ
0)c1√p
nξ0≤6(Q−sΩ
0)c1√p√n
pplog(q)Q2+b
=6c1
plog(q)
Q−s0
Q2
√n
Qb
≤Q−s0
Q2
≤1
Q−s0
60
The event B3only involves diagonal entries. The untruncated prior mass can be directly
bounded using the exponential distribution
˜
Π(B3) = ZB3
q
Y
k=1
π(ωk,k)dµ
≥
q
Y
k=1 Z|ω0,k,k−ωk ,k|≤ n
3qc1√p
π(ωk,k)dωk,k
=
q
Y
k=1 Zω0,k,k+n
3qc1√p
ω0,k,k−n
3qc1√p
ξexp(−ξωk,k)dωk,k
≥
q
Y
k=1 Zω0,k,k+n
3qc1√p
ω0,k,k
ξexp(−ξωk,k)dωk,k
= exp(−ξ
q
X
i=1
ω0,k,k)Zn
3qc1√p
0
ξexp(−ξωk,k)dωk,k
≥exp(−ξ
q
X
i=1
ω0,k,k)e−ξn
3c1q√pξn
3qc1√pq
Now we are ready to show that the log prior mass on Bcan be bounded by some C1n2
n. To
this end, consider the negative log probability
−log(Π(A1∩ {Ωτ I}))
≤
3
X
i=1 −log( ˜
Π(Bi))
.−sΩ
0log(η) + ξ1||Ω0,SΩ
0||1+ξ1n
6c1√p−sΩ
0log ξ1n
6sΩ
0c1√p−(Q−sΩ
0) log(1 −η)
+ξX
k
ω0,k,k +ξn
3c1√p−qlog ξn
3qc1√p
=−log ηsΩ
0(1 −η)Q−sΩ
0+ξ1||Ω0,SΩ
0||1+ξ1n
6c1√p+ξX
k
ω0,k,k +ξn
3c1√p
−sΩ
0log ξ1n
6sΩ
0c1√p−qlog ξn
3qc1√p
The ξ1n
6c1√pand ξn
3c1√pterms are O(n/√p).n2
nwhich goes to infinity. The 4th term is of
61
order qsince the diagonal entries is controlled by the largest eigenvalue of Ω,which was
assumed to be bounded.
ξ1||Ω0,SΩ
0||1≤ξ1sΩ
0sup |ω0,k,k0|
is of order sΩ
0as the entries of ω0,k,k0is controlled.
Without tuning of η, the first term −log ηsΩ
0(1 −η)Q−sΩ
0has order of Q. But since
we assumed 1−η
η∼max{Q, pq}2+afor some a > 0, we have K1max{Q, pq}2+a≤1−η
η≤
K2max{Q, pq}2+a. That is, we have 1/(1+K2max{Q, pq}2+a)≤η≤1/(1+K1max{Q, pq}2+a).
We can derive a simple lower bound as
ηsΩ
0(1 −η)Q−sΩ
0≥(1 + K2max{Q, pq}2+a)−sΩ
0(1 −η)Q−sΩ
0
≥(1 + K2max{Q, pq}2+a)−sΩ
01−1
1 + K1max{Q, pq}2+aQ−sΩ
0
&(1 + K2max{Q, pq}2+a)−sΩ
0
The last line is because max{Q, pq}2+agrows faster than Q−sΩ
0.Thus (1−1
1+K1max{Q,pq}2+a)max{Q,pq}−sΩ
0
can be bounded below by some constant.
−log ηsΩ
0(1 −η)Q−sΩ
0.sΩ
0log(1 + K2max{Q, pq}2+a).sΩ
0log(max{Q, pq})
sΩ
0log(max{q, p})≤max(p, q, sΩ
0) log(max{q, p})
The last two terms can be treated in the same way, using the assumption ξ11/n and
ξ1/max{Q, n}.
−sΩ
0log ξ1n
6sΩ
0c1√p=sΩ
0log 6sΩ
0c1√p
ξ1n
.sΩ
0log n3/2sΩ
0√p
pmax{sΩ
0, p, q}log(q)!
≤sΩ
0log n3/2sΩ
0
.sΩ
0log(q2)
.n2
n
62
The third line holds because √p≤pmax{sΩ
0, p, q}and log(q)≥1,which together imply
that √p/pmax{sΩ
0, p, q}log(q)≤1. The fourth line follows from our assumption that
log(n).log(q) because sΩ
0< q2. The last line uses the definition of n.
Finally, we have
−qlog ξn
3qc1√p=qlog 3qc1√p
ξn
.qlog n1/2max{Q, n}q√p
pmax{sΩ
0, p, q}log(q)!
≤qlog n1/2max{Q, n}q
.qlog(q)
.n2
n
S5.1.2 Bounding the conditional probability Π(A2|A1)
To bound Π(A2|A1),we use a very similar strategy as the one above. The difference is that
we now focus on the matrix Ψ.We show that mass on a L1 norm ball serves as a lower bound
similar to that of Ω. To see that, using an argument from Ning et al. (2020), we show that
powers of Ω and Ω0are bounded in operator norm. Thus the terms 1
n||X∆ΨΩ−1/2
0||2
Fand
2
n||X∆ΨΩ−1/2||2
Fthat appear in the KL condition are bounded by a constant multiplier of
n−1||X∆Ψ||2
F.Using the fact that the columns of Xhave norm √n, we can found this norm:
||X∆Ψ||F≤√n
p
X
j=1 ||∆Ψ,j,.||F≤√n
p
X
j=1
q
X
k=1 |ψj,k −ψ0,j,k|
Thus to bound Π(A2|A1) from below, it suffices to bound Π(P|ψj,k −ψ0,j,k | ≤ c4n) for
some fixed constant c4>0.
We separate the sum based on whether the true value is 0, similar to our treatment on Ω:
X
ij |ψj,k −ψ0,j,k|=X
(j,k)∈SΨ
0
|ψj,k −ψ0,j,k|+X
(j,k)∈(SΨ
0)c|ψj,k|
Using the same argument as in Ω, we can consider the events whose intersection is a subset
63
of A2:
B4=
X
(j,k)∈SΨ
0
|ψj,k −ψ0,j,k| ≤ c4n
2
B5=
X
(j,k)∈(SΨ
0)c|ψj,k −ψ0,j,k| ≤ c4n
2
We have B4∩ B5⊂ A2.Since the elements of Ψ are a priori independent of each other and
of Ω,we compute
Π(A2|A1)≥Π(B4|A1)Π(B5|A1) = Π(B4)Π(B5)
We bound each of these terms using the same argument as in the previous subsection:
Π(B4) = ZB4Y
(j,k)∈SΨ
0
π(ψj,k|θ)dµ
≥Y
(j,k)∈SΨ
0Z|ψj,k−ψ0,j,k |≤c4n
2sΨ
0
π(ψj,k|θ)dψj,k
≥θsΨ
0Y
(j,k)∈SΨ
0Z|ψj,k−ψ0,j,k |≤c4n
2sΨ
0
λ1
2exp(−λ1|ψj,k|)dψj,k
≥θsΨ
0exp(−λ1X
(j,k)∈SΨ
0
|ψ0,j,k|)Y
(j,k)∈SΨ
0Z|ψj,k−ψ0,j,k |≤c4n
2sΨ
0
λ1
2exp(−λ1|ψj,k −ψ0,j,k|)dψj,k
=θsΨ
0exp(−λ1X
(j,k)∈SΨ
0
|ψ0,j,k|)Y
(j,k)∈SΨ
0Z|∆|≤c4n
2sΨ
0
λ1
2exp(−λ1|∆|)d∆
≥θsΨ
0exp(−λ1||Ψ0,SΨ
0||1)e−c4λ1n
2sΨ
0c4n
2sΨ
0sΨ
0
Similarly, we have
Π(B5)≥(1 −θ)pq−sΨ
01−2(pq −sΨ
0)c4
nλ0pq−sΨ
0
&(1 −θ)pq−sΨ
0
64
From here we have
−log(Π(A2|A1)) ≤ −log(Π(B4)) −log(Π(B5))
=−log(θsΨ
0(1 −θ)pq−sΨ
0) + λ1||Ψ0,SΨ
0||1+λ1c4n
2−s0
Ψlog c4n
2sΨ
0
Since Ψ0has bounded L2 operator norm, we know that the entries of Ψ0are all bounded.
Thusλ1||Ψ0,SΨ
0||1=O(sΨ
0).n2
n. The last two terms are O(n).n2
n.
For the first term, recall that we assumed 1−θ
θ∼(pq)2+bfor some b > 0.That is, there
are constants M3and M4such that M3(pq)2+b≤1−θ
θ≤M4(pq)2+b≤1−θ
θ. Since 1/(1 +
M4(pq)2+b)≤θ≤1/(1 + M3(pq)2+b),we compute
θsΨ
0(1 −θ)pq−sΨ
0≥(1 + M4(pq)2+b)−sΨ
0(1 −θ)pq−sΨ
0
≥(1 + M4(pq)2+b)−sΨ
01−1/(1 + M3(pq)2+b)pq−sΨ
0
&(1 + M4(pq)2+b)−sΨ
0
Note that the last line is due to the fact that (pq)2+bgrows faster than pq −sΨ
0.Conse-
quently, the term 1−1/(1 + M3(pq)2+b)pq−sΨ
0can be bounded from below by a constant
not depending on n. Thus,
−log θsΨ
0(1 −θ)pq−sΨ
0.sΨ
0log(1 + M4(pq)2+b).sΨ
0log(pq).sΨ
0max{log(q),log(p)}
For the last term, we use the same argument as we did with Ω.
−s0
Ψlog c4n
2sΨ
0=s0
Ψlog 2sΨ
0
c4n
.sΨ
0log √n
plog(pq)!
≤sΨ
0log(√n)
.n2
n
65
S5.2 Test condition
To simplify the parameter space to be concerned in the test condition, we first show the
dimension recovery result by bounding the prior probability, with our effective dimension
defined as number of entries whose absolute value is larger than the intersection of spike
and slab components. Then we find the proper vectorized L1 norm sieve in the “lower-
dimensional” parameter space. We construct tests based on the supremum of a collection of
single-alternative Neyman-Pearson likelihood ratio tests in the subsets of the sieve that are
norm balls, then we show that the number of such subsets needed to cover the sieve can be
bounded properly.
S5.2.1 Dimension recovery
Unlike Ning et al. (2020), our prior assigns no mass on exactly sparse solutions. Nevertheless,
similar to Roˇckov´a and George (2018), we can define a notion of “effective sparsity” and
generalized dimension. Intuitively the generalized dimension can be defined as how many
coefficients are drawn from the slab rather than the spike part of the prior. Formally the
generalized inclusion functions νψand νωfor Ψ and Ω can be defined as:
νψ(ψj,k) =
1
(|ψj,k|> δψ)
νω(ωk,k0) =
1
(|ωk,k0|> δω)
where δψand δωis the threshold where the spike and slab part has the same density.
δψ=1
λ0−λ1
log 1−θ
θ
λ0
λ1
δω=1
ξ0−ξ1
log 1−η
η
ξ0
ξ1
Then the generalized dimension can be defined as number of entries are included:
|ν(ψ)|=X
jk
νψ(ψj,k)
|ν(Ω)|=X
k>k0
νω(ωk,k0)
(S44)
66
Note that we only count the off-diagonal entries in Ω.
We are now ready to prove Lemma 1from the main text. The main idea is to check the pos-
terior probability directly. Let BΨ
n={Ψ : |ν(Ψ)|< rΨ
n}for some rΨ
n=C0
3max{p, q, sΨ
0, sΩ
0}
with C0
3> C1in the KL condition. For Ω, let BΩ
n={ΩτI :|ν(Ω)|< rΩ
n}for
rΩ
n=C0
3max{p, q, sΨ
0, sΩ
0}with some C0
3> C1in the KL condition. We aim to show that
E0Π(Ω ∈(BΩ
n)c|Y1, . . . , Yn)→0 and E0Π(Ψ ∈(BΨ
n)c|Y1, . . . , Yn)→0.
The marginal posterior can be expressed using log-likelihood `n:
Π(Ψ ∈ BΨ
n|Y1, . . . , Yn) = RRBΨ
nexp(`n(Ψ,Ω) −`n(Ψ0,Ω0))dΠ(Ψ)dΠ(Ω)
RR exp(`n(Ψ,Ω) −`n(Ψ0,Ω0))dΠ(Ψ)dΠ(Ω)
Π(Ω ∈ BΩ
n|Y1, . . . , Yn) = RRBΩ
nexp(`n(Ψ,Ω) −`n(Ψ0,Ω0))dΠ(Ψ)dΠ(Ω)
RR exp(`n(Ψ,Ω) −`n(Ψ0,Ω0))dΠ(Ψ)dΠ(Ω)
(S45)
By using the result of KL condition (Lemma S2), we know the denominators are bounded
from below by e−C1n2
nwith large probability. Thus, we focus now on upper bounding the
numerators beginning with Ψ.
Consider the numerator:
E0ZZ(BΨ
n)c
f/f0dΠ(Ψ)dΠ(Ω)=Z ZZ(BΨ
n)c
f/f0dΠ(Ψ)dΠ(Ω)f0dy
=ZZ(BΨ
n)cZfdydΠ(Ψ)dΠ(Ω)
≤Z(BΨ
n)c
dΠ(Ψ) = Π(|ν(Ψ)| ≥ rΨ
n)
We can bound the above display using the fact that when |ψj,k|> δψwe have π(ψj,k)<
2θλ1
2exp(−λ1|ψj,k|), this is by definition of the effective dimension:
Π(|ν(Ψ)| ≥ rΨ
n)≤X
|S|>rΨ
n
(2θ)|S|Y
(j,k)∈SZ|ψj,k |>δψ
λ1
2exp(−λ1|ψj,k|)dψj,k Y
(j,k)/∈SZ|ψj,k|<δψ
π(ψj,k)dψj,k
≤X
|S|>rΨ
n
(2θ)|S|
67
Using the assumption on θ, and the fact pq
k≤(epq/k)k(similar to Bai et al. (2020)’s
equation D.32), we can further upper bound the probability
Π(|ν(Ψ)| ≥ rΨ
n)≤X
|S|>rΨ
n
(2θ)|S|≤X
|S|>rΨ
n
(2
1 + M4(pq)2+b)|S|
≤
pq
X
k=brΨ
nc+1 pq
k 2
M4(pq)2k
≤
pq
X
k=brΨ
nc+1 2e
M4kpq k
<
pq
X
k=brΨ
nc+1 2e
M4(brΨ
nc+ 1)pq k
.(pq)−(brΨ
nc+1)
≤exp(−(rΨ
n) log(pq)).
Taking rΨ
n=C0
3max{p, q, sΨ
0, sΩ
0}for some C0
3> C1, we have:
Π(|ν(Ψ)| ≥ rΨ
n)≤exp(−C0
3max{p, q, sΨ
0, sΩ
0}log(pq))
Therefore,
E0Π((BΨ
n)c|Y1, . . . , Yn)≤E0Π((BΨ
n)c|Y1, . . . , Yn)IEn+P0(Ec
n),
where Enis the event in the KL condition. On En, the KL condition ensures that the
denominator in Equation (S45) is lower bounded by exp(−C1n2
n) while the denominator is
upper bounded by exp(−C0
3max{p, q, sΨ
0, sΩ
0}log(pq))., Since P0(Ec
n) is o(1) per KL condition,
we have the upper bound
E0Π((BΨ
n)c|Y1, . . . , Yn)≤exp(C1n2
n−C0
3max{p, q, sΨ
0, sΩ
0}log(pq)) + o(1) →0
This completes the proof of the dimension recovery result of Ψ.
The workflow for Ω is very similar, except we need to use the upper bound of the graphical
prior in Equation (S34) to properly bound the prior mass.
68
We upper bound the numerator:
E0ZZ(BΩ
n)c
f/f0dΠ(Ψ)dΠ(Ω)≤Z(BΩ
n)c
dΠ(Ω) = Π(|ν(Ω)| ≥ rΩ
n)≤exp(2ξQ −log(R)) ˜
Π(|ν(Ω)| ≥ rΩ
n)
We bound the above display using the fact that when |ωk,k0|> δωwe have π(ωk,k0)<
2ηξ1
2exp(−ξ1|ωk,k0|). Note that this follows from the definition of the effective dimension.
We have
˜
Π(|ν(Ω)| ≥ rΩ
n)≤X
|S|>rΩ
n
(2η)|S|Y
(k,k0)∈SZ|ωk,k0|>δω
ξ1
2exp(−ξ1|ωk,k0|)dωk,k0Y
(k,k0)/∈SZ|ωk,k0|<δω
π(ωk,k0)dωk,k0
≤X
|S|>rΩ
n
(2η)|S|
By using the assumption on η, and the fact Q
k≤(eQ/k)k, we can further upper bound the
probability:
˜
Π(|ν(Ω)| ≥ rΩ
n)≤X
|S|>rΩ
n
(2η)|S|≤X
|S|>rΩ
n
(2
1 + K4max{pq, Q}2+b)|S|
≤
Q
X
k=brΩ
nc+1 Q
k 2
K4max{pq, Q}2k
≤
max{pq,Q}
X
k=brΩ
nc+1 max{pq, Q}
k 2
K4max{pq, Q}2k
≤
max{pq,Q}
X
k=brΩ
nc+1 2e
K4kmax{pq, Q}k
<
max{pq,Q}
X
k=brΩ
nc+1 2e
K4(brΩ
nc+ 1) max{pq, Q}k
.max{pq, Q}−(brΩ
nc+1)
≤exp(−(rΩ
n) log(max{pq, Q}))
Taking rΩ
n=C0
3max{p, q, sΨ
0, sΩ
0}and C0
3> C1, we have
˜
Π(|ν(Ω)| ≥ rΩ
n)≤exp(−C0
3max{p, q, sΨ
0, sΩ
0}log(max{pq, Q})) ≤exp(−C3n2
n)
69
Thus, using the assumption ξ1/max{Q, n}, for some R0not depending on n, we have
Π(|ν(Ω)| ≥ rΩ
n)≤exp(−C3n2
n+ 2ξQ −log(R)) ≤exp(−C3n2
n+ log(R0))
We therefore conclude that
E0Π((BΩ
n)c|Y1, . . . , Yn)≤E0Π((BΩ
n)c|Y1, . . . , Yn)IEn,+P0(Ec
n)
where Enis the event in KL condition. On En, the KL condition ensures that the denomi-
nator in Equation (S45) is lower bounded by exp(−C1n2
n) while the denominator is upper
bounded by exp(−C0
3n2
n+ log(R0)).Since P0(Ec
n) is o(1) per KL condition, we conclude
E0Π((BΩ
n)c|Y1, . . . , Yn)≤exp(C1n2
n−C0
3n2
n+ log(R0)) + o(1) →0
We pause now to reflect on how dimension recovery can help us establish contraction. Our
end goal is to show the posterior distribution contract to the true value by first showing that
event with log-affinity difference larger than any given > 0 has an o(1) posterior mass.
For any such event, we can take a partition based on whether it intersects with BΨ
n,BΩ
nor
their complements. Because the complements (BΨ
n)cand (BΩ
n)chave o(1) posterior mass, we
have the partition that intersects with any of these two complements also has o(1) posterior
mass. Thus, we only need to show that events with log-affinity difference larger than any
given > 0and recovered the low dimension structure have an o(1) posterior mass. The
recovery condition reduces the complexity of the events (on the parameter space) that we
need to deal with by reducing the effective dimension of such events. We will make use of
this low dimension structure during checking the test condition.
Formally for every > 0, we have
E0Π(Ψ,ΩτI :1
nXρ(fi, f0,i)> |Y1, . . . , Yn)
≤E0Π(Ψ ∈ BΨ
n,ΩτI :1
nXρ(fi, f0,i)> |Y1, . . . , Yn) + E0Π((BΨ
n)c|Y1, . . . , Yn)
≤E0Π(Ψ ∈ BΨ
n,Ω∈ BΩ
n:1
nXρ(fi, f0,i)> |Y1, . . . , Yn)
+E0Π((BΨ
n)c|Y1, . . . , Yn) + E0Π((BΩ
n)c|Y1, . . . , Yn)
The last two terms are o(1),as proved above.
70
S5.2.2 Sieve
As shown in the previous section, we can concentrate on the events with proper dimension
recovery, i.e. {Ψ∈ BΨ
n,Ω∈ BΩ
n}. To apply Ghosal and van der Vaart (2017)’s general theory
of posterior contraction, to establish contraction on the event of proper dimension recovery
(i.e. E0Π(Ψ ∈ BΨ
n,Ω∈ BΩ
n:1
nPρ(fi, f0,i)> |Y1, . . . , Yn)→0), we need to find a sieve that
covers enough of the support of the prior. We will show that an L1 norm sieve is sufficient.
Formally we will show that there exist a sieve Fnsuch that for some constants C2> C1+ 2:
Π(Fc
n)≤exp(−C2n2
n) (S46)
Consider the sieve:
Fn=Ψ∈ BΨ
n,Ω∈ BΩ
n:||Ψ||1≤2C3p, ||Ω||1≤8C3q(S47)
for some large C3> C1+ 2 + log(3) where C1is the constant in KL condition. We have
Π(Fc
n)≤Π(||Ψ||1>2C3p) + Π((||Ω||1>8C3q)∩ {ΩτI})
We upper bound each term similar to Bai et al. (2020). By using the bound in Equation (S34),
we know that
Π((||Ω||1>8C3q)∩ {ΩτI})≤exp(2ξQ −log(R))˜
Π(||Ω||1>8C3q).
Since ||Ω||1= 2 Pk>k0|ωk,k0|+Pk|ωk,k |, at least one of these two sums exceeds 8C3q/2.
Thus, we can form an upper bound on the L1 norm probability
˜
Π(||Ω||1>8C3q)≤˜
Π X
k>k0|ωk,k0|>8C3q
4!+˜
Π X
k|ωk,k|>8C3q
2!.
To get an upper bound under ˜
Π,we can act as if all ωk,k0’s were drawn from the slab
distribution. In that setting, Pk>k0|ωk,k0|is Gamma distributed with shape parameter Q
and rate parameter ξ1. By using an appropriate tail probability for the Gamma distribution
71
(Boucheron et al. (2013), pp.29) and the fact 1 + x−√1+2x≥(x−1)/2,we compute
exp(2ξQ −log(R)) ˜
Π(X
k>k0|ωk,k0|>8C3q/4) ≤exp "−Q 1−s1+28C3q
4Qξ1
+8C3q
4Qξ1!+ 2Q−log(R)#
≤exp −8C3q
8ξ1
+5
2Q−log(R)
Since we have assumed ξ11/n, for sufficiently large n, we have n2
n≥qlog(q).Conse-
quently, qn2
n≥Qlog(q), Q =o(qn2
n), and we see that
8C3q
8ξ1−5
2Q−log(R)C3(nq)−5
2Q−log(R)
≥C3(qn2
n)−5
2Q−log(R)
=C3(qn2
n)−o(qn2
n)
≥C3n2
n
The first order term of Qon the left hand side can be ignored when nlarge as the left hand
side is dominated by the Qlog(q) term. Note that we used the assumption that n→0. We
further have
exp(2ξQ −log(R)) ˜
Π((X
k>k0|ωk,k0|>8C3q/4)) ≤exp(−C3n2
n)
For the diagonal, the sum follows a gamma distribution with shape qand rate ξ. We obtain
a similar bound
exp(2ξQ −log(R)) ˜
Π(X
k|ωk,k|>8C3q/2) ≤exp(2Q−log(R)) exp "−q 1−s1+28C3q
2qξ +8C3q
2qξ !#
≤exp −8C3q
4ξ+Q2 + q
2Q−log(R)
72
Using the same argument as before and the fact that ξ1/max{Q, n}, we have
8C3q
4ξ−Q2 + q
2Q+ log(R)2C3(max{Q, n}q)) −Q2 + q
2Q+ log(R)
≥C3qn2
n−o(qn2
n)
≥C3n2
n
The first order term of Qon the left hand side can be ignored when nlarge as the left hand
side is dominated by the Qlog(q) term and q/Q →0.
By combining the above results, we have:
Π((||Ω||1>8C3q)∩ {ΩτI})≤exp(2Q−log(R)) ˜
Π(||Ω||1>8C3q)
≤exp(2Q−log(R)) ˜
Π(X
k>k0|ωk,k0|>8C3q
4)
+ exp(2Q−log(R)) ˜
Π(X
k|ωk,k|>8C3q
2)
≤2 exp(−C3n2
n)
(S48)
The probability ||Ψ||1>2C3pcan be bounded by tail probability of Gamma distribution
with shape parameter pq and rate parameter λ1:
Π(||Ψ||1>2C3p)≤exp "−pq 1−s1+22C3p
pqλ1
+2C3p
pqλ1!#
≤exp −pq 2C3p
2pqλ1−1
2
≤exp −2C3p
2λ1
+pq
2
Using the same argument, we have pn ≥pn2
n≥pq log(q) and thus, pq =o(pn2
n) for large
n. Consequently,
exp −2C3p
2λ1
+pq
2≤exp −C3pn2
n+o(pn2
n)≤exp(−C3n2
n)
73
and
Π(||Ψ||1>2C3p)≤exp(−C3n2
n)(S49)
By combining the result from Equations (S48) and (S49), we conclude
Π(Fc
n)≤3 exp(−C3n2
n) = exp(−C3n2
n+ log(3)).
With our choice of C3, the above probability is asymptotically bounded from above by
exp(−C2n2
n) with some C2≥C1+ 2.
S5.2.3 Tests around a representative point
To apply the general theory, we need to construct test ϕn, such that for some M2> C1+ 1:
Ef0ϕn.e−M2n2/2
sup
f∈Fn:ρ(f0,f)>M2n2
n
Ef(1 −ϕn).e−M2n2
n(S50)
where f=Qn
i=1 N(XiΨΩ−1,Ω−1) while f0=Qn
i=1 N(XiΨ0Ω−1
0,Ω−1
0)
Instead of directly constructing the ϕnon the whole sieve, we use the method similar to
Ning et al. (2020). That is, we construct tests versus a representative point and show that
these tests works well in the neighborhood of the representative points. We then take the
supremum of these tests and show that the number of pieces needed to cover the entire sieve
can be appropriately bounded.
For a representative point f1, consider the Neyman-Pearson test for a single point alternative
H0:f=f0, H1:f=f1,φn=I{f1/f0≥1}. If the average half order R´enyi divergence
−n−1log(R√f0f1dµ)≥2, we will have:
Ef0(φn)≤Zf1>f0pf1/f0f0dµ ≤Zpf1f0dµ ≤e−n2
Ef1(1 −φn)≤Zf0>f1pf0/f1f1dµ ≤Zpf0f1dµ ≤e−n2
74
By Cauchy-Schwarz, for any alternative fwe can control the Type II error rate:
Ef(1 −φn)≤ {Ef1(1 −φn)}1/2{Ef1(f/f1)2}1/2
So long as the second factor grows at most like ecn2for some properly chosen small c,
the full expression can be controlled. Thus we can consider the neighborhood around the
representative point small enough so that the second factor can be actually bounded.
Consider every density with parameters satisfying
|||Ω|||2≤ ||Ω||1≤8C3q,
||Ψ1−Ψ||2≤ ||Ψ1−Ψ||1≤1
√2C3np,
|||Ω1−Ω|||2≤ ||Ω1−Ω||1≤1
8C3nmax{p, q}3/2≤1
8C3nq3/2
(S51)
We show that Ef1(f/f1)2is bounded on the above set when parameters are from the sieve
Fn.
Similar to Ning et al. (2020), denote Σ1= Ω−1
1, Σ = Ω−1as well as Σ?
1= Ω1/2Σ1Ω1/2,
and ∆Ψ= Ψ −Ψ1while ∆Ω= Ω −Ω1. Using the observation ΨΩ−1−Ψ1Ω−1
1= (∆Ψ−
Ψ1Ω−1∆Ω)Ω−1,we have
Ef1(f/f1)2=|Σ?
1|n/2|2I−Σ?−1
1|−n/2
×exp n
X
i=1
Xi(ΨΩ−1−Ψ1Ω−1
1)Ω1/2(2Σ?
1−I)−1Ω1/2(ΨΩ−1−Ψ1Ω−1
1)>X>
i!
=|Σ?
1|n/2|2I−Σ?−1
1|−n/2
×exp n
X
i=1
Xi(∆Ψ−Ψ0Ω−1∆Ω)Ω−1/2(2Σ?
1−I)−1Ω−1/2(∆Ψ−Ψ0Ω−1∆Ω)>X>
i!
(S52)
For the first factor we use a similar argument as in Ning et al. (2020) (after Equation 5.9).
Since Ω ∈ Fn, we have |||Ω−1|||2≤1/τ. The fact |||Ω1−Ω|||2≤δ0
n= 1/8C3nq3/2implies
|||Σ?
1−I|||2≤ |||Ω−1|||2|||Ω1−Ω|||2≤δ0
n/τ
and thus we can bound the spectrum of Σ?
1, i.e. 1 −δ0
n/τ ≤eig1(Σ?
1)≤eigq(Σ?
1)≤1 + δ0
n/τ.
75
Thus
|Σ?
1|
|2I−Σ?−1
1|n/2
= exp n
2
q
X
i=1
log(eigi(Σ?
1)) −n
2
q
X
i=1
log 2−1
eigi(Σ?
1)!
≤exp nq
2log(1 + δ0
n/τ)−nq
2log 1−δ0
n/τ
1−δ0
n/τ
≤exp nq2
2δ0
n+nq
2δ0
n/τ
1−2δ0
n/τ
≤exp(nqδ0
n/τ)
≤e
The third inequality is due to the fact 1 −x−1≤log(x)≤x−1.
We can bound the log of the second factor of Equation (S52).
|||Ω−1|||2|||(2Σ?
1−I)−1|||2
n
X
i=1 ||Xi(∆Ψ−Ψ1Ω−1∆Ω)||2
2≤2/τ
n
X
i=1 ||Xi(∆Ψ−Ψ1Ω−1∆Ω)||2
2
We can further bound the sum on the sieve.
n
X
i=1 ||Xi(∆Ψ−Ψ1Ω−1∆Ω)||2
2≤2
n
X
i=1 ||Xi∆Ψ||2
2+ 2
n
X
i=1 ||XiΨ1Ω−1∆Ω||2
2
≤2np|||∆Ψ|||2
2+ 2np|||Ψ1|||2
2|||Ω−1|||2
2||∆Ω||2
F
≤2np 1
2C3np + 2np 2C3p+1
√2C3np21
τ2
1
(8C3nmax{p, q}3/2)2
≤2np 1
2C3np + 2np16C2
3p21
τ2
1
(8C3nmax{p, q}3/2)2
.1
We bound the norm of Ψ1using triangle inequality, |||Ψ1||| ≤ |||Ψ|||+|||Ψ1−Ψ||| ≤ 2C3p+
1/√2C3np. The first term is O(1) and second term is O(1/q), by combining the result we
conclude the second factor of Equation (S52) is bounded.
Thus, following the argument of Ning et al. (2020), the desired test ϕnin Equation (S50)
can be obtained as the maximum of all tests φndescribed above.
76
S5.2.4 Pieces needed to cover the sieve
From here we show the contraction in log-affinity ρ(f, f0). To finish up the proof, we check
that number of sets described in Equation (S51) needed to cover sieve Fn, denoted by N∗,
can be bounded by exp(Cn2
n) for some suitable constant C.
The number N∗is called a covering number of Fn. A closely related quantity is the packing
number, which is defined as the maximum number of disjoint balls centered in a set and
upper bounds the covering number. Both the covering number and packing number can be
used as a measure of complexity of a given set (Ghosal and van der Vaart,2017).
The packing number of a set usually depends exponentially on the sets dimensions. Because
Ning et al. (2020) studied posteriors which place positive probability on exactly sparse pa-
rameters, they were able to directly bound the packing number of suitable low-dimensional
sets. In our case, which uses an absolutely continuous prior, we need to instead control the
packing number of “effectively low dimensional” spaces.
Lemma S4 provides a sufficient condition for bounding the complexity (evaluated by packing
number) of an set of “effectively sparse” vectors can be bounded by the complexity of a set
of actually sparse vectors.
Lemma S4 (packing a shallow cylinder in Lp).For a set of form E=A×[−δ, δ]Q−s⊂RQ
where A⊂Rs, (with s > 0and Q≥s+ 1 are integers) for 1≤p < ∞and a given T > 1,
if δ <
2[T(Q−s)]1/p , we have the packing number:
D(, A, || · ||p)≤D(, E, || · ||p)≤D((1 −T−1)1/p , A, || · ||p)
Proof. The lower bound is trivial by observing A×{0}Q−s⊂Eand the packing number of
A×{0}Q−sis exactly the packing number of A. For the upper bound, we show that for each
packing on E, we can slice that packing with the 0-plane to form a packing on Awith the
same number of balls but smaller radius (see Figure S7 for an illustration).
We first show any Lp /2−ball Bθ(/2) centered in the set Eintersects the plane Rs×{0}Q−s.
Assume the center is θ= (x1, . . . , xQ). It suffices to show the center’s distance to the plane
is less than the radius of the ball. Since the center is in E, we have |xi| ≤ δfor the last
Q−scoordinates. Denote the projection of the center on the plane as θA= (x1, . . . , xs,0) ∈
77
A× {0}Q−s. Then the Lp distance from the center to the plane is
||θA−θ||p
p=
Q
X
i=s+1 |xi|p≤(Q−s)δp< T −1(/2)p
Next we show the slice Bθ()∩Rs×{0}Q−sis also a ball centered at θAin the lower dimensional
plane. It suffice to show the boundary is a sphere. Suppose we take a point afrom the
intersection of boundary of Bθ()∩Rs×{0}Q−s, the vector from center to the point can be
decomposed to sum of two orthogonal component, namely the vector from θAto aand from
θAto θ, we have in this case
||a−θA||p
p+||θA−θ||p
p=||a−θ||p
p=p/2p
because a−θAhas all 0 entries on the last Q−saxis and θA−θhas all 0 entries on the
first sentries. Thhus any such point has a fixed distance to θA, the projection of the center
θon the plane of A. Notice that
||a−θA||p
p=p/2p− ||θA−θ||p
p,
which is fixed. Thus the collection of aforms a sphere on A’s plane.
From here, we can also lower bound the radius of slice by (1 −T−1)1/p /2 since ||θA−θ||p
p<
T−1(/2)p, thus we have the radius ||a−θA||p>(1 −T−1)1/p/2. Thus, we have the smaller
ball must lie within the slice, i.e.
BθA((1 −T−1)1/p/2) × {0}Q−s⊂Bθ(/2) ∩(Rs× {0}Q−s)⊂Bθ(/2) (S53)
That is, any /2−ball centered in Ehas a corresponding (1 −T−1)1/p/2 lower dimension
ball centered in A. With the above observations in hand, we can now prove the inequality
by contradiction.
Suppose we have a packing on E{θ1, . . . , θD}, where Dis larger than the packing number
of Ain the main result. By Equation (S53), the lower dimension balls BθiA ((1 −T−1)1/p/2)
must also be disjoint. Since the centers of the balls θiA ∈A, these balls form a packing of
Awith radius 0= (1 −T−1)1/p. That is, we can find a packing with more balls than the
78
packing number, yielding the desired contradiction. Thus we must have
D≤D((1 −T−1)1/p, A, || · ||p)
Figure S7: A schematic of argument used in the packing number lemma proof. We showed
two disjoint unit L1 balls (red) centered in (0.8,0,0.5) and (−.3,1,−.2), all with in A×
[−0.5,0.5] (with A= [−1,1] ×[−1,1] shown in the middle plane), their slice in the z= 0
plane (in blue) also forms L1 balls in R2whose radius are lower bounded and centered within
A, thus induced a packing of the lower dimensional set.
Now we can bound the logarithm of the covering number log(N∗) similar to Ning et al.
(2020).
log(N∗)≤log N1
√2C3np,{Ψ∈ BΨ
n:||Ψ||1≤2C3p},|| · ||1
+ log N1
8C3nmax{p, q}3/2,{Ω∈ BΩ
n,||Ω||1≤8C3q},|| · ||1
The two terms above can be treated in a similar way. Denote max{p, q, sB
0, sΩ
0}=s?. There
are multiple ways to allocate the effective 0’s, which introduces the binomial coefficient
79
below:
N1
8C3nmax{p, q}3/2,{Ω∈ BΩ
n,||Ω||1≤8C3q},|| · ||1
≤Q
C0
3s?N1
8C3nmax{p, q}3/2,{V∈RQ+q:|vi|< δωfor 1 ≤i≤Q+q−C0
3s?,||V||1≤8C3q},|| · ||1
N1
√2C3np,{Ψ∈ BΨ
n:||Ψ||1≤2C3p},|| · ||1
≤pq
C0
3s?N1
√2C3np,{V∈Rpq :|vi|< δψfor 1 ≤i≤pq −C0
3s?,||V||1≤2C3p},|| · ||1
Note that Ω has Q+q < 2Qfree parameters. We have first
log Q
C0
3s?.s?log(Q).n2
n
log pq
C0
3s?.s?log(pq).n2
n
We further bound the covering number using the result in Lemma S4. Observe that ||V||1∩
{|vi|< δωfor 1 ≤i≤Q+q−C0
3s?} ⊂ {||V0|| ≤ 8C3q}×[−δω, δΩ]Q+q−C0
3s?where V0∈RC0
3s?
we have
N1
8C3nmax{p, q}3/2,{V:|vi|< δωfor 1 ≤i≤Q+q−C0
3s?,||V||1≤8C3q},|| · ||1
≤N1
8C3nmax{p, q}3/2,{V∈RC0
3s?:||V0||1≤8C3q×[−δω, δω]Q+q−C0
3s?},|| · ||1
We check the condition of Lemma S4 (with p= 1 and T= 2), by our assumption on ξ0, we
have:
(Q+q−C0
3s?)δω≤2Qδω= 2Q1
ξ0−ξ1
log 1−η
η
ξ0
ξ1.Qlog(max{p, q, n})
max{Q, pq, n}4+b/2+b/2
≤1
max{Q, pq, n}3+b/2
80
The denominator dominates C3nmax{p, q}3/2thus for large enough n, we have (Q+q−
C0
3s?)δω≤1
32C3nmax{p,q}3/2thus by Lemma S4, we can control the covering number by the
packing number:
log N1
8C3nmax{p, q}3/2,{V:|vi|< δωfor 1 ≤i≤Q+q−C0
3s?,||V||1≤8C3q},|| · ||1
≤log D1
16C3nmax{p, q}3/2,{V0∈RC0
3s?,||V0||1≤8C3q},|| · ||1
.s?log(128C2
3qn max{p, q}3/2)
.n2
n
Similarly for Ψ,
N1
√2C3np,{V:|vi|< δψfor 1 ≤i≤pq −C0
3s?,||V||1≤2C3p},|| · ||1
≤N1
√2C3np,{V0∈RC0
3s?:||V0||1≤2C3p×[−δψ, δψ]},|| · ||1
We again check the condition of Lemma S4 (again with p= 1 and T= 2):
(pq −C0
3s?)δψ≤pqδψ=pq
λ0−λ1
log 1−θ
θ
λ0
λ1.pq log(max{p, q, n})
max{pq, n}5/2+b/2+b/2
≤1
max{pq, n}3/2+b/2
The denominator dominates √2C3np, Thus for enough large n, we have (pq −C0
3s?)δψ≤
1/4√2C3np. Thus similar to Ω, we have:
log N1
√2C3np,{V:|vi|< δωfor 1 ≤i≤pq −C0
3s?,||V||1≤2C3p},|| · ||1
≤log D1
2√2C3np,{V0∈RC0
3s?,||V0||1≤2C3p},|| · ||1
.s?log(4C3pp2C3np)
.n2
n
81
Thus we finally get the contraction under log-affinity.
S5.3 From log-affinity to Ωand XΨΩ−1
In this section we show the main result Theorem 1using the contraction under log-affinity.
Denoting Ψ −Ψ0= ∆Ψand Ω −Ω0= ∆Ωwe have the log-affinity 1
nPρ(fi, f0i) is
1
nXρ(fi, f0i) = −log |Ω−1|1/4|Ω−1
0|1/4
|(Ω−1+ Ω−1
0)/2|1/2
+1
8nXXi(ΨΩ−1−Ψ0Ω−1
0)Ω−1+ Ω−1
0
2−1
(ΨΩ−1−Ψ0Ω−1
0)>X>
i
Thus Pρ(fi−f0i).n2
nimplies
−log |Ω−1|1/4|Ω−1
0|1/4
|(Ω−1+ Ω−1
0)/2|1/2.2
n
1
8nXXi(ΨΩ−1−Ψ0Ω−1
0)Ω−1+ Ω−1
0
2−1
(ΨΩ−1−Ψ0Ω−1
0)>X>
i.2
n
(S54)
This is almost the same as Ning et al. (2020)’s Equations 5.11-5.12. We can directly apply
the result from Ning et al. (2020)’s Equation 5.11 as it is the same as the first equation in
Equation (S54). Because Ψ0and Ω−1have bounded operator norms and because ∆Ωcan be
controlled, the cross-term is also controlled by n.The first part of Equation (S54) implies
||Ω−1−Ω−1
0||2
F.2
n.
Meanwhile ||Ω−1−Ω−1
0||2
F.2
nimplies for large enough nΩ’s L2 operator norm is bounded
(since we assume bounds on Ω−1
0’s operator norm and the difference cannot have very large
eigenvalues which make the sum has 0 eigenvalue), using the result ||AB||F≤ |||A|||2||B||F,
while also observe Ω0−Ω = Ω(Ω−1−Ω−1
0)Ω0, and by assumption that Ω0has bounded L2
operator norm, we conclude (S54) implies ||Ω−Ω0||F.n.
Since |||Ω−1|||2is bounded for large enough n, we can directly apply an argument from Ning
et al. (2020) ( specifically the argument around their Equation 5.12) to conclude (S54)’s
82
second part implies:
2
n&1
8nX||Xi(ΨΩ−1−Ψ0Ω−1
0)||2
2|||Ω−1+ Ω−1
0
2|||−1
2
&1
8nX||Xi(ΨΩ−1−Ψ0Ω−1
0)||2
2|||Ω−1+ Ω−1
0
2|||−1
2
&1
nX||Xi(ΨΩ−1−Ψ0Ω−1
0)||2
2/p2
n+ 1
Combining all of these results yields the desired result.
S5.4 Contraction of Ψ
Contraction of Ψ requires more assumptions on the design matrix X. Similar to Roˇckov´a
and George (2018) and Ning et al. (2020), we introduce the restricted eigenvalue
φ2(˜s) = inf ||XA||2
F
n||A||2
F
: 0 ≤ |ν(A)| ≤ ˜s
With this definition,
||X(ΨΩ−1)−Ψ0Ω−1
0)||2
F.n2
n
||Ω−Ω0||2
F.2
n
implies the result in Equation (15) of the main text. Namely,
||ΨΩ−1−Ψ0Ω−1
0||2
F=||(∆Ψ−Ψ0Ω−1∆Ω)Ω−1||2
F.2
n/φ2(sΨ
0+C0
3s?)
Since both Ω and Ω−1have bounded operator norm when ||Ω−Ω0||2
F.2
n,for large enough
n, we must have:
||∆Ψ||F− ||Ψ0Ω−1∆Ω||F≤ ||∆Ψ−Ψ0Ω−1∆Ω||F.n/qφ2(sΨ
0+C0
3s?)
Since Ψ0and Ω−1have bounded operator norm, ||Ψ0Ω−1∆Ω||F.n, and we must have:
||∆Ψ||F.n/qmin{φ2(sΨ
0+C0
3s?),1}
83
Thus we can conclude
sup
Ψ∈T0,Ω∈H0
E0Π||Ψ−Ψ0||2
F≥M02
n
min{φ2(sΨ
0+C0
3s?),1}→0
84