PreprintPDF Available

Sparse Gaussian chain graphs with the spike-and-slab LASSO: Algorithms and asymptotics

July 2022

July 2022

DOI:10.48550/arXiv.2207.07020

License
CC BY 4.0

Authors:

Yunyi Shen

Massachusetts Institute of Technology

Sameer Deshpande

University of Wisconsin–Madison

The Gaussian chain graph model simultaneously parametrizes (i) the direct effects of $p$ predictors on $q$ correlated outcomes and (ii) the residual partial covariance between pair of outcomes. We introduce a new method for fitting sparse Gaussian chain graph models with spike-and-slab LASSO (SSL) priors. We develop an Expectation-Conditional Maximization algorithm to obtain sparse estimates of the $p \times q$ matrix of direct effects and the $q \times q$ residual precision matrix. Our algorithm iteratively solves a sequence of penalized maximum likelihood problems with self-adaptive penalties that gradually filter out negligible regression coefficients and partial covariances. Because it adaptively penalizes model parameters, our method is seen to outperform fixed-penalty competitors on simulated data. We establish the posterior concentration rate for our model, buttressing our method's excellent empirical performance with strong theoretical guarantees. We use our method to reanalyze a dataset from a study of the effects of diet and residence type on the composition of the gut microbiome of elderly adults.

The estimated graphical model underlying Claesson et al. (2012)'s gut microbiome dataset. We annotate the edge weight by the absolute value of conditional regression coefficients and red color represents positive (conditional) dependence and blue represents negative (conditional) dependence.

…

Figure S7: A schematic of argument used in the packing number lemma proof. We showed two disjoint unit L1 balls (red) centered in (0.8, 0, 0.5) and (−.3, 1, −.2), all with in A × [−0.5, 0.5] (with A = [−1, 1] × [−1, 1] shown in the middle plane), their slice in the z = 0 plane (in blue) also forms L1 balls in R 2 whose radius are lower bounded and centered within A, thus induced a packing of the lower dimensional set.

…

Figures - uploaded by Yunyi Shen

Content may be subject to copyright.

Content uploaded by Yunyi Shen

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Sparse Gaussian chain graphs with the spike-and-slab

LASSO: Algorithms and asymptotics

Yunyi Shen∗

, Claudia Sol´ıs-Lemus†

, and Sameer K. Deshpande ‡

July 15, 2022

Abstract

The Gaussian chain graph model simultaneously parametrizes (i) the direct eﬀects

of ppredictors on qcorrelated outcomes and (ii) the residual partial covariance be-

tween pair of outcomes. We introduce a new method for ﬁtting sparse Gaussian chain

graph models with spike-and-slab LASSO (SSL) priors. We develop an Expectation

Conditional Maximization algorithm to obtain sparse estimates of the p×qmatrix of

direct eﬀects and the q×qresidual precision matrix. Our algorithm iteratively solves a

sequence of penalized maximum likelihood problems with self-adaptive penalties that

gradually ﬁlter out negligible regression coeﬃcients and partial covariances. Because it

adaptively penalizes model parameters, our method is seen to outperform ﬁxed-penalty

competitors on simulated data. We establish the posterior concentration rate for our

model, buttressing our method’s excellent empirical performance with strong theoret-

ical guarantees. We use our method to reanalyze a dataset from a study of the eﬀects

of diet and residence type on the composition of the gut microbiome of elderly adults.

∗Laboratory for Information & Decision Systems, Massachusetts Institute of Technology. The work was

done while the author was at the University of Wisconsin–Madison.

†Wisconsin Institute for Discovery & Dept. of Plant Pathology, University of Wisconsin–Madison, Cor-

respondence to: solislemus@wisc.edu

‡Dept. of Statistics, University of Wisconsin–Madison. Correspondence to: sameer.deshpande@wisc.edu

arXiv:2207.07020v1 [stat.ME] 14 Jul 2022

1 Introduction

1.1 Motivation

There are between 10 and 100 trillion microorganisms living within each person’s lower

intenstines. These bacteria, fungi, viruses and other microbes constitute the human gut

microbiome (Guinane and Cotter,2013). Recent research suggests that the composition of

human gut microbiome can have a substantial eﬀect on our health and well-being (Shreiner

et al.,2015): microbes living in the gut play an integral role in our digestive and metabolic

processes (Larsbrink et al.,2014;Belcheva et al.,2014); they can mediate our immune

response to various diseases (Kamada and N´u˜nez,2014;Kim et al.,2017); and they can even

inﬂuence disease pathogenesis and progression (Scher et al.,2013;Wang et al.,2011).

Additional emerging evidence suggests that the gut microbiome mediates the eﬀects of

lifestyle factors such as diet and medication use on human health (Singh et al.,2017;Battson

et al.,2018;Hills Jr et al.,2019). That is, such lifestyle factors may ﬁrst aﬀect the com-

position of the gut microbiome, which in turn inﬂuences health outcomes. In fact, lifestyle

factors and medication use can impact the composition of the microbiome in direct and in-

direct ways. For instance, many antibiotics target and kill certain microbial species, thereby

directly aﬀecting the abundances of the targeted species. However, by killing the targeted

species, the antibiotics may reduce the overall competition for nutrients, thereby allowing

non-targeted species to proliferate. In other words, by directly reducing the abundance of

certain targeted microbes, antibiotics may indirectly increase the abundance of other non-

targeted species. Our goal in this paper is to estimate such direct and indirect eﬀects.

1.2 Sparse chain graph models

At a high level, the statistical challenge is to estimate the functional relationship between

a vector of predictors x∈Rpand vector of responses y∈Rq.In our application, we re-

analyze a dataset from Claesson et al. (2012) containing n= 178 predictor-response pairs

(x,y) where xcontains measures of p= 11 factors related to diet, medication use, and

residence type, and ycontains the logit-transformed relative abundances of q= 14 diﬀerent

microbial taxa. Our goal is to uncover the direct and indirect eﬀect of these factors on the

abundance of each microbial taxon as well as any interactions between microbial taxa. The

Gaussian chain graph model (Lauritzen and Wermuth,1989;Frydenberg,1990;Lauritzen

and Richardson,2002), which simultaneously parameterizes the direct eﬀects of predictors

on responses and the residual dependence structure between response, is natural for these

data. The model asserts that

y|Ψ,Ω,x∼ N(Ω−1Ψ>x,Ω−1).(1)

where Ψ is a p×qmatrix and Ω is a symmetric, positive deﬁnite q×qmatrix. As we detail

in Section 2.1, the (j, k) entry of Ψ, ψj,k,quantiﬁes the direct eﬀect of the jth predictor

Xjon the kth response Yk.The (k, k0) entry of Ω, ωk,k0,encodes the residual conditional

covariance between outcomes Ykand Yk0that remains after accounting for the direct eﬀects

of the predictors and all of the other response variables.

To ﬁt the model in Equation (1), we must estimate pq +q(q+ 1)/2 unknown parameters.

When the total number of unknown parameters is comparable to or larger than the sample

size n, it is common to assume that the matrices Ψ and Ω are sparse. If ωk,k0= 0,we

can conclude that after adjusting for the covariates and all other outcomes, outcomes Yk

and Yk0are conditionally independent. If ψj,k = 0,we can conclude that Xjdoes not have

adirect eﬀect on the kth outcome variable Yk.Furthermore, when ψj,k = 0,any marginal

correlation between Xjand Ykis due solely to Xj’s direct eﬀects on other outcomes Yk0that

are themselves conditionally correlate with Yk.

1.3 Our contributions

We introduce the chain graph spike-and-slab LASSO (cgSSL) procedure for ﬁtting the model

in Equation (1) in a sparse fashion. At a high level, we place separate spike-and-slab LASSO

priors (Roˇckov´a and George,2018) on the entries of Ψ and on the oﬀ-diagonal entries of Ω

in Equation (1). We derive an eﬃcient Expectation Conditional Maximization algorithm to

compute the maximum a posteriori (MAP) estimates of Ψ and Ω.Our algorithm is equivalent

to solving a series of maximum likelihood problems with self-adaptive penalties. On synthetic

data, we demonstrate that our algorithm displays excellent support recovery and estimation

performance. We further establish the posterior contraction rate for each of Ψ,Ω,ΨΩ−1,

and XΨΩ−1.Our contraction results imply that our proposed cgSSL procedure consistently

estimates these quantities and also provides an upper bound for the minimax optimal rate

of estimating these quantities in the Frobenius norm. To the best of our knowledge, ours are

the ﬁrst posterior contraction results for ﬁtting sparse Gaussian chain graph models with

element-wise priors on Ψ and Ω.

Here is an outline for the rest of our paper. We review the Gaussian chain graph model and

spike-and-slab LASSO in Section 2. We next introduce the cgSSL procedure in Section 3

and carefully derive our ECM algorithm for ﬁnding the MAP in Sections 3.2. We present our

asymptotic results in Section 4before demonstrating the excellent ﬁnite sample performance

of the cgSSL on several synthetic datasets in Section 5. We apply the cgSSL to our motivating

gut microbiome data in Section 6. We conclude in Section 7by outlining several avenues for

future development.

2 Background

2.1 The Gaussian chain graph model

Graphical models are a convenient way to represent the dependence structure between several

variables. Speciﬁcally, we can represent each variable as a node in a graph and we can draw

edges to indicate conditional dependence between variables. Absence of an edge between

two nodes indicates the corresponding variables are conditionally independent given all of the

other variables. In the context of our gut microbiome data, we can represent each predictor

Xjwith a node and each outcome Ykwith a node. We are primarily interested in detecting

edges between predictors and outcomes and edges between outcomes. Figure 1a is a cartoon

illustration of such a graphical model with p= 3 and q= 4.Note that we have not drawn

any edges between the predictors as such edges are not typically of primary interest.

(a) (b)

Figure 1: Cartoon illustrations of a general graphical model (a) and a Gaussian chain graph

model (b) with p= 3 covariates and q= 4 outcomes. Edges in both graphs encode condi-

tional dependence relationships. The edge labels in (b) correspond to the non-zero parame-

ters in Equation (1).

Without additional modeling assumptions, estimating a discrete graph like that in Figure 1a

from npairs of data (x1,y1),...,(xn,yn) is a challenging task. The Gaussian chain graph

model in Equation (1) translates the discrete graph estimation problem into a much more

tractable continuous parameter estimation problem. Speciﬁcally, the model introduces two

matrices, Ψ and Ω and asserts that y|Ψ,Ω,x∼ N(Ω−1Ψ>x,Ω−1).Under the Gaussian graph

model, Xjand Ykare conditionally independent if and only if ψj,k = 0.Furthermore, Ykand

Yk0are conditionally independent if and only if ωk,k0= 0.In other words, by ﬁrst estimating

Ψ and Ω and then examining their supports, we can recover the underlying graphical model.

Figure 1b reproduces the cartoon from Figure 1a with edges labelled by the corresponding

non-zero parameters in Equation (1).

In the Gaussian chain graph model, the direct eﬀect of Xjon Ykis deﬁned as

E[Yk|Xj=xj+ 1, Y−k, X−j,Ψ,Ω] −E[Yk|Xj=xj, Y−k, X−j,Ψ,Ω] = −ψj,k/ωk,k

That is, ﬁxing the values of all of the other covariates and all of the other outcomes, an

increase of one unit in Xjis associated with −ψj,k /ωk,k unit increase in the expectation of

Yk. Notice that the direct eﬀect of Xjon Ykis deﬁned conditionally on the values of all

other outcome Yk0.Because of this, the direct eﬀect of Xjon Ykis typically not equal to its

marginal eﬀect, which is deﬁned as

E[Yk|Xj=xj+ 1, X−j,Ψ,Ω] −E[Yk|Xj=xj, X−j,Ψ,Ω] = βj,k,

where βj,k is the (j, k) entry of the matrix B= ΨΩ−1.Notice that we can re-parametrize the

Gaussian chain graph model in Equation (1) in terms of B

y|B, Ω,x∼ N(B>x,Ω−1).(2)

We will refer to this re-parametrized model as the marginal regression model. There is a

considerable literature on ﬁtting sparse marginal regression models and we refer the reader

to Deshpande et al. (2019) and references therein for a review.

Generally speaking, under (1), the supports of Ψ and Bwill be diﬀerent. Speciﬁcally, it is

possible for Xjto have a marginal eﬀect but no direct eﬀect on Yk.For instance, in Figure 1,

although X3does not directly aﬀect Y2,it may still be marginally correlated with Y2thanks

to the conditional correlation between Y2and Y3.That is, changing the value of X3can

change the value of Y3,which in turn changes the value of Y2.Consequently, if we ﬁt a sparse

marginal regression model, we cannot generally expect to recover sparse estimates of the

matrix of direct eﬀects.

2.2 Related works

Learning sparse chain graphs.McCarter and Kim (2014) proposing ﬁtting sparse Gaus-

sian chain graphical models by maximizing a penalized negative log-likelihood. They speciﬁ-

cally proposed homogeneous L1penalties on the entries of Ψ and Ω and used cross-validation

to set the penalty parameters for Ψ and Ω.Shen and Solis-Lemus (2021) developed a Bayesian

version of that chain graphical LASSO and put a Gamma prior on the penalty parameters.

In this way, they automatically learned the degree to which the entries ψj,k and ωk,k0are

shrunk to zero. Although these paper diﬀer in how they determine the appropriate amount

of penalization, both McCarter and Kim (2014) and Shen and Solis-Lemus (2021) deploy a

single ﬁxed penalty on all of the entries in Ψ and a single ﬁxed penalty on all entries in Ω.

With such ﬁxed penalties, larger parameter estimates are shrunk towards zero as aggressively

as the smaller parameter estimates, which can introduce substantial estimation bias.

Spike-and-slab variable selection with the EM algorithm. Spike-and-slab priors are

the workhorses of sparse Bayesian modeling. As introduced by Mitchell and Beauchamp

(1988), the spike-and-slab prior is mixture between a point mass at 0 (the “spike”) and

a uniform distribution over a wide interval (the “slab”). George and McCulloch (1993)

introduced a continuous relaxation of the original spike-and-slab prior, respectively replacing

the point mass spike and uniform slab distributions with zero-mean Gaussians with extremely

small and large variances. In this way, one may imagine generating all of the “essentially

negligible” parameters in a model from the spike distribution and generating all of the

“relevant” or “signiﬁcant” parameters from the slab distribution. Despite their intuitive

appeal, spike-and-slab priors usually produce extremely multimodal posterior distributions.

In high dimensions, exploring these distributions with Markov chain Monte Carlo (MCMC)

is computationally prohibitive.

In response, Roˇckov´a and George (2014) introduced EMVS, a fast EM algorithm targeting

the maximum a posteriori (MAP) estimate of the regression parameters. They later extended

EMVS, which used conditionally conjugate Gaussian spike and slab distributions, to use

Laplacian spike and slab distributions in Roˇckov´a and George (2018). The resulting spike-

and-slab LASSO (SSL) procedure demonstrated excellent empirical performance. At a high-

level, the SSL algorithm solves a sequence of L1penalized regression problems with self-

adaptive penalties. The adaptive penalty mixing is key to the empirical success of the SSL

(George and Roˇckov´a,2020;Bai et al.,2021), as it facilitates shrinking larger parameter

estimates to zero less aggressively than smaller parameter estimates.

Since Roˇckov´a and George (2014), the general EM technique for maximizing spike-and-slab

posteriors has been successfully applied to many problems. For instance, Bai et al. (2020)

introduced a grouped version of the SSL that adaptively shrinks groups of parameter values

towards zero. Tang et al. (2017,2018) similarly deployed the SSL and its grouped variant

to generalized linear models. Outside of the single-outcome regression context, continuous

spike-and-slab priors have been used to estimate sparse Gaussian graphical models (Li et al.,

2019;Gan et al.,2019a,b), sparse factor models Roˇckov´a and George (2016), and to bi-

clustering Moran et al. (2021). Deshpande et al. (2019) introduce a multivariate SSL for

estimating Band Ω in the marginal regression model in Equation (2). In each extension,

the adaptive penalization performed by the EM algorithm resulted in support recovery and

parameter estimation superior to that of ﬁxed penalty methods.

The asymptotics of spike-and-slab variable selection. Beyond its excellent empirical

performance, Roˇckov´a and George (2018)’s SSL enjoys strong theoretical support. Using

general techniques proposed by Zhang and Zhang (2012) and Ghosal and van der Vaart

(2017), they proved that, under mild conditions, the posterior induced by the SSL prior in

high-dimensional, single-outcome linear regression contracts at a near minimax-optimal rate

as n→ ∞.Their contraction result implies that the MAP estimate returned by their EM

algorithm is consistent and is, up to a log factor, rate-optimal. By directly applying Ghosal

and van der Vaart (2017)’s general theory, Bai et al. (2020) extended these results to the

group SSL posterior with an unknown variance.

In the context of Gaussian graphical models, Gan et al. (2019a) showed that the MAP

estimator corresponding to placing spike-and-slab LASSO priors on the oﬀ-diagonal elements

of a precision matrix is consistent. They did not, however, establish the contraction rate of

the posterior. Ning et al. (2020) showed that the joint posterior distribution of (B, Ω) in

the multivariate regression model in Equation (2) concentrates when using a group spike-

and-slab prior with Laplace slab and point mass spike on Band a carefully selected prior

on the eigendecomposition of Ω−1.To the best of our knowledge, however, the asymptotic

properties of the posterior formed by placing SSL priors on both the precision matrix Ω and

regression coeﬃcients Ψ in Equation (1) have not yet been established.

3 Introducing the cgSSL

3.1 The cgSSL prior

To quantify the prior belief that many entries in Ψ are essentially negligible, we model each

ψj,k as having been drawn either from a spike distribution, which is sharply concentrated

around zero, or a slab distribution, which is much more diﬀuse. More speciﬁcally, we take

the spike distribution to be Laplace(λ0) and the slab distribution to be Laplace(λ1),where

0< λ1λ0are ﬁxed positive constants. This way, the spike distribution is much more

heavily concentrated around zero than is the slab. We further let θ∈[0,1] be the prior

probability that each ψj,k is drawn from the slab and model the ψj,k ’s as conditionally

independent given θ. Thus, the prior density for Ψ, conditional on θ, is given by

π(Ψ|θ) =

j=1

k=1 θλ1

2e−λ1|ψj,k|+(1 −θ)λ0

2e−λ0|ψj,k|.(3)

Since Ω is symmetric, it is enough to specify a prior on the entries ωk,k0where k≤k0.To

this end, we begin by placing an entirely analogous spike-and-slab prior on the oﬀ-diagonal

entries. That is, we model each ωk,k0as being drawn from a Laplace(ξ1),with probability

η∈[0,1],or a Laplace(ξ0),with probability 1 −η, where 0 < ξ1ξ0.We similarly model

each ωk,k0as conditionally independent given ηand place independent Exp(ξ1) priors on

the diagonal entries of Ω.We then truncate the resulting distribution of Ω|θto the cone of

symmetric positive deﬁnite matrices, yielding the prior density

π(Ω|η)∝ Y

1≤k<k0≤qηξ1

2e−ξ1|ωk,k0|+(1 −η)ξ0

2e−ξ0|ωk,k0|!× q

k=1

ξe−ξωk,k !×

(Ω 0) (4)

Observe that 1 −θand 1 −ηrespectively quantify the proportion of entries in Ψ and Ω

that are essentially negligible. To model our uncertainty about these proportions, we place

Beta priors on each of θand η. Speciﬁcally, we independently model θ∼Beta(aθ, bθ) and

η∼Beta(aη, bη),where aθ, bθ, aη, bη>0 are ﬁxed positive constants.

3.2 Targeting the MAP

Unfortunately the posterior distribution of (Ψ, θ, Ω, η)|Yis analytically intractable. Fur-

ther, it is generally high-dimensional and rather multimodal, rendering stochastic search

techniques like Markov Chain Monte Carlo computationally impractical. We instead fol-

low Roˇckov´a and George (2018)’s example and focus on ﬁnding the maximum a posteriori

(MAP) estimate of (Ψ, θ, Ω, η).Throughout, we assume that the columns of Xhave been

centered and scaled to have norm n.

To this end, we attempt to maximize the log posterior density

log π(Ψ, θ, Ω, η|Y) = −n

2log|Ω| − 1

2tr (Y−XΨΩ−1)Ω(Y−XΨΩ)>

j=1

k=1

log θλ1e−λ1|ψj,k |+ (1 −θ)λ0e−λ0|ψj,k|

q−1

k=1

k0>k

log ηξ1e−ξ1|ωk,k0|+ (1 −η)ξ0e−ξ0|ωk ,k0|

−

k=1

ξ0ωk,k + log

(Ω 0)

+ (aθ−1) log(θ)+(bθ−1) log(1 −θ)

+ (aη−1) log(η)+(bη−1) log(1 −η)

(5)

Optimizing the log posterior density directly is complicated by the non-concavity of log π(Ω|η).

Instead, following Deshpande et al. (2019), we iteratively optimize a surrogate objective using

an EM-like algorithm.

To motivate this approach, observe that we can obtain the posterior density π(Ω|η) in Equa-

tion (4) by marginalizing an augmented prior

π(Ω|η) = Zπ(Ω|δ)π(δ|η)dδ

where δ={δk,k0: 1 ≤k < k0≤q}is a collection of q(q−1)/2 i.i.d. Bernoulli(η) variables

and

π(Ω|δ)∝ Y

1≤k<k0≤qξ1e−ξ1|ωk,k0|δk ,k0ξ0e−ξ0|ωk,k0|1−δk,k0!× q

k=1

ξe−ξωk,k !×

(Ω 0).

In our augmented prior, δk,k0indicates whether ωk,k0is drawn from the slab (δk,k0= 1) or the

spike (δk,k0= 0).

The above marginalization immediately suggests an EM algorithm: rather than optimize

log π(Ψ, θ, Ω, η|Y) directly, we can iteratively optimize a surrogate objective formed by

marginalizing the augmented log posterior density. That is, starting from some initial guess

(Ψ(0), θ(0),Ω(0), η(0)),for t > 1,the tth iteration of our algorithm consists of two steps. In the

ﬁrst step, we compute the surrogate objective

F(t)(Ψ, θ, Ω, η) = Eδ|·[log π(Ψ, θ, Ω, η, δ|y)|Ψ = Ψ(t−1),Ω = Ω(t−1) , θ =θ(t−1), η =η(t−1) ],

where the expectation is taken with respect to the conditional posterior distribution of the

indicators δgiven the current value of (Ψ, θ, Ω, η).Then in the second step, we maximize

the surrogate objective and set (Ψ(t), θ(t),Ω(t), η(t)) = arg max F(t)(Ψ, θ, Ω, η).

It turns out that, given Ω and η, the indicators δk,k0are conditionally independent Bernoulli

random variables whose means are easy to evaluate, making it simple to compute a closed

form expression for the surrogate objective F(t).Unfortunately, maximizing F(t)is still dif-

ﬁcult. Consequently, similar to Deshpande et al. (2019), we carry out two conditional max-

imizations, ﬁrst optimizing with respect to (Ψ, θ) while holding (Ω, η) ﬁxed, and then opti-

mizing with respect to (Ω, η) while holding (Ψ, θ) ﬁxed. That is, in the second step of each

iteration of our algorithm, we set

(Ψ(t), θ(t)) = arg max

Ψ,θ

F(t)(Ψ, θ, Ω(t−1), η(t−1)) (6)

(Ω(t), η(t)) = arg max

Ω,η

F(t)(Ψ(t), θ(t),Ω, η).(7)

In summary, we propose ﬁnding the MAP estimate of (Ψ, θ, Ω, η) using an Expectation

Conditional Maximization (ECM; Meng and Rubin,1993) algorithm.

When we ﬁx the values of Ω and η, the surrogate objective F(t)is separable in Ψ and

θ. That is, the objective function F(t)(Ψ, θ, Ω(t−1), η(t−1)) in Equation (6) can be written

as the sum of a function of Ψ alone and a function of θalone. This means that we can

separately compute Ψ(t)and θ(t)while ﬁxing (Ω, η) = (Ω(t−1), η(t−1)). The objective function

in Equation (7) is similarly separable and we can separately compute Ω(t)and η(t)while

ﬁxing (Ψ, θ) = (Ψ(t), θ(t)).As we describe in Section S1 of the Supplementary Materials,

computing θ(t)and η(t)is relatively straightforward; we compute θ(t)with a simple Newton

algorithm and there is a closed form expression for η(t).The main computational challenge

is computing Ψ(t)and Ω(t).In the next subsection, we detail how updating Ψ and Ω reduces

to solving penalized likelihood problems with self-adaptive penalties.

3.3 Adaptive penalty mixing

Before describing how we compute Ψ(t)and Ω(t),we introduce two important functions:

p?(x, θ) = θλ1e−λ1|x|

θλ1e−λ1|x|+ (1 −θ)λ0e−λ0|x|

q?(x, η) = ηξ1e−ξ1|x|

ηξ1e−ξ1|x|+ (1 −η)ξ0e−ξ0|x|

For each 1 ≤j≤pand 1 ≤k≤q, p?(ψj,k, θ) is the conditional posterior probability that ψj,k

was drawn from the Laplace(λ1) slab distribution. Similarly, for 1 ≤k < k0≤q, q?(ωk,k0, η)

is just the conditional posterior probability that ωk,k0was drawn from the Laplace(ξ1) slab.

That is, q?(ωk,k0, η) = E[δk,k0|Y,Ψ,Ω, θ, η].

Updating Ψ.Fixing the value Ω = Ω(t−1) ,computing Ψ(t)is equivalent to solving the

following penalized optimization problem

Ψ(t)= arg max

Ψ(−1

2tr (YΩ−XΨ)Ω−1(YΩ−XΨ)>+X

pen(ψj,k ;θ))(8)

where

pen(ψj,k ;θ) = log π(ψj,k|θ)

π(0|θ)=−λ1|ψj,k|+ log p?(ψj,k , θ)

p?(0, θ).

Note that the ﬁrst term in the objective of Equation (8) can obtained by distributing a

factor of Ω through the quadratic form that appears in the log-likelihood (see Equations (S5)

and (S7) of the Supplementary Materials for details).

Following arguments similar to those in Deshpande et al. (2019), the Karush-Kuhn-Tucker

(KKT) condition for (8) tells us that

ψ(t)

j,k =n−1h|zj,k| − λ?(ψ(t)

j,k, θ)i+sign(zj,k),(9)

where

zjk =nψ(t)

j,k +X>

jrk+X

k06=k

(Ω−1)k,k0

(Ω−1)k,k

jrk0

rk0= (YΩ−XΨ(t))k0

λ?(ψ(t)

j,k, θ) = λ1p?(ψ(t)

j,k, θ) + λ0(1 −p?(ψ(t)

j,k, θ)).

The KKT conditions suggest a natural coordinate-ascent strategy for computing Ψ(t): start-

ing from some initial guess Ψ0,we cyclically update the entries ψj,k by soft-thresholding ψj,k

at λ?

j,k.During our cyclical coordinate ascent, whenever the current value of ψj,k is very large,

the corresponding value of p?(ψj,k, θ) will be close to one, and the threshold λ?will be close

to the slab penalty λ1.On the other hand, when ψj,k is very small, the corresponding p?will

be close to zero and the threshold λ?will be close to the spike penalty λ0.Since λ1λ0,

we are therefore able to apply a stronger penalty to the smaller entries of Ψ and a weaker

penalty to the larger entries. As our cyclical coordinate ascent proceeds, we iteratively reﬁne

the thresholds λ?,thereby adaptively shrinking our estimates of ψj,k.

Before proceeding, we note that the quantity zj,k depends not only on the inner product

between the Xj,the jth column of the design matrix, and the partial residual rkbut also

on the inner product between Xjand all other partial residuals rk0for k06=k. Practically

this means that in our cyclical coordinate ascent algorithm, our estimate of the direct eﬀect

of predictor Xjon outcome Ykcan depend on how well we have ﬁt all other outcomes Yk0.

Moreover, the entries of Ω−1determine the degree to which ψj,k depends on the outcomes Yk0

for k06=k. Speciﬁcally, if (Ω−1)k,k0= 0,then we are unable to leverage information contained

in Yk0to inform our estimate of ψj,k.

Updating Ω.Fixing Ψ = Ψ(t)and letting S=n−1Y>Yand M=n−1(XΨ)>XΨ,we can

compute Ω(t)by solving

Ω(t)= arg max

Ω0(n

2log|Ω| − tr(SΩ) −tr(MΩ−1)−

k=1 "ξωk,k +X

k0>k

ξ?

k,k0|ωk,k0|#) (10)

where ξ?

k,k0=ξ1q?(ω(t−1)

k,k0, η(t−1)) + ξ0(1 −q?(ω(t−1)

k,k0, η(t−1))).

The objective in Equation (10) is extremely similar to the conventional graphical LASSO

(GLASSO; Friedman et al.,2008) objective. However, there are two crucial diﬀerences. First,

because the conditional mean of Ydepends on Ω in the Gaussian chain graph model (1), we

have an additional term tr(MΩ−1) that is absent in the GLASSO objective. Second, and

more substantively, the objective in Equation (10) contains individualized penalties ξ?

k,k0on

the oﬀ-diagonal entries of Ω.Here, the penalty ξ?

k,k0will be large (resp. small) whenever

the previous estimate of ω(t−1)

k,k0is small (resp. large). In other words, as we run our ECM

algorithm, we can reﬁne the amount of penalization applied to each oﬀ-diagonal entry in Ω.

Although the objective in Equation (10) is somewhat diﬀerent than the GLASSO objective,

we can solve it by suitably modifying an existing GLASSO algorithm. Speciﬁcally, we

solve the optimization problem in Equation (10) with a modiﬁed version of Hsieh et al.

(2011)’s QUIC algorithm. Our solution repeatedly (i) forms a quadratic approximation

of the objective, (ii) computes a suitable Newton direction, and (iii) follows that Newton

direction for step size chosen with an Armijo rule. In Section S2.4 of the Supplementary

Materials, we show that the optimization problem in Equation (10) has a unique solution

and that our modiﬁcation to QUIC converges to the unique solution.

3.4 Selecting the spike and slab penalties

The proposed ECM algorithm depends on two sets of hyperparameters. The ﬁrst set, contain-

ing aθ, bθ, aη,and bη,encode our initial beliefs about the overall proportion of non-negligible

entries in Ψ and Ω.We set aθ= 1, bθ=pq, aη= 1,and bη=qsimilar to Deshpande et al.

(2019). The second set of hyperparameters consists of the spike and slab penalties λ0, λ1, ξ0

and ξ1.Rather than run cgSSL with a single set of these penalties, we use Deshpande et al.

(2019)’s path-following dynamic posterior exploration (DPE) strategy to obtain the MAP

estimates corresponding to several diﬀerent choices of spike penalties.

Speciﬁcally, we ﬁx the slab penalties λ1and ξ1and specify grids of increasing spike penalties

Iλ={λ(1)

0<··· < λ(L)

0}and Iξ={ξ1

0<··· < ξ(L)

0}.We then run cgSSL with warmstarts for

each combination of spike penalties, yielding a set of posterior modes {(Ψ(s,t), θ(s,t),Ω(s,t), η(s,t)}

indexed by the choices (λ(s)

0, ξ(t)

0).To warm start the estimation of the mode corresponding to

(λ(s)

0, ξ(t)

0),we ﬁrst compute the models found with (λs−1

0, ξ(t−1)

0), (λs

0, ξ(t−1)

0) and (λs−1

0, ξ(t)

0).

We evaluate the posterior density using (λ0, ξ0)=(λ(s)

0, ξ(t)

0) at each of the three previously

computed modes and initialize at the mode with largest density.

Following this DPE strategy provides a snapshot of the many diﬀerent cgSSL posteriors.

However, it can be computationally intensive, as we must run our ECM algorithm to conver-

gence for every pair of spike penalties. Deshpande et al. (2019) introduced a faster variant,

called dynamic conditional posterior exploration (DCPE), which we also implemented for

the cgSSL. In DCPE, we ﬁrst run our ECM algorithm with warm-starts over the ladder Iλ

while keeping Ω = Iﬁxed. Then, ﬁxing (Ψ, θ) at the ﬁnal value from the ﬁrst step, we run

our ECM algorithm with warm-starts over the ladder IΩ.Finally, we run our ECM algo-

rithm starting from the ﬁnal estimates of the parameters obtained in the ﬁrst two steps with

(λ0, ξ0)=(λ(L)

0, ξ(0)

L).Generally speaking, DPE and DCPE trace diﬀerent paths through the

parameter space and typically return diﬀerent ﬁnal estimates.

When the spike and slab penalties are similar in size (i.e. λ1≈λ0, ξ1≈ξ0), we noticed that

our ECM algorithm would sometimes return very dense estimates of Ψ and diagonal estimates

of Ω with very large diagonal entries. Essentially, when the spike and slab distributions are

not too diﬀerent, our ECM algorithm has a tendency to overﬁt the response with a dense

Ψ, leaving very little residual variation to be quantiﬁed with Ω.On further investigation, we

found that we could detect such pathological behavior by examining the condition number

of the matrix YΩ−XΨ. To avoid propagating dense Ψ’s and diagonal Ω’s through the

DPE and DCPE, we terminate our ECM early whenever the condition number of YΩ−XΨ

exceeds 10n. We then set the corresponding Ψ(s)= 0 and Ω(t)=Iand continue the dynamic

exploration from that point. While this is admittedly ad hoc heuristic, we have found that

it works well in practice and note that Moran et al. (2019) utilized a similar strategy in the

single-outcome high-dimensional linear regression setting with unknown variance.

The DPE and DCPE cgSSL procedures are implemented in the mSSL R(R Core Team,

2022) package, which is available at https://github.com/YunyiShen/mSSL. Note that this

package contains a new implementation of Deshpande et al. (2019)’s mSSL procedure as

well.

4 Asymptotic theory of cgSSL

If the Gaussian chain graph model in Equation (1) is well-speciﬁed – that is, if our data

(xi,yi) are truly generated according to the model – will the posterior distribution of Ψ

and Ω collapse to a point-mass at the true data generating parameters as n→ ∞? Such

a collapse would, among other things, imply the MAP estimate returned by the cgSSL

procedure described in Section 3is consistent, providing an asymptotic justiﬁcation for its

use. In this section, we answer the question aﬃrmatively: under some mild assumptions

and with some slight modiﬁcations, the cgSSL posterior concentrates around the truth. We

further establish the rate of concentration, which quantiﬁes the speed at which the posterior

distribution shrinks to the true data generating parameters. We begin by brieﬂy reviewing

our general proof strategy before precisely stating our assumptions and results. Proofs of

our main results are available in Section S5 of the Supplementary Materials.

4.1 Proof strategy

To establish the posterior concentration rate for Ψ and Ω, we followed Ning et al. (2020)

and Bai et al. (2020) and ﬁrst showed that the posterior concentrates in log-aﬃnity (see

Section S5.3 in the Supplementary Materials for details). Posterior concentration of the

individual parameters followed as a consequence. To show that the posterior concentrates

in log-aﬃnity, we appealed to general results about posterior concentration for independent

but non-identically distributed observations. Speciﬁcally, we veriﬁed the three conditions

of Theorem 8.23 of Ghosal and van der Vaart (2017). First, we conﬁrmed that the cgSSL

prior introduced in Section 3.1 places enough prior probability mass in small neighborhoods

around every possible choice of (Ψ,Ω).This was done by verifying that for each (Ψ,Ω),

the prior probability contained in a small Kullback-Leibler ball around (Ψ,Ω) can be lower

bounded by a function of the ball’s radius (the so-called “KL-condition” in Lemma S2 of the

Supplementary Materials). Then we studied a sequence of likelihood ratio tests deﬁned on

sieves of the parameter space that can correctly distinguish between parameter values that

are suﬃciently far away from each other in log-aﬃnity. In particular, we bounded the error

rate of such tests and then bounded the covering number of the sieves (Lemma S4 of the

Supplementary Materials).

Ning et al. (2020) studied the sparse marginal regression model in Equation (2) instead of the

sparse chain graph. Although these are somewhat diﬀerent models, our overall proof strategy

is quite similar to theirs. However, we pause here to highlight some important technical

diﬀerences. First, Ning et al. (2020) placed a prior on Ω’s eigendecomposition while we

placed an arguably simpler and more natural element-wise prior on Ω.The second and more

substantive diﬀerence is in how we bound the covering number of sieves of the underlying

parameter space. Because Ning et al. (2020) speciﬁed exactly sparse priors on the elements

of B= ΨΩ−1,it was enough for them to carefully bound the covering number of exactly

low-dimensional sets of the form A × {0}rwhere Ais some subset of a multi-dimensional

Euclidean space and r > 0 is a positive integer. In contrast, because we speciﬁed absolutely

continuous priors on the elements of Ψ,we had to cover “eﬀectively low-dimensional” sets

of the form A × [−δ, δ]rfor small δ > 0.Our key lemma (Lemma S4 in the Supplementary

Materials) provides suﬃcient conditions on δfor bounding the -packing number of such

eﬀectively low-dimensional sets using the 0-packing number of Afor a carefully chosen

0>0.

4.2 Contraction of cgSSL

In order to establish our posterior concentration results, we ﬁrst assume that the data

(x1,y1),...,(xn,yn) were generated according to a Gaussian chain graph model with true

parameter Ψ0and Ω0.We need to make additional assumptions about the spectra of Ψ0and

Ω0and on the dimensions n, p and q.

A1 Ψ0and Ω0have bounded operator norm: that is, Ψ0∈ T0={Ψ : |||Ψ|||2< a1}and

Ω0∈ H0{Ω : |||Ω|||2∈[1/b2,1/b1] where ||| · ||| is the operator norm and a1, b1, b2>0

are ﬁxed positive constants.

A2 Dimensionality: We assume that log(n).log(q); log(n).log(p); and

max{p, q, sΩ

0, sΨ

0}log(max{p, q})/n →0,

where sΩ

0and sΨ

0are the number of non-zero free parameters in Ω and Ψ respectively;

and an.bnmeans for suﬃcient large n, there exists a constant Cindependent of n

such that an≤Cbn

A3 Tuning the Ψ prior: We assume that (1 −θ)/θ ∼(pq)2+a0for some a0>0; λ0∼

max{n, pq}2+b0for some b0>1/2; and λ11/n

A4 Tuning the Ω prior: We assume that that (1 −η)/η ∼max{Q, pq}2+afor some a > 0,

where Q=q(q−1)/2; ξ0∼max{Q, pq, n}4+bfor some b > 0; ξ11/n and ξ

1/max{Q, n}

Before stating our main result, we pause to highlight two key diﬀerences between the above

assumptions and model introduced in Section 3.1. Although the prior in Section 3.1 restricts

Ω to the positive-deﬁnite cone, Assumption 1 is slightly stronger as it bounds the smallest

eigenvalue of Ω away from zero. The stronger assumption ensures that the entries of XΨΩ−1

do not diverge in our theoretical analysis. We additionally restricted our theoretical analysis

to the setting where the proportion of non-negligible parameters, θand η, are ﬁxed and

known (Assumption 4). We note that Roˇckov´a and George (2018) and Gan et al. (2019a)

make similar assumptions in their theoretical analyses.

Theorem 1 (Posterior contraction of cgSSL).Under Assumptions A1–A4, there is a con-

stant M1>0,which does not depend on n, such that

sup

Ψ∈T0,Ω∈H0

E0ΠΨ : ||X(ΨΩ−1−Ψ0Ω−1

0)||2

F≥M1n2

n|Y1, . . . , Yn→0 (11)

sup

Ψ∈T0,Ω∈H0

E0ΠΩ : ||Ω−Ω0||2

F≥M12

n|Y1, . . . , Yn→0 (12)

where n=pmax{p, q, sΩ

0, sΨ

0}log(max{p, q})/n. Note that n→0as n→ ∞.

A key step in proving Theorem 1is Lemma 1, which shows that the cgSSL posterior does

not place too much probability on Ψ’s and Ω’s with too many large entries. In order to

state this lemma, we denote the eﬀective dimensions of Ψ and Ω by |ν(Ψ)|and |ν(Ω)|.The

eﬀective dimension of Ψ (resp. Ω) counts the number of entries (resp. oﬀ-diagonal entries in

the lower-triangle) whose absolute value exceeds the intersection point of the spike and slab

prior densities.

Lemma 1 (Dimension recovery of cgSSL).For a suﬃciently large number C0

3>0, we have:

sup

Ψ∈T0,Ω∈H0

E0Π (Ψ : |ν(Ψ)|> C0

3s?|Y1, . . . , Yn)→0 (13)

sup

B∈T0,Ω∈H0

E0Π (Ω : |ν(Ω)|> C0

3s?|Y1, . . . , Yn)→0 (14)

where s?= max{p, q, sΩ

0, sΨ

0}.

Lemma 1essentially guarantees that the cgSSL posterior does not grossly overestimate the

number of predictor-response and response-response edges in the underlying graphical model.

Note that the result in Equation (11) shows that the vector containing the nevaluations

of the regression function (i.e. the vector XΨΩ−1), converges to the vector containing the

evaluations of the true regression function Ω−1

0Ψ>

0x.Importantly, apart from Assumption A2

about the dimensions of X, we did not make any additional assumptions about the design

matrix. The contraction rates for Ψ and ΨΩ−1,however, depend critically on X. To state

these results, denote the restricted eigenvalue of a matrix Aas

φ2(s) = inf

A∈Rp×q:0≤|ν(A)|≤skXAk2

nkAk2

F.

Corollary 1 (Recovery of regression coeﬃcients in cgSSL).Under Assumptions A1–A4,

there is some constant M0>0,which does not depend on n, such that

sup

Ψ∈T0,Ω∈H0

E0Π||ΨΩ−1−Ψ0Ω−1

0||2

F≥M02

φ2(sΨ

0+C0

3s?)→0 (15)

sup

Ψ∈T0,Ω∈H0

E0Π||Ψ−Ψ0||2

F≥M02

min{φ2(sΨ

0+C0

3s?),1}→0.(16)

Corollary 1shows that the posterior distribution of ΨΩ−1can contract at a faster or slower

rate than the posterior distributions of XΨΩ−1and Ω,depending on the design matrix. In

particular, when Xis poorly conditioned, we might expect the rate to be slower. In contrast,

the term min{φ2(sΨ

0+C0

3s?),1}appearing in the denominator of the rate in Equation (16)

implies that the posterior distribution of Ψ cannot concentrate at a faster rate than the

posterior distributions of ΨΩ−1and Ω,regardless of the design matrix. To develop some

intuition about this phenomenon, notice that we can decompose the diﬀerence Ψ −Ψ0as

Ψ−Ψ0= (ΨΩ−1−Ψ0Ω−1

0)Ω + (Ψ0Ω−1

0(Ω −Ω0)Ω−1)Ω.

Roughly speaking, the decomposition suggests that in order to estimate Ψ well, we must

be able to estimate both Ω and ΨΩ−1well. In other words, estimating Ψ is at least as

hard, statistically, as estimating Ω and ΨΩ−1.Taken together, the two results in Corollary 1

suggest that while a carefully constructed design matrix can improve estimation of the matrix

of marginal eﬀects, B= ΨΩ−1,it cannot generally improve estimation of the matrix of direct

eﬀects Ψ.

5 Synthetic experiments

We performed a simulation study to assess how well our two implementations of cgSSL

(cgSSL-DPE and cgSSL-DCPE) (i) recover the supports of Ψ and Ω and (ii) estimate each

matrix. We compared both implementations of cgSSL to several competitors: a ﬁxed-penalty

method (cgLASSO), which deploys a single penalty λfor the entries in Ψ and a single ﬁxed

penalty ξfor the entries in Ω; Shen and Solis-Lemus (2021)’s CAR-LASSO procedure (CAR),

which puts Laplace priors on Ψ and Ω entries and Gamma prior on the overall shrinkage

strength; and Shen and Solis-Lemus (2021)’s adaptive CAR-LASSO (CAR-A), which puts

individualized Laplace prior on free parameters of Ψ and Ω.Note that cgSSL and cgLASSO

perform optimization while CAR and CAR-A run MCMC. Further, we selected the penalties

in cgLASSO with 10-fold cross-validation. Additionally, cgLASSO and CAR apply the same

amount of shrinkage to every element of Ψ and the same amount of shrinkage to every

element of Ω.CAR-A, on the other hand, applied individualized shrinkage.

We simulated several synthetic datasets of various dimensions and with diﬀerent sparsity

patterns in Ω (Figure 2). Across all of these choices of dimension and Ω,we found that

cgSSL-DPE achieved somewhat lower sensitivity but much higher precision in estimating the

supports of both Ψ and Ω than the competing methods. Taken together, these ﬁndings

suggest that while cgSSL-DPE tended to return fewer non-zero parameter estimates than the

other methods, we can be much more certain that those parameters are truly non-zero. Put

another way, although the other methods can recover more of the truly non-zero signal, they

do so at the expense of making many more false positive identiﬁcations in the supports of Ψ

and Ω than cgSSL-DPE.

5.1 Simulation design

We simulated data with three diﬀerent choices of dimensions (n, p, q) = (100,10,10),(100,20,30),

and (400,100,30).For each choice of (n, p, q),we considered ﬁve diﬀerent choices of Ω: (i)

an AR(1) model for Ω−1so that Ω is tri-diagonal; (ii) an AR(2) model for Ω−1so that

ωk,k0= 0 whenever |k−k0|>2; (iii) a block model in which Ω is block-diagonal with two

dense q/2×q/2 diagonal blocks; (iv) a star graph where the oﬀ-diagonal entry ωk,k0= 0

unless kor k0is equal to 1; and a dense model with all oﬀ-diagonal elements ωk,k0= 2.

In the AR(1) model we set (Ω−1)k,k0= 0.7|k−k0|so that ωk,k0= 0 whenever |k−k0|>1.In

the AR(2) model, we set ωk,k = 1, ωk−1,k =ωk,k−1= 0.5,and ωk−2,k =ωk,k−2= 0.25.For

the block model, we partitioned Σ = Ω−1into 4 q/2×q/2 blocks and set all entries in the

oﬀ-diagonal blocks of Σ to zero. We then set σk,k = 1 and σk,k0= 0.5 for 1 ≤k6=k0≤q/2

and for q/2 + 1 ≤k6=k0≤q. For the star graph, we set ωk,k = 1, ω1,k =ωk,1= 0.1 for each

k= 2, . . . , q, and set the remaining oﬀ-diagonal elements of Ω equal to zero.

These ﬁve speciﬁcations of Ω (top row of Figure 2) correspond to rather diﬀerent underlying

graphical structure among the response variables (bottom row of Figure 2). The AR(1)

model, for instance, represents an extremely sparse but regular structure while the AR(2)

model is somewhat less sparse. While the star model and AR(1) model contain the same

number of edges, the underlying graphs have markedly diﬀerent degree distributions. Com-

pared to the AR(1), AR(2), and star models, the block model is considerably denser. We

included the full model, which corresponds to a dense Ω,to assess how well all of the methods

perform in a misspeciﬁed regime.

In total, we considered 15 combinations of dimensions (n, p, q) and Ω.For each combination,

we generated Ψ by randomly selecting 20% of entries to be non-zero. We drew the non-zero

entries uniformly from a U(−2,2) distribution. For each combination of (n, p, q),Ω and Ψ,

we generated 100 synthetic datasets from the Gaussian chain graph model (1). The entries

of the design matrix Xwere independently drawn from a standard N(0,1) distribution.

Figure 2: Visualization of the supports of Ω for q= 10 under each of the ﬁve speciﬁcations

(top) and corresponding graph (bottom). In the top row, we have gray cells indicate non-zero

entries in Ω and white cells indicate zeros

5.2 Results

To assess estimation performance, we computed the Frobenius norm between the estimated

matrices and the true data generating matrices. To assess the support recovery performance,

we counted the number of elements in each of Ψ and Ω that were (i) correctly estimated

as non-zero (true positives; TP); (ii) correctly estimated as zero (true negatives; TN); (iii)

incorrectly estimated as non-zero (false positives; FP); and (iv) incorrectly estimated as zero

(false negatives; FN). We report the sensitivity (TP/(TP + FN)) and precision (TP/(TP

+ FP)). Generally speaking, we prefer methods with high sensitivity and high precision.

High sensitivity indicates that the method has correctly estimated most of the true non-

zero parameters as non-zero. High precision, on the other hand, indicates that most of the

estimated non-zero parameters are truly non-zero. For brevity, we only report the average

sensitivity, precision, and Frobenius errors for the (n, p, q) = (100,10,10) setting in Table 1.

We observed qualitatively similar results for the other two settings of dimension and report

average performance in those settings in Tables S2–S3 of the Supplementary Materials.

Table 1: Average (sd) sensitivity, precision, and Frobenius error for Ψ and Ω when (n, p, q) =

(100,10,10) for each speciﬁcation of Ω across 100 simulated datasets. For each choice of Ω,

the best performance is bold-faced.

Ψ recovery Ω recovery

Method SEN PREC FROB SEN PREC FROB

AR(1) model

cgLASSO 0.88 (0.08) 0.44 (0.15) 0.13 (0.16) 0.78 (0.37) 0.55 (0.31) 31.93 (22.08)

CAR 0.86 (0.06) 0.31 (0.03) 0.04 (0.01) 1 (0) 0.3 (0.03) 4.16 (1.18)

CAR-A 0.87 (0.06) 0.59 (0.07) 0.02 (0.01) 1 (0) 0.83 (0.1) 2.75 (1.59)

cgSSL-dcpe 0.64 (0.05) 0.8 (0.16) 0.08 (0.05) 0.94 (0.11) 0.96 (0.07) 6.32 (6.64)

cgSSL-dpe 0.65 (0.05) 0.99 (0.03) 0.04 (0.01) 1 (0) 0.97 (0.05) 2.49 (1.12)

AR(2) model

cgLASSO 1 (0.02) 0.22 (0.06) 0.17 (0.09) 0.84 (0.29) 0.55 (0.17) 2.7 (1.66)

CAR 0.9 (0.06) 0.34 (0.04) 0.03 (0.01) 0.98 (0.03) 0.57 (0.06) 0.58 (0.21)

CAR-A 0.89 (0.05) 0.67 (0.08) 0.02 (0.01) 1 (0.02) 0.91 (0.06) 0.46 (0.32)

cgSSL-dcpe 0.96 (0.06) 0.43 (0.12) 0.45 (0.28) 0.24 (0.3) 0.63 (0.14) 5 (0.98)

cgSSL-dpe 0.73 (0.05) 1 (0.01) 0.02 (0.01) 1 (0) 0.86 (0.06) 0.38 (0.21)

Block model

cgLASSO 0.95 (0.05) 0.39 (0.18) 0.13 (0.11) 0.73 (0.38) 0.78 (0.21) 5.15 (2.27)

CAR 0.89 (0.06) 0.31 (0.03) 0.03 (0.01) 0.95 (0.02) 0.61 (0.06) 1.89 (0.75)

CAR-A 0.87 (0.06) 0.57 (0.07) 0.03 (0.01) 0.86 (0.07) 0.93 (0.05) 2.97 (1.22)

cgSSL-dcpe 0.76 (0.06) 0.29 (0.02) 0.28 (0.02) 0.01 (0.03) 0.71 (0.39) 8.85 (0.2)

cgSSL-dpe 0.69 (0.07) 0.99 (0.02) 0.03 (0.01) 0.71 (0.06) 0.95 (0.05) 3.28 (1.17)

Star model

cgLASSO 0.96 (0.04) 0.48 (0.14) 0.04 (0.02) 0.36 (0.41) 0.2 (0.18) 0.86 (0.35)

CAR 0.91 (0.05) 0.34 (0.03) 0.02 (0) 0.55 (0.18) 0.25 (0.08) 0.57 (0.29)

CAR-A 0.91 (0.04) 0.57 (0.06) 0.02 (0.01) 0.22 (0.14) 0.46 (0.24) 0.57 (0.26)

cgSSL-dcpe 0.83 (0.04) 0.96 (0.05) 0.01 (0) 0.05 (0.09) 0.9 (0.24) 0.22 (0.12)

cgSSL-dpe 0.79 (0.06) 0.99 (0.03) 0.01 (0.01) 0.09 (0.13) 0.71 (0.29) 0.29 (0.19)

Dense model

cgLASSO 0.92 (0.04) 0.57 (0.07) 0.03 (0.01) 0.88 (0.32) 1 (0) 16.93 (32.74)

CAR 0.85 (0.06) 0.28 (0.03) 0.04 (0.01) 0.03 (0.02) 1 (0) 92.51 (1.74)

CAR-A 0.84 (0.06) 0.4 (0.04) 0.04 (0.01) 0 (0.01) 1 (0) 96.04 (1.21)

cgSSL-dcpe 0.82 (0.03) 0.84 (0.06) 0.02 (0) 0.01 (0.02) 1 (0) 99.93 (0.39)

cgSSL-dpe 0.72 (0.07) 0.93 (0.06) 0.03 (0.01) 0.05 (0.04) 1 (0) 99.99 (0.98)

In terms of identifying non-zero direct eﬀects (i.e. estimating the support of Ψ), cgLASSO

consistently achieves the highest sensitivity. On further inspection, we found that the penal-

ties selected by 10-fold cross-validation tended to be quite small, meaning that cgLASSO

returned many non-zero ˆ

ψj,k’s. As the precision results indicate, many of cgLASSO’s “discov-

eries” were in fact false positives. The other ﬁxed penalty method, CAR, similarly displayed

somewhat high sensitivity and low precision. Interestingly, for several choices of Ω,the pre-

cisions of cgLASSO and CAR for recovering the support of Ψ were less than 0.5. Such low

precisions indicate that most of the returned non-zero estimates were in fact false positives.

In contrast, methods that deployed adaptive penalties (CAR-A and both implementations

of cgSSL), displayed higher precision in estimating the support of Ψ.In fact, at least for

estimating the support of Ψ,cgSSL-DPE made almost no false positives.

We observed essentially the same phenomenon for Ω: although the cgSSL generally returned

fewer non-zero estimates of ωk,k0, the vast majority of these estimates were true positives. In

a sense, the ﬁxed penalty methods (cgLASSO and CAR) cast a very wide net when searching

for non-zero signal in Ψ and Ω,leading to large number of false positive identiﬁcations in the

supports of these matrices. Adaptive penalty methods, on the other hand, are much more

discerning.

In terms of estimation performance, we found that the ﬁxed penalty methods (cgLASSO and

CAR) tended to have much larger Frobenius error, reﬂecting the well-documented bias in-

troduced by L1regularization. The one exception was in the misspeciﬁed setting where Ω

was dense. Interestingly, for the four sparse Ω’s, we did not observe any method achieving

high Frobenius error for Ω but low Frobenius error for Ψ.This ﬁnding helps substantiate

our intuition about Corollary 1. Namely, in order to estimate Ψ well, one must estimate

Ω well. Finally, like Deshpande et al. (2019), we found that dynamic conditional poste-

rior exploration implementation of cgSSL performed slightly worse than dynamic posterior

exploration implementation.

6 Real data experiments

Claesson et al. (2012) studied the gut microbiota of elderly individuals using data sequenced

from fecal samples taken from 178 subjects. They were primarily interested in understanding

diﬀerences in the gut microbiome composition across several residence types (in the commu-

nity, day-hospital, rehabilitation, or in long-term residential care) and across several diﬀerent

types of diet. We refer the reader to the Supplementary Notes and Supplementary Table 3

of Claesson et al. (2012) for more details. They found that the gut microbiomes of residents

in long-term care facilities were considerably less diverse than those of residents dwelling

in the community. They additionally reported that diet had a large marginal eﬀect on gut

microbe diversity but they did not examine conditional or direct eﬀects, which might align

more closely with the underlying biological mechanism. In this section, we re-analyze their

data using the cgSSL to try to estimate the direct eﬀects of each type of diet and residence

type on gut microbiome composition.

We pre-processed the raw 16s-rRNA data in the MG-RAST server (Keegan et al.,2016);

please see Section S4 of the Supplementary Materials for more details on the pre-processing.

In all, we had n= 178 observations of p= 11 predictors and q= 14 taxa. Figure 3shows

the graphical model estimated by cgSSL-DPE. In the ﬁgure, edges are colored according to

the sign of the eﬀect, with blue edges corresponding to negative conditional correlation and

red edges corresponding to positive conditional correlation. The edge widths correspond to

the absolute value of the parameter, with wider edges indicating larger parameter values.

We found a large number of edges between the diﬀerent species, suggesting that there was

considerable conditional dependence between their abundances after adjusting for the co-

variates. In fact, we found only two non-zero entries in Ψ.We estimated that percutaneous

endoscopic gastronomy (PEG), in which a feeding tube is inserted into the abdomen, had a

negative direct eﬀect on the abundance of Veillonella, which is involved in lactose fermen-

tation. Reassuringly for us, our ﬁnding aligns with those in Takeshita et al. (2011), which

reported a negative eﬀect of PEG on this genus. We additionally found that staying in a

day hospital had a positive direct eﬀect on Caloramator.

Alistipes

Bacteroides

Barnesiella

Blautia

Butyrivibrio

Caloramator

Clostridium

Eubacterium

Faecalibacterium

Hespellia

Parabacteroides

Ruminococcus

Selenomonas

Veillonella

Age

GenderMale

StratumDayHospital

StratumLong−term

StratumRehabDiet1

Diet2 Diet3

Diet4

DietPEG

BMI

abs_weight 0.2 0.4 0.6 0.8

Figure 3: The estimated graphical model underlying Claesson et al. (2012)’s gut micro-

biome dataset. We annotate the edge weight by the absolute value of conditional regression

coeﬃcients and red color represents positive (conditional) dependence and blue represents

negative (conditional) dependence.

Our results suggest that the large marginal eﬀects reported by Claesson et al. (2012) are a by-

product of only a few direct eﬀects and substantial residual conditional dependence between

species. For instance, because PEG has a direct eﬀect on Veillonella, which is conditionally

correlated with Clostridium,Butyrivibrio, and Blautia, PEG displays a marginal eﬀect on

each of these other genus. In this way, the cgSSL can provide a more nuanced understanding

of the underlying biological mechanism than simply estimating the matrix of marginal eﬀects

B= ΨΩ−1.We note, however, that Claesson et al. (2012)’s dataset does not contain an

exhaustive set of environmental and patient life-style predictors. Accordingly, our re-analysis

is limited in the sense that were we able to incorporate additional predictors, the estimated

graphical model may be quite diﬀerent.

7 Discussion

In the Gaussian chain graph model in Equation (1), Ψ is a matrix containing all of the direct

eﬀects of ppredictors on qoutcomes while Ω is the residual precision matrix that encodes

the conditional dependence relationships between the outcomes that remain after adjusting

for the predictors. We have introduced the cgSSL procedure for obtaining simultaneously

sparse estimates of Ψ and Ω.In our procedure, we formally specify spike-and-slab LASSO

priors on the free elements of Ψ and Ω and use an ECM algorithm to maximize the posterior

density. Our ECM algorithm iteratively solves a penalized maximum likelihood problem

with self-adaptive penalties. Across several simulated datasets, cgSSL demonstrated excel-

lent support recovery and estimation performance, substantially out-performing competitors

that deployed constant shrinkage penalties. We further characterized the asymptotic prop-

erties of cgSSL posteriors, establishing posterior concentration rates under relatively mild

assumptions. To the best of our knowledge, these are the ﬁrst such results for sparse Gaussian

chain graph models.

Although our main theoretical result (Theorem 1) implies that a slightly modiﬁed version of

the cgSSL procedure from Section 3is asymptotically consistent, quantifying ﬁnite sample

posterior uncertainty remains challenging. Several authors have proposed extensions of New-

ton and Raftery (1994)’s weighted likelihood bootstrap for quantifying posterior uncertainty.

Basically, these procedures work by repeatedly maximizing a randomized objective formed

by carefully re-weighting each term in the log-likelihood and the log-prior. In fact, Nie and

Roˇckov´a (2022) recently deployed this strategy to quantify uncertainty in SSL posteriors for

single-outcome regression in high dimensions. A key ingredient in Nie and Roˇckov´a (2022) is

the introduction of an additional random location shift in the prior to oﬀset the tendency of

the SSL to return exactly sparse parameter estimates. In our Gaussian chain graph problem,

introducing a similar shift is challenging due to the constraint that Ω be positive deﬁnite.

Overcoming this diﬃculty is the subject of on-going work.

In many applications, analysts encounter multiple outcomes of mixed type (i.e. continuous

and discrete). In its current form, the cgSSL is not applicable to these situations. It is possi-

ble, however, to extend the cgSSL to model outcomes of mixed type using a strategy similar

to one found in Kowal and Canale (2020), which modeled discrete variables as truncated

and transformed versions of latent Gaussian random variables.

Acknowledgements

The authors are grateful to Ray Bai for helpful comments on the theoretical results and to

Gemma Moran for feedback on an early draft of the manuscript.

This work was supported by the National Institute of Food and Agriculture, United States

Department of Agriculture, Hatch project 1023699. This work was also supported by the

Department of Energy [DE-SC0021016 to C.S.L.]. Support for S.K.D. was provided by the

University of Wisconsin–Madison, Oﬃce of the Vice Chancellor for Research and Graduate

Education with funding from the Wisconsin Alumni Research Foundation.

This research was performed using the compute resources and assistance of the UW–Madison

Center For High Throughput Computing (CHTC) in the Department of Computer Sciences.

The CHTC is supported by UW–Madison, the Advanced Computing Initiative, the Wiscon-

sin Alumni Research Foundation, the Wisconsin Institutes for Discovery, and the National

Science Foundation, and is an active member of the OSG Consortium, which is supported

by the National Science Foundation and the U.S. Department of Energy’s Oﬃce of Science.

References

Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal

Statistical Society: Series B (Methodological), 44(2):139–160.

Bai, R., Moran, G. E., Antonelli, J. L., Chen, Y., and Boland, M. R. (2020). Spike-and-slab

group LASSOs for grouped regression and sparse generalized additive models. Journal of

the American Statistical Association.

Bai, R., Roˇckov´a, V., and George, E. I. (2021). Spike-and-slab meets LASSO: A review of the

spike-and-slab LASSO. In Tadesse, M. and Vannucci, M., editors, Handbook of Bayesian

Variable Selection. Routledge.

Banerjee, O., Ghaoui, L. E., and D’Aspremont, A. (2008). Model selection through sparse

maximum likelihood estimation for multivariate Gaussian or binary data. Journal of

Machine Learning Research, 9:485–516.

Battson, M. L., Lee, D. M., Weir, T. L., and Gentile, C. L. (2018). The gut microbiota

as a novel regulator of cardiovascular function and disease. The Journal of Nutritional

Biochemistry, 56:1–15.

Belcheva, A., Irrazabal, T., Robertson, S. J., Streutker, C., Maughan, H., Rubino, S.,

Moriyama, E. H., Copeland, J. K., Surendra, A., Kumar, S., et al. (2014). Gut mi-

crobial metabolism drives transformation of MSH2-deﬁcient colon epithelial cells. Cell,

158(2):288–299.

Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymp-

totic theory of independence. Oxford university press.

Boyd, S. P. and Barratt, C. H. (1991). Linear controller design: limits of performance.

Prentice-Hall.

Claesson, M. J., Jeﬀery, I. B., Conde, S., Power, S. E., O’connor, E. M., Cusack, S., Harris,

H. M., Coakley, M., Lakshminarayanan, B., O’Sullivan, O., et al. (2012). Gut microbiota

composition correlates with diet and health in the elderly. Nature, 488(7410):178–184.

Deshpande, S. K., Roˇckov´a, V., and George, E. I. (2019). Simultaneous variable and co-

variance selection with the multivariate spike-and-slab LASSO. Journal of Computational

and Graphical Statistics, 28(4):921–931.

Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation

with the graphical LASSO. Biostatistics, 9(3):432–441.

Frydenberg, M. (1990). The chain graph Markov property. Scandinavian Journal of Statis-

tics, 17(4):333–353.

Gan, L., Narisetty, N. N., and Liang, F. (2019a). Bayesian regularization for graphical models

with unequal shrinkage. Journal of the American Statistical Association, 114(527):1218–

1231.

Gan, L., Yang, X., Narisetty, N. N., and Liang, F. (2019b). Bayesian joint estimation of mul-

tiple graphical models. Advances in Neural Information Processing Systems (NeurIPS).

George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal

of the American Statistical Association, 88(423):881–889.

George, E. I. and Roˇckov´a, V. (2020). Comment: Regularization via Bayesian penalty

mixing. Technometrics, 62(4):438 – 442.

Ghosal, S. and van der Vaart, A. (2017). Fundamentals of nonparametric Bayesian inference,

volume 44. Cambridge University Press.

Guinane, C. M. and Cotter, P. D. (2013). Role of the gut microbiota in health and chronic

gastrointestinal disease: understanding a hidden metabolic organ. Therapeutic Advances

in Gastroenterology, 6(4):295–308.

Hills Jr, R. D., Pontefract, B. A., Mishcon, H. R., Black, C. A., Sutton, S. C., and Theberge,

C. R. (2019). Gut microbiome: profound implications for diet and disease. Nutrients,

11(7):1613.

Hsieh, C. J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P. (2011). Sparse inverse covariance

matrix estimation using quadratic approximation. In Advances in Neural Information

Processing Systems (NeurIPS).

Kamada, N. and N´u˜nez, G. (2014). Regulation of the immune system by the resident

intestinal bacteria. Gastroenterology, 146(6):1477–1488.

Keegan, K. P., Glass, E. M., and Meyer, F. (2016). MG-RAST, a metagenomics service

for analysis of microbial community structure and function. In Microbial Environmental

Genomics (MEG), pages 207–233. Springer.

Kim, D., Zeng, M. Y., and N´u˜nez, G. (2017). The interplay between host immune cells and

gut microbiota in chronic inﬂammatory diseases. Experimental & Molecular Medicine,

49(5):e339–e339.

Kowal, D. R. and Canale, A. (2020). Simultaneous transformation and rounding (STAR)

models for integer-valued data. Electronic Journal of Statistics, 14(1):1744–1772.

Larsbrink, J., Rogers, T. E., Hemsworth, G. R., McKee, L. S., Tauzin, A. S., Spadiut, O.,

Klinter, S., Pudlo, N. A., Urs, K., Koropatkin, N. M., et al. (2014). A discrete genetic locus

confers xyloglucan metabolism in select human gut Bacteroidetes. Nature, 506(7489):498–

502.

Lauritzen, S. L. and Richardson, T. S. (2002). Chain graph models and their causal inter-

pretations. Journal of the Royal Statistical Society: Series B, 64(3):321–348.

Lauritzen, S. L. and Wermuth, N. (1989). Graphical models for associations between vari-

ables, some of which are qualitative and some quantitative. The Annals of Statistics,

17(1):31–57.

Li, Z., Mccormick, T., and Clark, S. (2019). Bayesian joint spike-and-slab graphical LASSO.

In Proceedings of the 36th International Conference on Machine Learning (ICML), pages

3877–3885. PMLR.

McCarter, C. and Kim, S. (2014). On sparse Gaussian chain graph models. Advances in

Neural Information Processing Systems (NeurIPS).

Meng, X.-L. and Rubin, D. B. (1993). Maximum likleihood estimation via the ECM algo-

rithm: a general framework. Biometrika, 80(2):267–278.

Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression.

Journal of the American Statistical Association, 83(404):1023–1032.

Moran, G. E., Roˇckov´a, V., and George, E. I. (2019). Variance prior forms for high-

dimensional Bayesian variable selection. Bayesian Analysis, 14(4):1091–1119.

Moran, G. E., Roˇckov´a, V., and George, E. I. (2021). Spike-and-slab LASSO biclustering.

The Annals of Applied Statistics, 15(1):148–173.

Newton, M. A. and Raftery, A. E. (1994). Approximate Bayesian inference with the weighted

likelihood bootstrap. Journal of the Royal Statistical Society: Series B, 56(1):3–26.

Nie, L. and Roˇckov´a, V. (2022). Bayesian bootstrap spike-and-slab LASSO. Journal of the

American Statistical Association.

Ning, B., Jeong, S., and Ghosal, S. (2020). Bayesian linear regression for multivariate

responses under group sparsity. Bernoulli, 26(3):2353–2382.

R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foun-

dation for Statistical Consulting, Vienna, Austria.

Roˇckov´a, V. and George, E. I. (2014). EMVS: The EM approach to Bayesian variable

selection. Journal of the American Statistical Association, 109(506):828–846.

Roˇckov´a, V. and George, E. I. (2018). The spike-and-slab LASSO. Journal of the American

Statistical Association, 113(521):431–444.

Roˇckov´a, V. and George, E. I. (2016). Fast Bayesian factor analysis via automatic rotations

to sparsity. Journal of the American Statistical Association, 111(516):1608–1622.

Scher, J. U., Sczesnak, A., Longman, R. S., Segata, N., Ubeda, C., Bielski, C., Rostron,

T., Cerundolo, V., Pamer, E. G., Abramson, S. B., et al. (2013). Expansion of intestinal

prevotella copri correlates with enhanced susceptibility to arthritis. eLife, 2:e01202.

Shen, Y. and Solis-Lemus, C. (2021). Bayesian conditional auto-regressive LASSO models

to learn sparse microbial networks with predictors. arXiv preprint arXiv:2012.08397.

Shreiner, A. B., Kao, J. Y., and Young, V. B. (2015). The gut microbiome in health and in

disease. Current Opinion in Gastroenterology, 31(1):69.

Singh, R. K., Chang, H.-W., Yan, D., Lee, K. M., Ucmak, D., Wong, K., Abrouk, M., Farah-

nik, B., Nakamura, M., Zhu, T. H., et al. (2017). Inﬂuence of diet on the gut microbiome

and implications for human health. Journal of Translational Medicine, 15(1):1–17.

Takeshita, T., Yasui, M., Tomioka, M., Nakano, Y., Shimazaki, Y., and Yamashita, Y.

(2011). Enteral tube feeding alters the oral indigenous microbiota in elderly adults. Applied

and Environmental Microbiology, 77(19):6739–6745.

Tang, Z., Shen, Y., Li, Y., Zhang, X., Wen, J., Qian, C., Zhuang, W., Shi, X., and Yi,

N. (2018). Group spike-and-slab LASSO generalized linear models for disease prediction

and associated genes detected by incorporating pathway information. Bioinformatics,

34(6):901–910.

Tang, Z., Shen, Y., Zhang, X., and Yi, N. (2017). The spike-and-slab LASSO generalized

linear models for prediction and associated genes detection. Genetics, 205:77–88.

Wang, Z., Klipfell, E., Bennett, B. J., Koeth, R., Levison, B. S., DuGar, B., Feldstein, A. E.,

Britt, E. B., Fu, X., Chung, Y.-M., et al. (2011). Gut ﬂora metabolism of phosphatidyl-

choline promotes cardiovascular disease. Nature, 472(7341):57–63.

Zhang, C. H. and Zhang, T. (2012). A General theory of concave regularization for high-

dimensional sparse estimation problems. Statistical Science, 27(4):576–593.

Supplementary Materials

In Section S1 we derive the Expectation Conditional Maximization (ECM) algorithm used to

ﬁnd the maximum a posteriori (MAP) estimates of Ψ and Ω in the cgSSL model. One of the

conditional maximization steps of that algorithm involves solving a CGLASSO problem. We

introduce a new algorithm, cgQUIC, to solve the general CGLASSO problem in Section S2.

Speciﬁcally, we show that the problem has unique global optimum (Theorem S1) and that our

cgQUIC algorithm converges to this optimum (Theorem S2). Then, we present additional

results from the simulation study described in Section 5of the main text in Section S3. In

Section S4, we detail the preprocessing steps we took to prepare the gut microbiome data

for analysis with the cgSSL. Finally, we state and prove our main asymptotic results in

Section S5.

S1 The cgSSL algorithm

In this section, we provide full details of the Expectation Conditional Maximization (ECM)

algorithm that is used in the cgSSL procedure. We describe the algorithm for a ﬁxed set of

spike-and-slab penalties (λ0, λ1, ξ0, xi1) and ﬁxed set of hyperparameters (aθ, bθ, aη, bη). For

notational brevity, we will let Θ = {Ψ, θ, Ω, η}denote the set of four parameters of interest.

Recall from Section 3.3 of the main text that we wish to maximize the log posterior density

log π(Θ|Y) = −n

2log|Ω| − 1

2tr (Y−XΨΩ−1)Ω(Y−XΨΩ)>

j=1

k=1

log θλ1e−λ1|ψj,k |+ (1 −θ)λ0e−λ0|ψj,k|

q−1

k=1

k0>k

log ηξ1e−ξ1|ωk,k0|+ (1 −η)ξ0e−ξ0|ωk ,k0|

−

k=1

ξ0ωk,k + log

(Ω 0)

+ (aθ−1) log(θ)+(bθ−1) log(1 −θ)

+ (aη−1) log(η)+(bη−1) log(1 −η)

(S1)

Instead of optimizing log π(Θ|Y) directly, we use an ECM algorithm and iteratively update

the surrogate objective

F(Θ) = Eδ|·[log π(Θ, δ|Y)|Θ],

where log π(Θ, δ|Y) is the log-density of the posterior in an augmented model involving the

spike-and-slab indicators δ={δk,k0: 1 ≤k < k0≤q}.Note that the expectation is taken

with respect to the conditional posterior distribution of δgiven Θ.In our augmented model,

δk,k0indicates whether ωk,k0was drawn from the spike (δk,k0= 0) or the slab (δk,k0= 1).

Given Θ and the data Y,these indicators are conditionally independent with

E[δk,k0|Y,Θ] = ηξ1e−ξ1|ωk,k0|

ηξ1e−ξ1|ωk,k0|+ (1 −η)ξ0e−ξ0|ωk ,k0|.

The surrogate objective F(Θ) is given by

F(Θ) = n

2log(|Ω|) + tr(Y>XΨ) −1

2tr(Y>YΩ) −1

2tr((XΨ)>(XΨ)Ω−1)

log θλ1e−λ1|ψj,k |+ (1 −θ)λ0e−λ0|ψj,k|−X

k<k0

ξ?

k,k0|ωk,k0| − ξ1

k=1

ωk,k

+ (aθ−1) log θ+ (bθ−1) log(1 −θ)+(aη−1) log η+ (bη−1) log(1 −η)

(S2)

where ξ?

k,k0=ξ1q?

k,k0+ξ0(1 −q?

k,k0) and

q?(x, η) = ηξ1e−ξ1|x|

ηξ1e−ξ1|x|+ (1 −η)ξ0e−ξ0|x|.

Our ECM algorithm iteratively computes F(Θ) based on the current value of Θ (the E-step)

and then updates the value of Θ by performing two conditional maximizations (the CM-

step). More speciﬁcally, for t≥1,if Θ(t−1) is the value of Θ at the start of the tth iteration,

in the E-step we compute

F(t)(Θ) = Eδ|·[log π(Θ, δ|Y)|Θ = Θ(t−1)].

We then compute Θ(t)by ﬁrst optimizing F(t)(Θ) with respect to (Ψ, θ) while ﬁxing (Ω, η) =

(Ω(t−1), η(t−1)) to obtain (Ψ(t), θ(t)). Then we optimizing F(t)(Θ) with respect to (Ω, η) while

ﬁxing (Ψ, θ) = (Ψ(t), θ(t)) to obtain (Ω(t), η(t)).That is, in the CM-step we solve the following

optimization problems

(Ψ(t), θ(t)) = arg max

Ψ,θ

F(t)(Ψ, θ, Ω(t−1), η(t−1)) (S3)

(Ω(t), η(t)) = arg max

Ω,η

F(t)(Ψ(t), θ(t),Ω, η).(S4)

Once we solve the optimization problems in Equations (S3) and (S4), we set

Θ(t)= (Ψ(t), θ(t),Ω(t), η(t)).

Our ECM algorithm iterates between the E-step and CM-step until the percentage change

in the estimated entries of Ψ and Ω or log posterior density is below some user-deﬁned

tolerance. In our implementation, we have found that tolerance of 10−3works well. The

following subsections detail how we carry out each conditional maximization step.

S1.1 Updating Ψand θ

Fixing (Ω, η) = (Ω(t−1), η(t−1)),observe that

F(t)(Ψ, θ, Ω(t−1), η(t−1)) = −1

2tr (Y−XΨΩ−1)Ω(Y−XΨΩ−1)>+ log π(Ψ, θ)

=−1

2tr (Y−XΨΩ−1)ΩΩ−1Ω(Y−XΨΩ−1)>+ log π(Ψ, θ)

=−1

2tr (YΩ−XΨ)Ω−1(YΩ−XΨ)>+ log π(Ψ, θ)

(S5)

where

log π(Ψ, θ) =

j=1

k=1

log θλ1e−λ1|ψj,k |+ (1 −θ)λ0e−λ0|ψj,k|

+ (aθ−1) log(θ)+(bθ−1) log(1 −θ)

(S6)

We solve the optimization problem in Equation (S5) using a coordinate ascent strategy that

iteratively updates Ψ (resp. θ) while holding θ(resp. Ψ) ﬁxed. We run the coordinate ascent

until all active ψj,k has relative change under the user deﬁned tolerance.

Updating θgiven Ψ. Notice that the objective in Equation (S5) depends on θonly through

the log π(Ψ, θ) term. Accordingly, to updating θconditionally on Ψ,it is enough to maximize

the expression in Equation (S6) as a function of θwhile keeping all ψj,k terms ﬁxed. We use

Newton’s method for this optimization and we terminate once the Newton step has a step

size less than the user deﬁned tolerance.

Updating Ψgiven θ. With θﬁxed, optimizing Equation (S5) is equivalent to solving

Ψ(t)= arg max

Ψ−1

2(YΩ−XΨ)Ω−1(YΩ−XΨ)>+ log π(Ψ|θ)

= arg max

Ψ(−1

2(YΩ−XΨ)Ω−1(YΩ−XΨ)>+X

j,k

log π(ψj,k|θ)

π(0|θ))

= arg max

Ψ(−1

2(YΩ−XΨ)Ω−1(YΩ−XΨ)>+X

j,k

pen(ψj,k |θ))

(S7)

where

pen(ψj,k |θ) = log π(ψj,k|θ)

π(0|θ)=−λ1|ψj,k|+ log p?(ψj,k , θ)

p?(0, θ).

Following essentially the same arguments as those in Deshpande et al. (2019) and using

the fact that the columns of Xhave norm n, the Karush-Kuhn-Tucker (KKT) condition for

optimization problem in the ﬁnal line of Equation (S7) tells us that the optimizer Ψ∗satisﬁes

ψ∗

j,k =n−1|zj,k| − λ?(ψ∗

j,k, θ)+sign(zj,k),(S8)

where

zjk =nψ∗

j,k +X>

jrk+X

k06=k

(Ω−1)k,k0

(Ω−1)k,k

jrk0

rk0= (YΩ−XΨ∗)k0

λ?(ψ∗

j,k, θ) = λ1p?(ψ∗

j,k, θ) + λ0(1 −p?(ψ∗

j,k, θ)).

The KKT condition immediately suggests a cyclical coordinate ascent strategy for solving

the problem in Equation (S7) that involves soft thresholding the running estimates of ψj,k.

Like Roˇckov´a and George (2018) and Deshpande et al. (2019), we can, however, obtain a

more reﬁned characterization of the global model ˜

Ψ = ( ˜

ψj,k):

ψj,k =n−1h|zj,k| − λ?(˜

ψj,k, θ)i+sign(zj,k)×

(|zj,k|>∆j,k ),

where

∆j,k = inf

t>0(nt

2−pen( ˜

ψj,k, θ)

(Ω−1)k,kt).

Though the exact thresholds ∆j,k are diﬃcult to compute, they can be bounded use an

analog to Theorem 2.1 of Roˇckov´a and George (2018) and Proposition 2 of Deshpande et al.

(2019). Speciﬁcally, suppose we have (λ1−λ0)>2pn(Ω−1)k,k and (λ?(0, θ)−λ1)2>

−2n(Ω−1)k,kp?(0, θ). Then we have ∆L

j,k ≤∆j,k ≤∆U

j,k, where:

∆L

j,k =q−2n((Ω−1)k,k)−1log p?(0, θ)−((Ω−1)k,k )−2d+λ1/(Ω−1)k,k

∆U

j,k =q−2n((Ω−1)k,k)−1log p?(0, θ) + λ1/(Ω−1)k,k

where d=−(λ?(δc+, θ)−λ1)2−2n(Ω−1)k,k log p?(δc+, θ) and δc+is the largest root of

pen00(x|θ) = (Ω−1)k,k.

Our reﬁned characterization of ˜

Ψ suggests a cyclical coordinate descent strategy that com-

bines hard thresholding at ∆j,k and soft-thresholding at λ?

j,k.

Remark 1. Equation (S5)and our approach to solving the optimization problem are ex-

tremely similar to Equation 3 and the coordinate ascent strategy used in Deshpande et al.

(2019), who ﬁt sparse marginal multivariate linear models with spike-and-slab LASSO priors.

This is because if Y∼ N(XΨΩ−1,Ω−1)in our chain graph model, then YΩ∼ N(XΨ,Ω).

Thus, if we ﬁx the value of Ω,we can use any computational strategy for estimating marginal

eﬀects in the multivariate linear regression model to estimate Ψby working with the trans-

formed data YΩ.

S1.2 Updating Ωand η

Fixing Ψ = Ψ(t)and θ=θ(t), we compute Ω(t)and η(t)by optimizing the function

F(t)(Ψ(t), θ(t),Ω, η) = n

2log|Ω| − tr(SΩ) −tr(MΩ−1)−X

k<k0

ξ?

k,k0|ωk,k0| −

k=1

ωk,k

+ aη−1 + X

k<k0

k,k!×log(η)

+ bη−1 + q(q−1)/2−X

k<k0

k,k!×log(1 −η)

(S9)

where S=1

nY>Yand M=1

n(XΨ)>XΨ.

We immediately observe that expression in Equation (S9) is separable in Ω and η, meaning

that we can compute Ω(t)and η(t)separately. Speciﬁcally, we have

η(t)=aη−1 + Pk<k0q?

k,k0

aη+bη−2 + q(q−1)/2(S10)

and

Ω(t)= arg max

Ω>0(n

2log(|Ω|)−tr(SΩ) −tr(MΩ−1)−X

k<k0

ξ?

k,k0|ωk,k0| − ξ

k=1

ωk,k).(S11)

The objective function in Equation (S11) is similar to a graphical LASSO (GLASSO; Fried-

man et al.,2008) problem insofar as both problems involve a term like log|Ω|+ tr(SΩ) and

separable L1 penalties on the oﬀ-diagonal elements of Ω.However, Equation (S11) includes

an additional term tr(MΩ−1), which does not appear in the GLASSO. This term arises from

through the entanglement of Ψ and Ω in the Gaussian chain graph model and we accord-

ingly call the problem in Equation (S11) the CGLASSO problem. We solve this problem

by (i) forming a quadratic approximation of the objective, (ii) computing a suitable Newton

direction, and (iii) following that Newton direction for a suitable step size. We detail this

solution strategy in Section S2.

S2 Chain graphical LASSO with cgQUIC

Equation (S11) is a speciﬁc instantiation of what we term the “chain graphical LASSO”

(CGLASSO) problem, whose general form is

arg min

Ω(−log(|Ω|) + tr(SΩ) + tr(MΩ−1) + X

k,k0

ξk,k0|ωk,k0|)(S12)

where Sand Mare symmetric positive semi-deﬁnite q×qmatrices; Ω is a symmetric positive

deﬁnite q×qmatrix; and the ξk,k0’s are symmetric non-negative penalty weights (i.e. we

have ξk,k0=ξk0,k).

Notice that when Mis the 0 matrix, the CGLASSO problem reduces to a general GLASSO

problem, which admits several computational solutions. One well known solution is to solve

the dual problem, which involves minimization of a log determinant under L∞constraint

(Banerjee et al.,2008).

Unfortunately, the dual form of the CGLASSO problem does not have such a simple form.

To wit, the dual of the CGLASSO problem with uniform penalty ξis given by:

min

||U||∞<ξ max

Ω0log|Ω| − tr[(S+U)Ω] −tr[MX−1]

The inner optimization about Ω can be solved by setting the derivative to 0; the optimal

value of Ω solves a special case of continuous time algebraic Riccati equation (CARE) (Boyd

and Barratt,1991):

Ω−Ω(S+U)Ω + M= 0

Unfortunately, this problem does not have a closed form solution and solving it numerically

in every step of the cgSSL is computationally prohibitive.

We instead solve the CGLASSO problem using a suitably modiﬁed version of Hsieh et al.

(2011)’s QUIC algorithm for solving the GLASSO problem. At a high level, instead of using

the ﬁrst order gradient or solving dual problem, the algorithm is based on Newton’s method

and uses a quadratic approximation. Basically, we sequentially cycle over the parameters

ωk,k0and update each parameter by following a Newton direction for suitable step-size. The

step-size is chosen to ensure that our running estimate of Ω remains positive deﬁnite while

also ensuring suﬃcient decrease in the overall objective. We call our solution CGQUIC,

which we summarize in Algorithm 1.

To describe CGQUIC, we ﬁrst deﬁne the “smooth” part of the CGLASSO objective as g(Ω)

and the objective function as f(Ω):

g(Ω) = −log(|Ω|) + tr(SΩ) + tr(MΩ−1),

f(Ω) = g(Ω) + X

k,k0

λk,k0ωk,k0.(S13)

The function g(Ω) is twice diﬀerentiable and strictly convex. To see this, observe that g(Ω)

is just the log-likelihood of a Gaussian chain graph model with known Ψ.Its Hessian is

just the negative Fisher information of Ω and is positive deﬁnite. The second-order Taylor

expansion of the smooth part g(Ω) based on Ω and evaluated at Ω + ∆ for a symmetric ∆ is

¯gΩ(∆) = −log(|Ω|) + tr(SΩ) + tr(M W )

+ tr(S∆) −tr(W∆) −tr(W M W ∆)

2tr(W∆W∆) + tr(W M W ∆W∆)

(S14)

S2.1 Newton Direction

We now consider the coordinate descent update for the variable Ωk,k0for k≤k0.Let Ddenote

the current approximation of the Newton direction and let D0be the updated direction. To

preserve symmetry, we set D0=D+µ(eke>

k0+ek0e>

k).Our goal, then, is to ﬁnd the optimal

µ:

arg min

µ{¯g(D+µ(eke>

k0+ek0e>

k)) + 2ξk,k0|Ωk,k0+Dk,k0+µ|} (S15)

We begin by substituting ∆ = D0into ¯g(∆).Note that terms not depending on µdo not

aﬀect the line search. Compared to QUIC, we have two additional terms, tr(W M W ∆) and

tr(W M W ∆W∆).The ﬁrst term turns out to be linear µand the second is quadratic in µ.

Algorithm 1: The CGQUIC algorithm for CGLASSO problem

Data: S=Y>Y/n,M= (XΨ)>(XΨ)/n, regularization parameter matrix Ξ, initial

Ω0, inner stopping tolerance , parameters 0 < σ < 0.5, 0 < β < 1

Result: path of positive deﬁnite Ωtthat converge to arg minΩfwith

f(Ω) = g(Ω) + Pk,k0ξk,k0|ωk,k0|, where g(Ω) = −log(|Ω|) + tr(SΩ) + tr(MΩ−1)

Initialize W0= Ω−1

for t= 1,2, . . . do

D= 0, U = 0;

Q=MWt−1;

while not converged do

Partition the variables into ﬁxed and free sets based on gradient1

Sfixed := {(k, k0) : |∇k,k0g(Ω)|< ξk,k0and ωk,k0= 0};

Sfree := {(k, k0) : |∇k,k0g(Ω)| ≥ ξk,k0or ωk,k06= 0};

for (k, k0)∈Sfree do

Calculate Newton direction:

b=Sk,k0−Wk,k0+w>

kDwk0−w>

kMwk0+w>

k0DW M wk+w>

kDW M wk0;

c= Ωk,k0+Dk,k0;

if i6=jthen

a=W2

k,k0+Wk,k Wk0,k0+Wk,kw>

k0Mwk0+Wk0,k0w>

kMwk+ 2Wk,k0w>

kMwk0;

end

else

a=W2

k,k + 2Wk,k w>

kMwk;

end

µ=−c+ [|c−b/a| − ξk,k0/a]+sign(c−b/a) ;

Dk,k0+ = µ;

uk+ = µwk0;

uk0+ = µwk;

end

for α= 1, β, β 2, . . . do

Compute the Cholesky decomposition of Ωt−1+αD;

if Ωtis not positive deﬁnite then

continue;

end

Compute f(Ωt−1+αD);

f(Ωt−1+αD∗)≤f(Ωt−1)+ασδ, δ = tr[∇g(Ωt−1)>D∗]+||Ωt−1+D∗||1,Ξ−||Ωt−1||1,Ξ

then

break;

end

Ωt= Ωt−1+αD;

Wt= Ω−1

tusing Cholesky decomposition result;

end

return {Ωt};40

To see this, ﬁrst observe

−tr(W M W ∆) = −tr(W MW (D+µ(eke>

k0+ek0e>

k))

=C−µtr(W M W eke>

k0+W M W ek0e>

=C−µtr(e>

k0W M W ek+e>

kW M W ek0)

=C−2µe>

kW M W ek0

=C−2µw>

kMwk0

(S16)

where wkis the kth column of Ω.

Furthermore, we have

tr(W M W ∆W∆) = tr[W M W (D+µ(eke>

k0+ek0e>

k))W(D+µ(eke>

k0+ek0e>

k))]

= tr[DW M W + 2µD W MW (eke>

k0+ek0e>

k)W]

+ tr[µ2(eke>

k0+ek0e>

k)W M W (eke>

k0+ek0e>

k)W]

=C+ tr[2µDW M W (eke>

k0+ek0e>

k)W]

+ tr[µ2(eke>

k0+ek0e>

k)W M W (eke>

k0+ek0e>

k)W]

=C+ 2µw>

k0DW M wk+ 2µw>

kDW M w>

+µ2tr[(eke>

k0+ek0e>

k)W M W (eke>

k0+ek0e>

k)W]

=C+ 2µ(w>

k0DW M wk+w>

kDW M w>

k0)

+µ2(Wk,kw>

k0Mwk0+Wk0,k0w>

kMwk+ 2Wk,k0w>

kMwk0)

(S17)

By combining the above simpliﬁcations, we can minimize the objective with coordinate

descent. The update for ωk,k0is given by:

2[W2

k,k0+Wk,k Wk0,k0+Wk,kw>

k0Mwk0+Wk0,k0w>

kMwk+ 2Wk,k0w>

kMwk0]µ2

+[Sk,k0−Wk,k0+wkDwk0−wkMwk0+w>

k0DW M wk+w>

kDW M wk0]µ

+ξk,k0|Ωk,k0+Dk,k0+µ|

(S18)

The optimal solution (for oﬀ-diagonal ωk,k0) is given by

µ=−c+ [|c−b/a|−ξk,k0/a]+sign(c−b/a) (S19)

where

a=W2

k,k0+Wk,k Wk0,k0+Wk,kw>

k0Mwk0+Wk0,k0w>

kMwk+ 2Wk,k0w>

kMwk0

b=Sk,k0−Wk,k0+w>

kDwk0−w>

kMwk0+w>

k0DW M wk+w>

kDW M wk0

c=ωk,k0+Dk,k0

For diagonal entries, we take D0=D+µeke>

k, the two terms involving Dare then:

−tr(W M W ∆) = C−µw>

kMwk

tr(W M W ∆W∆) = C+ 2µw>

kDW M wk+µ2Wk,kw>

kMwk

(S20)

Then we can take

a=W2

k,k + 2Wk,k w>

kMwk

b=Sk,k −Wk,k +w>

kDwk−w>

kMwk+ 2w>

kDW M wk

c=ωk,k +Dk,k

and use Equation (S19) to obtain the optimal µand thus the updated Newton direction D0.

Note that computing the optimal µrequires repeated calculation of quantities like w>

kMwk0

and w>

kUMwk0.To enable rapid computation, we track and update the values of U=DW

and Q=MW during our optimization.

S2.2 Step Size

Like Hsieh et al. (2011), we use Armijo’s rule to set a step size αthat simultaneously ensures

our estimate of Ω remains positive deﬁnite and suﬃcient decrease of our overall objective

function. We denote the Newton direction after a complete update over all active coordinates

as D∗(see Appendix S2.3 for active sets). We require our step size to satisfy the line search

condition (S21).

f(Ω + αD∗)≤f(Ω) + ασδ, δ = tr[∇g(Ω)>D∗] + ||Ω + D∗||1,Ξ− ||Ω||1,Ξ(S21)

Three important properties can be established following Hsieh et al. (2011):

P1. The condition was satisﬁed for small enough α. This property is satisﬁed exactly

following proposition 1 of Hsieh et al. (2011).

P2. We have δ < 0 for all Ω 0, which ensures that the objective function decreases. This

property generally follow Lemma 2 and Proposition 2 of Hsieh et al. (2011), which

requires the Hessian of the smooth part g(Ω) to be positive deﬁnite. In our case the

Hessian of g(Ω) is the Fisher information of the chain graph model, ensuring its positive

deﬁniteness.

P3. When Ω is close to the global optimum, the step size α= 1 will satisfy the line search

condition. To establish this, we follow the proof of Proposition 3 in Hsieh et al. (2011).

S2.3 Thresholding to Decide the Active Sets

Similar to the QUIC procedure, our algorithm does not need to update every ωk,k0in each

iteration. We instead follow Hsieh et al. (2011) and only update those parameters exceeding

a certain threshold. More speciﬁcally, we can partition the parameters ωk,k0into a ﬁxed set

Sﬁxed,containing those parameters falling below the threshold, and a free set Sfree,containing

those parameters exceeding the threshold. That is

ωk,k0∈Sf ixed if |∇k,k0g(Ω)| ≤ ξk,k0and ωk,k0= 0

ωk,k0∈Sf ree otherwise (S22)

We can determine the free set Sfree using the minimum-norm sub-gradient gradS

k,k0f(Ω),which

is deﬁned in Deﬁnition 2 of Hsieh et al. (2011). In our case ∇g(Ω) = S−Ω−1−Ω−1MΩ−1,

so the minimum-norm sub-gradient is

gradS

k,k0f(Ω) = 









(S−Ω−1−Ω−1MΩ−1)k,k0+ξk,k0if ωk,k0>0

(S−Ω−1−Ω−1MΩ−1)k,k0−ξk,k0if ωk,k0<0

sign((S−Ω−1−Ω−1MΩ−1)k,k0)[|(S−Ω−1−Ω−1MΩ−1)k,k0| − ξk,k0]+if ωk,k0= 0

(S23)

Note that the subgradient evaluated on the ﬁxed set is always equal to 0. Thus, following

Lemma 4 in Hsieh et al. (2011), the elements of the ﬁxed set do not change during our

coordinate descent procedure. It suﬃces, then, to only compute the Newton direction on

the free set and update those parameters.

S2.4 Unique minimizer

In this subsection, we show that the CGLASSO problem admits a unique minimizer. Our

proof largely follows the proofs of Lemma 3 and Theorem 1 of Hsieh et al. (2011) but makes

suitable modiﬁcations to account for the extra tr(MΩ−1) term in the CGLASSO objective.

Theorem S1 (Unique minimizer).There is a unique global minimum for the CGLASSO

problem (S12).

We ﬁrst show the entire sequence of iterates {Ωt}lies in a particular, compact level set. To

this end, let

U={Ω|f(Ω) ≤f(Ω0),Ω∈Sp

++}.(S24)

To see that all iterations lies in U, we check need to check the line search condition Equa-

tion (S21) has a δ < 0. By directly applying Hsieh et al. (2011)’s Lemma 2 to g(Ω), we have

that

δ≤ −vec(D∗)>∇2g(Ω) vec(D∗)

where D∗is the Newton direction. Since g(Ω) (Equation (S13)) is convex, ∇2g(Ω) is positive

deﬁnite, so the function value f(Ωt) is always decreasing.

Now we need to check that the level set is actually contained in a compact set, by suitably

adapt Lemma 3 of Hsieh et al. (2011).

Lemma S1. The level set Udeﬁned in (S24)is contained in the set {mI ≤Ω≤NI }for

some constants m, N > 0, if we assume that the oﬀ-diagonal elements of Ξand the diagonal

elements of Sare positive.

Proof. We begin by showing that the largest eigenvalue of Ω is bounded by some constant

that does not depend on Ω.Recall that Sand Mare positive semi-deﬁnite. Since Ω is

positive deﬁnite, we have tr(SΩ) + tr(MΩ−1)>0 and ||Ω||1,Ξ+ tr(MΩ−1)>0.

Therefore we have

f(Ω0)> f(Ω) ≥ −log(|Ω|) + ||Ω||1,Ξ

f(Ω0)> f(Ω) ≥ −log(|Ω|) + tr(SΩ) (S25)

Since ||Ω||2is the largest eigenvalue of Ω, we have log(|Ω|)≤qlog(||Ω||2).

Using the assumption that oﬀ-diagonal entries of Ξ is larger than some positive number ξ,

we know that

ξX

i6=j|Ωk,k0| ≤ ||Ω||1,Ξ≤f(Ω0) + qlog(||Ω||2) (S26)

Similarly, we have

tr(SΩ) ≤f(Ω0) + qlog(||Ω||2) (S27)

Let α= mink(Sk,k) and β= maxk6=k0Sk,k0. We can split the tr(SΩ) into two parts, which

can be further lower bounded:

tr(SΩ) = X

Sk,kΩk,k +X

k6=k0

Sk,k0Ωk,k0≥αtr(Ω) −βX

k6=k0|Ωk,k0|(S28)

Since ||Ω||2≤tr(Ω), by using Equation (S28), we have,

α||Ω||2≤αtr(Ω) ≤tr(ΩS) + βX

k6=k0|Ωk,k0|(S29)

By combining Equations (S26), (S27), and(S29), we conclude that

α||Ω||2≤(1 + β/ξ)(f(Ω0) + qlog(||Ω||2)) (S30)

The left hand side as a function of ||Ω||2grows much faster than then right hand side. Thus

||Ω||2can be bounded by a quantity depending only on the value of f(Ω0), α,βand ξ.

We now consider the smallest eigenvalue denoted by a. We use the upper bound of other

eigenvalues to bound the determinant. By using the fact that f(Ω) always decreasing during

iterations, we have

f(Ω0)> f(Ω) >−log(|Ω|)≥ −log(a)−(q−1) log(N) (S31)

Thus we have m=e−f(Ω0)M−q+1 is a lower bound for smallest eigenvalue a.

We are now ready to prove Theorem S1, by showing the objective function is strongly convex

on a compact set.

Proof. Because of Lemma S1, the level set Ucontains all iterates produced by cgQUIC. The

set Uis further contained in the compact set {mI ≤Ω≤NI}. By the Weierstrass extreme

value theorem the continuous function f(Ω) (S13) attains its minimum on this set.

Further, the modiﬁed objective function is also strongly convex in its smooth part. This is

because tr(MΩ−1) and tr(SΩ) are convex and −log(|Ω|) is strongly convex. Since tr(MΩ−1)

is convex, the Hessian of the smooth part has the same lower bound as in Theorem 1 of

Hsieh et al. (2011). By following the argument of in the proof of Theorem 1 of Hsieh et al.

(2011), we can show the objective function f(Ω) is strongly convex on the compact set

{mI ≤Ω≤NI}, and thus has a unique minimizer.

We can further show that the cgQUIC procedure converges to the unique minimizer, using

the general results on quadratic approximation methods studied in Hsieh et al. (2011).

Theorem S2 (Convergence).The cgQUIC converge to global optimum.

Proof. cgQUIC is an example of quadratic approximation method investigated in Section

4.1 of Hsieh et al. (2011) with a strongly convex smooth part g(Ω) in (S13). Convergence to

the global optimum follows from their Theorem 2.

S3 Synthetic experiment results

We now present the remaining results from our simulation experiments. These results are

qualitatively similar to those from the (n, p, q) = (100,10,10) setting presented in the main

text. Generally speaking, in terms of support recovery, the methods that deployed a single

ﬁxed penalty (cgLASSO and CAR) displayed higher sensitivity but lower precision than both

cgSSL-DPE and cgSSL-DCPE. The only exception was when Ω was dense. Furthermore,

methods with adaptive penalties (both cgSSL procedures and CAR-A) tended to return a

fewer number of non-zero estimates than the ﬁxed penalty. Most of these non-zero estimates

were in fact true positives. Across all settings of (n, p, q), cgSSL-DPE makes virtually no

false positive identiﬁcations in the support of Ψ.In terms of parameter estimation, the ﬁxed

penalty methods tended to have larger Frobenius error in estimating both Ψ and Ω than

the cgSSL. Note that cgLASSO uses ten-fold cross-validation to set the two penalty levels.

Even with a parallel implementation and warm-starts, the full cgLASSO procedure did not

converge after 72 hours in the n= 400 setting.

Figure S4: Sensitivity, speciﬁcity and Frobenius loss of parameter estimations when p=

10, q = 10, n = 100

Figure S5: Sensitivity, speciﬁcity and Frobenius loss of parameter estimations when p=

20, q = 30, n = 100

Figure S6: Sensitivity, speciﬁcity and Frobenius loss of parameter estimations when p=

10, q = 30, n = 400, cgLASSO was not able to ﬁnish with dense Ω within 72 hours thus we

omit the result.

Table S2: Sensitivity, precision, and Frobenius error for Ψ and Ω when (n, p, q) = (100,20,30)

for each speciﬁcation of Ω.For each choice of Ω,the best performance is bold-faced.

Ψ recovery Ω recovery

Method SEN PREC FROB SEN PREC FROB

AR(1) model

cgLASSO 0.94 (0.09) 0.3 (0.14) 0.21 (0.21) 0.74 (0.42) 0.48 (0.36) 111.66 (66.15)

CAR 0.54 (0.05) 0.39 (0.03) 0.11 (0.02) 1 (0.01) 0.21 (0.01) 11.15 (2.35)

CAR-A 0.69 (0.03) 0.69 (0.04) 0.04 (0.01) 1 (0) 0.74 (0.06) 15.07 (4.99)

cgSSL-dcpe 0.66 (0.02) 0.82 (0.07) 0.08 (0.02) 0.94 (0.04) 0.68 (0.07) 34.1 (19.97)

cgSSL-dpe 0.69 (0.02) 1 (0.01) 0.02 (0) 1 (0) 0.82 (0.07) 4.87 (1.73)

AR(2) model

cgLASSO 0.94 (0.07) 0.3 (0.13) 0.17 (0.08) 0.98 (0.12) 0.18 (0.12) 14.44 (6.93)

CAR 0.42 (0.04) 0.37 (0.03) 0.15 (0.02) 0.38 (0.08) 0.25 (0.04) 15.8 (2.68)

CAR-A 0.64 (0.04) 0.76 (0.04) 0.07 (0.02) 0.91 (0.05) 0.87 (0.04) 3.16 (1.15)

cgSSL-dcpe 0.73 (0.02) 0.7 (0.06) 0.05 (0.01) 0.81 (0.09) 0.32 (0.05) 14.5 (4.21)

cgSSL-dpe 0.72 (0.02) 0.99 (0.01) 0.01 (0) 1 (0) 0.48 (0.03) 1.04 (0.35)

Block model

cgLASSO 0.92 (0.07) 0.4 (0.19) 0.51 (0.46) 0.62 (0.46) 0.93 (0.06) 27.48 (6.01)

CAR 0.51 (0.05) 0.4 (0.03) 0.12 (0.02) 0.53 (0.04) 0.68 (0.03) 12.96 (2.49)

CAR-A 0.66 (0.04) 0.64 (0.04) 0.06 (0.01) 0.47 (0.03) 0.96 (0.02) 23.14 (4.16)

cgSSL-dcpe 0.82 (0.1) 0.29 (0.19) 0.68 (0.19) 0.1 (0.28) 0.88 (0.1) 30.22 (2.11)

cgSSL-dpe 0.61 (0.02) 0.99 (0.02) 0.07 (0.02) 0.66 (0.36) 0.9 (0.05) 30.39 (4.02)

Star model

cgLASSO 0.91 (0.03) 0.45 (0.04) 0.08 (0.02) 0.7 (0.18) 0.31 (0.14) 6.45 (3.27)

CAR 0.45 (0.06) 0.41 (0.04) 0.14 (0.02) 0.32 (0.09) 0.12 (0.03) 4.36 (1.07)

CAR-A 0.69 (0.05) 0.68 (0.03) 0.06 (0.02) 0.31 (0.09) 0.35 (0.09) 2.83 (0.69)

cgSSL-dcpe 0.77 (0.02) 0.94 (0.03) 0.01 (0) 0.61 (0.21) 0.57 (0.08) 0.6 (0.21)

cgSSL-dpe 0.73 (0.02) 1 (0) 0.01 (0) 0.83 (0.08) 0.54 (0.1) 0.89 (0.32)

Dense model

cgLASSO 0.89 (0.04) 0.43 (0.05) 0.07 (0.03) 0.34 (0.41) 1 (0) 712.46 (354.87)

CAR 0.49 (0.06) 0.39 (0.03) 0.13 (0.02) 0.05 (0.01) 1 (0) 914.47 (6.45)

CAR-A 0.7 (0.04) 0.64 (0.04) 0.05 (0.01) 0.01 (0.01) 1 (0) 897.91 (5.95)

cgSSL-dcpe 0.77 (0.01) 0.99 (0.01) 0.01 (0) 0 (0.01) 1 (0) 900 (0.01)

cgSSL-dpe 0.72 (0.02) 1 (0.01) 0.01 (0.01) 0.03 (0.03) 1 (0) 901.45 (2.97)

Table S3: Sensitivity, precision, and Frobenius error for Ψ and Ω when (n, p, q) =

(400,100,30) for each speciﬁcation of Ω.For each choice of Ω,the best performance is

bold-faced. The cgLASSO method was not able to ﬁnish within 72 hours.

Ψ recovery Ω recovery

Method SEN PREC FROB SEN PREC FROB

AR(1) model

cgLASSO 1 (0) 0.2 (0) 0.07 (0.11) 0.94 (0.23) 0.46 (0.14) 27.98 (40.98)

CAR 0.82 (0.02) 0.46 (0.01) 0.02 (0) 1 (0) 0.27 (0.02) 2.23 (0.51)

CAR-A 0.86 (0.01) 0.73 (0.02) 0.01 (0) 1 (0) 0.89 (0.05) 7.32 (1.48)

cgSSL-dcpe 0.74 (0.01) 0.89 (0.03) 0.07 (0) 1 (0) 0.46 (0.03) 70.84 (3.72)

cgSSL-dpe 0.87 (0.01) 0.99 (0) 0 (0) 1 (0) 0.78 (0.06) 3.42 (0.8)

AR(2) model

cgLASSO 1 (0) 0.2 (0) 0.15 (0.05) 0.79 (0.23) 0.63 (0.22) 10.82 (4.04)

CAR 0.85 (0.02) 0.5 (0.01) 0.01 (0) 0.98 (0.02) 0.49 (0.03) 0.38 (0.15)

CAR-A 0.89 (0.01) 0.77 (0.02) 0.01 (0) 1 (0.01) 0.94 (0.03) 1.22 (0.24)

cgSSL-dcpe 0.87 (0.01) 0.79 (0.14) 0.04 (0.02) 1 (0) 0.31 (0.05) 10.5 (6.66)

cgSSL-dpe 0.92 (0) 1 (0) 0 (0) 1 (0) 0.47 (0.03) 0.26 (0.08)

Block model

cgLASSO 1 (0) 0.2 (0) 0.44 (0.25) 0.87 (0.21) 0.97 (0.11) 10.05 (11.21)

CAR 0.84 (0.02) 0.46 (0.01) 0.02 (0) 0.71 (0.03) 0.76 (0.02) 3.36 (0.24)

CAR-A 0.88 (0.01) 0.7 (0.02) 0.01 (0) 0.75 (0.02) 0.99 (0.01) 4.13 (0.5)

cgSSL-dcpe 0.9 (0.01) 0.22 (0) 0.82 (0.04) 0 (0.01) 0.64 (NA) 29.51 (0.24)

cgSSL-dpe 0.86 (0.01) 0.99 (0.01) 0.01 (0) 0.98 (0.02) 0.98 (0.01) 1.44 (0.43)

Star model

cgLASSO 0.93 (0) 0.83 (0.02) 0.01 (0) 0.53 (0.41) 0.59 (0.45) 4.68 (3.43)

CAR 0.89 (0.01) 0.48 (0.01) 0.01 (0) 0.73 (0.09) 0.25 (0.03) 0.55 (0.1)

CAR-A 0.91 (0.01) 0.7 (0.02) 0.01 (0) 0.87 (0.07) 0.74 (0.06) 1.07 (0.18)

cgSSL-dcpe 0.88 (0) 1 (0) 0 (0) 1 (0) 0.89 (0.05) 0.29 (0.08)

cgSSL-dpe 0.89 (0) 1 (0) 0 (0) 1 (0) 0.9 (0.05) 0.27 (0.06)

Dense model

cgLASSO

CAR 0.87 (0.02) 0.39 (0.01) 0.01 (0) 0 (0) NaN (NA) 964.24 (9.25)

CAR-A 0.88 (0.01) 0.52 (0.01) 0.01 (0) 0 (0) NaN (NA) 964.08 (9.71)

cgSSL-dcpe 0.87 (0.01) 0.94 (0.02) 0.03 (0) 0.21 (0.02) 1 (0) 913.81 (2.33)

cgSSL-dpe 0.86 (0.01) 0.98 (0.01) 0.04 (0.01) 0.26 (0.01) 1 (0) 918.35 (4.57)

S4 Preprocessing for real data experiment

To conduct our reanalysis of Claesson et al. (2012)’s gut microbiome data, we preprocesses

the raw 16s-rRNAseq data following the workﬂow provided by the MG-RAST server (Keegan

et al.,2016). We ﬁrst “annotated” the sequences to get genus counts (i.e. number of segments

belongs to one genus). The annotation process compares the rRNA segments detected during

sequencing to the reference sequence of each genus of microbes, then counts the number of

rRNA segments match with each genus. We used the MG-RAST server’s default tuning

parameters during the annotation process. That is, we set e-value to be 5 and annotated

with 60% identity, alignment length of 15 bp, and set a minimal abundance of 10 reads.

Following standard practices of analyzing microbiome data, we transformed raw counts into

relative abundance.We selected genera with more than 0.5% relative abundance in more than

50 samples as the focal genus and all other genera aggregated as the reference group. We

further took the log-odds (with respect to the reference group described above) to stabilize

the variances (Aitchison,1982) in order to ﬁt our normal model.

S5 Proofs of posterior contraction for cgSSL

This section provides detail on the posterior contraction results for the cgSSL. Our proof

was inspired by Ning et al. (2020) and Bai et al. (2020). We ﬁrst show the contraction in

log-aﬃnity by verifying KL condition and test conditions following Ghosal and van der Vaart

(2017). Then we use the results in log-aﬃnity to show recovery of parameters.

To establish our results, we work with a slightly modiﬁed prior on Ω that has density

fΩ(Ω) ∝Y

k>k0(1 −η)ξ0

2exp (−ξ0|ωk,k0|) + ηξ1

2exp (ξ1|ωk,k0|)

×Y

ξexp [−ξωk,k]×

(Ω τI)

(S32)

fΨ(Ψ) = Y

jk (1 −θ)λ0

2exp (−λ0|ψj,k|) + θλ1

2exp (λ1|ψj,k|)(S33)

where 0 < τ < 1/b2.This way τis less than the lower bound of the smallest eigenvalue of

the true precision matrix Ω0.

S5.1 The Kullback-Leibler condition

We need to verify that our prior places enough probability in small neighborhoods around

each of the possible values of the true parameters. These neighborhoods are deﬁned in a KL

sense.

Lemma S2 (KL conditions).Let n=pmax{p, q, sΩ

0, sΨ

0}log(max{p, q})/n. Then for all

true parameters (Ψ0,Ω0)we have

−log Π (Ψ,Ω) : K(f0, f )≤n2

n, V (f0, f )≤n2

n≤C1n2

Further, let Enbe the event

En={Y:ZZ f /f0dΠ(Ψ)Πd(Ω) ≥e−C1n2

n}.

Then for all (Ψ0,Ω0,we have P0(Ec

n)→0as n→ ∞.

The last assertion that P(Ec

n)→0 follows from Lemma 8.1 of Ghosal and van der Vaart

(2017) so we now focus on establishing the ﬁrst assertion of the Lemma. To verify this

condition we need to bound the prior mass of certain events A. However, the truncation of

the prior on Ω makes computing these masses intractable. To overcome this, we ﬁrst bound

the prior probability of events of the form A∩{Ωτ I }by observing the prior on Ω can be

viewed as a particular conditional distribution.

Speciﬁcally, let ˜

Π be the untruncated spike-and-slab LASSO prior with density

f(Ω) = Y

k>k0(1 −η)ξ0

2exp (−ξ0|ωk,k0|) + ηξ1

2exp (ξ1|ωk,k0|)×Y

ξexp (−ξωk,k).

The following Lemma shows that we can bound Π probabilities using ˜

Π probabilities.

Lemma S3 (Bounds of the graphical prior).Let ˜

Πbe the untruncated version of the prior

on Ω.Then for all events A, for large enough nthere is a number Rthat does not depend

on nsuch that

Π(Ω τI|A)˜

Π(A)≤ΠΩ(A∩ {ΩτI})≤exp(2ξQ −log(R))˜

Π(A) (S34)

where Q=q(q−1)/2is the total number of free oﬀ-diagonal entries in Ω.

Proof. Consider an event of form A∩{Ωτ I} ⊂ Rq×q. The prior mass ΠΩ(A∩ {Ωτ I})

can be viewed as a conditional probability:

ΠΩ(A∩ {ΩτI}) = ˜

Π(A|ΩτI) = ˜

Π(Ω τI|A)˜

Π(A)

Π(Ω τI)(S35)

The lower bound follows because the denominator is bounded from above by 1.

For the upper bound, we ﬁrst observe that

ΠΩ(A∩ {ΩτI}) = ˜

Π(A|ΩτI) = ˜

Π(Ω τI|A)˜

Π(A)

Π(Ω τI)≤(˜

Π(Ω τI))−1˜

Π(A) (S36)

To upper bound the probability in Equation (S35), we ﬁnd a lower bound of the denominator

Π(Ω τI). To this end, let

G=Ω : ωk,k > q −1,|ωk,k0| ≤ 1−τ

q−1for k06=k

and consider an Ω ∈ G.Since all of Ω’s eigenvalues are real, they must each be contained

in at least one Gershgorin disc. Consider the kth Gershgorin disc, whose intersection with

the real line is an interval centered at ωk,k with half-width Pk06=k|ωk,k0|.Any eigenvalue of

Ω that lies in this disc must be greater than

ωk,k −X

k06=k|ωk,k0|> q −1−(q−1−τ) = τ

Thus, we have G=⊂ {ΩτI}.

Since the entries of Ω are independent under ˜

Π, we compute

Π(G)≥Y

kZ∞

q−1

ξexp(−ξωk,k)dωk,k0(1 −η)QY

k>k0Z|ωk,k0|≤1−τ

q−1

ξ0

2exp(−ξ0|ωk,k0|)dωk,k0

≥exp(−2ξQ)(1 −η)Q"1−E|ωk,k0|

1−τ

q−1#Q

= exp(−2ξQ)(1 −η)Q"1−1

ξ0(1 −τ

q−1)#Q

≥exp(−2ξQ)1−1

1 + K1Q2+aQ1−1

K3Q2+b(1 −τ)Q

≥exp(−2ξQ + log(R)),

(S37)

where R > 0 does not depend onn. Note that the ﬁrst inequality holds by ignoring the

contribution to the probability from slab distribution. The second inequality is Markov’s

inequality and the third inequality follows from our assumptions about how ξ0and ηare

tuned.

Let SΨ

0and SΩ

0respectively denote the supports of Ψ and Ω.Similarly, let sΨ

0be the number

of true non-zero entries in Ψ0and let sΩ

0be the true number of non-zero oﬀ-diagonal entries

in Ω0

The KL divergence between a Gaussian chain graph model with parameters (Ψ0,Ω0) and

one with parameters (Ψ,Ω) is

nK(f0, f ) = E0log f0

f

2 log |Ω0|

|Ω|−q+ tr(Ω−1

0Ω) + 1

i=1 ||Ω1/2(ΨΩ −Ψ0Ω0)>X>

i||2

2!(S38)

The KL variance is:

nV(f0, f ) = V ar0log f0

f

2tr((Ω−1

0Ω)2)−2 tr(Ω−1

0Ω) + q+1

i=1 ||Ω−1/2

0Ω(ΨΩ−1−Ψ0Ω−1

0)>X>

i||2

(S39)

We need to lower bound the prior probability of the event

{(Ψ,Ω) : K(f0, f )≤n2

n, V (f0, f )≤n2

for large enough n.

We ﬁrst obtain an upper bound of the average KL divergence and variance so that the mass

of such event can serve as a lower bound. To simplify the notation, we denote (Ψ−Ψ0) = ∆Ψ

and Ω −Ω0= ∆Ω. We observe that ΨΩ−1−Ψ0Ω−1

0= (∆Ψ−Ψ0Ω−1

0∆Ω)Ω−1.

Using the fact that ||A−B||2

2≤(||A||2+||B||2)2≤2||A||2

2+ 2||B||2

2for any two matrices A

and B, we obtain a simple upper bound:

nK(f0, f )

2 log |Ω0|

|Ω|−q+ tr(Ω−1

0Ω) + 1

i=1 ||Ω−1/2∆>

ΨX>

i−Ω−1/2∆ΩΩ−1

0Ψ>

0X>

i||2

≤1

2log |Ω0|

|Ω|−q+ tr(Ω−1

0Ω)+1

i=1 ||Ω−1/2∆ΩΩ−1

0Ψ>

0X>

i||2

i=1 ||Ω−1/2∆>

ΨX>

i||2

2log |Ω0|

|Ω|−q+ tr(Ω−1

0Ω)+1

n||XΨ0Ω−1

0∆ΩΩ−1/2||2

n||X∆ΨΩ−1/2||2

(S40)

The last line holds because Ω−1/2∆>

ΨX>

iis the ith row of X∆ΨΩ−1/2.

Using the same inequality, we derive a similar upper bound for the average KL variance:

nV(f0, f )

2tr((Ω−1

0Ω)2)−2 tr(Ω−1

0Ω) + q+1

i=1 ||Ω−1/2

0∆>

ΨX>

i−Ω−1/2

0∆ΩΩ−1

0Ψ>

0X>

i||2

≤1

2tr((Ω−1

0Ω)2)−2 tr(Ω−1

0Ω) + q+2

i=1 ||Ω−1/2

0∆ΩΩ−1

0Ψ>

0X>

i||2

i=1 ||Ω−1/2

0∆>

ΨX>

i||2

2tr((Ω−1

0Ω)2)−2 tr(Ω−1

0Ω) + q+2

n||XΨ0Ω−1

0∆ΩΩ−1/2

0||2

F+2

n||X∆ΨΩ−1/2

0||2

(S41)

Similar to Ning et al. (2020) and Bai et al. (2020), we ﬁnd event A1involving only ∆Ωand

event A2involving both ∆Ωand ∆Ψsuch that (A1∩{Ω0})∩ A2is a subset of the event

of interest {K/n ≤2

n, V /n ≤2

n}.

To this end, deﬁne

A1=Ω : 1

2tr((Ω−1

0Ω)2)−2 tr(Ω−1

0Ω) + q+2

n||XΨ0Ω−1

0∆ΩΩ−1/2

0||2

F≤2

n/2

\1

2log |Ω0|

|Ω|−q+ tr(Ω−1

0Ω)+1

n||XΨ0Ω−1

0∆ΩΩ−1/2||2

F≤2

n/2(S42)

and

A2={(Ω,Ψ) : 1

n||X∆ΨΩ−1/2

0||2

F≤2

2,2

n||X∆ΨΩ−1/2||2

F≤2

2}(S43)

We separately bound the prior probabilities Π(A1) and Π(A1|A2)

S5.1.1 Bounding the prior mass Π(A1)

The goal here is to ﬁnd a proper lower bound of prior mass on A1. To do this, ﬁrst consider

the set

1={2X

k>k0|ω0,k,k0−ωk,k0|+X

k|ω0,k,k −ωk,k | ≤ n

c1√p}

where c1>0 is a constant to be speciﬁed. Since the Frobenius norm is bounded by the

vectorized L1 norm, we immediately conclude that

1⊂kΩ0−ΩkF≤n

c1√p.

We now show that nkΩ0−ΩkF≤n

c1√po⊂ A1.

Since the Frobenius norm bounds the L2 operator norm, if ||Ω0−Ω||F≤n

c1√pthen the

absolute value of the eigenvalues of Ω −Ω0are bounded by n

c1√p.Further, because we have

assumed Ω0has bounded spectrum, the spectrum of Ω = Ω0+Ω−Ω0is bounded by λmin−n

c1√p

and λmax +n

c1√p.When nis large enough, these quantities are further bounded by λmin/2

and 2λmax.Thus, for nlarge enough, if ||Ω0−Ω||F≤n

c1√p,then we know Ω has bounded

spectrum.

Consequently, Ω−1/2has bounded L2 operator norm. Using the fact that ||AB||F≤min(|||A|||2||B||F,|||B|||2||A||F),

we have for some constant c2not depending on n,

n||XΨ0Ω−1

0∆ΩΩ−1/2

0||2

F≤2

n|||XΨ0|||2

2||Ω−1

0∆ΩΩ−1/2

0||2

≤2

n||X||2

F|||Ψ0|||2

2|||Ω−1

0∆ΩΩ−1/2

0||2

≤pc2

2||∆Ω||2

where we have used the fact that ||X||F=√np. Thus ||∆Ω||F≤n

2c2√pimplies

n||XΨ0Ω−1

0∆ΩΩ−1/2

0||2

F≤2

n/4.

Similarly, for some constant c3, we have that

n||XΨ0Ω−1

0∆ΩΩ−1/2||2

F≤1

n||X||2

F||Ψ0||2

2||Ω−1

0∆ΩΩ−1/2||2

≤pc2

3||∆Ω||2

Thus we have ||∆Ω||F≤n

2c3√pimplies 1

2n||XΨ0Ω−1

0∆ΩΩ−1/2||2

F≤2

n/4

Using an argument from Ning et al. (2020), ||∆Ω||F≤n

2b2√p≤n/2b2implies the following

two inequalities

2(tr((Ω−1

0Ω)2)−2 tr(Ω−1

0Ω) + q)≤2

n/4

2(log(|Ω0|

|Ω|)−q+ tr(Ω−1

0Ω)) ≤2

n/4.

Thus by taking c1= 2 max{c2, c3, b2}, we can conclude {||Ω0−Ω||F≤n

c1√p}⊂A1. Thus

1⊂ A1

Since A?

1∈ {Ω : ||Ω0−Ω||F≤n/c1√p},we know that ˜

Π(Ω τI|A?

1) = 1.We can therefore

lower bound Π(A1) by Π(A∗

1∩ {ΩτI}). Instead of calculating the latter probability

directly, we can lower bound it by observing

k>k0|ω0,k,k0−ωk,k0|+X

k|ω0,k,k −ωk,k |

=2 X

(k,k0)∈SΩ

|ω0,k,k0−ωk,k0|+ 2 X

(k,k0)∈(SΩ

0)c|ωk,k0|+X

k|ω0,k,k −ωk,k |.

Consider the following events

B1=



X

(k,k0)∈SΩ

|ω0,k,k0−ωk,k0| ≤ n

6c1√p





B2=



X

(k,k0)∈(SΩ

0)c|ωk,k0| ≤ n

6c1√p





B3=(X

k|ω0,k,k −ωk,k | ≤ n

3c1√p)

Let B=T3

i=1 Bi⊂ A∗

1⊂ A1. Since the prior probability of Blower bounds Π(A1), we now

focus on estimating ˜

Π(B). Recall that the untruncated prior ˜

Π is separable. Consequently,

Π(A1∩ {ΩτI})≥˜

Π(A1)≥˜

Π(B) =

i=1

Π(Bi)

We ﬁrst bound the probability of B1.Note that we can use only the slab part of the prior

to bound this probability. A similar technique was used by Bai et al. (2020) (speciﬁcally in

their Equation D.18) and by Roˇckov´a and George (2018). Speciﬁcally, we have

Π(B1) = ZB1Y

(k,k0)∈SΩ

π(ωk,k0|η)dµ

≥Y

(k,k0)∈SΩ

0Z|ω0,k,k0−ωk,k0|≤ n

6sΩ

0c1√p

π(ωk,k0|η)dωk,k0

≥ηsΩ

(k,k0)∈SΩ

0Z|ω0,k,k0−ωk,k0|≤ n

6sΩ

0c1√p

ξ1

2exp(−ξ1|ωk,k0|)dωk,k0

≥ηsΩ

0exp(−ξ1X

(k,k0)∈SΩ

|ω0,k,k0|)Y

(k,k0)∈SΩ

0Z|ω0,k,k0−ωk,k0|≤ n

6sΩ

0c1√p

ξ1

2exp(−ξ1|ω0,k,k0−ωk,k0|)dωk,k0

=ηsΩ

0exp(−ξ1||Ω0,SΩ

0||1)Y

(k,k0)∈SΩ

0Z|∆|≤ n

6sΩ

0c1√p

ξ1

2exp(−ξ1|∆|)d∆

≥ηsΩ

0exp(−ξ1||Ω0,SΩ

0||1)e−ξ1n

6c1sΩ

0√pξ1n

6sΩ

0c1√psΩ

The ﬁrst inequality holds because the fact that |ω0,k,k0−ωk,k0| ≤ n/(6sΩ

0c1√p) implies that

the sum less than n/(6c1√p).The last inequality is a special case of Equation D.18 of Bai

et al. (2020).

For B2,we derive the lower bound using the spike component of the prior. To this end, let

Q=q(q−1)/2 denote the number of oﬀ-diagonal entries of matrix Ω. We have

Π(B2) = ZB2Y

(k,k0)∈(SΩ

0)c

π(ωk,k0|η)dµ

≥Y

(k,k0)∈(SΩ

0)cZ|ωk,k0|≤ n

6(Q−sΩ

0)c1√p

π(ωk,k0|η)dµ

≥(1 −η)Q−sΩ

(k,k0)∈(SΩ

0)cZ|ωk,k0|≤ n

6(Q−sΩ

0)c1√p

ξ0

2exp(−ξ0|ωk,k0|)dωk,k0

≥(1 −η)Q−sΩ

(k,k0)∈(SΩ

0)c1−6(Q−sΩ

0)c1√p

n

Eπ|ωk,k0|

= (1 −η)Q−sΩ

01−6(Q−sΩ

0)c1√p

nξ0Q−sΩ

&(1 −η)Q−sΩ

01−1

Q−sΩ

0Q−sΩ

(1 −η)Q−sΩ

To derive the last two lines, we used an argument similar to the one used by Bai et al. (2020)

to derive Equation D.22. That is, we used the assumption that ξ0∼max{Q, n, pq}4+b

for some b > 0 to conclude that √n/ max{Q, n, pq}1/2+b≤1.This inequality allows us to

control the Qin the numerator. Since sΩ

0grows slower than Q, we can lower bound the

above function some multiplier of the form (1 −η)Q−sΩ

0.Thus, for large enough n, we have

6(Q−sΩ

0)c1√p

nξ0≤6(Q−sΩ

0)c1√p√n

pplog(q)Q2+b

=6c1

plog(q)

Q−s0

√n

≤Q−s0

≤1

Q−s0

The event B3only involves diagonal entries. The untruncated prior mass can be directly

bounded using the exponential distribution

Π(B3) = ZB3

k=1

π(ωk,k)dµ

≥

k=1 Z|ω0,k,k−ωk ,k|≤ n

3qc1√p

π(ωk,k)dωk,k

k=1 Zω0,k,k+n

3qc1√p

ω0,k,k−n

3qc1√p

ξexp(−ξωk,k)dωk,k

≥

k=1 Zω0,k,k+n

3qc1√p

ω0,k,k

ξexp(−ξωk,k)dωk,k

= exp(−ξ

i=1

ω0,k,k)Zn

3qc1√p

ξexp(−ξωk,k)dωk,k

≥exp(−ξ

i=1

ω0,k,k)e−ξn

3c1q√pξn

3qc1√pq

Now we are ready to show that the log prior mass on Bcan be bounded by some C1n2

n. To

this end, consider the negative log probability

−log(Π(A1∩ {Ωτ I}))

≤

i=1 −log( ˜

Π(Bi))

.−sΩ

0log(η) + ξ1||Ω0,SΩ

0||1+ξ1n

6c1√p−sΩ

0log ξ1n

6sΩ

0c1√p−(Q−sΩ

0) log(1 −η)

+ξX

ω0,k,k +ξn

3c1√p−qlog ξn

3qc1√p

=−log ηsΩ

0(1 −η)Q−sΩ

0+ξ1||Ω0,SΩ

0||1+ξ1n

6c1√p+ξX

ω0,k,k +ξn

3c1√p

−sΩ

0log ξ1n

6sΩ

0c1√p−qlog ξn

3qc1√p

The ξ1n

6c1√pand ξn

3c1√pterms are O(n/√p).n2

nwhich goes to inﬁnity. The 4th term is of

order qsince the diagonal entries is controlled by the largest eigenvalue of Ω,which was

assumed to be bounded.

ξ1||Ω0,SΩ

0||1≤ξ1sΩ

0sup |ω0,k,k0|

is of order sΩ

0as the entries of ω0,k,k0is controlled.

Without tuning of η, the ﬁrst term −log ηsΩ

0(1 −η)Q−sΩ

0has order of Q. But since

we assumed 1−η

η∼max{Q, pq}2+afor some a > 0, we have K1max{Q, pq}2+a≤1−η

η≤

K2max{Q, pq}2+a. That is, we have 1/(1+K2max{Q, pq}2+a)≤η≤1/(1+K1max{Q, pq}2+a).

We can derive a simple lower bound as

ηsΩ

0(1 −η)Q−sΩ

0≥(1 + K2max{Q, pq}2+a)−sΩ

0(1 −η)Q−sΩ

≥(1 + K2max{Q, pq}2+a)−sΩ

01−1

1 + K1max{Q, pq}2+aQ−sΩ

&(1 + K2max{Q, pq}2+a)−sΩ

The last line is because max{Q, pq}2+agrows faster than Q−sΩ

0.Thus (1−1

1+K1max{Q,pq}2+a)max{Q,pq}−sΩ

can be bounded below by some constant.

−log ηsΩ

0(1 −η)Q−sΩ

0.sΩ

0log(1 + K2max{Q, pq}2+a).sΩ

0log(max{Q, pq})

sΩ

0log(max{q, p})≤max(p, q, sΩ

0) log(max{q, p})

The last two terms can be treated in the same way, using the assumption ξ11/n and

ξ1/max{Q, n}.

−sΩ

0log ξ1n

6sΩ

0c1√p=sΩ

0log 6sΩ

0c1√p

ξ1n

.sΩ

0log n3/2sΩ

0√p

pmax{sΩ

0, p, q}log(q)!

≤sΩ

0log n3/2sΩ

0

.sΩ

0log(q2)

.n2

The third line holds because √p≤pmax{sΩ

0, p, q}and log(q)≥1,which together imply

that √p/pmax{sΩ

0, p, q}log(q)≤1. The fourth line follows from our assumption that

log(n).log(q) because sΩ

0< q2. The last line uses the deﬁnition of n.

Finally, we have

−qlog ξn

3qc1√p=qlog 3qc1√p

ξn

.qlog n1/2max{Q, n}q√p

pmax{sΩ

0, p, q}log(q)!

≤qlog n1/2max{Q, n}q

.qlog(q)

.n2

S5.1.2 Bounding the conditional probability Π(A2|A1)

To bound Π(A2|A1),we use a very similar strategy as the one above. The diﬀerence is that

we now focus on the matrix Ψ.We show that mass on a L1 norm ball serves as a lower bound

similar to that of Ω. To see that, using an argument from Ning et al. (2020), we show that

powers of Ω and Ω0are bounded in operator norm. Thus the terms 1

n||X∆ΨΩ−1/2

0||2

Fand

n||X∆ΨΩ−1/2||2

Fthat appear in the KL condition are bounded by a constant multiplier of

n−1||X∆Ψ||2

F.Using the fact that the columns of Xhave norm √n, we can found this norm:

||X∆Ψ||F≤√n

j=1 ||∆Ψ,j,.||F≤√n

j=1

k=1 |ψj,k −ψ0,j,k|

Thus to bound Π(A2|A1) from below, it suﬃces to bound Π(P|ψj,k −ψ0,j,k | ≤ c4n) for

some ﬁxed constant c4>0.

We separate the sum based on whether the true value is 0, similar to our treatment on Ω:

ij |ψj,k −ψ0,j,k|=X

(j,k)∈SΨ

|ψj,k −ψ0,j,k|+X

(j,k)∈(SΨ

0)c|ψj,k|

Using the same argument as in Ω, we can consider the events whose intersection is a subset

of A2:

B4=



X

(j,k)∈SΨ

|ψj,k −ψ0,j,k| ≤ c4n

2





B5=



X

(j,k)∈(SΨ

0)c|ψj,k −ψ0,j,k| ≤ c4n

2





We have B4∩ B5⊂ A2.Since the elements of Ψ are a priori independent of each other and

of Ω,we compute

Π(A2|A1)≥Π(B4|A1)Π(B5|A1) = Π(B4)Π(B5)

We bound each of these terms using the same argument as in the previous subsection:

Π(B4) = ZB4Y

(j,k)∈SΨ

π(ψj,k|θ)dµ

≥Y

(j,k)∈SΨ

0Z|ψj,k−ψ0,j,k |≤c4n

2sΨ

π(ψj,k|θ)dψj,k

≥θsΨ

(j,k)∈SΨ

0Z|ψj,k−ψ0,j,k |≤c4n

2sΨ

λ1

2exp(−λ1|ψj,k|)dψj,k

≥θsΨ

0exp(−λ1X

(j,k)∈SΨ

|ψ0,j,k|)Y

(j,k)∈SΨ

0Z|ψj,k−ψ0,j,k |≤c4n

2sΨ

λ1

2exp(−λ1|ψj,k −ψ0,j,k|)dψj,k

=θsΨ

0exp(−λ1X

(j,k)∈SΨ

|ψ0,j,k|)Y

(j,k)∈SΨ

0Z|∆|≤c4n

2sΨ

λ1

2exp(−λ1|∆|)d∆

≥θsΨ

0exp(−λ1||Ψ0,SΨ

0||1)e−c4λ1n

2sΨ

0c4n

2sΨ

0sΨ

Similarly, we have

Π(B5)≥(1 −θ)pq−sΨ

01−2(pq −sΨ

0)c4

nλ0pq−sΨ

&(1 −θ)pq−sΨ

From here we have

−log(Π(A2|A1)) ≤ −log(Π(B4)) −log(Π(B5))

=−log(θsΨ

0(1 −θ)pq−sΨ

0) + λ1||Ψ0,SΨ

0||1+λ1c4n

2−s0

Ψlog c4n

2sΨ

0

Since Ψ0has bounded L2 operator norm, we know that the entries of Ψ0are all bounded.

Thusλ1||Ψ0,SΨ

0||1=O(sΨ

0).n2

n. The last two terms are O(n).n2

For the ﬁrst term, recall that we assumed 1−θ

θ∼(pq)2+bfor some b > 0.That is, there

are constants M3and M4such that M3(pq)2+b≤1−θ

θ≤M4(pq)2+b≤1−θ

θ. Since 1/(1 +

M4(pq)2+b)≤θ≤1/(1 + M3(pq)2+b),we compute

θsΨ

0(1 −θ)pq−sΨ

0≥(1 + M4(pq)2+b)−sΨ

0(1 −θ)pq−sΨ

≥(1 + M4(pq)2+b)−sΨ

01−1/(1 + M3(pq)2+b)pq−sΨ

&(1 + M4(pq)2+b)−sΨ

Note that the last line is due to the fact that (pq)2+bgrows faster than pq −sΨ

0.Conse-

quently, the term 1−1/(1 + M3(pq)2+b)pq−sΨ

0can be bounded from below by a constant

not depending on n. Thus,

−log θsΨ

0(1 −θ)pq−sΨ

0.sΨ

0log(1 + M4(pq)2+b).sΨ

0log(pq).sΨ

0max{log(q),log(p)}

For the last term, we use the same argument as we did with Ω.

−s0

Ψlog c4n

2sΨ

0=s0

Ψlog 2sΨ

c4n

.sΨ

0log √n

plog(pq)!

≤sΨ

0log(√n)

.n2

S5.2 Test condition

To simplify the parameter space to be concerned in the test condition, we ﬁrst show the

dimension recovery result by bounding the prior probability, with our eﬀective dimension

deﬁned as number of entries whose absolute value is larger than the intersection of spike

and slab components. Then we ﬁnd the proper vectorized L1 norm sieve in the “lower-

dimensional” parameter space. We construct tests based on the supremum of a collection of

single-alternative Neyman-Pearson likelihood ratio tests in the subsets of the sieve that are

norm balls, then we show that the number of such subsets needed to cover the sieve can be

bounded properly.

S5.2.1 Dimension recovery

Unlike Ning et al. (2020), our prior assigns no mass on exactly sparse solutions. Nevertheless,

similar to Roˇckov´a and George (2018), we can deﬁne a notion of “eﬀective sparsity” and

generalized dimension. Intuitively the generalized dimension can be deﬁned as how many

coeﬃcients are drawn from the slab rather than the spike part of the prior. Formally the

generalized inclusion functions νψand νωfor Ψ and Ω can be deﬁned as:

νψ(ψj,k) =

(|ψj,k|> δψ)

νω(ωk,k0) =

(|ωk,k0|> δω)

where δψand δωis the threshold where the spike and slab part has the same density.

δψ=1

λ0−λ1

log 1−θ

λ0

λ1

δω=1

ξ0−ξ1

log 1−η

ξ0

ξ1

Then the generalized dimension can be deﬁned as number of entries are included:

|ν(ψ)|=X

νψ(ψj,k)

|ν(Ω)|=X

k>k0

νω(ωk,k0)

(S44)

Note that we only count the oﬀ-diagonal entries in Ω.

We are now ready to prove Lemma 1from the main text. The main idea is to check the pos-

terior probability directly. Let BΨ

n={Ψ : |ν(Ψ)|< rΨ

n}for some rΨ

n=C0

3max{p, q, sΨ

0, sΩ

with C0

3> C1in the KL condition. For Ω, let BΩ

n={ΩτI :|ν(Ω)|< rΩ

n}for

rΩ

n=C0

3max{p, q, sΨ

0, sΩ

0}with some C0

3> C1in the KL condition. We aim to show that

E0Π(Ω ∈(BΩ

n)c|Y1, . . . , Yn)→0 and E0Π(Ψ ∈(BΨ

n)c|Y1, . . . , Yn)→0.

The marginal posterior can be expressed using log-likelihood `n:

Π(Ψ ∈ BΨ

n|Y1, . . . , Yn) = RRBΨ

nexp(`n(Ψ,Ω) −`n(Ψ0,Ω0))dΠ(Ψ)dΠ(Ω)

RR exp(`n(Ψ,Ω) −`n(Ψ0,Ω0))dΠ(Ψ)dΠ(Ω)

Π(Ω ∈ BΩ

n|Y1, . . . , Yn) = RRBΩ

nexp(`n(Ψ,Ω) −`n(Ψ0,Ω0))dΠ(Ψ)dΠ(Ω)

RR exp(`n(Ψ,Ω) −`n(Ψ0,Ω0))dΠ(Ψ)dΠ(Ω)

(S45)

By using the result of KL condition (Lemma S2), we know the denominators are bounded

from below by e−C1n2

nwith large probability. Thus, we focus now on upper bounding the

numerators beginning with Ψ.

Consider the numerator:

E0ZZ(BΨ

n)c

f/f0dΠ(Ψ)dΠ(Ω)=Z ZZ(BΨ

n)c

f/f0dΠ(Ψ)dΠ(Ω)f0dy

=ZZ(BΨ

n)cZfdydΠ(Ψ)dΠ(Ω)

≤Z(BΨ

n)c

dΠ(Ψ) = Π(|ν(Ψ)| ≥ rΨ

We can bound the above display using the fact that when |ψj,k|> δψwe have π(ψj,k)<

2θλ1

2exp(−λ1|ψj,k|), this is by deﬁnition of the eﬀective dimension:

Π(|ν(Ψ)| ≥ rΨ

n)≤X

|S|>rΨ

(2θ)|S|Y

(j,k)∈SZ|ψj,k |>δψ

λ1

2exp(−λ1|ψj,k|)dψj,k Y

(j,k)/∈SZ|ψj,k|<δψ

π(ψj,k)dψj,k

≤X

|S|>rΨ

(2θ)|S|

Using the assumption on θ, and the fact pq

k≤(epq/k)k(similar to Bai et al. (2020)’s

equation D.32), we can further upper bound the probability

Π(|ν(Ψ)| ≥ rΨ

n)≤X

|S|>rΨ

(2θ)|S|≤X

|S|>rΨ

1 + M4(pq)2+b)|S|

≤

k=brΨ

nc+1 pq

k 2

M4(pq)2k

≤

k=brΨ

nc+1 2e

M4kpq k

k=brΨ

nc+1 2e

M4(brΨ

nc+ 1)pq k

.(pq)−(brΨ

nc+1)

≤exp(−(rΨ

n) log(pq)).

Taking rΨ

n=C0

3max{p, q, sΨ

0, sΩ

0}for some C0

3> C1, we have:

Π(|ν(Ψ)| ≥ rΨ

n)≤exp(−C0

3max{p, q, sΨ

0, sΩ

0}log(pq))

Therefore,

E0Π((BΨ

n)c|Y1, . . . , Yn)≤E0Π((BΨ

n)c|Y1, . . . , Yn)IEn+P0(Ec

n),

where Enis the event in the KL condition. On En, the KL condition ensures that the

denominator in Equation (S45) is lower bounded by exp(−C1n2

n) while the denominator is

upper bounded by exp(−C0

3max{p, q, sΨ

0, sΩ

0}log(pq))., Since P0(Ec

n) is o(1) per KL condition,

we have the upper bound

E0Π((BΨ

n)c|Y1, . . . , Yn)≤exp(C1n2

n−C0

3max{p, q, sΨ

0, sΩ

0}log(pq)) + o(1) →0

This completes the proof of the dimension recovery result of Ψ.

The workﬂow for Ω is very similar, except we need to use the upper bound of the graphical

prior in Equation (S34) to properly bound the prior mass.

We upper bound the numerator:

E0ZZ(BΩ

n)c

f/f0dΠ(Ψ)dΠ(Ω)≤Z(BΩ

n)c

dΠ(Ω) = Π(|ν(Ω)| ≥ rΩ

n)≤exp(2ξQ −log(R)) ˜

Π(|ν(Ω)| ≥ rΩ

We bound the above display using the fact that when |ωk,k0|> δωwe have π(ωk,k0)<

2ηξ1

2exp(−ξ1|ωk,k0|). Note that this follows from the deﬁnition of the eﬀective dimension.

We have

Π(|ν(Ω)| ≥ rΩ

n)≤X

|S|>rΩ

(2η)|S|Y

(k,k0)∈SZ|ωk,k0|>δω

ξ1

2exp(−ξ1|ωk,k0|)dωk,k0Y

(k,k0)/∈SZ|ωk,k0|<δω

π(ωk,k0)dωk,k0

≤X

|S|>rΩ

(2η)|S|

By using the assumption on η, and the fact Q

k≤(eQ/k)k, we can further upper bound the

probability:

Π(|ν(Ω)| ≥ rΩ

n)≤X

|S|>rΩ

(2η)|S|≤X

|S|>rΩ

1 + K4max{pq, Q}2+b)|S|

≤

k=brΩ

nc+1 Q

k 2

K4max{pq, Q}2k

≤

max{pq,Q}

k=brΩ

nc+1 max{pq, Q}

k 2

K4max{pq, Q}2k

≤

max{pq,Q}

k=brΩ

nc+1 2e

K4kmax{pq, Q}k

max{pq,Q}

k=brΩ

nc+1 2e

K4(brΩ

nc+ 1) max{pq, Q}k

.max{pq, Q}−(brΩ

nc+1)

≤exp(−(rΩ

n) log(max{pq, Q}))

Taking rΩ

n=C0

3max{p, q, sΨ

0, sΩ

0}and C0

3> C1, we have

Π(|ν(Ω)| ≥ rΩ

n)≤exp(−C0

3max{p, q, sΨ

0, sΩ

0}log(max{pq, Q})) ≤exp(−C3n2

Thus, using the assumption ξ1/max{Q, n}, for some R0not depending on n, we have

Π(|ν(Ω)| ≥ rΩ

n)≤exp(−C3n2

n+ 2ξQ −log(R)) ≤exp(−C3n2

n+ log(R0))

We therefore conclude that

E0Π((BΩ

n)c|Y1, . . . , Yn)≤E0Π((BΩ

n)c|Y1, . . . , Yn)IEn,+P0(Ec

where Enis the event in KL condition. On En, the KL condition ensures that the denomi-

nator in Equation (S45) is lower bounded by exp(−C1n2

n) while the denominator is upper

bounded by exp(−C0

3n2

n+ log(R0)).Since P0(Ec

n) is o(1) per KL condition, we conclude

E0Π((BΩ

n)c|Y1, . . . , Yn)≤exp(C1n2

n−C0

3n2

n+ log(R0)) + o(1) →0

We pause now to reﬂect on how dimension recovery can help us establish contraction. Our

end goal is to show the posterior distribution contract to the true value by ﬁrst showing that

event with log-aﬃnity diﬀerence larger than any given  > 0 has an o(1) posterior mass.

For any such event, we can take a partition based on whether it intersects with BΨ

n,BΩ

nor

their complements. Because the complements (BΨ

n)cand (BΩ

n)chave o(1) posterior mass, we

have the partition that intersects with any of these two complements also has o(1) posterior

mass. Thus, we only need to show that events with log-aﬃnity diﬀerence larger than any

given  > 0and recovered the low dimension structure have an o(1) posterior mass. The

recovery condition reduces the complexity of the events (on the parameter space) that we

need to deal with by reducing the eﬀective dimension of such events. We will make use of

this low dimension structure during checking the test condition.

Formally for every  > 0, we have

E0Π(Ψ,ΩτI :1

nXρ(fi, f0,i)> |Y1, . . . , Yn)

≤E0Π(Ψ ∈ BΨ

n,ΩτI :1

nXρ(fi, f0,i)> |Y1, . . . , Yn) + E0Π((BΨ

n)c|Y1, . . . , Yn)

≤E0Π(Ψ ∈ BΨ

n,Ω∈ BΩ

n:1

nXρ(fi, f0,i)> |Y1, . . . , Yn)

+E0Π((BΨ

n)c|Y1, . . . , Yn) + E0Π((BΩ

n)c|Y1, . . . , Yn)

The last two terms are o(1),as proved above.

S5.2.2 Sieve

As shown in the previous section, we can concentrate on the events with proper dimension

recovery, i.e. {Ψ∈ BΨ

n,Ω∈ BΩ

n}. To apply Ghosal and van der Vaart (2017)’s general theory

of posterior contraction, to establish contraction on the event of proper dimension recovery

(i.e. E0Π(Ψ ∈ BΨ

n,Ω∈ BΩ

n:1

nPρ(fi, f0,i)> |Y1, . . . , Yn)→0), we need to ﬁnd a sieve that

covers enough of the support of the prior. We will show that an L1 norm sieve is suﬃcient.

Formally we will show that there exist a sieve Fnsuch that for some constants C2> C1+ 2:

Π(Fc

n)≤exp(−C2n2

n) (S46)

Consider the sieve:

Fn=Ψ∈ BΨ

n,Ω∈ BΩ

n:||Ψ||1≤2C3p, ||Ω||1≤8C3q(S47)

for some large C3> C1+ 2 + log(3) where C1is the constant in KL condition. We have

Π(Fc

n)≤Π(||Ψ||1>2C3p) + Π((||Ω||1>8C3q)∩ {ΩτI})

We upper bound each term similar to Bai et al. (2020). By using the bound in Equation (S34),

we know that

Π((||Ω||1>8C3q)∩ {ΩτI})≤exp(2ξQ −log(R))˜

Π(||Ω||1>8C3q).

Since ||Ω||1= 2 Pk>k0|ωk,k0|+Pk|ωk,k |, at least one of these two sums exceeds 8C3q/2.

Thus, we can form an upper bound on the L1 norm probability

Π(||Ω||1>8C3q)≤˜

Π X

k>k0|ωk,k0|>8C3q

4!+˜

Π X

k|ωk,k|>8C3q

2!.

To get an upper bound under ˜

Π,we can act as if all ωk,k0’s were drawn from the slab

distribution. In that setting, Pk>k0|ωk,k0|is Gamma distributed with shape parameter Q

and rate parameter ξ1. By using an appropriate tail probability for the Gamma distribution

(Boucheron et al. (2013), pp.29) and the fact 1 + x−√1+2x≥(x−1)/2,we compute

exp(2ξQ −log(R)) ˜

Π(X

k>k0|ωk,k0|>8C3q/4) ≤exp "−Q 1−s1+28C3q

4Qξ1

+8C3q

4Qξ1!+ 2Q−log(R)#

≤exp −8C3q

8ξ1

+5

2Q−log(R)

Since we have assumed ξ11/n, for suﬃciently large n, we have n2

n≥qlog(q).Conse-

quently, qn2

n≥Qlog(q), Q =o(qn2

n), and we see that

8C3q

8ξ1−5

2Q−log(R)C3(nq)−5

2Q−log(R)

≥C3(qn2

n)−5

2Q−log(R)

=C3(qn2

n)−o(qn2

≥C3n2

The ﬁrst order term of Qon the left hand side can be ignored when nlarge as the left hand

side is dominated by the Qlog(q) term. Note that we used the assumption that n→0. We

further have

exp(2ξQ −log(R)) ˜

Π((X

k>k0|ωk,k0|>8C3q/4)) ≤exp(−C3n2

For the diagonal, the sum follows a gamma distribution with shape qand rate ξ. We obtain

a similar bound

exp(2ξQ −log(R)) ˜

Π(X

k|ωk,k|>8C3q/2) ≤exp(2Q−log(R)) exp "−q 1−s1+28C3q

2qξ +8C3q

2qξ !#

≤exp −8C3q

4ξ+Q2 + q

2Q−log(R)

Using the same argument as before and the fact that ξ1/max{Q, n}, we have

8C3q

4ξ−Q2 + q

2Q+ log(R)2C3(max{Q, n}q)) −Q2 + q

2Q+ log(R)

≥C3qn2

n−o(qn2

≥C3n2

The ﬁrst order term of Qon the left hand side can be ignored when nlarge as the left hand

side is dominated by the Qlog(q) term and q/Q →0.

By combining the above results, we have:

Π((||Ω||1>8C3q)∩ {ΩτI})≤exp(2Q−log(R)) ˜

Π(||Ω||1>8C3q)

≤exp(2Q−log(R)) ˜

Π(X

k>k0|ωk,k0|>8C3q

+ exp(2Q−log(R)) ˜

Π(X

k|ωk,k|>8C3q

≤2 exp(−C3n2

(S48)

The probability ||Ψ||1>2C3pcan be bounded by tail probability of Gamma distribution

with shape parameter pq and rate parameter λ1:

Π(||Ψ||1>2C3p)≤exp "−pq 1−s1+22C3p

pqλ1

+2C3p

pqλ1!#

≤exp −pq 2C3p

2pqλ1−1

2

≤exp −2C3p

2λ1

+pq

2

Using the same argument, we have pn ≥pn2

n≥pq log(q) and thus, pq =o(pn2

n) for large

n. Consequently,

exp −2C3p

2λ1

+pq

2≤exp −C3pn2

n+o(pn2

n)≤exp(−C3n2

and

Π(||Ψ||1>2C3p)≤exp(−C3n2

n)(S49)

By combining the result from Equations (S48) and (S49), we conclude

Π(Fc

n)≤3 exp(−C3n2

n) = exp(−C3n2

n+ log(3)).

With our choice of C3, the above probability is asymptotically bounded from above by

exp(−C2n2

n) with some C2≥C1+ 2.

S5.2.3 Tests around a representative point

To apply the general theory, we need to construct test ϕn, such that for some M2> C1+ 1:

Ef0ϕn.e−M2n2/2

sup

f∈Fn:ρ(f0,f)>M2n2

Ef(1 −ϕn).e−M2n2

n(S50)

where f=Qn

i=1 N(XiΨΩ−1,Ω−1) while f0=Qn

i=1 N(XiΨ0Ω−1

0,Ω−1

Instead of directly constructing the ϕnon the whole sieve, we use the method similar to

Ning et al. (2020). That is, we construct tests versus a representative point and show that

these tests works well in the neighborhood of the representative points. We then take the

supremum of these tests and show that the number of pieces needed to cover the entire sieve

can be appropriately bounded.

For a representative point f1, consider the Neyman-Pearson test for a single point alternative

H0:f=f0, H1:f=f1,φn=I{f1/f0≥1}. If the average half order R´enyi divergence

−n−1log(R√f0f1dµ)≥2, we will have:

Ef0(φn)≤Zf1>f0pf1/f0f0dµ ≤Zpf1f0dµ ≤e−n2

Ef1(1 −φn)≤Zf0>f1pf0/f1f1dµ ≤Zpf0f1dµ ≤e−n2

By Cauchy-Schwarz, for any alternative fwe can control the Type II error rate:

Ef(1 −φn)≤ {Ef1(1 −φn)}1/2{Ef1(f/f1)2}1/2

So long as the second factor grows at most like ecn2for some properly chosen small c,

the full expression can be controlled. Thus we can consider the neighborhood around the

representative point small enough so that the second factor can be actually bounded.

Consider every density with parameters satisfying

|||Ω|||2≤ ||Ω||1≤8C3q,

||Ψ1−Ψ||2≤ ||Ψ1−Ψ||1≤1

√2C3np,

|||Ω1−Ω|||2≤ ||Ω1−Ω||1≤1

8C3nmax{p, q}3/2≤1

8C3nq3/2

(S51)

We show that Ef1(f/f1)2is bounded on the above set when parameters are from the sieve

Fn.

Similar to Ning et al. (2020), denote Σ1= Ω−1

1, Σ = Ω−1as well as Σ?

1= Ω1/2Σ1Ω1/2,

and ∆Ψ= Ψ −Ψ1while ∆Ω= Ω −Ω1. Using the observation ΨΩ−1−Ψ1Ω−1

1= (∆Ψ−

Ψ1Ω−1∆Ω)Ω−1,we have

Ef1(f/f1)2=|Σ?

1|n/2|2I−Σ?−1

1|−n/2

×exp n

i=1

Xi(ΨΩ−1−Ψ1Ω−1

1)Ω1/2(2Σ?

1−I)−1Ω1/2(ΨΩ−1−Ψ1Ω−1

1)>X>

=|Σ?

1|n/2|2I−Σ?−1

1|−n/2

×exp n

i=1

Xi(∆Ψ−Ψ0Ω−1∆Ω)Ω−1/2(2Σ?

1−I)−1Ω−1/2(∆Ψ−Ψ0Ω−1∆Ω)>X>

(S52)

For the ﬁrst factor we use a similar argument as in Ning et al. (2020) (after Equation 5.9).

Since Ω ∈ Fn, we have |||Ω−1|||2≤1/τ. The fact |||Ω1−Ω|||2≤δ0

n= 1/8C3nq3/2implies

|||Σ?

1−I|||2≤ |||Ω−1|||2|||Ω1−Ω|||2≤δ0

n/τ

and thus we can bound the spectrum of Σ?

1, i.e. 1 −δ0

n/τ ≤eig1(Σ?

1)≤eigq(Σ?

1)≤1 + δ0

n/τ.

Thus

|Σ?

|2I−Σ?−1

1|n/2

= exp n

i=1

log(eigi(Σ?

1)) −n

i=1

log 2−1

eigi(Σ?

1)!

≤exp nq

2log(1 + δ0

n/τ)−nq

2log 1−δ0

n/τ

1−δ0

n/τ 

≤exp nq2

2δ0

n+nq

2δ0

n/τ

1−2δ0

n/τ 

≤exp(nqδ0

n/τ)

≤e

The third inequality is due to the fact 1 −x−1≤log(x)≤x−1.

We can bound the log of the second factor of Equation (S52).

|||Ω−1|||2|||(2Σ?

1−I)−1|||2

i=1 ||Xi(∆Ψ−Ψ1Ω−1∆Ω)||2

2≤2/τ

i=1 ||Xi(∆Ψ−Ψ1Ω−1∆Ω)||2

We can further bound the sum on the sieve.

i=1 ||Xi(∆Ψ−Ψ1Ω−1∆Ω)||2

2≤2

i=1 ||Xi∆Ψ||2

2+ 2

i=1 ||XiΨ1Ω−1∆Ω||2

≤2np|||∆Ψ|||2

2+ 2np|||Ψ1|||2

2|||Ω−1|||2

2||∆Ω||2

≤2np 1

2C3np + 2np 2C3p+1

√2C3np21

τ2

(8C3nmax{p, q}3/2)2

≤2np 1

2C3np + 2np16C2

3p21

τ2

(8C3nmax{p, q}3/2)2

We bound the norm of Ψ1using triangle inequality, |||Ψ1||| ≤ |||Ψ|||+|||Ψ1−Ψ||| ≤ 2C3p+

1/√2C3np. The ﬁrst term is O(1) and second term is O(1/q), by combining the result we

conclude the second factor of Equation (S52) is bounded.

Thus, following the argument of Ning et al. (2020), the desired test ϕnin Equation (S50)

can be obtained as the maximum of all tests φndescribed above.

S5.2.4 Pieces needed to cover the sieve

From here we show the contraction in log-aﬃnity ρ(f, f0). To ﬁnish up the proof, we check

that number of sets described in Equation (S51) needed to cover sieve Fn, denoted by N∗,

can be bounded by exp(Cn2

n) for some suitable constant C.

The number N∗is called a covering number of Fn. A closely related quantity is the packing

number, which is deﬁned as the maximum number of disjoint balls centered in a set and

upper bounds the covering number. Both the covering number and packing number can be

used as a measure of complexity of a given set (Ghosal and van der Vaart,2017).

The packing number of a set usually depends exponentially on the sets dimensions. Because

Ning et al. (2020) studied posteriors which place positive probability on exactly sparse pa-

rameters, they were able to directly bound the packing number of suitable low-dimensional

sets. In our case, which uses an absolutely continuous prior, we need to instead control the

packing number of “eﬀectively low dimensional” spaces.

Lemma S4 provides a suﬃcient condition for bounding the complexity (evaluated by packing

number) of an set of “eﬀectively sparse” vectors can be bounded by the complexity of a set

of actually sparse vectors.

Lemma S4 (packing a shallow cylinder in Lp).For a set of form E=A×[−δ, δ]Q−s⊂RQ

where A⊂Rs, (with s > 0and Q≥s+ 1 are integers) for 1≤p < ∞and a given T > 1,

if δ < 

2[T(Q−s)]1/p , we have the packing number:

D(, A, || · ||p)≤D(, E, || · ||p)≤D((1 −T−1)1/p , A, || · ||p)

Proof. The lower bound is trivial by observing A×{0}Q−s⊂Eand the packing number of

A×{0}Q−sis exactly the packing number of A. For the upper bound, we show that for each

packing on E, we can slice that packing with the 0-plane to form a packing on Awith the

same number of balls but smaller radius (see Figure S7 for an illustration).

We ﬁrst show any Lp /2−ball Bθ(/2) centered in the set Eintersects the plane Rs×{0}Q−s.

Assume the center is θ= (x1, . . . , xQ). It suﬃces to show the center’s distance to the plane

is less than the radius of the ball. Since the center is in E, we have |xi| ≤ δfor the last

Q−scoordinates. Denote the projection of the center on the plane as θA= (x1, . . . , xs,0) ∈

A× {0}Q−s. Then the Lp distance from the center to the plane is

||θA−θ||p

i=s+1 |xi|p≤(Q−s)δp< T −1(/2)p

Next we show the slice Bθ()∩Rs×{0}Q−sis also a ball centered at θAin the lower dimensional

plane. It suﬃce to show the boundary is a sphere. Suppose we take a point afrom the

intersection of boundary of Bθ()∩Rs×{0}Q−s, the vector from center to the point can be

decomposed to sum of two orthogonal component, namely the vector from θAto aand from

θAto θ, we have in this case

||a−θA||p

p+||θA−θ||p

p=||a−θ||p

p=p/2p

because a−θAhas all 0 entries on the last Q−saxis and θA−θhas all 0 entries on the

ﬁrst sentries. Thhus any such point has a ﬁxed distance to θA, the projection of the center

θon the plane of A. Notice that

||a−θA||p

p=p/2p− ||θA−θ||p

which is ﬁxed. Thus the collection of aforms a sphere on A’s plane.

From here, we can also lower bound the radius of slice by (1 −T−1)1/p /2 since ||θA−θ||p

T−1(/2)p, thus we have the radius ||a−θA||p>(1 −T−1)1/p/2. Thus, we have the smaller

ball must lie within the slice, i.e.

BθA((1 −T−1)1/p/2) × {0}Q−s⊂Bθ(/2) ∩(Rs× {0}Q−s)⊂Bθ(/2) (S53)

That is, any /2−ball centered in Ehas a corresponding (1 −T−1)1/p/2 lower dimension

ball centered in A. With the above observations in hand, we can now prove the inequality

by contradiction.

Suppose we have a packing on E{θ1, . . . , θD}, where Dis larger than the packing number

of Ain the main result. By Equation (S53), the lower dimension balls BθiA ((1 −T−1)1/p/2)

must also be disjoint. Since the centers of the balls θiA ∈A, these balls form a packing of

Awith radius 0= (1 −T−1)1/p. That is, we can ﬁnd a packing with more balls than the

packing number, yielding the desired contradiction. Thus we must have

D≤D((1 −T−1)1/p, A, || · ||p)

Figure S7: A schematic of argument used in the packing number lemma proof. We showed

two disjoint unit L1 balls (red) centered in (0.8,0,0.5) and (−.3,1,−.2), all with in A×

[−0.5,0.5] (with A= [−1,1] ×[−1,1] shown in the middle plane), their slice in the z= 0

plane (in blue) also forms L1 balls in R2whose radius are lower bounded and centered within

A, thus induced a packing of the lower dimensional set.

Now we can bound the logarithm of the covering number log(N∗) similar to Ning et al.

(2020).

log(N∗)≤log N1

√2C3np,{Ψ∈ BΨ

n:||Ψ||1≤2C3p},|| · ||1

+ log N1

8C3nmax{p, q}3/2,{Ω∈ BΩ

n,||Ω||1≤8C3q},|| · ||1

The two terms above can be treated in a similar way. Denote max{p, q, sB

0, sΩ

0}=s?. There

are multiple ways to allocate the eﬀective 0’s, which introduces the binomial coeﬃcient

below:

N1

8C3nmax{p, q}3/2,{Ω∈ BΩ

n,||Ω||1≤8C3q},|| · ||1

≤Q

3s?N1

8C3nmax{p, q}3/2,{V∈RQ+q:|vi|< δωfor 1 ≤i≤Q+q−C0

3s?,||V||1≤8C3q},|| · ||1

N1

√2C3np,{Ψ∈ BΨ

n:||Ψ||1≤2C3p},|| · ||1

≤pq

3s?N1

√2C3np,{V∈Rpq :|vi|< δψfor 1 ≤i≤pq −C0

3s?,||V||1≤2C3p},|| · ||1

Note that Ω has Q+q < 2Qfree parameters. We have ﬁrst

log Q

3s?.s?log(Q).n2

log pq

3s?.s?log(pq).n2

We further bound the covering number using the result in Lemma S4. Observe that ||V||1∩

{|vi|< δωfor 1 ≤i≤Q+q−C0

3s?} ⊂ {||V0|| ≤ 8C3q}×[−δω, δΩ]Q+q−C0

3s?where V0∈RC0

3s?

we have

N1

8C3nmax{p, q}3/2,{V:|vi|< δωfor 1 ≤i≤Q+q−C0

3s?,||V||1≤8C3q},|| · ||1

≤N1

8C3nmax{p, q}3/2,{V∈RC0

3s?:||V0||1≤8C3q×[−δω, δω]Q+q−C0

3s?},|| · ||1

We check the condition of Lemma S4 (with p= 1 and T= 2), by our assumption on ξ0, we

have:

(Q+q−C0

3s?)δω≤2Qδω= 2Q1

ξ0−ξ1

log 1−η

ξ0

ξ1.Qlog(max{p, q, n})

max{Q, pq, n}4+b/2+b/2

≤1

max{Q, pq, n}3+b/2

The denominator dominates C3nmax{p, q}3/2thus for large enough n, we have (Q+q−

3s?)δω≤1

32C3nmax{p,q}3/2thus by Lemma S4, we can control the covering number by the

packing number:

log N1

8C3nmax{p, q}3/2,{V:|vi|< δωfor 1 ≤i≤Q+q−C0

3s?,||V||1≤8C3q},|| · ||1

≤log D1

16C3nmax{p, q}3/2,{V0∈RC0

3s?,||V0||1≤8C3q},|| · ||1

.s?log(128C2

3qn max{p, q}3/2)

.n2

Similarly for Ψ,

N1

√2C3np,{V:|vi|< δψfor 1 ≤i≤pq −C0

3s?,||V||1≤2C3p},|| · ||1

≤N1

√2C3np,{V0∈RC0

3s?:||V0||1≤2C3p×[−δψ, δψ]},|| · ||1

We again check the condition of Lemma S4 (again with p= 1 and T= 2):

(pq −C0

3s?)δψ≤pqδψ=pq

λ0−λ1

log 1−θ

λ0

λ1.pq log(max{p, q, n})

max{pq, n}5/2+b/2+b/2

≤1

max{pq, n}3/2+b/2

The denominator dominates √2C3np, Thus for enough large n, we have (pq −C0

3s?)δψ≤

1/4√2C3np. Thus similar to Ω, we have:

log N1

√2C3np,{V:|vi|< δωfor 1 ≤i≤pq −C0

3s?,||V||1≤2C3p},|| · ||1

≤log D1

2√2C3np,{V0∈RC0

3s?,||V0||1≤2C3p},|| · ||1

.s?log(4C3pp2C3np)

.n2

Thus we ﬁnally get the contraction under log-aﬃnity.

S5.3 From log-aﬃnity to Ωand XΨΩ−1

In this section we show the main result Theorem 1using the contraction under log-aﬃnity.

Denoting Ψ −Ψ0= ∆Ψand Ω −Ω0= ∆Ωwe have the log-aﬃnity 1

nPρ(fi, f0i) is

nXρ(fi, f0i) = −log |Ω−1|1/4|Ω−1

0|1/4

|(Ω−1+ Ω−1

0)/2|1/2

8nXXi(ΨΩ−1−Ψ0Ω−1

0)Ω−1+ Ω−1

2−1

(ΨΩ−1−Ψ0Ω−1

0)>X>

Thus Pρ(fi−f0i).n2

nimplies

−log |Ω−1|1/4|Ω−1

0|1/4

|(Ω−1+ Ω−1

0)/2|1/2.2

8nXXi(ΨΩ−1−Ψ0Ω−1

0)Ω−1+ Ω−1

2−1

(ΨΩ−1−Ψ0Ω−1

0)>X>

i.2

(S54)

This is almost the same as Ning et al. (2020)’s Equations 5.11-5.12. We can directly apply

the result from Ning et al. (2020)’s Equation 5.11 as it is the same as the ﬁrst equation in

Equation (S54). Because Ψ0and Ω−1have bounded operator norms and because ∆Ωcan be

controlled, the cross-term is also controlled by n.The ﬁrst part of Equation (S54) implies

||Ω−1−Ω−1

0||2

F.2

Meanwhile ||Ω−1−Ω−1

0||2

F.2

nimplies for large enough nΩ’s L2 operator norm is bounded

(since we assume bounds on Ω−1

0’s operator norm and the diﬀerence cannot have very large

eigenvalues which make the sum has 0 eigenvalue), using the result ||AB||F≤ |||A|||2||B||F,

while also observe Ω0−Ω = Ω(Ω−1−Ω−1

0)Ω0, and by assumption that Ω0has bounded L2

operator norm, we conclude (S54) implies ||Ω−Ω0||F.n.

Since |||Ω−1|||2is bounded for large enough n, we can directly apply an argument from Ning

et al. (2020) ( speciﬁcally the argument around their Equation 5.12) to conclude (S54)’s

second part implies:

2

n&1

8nX||Xi(ΨΩ−1−Ψ0Ω−1

0)||2

2|||Ω−1+ Ω−1

2|||−1

8nX||Xi(ΨΩ−1−Ψ0Ω−1

0)||2

2|||Ω−1+ Ω−1

2|||−1

nX||Xi(ΨΩ−1−Ψ0Ω−1

0)||2

2/p2

n+ 1

Combining all of these results yields the desired result.

S5.4 Contraction of Ψ

Contraction of Ψ requires more assumptions on the design matrix X. Similar to Roˇckov´a

and George (2018) and Ning et al. (2020), we introduce the restricted eigenvalue

φ2(˜s) = inf ||XA||2

n||A||2

: 0 ≤ |ν(A)| ≤ ˜s

With this deﬁnition,

||X(ΨΩ−1)−Ψ0Ω−1

0)||2

F.n2

||Ω−Ω0||2

F.2

implies the result in Equation (15) of the main text. Namely,

||ΨΩ−1−Ψ0Ω−1

0||2

F=||(∆Ψ−Ψ0Ω−1∆Ω)Ω−1||2

F.2

n/φ2(sΨ

0+C0

3s?)

Since both Ω and Ω−1have bounded operator norm when ||Ω−Ω0||2

F.2

n,for large enough

n, we must have:

||∆Ψ||F− ||Ψ0Ω−1∆Ω||F≤ ||∆Ψ−Ψ0Ω−1∆Ω||F.n/qφ2(sΨ

0+C0

3s?)

Since Ψ0and Ω−1have bounded operator norm, ||Ψ0Ω−1∆Ω||F.n, and we must have:

||∆Ψ||F.n/qmin{φ2(sΨ

0+C0

3s?),1}

Thus we can conclude

sup

Ψ∈T0,Ω∈H0

E0Π||Ψ−Ψ0||2

F≥M02

min{φ2(sΨ

0+C0

3s?),1}→0

ResearchGate has not been able to resolve any citations for this publication.

Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

Article

Full-text available

Aug 2007

We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added ℓ 1 -norm penalty term. The problem as formulated is convex but the memory requirements and complexity of existing interior point methods are prohibitive for problems with more than tens of nodes. We present two new algorithms for solving problems with at least a thousand nodes in the Gaussian case. Our first algorithm uses block coordinate descent, and can be interpreted as recursive ℓ 1 -norm penalized regression. Our second algorithm, based on Nesterov's first order method, yields a complexity estimate with a better dependence on problem size than existing interior point methods. Using a log determinant relaxation of the log partition function (Wainwright and Jordan [2006]), we show that these same algorithms can be used to solve an approximate sparse maximum likelihood problem for the binary case. We test our algorithms on synthetic data, as well as on gene expression and senate voting records data.

Spike-and-Slab Group Lassos for Grouped Regression and Sparse Generalized Additive Models

Article

Mar 2022

We introduce the spike-and-slab group lasso (SSGL) for Bayesian estimation and variable selection in linear regression with grouped variables. We further extend the SSGL to sparse generalized additive models (GAMs), thereby introducing the first nonparametric variant of the spike-and-slab lasso methodology. Our model simultaneously performs group selection and estimation, while our fully Bayes treatment of the mixture proportion allows for model complexity control and automatic self-adaptivity to different levels of sparsity. We develop theory to uniquely characterize the global posterior mode under the SSGL and introduce a highly efficient block coordinate ascent algorithm for maximum a posteriori (MAP) estimation. We further employ de-biasing methods to provide uncertainty quantification of our estimates. Thus, implementation of our model avoids the computational intensiveness of Markov chain Monte Carlo (MCMC) in high dimensions. We derive posterior concentration rates for both grouped linear regression and sparse GAMs when the number of covariates grows at nearly exponential rate with sample size. Finally, we illustrate our methodology through extensive simulations and data analysis.

The Statistical Analysis of Compositional Data.

Article

Mar 1989

The Gut Microbiota as a Novel Regulator of Cardiovascular Function and Disease

Article

Dec 2017
J NUTR BIOCHEM

The gut microbiome has emerged as a critical regulator of human physiology. Deleterious changes to the composition or number of gut bacteria, commonly referred to as gut dysbiosis, has been linked to the development and progression of numerous diet-related diseases, including cardiovascular disease (CVD). Most CVD risk factors, including aging, obesity, certain dietary patterns, and a sedentary lifestyle, have been shown to induce gut dysbiosis. Dysbiosis is associated with intestinal inflammation and reduced integrity of the gut barrier, which in turn increases circulating levels of bacterial structural components and microbial metabolites that may facilitate the development of CVD. The aim of the current review is to summarize the available data regarding the role of the gut microbiome in regulating CVD function and disease processes. Particular emphasis is placed on nutrition-related alterations in the microbiome, as well as the underlying cellular mechanisms by which the microbiome may alter CVD risk.

Spike-and-slab meets LASSO: A review of the spike-and-slab LASSO

Jan 2021

R Bai
V Ročková

Bai, R., Ročková, V., and George, E. I. (2021). Spike-and-slab meets LASSO: A review of the spike-and-slab LASSO. In Tadesse, M. and Vannucci, M., editors, Handbook of Bayesian Variable Selection. Routledge.

Sparse Gaussian chain graphs with the spike-and-slab LASSO: Algorithms and asymptotics

Abstract and Figures

Recommended publications

On the posterior contraction of the multivariate spike-and-slab LASSO

Estimating Sparse Direct Effects in Multivariate Regression With the Spike-and-Slab LASSO

CARlasso: An R package for the estimation of sparse microbial networks with predictors

The Effect of the Prior and the Experimental Design on the Inference of the Precision Matrix in Gaus...