PreprintPDF Available

Leveraging graphical model techniques to study evolution on phylogenetic networks

May 2024

May 2024

License
CC BY 4.0

Authors:

Cécile Ané

University of Wisconsin–Madison

Preprints and early-stage research may not have been peer reviewed yet.

The evolution of molecular and phenotypic traits is commonly modelled using Markov processes along a rooted phylogeny. This phylogeny can be a tree, or a network if it includes reticulations, representing events such as hybridization or admixture. Computing the likelihood of data observed at the leaves is costly as the size and complexity of the phylogeny grows. Efficient algorithms exist for trees, but cannot be applied to networks. We show that a vast array of models for trait evolution along phylogenetic networks can be reformulated as graphical models, for which efficient belief propagation algorithms exist. We provide a brief review of belief propagation on general graphical models, then focus on linear Gaussian models for continuous traits. We show how belief propagation techniques can be applied for exact or approximate (but more scalable) likelihood and gradient calculations, and prove novel results for efficient parameter inference of some models. We highlight the possible fruitful interactions between graphical models and phylogenetic methods. For example, approximate likelihood approaches have the potential to greatly reduce computational costs for phylogenies with reticulations.

Figure S3: Boxplots (with means as points) showing the distribution of cluster sizes in the join-graph structuring cluster graph U * and in the clique tree U from Fig. 7. The factor graph has clusters of size between 1 and 3 (not displayed). The time for 100 iterations (defined in Fig. 7) was benchmarked over 20 replicates on a MacBook Pro M2 2022, and divided by the number of messages per 100 iterations to obtain an estimate of the mean time per belief update (vertical axis).

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

LEVERAGING GRAPHICAL MODEL TECHNIQUES TO STUDY

EVOLUTION ON PHYLOGENETIC NETWORKS

Benjamin Teo

Department of Statistics

University of Wisconsin-Madison

Paul Bastide

IMAG, Universit´

e de Montpellier,

CNRS

C´

ecile An´

Departments of Statistics and of Botany

University of Wisconsin-Madison

ABS TRAC T

The evolution of molecular and phenotypic traits is commonly modelled using Markov processes

along a rooted phylogeny. This phylogeny can be a tree, or a network if it includes reticulations,

representing events such as hybridization or admixture. Computing the likelihood of data observed at

the leaves is costly as the size and complexity of the phylogeny grows. Efﬁcient algorithms exist for

trees, but cannot be applied to networks. We show that a vast array of models for trait evolution along

phylogenetic networks can be reformulated as graphical models, for which efﬁcient belief propagation

algorithms exist. We provide a brief review of belief propagation on general graphical models, then

focus on linear Gaussian models for continuous traits. We show how belief propagation techniques

can be applied for exact or approximate (but more scalable) likelihood and gradient calculations,

and prove novel results for efﬁcient parameter inference of some models. We highlight the possible

fruitful interactions between graphical models and phylogenetic methods. For example, approximate

likelihood approaches have the potential to greatly reduce computational costs for phylogenies with

reticulations.

Keywords belief propagation, cluster graph, admixture graph, trait evolution, Brownian motion, linear Gaussian

Contents

1 Introduction 3

2 Complexity of the phylogenetic likelihood calculation 3

2.1 Thepruningalgorithm .......................................... 3

2.2 Continuous traits on trees: the lazy way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 BPforcontinuoustraitsontrees ..................................... 5

2.4 Fromtreestonetworks .......................................... 5

2.5 Current network approaches for discrete traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.6 Current network approaches for continuous traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Continuous trait evolution on a phylogenetic network 6

3.1 LinearGaussianmodels.......................................... 6

3.2 Evolutionary models along one lineage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.3 Evolutionary models at reticulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.4 Evolutionary models with interacting populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

arXiv:2405.09327v1 [q-bio.PE] 15 May 2024

Leveraging graphical model techniques to study evolution on phylogenetic networks

4 A short review of graphical models and belief propagation 8

4.1 Graphicalmodels............................................. 8

4.2 Phylogenetic examples of graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.3 BeliefPropagation ............................................ 11

4.4 BPforGaussianmodels.......................................... 13

5 Scalable approximate inference with loopy BP 15

5.1 Calibration ................................................ 15

5.2 Likelihoodapproximation ........................................ 16

5.3 Scalability versus accuracy: choice of cluster graph complexity . . . . . . . . . . . . . . . . . . . . . 16

6 Leveraging BP for efﬁcient parameter inference 19

6.1 BP for fast likelihood computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.2 BP for fast gradient computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6.3 BP for direct Bayesian parameter inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

7 Challenges and Extensions 21

7.1 Degeneracy ................................................ 21

7.2 Loopy BP is promising for discrete traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

A Recasting SnappNet as BP 31

B Bounding the moralized network’s treewidth 32

C Approximation quality with loopy BP 33

D Gradient and parameter estimates under the BM 34

D.1 ThehomogeneousBMmodel....................................... 34

D.2 BeliefPropagation ............................................ 34

D.3 Gradient computation and analytical formula for parameter estimates . . . . . . . . . . . . . . . . . . 38

D.4 Analytical formula for phylogenetic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

E Regularizing initial beliefs 41

F Handling deterministic factors 43

F.1 Substitution................................................ 43

F.2 Generalizedcanonicalform........................................ 44

Leveraging graphical model techniques to study evolution on phylogenetic networks

1 Introduction

In phylogenetics, data are observed at the leaves of a phylogeny: a directed acyclic graph representing the historical

relationships between species, populations or individuals of interest, with branch lengths representing evolutionary time

and internal nodes representing divergence (e.g. speciation) or merging (e.g. introgression) events. Stochastic processes

are used to model the evolution of traits over time along this phylogeny. In this work, we consider traits that may be

multivariate, discrete and/or continuous, with a focus on continuous traits. Inference from these models are used to

infer evolutionary dynamics and historical correlation between traits, predict unobserved traits at ancestral nodes or

extant leaves, or estimate phylogenies from rich data sets.

Calculating the likelihood is no easy task because the traits at ancestral nodes are unobserved and need to be integrated

out. This problem is very well studied for phylogenetic trees, with efﬁcient solutions for both discrete and continuous

traits. Admixture graphs and phylogenetic networks with reticulations are now gaining traction due to growing empirical

evidence for gene ﬂow, hybridization and admixture. Yet many methods and tools for these networks could be improved

towards more efﬁcient likelihood calculations.

The vast majority of evolutionary models used in phylogenetics make a Markov assumption, in that the trait distribution

at all nodes (observed at the tips and unobserved at internal nodes) can be expressed by a set of local models. At the root,

this model describes the prior distribution of the ancestral trait. For each node in the phylogeny, a local transition model

describes the trait distribution at this node conditional on the trait(s) at its parent node(s). As each local model can be

speciﬁed individually with its own set of parameters, the overall evolutionary model can be very ﬂexible, including

possible shifts in rates, constraints, and mode of evolution across different clades. Other models do not make a Markov

assumption, such as models that combine a backwards-in-time coalescent process for gene trees and forward-in-time

mutation process along gene trees. We show here that some of these models can still be expressed as a product of local

conditional distributions, over a graph that is more complex than the initial phylogeny.

These evolutionary models are special cases of graphical models, also known as Bayesian networks, which have been

heavily studied. The task of calculating the likelihood of the observed data has received a lot of attention, including

algorithms for efﬁcient approximations when the network is too complex to calculate the likelihood exactly. Another

well-studied task is that of predicting the state of unobserved variables (ancestral states in phylogenetics) conditional on

the observed data. We argue here that the ﬁeld of phylogenetics could greatly beneﬁt from applying and expanding

knowledge from graphical models for the study and use of phylogenetic networks.

In section 2 we review the challenge brought by phylogenetic models in which only tip data are observed, and the

techniques currently used for efﬁcient likelihood calculations for phylogenetic models on trees and networks. In

section 3 we focus on the general Gaussian models for the evolution of a continuous trait, possibly multivariate to

capture evolutionary correlations between traits. On reticulate phylogenies, these models need to describe the trait of

admixed populations conditional on their parental populations. Turning to graphical models in section 4, we describe

their general formulation and show that many phylogenetic models can be expressed as special cases, from known

examples (on gene trees) to less obvious examples (using the coalescent process on species trees, or species networks).

We then provide a short review of belief propagation, a core technique to perform inference on graphical models, ﬁrst in

its general form and then specialized for continuous traits in linear Gaussian models. In section 5 we describe loopy

belief propagation, a technique to perform approximation inference in graphical models, when exact inference does not

scale. As far as we know, loopy belief propagation has never been used in phylogenetics. Section 6 describes leveraging

BP for parameter inference: fast calculations of the likelihood and its gradient can be used in any likelihood-based

framework, frequentist or Bayesian. Finally, section 7 discusses future challenges for the application and extension of

graphical model techniques in phylogenetics. These techniques offer a range of avenues to expand the phylogeneticists’

toolbox for ﬁtting evolutionary models on phylogenetic networks, from approximate inference methods that are more

scalable, to algorithms for fast gradient computation for better parameter inference.

2 Complexity of the phylogenetic likelihood calculation

2.1 The pruning algorithm

Felsenstein’s pruning algorithm Felsenstein [1973, 1981] launched the era of model-based phylogenetic inference, now

rich with complex models to account for a large array of biological processes: including DNA and protein substitution

models, variation of their substitution rates across genomic loci, lineages and time, and evolutionary models for

continuous traits and geographic distributions. The pruning algorithm gave the key to calculate the likelihood of these

models along a phylogenetic tree, in a practically feasible way. The basis of this algorithm, which extends to tasks

beyond likelihood calculation, was discovered in other areas and given other names, such as the sum-product algorithm,

message passing, and belief propagation (BP).

Leveraging graphical model techniques to study evolution on phylogenetic networks

The pruning algorithm, which is a form of BP, computes the the full likelihood of all the observed taxa by traversing the

phylogenetic tree once, taking advantage of the Markov property: where the evolution of the trait of interest along a

daughter lineage is independent of its past evolution, given knowledge of the parent’s state. The idea is to traverse the

tree and calculate the likelihood of the descendant leaves of an ancestral species conditional on its state, from similar

likelihoods calculated for each of its children. If the trait is discrete with 4 states for example (as for DNA), then

this entails keeping track of 4 likelihood values at each ancestral species. If the trait is continuous with a Gaussian

distribution, e.g. from a Brownian motion (BM) or an Ornstein-Uhlenbeck (OU) process Hansen [1997], then the

likelihood at an ancestral species is a nice function of its state that can be concisely parametrized by quantities akin to

the posterior mean and variance conditional on descendant leaves. Felsenstein’s independent contrasts (IC) Felsenstein

[1985] also captures these partial posterior quantities and can be viewed as a special implementation of BP for likelihood

calculation.

BP is used ubiquitously for the analysis of discrete traits, such as for DNA substitution models (e.g. in

RAxML

Stamatakis

[2014],

IQ-TREE

Nguyen et al. [2015],

MrBayes

Ronquist and Huelsenbeck [2003]) or for discrete morphological

traits in comparative methods (e.g. in

phytools

Revell [2012],

BayesTraits

Pagel et al. [2004],

corHMM

Boyko and

Beaulieu [2021], Boyko et al. [2023],

RevBayes

ohna et al. [2016]). For discrete traits, there is simply no feasible

alternative. On a tree with 20 taxa and 19 ancestral species, the naive calculation of the likelihood at a given DNA site

would require the calculation and summation of

419

or 274 billion likelihoods, one for each nucleotide assignment at

the 19 ancestral species. This calculation would need to be repeated for each site in the alignment, then repeated all

over during the search for a well-ﬁtting phylogenetic tree.

2.2 Continuous traits on trees: the lazy way

For continuous traits under a Gaussian model (including the Brownian motion), BP is not used as ubiquitously because

a multivariate Gaussian distribution can be nicely captured by its mean and covariance matrix: the multivariate Gaussian

formula can serve as an alternative. For example, for one trait

with ancestral state

at the root of the phylogeny, the

phylogenetic covariance

between the taxa at the leaves can be obtained from the branch lengths in the tree. Under a

BM, the covariance

cov(Yi, Yj)

between taxa

and

Σij =σ2tij

where

tij

is the length between the root and their

most recent common ancestor. The likelihood of the observed traits at the

leaves can then be calculated using matrix

and vector multiplication techniques as

(2π)−n/2det |Σ|−1/2exp −1

2(Y−µ)⊤Σ−1(Y−µ).(1)

This alternative to BP has the disadvantage of requiring the inversion of the covariance matrix

, a task whose

computing time typically grows as

for a matrix of size

m×m

. It also has the disadvantage that

needs to be

calculated and stored in memory in the ﬁrst place. For multivariate observations of

traits on each of

taxa, the

covariance matrix has size

m=pn

so the typical calculation cost of

(1)

is then

O(p3n3)

, which can quickly become

very large. For example, with only 30 taxa and 10 traits,

is a

300 ×300

-matrix. Studies with large

and/or large

are now frequent, especially from geometric morphometric data with

over 100 typically (e.g. Hedrick [2023]) or with

expression data on

p > 1000

genes easily, that also require more complex models to account for variation (e.g. within

species, between organs, between batches) Dunn et al. [2013], Shafer [2019]. Studies with a large number

of taxa are

now frequent (e.g.

n > 5,000

in birds and mammals Jetz et al. [2012], Upham et al. [2019]) and virus phylogenies can

be massive (e.g.

n > 1000

and

p= 3

virulence traits in HIV Hassler et al. [2022a], or

n > 500,000

SARS-CoV-2

strains De Maio et al. [2023]).

In these cases with large data size

, the matrix-based alternative to BP is prone to numerical inaccuracy and numerical

instability in addition to the increased computational time, because it is hard to accurately invert a large matrix. Even

when the matrix is of moderate size, numerical inaccuracy can arise when the matrix is “ill-conditioned”. These

problems were identiﬁed under OU models on phylogenetic trees that have closely-related sister taxa, or under early-

burst (EB) models with strong morphological diversiﬁcation early on during the group radiation, and much slowed-down

evolution later on Adams and Collyer [2017], Jhwueng and O’Meara [2020], Bartoszek et al. [2023].

For some simple models, the large

np ×np

covariance matrix can be decomposed as a Kronecker product of a

p×p

trait covariance and a

n×n

phylogenetic covariance. This decomposition can simplify the complexity of calculating

the likelihood. However, this decomposition is not available under many models, such as the multivariate Brownian

motion with shifts in the evolutionary rates (e.g. Caetano and Harmon [2019]) or the multivariate Ornstein-Uhlenbeck

model with non-scalar rate or selection matrices Bartoszek et al. [2012], Clavel et al. [2015].

Leveraging graphical model techniques to study evolution on phylogenetic networks

2.3 BP for continuous traits on trees

To bypass the complexity of matrix inversion, Felsenstein pioneered IC to test for phylogenetic correlation between

traits, assuming a BM model on a tree Felsenstein [1985]. Many authors then used BP approaches to handle Gaussian

models beyond the BM FitzJohn [2012], Freckleton [2012], Cybis et al. [2015], Goolsby et al. [2017]. Notably, Ho

and An

e [2014] describe a fast algorithm that can be used for non-Gaussian models as well. Most recently, Mitov

et al. [2020] highlighted that BP can be applied to a large class of Gaussian models: including the BM and the OU

process with shifts and variation of rates and selection regimes across branches. Software packages that use these fast

BP algorithms include

phylolm

Ho and An

e [2014],

Rphylopars

Goolsby et al. [2017],

BEAST

Hassler et al. [2023]

or the most recent versions of hOUwie Boyko et al. [2023] and mvSLOUCH Bartoszek et al. [2023].

All the methods cited above only use the ﬁrst post-order tree traversal of BP to compute the likelihood. A second preorder

traversal allows, in the Gaussian case, for the computation of the distribution of all internal nodes conditionally on the

model and on the traits values at the tips. These distributions can then be used for, e.g., ancestral state reconstruction

Lartillot [2014], expectation-maximization algorithms for shift detection in the optimal values of an OU Bastide et al.

[2018a], or the computation of the gradient of the likelihood in the BM Zhang et al. [2021], Fisher et al. [2021] or

general Gaussian model Bastide et al. [2021]. Such BP techniques have also been used for taking gradients of the

likelihood with respect to branch lengths in sequence evolution models Ji et al. [2020] or for phylogenetic factor analysis

Tolkoff et al. [2018], Hassler et al. [2022b].

2.4 From trees to networks

So far, Felsenstein’s pruning algorithm and related BP approaches have been restricted to phylogenetic trees, mostly.

There is now ample evidence that reticulation is ubiquitous in all domains of life from biological processes such as

lateral gene transfer, hybridization, introgression and gene ﬂow between populations. Networks are recognized to

be better than trees for representing the phylogenetic history of species and populations in many groups. Although

current studies using networks have few taxa, typically between 10-20 (e.g. Nielsen et al. [2023]), they tend to have

increasingly more tips as network inference methods become more scalable (e.g.

n= 39

languages in Neureiter et al.

[2022]). As viruses are known to be affected by recombination, we also expect future virus studies to use large network

phylogenies Ignatieva et al. [2022], so that BP will become essential for network studies too. In this work, we describe

approaches currently used for trait evolution on phylogenetic networks. We argue that the ﬁeld of evolutionary biology

would beneﬁt from applying BP approaches on networks more systematically. Transferring knowledge from the mature

and rich literature on BP would advance evolutionary biology research when phylogenetic networks are used.

2.5 Current network approaches for discrete traits

For discrete traits on general networks, very few approaches use BP techniques as far as we know. For DNA data for

example,

PhyLiNC

Allen-Savietta [2020] and

NetRAX

Lutteropp et al. [2022] extend the typical tree-based model to

general networks, assuming no incomplete lineage sorting. That is, each site is assumed to evolve along one of the

trees displayed in the network, chosen according to inheritance probabilities at reticulate edges.

PhyLiNC

assumes

independent (unlinked) sites.

NetRAX

assumes independent loci, which may have a single site each. Each locus may

have its own set of branch lengths and substitution model parameters. Both methods calculate the likelihood of a network

via extracting its displayed trees and then applying BP on each tree. Similarly, comparative methods for binary and

multi-state traits implemented in

PhyloNetworks

also extract displayed trees and then apply BP on each displayed

tree Karimi et al. [2020]. While these approaches use BP on each displayed tree, a network with

reticulations can

have up to 2hdisplayed trees. This leads to a computational bottleneck when the number of reticulations increases.

BP approaches have also been used for models with incomplete lineage sorting, modelled by the coalescent Kingman

[1982]. Notably,

SNAPP

models the evolution of unlinked biallelic markers along a species tree, accounting for

incomplete lineage sorting Bryant et al. [2012]. This method was recently made faster with

SNAPPER

Stoltz et al. [2020]

and extended to phylogenetic networks with

SnappNet

Rabier et al. [2021]. The coalescent process introduces the

challenge that each site may evolve along any tree, depending on past coalescent events.

SNAPP

introduced a way to

bypass the difﬁculties of handling coalescent histories and hence decrease computation time. After we describe BP for

general graphical models, we recast this innovation as BP on a graphical model formulation of the problem.

BP was also used to calculate the likelihood of the joint sample frequency spectrum (SFS). To account for incomplete

lineage sorting on a tree, Kamm et al. [2017] use the continuous-time Moran model to reduce computational complexity,

and assume that each site undergoes at most one mutation. In

momi2

, Kamm et al. [2020] extend the approach to

phylogenetic networks by assuming a pulse of admixture at reticulations. The associated graphical model is much

simpler than that required by SNAPP or SnappNet thanks to the assumption of no recurrent mutation.

Leveraging graphical model techniques to study evolution on phylogenetic networks

2.6 Current network approaches for continuous traits

Compared to the rich toolkit available for the analysis of continuous traits on trees, the toolkit for phylogenetic networks

is still limited.

PhyloNetworks

includes comparative methods on networks Sol

ıs-Lemus et al. [2017], implemented in

Julia

Bezanson et al. [2017]. These methods extend phylogenetic ANOVA to networks, for a continuous response trait

predicted by any number of continuous or categorical traits, with residual variation being phylogenetically correlated.

So far, the models available in

PhyloNetworks

include the BM, Pagel’s

, possible within-species variation, and shifts

at reticulations to model transgressive evolution Bastide et al. [2018b], Teo et al. [2023]. However, all calculations are

based on working with the full covariance matrix, without BP.

TreeMix

Pickrell and Pritchard [2012],

ADMIXTOOLS

Patterson et al. [2012], Maier et al. [2023],

poolfstat

Gautier et al. [2022] and

AdmixtureBayes

Nielsen et al.

[2023] use allele frequency as a continuous trait. They model its evolution along a network, or admixture graph, using

a Gaussian model in which the evolutionary rate variance is affected by the ancestral allele frequency Soraggi and

Wiuf [2019], Lipson [2020]. Again, these methods work with the phylogenetic covariance matrix, rather than BP

approaches. They also consider subsets of up to 4 taxa at a time via

and

statistics, which simpliﬁes the

likelihood calculation. To identify selection and adaptation on a network,

PolyGraph

Racimo et al. [2018] and

GRoSS

Refoyo-Mart

ınez et al. [2019] assume a similar model and use the full covariance matrix. In summary, BP has yet to be

used for continuous trait evolution on networks.

3 Continuous trait evolution on a phylogenetic network

We now present phylogenetic models for the evolution of continuous traits, to which we apply BP later. We generalize

the framework in Mitov et al. [2020] and Bastide et al. [2021] from trees to networks, and we extend the network model

in Bastide et al. [2018b] from the BM to more general evolutionary models. We consider a multivariate

consisting of

continuous traits, and model their correlation over time. Our model ignores the potential effects of incomplete lineage

sorting on X, a reasonable assumption for highly polygenic traits.

3.1 Linear Gaussian models

Most random processes used to model continuous trait evolution on a phylogenetic tree are extensions of the BM to

capture processes such as evolutionary trends, adaptation, and variation in rates across lineages for example. In its most

general form, the linear Gaussian evolutionary model on a tree (referred to as the GLInv family in Mitov et al. [2020])

assumes that the trait Xvat node vhas the following distribution conditional on its parent pa(v)

Xv|Xpa(v)∼ N(qvXpa(v)+ωv,Vv)(2)

where the actualization matrix

, the trend vector

ωv

and the covariance matrix

are appropriately sized and do

not depend on trait values

Xpa(v)

. When the tree is replaced by a network, a node

can have multiple parents

pa(v)

In this case, we can write

Xpa(v)

as the vector formed by stacking the elements of

{Xu|u∈pa(v)}

vertically, with

length equal to the number of traits times the number of parents of

. In the following, we show that

(2)

, already used

on trees, can easily be extended to networks, to describe both evolutionary models along one lineage and a merging rule

at reticulation events.

3.2 Evolutionary models along one lineage

For a tree node

with parent node

, we need to describe the evolutionary process along one lineage, graphically

modelled by the tree edge

e= (u, v)

. It is well known that a wide range of evolutionary models can ﬁt in the general

form

(2)

Mitov et al. [2020], Bastide et al. [2021]. For instance, the BM with variance rate

(a variance-covariance

matrix for a multivariate trait) is described by

(2)

where

is the

p×p

identity matrix

, there is no trend

ωv=0

and the variance is proportional to the edge length ℓ(e):Vv=ℓ(e)Σ.

Allowing for rate variation amounts to letting the variance rate vary across edges

Σ=Σ(e)

. For example, the Early

Burst (EB) model assumes that the variance rate at any given point in the phylogeny depends on the time

from the root

to that point, as:

Σ(t) = Σ0ebt .

For this

to be well-deﬁned on a reticulate network, the network needs to be time-consistent (distinct paths from the

root to a node all share the same length). The rate

is a rate of variance decay if it is negative, to expected during

adaptive radiations, with a burst of variation near the root (hence Early Burst) before a slow-down of trait evolution

Harmon et al. [2010]. When

b > 0

, this model is called “accelerating rate” (AC) Blomberg et al. [2003]. Clavel and

Leveraging graphical model techniques to study evolution on phylogenetic networks

Morlon [2017] used a ﬂexible extension of this model (on a tree), replacing

by one or more covariates that are known

functions of time, such as the average global temperature and other environmental variables:

Σ(t) = ˜σ(t, T1(t),··· , Tk(t)) .

Then, the variance accumulated along edge e= (u, v)is given by

Vv=Zt(v)

t(u)

Σ(t)dt .

In the particular case of the EB model, we get

Vv=Σ0ebt(u)(ebℓ(e)−1)/b .

Allowing for shifts in the trait value, perhaps due to jumps or cladogenesis, amounts to including ωv= 0 for some v.

Adaptive evolution is typically modelled by the OU process, which includes a parameter

for the strength of selection

along edge

. This selection strength is often assumed constant across edges, and is typically denoted as

for a

univariate trait. The OU process also includes a primary optimum value

θe

, which may vary across edges when we are

interested in detecting shifts in the adaptive regime across the phylogeny. Under the OU model, the trait evolves along

edge ewith random drift and a tendency towards θe:

dX(e)(t) = Ae(θe−X(e)(t))dt +RedB(t)

where

is a standard BM and the drift variance is

Σe=ReR⊤

. Then, conditional on the starting value at the start of

, the end value

is linear Gaussian as in

(2)

with actualization

qv=e−ℓ(e)Ae

, trend

ωv= (I−e−ℓ(e)Ae)θe

and

variance

Vv=Zℓ(e)

e−sAeΣee−sA⊤

eds =Se−e−ℓ(e)AeSee−ℓ(e)A⊤

where

is the stationary variance matrix. These equations simplify greatly if

and

Σe

commute, such as if

scalar of the form αeIp, including when the process is univariate. In this case,

Vv= (1 −e−2αℓ(e))Σe/(2α).

Shifts in adaptive regimes can be modelled by shifts in any of the parameters θe,Aeor Σeacross edges.

Finally, variation within species, including measurement error, can be easily modelled by grafting one or more edges at

each species node, to model the fact that the measurement taken from an individual may differ from the true species

mean. The model for within-species variation, then, should also follow

(2)

by which an individual value is assumed

to be normally distributed with a mean that depends linearly on the species mean, and a variance independent of the

species mean – although this variance can vary across species. Most typically, observations from species

are modelled

using

q=Ip

ω=0

and some phenotypic variance to be estimated, that may or may not be tied to the evolutionary

variance parameter from the phylogenetic model across species. This additional observation layer can also be used for

factor analysis, where the unobserved latent trait evolving on the network has smaller dimension than the observed trait.

In that case, qis a rectangular, representing the loading matrix Tolkoff et al. [2018], Hassler et al. [2022b].

3.3 Evolutionary models at reticulations

For a continuous trait and a hybrid node

, Bastide et al. [2018b] and Pickrell and Pritchard [2012] assumed that

a weighted average of its immediate parents, using their state immediately before the reticulation event. Speciﬁcally,

has parent edges

e1, . . . , em

, and if we denote by

Xek

the state at the end of edge

right before the reticulation

event (1≤k≤m), then the weighted-average model assumes that

Xh=X

ekparent of h

γ(ek)Xek.(3)

This model is a reasonable null model for polygenic traits, reﬂecting the typical observation that hybrid species show

intermediate phenotypes. In this model, the biological process underlying the reticulation event (such as gene ﬂow

versus hybrid speciation) does not need to be known. Only the proportion of the genome inherited by each parent,

γ(ek)

, needs to be known. Compared to the evolutionary time scale of the phylogeny, the reticulation event is assumed

to be instantaneous.

To describe this process as a graphical model, we may add a degree-2 node at the end of each hybrid edge

to store

the value Xe, so as to separate the description of the evolutionary process along each edge from the description of the

Leveraging graphical model techniques to study evolution on phylogenetic networks

process at a reticulation event. With these extra degree-2 nodes, the weighted-average model

(3)

corresponds to the linear

Gaussian model

(2)

with no trend

ωh=0

, no variance

Vh=0

, and with actualization

qh= [γ(e1)Ip. . . γ(em)Ip]

made of scalar diagonal blocks.

Several extensions of this hybrid model can be considered. Bastide et al. [2018b] modelled transgressive evolution with

a shift

ωh=0

, for the hybrid population to differ from the weighted average of its immediate parents, even possibly

taking a value outside their range. Jhwueng and O’Meara [2015] considered transgressive shifts at each hybrid node as

random variables with a common variance, corresponding to a model with ωh=0but non-zero variance Vh.

More generally, we may consider models in which the hybrid value is any linear combination of its immediate parents

qvXpa(v)

as in

(2)

. A biologically relevant model could consider

to be diagonal, with, on the diagonal, parental

weights γ(e, j)that may depend on the trait jinstead of being shared across all ptraits.

We may also consider both a ﬁxed transgressive shift

ωh=0

and an additional hybrid variance

. For both of these

components to be identiﬁable in the typical case when we observe a unique realization of the trait evolution, the model

would need extra assumptions to induce sparsity. For example, we may assume that

is shared across all reticulations

and is given an informative prior, to capture small variations around the parental weighted average. We may also need a

sparse model on the set of

ωh

parameters, e.g. letting

ωh=0

only at a few candidate reticulations

, chosen based on

external domain knowledge.

For a continuous trait known to be controlled by a single gene, we may prefer a model similar to the discrete trait

model presented later in Example 2, by which

takes the value of one of its immediate parent

with probability

γ(e)

. This model would no longer be linear Gaussian, unless we condition on which parent is being inherited at each

reticulation. Such conditioning would reduce the phylogeny to one of its displayed tree. But it would require other

techniques to integrate over all parental assignments to each hybrid population, such as Markov Chain Monte Carlo or

Expectation Maximization.

3.4 Evolutionary models with interacting populations

Models have been proposed in which the evolution of

X(e)(t)

along one edge

depends on the state on other edges

existing at the same time

Drury et al. [2016], Manceau et al. [2017], Bartoszek et al. [2017], Duchen et al. [2020].

These models can describe “phenotype matching” that may arise from ecological interactions (mutualism, competition)

or demographic interactions (migration), in which traits across species converge to or diverge from one another. To

express this coevolution, we consider the set

E(t)

of edges contemporary to one another at time

and divide the

phylogeny into epochs: time intervals

[τi, τi+1]

during which the set

E(t)

of interacting lineages is constant, denoted

. Within each epoch

(i.e

t∈[τi, τi+1]

), the vector of all traits

(X(e)(t))e∈Ei

is modelled by a linear stochastic

differential equation. Since its mean is linear in and its variance independent of the starting value

(X(e)(τi))e∈Ei

, these

models are linear Gaussian Manceau et al. [2017], Bartoszek et al. [2017]. In fact, they can be expressed by

(2)

on a

supergraph of the original phylogeny, in which an edge

(u, v)

is added if

is at the start

τi

of some epoch

is at the

end

τi+1

, and if the mean of

conditional on all traits at time

τi

has a non-zero coefﬁcient for

. The speciﬁc form

ωv

and

(2)

depend on the speciﬁc interaction model, and may be more complex than the merging rule

(3)

4 A short review of graphical models and belief propagation

Implementing BP techniques on general networks is more complex than on trees. To explain why, we review here the

main ideas of graphical models and belief propagation for likelihood calculation.

4.1 Graphical models

A probabilistic graphical model is a graph representation of a probability distribution. Each node in the graph represents

a random variable, typically univariate but possibly multivariate. We focus here on graphical models with directed

edges. Edges represent dependencies between variables, where the direction is typically used to represent causation.

The graph expresses conditional independencies satisﬁed by the joint distribution of all the variables at all nodes in

the graph. Given the directional nature of evolution and inheritance, models for trait evolution on a phylogeny are

often readily formulated as directed graphical models. H

ohna et al. [2014] demonstrate the utility of representing

phylogenetic models as graphical models for exposing assumptions, and for interpretation and implementation. They

present a range of examples common in evolutionary biology, with a focus on how graphical models facilitate greater

modularity and transparency. Here we focus on the computational gains that BP allows on graphical models.

A directed graphical model consists of a directed acyclic graph (DAG)

and a set of conditional distributions, one for

each node in

. At a node

with parent nodes

pa(v)

, the distribution of variable

conditional on its set of parent

Leveraging graphical model techniques to study evolution on phylogenetic networks

variables

Xpa(v)={Xu;u∈pa(v)}

is given by a factor

ϕv

, which is a function whose scope is the set of variables

from

and

pa(v)

. For each node

, the set formed by this node and its parents

{v} ∪ pa(v)

is called a node family. If

Vdenotes the vertex set of G, then the set of factors {ϕv, v ∈V}deﬁnes the joint density of the graphical model as

pθ(Xv;v∈V) = Y

v∈V

ϕv(Xv|Xu, θ;u∈pa(v)) (4)

where we add the possible dependence of factors on model parameters

. This factor formulation implies that,

conditional on its parents,

is independent of any non-descendant node (e.g. “grandparents”) Koller and Friedman

[2009].

4.2 Phylogenetic examples of graphical models

Example 1 (BM on a tree).First consider the phylogenetic tree Tin Fig. 1a. The graphical model for the node states

under a BM, whose parameters

are the trait evolutionary variance rate

σ2

, the ancestral state at the root

xρ

and

edge lengths

ℓi

, has the same topology as

. On a tree, each node family consists of a node

and its single parent, or

the root

by itself. The distribution

ϕρ

may be deterministic as when

xρ

is a ﬁxed parameter of the model, or may be

given a prior distribution ϕρ.

T=G

(a)

xρ

x5x6

x1x2x3x4

(b)

xρ

x5, xρx6, xρ

x1, x5x2, x5x3, x6x4, x6

ℓ5ℓ6

ℓ1ℓ2ℓ3ℓ4x5x5x6x6

xρxρ

Figure 1: Example graphical model on a phylogenetic tree with factors deﬁned by the BM. The joint distribution of

all variables at all nodes is given by the product of factors:

Qvϕv

where

ϕv

is the distribution of

conditional on

its parent variable xpa(v):N(xpa(v), σ2ℓv)under the BM. (a) Phylogenetic tree T. (b) Clique tree Ufor the graphical

model. Its nodes are clusters of variables in

(ellipses). Each edge is labelled by a sepset (squares): a subset of

variables shared by adjacent clusters.

Example 2 (Discrete trait on a network).For a second example, we will consider a reticulate phylogeny. A rooted

phylogenetic network is a DAG with a single root, and taxon-labelled leaves (or tips). A node with at most one parent

is called a tree node and its incoming edge is a tree edge. A node with multiple parents is called a hybrid node, and

represents a population (or species more generally) with mixed ancestry. An edge

e= (u, h)

going into a hybrid node

is called a hybrid edge. It is assigned an inheritance probability

γ(e)>0

that represents the proportion of the genome

that was inherited from the parent population

(via edge

). Obviously, at each hybrid node

we must have

Pu∈pa(h)γ((u, h)) = 1

. The phylogenetic network

in Fig. 2a has one hybrid node

whose genetic makeup comes

from x4with proportion 0.4and from x6with proportion 0.6.

N=G

(a)

xρ

x4x5x6

x1x2x3

(b)

x4, x6, xρ

x5, x4, x6

x1, x4x2, x5x3, x6

U∗

(c)

xρ

x4, xρx6, xρ

x5, x4, x6

γ= 0.4γ= 0.6

x4x5x6

x4, x6

x4x6

xρxρ

Figure 2: (a) Phylogenetic network

with hybrid edges shown in blue.

displays two trees, depending on which

hybrid edge is retained. One tree, with sister taxa 1 and 2, has probability

0.4

. The other tree, with sister taxa 2 and 3, is

displayed with probability

0.6

. The distribution of the hybrid node

depends on both its parents, and induces a factor

cluster

{x4, x5, x6}

of size 3 in

and

U∗

. (b) Clique tree

for the graphical model. (c) Cluster graph

U∗

(leaf clusters

not shown) for the same graphical model in which

{x4, x6, xρ}

is replaced by smaller clusters

{x4, xρ}

{x6, xρ}

and {xρ}that induce a cycle.

Leveraging graphical model techniques to study evolution on phylogenetic networks

For a discrete trait

, the traditional model of evolution on a tree can be extended to a network

as follows. Along

each edge

evolves according to a Markov process with some transition rate matrix

for an amount of time

ℓ(e)

that depends on the edge. At a tree node, the state of

at the end of its parent edge is passed as the starting value to

each daughter lineage, as in the traditional tree model. At reticulations, we follow previous authors to model the value

at a hybrid node

Karimi et al. [2020], Allen-Savietta [2020], Lutteropp et al. [2022]. Let

denote the state at the

end of edge

. If

has

parent edges

e1,··· , em

, then

is assumed to take value

xek

with probability

γ(ek)

. This

model reﬂects the idea that the trait is controlled by unknown genes, but the proportion of genes inherited from each

parent is known. Incomplete lineage sorting, which can lead to hemiplasy for a trait Avise and Robinson [2008], is

unaccounted for. Similar to Example 1, the graphical model uses the topology of the network N.

To describe the factors of this graphical model and simplify notations, consider the case when

is binary with states

0 and 1. For a tree node

, the factor

ϕv

can be represented by the

2×2

matrix

exp(ℓ(e)Q)

, where

is the parent

edge of

. For a hybrid node

with

parents

p1,··· , pm

and edges

ek= (pk, h)

with

γ(ek) = γk

, the factor

ϕh

has scope

(Xh, Xp1,··· , Xpm)

, and can be described by a

2×2m

matrix to store the conditional probabilities

P(Xh=j|Xp1=i1,··· , Xpm=im)

. This is a

2×4

matrix in the typical case when

is admixed from

m= 2

parental populations. With

m= 2

and with parental values

(Xp1, Xp2)

arranged ordered

((0,0),(0,1),(1,0),(1,1))

then

ϕh=1γ1γ20

0γ2γ11.

(a)

nρ,rρ

n3,r3

n1,r1n2,r2

(b)

nρrρ

n3r3

n1r1r2n2

(c)

nρ,rρ, n3,r3

n3,r3, n3,r3

n1,r1, n2,r2, n3,r3

n1,r1, n1,r1n2,r2, n2,r2

n3,r3

n1,r1n2,r2

n3,r3

Figure 3: (a) Phylogenetic species tree

, used to generate gene genealogies for

and

individuals sampled from

species 1 and 2 respectively. (b) Graph

for the graphical model associated with the evolution of a binary trait on a

gene tree drawn from the multispecies coalescent model.

is a DAG with 2 sources (roots)

and

, and 2 sinks

(leaves)

and

. The model restricted to variables

and

(in black) can be described by the subgraph

whose

nodes and edges are in black. It is a tree similar to

but with reversed edge directions. (c) Clique tree

for

. Note

that the 6-variable clique is overparametrized because

n1+n2=n3

and

r1+r2=r3

, but reﬂects the symmetry of the

model.

Example 3 (Binary trait with ILS on a tree).Our ﬁnal example is a case that accounts for incomplete lineage sorting

(ILS), when the graph

for the graphical model is constructed from but not identical to the phylogeny. Consider the

species tree in Fig. 3a, a sample of one or more individuals sampled from each species (1 and 2), and a gene tree (or

genealogy) generated according to the multispecies coalescent model along

Kingman [1982], Rannala and Yang

[2003]. Finally, consider a binary trait evolving along this gene tree, with states (or alleles) “black” and “red” to re-use

terminology by Bryant et al. [2012]. The observations from this model are the number of individuals

with the red

allele among the

individuals sampled from each species

. Bryant et al. [2012] discovered conditional independencies

in this model by considering and conditioning on the total number (

) and number of red alleles (

) ancestral to the

sampled individuals at the beginning of each edge

and

; and at the end of each edge

and

. Here, we

formulate this evolutionary model as a graphical model. Its graph

is different from the original phylogenetic tree, as

illustrated in Fig. 3b.

If we only consider the ancestral number of individuals

, then the graph

for the associated graphical model is

as follows, thanks to the description of the coalescent model going back in time. For each edge

, an edge is

created in

but with the reversed direction (black subgraph in Fig. 3). On this edge, the coalescent edge factor

ϕne=P(ne|ne)

was derived by Tavar

e [1984] and is given in [Bryant et al., 2012, eq. (6)]. Each internal node

is triplicated in

to hold the variables

nc1

and

nc2

, where

denotes the parent edge of

and

c1, c2

denote its

child edges (assuming that

has only 2 children, without loss of generality). These nodes are then connected in

Leveraging graphical model techniques to study evolution on phylogenetic networks

edges from each

nci

. The speciation factor

ϕne=

{nc1+nc2}(ne)

expresses the relationship

ne=nc1+nc2

Overall, Gnis a tree with a single sink (leaf), multiple sources (roots), and data at the roots.

To calculate the likelihood of the data, we add the number of red alleles

ancestral to the sampled individuals.

The full graph

(Fig. 3b) contains

, with extra nodes for the

variables, and extra edges to model the process

along edges and at speciations. The node family for

includes

and both

and

. The mutation edge factor

ϕre=P(re|re, ne, ne)

was derived by Grifﬁths and Tavar

e [1994] using both the coalescent and mutation processes,

and is given in [Bryant et al., 2012, eq. (16)]. For edge

with child edges

and

, the speciation factors for

red alleles

ϕrc1=P(rc1|nc1, ne, re)

and

ϕrc2=

{re−rc1}(rc2)

describe a hypergeometric distribution where

nc1

individuals, rc1of which are red, are sampled from a pool of neindividuals, reof which are red, and rc2=re−rc1.

Given this graphical model description, the likelihood calculation described in Bryant et al. [2012] corresponds to BP

along graph G, as we will illustrate later.

This framework can be extended to the case when the phylogeny is reticulate, with additional edges in

, and

hybridization factors to model the process at hybrid nodes for the

and

variables, illustrated on an example in SM

section A. The likelihood calculations used in

SnappNet

and described in Rabier et al. [2021] correspond to BP along

this graph G.

4.3 Belief Propagation

BP is a framework for efﬁciently computing various integrals of the factored density

pθ

by grouping nodes and their

associated variables into clusters and integrating them out according to rules along a clique tree (also known by junction

tree, join tree, or tree decomposition) or along a cluster graph, more generally.

4.3.1 Cluster graphs and Clique trees

Deﬁnition 1 (cluster graph and clique tree).Let

Φ = {ϕv, v ∈V}

be the factors of a graphical model on graph

and

let

U= (V,E)

be an undirected graph whose nodes

, called clusters, consists of sets of variables in the scope of

is a cluster graph for Φif it satisﬁes the following properties:

(family-preserving) There exists a map

α: Φ → V

such that for each factor

ϕv

, its scope (node family for

node vin the graphical model) is a subset of the cluster α(ϕv).

(edge-labeled) Each edge

{Ci,Cj}

is labelled with a non-empty sepset

Si,j

(“separating set”) such that

Si,j ⊆ Ci∩ Cj.

(running intersection) For each variable

in the scope of

Ex⊆ E

, the set of edges with

in their sepsets

forms a tree that spans Vx⊆ V, the set of clusters that contain x.

is acyclic, then

is called a clique tree and we refer to its nodes as cliques. In this case, properties 2 and 3 imply

that Si,j =Ci∩ Cj.

A clique tree

is shown in Fig. 1b for the BM model from Example 1, on the tree

in Fig. 1a. To check the running

intersection property for

, for example, we extract the graph deﬁned by edges with

in their sepsets (squares).

There are 2 such edges. They induce a subtree of

that connects all 3 clusters (ellipses) containing

, as desired.

More generally, when the graphical model is deﬁned on a tree

, a corresponding clique tree

is easily constructed,

where cliques in

correspond to edges in

, and edges in

correspond to nodes in

. Multiple clique trees can be

constructed for a given graphical model. In this example, the clique

{xρ}

(shown at the top) could be suppressed,

because it is a subset of adjacent cliques.

For the network

in Fig. 2a and the evolution of a discrete trait in Example 2, one possible clique tree

is shown

in Fig. 2b. Note that

x5, x4

and

have to appear together in at least one of the clusters for the clique tree to be

family-preserving (property 1), because

and

are partners with a common child

whose distribution depends on

both of their states.

We ﬁrst focus on clique trees, which provide a structure for the exact likelihood calculation. In section 5 we discuss the

advantages of cluster graphs, to approximate the likelihood at a lower computational cost.

4.3.2 Evidence

To calculate the likelihood of the data, or the marginal distribution of the traits at some node conditional on the data, we

inject evidence into the model, in one of two equivalent ways. For each observed value

xv,t

of the

tth

trait

xv,t

at node

Leveraging graphical model techniques to study evolution on phylogenetic networks

, we add to the model the indicator function

{xv,t}(xv ,t)

as an additional factor. Equivalently, we can plug in the

observed value

xv,t

in place of the variable

xv,t

in all factors where

xv,t

appears, and then drop

xv,t

from the scope

of all these factors. This second approach is more tractable than the ﬁrst to avoid the degenerate zero-variance Dirac

distribution. But it requires careful bookkeeping of the scope and of re-parametrization of each factor with missing data,

when some traits but not all are observed at some nodes. Below, we assume that the factors and their scopes have been

modiﬁed to absorb evidence from the data.

4.3.3 Belief update message passing

There are multiple equivalent algorithms to perform BP. We focus here on the belief update algorithm. It assigns a

belief to each cluster and to each sepset in the cluster graph. After running the algorithm, each belief should provide the

marginal probability of the variables in its scope and of the observed data, with all other variables integrated out as

desired to calculate the likelihood. The belief

βi

of cluster

is initialized as the product of all factors assigned to that

cluster:

ψi=Y

ϕ;α(ϕ)=Ci

ϕfor cluster Ci(5)

The belief

µi,j

of an edge between cluster

and

is initialized to the constant function 1. These beliefs are then updated

iteratively by passing messages. Passing a message from

along an edge with sepset

Si,j

corresponds to passing

information about the marginal distribution of the variables in

Si,j

as shown in Algorithm 1. If

is a clique tree, then

Algorithm 1 Belief propagation: message passing along an edge from Cito Cjwith sepset Si,j .

compute the message

˜µi→j=RCi\Si,j βid(Ci\Si,j )

, that is, the marginal probability of

Si,j

based on belief

βi

, by

integrating all other variables in Ci,

2: update the cluster belief about Cj:βj←βj˜µi→j/µi,j ,

3: update the edge belief about Si,j :µi,j ←˜µi→j.

all beliefs converge to the true marginal probability of their variables and of the observed data, after traversing

only

twice: once to pass messages from leaf cliques towards some root clique, and then back from the root clique to the

leaf cliques. If our goal is to calculate the likelihood, then one traversal is sufﬁcient. Once the root clique has received

messages from all its neighboring cliques, we can marginalize over all its variables (similar to step 1) to obtain the

probability of the observed data only, which is the likelihood. The second traversal is necessary to obtain the marginal

probability of all variables, such as if one is interested in the posterior distribution of ancestral states conditional on the

observed data.

Some equivalent formulations of BP only store sepset messages, and avoid storing cluster beliefs. This strategy requires

less memory but more computing time if Uis traversed multiple times.

Example 4 (link to IC).Continuing on Example 1 on the tree in Fig. 1, the conditional distribution of

at a non-root

node

corresponds to a factor

ϕv

for the BM model along edge

(pa(v), v)

. This factor is assigned to clique

Cv={pa(v), v}

to initialize the belief

βv

. If

if a leaf in

, then

βv

is further multiplied by the indicator

function at the value

observed at

, such that the belief of clique

can be expressed as a function of the leaf’s parent

state only:

ϕv(xpa(v)) = P(xv|xpa(v))

. The prior distribution

ϕ(xρ)

at the root

(which can be an indicator

function if the root value is ﬁxed as a model parameter) can be assigned to any clique containing

. In Fig. 1,

includes

a clique

Cρ={xρ}

drawn at the top, to which we assign the root prior

ϕρ(xρ)

and which we will use as the root of

. Since

is a clique tree, BP converges after traversing

twice: from the tips to

Cρ

and then back to the tips. IC

Felsenstein [1973, 1985] implements the ﬁrst “rootwards” traversal of BP. For example, the belief of clique

{x5, xρ}

after receiving messages (steps 1-3) from both of its daughter cliques is the function

β5(x5, xρ) = exp −(xρ−x5)2

2ℓ5−(x5−x∗

5)2

2v∗

+g∗

5

where

x∗

5=ℓ2x1+ℓ1x2

ℓ1+ℓ2

, v∗

5=ℓ1ℓ2

ℓ1+ℓ2

,and g∗

5=−(x2−x1)2

2(ℓ1+ℓ2)−log((2π)3/2ℓ1ℓ2ℓ5)

are quantities calculated for IC:

x∗

corresponds to the estimated ancestral state at node 5,

v∗

corresponds to the extra

length added to

ℓ5

when pruning the daughters of node 5, and

g∗

captures the contrast

(x2−x1)/√ℓ1+ℓ2

below node

5. At this stage of BP,

β5(x5, xρ)

can be interpreted as

P(x1,x2, x5|xρ)

such that the message

˜µ5→ρ(xρ)

sent from

{x5, xρ}

to the root clique

Cρ

is the partial likelihood

P(x1,x2|xρ)

after

is integrated out. The ﬁrst pass is complete

when

Cρ

has received messages from all its neighbors. Its ﬁnal belief is then

βρ(xρ) = P(x1,··· ,x4|xρ)ϕρ(xρ)

. If

xρ

is a ﬁxed model parameter, then this is the likelihood. Otherwise, we get the likelihood by integrating out

xρ

βρ(xρ)

Leveraging graphical model techniques to study evolution on phylogenetic networks

In Example 2 on a network (Fig. 2), we label the cliques in

as follows:

Cv={xv, xpa(v)}

for leaves

v= 1,2,3

C5={x5, x4, x6}

for hybrid node

v= 5

and its parents, and

Cρ={x4, x6, xρ}

. To initialize beliefs, we assign

ϕv

for

v= 1,2,3,5

, and

ϕ4

ϕ6

are both assigned to

Cρ

. Unlike in Example 1 on a tree, a clique may correspond

to more than a single edge in

. This is expected at a hybrid node

, because the factor describing its conditional

distribution needs to contain

and both of its parents. But for

to be a clique tree, the root clique

Cρ

also has to

contain the factors from 2 edges in

. Also, unlike for trees, sepsets may contain more than a single node. Here, the

two large cliques are separated by

{x4, x6}

so they will send messages

˜µ(x4, x6)

about the joint distribution of these

two variables. In this binary trait setting, these messages and sepset belief can be stored as

2×2

arrays, and the 3-node

cliques beliefs can be stored as arrays of

values. As they involve more variables than when

is a tree (in which case

BP would store only 2 values at each sepset), storing and updating them requires more computating time and memory.

More generally, we see that the computational complexity of BP scales with the size of the cliques and sepsets. This

complexity may become prohibitive on a more complex phylogenetic network, even for a simple binary trait without

ILS, if the size of the largest cluster in Uis too large —a topic that we explore later.

Example 3 illustrates the fact that beliefs cannot always be interpreted as partial (or full) likelihoods at every step

of BP, unlike in Examples 1 and 2. For example, consider the ﬁrst iteration of BP, with the tip clique

containing

(n1, r1)

(Fig. 3) sending a message to its large neighbor clique. The belief of

is initialized with the factors

ϕn1

and

ϕr1

, which are the probabilities of

and of

conditional on their parents in graph

. From ﬁxing

(n1, r1)

to their

observed values (n1,r1), the message sent by C1in step 1 is

˜µ(n1, r1) = P(n1|n1)P(r1|r1, n1,n1).

This message is the quantity denoted by

FT(n, r)

in Bryant et al. [2012]. It is not a partial likelihood, because it is not

the likelihood of some partial subset of the data conditional on some ancestral values in the phylogeny. Intuitively, this

is because nodes with data below

include both

and

, yet

does not include

. Information about

will be

passed to the root of

separately. More generally, during the ﬁrst traversal of

, each sepset belief corresponds to an

value in Bryant et al. [2012]:

for sepsets at the top of a branch

(ne, re)

, and

for sepsets at the bottom of a

branch

(ne, re)

. The beauty of BP on a clique tree is that beliefs are guaranteed to converge to the likelihood of the full

data, conditional on the state of the clique variables. After messages are passed down from the root to

, the updated

belief of C1will indeed be the likelihood of the full data conditional on n1and r1.

4.3.4 Clique tree construction

For a given graphical model on

, there are many possible clique trees and cluster graphs. For running BP, it is

advantageous to have small clusters and small sepsets. Indeed, clusters and sepsets with fewer variables require less

memory to store beliefs, and less computing time to run steps 1 (integration) and 2 (belief update). Ideally, we would

like to ﬁnd the best clique tree: whose largest clique is of the smallest size possible. For a general graph

, ﬁnding this

best clique tree is hard but good heuristics exist Koller and Friedman [2009].

The ﬁrst step is to create the moralized graph

from

. This is done by connecting all nodes that share a common

child, and then undirecting all edges. We can then triangulate

, that is, build a new graph

by adding edges to

such that

is chordal (any cycle includes a chord). This is the hard step, if one wants to ﬁnd a triangulation with

the smallest maximum clique size. An efﬁcient heuristic is the greedy minimum-ﬁll heuristic Rose [1972], Fishelson

and Geiger [2003]. The cliques in

are then taken as the maximal cliques in

Blair and Peyton [1993]. Finally, the

edges in

are formed such that

becomes a tree and such that the sum of the sepset sizes is maximum, by ﬁnding

a maximum spanning tree using Kruskal’s algorithm or Prim’s algorithm Cormen et al. [2009]. All these steps have

polynomial complexity.

4.4 BP for Gaussian models

Before discussing BP on cluster graphs that are not clique trees, we focus on BP updates for the evolutionary models

presented in section 3. On a phylogenetic network

, the joint distribution of all present and ancestral species

(Xv)v∈N

is multivariate Gaussian precisely when it comes from a graphical model on

whose factors

ϕv

are linear Gaussian

Koller and Friedman [2009]. The factor at node

is linear Gaussian if, conditional on its parents,

is Gaussian with

a mean that is linear in the parental values and a variance independent of parental values, hence the term

GLInv

used by

Mitov et al. [2020]. In other words, for the joint process to be Gaussian, each factor

ϕv(xv|xpa(v))

should be of the

form (2).

Such models have been called Gaussian Bayesian networks or graphical Gaussian networks, and are special cases of

Gaussian processes (on a graph). These Gaussian models are convenient for BP because linear Gaussian factors have a

Leveraging graphical model techniques to study evolution on phylogenetic networks

convenient parametrization that allows for a compact representation of beliefs and belief update operations. Namely,

the factor giving the conditional distribution

ϕv(xv|xpa(v))

from

(2)

can be expressed in a canonical form as the

exponential of a quadratic form:

C(x;K, h, g) = exp −1

2x⊤Kx+h⊤x+g.(6)

For example, if we think of

ϕv(xv|xpa(v))

as a function of

primarily, we may use the parametrization

C(xv;K, h, g)

with

K=V−1

v, h =V−1

vqvxpa(v)+ωv,and g=−1

2log |2πVv|+∥qvxpa(v)+ωv∥2

V−1

v

where ∥y∥2

Mdenotes y⊤My. We can also express ϕvas a canonical form over its full scope

ϕv(xv|xpa(v)) = C xv

xpa(v);Kv, hv, gv

with

Kv=V−1

v−V−1

vqv

−q⊤

vV−1

vq⊤

vV−1

vqv=I

−q⊤

vV−1

v[I−qv], hv=V−1

vωv

−q⊤

vV−1

vωv, gv=−1

2(log |2πVv|+∥ωv∥V−1

v).

(7)

is a leaf with fully observed data, then we need to plug-in the data

into

ϕv

and consider this factor as a function

of xpa(v)only. We can express ϕv(xv|xpa(v))as the canonical form C(xpa(v);K, h, g )with

K=q⊤

vV−1

vqv, h =q⊤

vV−1

v(xv−ωv),and g=−1

2log |2πVv|+∥xv−ωv∥2

V−1

v.

If data are partially observed at leaf

, the same principle applies. We can plug-in the observed traits into

ϕv

and express

ϕv

as a canonical form over its reduced scope:

xpa(v)

and any unobserved

xv,t

. Some quadratic terms captured by

on the full scope become linear or constant terms after plugging-in the data, and some linear terms captured by

the full scope become constant terms in the canonical form on the reduced scope.

An important property of this canonical form is its closure under the belief update operations: marginalization (step 1)

and factor product (step 2). Indeed, the product of two canonical forms with the same scope satisﬁes:

C(x;K1, h1, g1)C(x;K2, h2, g2) = C(x;K1+K2, h1+h2, g1+g2).

Now consider marginalizing a factor

C(x;K, h, g)

to a subvector

x∗

, by integrating out the elements

x\x∗

let

and

be the submatrices of

that correspond to

x∗

(Scope of marginal or Sepset) and

x\x∗

(variables to be

Integrated out), and let KS,I=K⊤

I,Sbe the cross-terms. If KIis invertible, then:

ZCx\x∗(x;K, h, g)d(x\x∗) = C(x∗;K∗, h∗, g∗)

where

K∗=KS−KS,IK−1

IKI,S

h∗=hS−KS,IK−1

IhI

with

and

deﬁned as the subvector of

corresponding

to x∗and x\x∗respectively, and g∗=g+ (log |2πK−1

I|+∥hI∥K−1

I)/2.

If the factors of a Gaussian network are non-deterministic, then each belief can be parametrized by its canonical form,

and the above equations can be applied to update the cluster and sepset beliefs for BP (Algorithm 1). For cluster

, let

(Ki, hi, gi)

parametrize its belief

βi

. For sepset

Si,j

, let

(Ki,j , hi,j , gi,j )

parametrize its belief

µi,j

. Also, for step 1 of

BP, let

(Ki→j, hi→j, gi→j)

parametrize the message

˜µi→j

sent from

. Then BP updates can be expressed as

shown below.

In step 1,

and

are the submatrices of

that correspond to

Si,j

and

Ci\Si,j

. Similarly,

and

are subvectors

. In step 2,

ext(K˜µ−Ki,j )

extends

K˜µ−Ki,j

to the same scope as

by padding it with zero-rows and

zero-columns for

Cj\ Si,j

. Similarly,

ext(hi→j−hi,j )

extends

hi→j−hi,j

to scope

with

entries on rows for

Cj\ Si,j .

If the phylogeny is a tree, performing these updates from the tips to the root corresponds to the recursive equations (9),

(10) and (11) of Mitov et al. [2020], and to the propagation formulas (A.3)-(A.8) of Bastide et al. [2021], who both

considered the general linear Gaussian model (2).

At any point, a belief

C(x;K, h, g)

gives a local estimate of the conditional mean (

K−1h

) and conditional variance

(

K−1

) of trait

given data

, for

K≻0

. An exact belief, such that

C(x;K, h, g)∝pθ(x|Y)

, gives exact

conditional estimates, that is: E(X|Y) = K−1hand var(X|Y) = K−1.

Leveraging graphical model techniques to study evolution on phylogenetic networks

Algorithm 2 Gaussian belief propagation: from Cito Cjwith sepset Si,j .

1: compute message ˜µi→j:









Ki→j=KS−KS,IK−1

IKI,S

hi→j=hS−KS,IK−1

IhI

gi→j=gi+ (log |2πK−1

I|+∥hI∥K−1

I)/2

2: update the cluster belief βjabout Cj:





Kj←Kj+ext(Ki→j−Ki,j)

hj←hj+ext(hi→j−hi,j)

gj←gj+gi→j−gi,j

3: update the edge belief µi,j about Si,j :





Ki,j ←Ki→j

hi,j ←hi→j

gi,j ←gi→j

5 Scalable approximate inference with loopy BP

The previous examples focused on clique trees and the exact calculation of the likelihood. We now turn to the use of

cluster graphs with cycles, or loopy cluster graphs, such as in Fig. 2(c) or Fig. 4(c-d). BP on a loopy cluster graph,

abbreviated as loopy BP, can approximate the likelihood and posterior distributions of ancestral values, and can be

more computationally efﬁcient than BP on a clique tree.

(a) (b)

9,8,10,12 7,9,8,12

3,9,11 9,10,11,12 7,6,8,12 7,5,6,8

1,3,9 2,4,6 4,7,5,6

(c)

9,10,11 10,11,12 8,10,12 6,8,12

3,9,11 9,8 7,8,92,4,6

1,3,97,5,84,7,5

(d)

11,12 6,8,12 2,4,6

3,11 10,11 8,10 5,84,7,5

1,3,99,10 7,9

Figure 4: (a) Admixture graph

from [Lazaridis et al., 2014, Fig. 3] with

h= 4

reticulations (hybrid edges are

coloured).

has one non-trivial biconnected component (blob)

, induced by all its internal nodes except for the root.

contains all 4 reticulations so

has level

ℓ= 4

. (b)-(d) Various cluster graphs for the moralized blob

: (b) clique

tree, (c) join-graph structuring with the maximum cluster size set to 3, (d) LTRIP using the set of node families in

Here sepsets (not shown) are the intersection of their incident clusters, and are small with 1 node only in (c) and (d).

Purple boxes and edges: clusters and sepsets that contain node 8. Red text: hybrid families.

5.1 Calibration

Updating beliefs on a loopy cluster graph uses Algorithm 1 in the same way as on a clique tree. A cluster graph is said

to be calibrated when its normalized beliefs have converged (i.e. are unchanged by Algorithm 1 along any edge). For

calibration, neighboring clusters

and

must have beliefs that are marginally consistent over the variables in their

sepset Si,j :Zβid(Ci\Si,j ) = ˜µi→j∝µi,j ∝˜µj→i=Zβjd(Cj\Si,j ).

On a clique tree, calibration can be guaranteed at the end of a ﬁnite sequence of messages passed. Clique and sepset

beliefs are then proportional to the posterior distribution over their variables, and can be integrated to compute the

Leveraging graphical model techniques to study evolution on phylogenetic networks

common normalization constant

κ=κi=RβidCi=κj,k =Rµj,kdSj,k 

, which equals the likelihood. For loopy

BP, calibration is not guaranteed. If it is attained, then we can similarly view cluster and sepset beliefs as unnormalized

approximations of the posterior distribution over their variables, though the

κi

s and

κj,k

s may differ, grow unboundedly,

and generally do not equal or estimate the likelihood. Gaussian models enjoy the remarkable property that, if calibration

can be attained on a cluster graph, then the approximate posterior means (ancestral values) are guaranteed to be exact.

In contrast, the posterior variances are generally inexact, and are typically underestimated Weiss and Freeman [1999],

Wainwright et al. [2003], Malioutov et al. [2006], although we found them overestimated in our phylogenetic examples

below (Fig. 7).

Successful calibration depends on various aspects such as the features of the loops in the cluster graph, the factors in the

model, and the scheduling of messages. For beliefs to converge, a proper message schedule requires that a message is

passed along every sepset, in each direction, inﬁnitely often (until stopping criteria are met) Malioutov et al. [2006].

Multiple scheduling schemes have been devised to help reach calibration more often and more accurately. These can

be data-independent (e.g. choosing a list of trees nested in the cluster graph that together cover all clusters and edges,

then iteratively traversing each tree in both directions Wainwright et al. [2003]) or adaptive (e.g. prioritizing messages

between clusters that are further from calibration Elidan et al. [2006], Sutton and McCallum [2007], Knoll et al. [2015],

Aksenov et al. [2020]).

5.2 Likelihood approximation

To approximate the log-likelihood

LL(θ) = log Rpθ(x)dx

from calibrated beliefs on cluster graph

U∗= (V∗,E∗)

denoted together as

q={βi, µi,j ;Ci∈ V∗,{Ci,Cj}∈E∗}

, we can use the factored energy functional Koller and

Friedman [2009]:

F(pθ, q) = X

Ci∈V∗

Eβi(log ψi) + X

Ci∈V∗

H(βi)−X

{Ci,Cj}∈E∗

H(µi,j ).(8)

Recall that

ψi

is the product of factors

ϕv

assigned to cluster

. Here

Eβi

denotes the expectation with respect to

βi

normalized to a probability distribution.

H(βi)

and

H(µi,j )

denote the entropy of the distributions deﬁned by

normalizing

βi

and

µi,j

respectively.

F(pθ, q)

has the advantage of involving local integrals that can be calculated

easily: each over the scope of a single cluster or sepset. The justiﬁcation for

F(pθ, q)

comes from two approximations.

First, following the expectation-maximization (EM) decomposition,

LL(θ)

can be approximated by the evidence lower

bound (ELBO) used for variational inference Ranganath et al. [2014]. For any distribution

over the full set of variables,

which are here the unobserved (latent) variables after absorbing evidence from the data, we have

LL(θ)≥ELBO(pθ, q)=Eq(log pθ) + H(q).

The gap

LL(θ)−ELBO(pθ, q)

is the Kullback-Leibler divergence between

, and

pθ

normalized to the distribution of

the unobserved variables conditional on the observed data. The ﬁrst approximation comes from minimizing this gap over

a class of distributions

that does not necessarily include the true conditional distribution. The second approximation

comes from pretending that for a given distribution qwith a belief factorization

q∝QCi∈V∗βi

Q{Ci,Cj}∈E∗µi,j

its marginal over a given cluster (or a given sepset) is equal to the normalized belief of that cluster (or sepset),

simplifying

Eq(log ψi)

Eβi(log ψi)

and simplifying

Eq(−log βi)

H(βi)

. This simpliﬁcation leads to the more

tractable ˜

F(pθ, q), in which each integral is of lower dimension, within the scope of a single cluster or sepset.

5.3 Scalability versus accuracy: choice of cluster graph complexity

5.3.1 Scalability, treewidth and phylogenetic network complexity

At the cost of exactness, loopy cluster graphs can offer greater computational scalability than clique trees because they

allow for smaller cluster sizes, which reduces the complexity associated with belief updates. For example, consider

a Gaussian model for

traits:

dim(xv) = p

at all nodes

in the network. For a clique tree

with

cliques and

maximum clique size

, passing a message between neighbor cliques has complexity

O(p3k3)

and calibrating

has

complexity

O(mp3k3)

. Now consider a cluster graph

U∗

with

m∗

clusters,

O(m∗)

edges, and maximum cluster size

k∗< k

. Then passing a message between neighbor cliques of

U∗

has complexity

O(p3k∗3)

so it is faster than on

But calibrating

U∗

now requires more belief updates because each edge needs to be traversed more than twice. If each

edge is traversed in both directions

times to reach convergence, then calibrating

U∗

has complexity

O(bm∗p3k∗3)

. So

Leveraging graphical model techniques to study evolution on phylogenetic networks

N2u1

v1u2

v2u3

a b

N1˜w

w u3

u1u2b

c v1d

u1and u2are adjacent

u3is a descendant of u2

Figure 5: Two binary networks with a hybrid ladder and

h=ℓ= 2

satisﬁes (A2) of Proposition 1 and

has

treewidth

t= 3

does not meet (A2) (see red/purple annotations) and

has treewidth

t= 2

. Stacking more

hybrid ladders in the same way above aand bincreases hand ℓbut leaves Nm

2outerplanar, keeping t= 2.

U∗

has smaller clusters than

and if

(k/k∗)3≫bm∗/m

, then loopy BP on

U∗

runs faster than BP on

. Loopy BP

could be particularly advantageous for complex networks whose clique trees have large clusters.

Cluster graph construction determines the balance between scalability and approximation quality. At one end of the

spectrum, the most scalable and least accurate are the factor graphs, also known as Bethe cluster graphs Yedidia et al.

[2005]. A factor graph has one cluster per factor

ϕv

and one cluster per variable, and so has the smallest possible

maximum clique size

k∗

and each sepset reduced to a single variable. Various algorithms have been proposed for

constructing cluster graphs along the spectrum (e.g. LTRIP Streicher and du Preez [2017]) (Fig. 4). Notably, join-graph

structuring Mateescu et al. [2010] spans the whole spectrum because it is controlled by a user-deﬁned maximum cluster

size k∗, which can be varied from its smallest possible value to a value large enough to obtain a clique tree.

At the other end of the spectrum, the best maximum clique size

1 + tw(Gm)

, where

tw(Gm)

is the treewidth of

the moralized graph. Loopy BP becomes interesting when

tw(Gm)

is large, making exact BP costly. Unfortunately,

determining the treewidth of a general graph is NP-hard Arnborg et al. [1987], Bodlaender and Koster [2010]. Heuristics

such as greedy minimum-ﬁll or nested dissection Strasser [2017], Hamann and Strasser [2018] can be used to obtain

clique trees whose maximum clique size kis near the optimum 1 + tw(Gm).

Different cluster graph algorithms could potentially be applied to the different biconnected components, or blobs

Gusﬁeld et al. [2007] (e.g. LTRIP for one blob, clique tree for another), perhaps based a blob’s attributes that are easy

to compute. To choose between loopy versus exact BP, or between different cluster graph constructions more generally,

one could use traditional complexity measures of phylogenetic networks as potential predictors of cost-effectiveness.

For example, the reticulation number

is straightforward to compute. In a binary network, where all internal non-root

nodes have degree 3,

is simply the number of hybrid nodes. More generally

h=|{hybrid edges}|− |{hybrid nodes}|

Van Iersel et al. [2010]. The level of a network is the maximum reticulation number within a blob Gambette et al.

[2009]. The network’s level ought to predict treewidth better than

because a graph’s treewidth equals the maximum

treewidth of its blobs Bodlaender [1998], and moralizing the network does not affect its nodes’ blob membership. These

phylogenetic complexity measures do not predict treewidth perfectly Scornavacca and Weller [2022] except in simple

cases as shown below, proved in SM section B.

Proposition 1. Let

be a binary phylogenetic network with

hybrid nodes, level

ℓ

, and let

be the treewidth of the

moralized network

obtained from

. For simplicity, assume that

has no parallel edges and no degree-2 nodes

other than the root.

(A0) If ℓ= 0 then h= 0 and t= 1.

(A1) If ℓ= 1 then h≥1and t= 2.

(A2)

Let

be a hybrid node with non-adjacent parents

u1, u2

. If

has a descendant hybrid node

such that one

of its parents is not a descendant of either u1or u2, then ℓ≥2and t≥3.

Level-1 networks have received much attention in phylogenetics because they are identiﬁable under various models

under some mild restrictions Sol

ıs-Lemus and An

e [2016], Ba

nos [2019], Gross et al. [2021], Xu and An

e [2023].

Several inference methods limit the search to level-1 networks Sol

ıs-Lemus and An

e [2016], Oldman et al. [2016],

Allman et al. [2019], Kong et al. [2022]. Since moralized level-1 networks have treewidth 2, exact BP is guaranteed to

be efﬁcient on them.

Leveraging graphical model techniques to study evolution on phylogenetic networks

100

300

1 10 100 1000

h: number of hybrids

Max cluster size upper bound

n: number of tips

100

1850

100

300

1 10 100 1000

level

Max cluster size upper bound

Network

empirical

simulated

Figure 6: We observe a positive sublinear relationship between a maximum clique size upperbound (from the greedy

minﬁll heuristic) and the number of hybrids (A) or network level (B) on a combined sample of 11 empirical networks

and 2509 simulated birth-death-hybridization networks. The empirical networks were sampled from [Maier et al., 2023,

Figs. 3a-c (left), 4a-c (left)] (reported as estimated by Bergstr

om et al. [2020], Librado et al. [2021], Hajdinjak et al.

[2021], Lipson et al. [2020], Wang et al. [2021], Sikora et al. [2019]), [Lazaridis et al., 2014, Fig. 3], [Nielsen et al.,

2023, Fig. 3 (left)], [Sun et al., 2023, Fig. 4c], [M

uller et al., 2022, Fig. 1a], [Neureiter et al., 2022, Fig. 5a]; ﬁt by

these authors using

ADMIXTOOLS

Patterson et al. [2012], Maier et al. [2023],

admixturegraph

Lepp

a et al. [2017],

OrientAGraph

Molloy et al. [2021],

contacTrees

Neureiter et al. [2022],

Recombination

uller et al. [2022],

AdmixtureBayes

Nielsen et al. [2023]. The simulated networks were obtained by subsampling 10 networks per

parameter scenario simulated by Justison and Heath [2024], then ﬁltering out networks of treewidth 1 (trees, possibly

with parallel hybrid edges).

Beyond level-1, a network has a hybrid ladder (also called stack Semple and Simpson [2018]) if a hybrid node

has a

hybrid child node

. By Proposition 1, a hybrid ladder has the potential to increase treewidth of the moralized network

and decrease BP scalability, if the remaining conditions in (A2) are met. Related results in Chaplick et al. [2023] are

for undirected graphs that do not require prior moralization, and contain ladders deﬁned as regular

2×L

grids. Their

Observation 1, that a graph containing a non-disconnecting grid ladder of length

L≥2

has treewidth at least 3, relies

on a similar argument as for (A2). However, structures leading to the conditions in (A2) are more general, even before

moralization. It may be interesting to extend some of the results from Chaplick et al. [2023] to moralized hybrid ladders

in rooted networks.

In Fig. 5 (right)

has a hybrid ladder that does not meet all conditions of (A2), and has

t= 2

. Generally, outerplanar

networks have treewidth at most

Bodlaender [1998], and if bicombining (hybrid nodes have exactly 2 parents), remain

outerplanar after moralization. Networks in which no hybrid node is the descendant of another hybrid node in the same

blob are called galled networks Huson et al. [2010]. They provide more tractability to solve the cluster containment

problem Huson et al. [2009]. Here, galled networks would then never meet the assumptions of (A2) and it would be

interesting to study their treewidth after moralization.

We performed an empirical investigation of how

and

ℓ

can predict the treewidth

of the moralized network. Fig. 6

shows that

correlates with

and

ℓ

, on networks estimated from real data using various inference methods and on

networks simulated under the biologically realistic birth-death-hybridization model Justison and Heath [2024], Justison

et al. [2023], especially for complex networks. For networks with hundreds of tips (Thorson et al. [2023] lists several

studies of this size), large maximum clique sizes

k≥30

are not uncommon. In contrast, a Bethe cluster graph would

have maximum cluster size

k∗= 3

, so that

(k/k∗)3≥103

would provide a large computational gain for loopy BP to

be considered.

5.3.2 Approximation quality with loopy BP

We simulated data on a complex graph (40 tips, 361 hybrids) [M

uller et al., 2022, Fig. 1a] and a simpler graph (12 tips,

12 hybrids) [Lipson et al., 2020, Extended Data Fig. 4], then compared estimates from exact and loopy BP. For both

networks, edges of length 0 were assigned the minimum non-zero edge length after suppressing any non-root degree-2

nodes. Trait values

x= (x1,...,xn)

at the tips were simulated from a BM with rate

σ2= 1

and

xρ= 0

at the root.

Figure 7 shows the exact and approximate log-likelihood and conditional mean and variance of

xρ

assuming a BM

with rate

σ2= 1

but improper prior

xρ∼ N(0,∞)

, using a greedy minimum-ﬁll clique tree

and a cluster graph

U∗

. Using a factor graph, calibration failed for the complex network (SM section C, Fig. S2), so we used join-graph

structuring to build

U∗

can be calibrated in one iteration and the calculated quantities are exact (horizontal lines). In

contrast,

U∗

requires multiple passes and gives approximations. Calibration required more iterations on the complex

Leveraging graphical model techniques to study evolution on phylogenetic networks

network (

h= 361

) than on the simpler network (

h= 11

), as expected. But for both networks, the factored energy

(8)

approximated the log-likelihood very well. The distribution of the root state

xρ

conditional on the data seems

more difﬁcult to approximate. The conditional mean was correctly estimated but required more iterations than the

log-likelihood approximation on the complex network. The conditional variance was severely overestimated on the

complex network and very slightly overestimated on the simpler network. As desired, the average computing time per

belief update was lower on

U∗

, although modestly so due to the clique tree

having many small clusters of size similar

to those in U∗(Fig. S3).

5 10 15 20

-2

-1

clique tree

join-graph str, R1

join-graph str, R2

5 10 15 20

-80

-70

-60

-50

-40

0 50 100 150 2 00

-60

-40

-20

0 10 20 30 40 5 0

500

1000

1500

2000

2500

0 10 20 30 40 5 0

-600

-500

-400

-300

-200

-100

Number of iterations

E(Xρ |data)

Var(Xρ|data)

Factored energy

Lipson et al. (2020b): n=12, ℓ=12, h=12, k=6, k∗=3

Müller et al. (2022): n=40, ℓ=358, h=361, k=54, k∗=10

Figure 7: Accuracy of loopy BP. Approximation of the conditional distribution of the root state

Xρ

(left and center) and

log-likelihood (right) using a greedy minimum-ﬁll clique tree Uand a join-graph structuring cluster graph U∗for two

networks of varying complexity M

uller et al. [2022], Lipson et al. [2020] as measured by their number of tips (

), level

(

ℓ

), number of hybrids (

), maximum clique size (

), and maximum cluster size (

k∗

). For

, estimates are exact after

one iteration and shown as horizontal red lines. For

U∗

, estimates are shown over 20 (ﬁrst row), 50 or 200 (second row)

iterations. Each iteration consists of two passes through each spanning tree in a minimal set that jointly covers

U∗

. In

each plot, the two curves correspond to two different regularizations of initial beliefs (SM section E, dotted: algorithm

R1, solid: algorithm R2).

6 Leveraging BP for efﬁcient parameter inference

6.1 BP for fast likelihood computation

In some particularly simple models, such as the BM on a tree, fast algorithms such as IC Felsenstein [1985] or

phylolm

Ho and An

e [2014] can directly calculate the best-ﬁtting parameters that maximize the restricted likelihood (REML), in

a single tree traversal avoiding numerical optimization. For more general models, such closed-form estimates are not

available. One product of BP is the likelihood of any ﬁxed set of model parameters. BP can hence be simply used as a

fast algorithm for likelihood computation, which can then be exploited by any statistical estimation technique, in a

Bayesian or frequentist framework. Most of the tools cited in section 2.3 use either direct numerical optimization of the

likelihood Mitov et al. [2019], Boyko et al. [2023], Bartoszek et al. [2023] or sampling techniques such as Markov

Chain Monte Carlo (MCMC) Pybus et al. [2012], FitzJohn [2012] for parameter inference.

BP also outputs the trait distribution at internal, unobserved nodes conditioned on the observed data at the tips. In

addition to providing a tool for efﬁcient ancestral state reconstruction, these conditional means and variances can be

used for parameter inference, with approaches based on latent variable models such as Expectation Maximization (EM)

Bastide et al. [2018a], or Gibbs sampling schemes Cybis et al. [2015]. Although not currently used in the ﬁeld of

Leveraging graphical model techniques to study evolution on phylogenetic networks

evolutionary biology to our knowledge, approaches based on approximate EM algorithms Heskes et al. [2003] and

relying on loopy BP could also be used.

6.2 BP for fast gradient computation

As we show below, the conditional means and variances at ancestral nodes can be used to efﬁciently compute the

gradient of the likelihood Salakhutdinov et al. [2003]. The gradient of the likelihood can help speed up inference in many

different statistical frameworks Barber [2012]. In a phylogenetic context, gradients have been used to improve maximum

likelihood estimation Ji et al. [2020], Bayesian estimation through Hamiltonian Monte Carlo (HMC) approaches Zhang

et al. [2021], Fisher et al. [2021], Bastide et al. [2021], or variational Bayes approximations Fourment and Darling

[2019]. Although automatic differentiation can be used on trees for some models Swanepoel et al. [2022], direct

computations of the gradient using BP-like algorithms have been shown to be more efﬁcient in some contexts Fourment

et al. [2023]. After recalling Fisher’s identity to calculate gradients after BP calibration, we illustrate its use on the

BM model (univariate or multivariate) where it allows for the derivation of a new analytical formula for the REML

parameter estimates.

6.2.1 Gradient Computation with Fisher’s Identity

In a phylogenetic context, latent variables are usually internal nodes, while observed variables are leaves. We write

Y={Xv,j :trait jobserved at v∈V}

the set of observed variables. Fisher’s identity provides a way to link

the gradient of the log-likelihood of the data

LL(θ) = log pθ(Y)

at parameter

, with the distribution of all the

variables conditional on the observations Y. We refer to [Capp´

e et al., 2005, chap. 10] or [Barber, 2012, chap. 11] for

general introductions on Markov models with latent variables. Under broad assumptions, Fisher’s identity states (see

Proposition 10.1.6 in Capp´

e et al. [2005], or Section 11.6 in Barber [2012]):

∇θ′[log pθ′(Y)]|θ′=θ= Eθ[∇θ′[log pθ′(Xv;v∈V)]|θ′=θ|Y],

where

∇θ′[f(θ′)]|θ′=θ

denotes the gradient of

with respect to the generic parameters

θ′

and evaluated at

θ′=θ

, and

Eθ[• | Y]

the expectation conditional on the observed data under the model parametrized by

, which is precisely

where the output from BP can be used. Plugging in the factor decomposition from the graphical model (4) we get:

∇θ′[log pθ′(Y)]|θ′=θ=X

v∈V

Eθ[∇θ′[log ϕv(Xv|Xu, θ′;u∈pa(v))]|θ′=θ|Y].(9)

While

(9)

applies to the full vector of all model parameters, it can also be applied to take the gradient with respect to a

single parameter

of interest, keeping the other parameters ﬁxed. For instance, we can focus on one rate matrix

a BM model, or one primary optimum of an OU model. Special care needs to be taken for gradients with respect to

structured matrices, such as variance matrices that need to be symmetric (see e.g. Bastide et al. [2021]) or with a sparse

inverse under structural equation modeling for high dimentional traits Thorson and van der Bijl [2023].

For models where the conditional expectation of the factor in

(9)

has a simple form, this formula is the key to an

efﬁcient gradient computation. In particular, for discrete traits as in Example 2, the expectation becomes a sum of a

manageable number of terms, local to a cluster, weighted by the normalized cluster belief after calibration [Koller and

Friedman, 2009, ch. 19].

6.2.2 Gradient computation for linear Gaussian models

For linear Gaussian models

(2)

, log-factors can be written as quadratic forms

(6)

, so their derivatives have analytical

formulas (see SM section D). The conditional expectation in

(9)

then only depends on the joint ﬁrst and second order

moments of the variables

(Xv, Xpa(v))

in a cluster, which are known as soon as the beliefs are calibrated. When the

graph is a tree, Bastide et al. [2021] exploited this formula to derive gradients in the general linear Gaussian case.

However, they did not use the complete factor decomposition

(4)

, but instead an ad-hoc decomposition that only works

when the graph is a (binary) tree, and exploits the split partitions deﬁned by the tree. In contrast, the present approach

gives a recipe for the efﬁcient gradient computation of any linear Gaussian model on any network, as soon as beliefs are

calibrated.

In the special case where the process is a homogeneous BM (univariate or multivariate) on a network with a weighted-

average merging rule

(3)

, a constant rate

, no missing data at the tips, and, if present, within-species variation that is

proportional to

, then the gradient with respect to

takes a particularly simple form. Setting this gradient to zero,

we ﬁnd an analytical formula for the REML estimate of

and for the ML estimate of the ancestral mean

µρ

(SM

section D.3). In a phylogenetic regression setting, a similar formula can be found for the ML estimate of coefﬁcients

(SM section D.4). Efﬁcient algorithms such as IC and

phylolm

already exist to compute these quantities on a tree, in a

Leveraging graphical model techniques to study evolution on phylogenetic networks

single traversal. Here, our formulas need two traversals but remain linear in the number of tips, and because they rely

on a general BP formulation, they apply to networks with reticulations. Fisher’s identity and BP hence offer a general

method for gradient computation, and could lead to analytical formulas for other simple models. Such efﬁcient formulas

could alleviate numerical instabilities observed in software such as

mvSLOUCH

, which experienced a signiﬁcant failure

rate for the BM on trees with a large number of traits Bartoszek et al. [2024].

6.2.3 Hessian computation with Louis’s identity

Using similar techniques, the Hessian of the log-likelihood with respect to the parameters can also be obtained as a

conditional expectation of the Hessian of the complete log-likelihood:

n∇2

θ′[log pθ′(Y)] + ∇θ′[log pθ′(Y)] [∇θ′[log pθ′(Y)]]⊤oθ′=θ=

Eθh∇2

θ′[log pθ′(Xv;v∈V)] + ∇θ′[log pθ′(Xv;v∈V)] [∇θ′[log pθ′(Xv;v∈V)]]⊤|θ′=θYi.

This so-called Louis identity Capp

e et al. [2005] also simpliﬁes under the factor decomposition

(4)

, and leads to

tractable formulas in simple Gaussian or discrete cases.

6.3 BP for direct Bayesian parameter inference

Likelihood or gradient-based approaches require careful analytical computations to get exact formulas in any new

model within the class of linear Gaussian graphical models, depending on the parameters of interest Bastide et al.

[2021]. One way to alleviate this problem is to use a Bayesian framework, and expand the graphical model to include

both the phylogenetic network and the evolutionary parameters, which are seen as random variables themselves, as

e.g., in H

ohna et al. [2014]. Then, inferring parameters amounts to learning their conditional distribution in this larger

graphical model. In this setting, the output of interest from BP is not the likelihood but the distribution of random

variables (evolutionary parameters primarily) conditional on the observed data.

Exact computation may not be possible in this extended graphical model, because it is typically not linear Gaussian

and the graph’s treewidth can be much larger than that of the phylogenetic network, when one parameter (e.g. the

evolutionary rate) affects multiple node families. Therefore, approximations may need to be used. For example,

“black box” optimization techniques rely on variational approaches to reach a tractable approximation of the posterior

distribution of model parameters Ranganath et al. [2014]. The conditional distribution of unobserved variables, provided

by BP, facilitates the noisy approximation of the variational gradient that can be used to speed up the optimization of

the variational Bayes approximation.

7 Challenges and Extensions

7.1 Degeneracy

While our implementation provides a proof-of-concept, various technical challenges still need to be solved. Much of

the literature on BP focuses on factor graphs, which failed to converge for one of our example phylogenetic networks.

More work is needed to better understand the convergence and accuracy of alternative cluster graphs, and on other

choices that can substantially affect loopy BP’s efﬁciency, such as scheduling. Below, we focus on implementation

challenges due to degeneracies.

For the message

˜µi→j

to be well-deﬁned in step 1 of Gaussian BP, the belief of the sending cluster must have a precision

matrix

(6)

with a full-rank submatrix with respect to the variables to be integrated out (

in Algorithm 2). This

condition can fail under realistic phylogenetic models, due to two different types of degeneracy.

The ﬁrst type arises from deterministic factors: when

Vv= 0

(2)

and

is determined by the states at parent nodes

Xpa(v)

without noise, e.g. when all of

’s parent branches have length 0 in standard phylogenetic models. This is

expected at hybridization events when both parents have sampled descendants in the phylogeny, because the parents and

hybrid need to be contemporary of one another. This situation is also common in admixture graphs Maier et al. [2023]

due to a lack of identiﬁability of hybrid edge lengths from

statistics, leading to a “zipped-up” estimated network

in which the estimable composite length parameter is assigned to the hybrid’s child edge Xu and An

e [2023]. With

this degeneracy,

has inﬁnite precision given its parents, that is,

has some inﬁnite values. The complications are

technical, but not numerical. For example, one can use a generalized canonical form that includes a Dirac distribution

to capture the deterministic equation of

given

Xpa(v)

from

(2)

. Then BP operations need to be extended to these

generalized canonical forms, as done in Schoeman et al. [2022] (illustrated in SM section F). One could also modify the

Leveraging graphical model techniques to study evolution on phylogenetic networks

network by contracting internal tree edges of length 0. At hybrid nodes, adding a small variance to

would be an

approximate yet biologically realistic approach.

The second type of degeneracy arises when the precision submatrix

is ﬁnite but not of full rank. In phylogenetic

models, this is frequent at initialization

(5)

. For example, consider a cluster of 3 nodes: a hybrid

and its 2 parents. By

(7)

we see that

rank(Kv)≤p

. So at initialization with belief

ϕv

is degenerate if we seek to integrate out

|I|= 2

nodes, which would occur if the cluster is adjacent to a sepset containining only one parent of

. This situation is typical

of factor graphs. Initial beliefs would also be degenerate with

K=0

for any cluster that is not assigned any factor by

(5)

. This may occur if there are more clusters than node families, or if the graph has nested redundant clusters (e.g.

from join-graph structuring). In some cases, a schedule may avoid these degeneracies, guaranteeing a well-deﬁned

message at each BP update. On a clique tree, a schedule based on a postorder traversal has this guarantee, provided

that all

traits are observed at all leaves or that trait

at node

is removed from scope if it is unobserved at all its

descendants. But generally, it is unclear how to ﬁnd such a schedule. Another approach is to simply skip a BP update if

its message is ill-deﬁned, though there is no guarantee that the sending cluster will eventually have a well-behaved

belief to pass the message later. A robust option is to regularize cluster beliefs, right after initialization

(5)

or during BP,

by increasing some diagonal elements of

to make

of full rank. To maintain the probability model, this cluster

belief regularization is balanced by a similar modiﬁcation to a corresponding sepset. SM section E describes two such

approaches that appear to work well in practice, although theoretical guarantees have not been established.

7.2 Loopy BP is promising for discrete traits

We focused on Gaussian models in this paper, for which the ‘lazy’ matrix approach is polynomial. For discrete trait

models, the computational gains from loopy BP can be much greater, because alternative approaches are not polynomial

on general networks. For a trait with

states (

c= 2

for a binary trait as in Example 2), passing a message has complexity

O(ck)

where

is the sending cluster size. Thus, cluster graphs with small clusters can bring exponential computational

gains. Even exact BP can bring signiﬁcant computational gains to existing approaches that rely on other means to

reduce complexity. For example, the model without ILS used in Lutteropp et al. [2022], Allen-Savietta [2020] is a

mixture model, so the network likelihood can be calculated as a weighted average of tree likelihoods for which exact

BP takes linear time. This approach scales exponentially with

because there are typically

O(2h)

trees displayed in a

network. In contrast, the complexity of BP on a clique tree of maximum clique size

O(nck)

, thus parametrized

by the treewidth

of the moralized network instead of

(

t=k−1

for an optimal clique tree). Given our empirical

evidence that

grows more slowly than

or the network’s level

ℓ

in biologically-realistic networks (Fig. 6), exact BP

could achieve signiﬁcant computational gains and loopy BP substantially more.

A BP approach is already used in

momi2

Kamm et al. [2020], who use a clique tree built from a node ordering by

age from youngest to oldest, to get conditional likelihoods of the derived allele count under a Moran model (without

mutation). The mutation-with-ILS model in

SnappNet

can be also reframed as a graphical model on a graph expanded

from the phylogenetic network (as shown in Example 3 and SM section A). Accordingly, the BP-like algorithm in

Rabier et al. [2021] has complexity controlled by the network’s scanwidth, a parameter introduced by Berry et al.

[2020]. Using regular BP on more optimal clique trees and loopy BP on cluster graphs may help speed up computations

even more.

Also related is the algorithm in Scornavacca and Weller [2022], who use a clique tree to solve a parsimony problem. In

this non-probabilistic setting, it is unclear how cluster graphs could be leveraged to speed up algorithms as they do in

loopy BP.

To deal with computational intractability, the most widely-used probabilistic methods to infer networks from DNA

sequences are based on composite likelihoods Sol

ıs-Lemus and An

e [2016], Yu and Nakhleh [2015] or summary

statistics like

statistics Maier et al. [2023], Nielsen et al. [2023], leading to a lack of identiﬁability for parts of the

network topology and some of its parameters Sol

ıs-Lemus and An

e [2016], Ba

nos [2019], Xu and An

e [2023], An

et al. [2024], Allman et al. [2024], Rhodes et al. [2024]. These identiﬁability issues should be alleviated if using the full

data becomes tractable thanks to exact or loopy BP.

Supplementary Material

Technical derivations are available in the Supplementary Material (SM). Code to reproduce Figures 4a, 6 and 7 is avail-

able at

https://github.com/bstkj/graphicalmodels_for_phylogenetics_code

. A julia package for Gaus-

sian BP on phylogenetic networks is available at

https://github.com/cecileane/PhyloGaussianBeliefProp.

jl.

Leveraging graphical model techniques to study evolution on phylogenetic networks

Acknowledgements

This work was supported in part by the National Science Foundation (DMS 2023239 to C.A.) and by the University

of Wisconsin-Madison Ofﬁce of the Vice Chancellor for Research and Graduate Education with funding from the

Wisconsin Alumni Research Foundation. C.A. visited P.B. at the University of Montpellier thanks to support from the

I-SITE MUSE through the Key Initiative “Data and Life Sciences”.

References

D. C. Adams and M. L. Collyer. Multivariate phylogenetic comparative methods: Evaluations, comparisons, and

recommendations. Systematic Biology, 67(1):14–31, 2017. doi: 10.1093/sysbio/syx055.

V. Aksenov, D. Alistarh, and J. H. Korhonen. Scalable belief propagation via relaxed scheduling. Advances in Neural

Information Processing Systems, 33:22361–22372, 2020. URL

https://proceedings.neurips.cc/paper_

files/paper/2020/file/fdb2c3bab9d0701c4a050a4d8d782c7f-Paper.pdf.

C. Allen-Savietta. Estimating Phylogenetic Networks from Concatenated Sequence Alignments. PhD thesis,

University of Wisconsin-Madison, 2020. URL

https://ezproxy.library.wisc.edu/login?url=https:

//www.proquest.com/dissertations-theses/estimating-phylogenetic-networks- concatenated/

docview/2476856270/se-2.

E. S. Allman, H. Ba

nos, and J. A. Rhodes. NANUQ: a method for inferring species networks from gene trees under the

coalescent model. Algorithms for Molecular Biology, 14:24, 2019. doi: 10.1186/s13015-019-0159-2.

E. S. Allman, H. Ba

nos, M. Garrote-Lopez, and J. A. Rhodes. Identiﬁability of level-1 species networks from gene tree

quartets. arXiv, 2024. doi: 10.48550/arXiv.2401.06290.

C. An

e, J. Fogg, E. S. Allman, H. Ba

nos, and J. A. Rhodes. Anomalous networks under the multispecies coalescent:

theory and prevalence. Journal of Mathematical Biology, 88:29, 2024. doi: 10.1007/s00285-024-02050-7.

S. Arnborg, D. G. Corneil, and A. Proskurowski. Complexity of ﬁnding embeddings in ak-tree. SIAM Journal on

Algebraic Discrete Methods, 8(2):277–284, 1987. doi: 10.1137/0608024.

J. C. Avise and T. J. Robinson. Hemiplasy: A new term in the lexicon of phylogenetics. Systematic Biology, 57(3):

503–507, 2008. doi: 10.1080/10635150802164587.

H. Ba

nos. Identifying species network features from gene tree quartets. Bulletin of Mathematical Biology, 81(2):

494–534, 2019. doi: 10.1007/s11538- 018-0485-4.

D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012. doi: 10.1017/

CBO9780511804779. URL http://www.cs.ucl.ac.uk/staff/d.barber/brml/.

K. Bartoszek, J. Pienaar, P. Mostad, S. Andersson, and T. F. Hansen. A phylogenetic comparative method for

studying multivariate adaptation. Journal of Theoretical Biology, 314:204–215, 2012. ISSN 0022-5193. doi:

10.1016/j.jtbi.2012.08.005.

K. Bartoszek, S. Gl

emin, I. Kaj, and M. Lascoux. Using the Ornstein-Uhlenbeck process to model the evolution of

interacting populations. Journal of Theoretical Biology, 429:35–45, 2017. doi: 10.1016/j.jtbi.2017.06.011.

K. Bartoszek, J. F. Gonzalez, V. Mitov, J. Pienaar, M. Piwczy

nski, R. Puchałka, K. Spalik, and K. L. Voje. Model

selection performance in phylogenetic comparative methods under multivariate Ornstein-Uhlenbeck models of trait

evolution. Systematic Biology, 72(2):275–293, 2023. doi: 10.1093/sysbio/syac079.

K. Bartoszek, J. Fuentes-Gonz

alez, V. Mitov, J. Pienaar, M. Piwczy

nski, R. Puchałka, K. Spalik, and K. L. Voje.

Analytical advances alleviate model misspeciﬁcation in non-Brownian multivariate comparative methods. Evolution,

78(3):389–400, Mar. 2024. ISSN 0014-3820. doi: 10.1093/evolut/qpad185. URL

https://doi.org/10.1093/

evolut/qpad185.

P. Bastide, C. An

e, S. Robin, and M. Mariadassou. Inference of Adaptive Shifts for Multivariate Correlated Traits.

Systematic Biology, 67(4):662–680, July 2018a. ISSN 1063-5157. doi: 10.1093/sysbio/syy005.

P. Bastide, C. Sol

ıs-Lemus, R. Kriebel, K. William Sparks, and C. An

e. Phylogenetic comparative methods on

phylogenetic networks with reticulations. Systematic Biology, 67(5):800–820, 2018b. doi: 10.1093/sysbio/syy033.

P. Bastide, L. S. T. Ho, G. Baele, P. Lemey, and M. A. Suchard. Efﬁcient Bayesian inference of general Gaussian models

on large phylogenetic trees. The Annals of Applied Statistics, 15(2):971–997, 2021. doi: 10.1214/20-AOAS1419.

Leveraging graphical model techniques to study evolution on phylogenetic networks

A. Bergstr

om, L. Frantz, R. Schmidt, E. Ersmark, O. Lebrasseur, L. Girdland-Flink, A. T. Lin, J. Stor

a, K.-G. Sj

ogren,

D. Anthony, et al. Origins and genetic legacy of prehistoric dogs. Science, 370(6516):557–564, 2020. doi:

10.1126/science.aba9572.

V. Berry, C. Scornavacca, and M. Weller. Scanning phylogenetic networks is np-hard. In SOFSEM 2020: Theory

and Practice of Computer Science: 46th International Conference on Current Trends in Theory and Practice of

Informatics, SOFSEM 2020, Limassol, Cyprus, January 20–24, 2020, Proceedings 46, pages 519–530, 2020. doi:

10.1007/978-3-030-38919-2\42.

J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah. Julia: A fresh approach to numerical computing. SIAM Review,

59(1):65–98, 2017. doi: 10.1137/141000671.

T. Biedl. On triangulating k-outerplanar graphs. Discrete Applied Mathematics, 181:275–279, 2015. doi: 10.1016/j.

dam.2014.10.017.

J. R. Blair and B. Peyton. An introduction to chordal graphs and clique trees. In Graph theory and sparse matrix

computation, pages 1–29, 1993. doi: 10.1007/978- 1-4613-8369-7\1.

S. P. Blomberg, J. Garland, Theodore, and A. R. Ives. Testing for phylogenetic signal in comparative data: Behavioral

traits are more labile. Evolution, 57(4):717–745, 2003. doi: 10.1111/j.0014-3820.2003.tb00285.x.

H. L. Bodlaender. A partial k-arboretum of graphs with bounded treewidth. Theoretical Computer Science, 209(1):

1–45, 1998. ISSN 0304-3975. doi: 10.1016/S0304-3975(97)00228-4.

H. L. Bodlaender and A. M. Koster. Treewidth computations i. upper bounds. Information and Computation, 208(3):

259–275, 2010. doi: 10.1016/j.ic.2009.03.008.

J. D. Boyko and J. M. Beaulieu. Generalized hidden Markov models for phylogenetic comparative datasets. Methods in

Ecology and Evolution, 12(3):468–478, 2021. doi: 10.1111/2041-210X.13534.

J. D. Boyko, B. C. O’Meara, and J. M. Beaulieu. A novel method for jointly modeling the evolution of discrete and

continuous traits. Evolution, 77(3):836–851, 2023. doi: 10.1093/evolut/qpad002.

D. Bryant, R. Bouckaert, J. Felsenstein, N. A. Rosenberg, and A. RoyChoudhury. Inferring species trees directly from

biallelic genetic markers: Bypassing gene trees in a full coalescent analysis. Molecular Biology and Evolution, 29(8):

1917–1932, 2012. doi: 10.1093/molbev/mss086.

D. S. Caetano and L. J. Harmon. Estimating correlated rates of trait evolution with uncertainty. Systematic Biology, 68

(3):412–429, 2019. doi: 10.1093/sysbio/syy067.

O. Capp

e, E. Moulines, and T. Ryd

en. Inference in Hidden Markov Models. Springer Series in Statistics. Springer New

York, New York, NY, 2005. ISBN 978-0-387-40264-2. doi: 10.1007/0-387-28982-8.

S. Chaplick, S. Kelk, R. Meuwese, M. Mihal

ak, and G. Stamoulis. Snakes and ladders: A treewidth story. In

D. Paulusma and B. Ries, editors, Graph-Theoretic Concepts in Computer Science, pages 187–200, 2023. ISBN

978-3-031-43380-1. doi: 10.1007/978-3-031-43380-1\14.

J. Clavel and H. Morlon. Accelerated body size evolution during cold climatic periods in the cenozoic. Proceedings of

the National Academy of Sciences, 114(16):4183–4188, 2017. doi: 10.1073/pnas.1606868114.

J. Clavel, G. Escarguel, and G. Merceron. mvMORPH : an R package for ﬁtting multivariate evolutionary models to

morphometric data. Methods in Ecology and Evolution, 6(11):1311–1319, 2015. doi: 10.1111/2041-210X.12420.

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Third Edition. Mit Press, 3rd

edition, 2009. ISBN 0262033844. doi: 10.5555/1614191.

G. B. Cybis, J. S. Sinsheimer, T. Bedford, A. E. Mather, P. Lemey, and M. A. Suchard. Assessing phenotypic correlation

through the multivariate phylogenetic latent liability model. The Annals of Applied Statistics, 9(2):969–991, 2015.

ISSN 1932-6157. doi: 10.1214/15-AOAS821. URL http://projecteuclid.org/euclid.aoas/1437397120.

N. De Maio, P. Kalaghatgi, Y. Turakhia, R. Corbett-Detig, B. Q. Minh, and N. Goldman. Maximum likelihood

pandemic-scale phylogenetics. Nature Genetics, 55(5):746–752, 2023. doi: 10.1038/s41588-023-01368-0.

J. Drury, J. Clavel, M. Manceau, and H. Morlon. Estimating the effect of competition on trait evolution using maximum

likelihood inference. Systematic Biology, 65(4):700–710, 2016. doi: 10.1093/sysbio/syw020.

P. Duchen, S. Hautphenne, L. Lehmann, and N. Salamin. Linking micro and macroevolution in the presence of

migration. Journal of Theoretical Biology, 486:110087, 2020. doi: 10.1016/j.jtbi.2019.110087.

C. W. Dunn, X. Luo, and Z. Wu. Phylogenetic analysis of gene expression. Integrative and Comparative Biology, 53

(5):847–856, 2013. doi: 10.1093/icb/ict068.

Leveraging graphical model techniques to study evolution on phylogenetic networks

G. Elidan, I. McGraw, and D. Koller. Residual belief propagation: informed scheduling for asynchronous message

passing. In Proceedings of the Twenty-Second Conference on Uncertainty in Artiﬁcial Intelligence, page 165–173,

2006. ISBN 0974903922.

J. Felsenstein. Maximum-likelihood estimation of evolutionary trees from continuous characters. American journal of

human genetics, 25(5):471, 1973.

J. Felsenstein. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular

Evolution, 17(6):368–376, 1981. doi: 10.1007/BF01734359.

J. Felsenstein. Phylogenies and the comparative method. The American Naturalist, 125(1):1–15, 1985.

M. Fishelson and D. Geiger. Optimizing exact genetic linkage computations. In Proceedings of the seventh annual

international conference on Research in computational molecular biology, pages 114–121, 2003. doi: 10.1145/

640075.640089.

A. A. Fisher, X. Ji, P. Lemey, and M. A. Suchard. Relaxed random walks at scale. Systematic Biology, 70(2):258–267,

July 2021. doi: 10.1093/sysbio/syaa056.

R. G. FitzJohn. Diversitree: comparative phylogenetic analyses of diversiﬁcation in R. Methods in Ecology and

Evolution, 3(6):1084–1092, 2012. doi: 10.1111/j.2041- 210X.2012.00234.x.

M. Fourment and A. E. Darling. Evaluating probabilistic programming and fast variational bayesian inference in

phylogenetics. PeerJ, 7:e8272, Dec. 2019. doi: 10.7717/peerj.8272.

M. Fourment, C. J. Swanepoel, J. G. Galloway, X. Ji, K. Gangavarapu, M. A. Suchard, and F. A. Matsen IV. Automatic

Differentiation is no Panacea for Phylogenetic Gradient Computation. Genome Biology and Evolution, 15(6):evad099,

June 2023. ISSN 1759-6653. doi: 10.1093/gbe/evad099. URL https://doi.org/10.1093/gbe/evad099.

R. P. Freckleton. Fast likelihood calculations for comparative analyses. Methods in Ecology and Evolution, 3(5):

940–947, 2012. doi: 10.1111/j.2041- 210X.2012.00220.x.

P. Gambette, V. Berry, and C. Paul. The structure of level-k phylogenetic networks. In Annual Symposium on

Combinatorial Pattern Matching, pages 289–300, 2009. doi: 10.1007/978-3-642- 02441-2\26.

M. Gautier, R. Vitalis, L. Flori, and A. Estoup. f-Statistics estimation and admixture graph construction with Pool-Seq

or allele count data using the R package poolfstat. Molecular Ecology Resources, 22(4):1394–1416, 2022. ISSN

1755-0998. doi: 10.1111/1755-0998.13557. URL

https://onlinelibrary.wiley.com/doi/abs/10.1111/

1755-0998.13557.

E. W. Goolsby, J. Bruggeman, and C. An

e. Rphylopars: fast multivariate phylogenetic comparative methods for missing

data and within-species variation. Methods in Ecology and Evolution, 8(1):22–27, 2017. doi: 10.1111/2041-210X.

12612.

R. C. Grifﬁths and S. Tavar

e. Ancestral inference in population genetics. Statistical science, 9(3):307–319, 1994. doi:

10.1214/ss/1177010378.

E. Gross, L. van Iersel, R. Janssen, M. Jones, C. Long, and Y. Murakami. Distinguishing level-1 phylogenetic

networks on the basis of data generated by Markov processes. Journal of Mathematical Biology, 83:32, 2021. doi:

10.1007/s00285-021-01653-8.

D. Gusﬁeld, V. Bansal, V. Bafna, and Y. S. Song. A decomposition theory for phylogenetic networks and incompatible

characters. Journal of Computational Biology, 14(10):1247–1272, 2007. doi: 10.1089/cmb.2006.0137.

M. Hajdinjak, F. Mafessoni, L. Skov, B. Vernot, A. H

ubner, Q. Fu, E. Essel, S. Nagel, B. Nickel, J. Richter, et al.

Initial upper palaeolithic humans in europe had recent neanderthal ancestry. Nature, 592(7853):253–257, 2021. doi:

10.1038/s41586-021-03335-3.

M. Hamann and B. Strasser. Graph bisection with pareto optimization. Journal of Experimental Algorithmics (JEA),

23:1–34, 2018. doi: 10.1145/3173045.

T. F. Hansen. Stabilizing selection and the comparative analysis of adaptation. Evolution, 51(5):1341–1351, 1997. doi:

10.1111/j.1558-5646.1997.tb01457.x.

L. J. Harmon, J. B. Losos, T. J. Davies, R. G. Gillespie, J. L. Gittleman, W. B. Jennings, K. H. Kozak, M. A. McPeek,

F. Moreno-Roark, T. J. Near, A. Purvis, R. E. Ricklefs, D. Schluter, J. A. Schulte II, O. Seehausen, B. L. Sidlauskas,

O. Torres-Carvajal, J. T. Weir, and A. Ø. Mooers. Early burst of body size and shape evolution are rare in comparatice

data. Evolution, 64(8):2385–2396, 2010. doi: 10.1111/j.1558- 5646.2010.01025.x.

D. A. Harville. Bayesian inference for variance components using only error contrasts. Biometrika, 61(2):383–385,

1974. doi: 10.1093/biomet/61.2.383.

Leveraging graphical model techniques to study evolution on phylogenetic networks

G. Hassler, M. R. Tolkoff, W. L. Allen, L. S. T. Ho, P. Lemey, and M. A. Suchard. Inferring phenotypic trait evolution

on large trees with many incomplete measurements. Journal of the American Statistical Association, 117(538):

678–692, 2022a. doi: 10.1080/01621459.2020.1799812.

G. W. Hassler, B. Gallone, L. Aristide, W. L. Allen, M. R. Tolkoff, A. J. Holbrook, G. Baele, P. Lemey, and M. A.

Suchard. Principled, practical, ﬂexible, fast: A new approach to phylogenetic factor analysis. Methods in Ecology

and Evolution, 13(10):2181–2197, 2022b. doi: 10.1111/2041- 210X.13920.

G. W. Hassler, A. F. Magee, Z. Zhang, G. Baele, P. Lemey, X. Ji, M. Fourment, and M. A. Suchard. Data integration

in Bayesian phylogenetics. Annual Review of Statistics and Its Application, 10(1):353–377, 2023. doi: 10.1146/

annurev-statistics-033021- 112532.

B. P. Hedrick. Dots on a screen: The past, present, and future of morphometrics in the study of nonavian dinosaurs. The

Anatomical Record, pages 1–22, 2023. doi: 10.1002/ar.25183.

T. Heskes, O. Zoeter, and W. Wiegerinck. Approximate Expectation Maximization. In Advances in Neural Information

Processing Systems, volume 16. MIT Press, 2003. URL

https://proceedings.neurips.cc/paper_files/

paper/2003/hash/8208974663db80265e9bfe7b222dcb18-Abstract.html.

L. S. T. Ho and C. An

e. A linear-time algorithm for Gaussian and non-Gaussian trait evolution models. Systematic

Biology, 63(3):397–408, 2014. doi: 10.1093/sysbio/syu005.

S. H

ohna, T. A. Heath, B. Boussau, M. J. Landis, F. Ronquist, and J. P. Huelsenbeck. Probabilistic graphical model

representation in phylogenetics. Systematic biology, 63(5):753–771, 2014. doi: 10.1093/sysbio/syu039.

S. H

ohna, M. J. Landis, T. A. Heath, B. Boussau, N. Lartillot, B. R. Moore, J. P. Huelsenbeck, and F. Ronquist.

Revbayes: Bayesian phylogenetic inference using graphical models and an interactive model-speciﬁcation language.

Systematic biology, 65(4):726–736, 2016. doi: 10.1093/sysbio/syw021.

D. H. Huson, R. Rupp, V. Berry, P. Gambette, and C. Paul. Computing galled networks from real data. Bioinformatics,

25(12):i85–i93, 05 2009. ISSN 1367-4803. doi: 10.1093/bioinformatics/btp217.

D. H. Huson, R. Rupp, and C. Scornavacca. Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge

University Press, Cambridge, 2010. doi: 10.1017/CBO9780511974076.

A. Ignatieva, J. Hein, and P. A. Jenkins. Ongoing recombination in SARS-CoV-2 revealed through genealogical

reconstruction. Molecular Biology and Evolution, 39(2):msac028, 2022. doi: 10.1093/molbev/msac028.

W. Jetz, G. H. Thomas, J. B. Joy, K. Hartmann, and A. O. Mooers. The global diversity of birds in space and time.

Nature, 491(7424):444–448, 2012. doi: 10.1038/nature11631.

D.-C. Jhwueng and B. C. O’Meara. Trait evolution on phylogenetic networks. bioRxiv, 2015. doi: 10.1101/023986.

D.-C. Jhwueng and B. C. O’Meara. On the matrix condition of phylogenetic tree. Evolutionary Bioinformatics, 16:

1176934320901721, 2020. doi: 10.1177/1176934320901721.

X. Ji, Z. Zhang, A. Holbrook, A. Nishimura, G. Baele, A. Rambaut, P. Lemey, and M. A. Suchard. Gradients do grow

on trees: A linear-time o(n)-dimensional gradient for statistical phylogenetics. Molecular Biology and Evolution, 37

(10):3047–3060, May 2020. doi: 10.1093/molbev/msaa130.

J. A. Justison and T. A. Heath. Exploring the distribution of phylogenetic networks generated under a birth-death-

hybridization process. Bulletin of the Society of Systematic Biologists, 2(3):1–22, 2024. doi: 10.18061/bssb.v2i3.9285.

J. A. Justison, C. Solis-Lemus, and T. A. Heath. Siphynetwork: An R package for simulating phylogenetic networks.

Methods in Ecology and Evolution, 14(7):1687–1698, 2023. doi: 10.1111/2041-210X.14116.

J. A. Kamm, J. Terhorst, and Y. S. Song. Efﬁcient computation of the joint sample frequency spectra for multiple

populations. Journal of Computational and Graphical Statistics, 26(1):182–194, 2017. doi: 10.1080/10618600.2016.

1159212.

J. A. Kamm, J. Terhorst, R. Durbin, and Y. S. Song. Efﬁciently inferring the demographic history of many populations

with allele count data. Journal of the American Statistical Association, 115(531):1472–1487, 2020. doi: 10.1080/

01621459.2019.1635482.

N. Karimi, C. E. Grover, J. P. Gallagher, J. F. Wendel, C. An

e, and D. A. Baum. Reticulate evolution helps explain

apparent homoplasy in ﬂoral biology and pollination in baobabs (Adansonia; Bombacoideae; Malvaceae). Systematic

Biology, 69(3):462–478, 2020. doi: 10.1093/sysbio/syz073.

J. F. C. Kingman. On the genealogy of large populations. Journal of Applied Probability, 19:27–43, 1982. doi:

10.2307/3213548.

Leveraging graphical model techniques to study evolution on phylogenetic networks

C. Knoll, M. Rath, S. Tschiatschek, and F. Pernkopf. Message scheduling methods for belief propagation. In Machine

Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2015, Porto, Portugal,

September 7-11, 2015, Proceedings, Part II 15, pages 295–310, 2015. doi: 10.1007/978-3- 319-23525-7\18.

D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT Press, 2009. ISBN

9780262013192.

S. Kong, D. L. Swofford, and L. S. Kubatko. Inference of phylogenetic networks from sequence data using composite

likelihood. bioRxiv, 2022. doi: 10.1101/2022.11.14.516468.

N. Lartillot. A phylogenetic Kalman ﬁlter for ancestral trait reconstruction using molecular data. Bioinformatics, 30(4):

488–496, 2014. ISSN 1367-4803. doi: 10.1093/bioinformatics/btt707.

I. Lazaridis, N. Patterson, A. Mittnik, G. Renaud, S. Mallick, K. Kirsanow, P. H. Sudmant, J. G. Schraiber, S. Castellano,

M. Lipson, et al. Ancient human genomes suggest three ancestral populations for present-day europeans. Nature,

513(7518):409–413, 2014. doi: 10.1038/nature13673.

K. Lepp

a, S. V. Nielsen, and T. Mailund. admixturegraph: an r package for admixture graph manipulation and ﬁtting.

Bioinformatics, 33(11):1738–1740, 2017. doi: 10.1093/bioinformatics/btx048.

P. Librado, N. Khan, A. Fages, M. A. Kusliy, T. Suchan, L. Tonasso-Calvi

ere, S. Schiavinato, D. Alioglu, A. Fromentier,

A. Perdereau, et al. The origins and spread of domestic horses from the western eurasian steppes. Nature, 598(7882):

634–640, 2021. doi: 10.1038/s41586- 021-04018-9.

M. Lipson. Applying f4-statistics and admixture graphs: Theory and examples. Molecular Ecology Resources, 20(6):

1658–1667, 2020. doi: 10.1111/1755- 0998.13230.

M. Lipson, I. Ribot, S. Mallick, N. Rohland, I. Olalde, N. Adamski, N. Broomandkhoshbacht, A. M. Lawson, S. L

opez,

J. Oppenheimer, et al. Ancient west african foragers in the context of african population history. Nature, 577(7792):

665–670, 2020. doi: 10.1038/s41586- 020-1929-1.

S. Lutteropp, C. Scornavacca, A. M. Kozlov, B. Morel, and A. Stamatakis. NetRAX: accurate and fast maximum

likelihood phylogenetic network inference. Bioinformatics, 38(15):3725–3733, 2022. doi: 10.1093/bioinformatics/

btac396.

J. R. Magnus and H. Neudecker. Symmetry, 0-1 matrices and Jacobians: A review. Econometric Theory, 2(2):157–190,

1986. ISSN 0266-4666. doi: 10.1017/S0266466600011476.

R. Maier, P. Flegontov, O. Flegontova, U. Isildak, P. Changmai, and D. Reich. On the limits of ﬁtting complex models

of population history to f-statistics. Elife, 12:e85492, 2023. doi: 10.7554/elife.85492.

D. M. Malioutov, J. K. Johnson, and A. S. Willsky. Walk-sums and belief propagation in gaussian graphical models.

The Journal of Machine Learning Research, 7:2031–2064, 2006.

M. Manceau, A. Lambert, and H. Morlon. A unifying comparative phylogenetic framework including traits coevolving

across interacting lineages. Systematic Biology, 66(4):551–568, 2017. doi: 10.1093/sysbio/syw115.

R. Mateescu, K. Kask, V. Gogate, and R. Dechter. Join-graph propagation algorithms. Journal of Artiﬁcial Intelligence

Research, 37:279–328, 2010. doi: 10.1613/jair.2842.

V. Mitov, K. Bartoszek, and T. Stadler. Automatic generation of evolutionary hypotheses using mixed Gaussian

phylogenetic models. Proceedings of the National Academy of Sciences, page 201813823, Aug. 2019. ISSN

0027-8424. doi: 10.1073/pnas.1813823116.

V. Mitov, K. Bartoszek, G. Asimomitis, and T. Stadler. Fast likelihood calculation for multivariate Gaussian phylogenetic

models with shifts. Theoretical Population Biology, 131:66–78, 2020. doi: 10.1016/j.tpb.2019.11.005.

E. K. Molloy, A. Durvasula, and S. Sankararaman. Advancing admixture graph estimation via maximum likelihood

network orientation. Bioinformatics, 37(Supplement 1):i142–i150, 2021. doi: 10.1093/bioinformatics/btab267.

N. F. M

uller, K. E. Kistler, and T. Bedford. A bayesian approach to infer recombination patterns in coronaviruses.

Nature communications, 13(1):4186, 2022. doi: 10.1038/s41467-022-31749-8.

N. Neureiter, P. Ranacher, N. Efrat-Kowalsky, G. A. Kaiping, R. Weibel, P. Widmer, and R. R. Bouckaert. Detecting

contact in language trees: a Bayesian phylogenetic model with horizontal transfer. Humanities and Social Sciences

Communications, 9:205, 2022. doi: 10.1057/s41599- 022-01211-7.

L.-T. Nguyen, H. A. Schmidt, A. Von Haeseler, and B. Q. Minh. IQ-TREE: A fast and effective stochastic algorithm

for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution, 32(1):268–274, 2015. doi:

10.1093/molbev/msu300.

Leveraging graphical model techniques to study evolution on phylogenetic networks

S. V. Nielsen, A. H. Vaughn, K. Lepp

a, M. J. Landis, T. Mailund, and R. Nielsen. Bayesian inference of admixture

graphs on native american and arctic populations. PLOS Genetics, 19(2):1–22, 2023. doi: 10.1371/journal.pgen.

1010410.

J. Oldman, T. Wu, L. van Iersel, and V. Moulton. TriLoNet: Piecing together small networks to reconstruct reticulate

evolutionary histories. Molecular Biology and Evolution, 33(8):2151–2162, 2016. doi: 10.1093/molbev/msw068.

M. Pagel, A. Meade, and D. Barker. Bayesian estimation of ancestral character states on phylogenies. Systematic

Biology, 53(5):673–684, 2004. doi: 10.1080/10635150490522232.

N. Patterson, P. Moorjani, Y. Luo, S. Mallick, N. Rohland, Y. Zhan, T. Genschoreck, T. Webster, and D. Reich. Ancient

admixture in human history. Genetics, 192(3):1065–1093, 2012. doi: 10.1534/genetics.112.145037.

J. K. Pickrell and J. K. Pritchard. Inference of population splits and mixtures from genome-wide allele frequency data.

PLOS Genetics, 8(11):1–17, 2012. doi: 10.1371/journal.pgen.1002967.

O. G. Pybus, M. A. Suchard, P. Lemey, F. J. Bernardin, A. Rambaut, F. W. Crawford, R. R. Gray, N. Arinaminpathy, S. L.

Stramer, M. P. Busch, and E. L. Delwart. Unifying the spatial epidemiology and molecular evolution of emerging

epidemics. Proceedings of the National Academy of Sciences, 109(37):15066–15071, Sept. 2012. ISSN 0027-8424.

doi: 10.1073/pnas.1206598109.

C.-E. Rabier, V. Berry, M. Stoltz, J. D. Santos, W. Wang, J.-C. Glaszmann, F. Pardi, and C. Scornavacca. On the

inference of complex phylogenetic networks by Markov Chain Monte-Carlo. PLOS Computational Biology, 17(9):

1–39, 2021. doi: 10.1371/journal.pcbi.1008380.

F. Racimo, J. J. Berg, and J. K. Pickrell. Detecting polygenic adaptation in admixture graphs. Genetics, 208(4):

1565–1584, 2018. doi: 10.1534/genetics.117.300489.

R. Ranganath, S. Gerrish, and D. Blei. Black Box Variational Inference. In S. Kaski and J. Corander, editors,

Proceedings of the Seventeenth International Conference on Artiﬁcial Intelligence and Statistics, volume 33 of

Proceedings of Machine Learning Research, pages 814–822, Reykjavik, Iceland, Apr. 2014. PMLR. URL

https:

//proceedings.mlr.press/v33/ranganath14.html.

B. Rannala and Z. Yang. Bayes estimation of species divergence times and ancestral population sizes using DNA

sequences from multiple loci. Genetics, 164(4):1645–1656, 2003. doi: 10.1093/genetics/164.4.1645.

A. Refoyo-Mart

ınez, R. R. da Fonseca, K. Halld

orsd

ottir, E.

Arnason, T. Mailund, and F. Racimo. Identifying

loci under positive selection in complex population histories. Genome Research, 29(9):1506–1520, 2019. doi:

10.1101/gr.246777.118.

L. J. Revell. phytools: An R package for phylogenetic comparative biology (and other things). Methods in Ecology and

Evolution, 3(2):217–223, 2012. doi: 10.1111/j.2041- 210X.2011.00169.x.

J. A. Rhodes, H. Ba

nos, J. Xu, and C. An

e. Identifying circular orders for blobs in phylogenetic networks. arXiv, 2024.

doi: 10.48550/arXiv.2402.11693.

F. Ronquist and J. Huelsenbeck. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19:

1572–1574, 2003. doi: 10.1093/bioinformatics/btg180.

D. J. Rose. A graph-theoretic study of the numerical solution of sparse positive deﬁnite systems of linear equations. In

Graph theory and computing, pages 183–217. Elsevier, 1972. doi: 10.1016/B978-1-4832-3187-7.50018-0.

R. Salakhutdinov, S. Roweis, and Z. Ghahramani. Optimization with EM and Expectation-Conjugate-Gradient. In

Proceedings of the 20th International Conference on Machine Learning (ICML-03). AAAI Press., 2003.

J. C. Schoeman, C. E. van Daalen, and J. A. du Preez. Degenerate gaussian factors for probabilistic inference.

International Journal of Approximate Reasoning, 143:159–191, 2022. doi: 10.1016/j.ijar.2022.01.008.

C. Scornavacca and M. Weller. Treewidth-based algorithms for the small parsimony problem on networks. Algorithms

for Molecular Biology, 17:15, 2022. doi: 10.1186/s13015-022-00216-w.

C. Semple and J. Simpson. When is a phylogenetic network simply an amalgamation of two trees? Bulletin of

Mathematical Biology, 80:2338–2348, 2018. doi: 10.1007/s11538-018-0463-x.

M. E. R. Shafer. Cross-species analysis of single-cell transcriptomic data. Frontiers in Cell and Developmental Biology,

7, 2019. doi: 10.3389/fcell.2019.00175.

M. Sikora, V. V. Pitulko, V. C. Sousa, M. E. Allentoft, L. Vinner, S. Rasmussen, A. Margaryan, P. de Barros Damgaard,

C. de la Fuente, G. Renaud, et al. The population history of northeastern siberia since the pleistocene. Nature, 570

(7760):182–188, 2019. doi: 10.1038/s41586- 019-1279-z.

C. Sol

ıs-Lemus and C. An

e. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete

lineage sorting. PLoS Genetics, 12(3):e1005896, 2016. doi: 10.1371/journal.pgen.1005896.

Leveraging graphical model techniques to study evolution on phylogenetic networks

C. Sol

ıs-Lemus, P. Bastide, and C. An

e. PhyloNetworks: A package for phylogenetic networks. Molecular Biology and

Evolution, 34(12):3292–3298, 2017. doi: 10.1093/molbev/msx235.

S. Soraggi and C. Wiuf. General theory for stochastic admixture graphs and f-statistics. Theoretical Population Biology,

125:56–66, 2019. doi: 10.1016/j.tpb.2018.12.002.

A. Stamatakis. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformat-

ics, 30(9):1312–1313, 2014. doi: 10.1093/bioinformatics/btu033.

M. Stoltz, B. Baeumer, R. Bouckaert, C. Fox, G. Hiscott, and D. Bryant. Bayesian inference of species trees using

diffusion models. Systematic Biology, 70(1):145–161, 2020. doi: 10.1093/sysbio/syaa051.

B. Strasser. Computing tree decompositions with ﬂowcutter: PACE 2017 submission. CoRR, abs/1709.08949, 2017.

URL http://arxiv.org/abs/1709.08949.

S. Streicher and J. du Preez. Graph coloring: Comparing cluster graphs to factor graphs. In Proceedings of the ACM

Multimedia 2017 Workshop on South African Academic Participation, pages 35–42, 2017. doi: 10.1145/3132711.

3132717.

X. Sun, Y.-C. Liu, M. P. Tiunov, D. O. Gimranov, Y. Zhuang, Y. Han, C. A. Driscoll, Y. Pang, C. Li, Y. Pan, et al.

Ancient dna reveals genetic admixture in china during tiger evolution. Nature ecology & evolution, 7(11):1914–1929,

2023. doi: 10.1038/s41559-023-02185-8.

C. Sutton and A. McCallum. Improved dynamic schedules for belief propagation. In Proceedings of the Twenty-Third

Conference on Uncertainty in Artiﬁcial Intelligence, page 376–383, 2007. ISBN 0974903930.

C. Swanepoel, M. Fourment, X. Ji, H. Nasif, M. A. Suchard, F. A. Matsen IV, and A. Drummond. TreeFlow: probabilistic

programming and automatic differentiation for phylogenetics. arXiv e-print, 2022. doi: 10.48550/arXiv.2211.05220.

URL http://arxiv.org/abs/2211.05220.

S. Tavar

e. Line-of-descent and genealogical processes, and their applications in population genetics models. Theoretical

Population Biology, 26(2):119–164, 1984. doi: 10.1016/0040-5809(84)90027-3.

B. Teo, J. P. Rose, P. Bastide, and C. An

e. Accounting for within-species variation in continuous trait evolution on a

phylogenetic network. Bulletin of the Society of Systematic Biologists, 2(3):1–29, 2023. doi: 10.18061/bssb.v2i3.8977.

J. T. Thorson and W. van der Bijl. phylosem: A fast and simple R package for phylogenetic inference and trait

imputation using phylogenetic structural equation models. Journal of Evolutionary Biology, 36(10):1357–1364,

2023. doi: 10.1111/jeb.14234.

J. T. Thorson, A. A. Maureaud, R. Frelat, B. M

erigot, J. S. Bigman, S. T. Friedman, M. L. D. Palomares, M. L. Pinsky,

S. A. Price, and P. Wainwright. Identifying direct and indirect associations among traits by merging phylogenetic

comparative methods and structural equation models. Methods in Ecology and Evolution, 14(5):1259–1275, 2023.

doi: 10.1111/2041-210X.14076.

M. R. Tolkoff, M. E. Alfaro, G. Baele, P. Lemey, and M. A. Suchard. Phylogenetic Factor Analysis. Systematic Biology,

67(3):384–399, Aug. 2018. ISSN 1063-5157. doi: 10.1093/sysbio/syx066.

N. S. Upham, J. A. Esselstyn, and W. Jetz. Inferring the mammal tree: Species-level sets of phylogenies for questions

in ecology, evolution, and conservation. PLOS Biology, 17(12):1–44, 2019. doi: 10.1371/journal.pbio.3000494.

L. Van Iersel, S. Kelk, R. Rupp, and D. Huson. Phylogenetic networks do not need to be complex: using fewer

reticulations to represent conﬂicting clusters. Bioinformatics, 26(12):i124–i131, 2010. doi: 10.1093/bioinformatics/

btq202.

M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. Tree-based reparameterization framework for analysis of

sum-product and related algorithms. IEEE Transactions on information theory, 49(5):1120–1146, 2003. doi:

10.1109/TIT.2003.810642.

C.-C. Wang, H.-Y. Yeh, A. N. Popov, H.-Q. Zhang, H. Matsumura, K. Sirak, O. Cheronet, A. Kovalev, N. Rohland,

A. M. Kim, et al. Genomic insights into the formation of human populations in east asia. Nature, 591(7850):413–419,

2021. doi: 10.1038/s41586-021-03336-2.

Y. Weiss and W. Freeman. Correctness of belief propagation in gaussian graphical models of arbitrary topology.

Advances in neural information processing systems, 12, 1999.

J. Xu and C. An

e. Identiﬁability of local and global features of phylogenetic networks from average distances. Journal

of Mathematical Biology, 86(1):12, 2023. doi: 10.1007/s00285-022-01847-8.

J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation

algorithms. IEEE Transactions on information theory, 51(7):2282–2312, 2005. doi: 10.1109/TIT.2005.850085.

Leveraging graphical model techniques to study evolution on phylogenetic networks

Y. Yu and L. Nakhleh. A maximum pseudo-likelihood approach for phylogenetic networks. BMC Genomics, 16(10):

S10, 2015. doi: 10.1186/1471- 2164-16-S10-S10.

Z. Zhang, A. Nishimura, P. Bastide, X. Ji, R. P. Payne, P. Goulder, P. Lemey, and M. A. Suchard. Large-scale inference

of correlation among mixed-type biological traits with phylogenetic multivariate probit models. The Annals of

Applied Statistics, 15(1):230–251, Mar. 2021. doi: 10.1214/20-aoas1394.

SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

SUP PLEMEN TARY MATERI AL

A Recasting SnappNet as BP

SnappNet

[Rabier et al., 2021] extends the model described in

SNAPP

[Bryant et al., 2012] to binary phylogenetic

networks with reticulations. In the main text, their model is considered along a 2-taxon phylogenetic tree and Fig. 3b

shows that the graphical model has a more complicated graph. The same applies in the presence of reticulations (see

Fig. S1 for a 2-taxon phylogeny with 1 reticulation). In addition to the coalescent and speciation factors described

in Example 3, we also need to describe hybridization factors. Consider an edge

that is the child of a hybrid node,

whose parent hybrid edges

and

have inheritance probabilities

γ1

and

γ2

. The hybridization factors for the total

allele count

ϕnp1=P(np1|ne)

and

ϕnp2=

{ne−np1}(np2)

describe a binomial distribution for each

npi

(

i= 1,2

)

with

np1+np1=ne

, because each of the

individuals has a

γi

chance of being assigned to edge

(

i= 1,2

). The

hybridization factor for the red allele count is simply ϕre=

{rp1+rp2}(re)because re=rp1+rp2.

(a)

ρ, ρ

1,12,2

(b)

5ρρ44

55 3 344

113322

11 2 2

(c)

5,3,5,3

1,5,3

1,1

5,4, ρ

5,4,ρ

5,4,4

5,3,2,4

2,2

1,1

5,5

γ= 0.6

5,53,3

3,3γ= 0.42,2

4,4

S1,2

5,3

S2,3

5,3

S3,5

S4,5

5,4

S5,6

5,4

S6,7

Figure S1: (a) Phylogenetic network

with hybrid edges in blue. (b) Graph

for the graphical model associated with

is a DAG with two roots (

and

) and two leaves (

and

). (c) Clique tree

for

, with clusters

in grey and

sepsets

Si,j

in orange. To reduce clutter and simplify notations in this ﬁgure,

and

are both abbreviated as

and

are distinguished by colours (

’s in black,

’s in red). Similarly,

and

are both denoted as

and distinguished by

colours.

We show that applying the

SnappNet

algorithm to the network in Fig. S1(a) is equivalent to BP on the clique tree

in Fig. S1(c). To start, we assign the initial beliefs

ϕn1ϕr1

to cluster

ϕn2ϕr2

ϕnρϕrρϕr5ϕr4

;

ϕn5ϕn3ϕr5ϕr3

ϕn4ϕr3ϕr2

, and

ϕn4ϕr4

. Finally, all hybridization factors

ϕn5ϕn3ϕr1

are assigned to

. Each sepset is assigned an initial belief of 1. The total allele counts

(ﬁxed by design) and the observed red

allele counts

at the tips are absorbed as evidence into the mutation factors

ϕr1

ϕr2

and coalescent factors

ϕn1

ϕn2

for the terminal edges. This is denoted as

ϕ[·]

, with

[·]

containing the evidence absorbed. BP messages are then

SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

passed on the clique tree from C1and C4(considered as leaves) towards C7(considered as root) as follows:

˜µ1→2=ϕn1[n1]ϕr1[n1,r1] = F1(S1,2)

˜µ2→3=X

C2\S2,3

ϕn5ϕn3ϕr1·˜µ1→2= F5,3(S2,3)

˜µ3→5=X

C3\S3,5

ϕn5ϕn3ϕr5ϕr3·˜µ2→3= F5,3(S3,5)

˜µ4→5=ϕn2[n2]ϕr2[n2,r2] = F2(S4,5)

˜µ5→6=X

C5\S5,6

ϕn4ϕr3ϕr2·˜µ3→5˜µ4→5= F5,4(S5,6)

˜µ6→7=X

C6\S6,7

ϕn4ϕr4·˜µ5→6= F5,4(S6,7)

βﬁnal

7=ϕnρϕrρϕr5ϕr4·˜µ6→7

βﬁnal

7=X

C7\S6,7

ϕrρX

S6,7

ϕnρϕr5ϕr4·˜µ6→7=X

C7\S6,7

ϕrρFρ(C7\ S6,7)

where the the

functions are deﬁned in Rabier et al. [2021] for different population interface sets

(each population

interface is the top or bottom of some branch, e.g.

). That this correspondence does not always hold (e.g. suppose

the messages were sent towards

instead), highlights that BP is more general. The

s recursively compose the

likelihood according to rules described in Rabier et al. [2021], and can be expressed at the top-level as

Pnρ,rρϕrρFρ

This is precisely the quantity from marginalizing βﬁnal

7above, the ﬁnal belief of C7.

B Bounding the moralized network’s treewidth

Proof of Proposition 1.

Using the notations in the main text, let

be as a binary phylogenetic network with

hybrid

nodes, level

ℓ

, no parallel edges and no degree-2 nodes other than the root. Let

be the treewidth of the moralized graph

Nmobtained from N.

(A0) is well-known:

t= 1

exactly when

is a tree. If

ℓ= 1

then

has at least one non-trivial blob and every such

blob is a cycle. So Nmhas outerplanar blobs and t= 2 Biedl [2015], proving (A1).

Now consider hybrid nodes

and

as in (A2). Let

u3, u4

be the parents of

such that

is not a descendant of

(see Fig. 5, in which

u4=v1

). Then there must be a path

from

through

since

is a descendant of

. For a directed path

, let

denote the corresponding undirected path. Let

be a strict common ancestor of

and

such that there exist disjoint paths

, with

from

. Such

exists because

has a parent other than

and vice versa. Let

be the cycle in

formed by concatenating

and the moral edge

{u1, u2}

. Next, pick

any path

from the root of

. If

does not share any node with

, then we can ﬁnd a common ancestor

˜w

and

, and paths

and

from

˜w

and

respectively, that do not intersect

nor

. Then we can see that

contains the complete graph on

{w, u1, u2, v1}

as a graph minor, by contracting

w+pu

3+{u3, v2}+pu

into

a single edge between

and

. If instead

intersects

, then let

w′

be the lowest node at which

and

intersect.

Then

w′=u1

because otherwise

would be its descendant. Similarly

w′=u2

. Let

denote the subpath of

from

w′to u3. Then Nmcontains the complete graph on {w′, u1, u2, v1}as a graph minor, as Ccan be contracted into the

cycle

{w′, u1, u2}

and

3+{u3, v2}+pu

can be contracted into an edge

{w′, v1}

. In both cases,

contains the

complete graph on 4 nodes as a graph minor, therefore its treewidth is

t≥3

Bodlaender [1998]. Also, in both cases

and v2are in a common undirected cycle in N, so in the same blob and ℓ≥2.

SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

C Approximation quality with loopy BP

5 10 15 20

-2

-1

clique tree

join -grap h str

factor graph

5 10 15 20

-80

-70

-60

-50

-40

0 50 100 150 200

-80

-60

-40

-20

0 10 20 30 40 50

500

1000

1500

2000

2500

0 10 20 30 40 50

-600

-500

-400

-300

-200

-100

0 50 100 150 200

-1012

-108

-104

104

108

1012

0 10 20 30 40 50

-102

-10

102

104

0 10 20 30 40 50

-104

-102

-10

-2

Number of iterations

E(Xρ |data)

Δ=E(Xρ |data)−15.7

Var(Xρ |data)

Δ=Var(Xρ |data)−1425.7

Factored energy

Δ=Factored energy −(−85)

Lipson et al. (2020b): n=12

Müller et al. (2022): n=40

Müller et al. (2022): n=40, difference from clique tree estimate shown on log-modulus transformation scale

Figure S2: Comparing the accuracy of loopy BP between different cluster graphs: built from join-graph structuring

U∗

as in Fig. 7 (black), or a factor graph (purple). For both, initial beliefs are regularized using algorithm R4. The true

values, obtained using a clique tree, are shown in red. The plots in the ﬁrst row are for the simpler phylogenetic network,

and the plots in the other rows are for the complex phylogeny. The last row shows the difference

∆

between the loopy

BP estimate and the true value, displayed on the log-modulus scale using the transformation

sign(∆) log(1 + |∆|)

. For

the simpler phylogenetic network, convergence speed and accuracy are similar between

U∗

and the factor graph, which

is unsurprising given their similarly small cluster sizes (

≤3

). For the complex network, the factor graph did not reach

calibration as its iterates diverged for the conditional mean and factored energy.

SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

4.9

5.8

8.3

1 2 5 10 20 50

Cluster size (log−scale)

Mean time (µs) per message

Müller (n=40)

Lipson (n=12)

Figure S3: Boxplots (with means as points) showing the dis-

tribution of cluster sizes in the join-graph structuring cluster

graph

U∗

and in the clique tree

from Fig. 7. The factor

graph has clusters of size between 1 and 3 (not displayed).

The time for 100 iterations (deﬁned in Fig. 7) was bench-

marked over 20 replicates on a MacBook Pro M2 2022, and

divided by the number of messages per 100 iterations to ob-

tain an estimate of the mean time per belief update (vertical

axis).

D Gradient and parameter estimates under the BM

D.1 The homogeneous BM model

We consider here the simple case of a multivariate BM of dimension

on a network: with

Vv=ℓ(e)Σ

at a tree node

with parent edge

. At a hybrid node, we assume a weighted average merging rule as in (3.2) with a possible extra

hybrid variance proportional to

Vh=e

ℓ(h)Σ

for some scalar

ℓ(h)≥0

. To simplify equations, we deﬁne

ℓ(v)=0

vis a tree node. Then for each node v∈Vthe Gaussian linear model (3.1) simpliﬁes to:

XvXpa(v)∼ N

X

u∈pa(v)

γuvXu;ℓ(v)Σ

,(SM-1)

with γuv the inheritance probability associated with the branch going from uto v, and

ℓ(v) = e

ℓ(v) + X

u∈pa(v)

γ2

uvℓ(eu→v).

At the root, we assume a prior variance proportional to

Xρ∼ N(µρ;ℓ(ρ)Σ)

which may be improper (and

degenerate) with inﬁnite variance ℓ(ρ) = ∞or ℓ(ρ)=0.

This model can also accommodate within-species variation, by considering each individual as one leaf in the phylogeny,

whose parent node corresponds to the species to which the individual belongs. The edge

from the species to the

individual is assigned length

ℓ(e) = w

and variance proportional to

conditional on the parent node (species average):

ℓ(e)Σ

. This model, then, assumes equal phenotypic (within species) correlation and evolutionary (between species)

correlation between the ptraits. The derivations below assume a ﬁxed variance ratio w, to be estimated separately.

All results in this section use this homogeneous BM model, and make the following assumption.

Assumption 1.At each leaf, the trait vector (of length

) is either fully observed or fully missing, i.e. there are no

partially observed nodes.

D.2 Belief Propagation

Gaussian BP Algorithm 2 can be applied in the simple BM case to get the calibrated beliefs, with two traversals of

a clique tree (or convergence with inﬁnitely many traversals of a cluster graph). The following result states that the

conditional moments of all the nodes obtained from this calibration have a very special form. We will use it to derive

analytical formulas for the maximum likelihood estimators of the parameters of the BM.

Proposition 2. Assume the homogeneous BM

(SM-1)

and Assumption 1. The expectation of the trait at each node

conditional on the observed data does not depend on the assumed Σparameter. In addition, the conditional variance

matrix and the conditional covariance matrix of a node trait and any of its parent’s is proportional to Σ.

SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

To prove this proposition, we need the following technical lemma, which we prove later.

Lemma 3. Consider a homogeneous

-dimensional BM on a network and Assumption 1. At each iteration of the

calibration, each cluster and sepset of snodes has a belief whose canonical parameters are of the form:

K=J⊗Σ−1and h= (Is⊗Σ−1)









= (Is⊗Σ−1)vec(M) = vec(Σ−1M)(SM-2)

for some

s×s

symmetric matrix

and vectors

(

i= 1 . . . s

) of size

, where

is the

p×s

matrix with

column

, and where

vec

denotes the vectorization operation formed by stacking columns. Further,

depends linearly

on the data Y(from stacking the trait vectors at the tips) and µρ, separately across traits, in the sense that

mi= (wi⊗Ip)µρ

Y.(SM-3)

for some

1×(n+ 1)

vector of weights

and vectors

(

i= 1, . . . , s

) are independent of the variance rate

, the

data Yand µρ. They only depend on the network, the chosen cluster graph, the chosen cluster or sepset in this graph,

and the iteration number.

Proof of Proposition 2.

First note that, for any belief with form

(SM-2)

, the mean of the associated normalized Gaussian

distribution can be expressed as follows. Let the vector µjof size pbe the mean for the node indexed j. Then







µ1

µs





=K−1h= (J−1⊗Ip)









= (J−1⊗Ip)vec(M) = vec(MJ−1).

Let Ebe the p×smatrix of means with µjon column j. Then the expression above simpliﬁes to

E=M J−1.

Assume Lemma 3, from which we re-use notations here. Let

be a cluster, and

and

be its matrices from

(SM-2)

. For any node

, let

kC(v)

be the index of

’s matrices. Then, writing

[J−1]•k

for the

kth

column

vector of J−1, we get:

E [ Xv|Y] = µkC(v)=M[J−1]•kC(v)=Ev,(SM-4)

where

denotes the column of

for node

(i.e. the conditional expectation of its trait), and does not depend on

. Assuming that calibration is reached,

does not depend on the cluster

(or sepset) containing

. Further, note

that

(SM-4)

is exact on any cluster graph at calibration, not simply approximate, because we are using a Gaussian

graphical model [Weiss and Freeman, 1999].

Similarly, for nodes u, v in C:

var [Xv|Y]=[K−1]kC(v)kC(v)= [J−1]kC(v)kC(v)Σ,

cov [Xv, Xu|Y]=[K−1]kC(v)kC(u)= [J−1]kC(v)kC(u)Σ.(SM-5)

Therefore, their conditional variances and covariances are proportional to Σ.

In the following, with a slight abuse of notation, for any two nodes

u, v

, we will write

[J−1]uv = [J−1]kC(u)kC(v)

and [K−1]uv = [K−1]kC(u)kC(v)for the submatrices corresponding to the indices for uand vin C.

Since SM-4 requires inverting

, whose size

depends on the cluster, calculating the conditional means

has

complexity

O(s3)

typically. As the

and

matrices appearing in

(SM-4)

and

(SM-5)

do not depend on

Lemma 3, they can be computed by running BP with any Σvalue, and we have the following.

Corollary 4. The

and

matrices in Lemma 3, used in

(SM-4)

and

(SM-5)

, are obtained as a direct output of BP

using Σ=Ipto calibrate the cluster graph.

Using

(SM-3)

in Lemma 3 and the derivation of vectors

at each BP update, given in the proof below, we obtain the

following result.

Corollary 5. For each cluster, the weights

appearing in

(SM-3)

can be obtained alongside BP for any trait using

updates

(SM-7)

and

(SM-8)

below, until convergence of all

weight vectors and

matrices. These quantities can

then be used to obtain conditional expectations and conditional (co)variances for any trait using

(SM-3)

(SM-4)

and (SM-5).

SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

For example, this result implies that obtaining calibrated conditional expectations for a large-dimensional trait can be

done without handling large

p×p

matrices: by ﬁrst calculating the

’s and

for each cluster until convergence, and

then re-using them repeatedly for each of the ptraits separately (without re-calibration).

Proof of Lemma 3.

We now show that the properties stated in Lemma 3 hold for each factor at initialization, and

continue to hold after each step of Algorithm 2: belief initialization, evidence absorption, and propagation.

Factor Initialization. Using the notations from the main text, each factor

ϕv(xv|xpa(v))

has canonical form over its

full scope:

ϕv(xv|xpa(v)) = C xv

xpa(v);Kv, hv, gv.

For any internal node or any leaf vbefore evidence absorption, we get from (4.4) and (SM-1):

Kv=J⊗Σ−1and hv=0,where J=1

ℓ(v)1−γ⊤

−γ γγ⊤(SM-6)

and

γ⊤= (γuv;u∈pa(v)).

Hence all the node family factors have form

(SM-2)

at initialization and neither

nor any

mi=0depend on the data. We can initialize wi=0for each iin the factor’s scope.

At the root

, the formulas above still hold using that

pa(ρ)

is empty and

has length

, for any

0< ℓ(ρ)<+∞

and also for

ℓ(ρ)=+∞

in which case

J= [0]

, independent of the data. For

hρ

, we have

hρ=1

ℓ(ρ)Σ−1µρ

, which

satisﬁes

(SM-2)

with

mρ=µρ/ℓ(ρ)

mρ

is linear in

µρ

and satisﬁes

(SM-3)

with

wρ= ( 1

ℓ(ρ),0,...,0)

(or simply

ℓ(ρ)=+∞

). If

ℓ(ρ)=0

, the root factor is not assigned to any cluster at initialization because it is instead handled

during evidence absorption below. This is because

ℓ(ρ)=0

implies that

Xρ

is ﬁxed to the

µρ

value, and this is handled

similarly to leaves ﬁxed at their observed values.

Belief Initialization. Now consider a cluster

that is assigned factors

ϕv

for

k≥0

nodes

{v1, . . . , vk}

, by (4.2).

Before this assignment, the belief for

(and for all sepsets) is set to the constant function equal to 1, which trivially

satisﬁes (SM-2) with J=0and every mi=0, and satisﬁes (SM-3) with every wi=0. To assign factor ϕvjto C, we

ﬁrst extend scope of

ϕvj

to the scope of

(re-ordering the rows and columns of

Kvj

and

hvj

to match that of

). We

then multiply

’s belief by the extended factor. So we now prove that

(SM-2)

is preserved by these two operations:

extension and multiplication.

Belief Extension. Consider extending the scope of a belief with parameters

(K, h, g)

satisfying

(SM-2)

to include all

traits of an extra

(s+ 1)th

node. Without loss of generality, we assign these extra variables the last

indices. Then the

canonical parameters of the extended belief can be written as

K=J 0

0 0⊗Σ−1and ˜

h= (Is+1 ⊗Σ−1)











and continue to be of form

(SM-2)

continues to be independent of

and of the data. All

vectors involved in

continue to be independent of

and linear in the data according to

(SM-3)

: with

unchanged for

i≤s

and

wi=0

for i=s+ 1.

Beliefs Product and Quotient. Next, if

(K, h, g)

and

(K′, h′, g′)

are the parameters of two beliefs on the same scope

satisfying Lemma 3, then their product also satisﬁes Lemma 3 because the canonical form of the product has parameters

(K+K′, h +h′, g +g′)

, and can be expressed with

(SM-2)

using

J+J′

and

mj+m′

(SM-3)

continues to hold

using weight vectors

wj+w′

. Similarly, the ratio of the two beliefs has parameters

(K−K′, h −h′, g −g′)

and

continues to satisfy Lemma 3.

Evidence Absorption. Assume that a belief satisﬁes Lemma 3, and that we want to absorb the evidence from one node

1≤u≤s

. This node

can be a leaf, or the root if

ℓ(ρ)=0

. If

u=ρ

then the data to be absorbed is

xu=µρ

. We

need to express the canonical form of the belief as a function of

x−u= [x⊤

1, . . . , x⊤

u−1, x⊤

u+1, . . . , x⊤

s]⊤

only, letting

the data

appear in the canonical parameters. By Assumption 1,

is of full length

, which maintains the block

structure. We have:

x⊤

1···x⊤

s(J⊗Σ−1)









=x⊤

−u(J−u⊗Σ−1)x−u+ 2 X

t=u

x⊤

uJutΣ−1xt+ x⊤

uJuuΣ−1xu

=x⊤

−u(J−u⊗Σ−1)x−u+ 2(J−u,u ⊗xu)⊤(Is−1⊗Σ−1)x−u+ x⊤

uJuuΣ−1xu,

SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

where

J−u

is the

(s−1) ×(s−1)

matrix

without the row and column for

; and

J−u,u

is the

(s−1) ×1

column

vector of Jfor uwithout the row entry for u. Likewise:

m⊤

1···m⊤

s(Is⊗Σ−1)









=m⊤

−u(Is−1⊗Σ−1)x−u+m⊤

uΣ−1xu,

where m−uis similarly deﬁned as x−u, so that the canonical form of the factor satisﬁes:

log C(x−u;K−u, h−u, g−u) = −1

2x⊤

−u(J−u⊗Σ−1)x−u+ (m−u−J−u,u ⊗xu)⊤(Is−1⊗Σ−1)x−u+g−u,

where g−udoes not depend on x−u,

K−u= (J−u⊗Σ−1)and h−u= (Is−1⊗Σ−1)(m−u−J−u,u ⊗xu),

have form

(SM-2)

J−u

continues to be independent of

and of the data and

µρ

. All

vectors involved in

h−u

continue to be independent of

and linear in the data —with linear dependence on

introduced in this step. Namely,

(SM-3) holds with weight vector wjupdated to:

wj−Ji,ueuwhere eu= (0,...,0,1,0, . . .)(SM-7)

is the basis (row) vector of Rn+1 with coordinate 1 at the position indexing tip u.

Note that, for a tip

with data on all

traits, we recover (4.4) for the factor associated with the external edge to

whose scope is reduced to xpa(v)after absorbing the evidence from xv(with s= 2):

Kv=1

ℓ(v)Σ−1, hv=1

ℓ(v)Σ−1xv,and gv=−1

2log |2πℓ(v)Σ|+1

ℓ(v)∥xv∥2

Σ−1.

Propagation. Next, we show that beliefs continue to satisfy Lemma 3 after any propagation step of Algorithm 2. The

ﬁrst propagation step consists of marginalizing a belief, to calculate the message

˜µi→j

from cluster

to cluster

Suppose that a belief with parameters

(K, h, g)

satisﬁes

(SM-2)

, and that we marginalize out all traits of one or more

nodes in its scope. Let

be the indices corresponding to nodes (or their traits, depending on the context, with some

abuse of notation) to be marginalized and

the indices corresponding to the remaining nodes (or their traits). Then, the

marginal belief has canonical parameters (˜

K,˜

h)with:

K=KS−KS,IK−1

IKI,S

=JS⊗Σ−1−JS,I⊗Σ−1JI−1⊗ΣJI,S⊗Σ−1

=JS−JS,IJI−1JI,S⊗Σ−1=˜

J⊗Σ−1

and

h=hS−KS,IK−1

IhI=hS−JS,I⊗Σ−1JI−1⊗ΣhI=hS−JS,IJI−1⊗IphI=





Σ−1˜m1

Σ−1˜ms







where, for j∈S:

˜mj=mj−X

i∈IJS,IJI−1ji mi.

So (SM-3) holds with updated weights:

˜wj=wj−X

i∈IJS,IJI−1ji wi,(SM-8)

and the marginalized belief (message) is still of the form

(SM-2)

and continues to satisfy Lemma 3. The remaining

propagation steps consist of dividing the message by the current sepset belief; extending the resulting quotient to the

scope of the receiving cluster; and multiplying the receiving cluster’s current belief with the extended quotient. Each of

these steps was already proved to preserve the properties of Lemma 3, therefore the receiving cluster’s new belief still

satisﬁes Lemma 3. The sepset belief does too because it is updated with the message that was passed.

SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

D.3 Gradient computation and analytical formula for parameter estimates

D.3.1 Gradients of factors

When the factors are linear Gaussian as in (3.1), their derivarive with respect to any vector of parameters

can be

written as:

∇θlog ϕv(Xv|Xpa(v), θ)=∂

∂θ [qvXpa(v)+ωv]⊤V−1

vXv−qvXpa(v)−ωv

∂vech(V−1

v)⊤

∂θ vech Vv−(Xv−qvXpa(v)−ωv)(Xv−qvXpa(v)−ωv)⊤,(SM-9)

where

vech

is the symmetric vectorization operation [Magnus and Neudecker, 1986]. In the BM case, (3.1) simpliﬁes

to (SM-1), so that, for non-root nodes:

∇θlog ϕv(Xv|Xpa(v), θ)=∂

∂θ 

X

u∈pa(v)

γuvXp



⊤

ℓ(v)−1Σ−1

Xv−X

u∈pa(v)

γuvXp



∂vech(ℓ(v)−1Σ−1)⊤

∂θ vech 



ℓ(v)Σ−

Xv−X

u∈pa(v)

γuvXp



Xv−X

u∈pa(v)

γuvXp



⊤



,

and, for the root ρ, assuming 0< ℓ(ρ)<+∞,

∇θ[log ϕρ(Xρ|θ)] = ∂[µρ]⊤

∂θ ℓ(ρ)−1Σ−1(Xρ−µρ)

∂vech(ℓ(ρ)−1Σ−1)⊤

∂θ vech ℓ(ρ)Σ−(Xρ−µρ) (Xρ−µρ)⊤.

D.3.2 Estimation of µρ

Note that

µρ

has no impact on the model and needs not be estimated if

ℓ(ρ) = ∞

(improper ﬂat prior). We assume here

that

0< ℓ(ρ)<+∞

, and will consider the case

ℓ(ρ) = 0

later. Only the root factor depends on

µρ

. Taking its gradient

with respect to µρ, we get:

∇µρ[log ϕρ(Xρ|θ)] = ℓ(ρ)−1Σ−1(Xρ−µρ).

To apply Fisher’s formula (6.1), we take the expectation Eθ[• | Y]of this gradient conditional on all the data Y:

∇µ′

ρ[log pθ′(Y)]µ′

ρ=µρ

= Eθ∇µ′

ρ[log ϕρ(Xρ|θ′)]µ′

ρ=µρY=ℓ(ρ)−1Σ−1(Eθ[Xρ|Y]−µρ).

Setting this gradient to 0, we get:

ˆµρ= Eθ[Xρ|Y] = Eρ(SM-10)

where

is any cluster containing

in its scope, and

and

are the matrices in Lemma 3 for its belief. Note that

by Lemma 3, this estimate is independent of the assumed

used during calibration. This procedure corresponds to

maximum likelihood estimation under the assumption that

ℓ(ρ)

is known. Under this model,

µρ

represents the ancestral

state at time

ℓ(ρ)

prior to the root node

, which is typically taken as the most recent common ancestor of the sampled

leaves. This is equivalent to considering an extra root edge of length

ℓ(ρ)

above

, whose parent node has ancestral

state

µρ

. Then

ˆµρ

is a maximum likelihood estimate of the ancestral state at

, or an approximation thereof if a cluster

graph is used instead of a clique tree. Note that, in a Bayesian setting, when ﬁxing

ℓ(ρ)

to a given value, and ﬁxing

µρ= 0

, this model can be seen as setting a Gaussian prior on the value at the root of the tree. This is the model used

e.g. in BEAST [Fisher et al., 2021].

D.3.3 Estimation of Σ

We now take the gradient with respect to the vectorized precision parameter

P= vech(Σ−1)

, of length

p(p+ 1)/2

For v=ρ, we get:

∇Plog ϕv(Xpa(v)|Xu, θ)=1

2ℓ(v)−1vech 



ℓ(v)Σ−

Xv−X

u∈pa(v)

γuvXu



Xv−X

u∈pa(v)

γuvXu



⊤





SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

and, for the root ρ:

∇P[log ϕρ(Xρ|θ)] = 1

2ℓ(ρ)−1vech ℓ(ρ)Σ−(Xρ−µρ) (Xρ−µρ)⊤.

Applying again Fisher’s formula (6.1), we get:

∇P′[log pθ′(Y)]|P′=P=1

v∈V

vech Σ−ℓ(v)−1Fv,

where

is derived next, using that

EZZ ⊤= var [Z] + E [Z] E [Z]⊤

and using

(SM-4)

and

(SM-5)

on a cluster

containing vand its parents in its scope, with Jvand Mvfrom Lemma 3 for C. For v=ρ, we get:

Fv= Eθ





Xv−X

u∈pa(v)

γuvXu



Xv−X

u∈pa(v)

γuvXu



⊤

Y





= varθ

Xv−X

u∈pa(v)

γuvXuY

+ Eθ

Xv−X

u∈pa(v)

γuvXuY

Eθ

Xv−X

u∈pa(v)

γuvXuY



⊤

=

[J−1

v]vv +X

u1,u2∈pa(v)

γu1vγu2v[J−1

v]u1u2−2X

u∈pa(v)

γuv[J−1

v]vu

Σ

+

Ev−X

u∈pa(v)

γuvEu



Ev−X

u∈pa(v)

γuvEu



⊤

For the root ρ:

Fρ= Eθh(Xρ−µρ) (Xρ−µρ)⊤Yi

= varθ[Xρ−µρ|Y]+Eθ[Xρ−µρ|Y] Eθ[Xρ−µρ|Y]⊤

= [J−1

ρ]ρρΣ+ (Eρ−µρ) (Eρ−µρ)⊤.

Setting this gradient to 0 with respect to Σ, we get the following maximum likelihood estimate for the rate matrix:

Σ=

ℓ(ρ)−1(Eρ−ˆµρ) (Eρ−ˆµρ)⊤+X

v∈V,v=ρ

ℓ(v)−1

Ev−X

u∈pa(v)

γuvEu



Ev−X

u∈pa(v)

γuvEu



⊤



×

X

v∈V

1−ℓ(v)−1

[J−1

v]vv +X

u1,u2∈pa(v)

γu1vγu2v[J−1

v]u1u2−2X

u∈pa(v)

γuv[J−1

v]vu





−1

(SM-11)

where we use the convention that a sum over an empty set (here pa(ρ)) is 0.

Note that this formula only uses the calibrated moments computed at each cluster. After calibration, then, calculating

with SM-11 has complexity

O(|V|(k3+p2))

where

is the maximum cluster size, since SM-11 requires inverting at

most

|V|

matrices of size

k×k

at most and the crossproduct of at most

|V|

vectors of size

. The ﬁnal product is a

scalar scaling of a

p×p

matrix. Calibrating the clique tree or cluster graph is more complex, because each BP update

has complexity up to

O(k3p3)

. If the phylogeny is a tree, a clique tree has

k= 2

and

|V|= 2n−1

, so that SM-11 has

complexity linear in the number of tips. While PIC can get these estimates in only one traversal of the tree, this formula

requires two traversals of the clique tree, but is more general as it applies to any phylogenetic network.

D.3.4 ML and REML estimation

Restricted maximum likelihood (REML) estimation can be framed as integrating out ﬁxed effects [Harville, 1974], here

µρ

, to estimate covariance parameters, here the BM variance rate

. This model corresponds to placing an improper

prior on the root using

ℓ(ρ) = +∞

, in which case

µρ

is irrelevant. Then

(SM-11)

remains valid (with vanishing terms

for the root) and gives an analytical formula for the REML estimate of Σ.

SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

For maximum likelihood (ML) estimation of

µρ

, considered as the state

Xρ

at the root node

, we need to consider

the case

ℓ(ρ)=0

to ﬁx

Xρ=µρ

. Under

ℓ(ρ)=0

(SM-10)

cannot be calculated because

Xρ=µρ

was absorbed as

evidence and

removed from scope. Instead, we note that under an improper root with inﬁnite variance, the posterior

density of the root trait conditional on all the tips is proportional to the likelihood

p(Xρ|Y)∝p(Y|Xρ)×p(Xρ)

because

p(Xρ)≡1

under an improper prior on

Xρ

. Therefore, maximizing the likelihood

p(Y|Xρ)

in the root

parameter

Xρ=µρ

amounts to maximizing the density

p(Xρ|Y)

Xρ

. This density is Gaussian with expectation

Eρ

(SM-4)

so its maximum is attained at

ˆµρ= Eθ[Xv|Y] = Eρ

. In summary, the ML estimate of

µρ

is still given

by (SM-10), but calculated by running BP under an improper prior at the root.

D.4 Analytical formula for phylogenetic regression

In the previous section, we derived analytical formulas

(SM-10)

and

(SM-11)

for estimating the parameters of a

homogeneous multivariate BM on a phylogenetic network, using the output of only one BP calibration thanks to

Corollary 4.

Instead of ﬁtting a multivariate process, it is often of interest to look at the distribution of one particular trait conditional

on all others. This phylogenetic regression setting is for instance used on a network in Bastide et al. [2018b]. Writing

the (univariate) trait of interest measured at the

tips of a network, and

the

p×n

matrix of regressors, we are

interested in the model:

V=U⊤β+ϵ, (SM-12)

with

a vector of

coefﬁcients, and

a vector of residuals with expectation

and a variance-covariance matrix that is

given by a univariate BM on the network with variance rate

σ2

, and a root ﬁxed to

. The

vth

column

U•v

corresponds

to the predictors at leaf vand will be denoted as Uv.

In this setting, explicit maximum likelihood estimators for

and

are available, but they involve the inverse of an

n×nmatrix, with O(n3)complexity. Our goal is to get these estimators in linear time.

D.4.1 Parameter estimation using the joint distribution

To build on section D.3, we ﬁrst look at the joint distribution of the reponse and predictors

and

. Setting the

intercept aside, we slightly rewrite model SM-12 (with a slight change of notation for Uand p) to:

V=α1+U⊤β+ϵ, (SM-13)

with

a scalar,

the vector of ones,

a vector of

coefﬁcients, and

a vector of residuals with expectation

and

a variance-covariance matrix given by a univariate BM on the network with variance rate

σ2

, and a root ﬁxed to

Assuming that the joint trait

X= (V, U )

, of dimension

p+ 1

, is jointly Gaussian and evolving on the network with

variance rate ΣX=ΣV V ΣV U

ΣUV ΣU U , we obtain the regression model above with

β=Σ−1

UU ΣU V and σ2=ΣV V −ΣV U Σ−1

UU ΣU V .(SM-14)

This is because a joint BM evolution for

implies that the evolutionary changes in

and

along each branch

(∆V)e

and

(∆U)e

, are jointly Gaussian

N(0, ℓ(e)ΣX)

and independent of previous evolutionary changes. By classical

Gaussian conditioning, this means that

(∆V)e= (∆U)⊤

eβ+ (∆ϵ)e

where

(∆ϵ)e∼ N(0, ℓ(e)σ2)

and independent of

(∆U)e

. At a hybrid node, the merging rule holds for both

and

with the same inheritance weights, so by induction on the nodes (in preorder) we get that

Vu=α+U⊤

uβ+ϵu

every node

in the network, with

α=Vρ−U⊤

ρβ

, and with

following a BM process with variance rate

σ2

starting at

ϵρ= 0. Therefore SM-13 holds at the tips.

Consequently, we can apply formulas

(SM-10)

and

(SM-11)

to get maximum likelihood (or REML) estimates

ˆµX

and

ΣXof the joint expectation and variance rate matrix of X. We can then plug in these estimates in SM-14 to get:

ˆα= ˆµV−ˆ

ΣV U ˆ

Σ−1

UU ˆµU,ˆ

β=ˆ

Σ−1

UU ˆ

ΣUV ,and ˆσ2=ˆ

ΣV V −ˆ

ΣV U ˆ

Σ−1

UU ˆ

ΣUV ,(SM-15)

where

ˆµV

and

ˆµU

are, respectively, the scalar and vector of size

extracted from

ˆµX

for traits

and

, and, similarly,

ΣV U

ΣUV

ΣV V

and

ΣUU

are the sub-matrices of dimension

1×p

p×1

1×1

and

p×p

extracted from

ΣX

. As

calculating

ˆµX

and

ΣX

via

(SM-10)

and

(SM-11)

has complexity

O(|V|(k3+p3))

where

|V|

is the number of nodes

in the network and

is the maximum cluster size, obtaining

ˆα

and

ˆσ2

with SM-15 has that same complexity, which

can be much smaller than O(n3). If the phylogeny is a tree, this complexity depends linearly on n.

SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

D.4.2 Direct parameter estimation using the marginal distribution

Going back to model (SM-12), we do not assume that X= (V, U )is jointly Gaussian and make no assumption about

. The distribution assumption is solely on the residual

. Model SM-12 then amounts to a trait

Yβ=V−U⊤β

the tips (for a given

) evolving under a homogeneous univariate BM model with variance

σ2

. We denote by

the

corresponding trait at all network nodes, whose values Yβat tips depends on β.

We can apply Fisher’s formula (6.1) to this model, taking the derivative with respect to β:

∇β′log p(V−U⊤β′)β′=β=X

v∈V

Eθh∇β′log ϕv(Xv|Xpa(v), θ′)β′=βV−U⊤βi(SM-16)

In this sum, the only factors

ϕv

that depend on

β′

are the factors at the tips. In phylogenies, leaves are typically con-

strained to have a single parent, although extending our derivation to the case of hybrid leaves would be straightforward.

For a leaf

with parent

pa(v) = {u}

, we have:

Yβ′

v|Xpa(v)∼ NXu;σ2ℓ(v)

, and

Yβ′

v=Vv−U⊤

vβ′

, so that

ϕv(Yβ′

v|Xpa(v), β′) = ϕv(Vv|U⊤

vβ′+Xu, β′). Using the Gaussian derivative formula (SM-9), we get:

∇β′hlog ϕv(Yβ′

v|Xpa(v), θ′)iβ′=β=∇β′log ϕv(Vv|U⊤

vβ′+Xu, θ′)β′=β

=∂(Xu+U⊤

vβ′)

∂β′

⊤

β′=β

[σ2ℓ(v)]−1Vv−Xu+U⊤

vβ

=Uv[σ2ℓ(v)]−1Vv−U⊤

vβ−Xu,

Using Lemma 3, the expectation of

conditional on the observed values at the tips,

Eθ(XuYβ)

, is linear in the

data so that by (SM-4):

Eθ(XuYβ)=Eθ(XuYβ)⊤= [J−1

u]u•(MV

u)⊤−[J−1

u]u•(MU

u)⊤β=EV

u−(EU

u)⊤β,

where

and

denote, respectively, the BP quantities of Lemma 3 when applied to the traits

and

separately. Note that

can also be obtained by running BP on each of the

rows of

independently, because

does not depend on the data and

depends linearly on the data. As

is a trait of dimension

is a row

vector of size

, the number of nodes in the chosen cluster containing

; and

[J−1

u]u•(MV

u)⊤=EV

is a scalar. Also,

[J−1

u]u•(MU

u)⊤= (EU

u)⊤

is a row-matrix of size

1×p

, so that

[J−1

u]u•(MU

u)⊤β= (EU

u)⊤β

is also a scalar. We can

hence write, for leaf vwith parent u:

Eθh∇β′[log ϕv(Xv|Xu, θ)]|β′=βi=Uv[σ2ℓ(v)]−1Vv−U⊤

vβ−(EV

u−(EU

u)⊤β).

Taking the sum and cancelling the gradient in β, we get:

β= X

leaf v

ℓ(v)Uv(Uv−EU

pa(v))⊤!−1X

leaf v

ℓ(v)Uv(Vv−EV

pa(v)).(SM-17)

Note that the ﬁrst term of the product involves the inversion of a

p×p

matrix, and that this formula outputs a vector

of size

. To get all the quantities needed in this formula, we just need one BP calibration of the cluster graph with

multivariate traits

(V, U )

to get the conditional means and variances, which can be done efﬁciently using only univariate

traits thanks to Corollary 5.

Finally, to get an estimator of the residual variance

σ2

, we can run another BP calibration, taking

ˆϵ=V−U⊤b

as the

tip trait values, and then use the formulas from the previous section. Using an inﬁnite root variance for this last BP

traversal gives us the REML estimate of the variance.

If the phylogeny is a tree, this algorithm involves

p+ 2

univariate BP calibrations, each requiring two traversals of the

tree, sums of

O(n)

terms in SM-17 and other formulas, and a

p×p

matrix inversion, so calculating

and

ˆσ2

is linear

in the number of tips. Comparatively, the algorithm used in the

package

phylolm

[Ho and An

e, 2014] only needs one

multivariate traversal of the tree. Our algorithm is more general however, as it applies to any phylogenetic network and

to any associated cluster graph.

E Regularizing initial beliefs

At initialization, each factor is assigned to a cluster whose scope includes all nodes from that factor. Then the initial

belief

βi

of a cluster

is the product of all factors assigned to it by (4.2). Sepsets are not assigned any factors so their

SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

beliefs

µi,j

are initialized to 1. This assignment guarantees that the the joint density

pθ

of the graphical model equals

the following quantity at initialization:

QCi∈V∗βi

Q{Ci,Cj}∈E∗µi,j

.(SM-18)

(SM-18)

is called the graph invariant because BP modiﬁes cluster and sepset beliefs without changing the value of this

quantity, and hence keeps it equal to

pθ

[Koller and Friedman, 2009]. Initialization with (4.2) can lead to degenerate

messages, as highlighted in section 7.1. However, other belief assignments are permitted, provided that

(SM-18)

equals

pθat initialization. Modifying beliefs between BP iterations is also permitted, provided that (SM-18) is unchanged.

Regularization modiﬁes the belief precisions to make them non-degenerate. To maintain the graph invariant, every

modiﬁcation to a cluster belief is balanced by a modiﬁcation to an adjacent sepset belief. We describe two basic

regularization algorithms below, but many others could also be considered.

Algorithm R3 Regularization along variable subtrees

1: for all variable xdo

2: Tx←subtree induced by all clusters containing x

3: ﬁx ϵ > 0

4: for all sepsets and for all but one cluster in Txdo

5: add ϵto the diagonal entry of its belief’s precision matrix corresponding to x

Algorithm R4 Regularization on a schedule

1: Choose an ordering of clusters: C1,...,C|V |

2: For each cluster Ciand each neighbor Cjof Ci, set i→jas unvisited

3: for all i= 1,...,|V| do

4: for all neighbor Cjof Cido

5: if j→iis unvisited then

6: ﬁx ϵ > 0

7: add ϵIto the precision matrix of the sepset Si,j

8: add ϵto the diagonal entry of Ci’s precision matrix corresponding to each variable in Si,j

9: mark j→ias visited

10: for all neighbor Ckof Cido

11: if i→kis unvisited then

12: propagate belief from Cito Ckby Algorithm 2

13: mark i→kas visited

In Algorithm R3, each modiﬁed belief is multiplied by a regularization factor

exp −1

2ϵx2

. The graph invariant is

satisﬁed because

must be a tree (by the running intersection property), so the same number of clusters and sepsets

are modiﬁed and the regularization factors cancel out in

(SM-18)

. In Algorithm R4, the same argument applies to

modiﬁcations on lines 7 and 8, which cancel out in

(SM-18)

so the graph invariant is maintained. It is also maintained

on line 12, which uses BP.

The choice of the regularization constant

is not speciﬁed above, but should be adapted to the magnitude of entries in

the affected precision matrices.

Both algorithms performed comparably well on the join-graph structuring cluster graphs used in Fig. 7 and Fig. S2. On

the factor graph for the complex network, however, Algorithm R4 was found to work better than R3. Namely, beliefs

remained persistently degenerate after initial regularization with R3, such that the estimated conditional means and

factored energy could not be computed.

Both algorithms are illustrated in Figure S4.

SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

x12





1−1/2−1/2

−1/2 1/4+3˜ϵ−ϵ1/4

−1/2 1/4 1/4





x10

x11

1+3˜ϵ−ϵ−1

−1 1 x8

x10

1+2ϵ−1

−1 1+ϵ

1−1

−1 1+3˜ϵ−ϵ

x12





1−1/2−1/2

−1/2 1/4+ϵ1/4

−1/2 1/4 1/4





x10

1+ϵ−1

−1 1 

1−1

−1 1 

x10

0+ϵ+ 3˜ϵ−ϵ

0+ϵ

Figure S4: Applying algorithms R3 and R4 on the cluster graph from Fig. 4(d), for a univariate BM model with mean 0

and variance rate 1, edge lengths of 1 in the original network and inheritance probabilities of 0.5. Cluster/sepset

precision matrices have rows labelled by variables to show the nodes in scope. Precision matrices show entries before

regularization (black) and after one pass through the outermost loop of the algorithm (coloured adjustments). Left:

regularization R3 starting with variable

. Right: regularization R4 starting with cluster

{8,10}

, assuming that it is the

ﬁrst cluster scheduled to be processed. For R4, we differentiate the effects of lines 3-8 (blue) and lines 9-12 (red). For

example, the resulting precision matrix for sepset {x10}is [3˜ϵ]after summing these effects, where ˜ϵ=ϵ+o(ϵ).

F Handling deterministic factors

This section illustrates two approaches to running BP in the presence of a deterministic Gaussian factor that arises

because the state at a hybrid node is a linear combination of its parents’ states.

Let

be a univariate continuous trait evolving on the 3-taxon network in Fig. 2a (reproduced in Fig. S5a) under a BM

model with ancestral state 0 at the root and variance rate

σ2

. For simplicity, we assume that tree edges have length 1,

hybrid edges have length 0, and inheritance probabilities are 1/2. The conditional distribution for each node given its

parents is non-deterministic and can be expressed in a canonical form, except for

at the hybrid node. Because of

0-length hybrid edges, we have the deterministic relationship:

X5= (X4+X6)/2

. After absorbing evidence

x1,x2,x3

at the tips and ﬁxing xρ= 0, the factors are:

ϕ1=C(x4;σ−2, σ−2x1, g1)ϕ2=C(x5;σ−2, σ−2x2, g2)ϕ3=C(x6;σ−2, σ −2x3, g3)

ϕ4=C(x4;σ−2,0, g4)ϕ6=C(x6;σ−2,0, g6)ϕ5=δ(x5−(x4+x6)/2)

where

normalizes

ϕi

to a valid probability density and

δ(·)

denotes a Dirac distribution at 0.

in Fig. 2b (reproduced

in Fig. S5b) remains a valid clique tree for this model. We index the cliques in

C1={x1, x4}

C2={x2, x5}

C3={x3, x6}

C4={x5, x4, x6}

C5={x4, x6, xρ}

. We set

as the root clique and assume the following factor

assignment for U:ϕ17→ C1,ϕ27→ C2,ϕ37→ C3,ϕ57→ C4,{ϕ4, ϕ6} 7→ C5.

F.1 Substitution

The substitution approach removes the Dirac factor

ϕ5

by removing

from the model, substituting it by

(x4+x6)/2

where needed. Since ϕ2has scope {x5}, it is reparametrized as ϕ′

2on scope {x4, x6}:

ϕ′

2=Cx4

x6;1

4σ21 1

1 1,x2

2σ21

1, g2.

For the simple univariate BM, it is well known in the admixture graph literature [Pickrell and Pritchard, 2012] that this

substitution corresponds to using a modiﬁed network

N′

in which hybrid edges do not all have length 0 (Fig. S5c).

N′

is built from the original network

by removing the hybrid node 5 and connecting its parents (nodes 4 and 6)

to its child (node 2) with edges of lengths

ℓ4=ℓ6= 2

for example (to ensure that

γ2

4ℓ4+γ2

2ℓ2

equals the length

of the original child edge to node 2). A clique tree

U′

for

N′

can be obtained from

by replacing

and

with

C′

4={x2, x4, x6}(Fig. S5d). Factor ϕ′

2is assigned to C′

4while the other factor assignments stay the same.

SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

(a)

xρ

x4x5x6

x1x2x3

(b)

x4, x6, xρ

x2, x4, x6

x1, x4

C1x2, x5

C2x3, x6

N′

(c)

xρ

x4x6

x1x2x3

U′

(d)

x4, x6, xρ

x2, x4, x6

C4′

x1, x4

C1x3, x6

x4x6

x4, x6

γ= 0.5γ= 0.5

x4x6

x4, x6

Figure S5: (a) Network

from Fig. 2a. (b) Clique tree

from Fig. 2b. (c) Network

N′

obtained by removing the

hybrid node 5 from

in (a). The BM model on

leads to the same probability model for the nodes in

N′

as the

BM model on

N′

, given a valid assignment of hybrid edge lengths in

N′

(see text). (d) Clique tree

U′

for

N′

after

moralization.

Standard BP can be used for the BM model on

N′

because all factors are non-degenerate. After one postorder traversal

of U′, the message ˜µ4′→5, ﬁnal belief β5and log-likelihood LL(σ2) = log pσ2(x1,x2,x3)are:

˜µ4′→5=ψ4′˜µ1→4′˜µ3→4′=ϕ′

2ϕ1ϕ3=C x4

x6;1

4σ25 1

1 5,1

2σ22x1+ x2

2x3+ x2,

i=1

gi!

β5=ψ5˜µ4′→5=ϕ4ϕ6˜µ4′→5=C

x4

x6;K=1

4σ29 1

1 9, h =1

2σ22x1+ x2

2x3+ x2, g =

i=1,i=5

gi



LL(σ2) = Zβ5dx4dx6=

i=1,i=5

gi+



log 2π1

4σ29 1

1 9−1+



2σ22x1+ x2

2x3+ x2



 1

4σ2"9 1

1 9#!−1



/2

We can still recover the conditional distribution of

from

β5

, because

β5

has scope

{x4, x6}

. Let

(K, h, g)

be the

parameters of the canonical form of

β5

, given above. After the postorder traversal of

U′

β5

contains information

from all the tips such that the distribution of

(X4, X6)

conditional on the data

(x1,x2,x3)

NK−1h, K−1

. Since

X5=γ⊤X4

X6with γ⊤= [1/2,1/2], we get

X5|(x1,x2,x3)∼ N(γ⊤K−1h, γ ⊤K−1γ) = N 3

i=1

xi/5, σ2/5!.

F.2 Generalized canonical form

A more general approach generalizes canonical form operations to include Dirac distributions without modifying the

original set of factors and clique tree, as demonstrated in Schoeman et al. [2022]. Crucially, they derived message

passing operations (evidence absorption, marginalization, factor product, etc.) for a generalized canonical form:

D(x;Q,R,Λ, h, c, g):=C(Q⊤x;Λ, h, g)·δ(R⊤x−c)

where

is an

-dimensional vector,

Λ⪰0

is a

(n−k)×(n−k)

diagonal matrix, and

and

are matrices of

dimension

n×(n−k)

and

n×k

respectively, that are orthonormal and orthogonal to each other, that is:

Q⊤Q=In−k

R⊤R=Ik

, and

Q⊤R=0

. If

is square (thus invertible) then

is empty and the Dirac

δ(·)

term is dropped or

deﬁned as 1. The same applies to the

C(·)

term if

is square (and

is empty). Non-deterministic linear Gaussian

SM: Leveraging graphical model techniques to study evolution on phylogenetic networks

factors are represented in generalized canonical form with

square from the eigendecomposition of

K=QΛQ⊤

Thus, we can run BP on U, converting beliefs or messages to generalized canonical form as needed.

Running BP according to a postorder traversal of U, the message ˜µ4→5involves a degenerate component:

˜µ4→5=Zψ4

i=1

˜µi→4dx5=Zϕ5

i=1

ϕidx5.

To compute ˜µ4→5, we ﬁrst convert ϕ5and Q3

i=1 ϕito generalized canonical forms:

ϕ5=D "x4

x5#;"1/w11/w2

1/w1−1/w2

1/w10#,"1/2w0

1/2w0

−1/w0#,Λ5=0 0

0 0,0

0,0,0!

i=1

ϕi=D "x4

x5#;I3,−, σ−2I3, σ−2"x1

x2#,−,

i=1

gi!

where

(w0, w1, w2)=(p6/4,√3,√2)

are normalization constants for the respective columns, and the dashes indicate

that the δ(·)part is dropped. By Schoeman et al. [2022, Algorithm 3], their product evaluates to:

ϕ5

i=1

ϕi=D "x4

x5#;Q="1/w11/w2

1/w1−1/w2

1/w10#,R="1/2w0

1/2w0

−1/w0#, σ−2I2, h =σ−2(x1+ x2+ x3)/w1

(x1−x3)/w2,0,

i=1

gi!

We partition Q=Q4,6

Q5and R=R4,6

R5by separating the ﬁrst two rows from the last row.

Then by Schoeman et al. [2022, Algorithm 2], integrating out x5yields:

ZD "x4

x5#;Q4,6

Q5,R4,6

R5, σ−2I2, h, 0,

i=1

gi!dx5=D x4

x6;Q4→5,−,Λ4→5, h4→5,−,

i=1

gi!

where, following notations from Schoeman et al. [2022] for intermediate quantities in their Algorithm 3:

U=I2,W= [1]

F= (W(R5W)+Q5)⊤=−w0/w1

0

G= (Q⊤

4,6−F R⊤

4,6)U=3/(2w1) 3/(2w1)

1/w2−1/w2

ZΛ4→5Z⊤=SVD(G⊤(σ−2I2)G) = SVD 1

4σ25 1

1 5

therefore Z=1/w21/w2

1/w2−1/w2and Λ4→5=1

2σ23 0

0 2

Q4→5=U Z =Z

h4→5=Z⊤G⊤h=σ−2w2(x1+ x2+ x3)

w2(x1−x3)

This generalized canonical form can be rewritten as a standard canonical form:

˜µ4→5=C Q⊤

4→5x4

x6;Λ4→5, h4→5,

i=1

gi!=C x4

x6;Q4→5Λ4→5Q⊤

4→5,Q4→5h4→5,

i=1

gi!

=C x4

x6;1

4σ25 1

1 5,1

2σ22x1+ x2

2x3+ x2,

i=1

gi!

which agrees with

˜µ4′→5

from the substitution approach in F.1. Since the remaining operations to compute

β5=

ψ5˜µ4→5

and

LL(σ2) = Rβ5dx4dx6

involve non-deterministic canonical forms, it is clear that they both evaluate to the

same quantity as when using the substitution approach above.

ResearchGate has not been able to resolve any citations for this publication.

Exploring the Distribution of Phylogenetic Networks Generated Under a Birth-Death-Hybridization Process

Article

Full-text available

Mar 2024

Gene-flow processes such as hybridization and introgression play important roles in shaping diversity across the tree of life. Recent studies extending birth-death models have made it possible to investigate patterns of reticulation in a macroevolutionary context. These models allow for different macroevolutionary patterns of gene flow events that can either add, maintain, or remove lineages—with the gene flow itself possibly being dependent on the relatedness between species—thus creating complex diversification scenarios. Further, many reticulate phylogenetic inference methods assume specific reticulation structures or phylogenies belonging to certain network classes. However, the distributions of phylogenetic networks under reticulate birth-death processes are poorly characterized, and it is unknown whether they violate common methodological assumptions. We use simulation techniques to explore phylogenetic network space under a birth-death-hybridization process where the hybridization rate can have a linear dependence on genetic distance. Specifically, we measured the number of lineages through time and role of hybridization in diversification along with the proportion of phylogenetic networks that belong to commonly used network classes (e.g., tree-child, tree-based, or level-1 networks). We find that the growth of phylogenetic networks and class membership are largely affected by assumptions about macroevolutionary patterns of gene flow. In accordance with previous studies, a lower proportion of networks belonged to these classes based on type and density of reticulate events. However, under a birth-death-hybridization process, these factors form an antagonistic relationship; the type of reticulation events that cause high membership proportions also lead to the highest reticulation density, consequently lowering the overall proportion of phylogenies in some classes. Further, we observed that genetic distance–dependent gene flow and incomplete sampling increase the proportion of class membership, primarily due to having fewer reticulate events. Our results can inform studies if their biological expectations of gene flow are associated with evolutionary histories that satisfy the assumptions of current methodology and aid in finding phylogenetic classes that are relevant for methods development.

Anomalous networks under the multispecies coalescent: theory and prevalence

Article

Full-text available

Feb 2024
J MATH BIOL

Reticulations in a phylogenetic network represent processes such as gene flow, admixture, recombination and hybrid speciation. Extending definitions from the tree setting, an anomalous network is one in which some unrooted tree topology displayed in the network appears in gene trees with a lower frequency than a tree not displayed in the network. We investigate anomalous networks under the Network Multispecies Coalescent Model with possible correlated inheritance at reticulations. Focusing on subsets of 4 taxa, we describe a new algorithm to calculate quartet concordance factors on networks of any level, faster than previous algorithms because of its focus on 4 taxa. We then study topological properties required for a 4-taxon network to be anomalous, uncovering the key role of \(3_2\)-cycles: cycles of 3 edges parent to a sister group of 2 taxa. Under the model of common inheritance, that is, when each gene tree coalesces within a species tree displayed in the network, we prove that 4-taxon networks are never anomalous. Under independent and various levels of correlated inheritance, we use simulations under realistic parameters to quantify the prevalence of anomalous 4-taxon networks, finding that truly anomalous networks are rare. At the same time, however, we find a significant fraction of networks close enough to the anomaly zone to appear anomalous, when considering the quartet concordance factors observed from a few hundred genes. These apparent anomalies may challenge network inference methods.

Accounting for Within-Species Variation in Continuous Trait Evolution on a Phylogenetic Network

Article

Full-text available

Oct 2023

Within-species trait variation may be the result of genetic variation, environmental variation, or measurement error, for example. In phylogenetic comparative studies, failing to account for within-species variation has many adverse effects, such as increased error in testing hypotheses about evolutionary correlations, biased estimates of evolutionary rates, and inaccurate inference of the mode of evolution. These adverse effects were demonstrated in studies that considered a tree-like underlying phylogeny. Comparative methods on phylogenetic networks are still in their infancy. The impact of within-species variation on network-based methods has not been studied. Here, we introduce a phylogenetic linear model in which the phylogeny can be a network to account for within-species variation in the continuous response trait assuming equal within-species variances across species. We show how inference based on the individual values can be reduced to a problem using species-level summaries, even when the within-species variance is estimated. Our method performs well under various simulation settings and is robust when within-species variances are unequal across species. When phenotypic (within-species) correlations differ from evolutionary (between-species) correlations, estimates of evolutionary coefficients are pulled towards the phenotypic coefficients for all methods we tested. Also, evolutionary rates are either underestimated or overestimated, depending on the mismatch between phenotypic and evolutionary relationships. We applied our method to morphological and geographical data from Polemonium. We find a strong negative correlation of leaflet size with elevation, despite a positive correlation within species. Our method can explore the role of gene flow in trait evolution by comparing the fit of a network to that of a tree. We find marginal evidence for leaflet size being affected by gene flow and support for previous observations on the challenges of using individual continuous traits to infer inheritance weights at reticulations. Our method is freely available in the Julia package PhyloNetworks.

Ancient DNA reveals genetic admixture in China during tiger evolution

Article

Full-text available

Aug 2023
Nat. Ecol. Evol.

The tiger (Panthera tigris) is a charismatic megafauna species that originated and diversified in Asia and probably experienced population contraction and expansion during the Pleistocene, resulting in low genetic diversity of modern tigers. However, little is known about patterns of genomic diversity in ancient populations. Here we generated whole-genome sequences from ancient or historical (100–10,000 yr old) specimens collected across mainland Asia, including a 10,600-yr-old Russian Far East specimen (RUSA21, 8× coverage) plus six ancient mitogenomes, 14 South China tigers (0.1–12×) and three Caspian tigers (4–8×). Admixture analysis showed that RUSA21 clustered within modern Northeast Asian phylogroups and partially derived from an extinct Late Pleistocene lineage. While some of the 8,000–10,000-yr-old Russian Far East mitogenomes are basal to all tigers, one 2,000-yr-old specimen resembles present Amur tigers. Phylogenomic analyses suggested that the Caspian tiger probably dispersed from an ancestral Northeast Asian population and experienced gene flow from southern Bengal tigers. Lastly, genome-wide monophyly supported the South China tiger as a distinct subspecies, albeit with mitochondrial paraphyly, hence resolving its longstanding taxonomic controversy. The distribution of mitochondrial haplogroups corroborated by biogeographical modelling suggested that Southwest China was a Late Pleistocene refugium for a relic basal lineage. As suitable habitat returned, admixture between divergent lineages of South China tigers took place in Eastern China, promoting the evolution of other northern subspecies. Altogether, our analysis of ancient genomes sheds light on the evolutionary history of tigers and supports the existence of nine modern subspecies.

Automatic Differentiation is no Panacea for Phylogenetic Gradient Computation

Article

Full-text available

Jun 2023

Gradients of probabilistic model likelihoods with respect to their parameters are essential for modern computational statistics and machine learning. These calculations are readily available for arbitrary models via "automatic differentiation" implemented in general-purpose machine-learning libraries such as TensorFlow and PyTorch. Although these libraries are highly optimized, it is not clear if their general-purpose nature will limit their algorithmic complexity or implementation speed for the phylogenetic case compared to phylogenetics-specific code. In this paper, we compare six gradient implementations of the phylogenetic likelihood functions, in isolation and also as part of a variational inference procedure. We find that although automatic differentiation can scale approximately linearly in tree size, it is much slower than the carefully-implemented gradient calculation for tree likelihood and ratio transformation operations. We conclude that a mixed approach combining phylogenetic libraries with machine learning libraries will provide the optimal combination of speed and model exibility moving forward.

SiPhyNetwork : An R package for simulating phylogenetic networks

Article

Full-text available

May 2023
Methods Ecol. Evol.

Gene flow is increasingly recognized as an important macroevolutionary process. The many mechanisms that contribute to gene flow (e.g. introgression, hybridization, lateral gene transfer) uniquely affect the diversification of dynamics of species, making it important to be able to account for these idiosyncrasies when constructing phylogenetic models. Existing phylogenetic‐network simulators for macroevolution are limited in the ways they model gene flow. We present SiPhyNetwork , an R package for simulating phylogenetic networks under a birth–death‐hybridization process. Our package unifies the existing birth–death‐hybridization models while also extending the toolkit for modelling gene flow. This tool can create patterns of reticulation such as hybridization, lateral gene transfer, and introgression. Specifically, we model different reticulate events by allowing events to either add, remove or keep constant the number of lineages. Additionally, we allow reticulation events to be trait dependent, creating the ability to model the expanse of isolating mechanisms that prevent gene flow. This tool makes it possible for researchers to model many of the complex biological factors associated with gene flow in a phylogenetic context.

On the limits of fitting complex models of population history to f-statistics

Article

Full-text available

Apr 2023
eLife

Our understanding of population history in deep time has been assisted by fitting admixture graphs (AGs) to data: models that specify the ordering of population splits and mixtures, which along with the amount of genetic drift and the proportions of mixture, is the only information needed to predict the patterns of allele frequency correlation among populations. The space of possible AGs relating populations is vast, and thus most published studies have identified fitting AGs through a manual process driven by prior hypotheses, leaving the majority of alternative models unexplored. Here, we develop a method for systematically searching the space of all AGs that can incorporate non-genetic information in the form of topology constraints. We implement this findGraphs tool within a software package, ADMIXTOOLS 2, which is a reimplementation of the ADMIXTOOLS software with new features and large performance gains. We apply this methodology to identify alternative models to AGs that played key roles in eight publications and find that in nearly all cases many alternative models fit nominally or significantly better than the published one. Our results suggest that strong claims about population history from AGs should only be made when all well-fitting and temporally plausible models share common topological features. Our re-evaluation of published data also provides insight into the population histories of humans, dogs, and horses, identifying features that are stable across the models we explored, as well as scenarios of populations relationships that differ in important ways from models that have been highlighted in the literature.

Maximum likelihood pandemic-scale phylogenetics

Article

Full-text available

Apr 2023
Nat Genet

Phylogenetics has a crucial role in genomic epidemiology. Enabled by unparalleled volumes of genome sequence data generated to study and help contain the COVID-19 pandemic, phylogenetic analyses of SARS-CoV-2 genomes have shed light on the virus’s origins, spread, and the emergence and reproductive success of new variants. However, most phylogenetic approaches, including maximum likelihood and Bayesian methods, cannot scale to the size of the datasets from the current pandemic. We present ‘MAximum Parsimonious Likelihood Estimation’ (MAPLE), an approach for likelihood-based phylogenetic analysis of epidemiological genomic datasets at unprecedented scales. MAPLE infers SARS-CoV-2 phylogenies more accurately than existing maximum likelihood approaches while running up to thousands of times faster, and requiring at least 100 times less memory on large datasets. This extends the reach of genomic epidemiology, allowing the continued use of accurate phylogenetic, phylogeographic and phylodynamic analyses on datasets of millions of genomes.

Identifiability of Level-1 Species Networks from Gene Tree Quartets

Article

Jan 2024

When hybridization or other forms of lateral gene transfer have occurred, evolutionary relationships of species are better represented by phylogenetic networks than by trees. While inference of such networks remains challenging, several recently proposed methods are based on quartet concordance factors -- the probabilities that a tree relating a gene sampled from the species displays the possible 4-taxon relationships. Building on earlier results, we investigate what level-1 network features are identifiable from concordance factors under the network multispecies coalescent model. We obtain results on both topological features of the network, and numerical parameters, uncovering a number of failures of identifiability related to 3-cycles in the network.

phylosem: A fast and simple R package for phylogenetic inference and trait imputation using phylogenetic structural equation models

Article

Oct 2023

Phylogenetic comparative methods (PCMs) can be used to study evolutionary relationships and trade-offs among species traits. Analysts using PCM may want to (1) include latent variables, (2) estimate complex trait interdependencies, (3) predict missing trait values, (4) condition predicted traits upon phylogenetic correlations and (5) estimate relationships as slope parameters that can be compared with alternative regression methods. The Comprehensive R Archive Network (CRAN) includes well-documented software for phylogenetic linear models (phylolm), phylogenetic path analysis (phylopath), phylogenetic trait imputation (Rphylopars) and structural equation models (sem), but none of these can simultaneously accomplish all five analytical goals. We therefore introduce a new package phylosem for phylogenetic structural equation models (PSEM) and summarize features and interface. We also describe new analytical options, where users can specify any combination of Ornstein-Uhlenbeck, Pagel's-δ and Pagel's-λ transformations for species covariance. For the first time, we show that PSEM exactly reproduces estimates (and standard errors) for simplified cases that are feasible in sem, phylopath, phylolm and Rphylopars and demonstrate the approach by replicating a well-known case study involving trade-offs in plant energy budgets. Abstract We develop a new R-package phylosem that provides a simple interface for phylogenetic structural equation models. We identify and visualize five desirable features (coloured ellipses and labelled using matching coloured boxes), and note how four existing R-packages (grey boxes) each address different combinations of these five features. In this paper, we then outline how phylosem incorporates all five features.

Leveraging graphical model techniques to study evolution on phylogenetic networks

Abstract and Figures

Recommended publications

A dissimilarity measure for semidirected networks

Data Integration in Bayesian Phylogenetics

Accounting for Within-Species Variation in Continuous Trait Evolution on a Phylogenetic Network

Efficient Bayesian inference of general Gaussian models on large phylogenetic trees