ArticlePDF Available

A Graph Theory Based Similarity Metric Enables Comparison of Subpopulation Psychometric Networks

December 2023
Psychological Methods

December 2023

DOI:10.1037/met0000625

Authors:

Esther Ulitzsch

University of Oslo

Benjamin Webre Domingue

Stanford University

Network psychometrics leverages pairwise Markov random fields to depict conditional dependencies among a set of psychological variables as undirected edge-weighted graphs. Researchers often intend to compare such psychometric networks across subpopulations, and recent methodological advances provide invariance tests of differences in subpopulation networks. What remains missing, though, is an analogue to an effect size measure that quantifies differences in psychometric networks. We address this gap by complementing recent advances for investigating whether psychometric networks differ with an intuitive similarity measure quantifying the extent to which networks differ. To this end, we build on graph-theoretic approaches and propose a similarity measure based on the Frobenius norm of differences in psychometric networks’ weighted adjacency matrices. To assess this measure’s utility for quantifying differences between psychometric networks, we study how it captures differences in subpopulation network models implied by both latent variable models and Gaussian graphical models. We show that a wide array of network differences translates intuitively into the proposed measure, while the same does not hold true for customary correlation-based comparisons. In a simulation study on finite-sample behavior, we show that the proposed measure yields trustworthy results when population networks differ and sample sizes are sufficiently large, but fails to identify exact similarity when population networks are the same. From these results, we derive a strong recommendation to only use the measure as a complement to a significant test for network similarity. We illustrate potential insights from quantifying psychometric network similarities through cross-country comparisons of human values networks.

Content uploaded by Esther Ulitzsch

Content may be subject to copyright.

Running head: NETWORK SIMILARITY 1

A graph-theory based similarity metric enables comparison of sub-population psychometric

networks

Esther Ulitzsch1, Saurabh Khanna2, Mijke Rhemtulla3, and Benjamin W. Domingue2

1IPN—Leibniz Institute for Science and Mathematics Education

2Stanford Graduate School of Education

3University of California, Davis

Author Note

Correspondence concerning this article should be sent to Esther Ulitzsch,

IPN—Leibniz Institute for Science and Mathematics Education, Educational Measurement,

Olshausenstraße 62, 24118 Kiel, Germany, phone: +49-431-880-1704, email:

ulitzsch@leibniz-ipn.de. Extra materials for this article can be found in the OSF and are

available via the following link: https://osf.io/guxf8/. This work was supported by the

Jacobs Foundation.

NETWORK SIMILARITY 2

Abstract

Network psychometrics leverages pairwise Markov random ﬁelds to depict conditional

dependencies among a set of psychological variables as undirected edge-weighted graphs.

Researchers often intend to compare such psychometric networks across sub-populations,

and recent methodological advances provide invariance tests of diﬀerences in

sub-population networks. What remains missing, though, is an analogue to an eﬀect size

measure that quantiﬁes diﬀerences in psychometric networks. We address this gap by

complementing recent advances for investigating whether psychometric networks diﬀer,

with an intuitive similarity measure quantifying the extent to which networks diﬀer. To

this end, we build on graph-theoretic approaches and propose a similarity measure based

on the Frobenius norm of diﬀerences in psychometric networks’ weighted adjacency

matrices. To assess this measure’s utility for quantifying diﬀerences between psychometric

networks, we study how it captures diﬀerences in sub-population network models implied

by both latent variable models and Gaussian graphical models. We show that a wide array

of network diﬀerences translates intuitively into the proposed measure, while the same does

not hold true for customary correlation-based comparisons. In a simulation study on ﬁnite

sample behavior, we show that the proposed measure yields trustworthy results when

population networks diﬀer and sample sizes are suﬃciently large, but fails to identify exact

similarity when population networks are the same. From these result, we derive the strong

recommendation to only use the measure as a complement to a signiﬁcant test for network

similarity. We illustrate potential insights from quantifying psychometric network

similarities through cross-country comparisons of human values networks.

Keywords: network models; group comparisons; graph similarity

NETWORK SIMILARITY 3

1 Introduction

Network psychometrics leverages pairwise Markov random ﬁelds to depict conditional

dependencies among a set of psychological variables as undirected edge-weighted graphs

and provides tools for exploring the relationships among the studied observables. In

psychological research, psychometric network models have facilitated a more nuanced

understanding of the interplay of, among others, psychopathological symptoms (e.g. Fried

et al., 2015; Isvoranu et al., 2016; McNally et al., 2015), attitudes and beliefs (Dalege et al.,

2016), or diﬀerent aspects of health-related quality of life (Kossakowski et al., 2016), and

may pose a viable alternative to latent variable or common cause modeling of psychological

and behavioral data (Borsboom, 2017; Borsboom & Cramer, 2013; Cramer et al., 2010;

Hofmann et al., 2016; McNally et al., 2017), especially when the assumption of a common

cause is not tenable and/or in the absence of strong prior theory on how variables are

related to each other.

A wide range of commonly encountered psychological research questions involves

comparisons across multiple groups. Examples include comparisons between clinical and

non-clinical populations, treatment groups and their control counterparts, groups diﬀering

in their exposure to risk factors, and diﬀerent cultural groups. When using network models

to address such research questions, researchers are typically interested in whether and to

what extent networks diﬀer across sub-populations. While earlier work conducting such

comparisons predominantly relied on visual inspection of networks (Bringmann et al., 2013;

Koenders et al., 2015; Wigman et al., 2015) or comparisons in terms of some selected

features such as the strength of single edges or centrality indicators of nodes (Birkeland

et al., 2017; Forbes et al., 2021), recent methodological advances provided invariance tests

(Haslbeck, 2022; van Borkulo et al., 2022; Williams et al., 2020) that test for evidence of

diﬀerences in sub-population networks and meta-analytic techniques that support

aggregating psychometric networks across diﬀerent samples (Epskamp et al., 2022). What

is missing, however, is an analogue to an eﬀect size measure that quantiﬁes diﬀerences in

NETWORK SIMILARITY 4

psychometric networks. In this study, we aim to ﬁll this gap and complement recently

developed methods for investigating whether psychometric networks diﬀer from each other

with an easily applicable similarity measure for quantifying the extent to which

psychometric networks diﬀer.

To this end, we suggest capitalizing on established graph-theoretical measures for

determining the degree of similarity among graphs, and evaluate whether these may serve

as standardized measures that—in analogy to eﬀect size measures—quantify the overall

diﬀerence between psychometric networks. Having such measures at hand opens the path

for, among others, evaluating ﬁercely debated replicability of psychometric networks across

diﬀerent samples (see Forbes et al., 2021; Fried et al., 2018; Jones et al., 2021; Williams,

2022, for discussions) by gauging diﬀerences among original and replicated networks,

quantifying the degree of change in psychometric networks an intervention induces to

experimental versus control groups, or investigating whether, say, structurally diﬀerent

clinical samples are as dissimilar to each other as they are from a non-clinical sample.

In what follows, Sections 2 and 3 provide concise overviews of psychometric network

models and previous work on comparisons of sub-population networks. Section 4 then

provides an overview of established graph-theoretical measures for quantifying diﬀerences

among graphs. Based on theoretical consideration around the applicability of these

measures to psychometric networks, we identify the Frobenius norm of diﬀerences in

psychometric networks’ weighted adjacency matrices as a potentially suitable candidate

measure for quantifying psychometric network diﬀerences and use it to derive a normalized

similarity measure. To study its utility for quantifying diﬀerences between psychometric

networks, Section 5 investigates whether various diﬀerences in both population network

models implied by latent variable models and Gaussian graphical models translate

intuitively into the Frobenius norm-based similarity measure. We contrast its behavior

against the correlation between lower triangulars of networks’ weighted adjacency matrices,

which is the current ad hoc (as it has not been formally evaluated) method of choice for

NETWORK SIMILARITY 5

quantifying the similarity between sub-population networks. We show that a wide array of

network diﬀerences translates intuitively into the suggested Frobenius norm-based

similarity measure, while the same does not hold true for correlation-based comparisons.

Section 6 studies the Frobenius norm-based similarity measure’s ﬁnite sample behavior. We

show that it yields trustworthy results when population networks diﬀer and sample sizes

are suﬃciently large, but fails to identify exact similarity when population networks are the

same. From the simulations’ results, Section 7 then derives initial guidelines for

interpreting the magnitude of the Frobenius norm-based similarity measure. Based on our

investigations of the measure’s ﬁnite-sample behavior, we strongly recommend to only use

the measure in combination with a signiﬁcance test for network similarity, and use the

measure to quantify the degree of similarity only in the case that the signiﬁcant test rejects

the null of networks being equal. Section 8 illustrate the insights than can be gained from

quantifying network similarities by conducting cross-country comparisons of human values

networks.

2 Psychometric Network Models

Network psychometrics make use of pairwise Markov random ﬁelds (Murphy, 2012) to

depict conditional dependencies among a set of psychological variables (e.g., symptoms) as

undirected edge-weighted graphs G= (V, E, w)(see Bondy & Murty, 2008; Schulz et al.,

2022). Here, G= (V, E, w)is a tuple (i.e., an immutable ordered sequence) where the set

of nodes V={v1, v2,...vp}denotes the variables of the p-node network model and the

edges E⊆V

2(i.e., the set of all two-element subsets of V) denote the connections among

them. Pairwise Markov random ﬁelds encode conditional independence associations, such

that e={j, k} ∈ Ewhen the variables jand kare not conditionally independent after

conditioning on all other variables. Conversely, e={j, k}/∈Eindicates that variables j

and kare independent after controlling for all other variables in the network. The weight

function w:E→Rassigns to each edge e={j, k} ∈ Ea weight encoding the strength of

NETWORK SIMILARITY 6

conditional dependence between nodes jand k. The weight of an edge e={j, k}is

denoted by w({j, k}). The structure of Gcan be represented using its p×pweighted

adjacency matrix A=A(G), with its entry ajk given by

ajk =











w({j, k})if {j, k} ∈ E

0otherwise.

(1)

The most common types of Markov random random ﬁelds used in network

psychometrics are Gaussian graphical models (GGMs, Lauritzen, 1996) for multivariate

normal distributed data, Ising models for binary data, and mixed graphical models for data

containing variables from diﬀerent distribution families. In the present study, we will use

GGMs to investigate the utility of graph-theoretical measures for quantifying diﬀerences

between psychometric networks. In the application of psychometric networks, continuous

data are commonly encountered and GGMs are the customary method of choice for their

analysis.

In GGMs, edge weights represent non-zero partial correlation coeﬃcients. More

speciﬁcally, let ydenote a set of prandom variables, constituting the nodes of the GGM. It

is assumed that yis centered and follows a multivariate normal distribution with

y∼ Np(0,Σ),(2)

with Σgiving the variance-covariance matrix. Partial correlations ωjk between variables j

and kcan directly be obtained from the precision matrix Θ=Σ−1as

ωjk =−θjk

qθjj qθkk ∀j6=k. (3)

Then, the p×ppartial correlation matrix Ωconstitutes the weighted adjacency matrix A

of the GGM.

NETWORK SIMILARITY 7

3 Comparing Psychometric Network Models

Methods for comparing sub-population networks have received increasing attention in

recent years (e.g. Epskamp et al., 2022; Haslbeck, 2022; van Borkulo et al., 2022; Williams

et al., 2020). So far, this rapidly evolving stream of research has mainly been concerned

with developing signiﬁcance tests focused on identiﬁcation of evidence for diﬀerences across

sub-population networks.

Among these, the Network Comparison Test (NCT; van Borkulo et al., 2022) is the

most widely used and has become a customary tool in applied network psychometrics. The

NCT is a three-step, resampling-based permutation test. First, networks for two groups of

interest are estimated separately, and some test statistic summarizing key structural

diﬀerences among them—such as the diﬀerence in the networks’ sums of edge weights or

the largest diﬀerence in edge weights—is obtained. Second, a reference distribution is

created by pooling the two samples, resampling according to the original sample sizes, and,

for each resample, obtaining the chosen test statistic. Finally, the signiﬁcance of the

empirical test statistic is assessed by comparing it to the reference distribution. Further

recent methodological advances cover network comparisons across more than two groups

(Haslbeck, 2022), Bayesian network comparisons (Williams et al., 2020), and partial

pooling of networks across multiple samples (Epskamp et al., 2022).

Naturally, the power of signiﬁcance tests for comparing psychometric networks

increases with increasing sample size, and even inﬁnitesimal diﬀerences between networks

will eventually become signiﬁcant. Hence, eﬀect sizes measures that facilitate investigating

whether found diﬀerences in networks are of practical relevance are urgently needed.

Methods for quantifying the extent to which sub-population networks diﬀer, however, have

received far less attention. In applied research, NCTs are often complemented with

(rank-order) correlations of lower triangulars of the network pair’s weighted adjacency

matrices (as in Bereznowski et al., 2021; Burger et al., 2020; Fritz et al., 2018; Maccallum

et al., 2017; Schlegl et al., 2021; Van Loo et al., 2018) or the comparison of speciﬁc features

NETWORK SIMILARITY 8

of the network such as node centrality indices (as in Beard et al., 2016; Kossakowski et al.,

2016; Schlegl et al., 2021; Shim et al., 2021). Nevertheless, to date, there are no procedures

accepted as a standard by the psychometric network community for quantifying diﬀerences

in sub-population networks, and the suitability of such current practices has yet not been

investigated. Obviously, the comparison of speciﬁc features such as node centralities is

sensitive to only some aspects of possible network diﬀerences. Likewise, Brusco (2004),

Brusco and Cradit (2005), and Hubert (1978) pointed out that correlation-based

comparisons are able to capture only the presence of linear association in weighted

adjacency matrices, but might overlook other types of structural similarities. In fact, as

will be illustrated below, current correlation-based practice for quantifying diﬀerences in

psychometric networks may yield counter-intuitive conclusions or fail to exhibit sensitivity

to speciﬁc network diﬀerences. To ﬁll this gap, we suggest borrowing from

graph-theoretical literature, where determining the similarity (or distance) between graphs

is a well-known and well-studied problem.

4 Bringing Graph-Theoretical Distance Measures Into Psychometric Network

Modeling

Graph-theoretical distance measures aim to compress diﬀerences between graphs into a

real-valued measure that converges to zero as pairs of graphs approach isomorphism, i.e.,

one-to-one correspondence mapping the graphs’ node, edge, and, if applicable, edge weight

sets; indicating exact similarity. Figure 1 provides an illustration of isomorphic and

non-isomorphic pairs of unweighted, undirected graphs. In the isomorphic pair, the two

depicted graphs exhibit exact similarity in that there exists a one-to-one mapping of the

vertices of graph 1 to those of graph 2 such that there is an edge between vertices jand k

in graph 1 if and only if there is an edge between the corresponding vertices in graph 2.

For the non-isomorphic pair, no such mapping exists, meaning that the two graphs do not

exhibit exact similarity, and researchers may want to employ graph-theoretical distance

NETWORK SIMILARITY 9

measures to quantify how far the pair of graphs is from isomorphism.

Graph-theoretical distance measures ﬁnd application in a broad range of ﬁelds

where the to-be-studied objects can be expressed as graphs, ranging from neuroscience,

e.g., for quantifying diﬀerences in brain connectivity among diﬀerent subgroups (Sporns

et al., 2004), through computer vision for quantifying diﬀerences in images (Wilson & Zhu,

2008) and social network analysis, e.g., to study diﬀerences in ﬂow of information (Koutra

et al., 2013), to communication science, e.g., to study re-tweet patters (Bovet & Makse,

2019; Van Vliet et al., 2021).

´a

Graph 1

Graph 2

(a) An isomorphic pair of graphs

´a

Graph 1

Graph 2

(b) A non-isomorphic pair of graphs

Figure 1 . Examples for isomorphic and non-isomorphic pairs of graphs. In the isomorphic

pair, the bijection b↔f,c↔g,a↔h, and d↔emaps the vertices of graph 1 to the

vertices of graph 2 such that any two vertices jand kof graph 1 are adjacent in graph 1 if

and only if f(j)and f(k)are adjacent in graph 2. For the non-isomorphic graph, no such

mapping exists.

A great variety of measures exists (see Tantardini et al., 2019; Wills & Meyer, 2020,

for an overview). The choice of measure should be guided by the properties of the studied

graphs. Tantardini et al. (2019) provided a three-dimensional taxonomy informing this

choice. First, measures can be distinguished based on whether they are dependent on node

correspondence. In the case of known node correspondence, the two graphs in question

NETWORK SIMILARITY 10

have the same node set, and the pairwise correspondence between nodes is known (i.e., no

mapping is needed). Second, measures can be distinguished based on whether they are

applicable to unweighted graphs only or generalize to weighted graphs. Third, measures

can be distinguished based on whether they are applicable to undirected graphs only or

generalize to directed graphs.

Based on this taxonomy, graph-theoretical measures used to compare psychometric

networks should fulﬁll the following criteria. First, sub-population comparisons of

psychometric networks are commonly performed on the same set of variables. Hence, a

known node correspondence method can be used. Second, as psychometric networks are

always weighted, measures used for their comparisons should generalize to weighted graphs.

Third, while most psychometric networks are undirected, directed networks exist in the

context of longitudinal and time-series data. These so-called temporal networks are

directed network of regression coeﬃcients that depict the lagged associations between

variables (e.g., aﬀective states) from one measurement point to the next (see Epskamp,

2020; Epskamp, van Borkulo, et al., 2018; Epskamp, Waldorp, et al., 2018). Thus,

measures should ideally generalize to directed graphs.

Norm-based graph distance measures satisfy these criteria (Tantardini et al., 2019).

These distance measures are based on the diﬀerence of the p×pweighted adjacency

matrices Anand Amof two graphs. To compress the structure of the obtained matrix

An−Aminto a single measure, any matrix norm can be used. The standard choice (Wills

& Meyer, 2020) is the (normalized) Frobenius norm:

dF(n, m) = 1

qp/2kAn−AmkF=1

qp/2

j=1

k=1 |an

jk −am

jk |2.(4)

The constant 1

√p/2normalizes the Frobenius norm by the size of the to-be-compared

weighted adjacency matrices, thereby ensuring that values of dF(n, m)are comparable

across pairs of networks with diﬀerent numbers of nodes.

NETWORK SIMILARITY 11

Besides accommodating the characteristics of psychometric networks, the Frobenius

norm of the diﬀerence between two adjacency matrices comes with two further major

advantages over other commonly used graph-theoretical distance measures. First, the

Frobenius norm of the diﬀerence between two adjacency matrix is a proper metric in the

mathematical sense (Wills & Meyer, 2020).1The second major advantage lies in its

simplicity (Martinez & Chavez, 2018). This is an important property, as more complex

measures—having been developed for networks capturing ﬂows or distances—may not be

applicable to the context of network psychometrics.

Note that the distance measure dF(n, m)=0is unbounded [0,∞), with 0 indicating

exact similarity. To facilitate interpretability, following Koutra et al. (2013), we use the

distance-to-similarity transformation

sF(n, m) = 1

1 + dF(n, m).(5)

This transformation ensures that sF(n, m)is bounded within the interval (0,1], which

facilitates interpretability. Note that the transformation yields a similarity measure instead

of a distance measure. That is, for exact similarity, the Frobenius norm-based similarity

measure sF(n, m)evaluates to 1, not 0. Since dF(n, m)is a proper metric satisfying the

identity of indiscernibles, sF(n, m)equals 1 if and only if the compared networks are

isomorphic. An example calculation of sFis provided in Appendix A. An R function for

obtaining the suggested measure is provided in Appendix B.

5 Illustrations and Experiments with Population Networks

For illustrating and evaluating the proposed Frobenius norm-based measure sF, we

investigated whether and how it captures various diﬀerences in psychometric networks. To

1That is, it satisﬁes a) identity of indiscernibles (i.e., dF(n, m)=0iﬀ An=Am), b) symmetry (i.e.,

dF(n, m) = dF(m, n)), and c) triangle inequality (i.e., dF(n, m)≤dF(n, l) + dF(l, m)) (Van Loan & Golub,

1996, p. 56).

NETWORK SIMILARITY 12

this end, we investigated diﬀerences between population networks (i.e., we considered true

partial correlation matrices) in two studies.

In Study I, we assessed how sFcaptures diﬀerences in networks implied by latent

variable models. Psychometric networks can fairly well capture patterns implied by latent

variable models (Golino & Epskamp, 2017; van der Maas et al., 2006). Due to their

ubiquity and long-standing application in psychological research, we believe that there is a

common understanding in the research community of what constitutes a mild or severe

diﬀerence among latent variable models. In fact, many studies in the context of

conﬁrmatory factor analysis leverage this common understanding when investigating

misspeciﬁcations, e.g., for deriving ﬁt index cut-oﬀs (see Garnier-Villarreal & Jorgensen,

2020; McNeish & Wolf, 2021, for recent examples). We aimed to use common intuition

about latent variable models and the common understanding of the severity of diﬀerences

among them to generate insights into the resultant functioning of sF. The hope in doing so

is that, when inspecting sFas a function of varying extent of diﬀerences between

sub-population conﬁrmatory factor analysis models, readers can draw on this

understanding to build initial intuition on values of sFthat indicate mild or severe

diﬀerences. We considered two commonly studied scenarios of group diﬀerences in latent

variable models: a) diﬀerences in latent correlations and b) the absence versus presence of

cross-loadings.

In Study II, we investigated how sFquantiﬁes diﬀerences in GGMs. We aimed to

cover a broad range of quantitative (in the sense that networks share the same set of edges,

but diﬀer in edge weights) and qualitative diﬀerences (in the sense that networks diﬀer in

structure), considering both network pairs diﬀering in a single edge or edge weight only

and diﬀerences spread out across the whole network. To this end, network pairs were

created by ﬁrst generating network 1 and then deriving network 2 from network 1 through

various types of manipulations. We considered three sets of scenarios. First, we

investigated manipulations of varying severeness of network 1’s strongest edge weight.

NETWORK SIMILARITY 13

Second, we gradually decreased similarity between network 1 and network 2 by increasing

the proportion of network 1’s edge weights aﬀected by manipulations of varying severeness.

Third, we studied how micro-manipulations of edge weights spreading across the whole

network are captured by sF.

Whenever possible, we also benchmarked our metric’s performance with that of the

correlation between vectorized strictly lower triangulars of the networks’ weighted

adjacency matrices, denoted with scor, which is the current method of choice for

determining the similarity between sub-population networks. Note that values of sFand

scor are not directly comparable, as they are located on diﬀerent scales.

R code for our investigations is provided in the OSF repository accompanying this

study.

5.1 Study I: Comparing Networks Implied by Latent Variable Models

5.1.1 Method. For generating population networks, we obtained partial

correlation matrices from model-implied variance-covariance matrices of conﬁrmatory

factor analysis models. That is, for the observed variables yof the networks we assumed

y=ν+Λη+ε,(6)

where νis a p×1vector containing intercepts, Λis a p×lmatrix of factor loadings, ηis an

l×1vector of latent factors, and εdenotes the vector of multivariate normally distributed

residuals with zero mean vector and covariance matrix Ψ(Bollen, 1989). For all models, we

generated uncorrelated residuals. All intercepts were set to zero, assuming that the mean

structure is saturated and completely reﬂected in the intercepts, that is, E(y) = ν=0.

The model-implied covariance structure Σof the observed variables is given by

Σ=ΛΦΛT+Ψ,(7)

NETWORK SIMILARITY 14

where Φis the l×lcovariance matrix of the latent factors. From the inverse of the

model-implied covariance matrix, we constructed the partial correlation matrix according

to Equation 3.

Diﬀerences in latent correlations. We generated pairs of networks implied by

two-dimensional conﬁrmatory factor analysis models with diﬀerent latent correlations ρ

(see Figure 2). For all models, we generated the observed variables ywith unit variance

and set all non-zero factor loadings to λ= 0.70. We assumed half of the observed variables

to load on the ﬁrst factor and the other half on the second factor. All generated pairs of

networks comprised the special case of ρ= 1 (network 1), implying the same covariance

matrix (and, consequently, the same network) as a unidimensional model, and a network

implied by a model with ρ6= 1 (network 2). We varied the number of nodes (p= 10;

p= 20) as well as the correlation between the two latent factors of network 2, evaluating

the boundary values of -.99 and .99 as well as the sequence from .95 to .-95 in steps of .05.

Doing so gradually increases diﬀerences in networks derived from models with ρ= 1 and

ρ6= 1, with ρ=−.99 inducing the largest diﬀerence.

Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10 Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10

η1η2

Group 1

ρ= 1

η1η2

Group 2

ρ6= 1

Figure 2 . Conﬁrmatory factor analysis models for two groups with diﬀerent latent

correlations

To help intuition, Figure 3 illustrates exemplar networks implied by models with

ρ= 1 (Figure 3a), ρ=.50 (Figure 3b), and ρ=−.90 (Figure 3c). As was to be expected,

the network with ρ=.50 exhibited greater proximity to the network with ρ= 1 than the

network with ρ=−.90. The model with ρ= 1 implied all partial correlations (i.e., edge

weights) to be equal (.10 in the present case). When ρ6= 1, edge strengths between nodes

NETWORK SIMILARITY 15

belonging to the same factor were stronger (.19 for ρ=.50 and .14 for ρ=−.90) compared

to those of the model with perfectly correlated latent factors; and edge weights between

nodes belonging to diﬀerent factors were considerably weaker when ρ=.50 (.02) and

negative when ρ=−.90 (-.07), mirroring the structure of the respective conﬁrmatory

factor analysis models.

Note that in this set of comparisons, scor could not be obtained due to lack of

variability in edge weights of the ρ= 1 network. Hence, networks were only compared

using sF.

Figure 3 . Psychometric networks implied by conﬁrmatory factor analysis models with

diﬀerent correlations ρamong latent factors. a) ρ= 1, b) ρ=.50, c) ρ=−.90. Line

thickness represents the absolute size of the edge weights. Red and blue denote negative

and positive edge weights, respectively.

Diﬀerences in cross-loadings. For evaluating how sFand the

correlation-based benchmark scor capture diﬀerences in loading patterns, we compared

networks implied by two-dimensional models with the same latent correlation (ρ=−.50;

ρ= 0;ρ=.50), but diﬀering in whether (network 2) or not (network 1) there were

cross-loadings (see Figure 4). For network 1, we generated the observed variables ywith

unit variance and set all non-zero factor loadings to λ= 0.50. For network 2, we added

cross-loadings to this model, while keeping main factor loadings and error variances

constant across both models. We varied the number of cross loadings (1; 2) as well as their

NETWORK SIMILARITY 16

size 0.5λ∗, evaluating the sequence λ∗= 0.05 to λ∗= 0.50 in steps of 0.05. We, again,

varied the number of nodes (p= 10;p= 20).

Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10 Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10

η1η2

Group 1

η1η2

Group 2

Figure 4 . Conﬁrmatory factor analysis models for two groups with diﬀerent loading

patterns

Figure 5 depicts examplar network pairs implied by two-dimensional models with

correlations between latent factors of -.50 (Figures 5a and 5b) and .50 (Figures 5c and 5d.)

Networks b) and d) are implied by conﬁrmatory factor analysis models with high

cross-loadings (λ∗= 0.50) for variables y6and y7.

The most pronounced diﬀerences between networks implied by models with and

without cross-loadings were observable for weights of edges connecting variables y6and y7

to variables from diﬀerent primary factors (i.e., y1to y5). In the network pair with

negatively correlated latent factors, weights for these edges were negative (-.03) when no

cross-loadings were present but positive (.02) for the network obtained from the model with

cross-loadings. In the network pair with positively correlated latent factors, the

introduction of cross-loadings resulted in higher weights for these edges (.07 as compared to

.03).

5.1.2 Results.

Diﬀerences in latent correlations. Figure 6 displays sFas a function of the

correlation between latent factors in the network with ρ6= 1 and the number of nodes p.

Recall that all comparisons involved p-node network pairs with ρ= 1 for network 1 and

ρ6= 1 for network 2. As expected, sFincreases to unity as ρ→1. Declines in sFassociated

with declines in ρwere more pronounced for smaller networks (e.g., for ρ=−.90,sFwas

.65 for p= 10, but .74 for p= 20), and steepness of changes in sFwas greater the closer ρ

NETWORK SIMILARITY 17

Figure 5 . Psychometric networks implied by diﬀerent latent variable models. a)

two-dimensional model without cross-loadings with a correlation between latent factors of

-.50, b) two-dimensional model with high cross-loadings of variables y6and y7with a

correlation between latent factors of -.50, c) two-dimensional model without cross-loadings

with a correlation between latent factors of .50, d) two-dimensional model with high

cross-loadings of variables y6and y7with a correlation between latent factors of .50. Line

thickness represents the absolute size of the edge weights. Red and blue denote negative

and positive edge weights, respectively.

in network 2 was to the boundaries of -.99 and .99. Across all conditions, sFranged from

.62 (p= 10,ρ=−.99) to .98 (p= 10,ρ=.99).

Diﬀerences in cross-loadings. Results for sFand the scor baseline are given in

Figure 7. The behavior of both measures is displayed as a function the number of nodes,

the correlation between latent factors, the number of cross-loadings, and λ∗.sFcaptured

diﬀerences in loading patterns in that it decreased steadily as a function of the size of λ∗

and tended to be smaller for a higher number of cross-loadings and a smaller number of

nodes. For a single small cross-loading (λ∗= 0.05), sFwas close to one (i.e., .99),

indicating almost perfect similarity between the networks. Two high cross-loadings

(λ∗= 0.50), in contrast, resulted in sFaround .90 for p= 10 and .93 for p= 20.

NETWORK SIMILARITY 18

0.4

0.6

0.8

1.0

−1.0 −0.5 0.0 0.5 1.0

p = 10 p = 20

Figure 6 . Similarity in terms of the Frobenius norm-based measure sFbetween two

network models implied by two-dimensional conﬁrmatory factor analysis models with

diﬀerent correlations between latent factors. The correlation between factors was set to 1

for network 1 and varied for network 2. ρgives the correlation for network 2; pdenotes the

number of nodes.

The alternative scor metric also captured diﬀerences in the number and size of the

cross-loadings. However, the magnitude of scor was highly contingent on correlations

between latent factors, and rapidly decreased for high positive correlations. For instance,

while two high cross-loadings (λ∗= 0.50) in a network with p= 10 resulted in scor = 0.95

for ρ=−.50,scor was as low as .88 when ρ=.50. These sensitivities are counter-intuitive

as the correlation between latent factors was kept constant within network pairs, and can

be attributed to decreased variability in the edge weights of the network without

cross-loadings when ρhad large, positive values (compare Figures 5a and 5c). sF, in

contrast, did not exhibit such sensitivities to the correlation between latent factors.

5.2 Study II: Comparing Gaussian Graphical Models

5.2.1 Method. For investigating the capability of sFand the scor baseline to

capture diﬀerences in networks implied by GGMs, we ﬁrst generated population networks

according to the procedure outlined in Wysocki and Rhemtulla (2021), varying the number

of nodes (p= 10;p= 20) and network density (i.e., the proportion of possible edges in the

NETWORK SIMILARITY 19

p = 10

p = 20

Cross−loadings = 1

Cross−loadings = 2

0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5

0.4

0.6

0.8

1.0

0.4

0.6

0.8

1.0

λ*

ρ=−0.5 ρ=0ρ=0.5

sFscor

Figure 7 . Similarity in terms of the Frobenius norm-based measure sF, and

correlation-based comparisons scor of network models implied by conﬁrmatory factor

analysis models with diﬀerent loading patterns. No cross-loadings were added for network

1. The size (0.50λ∗) and number of cross-loadings was varied for network 2. ρgives the

correlation between factors for both networks; pdenotes the number of nodes.

network actually present, regardless of the edges’ weights), considering values from .10 to

.90 in increments of .20. For each of the 2×5 = 10 conditions, 100 population network

pairs were generated.

We started by generating population network 1. To this end, for each of the 100

replications, we generated a random weighted adjacency matrix—constituting network

1—using the R package BDgraph (Mohammadi & Wit, 2015b). For creating the graph

structure, BDgraph randomly samples edges from a binomial distribution according to the

speciﬁed network density. Then, using a G-Wishart distribution with WG(df, Ip), a

precision matrix Θis generated according to the graph structure. Here, Iprepresents a

p×pidentity matrix and df gives the degrees of freedom of the G-Wishart distribution

(Mohammadi & Wit, 2015a).

NETWORK SIMILARITY 20

We obtained edge weights from the simulated precision matrix according to

Equation 3. The degrees of freedom of the G-Wishart distribution determine the degree of

shrinkage of the precision matrix towards the identity matrix (Hsu et al., 2012), and thus

controls the size of the edge weights. Note that the same degrees of freedom of the

G-Wishart distribution will result in diﬀerent partial correlations for diﬀerent network

densities and sizes. To achieve comparable distributions of edge weights for all networks,

we set the degrees of freedom to correspond to a targeted range of partial correlation values

of -.35 to .35 for each combination of network density and size. For all considered network

densities and sizes, we obtained edge weights distributed normally around zero, with a

standard deviation of around .15.

Finally, for obtaining network 2, the generated network 1 was duplicated and the

duplicate’s edges and/or edge weights were manipulated.2We considered diﬀerent types of

manipulations, described in greater detail below. On the rare occasions that the

manipulations yielded a non-positive deﬁnite correlation matrix, a new network 1 was

generated. For each of the 100 replications, we computed similarities between network 1

and network 2 in terms of sFand scor. In the reporting of results, we focus on the median

of these quantities across the 100 replications.

Networks diﬀering in single edge weights. In our ﬁrst set of scenarios, we

considered diﬀerences in single edge weights. To this end, in the network 2 duplicate, we

manipulated the edge weight with the highest absolute value, either a) halving it (i.e.,

multiplying it by 0.50), b) removing the corresponding edge, or c) switching its sign (i.e.,

multiplying it by -1).

Networks diﬀering in multiple edge weights. In our second set of scenarios,

we investigated diﬀerences in multiple edge weights. To this end, for manipulating the

network 2 duplicate, we randomly selected a subset of the edges, considering proportions

2Recall that comparisons of sub-population networks are commonly performed on networks with the same

node set and that both sFand scor assume known node correspondence. Hence, all induced diﬀerences

concerned edges and edge weights only.

NETWORK SIMILARITY 21

from .10 to .90 in increments of .10, and either a) halved the selected edges’ weights, b)

removed the respective edges, or c) switched the signs of their weights.

Note that scenario a) and, even more so, scenario b) created network pairs strongly

diﬀering in their connectivity (i.e., the weighted absolute sum of all edges in the network;

van Borkulo et al., 2022). In scenario a), network pairs only diﬀered in connectivity, while

the overall structure of the network remained intact. Diﬀerences in connectivity were

particularly pronounced when network 1 was dense and a large proportion of the edge

weights of network 2 was aﬀected by the manipulation. For instance, for scenario b), under

conditions with a density of .70, network 1 exhibited an average connectivity value twice as

high as the connectivity of network 2 when 50% of the original edge weights were removed,

and ten times as high when 90% of the original edge weights were removed. For scenario

a), under conditions with a density of .70, the average connectivity of network 1 was 1.3

times higher than the connectivity of network 2 when 50% of the original edge weights were

halved, and 1.8 times higher when 90% of the original edge weights were halved. Since

network connectivity diﬀerences are a common property of interest and, especially in

clinical applications, changes in connectivity over time as well as diﬀerences between

clinical and non-clinical samples are a common phenomenon of interest (e.g., Beard et al.,

2016; Kim et al., 2014) it is important to ensure that similarity measures are sensitive to

diﬀerences in connectivity.

Micro-diﬀerences spread across the networks. In our third set of scenarios,

we investigated micro-diﬀerences in edge weights, that were, however, spread out across the

whole network. To this end, for manipulating the network 2 duplicate, we either slightly

dampened all edge weights by multiplying them by 0.80, or “jittered” them, by making half

of them slightly larger (multiplying them by 1.25) and half of them slightly smaller in size

(multiplying them by 0.80).

Note that dampening edge weights led to diﬀerences in average connectivity, with

the connectivity of network 1 being 1.25 times higher than the connectivity of network 2

NETWORK SIMILARITY 22

across all conditions.

5.2.2 Results.

Networks diﬀering in single edge weights. Results for our manipulations of

single edge weights as a function of the number of nodes and network 1 density are given in

Figure 8. Both measures exhibited intuitive behavior in that they decreased with

increasing severity of the manipulation, i.e., given network size and density, both measures

were smallest when the strongest edge weight was multiplied by -1 and largest when the

strongest edge weight was multiplied by 0.50.

Dampening the strongest edge weights (i.e., multiplying it by 0.50) led to values for

sFand scor that were fairly constant as a function of density, but were diﬀerent for

networks of diﬀerent size. The dependency on network size was to be expected, as a

single-edge diﬀerence poses a more severe diﬀerence in a smaller network than in a larger

one. sFwas around .89 for p= 10 and .92 for p= 20 across all density levels.

Correlation-based comparisons, in contrast, were not markedly sensitive to single edges

being dampened, yielding values around .97 and .99 for p= 10 and p= 20, respectively.

Both measures were sensitive to the edge with the strongest weight being missing.

Patterns of sensitivity, however, diﬀered. Again, sFremained fairly constant as a function

of density and exhibited sensitivity to network size, yielding values around .79 for p= 10

and .85 for p= 20 across all density levels. Correlation-based comparisons, in contrast,

were highly sensitive to the networks’ densities. These eﬀects were stronger in smaller as

compared to larger networks. For instance, while scor was .93 for p= 10 and a density of

.90, it was as low as .68 for a density of .10. This is due to the fact that given the increased

number of zero entries, values in the strictly lower triangular are less variable in networks

with lower density. As such, comparable sensitivities of scor can be expected for dense

networks where both networks posses low as compared to high variability in their edge

weights.

These patterns were further exacerbated for sign switches of the strongest edge

NETWORK SIMILARITY 23

weight. Values of scor were highly sensitive to density and rapidly increased with increasing

density levels. For instance, scor was .73 for p= 10 and a density of .90, but as low as -.09

for a density of .10. This is remarkable, as researchers may falsely conclude from such a

low correlation that two networks that share the same (small) set of edges but exhibit a

strong diﬀerence in a single edge weight are highly dissimilar. Values for sF, in contrast,

were relatively insensitive to density, yielding values around .66 for p= 10 and .74 for

p= 20 across all density levels.

p = 10

p = 20

strongest edge weight half

strongest edge missing

strongest edge weight sign switch

0.25 0.50 0.75 0.25 0.50 0.75

0.4

0.6

0.8

1.0

0.4

0.6

0.8

1.0

0.4

0.6

0.8

1.0

Density

sFscor

Figure 8 . Similarity in terms of the Frobenius norm-based measure sF, and

correlation-based comparisons scor between network models diﬀering in single edge weights

across 100 simulated pairs of networks with diﬀerent size and density. Lines are smoothed

to accommodate small diﬀerences in average ranges of network 1’s edge weights across

diﬀerent conditions. pdenotes the number of nodes. Note that the y-axis is truncated at

0.30 for comparability across experiments, but values for scor fall below 0.30.

NETWORK SIMILARITY 24

Networks diﬀering in multiple edge weights. Results for our manipulations

of multiple edge weights as a function of network size and network 1 density as well as the

proportion of edge weights aﬀected by our manipulations are displayed in Figure 9. For

simplicity, only results for densities of .30 and .90 are displayed. Again, both measures

behaved intuitively in that they decreased with increasing severeness of the manipulation

in terms of both the type of edge weight manipulation and the proportion of aﬀected edge

weights. Nevertheless, mirroring results for manipulations of single edge weights, scor was

almost insensitive to edge weights being halved, ranging from .94 to .98 across all

conditions, as compared to a range of .72 to .93 for sF.

For the same type of manipulation, similarities were slightly lower for larger as

compared to smaller networks. For small networks with p= 10, both sFand scor were

somewhat higher in sparse as compared to dense networks. This behavior was to be

expected, because given the same proportion of manipulated edges, for more dense

networks, a higher number of edge weights were aﬀected by the manipulations.

Both measures exhibited steeper decline with an increasing proportion of aﬀected

edge weights the more severe the type of manipulation. This eﬀect, however, was much

more pronounced for scor. For instance, in small, dense networks (p= 10 and a density of

.90), sFtook values of .77 and .61 when 20% and 90% of the edges were missing,

respectively, and .64 and .44 when 20% and 90% of the edge weights were multiplied with

-1. scor, in contrast, was .91 and .29 when 20% and 90% of the edges were missing, and

took values of .65 and -.82 when 20% and 90% of the edge weights were multiplied with -1.

Micro-diﬀerences spread across the network. Figure 10 gives results for

micro-manipulations of edge weights spread across the whole network as a function of the

number of nodes and network 1 density. Recall that when all edge weights were slightly

dampened (i.e., multiplied by 0.80), edge weights were merely rescaled. As such, scor was

incapable to capture these diﬀerences, corresponding to 1 across all conditions. Likewise,

scor was almost insensitive to edge weights being “jittered”, remaining above .98 across all

NETWORK SIMILARITY 25

p = 10

density = 0.30

p = 10

density = 0.90

p = 20

density = 0.30

p = 20

density = 0.90

edge weight half

edge missing

edge weight sign switch

0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75

0.4

0.6

0.8

1.0

0.4

0.6

0.8

1.0

0.4

0.6

0.8

1.0

Proportion of affected edges

sFscor

Figure 9 . Similarity in terms of the Frobenius norm-based measure sF, and

correlation-based comparisons scor between network models diﬀering in multiple edge

weights across 100 simulated pairs of networks with diﬀerent size and density. Lines are

smoothed to accommodate small diﬀerences in average ranges of network 1’s edge weights

across diﬀerent conditions. pdenotes the number of nodes. Note that the y-axis is

truncated at 0.30 for comparability across experiments, but values for scor fall below 0.30.

conditions. sFyielded intuitive results in that it slowly decreased as density increased

(note that for more dense networks, more edge weights were aﬀected by the manipulation).

For instance, sFwas .91 in a network with p= 10 and low density of .10, and .87 when

density was .90. Results were almost identical when edge weights were dampened.

6 Investigating Finite-Sample Behavior

One challenge with sFis that its sampling distribution may not always be centered around

the population value. This is due to the fact that it is more likely for a ﬁnite-sample

network pair to exhibit less extreme similarity or dissimilarity than their population

NETWORK SIMILARITY 26

p = 10

p = 20

edge weights jittered

edge weights dampened

0.25 0.50 0.75 0.25 0.50 0.75

0.4

0.6

0.8

1.0

0.4

0.6

0.8

1.0

Density

sFscor

Figure 10 . Similarity in terms of the Frobenius norm-based measure sF, and

correlation-based comparisons scor between network models with micro-diﬀerences in edge

weights across 100 simulated pairs of networks with diﬀerent size and density. Lines are

smoothed to accommodate small diﬀerences in average ranges of network 1’s edge weights

across diﬀerent conditions. pdenotes the number of nodes.

counterparts,3especially when population similarity is extremely high or extremely low,

respectively. To illustrate this for the case of sF, consider the case of two identical

sub-population networks with non-zero density. For identical sub-population networks, the

population matrix An−Amhas zero entries only. For any given cell, it is highly unlikely

that for both networks, the exact same sample estimate is obtained for two given

sub-samples. With sFbeing constructed based on absolute values, the ﬁrst moment of the

sFsampling distribution can, therefore, be assumed to be smaller than the population

value, underestimating similarity. Likewise, extreme values in the matrix An−Am—

stemming, for instance, from extreme negative edge weights in sub-population network n

and extreme positive edge weights in sub-population network m—are likely to be less

extreme for two given sub-samples, resulting in sFvalues higher than their population

counterparts. To investigate the degree to which this property may distort conclusions on

3Note that this also holds true for scor

NETWORK SIMILARITY 27

network similarity, we studied sF’s ﬁnite sample behavior.

6.1 Method

We studied the ﬁnite-sample behavior of sFfor both (a) ﬁnite-sample network pairs

stemming from sub-population networks with exact similarity and (b) ﬁnite-sample

network pairs stemming from diﬀerent sub-population networks. For the latter, we studied

all scenarios considered in Study II of our experiments with population networks. For both

cases, we varied the number of nodes (p= 10;p= 20), network density, considering values

from .10 to .90 in increments of .20, and the sub-sample size (N= 200;N= 500;

N= 1000;N= 2000;N= 5000;N= 10000), assuming sample sizes for both sub-samples

to be the same. For each condition, we employed 500 replications and investigated the

median and interquartile range of sFacross replications.

6.1.1 Finite-sample behavior for sub-population networks with exact

similarity. For studying the ﬁnite-sample behavior of sFfor ﬁnite-sample network pairs

stemming from sub-population networks with exact similarity, for each replication, we ﬁrst

generated a population network according to the procedure described above. We then

simulated data according to the generated population network for two independent samples

by drawing values for Nobservations from the multivariate normal distribution implied by

the generated population network. A separate GGM was ﬁt to each generated data set.

For estimation, we used the EBICglasso method (see Epskamp & Fried, 2018, for an

introduction) as implemented in the R package qgraph (Epskamp et al., 2012), setting

γ= 0.50. Finally, we obtained sFfor the two estimated GGMs.

6.1.2 Finite-sample behavior for diﬀerent sub-population networks. To

study the ﬁnite-sample behavior of sFfor ﬁnite-sample network pairs stemming from

diﬀerent sub-population networks, for each replication, we ﬁrst generated population

network pairs according to the procedure and scenarios described in Section 5.2. Then, we

simulated two data sets for Nobservations according to the multivariate normal

NETWORK SIMILARITY 28

distributions implied by the two generated population networks. Again, we ﬁt a separate

GGM to each generated data set and obtained sFfor the two estimated GGMs.

6.2 Results

6.2.1 Finite-sample behavior for sub-population networks with exact

similarity. Figure 11 displays the median and interquartile ranges of the Frobenius

norm-based measure sFfor ﬁnite sample networks stemming from the same population

network with varying size and density across 500 replications. The color of the solid lines

(median) and shaded areas (interquartile range) denotes the sub-sample size. The dashed

gray line marks the true population value of sF= 1. Figure 11 highlights three important

points. First, for small sub-sample sizes, estimates of sFexhibited a strong downwards bias

(i.e., researchers would falsely conclude that two ﬁnite-sample networks stemming from the

same population network exhibit low similarity). Second, this eﬀect was aggravated by

high network density. Third, and most importantly, when the population network was

dense, sFdid not approach the true value of 1even for very large sub-sample sizes of

N= 10000, such that even for very large sample sizes researchers would conclude that two

ﬁnite-sample networks stemming from the same population network exhibit high, but not

perfect similarity.

6.2.2 Finite-sample behavior for diﬀerent sub-population networks.

Figures 12 to 14 illustrate the ﬁnite-sample behavior of sFfor samples stemming from

sub-population networks diﬀering according to the scenarios studied in Section 5.2. In each

ﬁgure, the dashed gray line marks the true population value.

Contrasting these ﬁgures against Figure 11 illustrates that sFbehaved markedly

diﬀerent when sub-population networks diﬀered, with the most important diﬀerence being

that—regardless of the type and severeness of true network diﬀerences—estimates of sF

approached the true population value with increasing sub-sample size and yielded

trustworthy conclusions on network similarity when sub-sample size was suﬃciently large

NETWORK SIMILARITY 29

p = 10

p = 20

0.25 0.50 0.75 0.25 0.50 0.75

0.4

0.6

0.8

1.0

Density

N200

500

1000

2000

5000

10000

Figure 11 . Median and interquartile ranges of the Frobenius norm-based measure sFfor

ﬁnite sample networks stemming from the same population network with varying size and

density across 500 simulated pairs of networks estimated from data with varying

sub-sample sizes. The dashed gray line marks the true population value of sF= 1.p

denotes the number of nodes. Ndenotes the sub-sample size.

(>2000).

When sub-sample size was small, however, sFexhibited a severe downwards bias

when the population networks exhibited high similarity (e.g., when the strongest edge

weight was halved, see ﬁrst row in Figure 12, or when all edge weights were mildly

dampened, see second row in Figure 14), especially when network density was high. In

contrast, when true similarity was very low (e.g., when the size of a large proportion of

edge weights was switched in large networks, see the panel in the fourth row and third

column in Figure 13), sFexhibited upwards bias under small sub-sample conditions. That

is, researchers would falsely conclude that two ﬁnite-sample networks stemming from

population networks with low similarity exhibit a comparably higher degree of similarity.

That is, under small sample conditions, estimates of sFwere not trustworthy in that they

tended to be less extreme than their population counterparts, such that networks with very

high and very low population similarity appeared less, respectively, more similar.

NETWORK SIMILARITY 30

p = 10

p = 20

strongest edge weight half

strongest edge missing

strongest edge weight sign switch

0.25 0.50 0.75 0.25 0.50 0.75

0.4

0.6

0.8

1.0

0.4

0.6

0.8

1.0

0.4

0.6

0.8

1.0

Density

N200

500

1000

2000

5000

10000

Figure 12 . Median and interquartile ranges of the Frobenius norm-based measure sFfor

ﬁnite sample networks stemming from population networks diﬀering in single edge weights

across 500 simulated pairs of networks estimated from data with varying sub-sample sizes.

The dashed gray line marks the true population value. Density gives the density of the

population network 1. Network 2 was derived from network 1 by manipulating single edge

weights. pdenotes the number of nodes. Ndenotes the sub-sample size.

7 Interim Summary: Deriving Guidelines

We derive guidelines on the recommended scope of application and interpretation of sF

from our experiments conducted in Sections 5 (investigating population networks) and 6

(investigating ﬁnite-sample behavior).

When using sFto investigate network similarity in ﬁnite samples, researchers should

be aware that (a) when true population networks exhibit exact similarity, sFdoes not

approach 1 even for very large sub-sample sizes and (b) when, in contrast, true population

NETWORK SIMILARITY 31

p = 10

density = 0.30

p = 10

density = 0.90

p = 20

density = 0.30

p = 20

density = 0.90

edge weight half

edge missing

edge weight sign switch

0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75

0.4

0.6

0.8

1.0

0.4

0.6

0.8

1.0

0.4

0.6

0.8

1.0

Proportion of affected edges

N200

500

1000

2000

5000

10000

Figure 13 . Median and interquartile ranges of the Frobenius norm-based measure sFfor

ﬁnite sample networks stemming from population networks diﬀering in multiple edge

weights across 500 simulated pairs of networks estimated from data with varying

sub-sample sizes. The dashed gray line marks the true population value. Density gives the

density of the population network 1. Network 2 was derived from network 1 by

manipulating single edge weights. pdenotes the number of nodes. Ndenotes the

sub-sample size.

networks do not exhibit exact similarity, ﬁnite-sample estimates of sFdo approach the true

population network, but only when sub-sample sizes are suﬃciently large. Based on these

results, we strongly recommend two rules of use. First, researchers should pair sFwith

signiﬁcance tests for network comparisons such as the NCT and calculate sFonly after

concluding that population networks indeed diﬀer. Second, if the conducted test for

network comparison evidences diﬀerences in sub-population networks, we recommend to

interpret sFonly if sub-sample sizes are suﬃciently large, i.e., >2000.

For conditions where these requirements are met, we derive initial guidelines on

interpreting the degree of similarity between sub-population networks from our

NETWORK SIMILARITY 32

p = 10

p = 20

edge weights jittered

edge weight dampened

0.25 0.50 0.75 0.25 0.50 0.75

0.4

0.6

0.8

1.0

0.4

0.6

0.8

1.0

Density

N200

500

1000

2000

5000

10000

Figure 14 . Median and interquartile ranges of the Frobenius norm-based measure sFfor

ﬁnite sample networks stemming from population networks with micro-diﬀerences in edge

weights across 500 simulated pairs of networks estimated from data with varying

sub-sample sizes. The dashed gray line marks the true population value. Density gives the

density of the population network 1. Network 2 was derived from network 1 by

manipulating single edge weights. pdenotes the number of nodes. Ndenotes the

sub-sample size.

experiments with population networks. Recall that these served to provide intuition for the

scale of sFby investigating how scenarios of varying diﬀerences in psychometric networks

translate into sF. Note that the values of sFobtained for the most extreme scenarios of

large, dense networks with a large proportion of edge weights diﬀering in sign were slightly

above .40, which means that such values are already indicative of very low similarity. We,

therefore, map values of sFfrom 1 to .50 in decrements of .10 to investigated scenarios of

diﬀerences between networks. Having an analog scenario for their obtained values of sF

may aid researchers in evaluating the severeness and practical signiﬁcance of diﬀerences in

their studied networks.

For each value of sF, Table 1 provides one example from Study I (comparing

NETWORK SIMILARITY 33

networks implied by latent variable models) and one from Study II (comparing networks

implied by GGMs). For examples from Study I, we focus on diﬀerences in latent

correlations and provide the non-zero correlation of network 2 for which the respective sF

value has been obtained. For examples from Study II, we focus on networks with medium

density of .50 diﬀering in multiple edge weights and provide the type of manipulation and

proportion of aﬀected edge weights for which the respective sFvalue has been obtained.

For instance, for sF=.90, researchers can infer that the two compared networks are as

similar as are pairs of networks implied by two-dimensional latent variable models, where

latent factors are perfectly correlated in network 1 and exhibit a correlation of .95 in

network 2. For a value of sF=.60, researchers could conclude that their to-be-compared

networks are as similar as a pair of large networks (p= 20) with medium density of .50

that diﬀer in sign in 30% of their edge weights. For further contextualizing these exemplar

scenarios, it should be kept in mind that our investigations of networks implied by GGMs

focused on networks with weights ranging from -.35 to .35, and that the values of sF

obtained for the diﬀerent scenarios refer to the comparison of population networks.

8 Empirical Example

The empirical example serves (a) to illustrate the insights that can be gained from

quantifying network similarities, (b) to explore agreement and disagreement between sF

and scor in empirical data as well as (c) in order to provide further intuition for network

similarities, to give examples for empirical networks with varying degrees of similarities in

terms of sFand scor. To these ends, we investigated patterns of similarities in psychometric

networks of human values across European countries.

NETWORK SIMILARITY 34

Table 1

Exemplar scenarios for diﬀerent values of sF

sFStudy I Study II

p= 10 p= 20 p= 10 p= 20

1 exact similarity

.90 ρ=.95 ρ=.95 20% of edges missing 10% of edges missing

.80 ρ=.80 ρ=.75 90% of edges missing 60% of edges missing

.70 ρ=.05 ρ=−.99 20% of edge weights

sign switch

10% of edge weights

sign switch

.60 ρ=−.99 – 40% of edge weights

sign switch

30% of edge weights

sign switch

.50 – – – 60% of edge weights

sign switch

Notes: Study I investigated diﬀerences between networks implied by latent

variable models. Study II investigated diﬀerences in networks implied by

Gaussian graphical models. In the displayed conditions of Study I, the

correlation between factors ρwas set to 1 for network 1 and varied for network

2. In the displayed conditions of Study II, network 1 had a medium density of

.50 and network 2 was derived from network 1 via the described manipulations.

pdenotes the number of nodes. A dash indicates that the respective sFvalue

(or a smaller one) has not been observed in the studied scenario.

8.1 Method

We used data from the semi-annual European Social Survey (ESS) 20184. We focused on

21 items measuring basic human values, derived from Schwartz’s (1992) seminal theory, to

investigate patterns of similarities in human values across European countries. The ESS

basic human values scale includes verbal portraits of 21 diﬀerent people, each reﬂecting the

importance of a value (see Table 3). For example, “Thinking up new ideas and being

creative is important to her. She likes to do things in her own original way” describes a

person for whom self-direction values are important. For each portrait, respondents are

asked to rate their likeness to the described person on a six-point Likert scale (1: “not like

me at all”, 6: “very much like me.”), and their own values are inferred from these

self-reported similarities (see, e.g., Davidov, 2008, for further details).

We analyzed data from 27 European country samples (see Table 2).

4Data can be retrieved from the ESS data portal via https://ess-search.nsd.no/

NETWORK SIMILARITY 35

Table 2

Overview Over Analyzed European Countries

Country Region N

Austria Western 2373

Belgium Western 1724

Bulgaria Eastern 1503)

Switzerland Western 1423)

Cyprus Southern 757

Czech Republic Eastern 2180

Germany Western 2246

Denmark Northern 469

Estonia Eastern 879

Spain Southern 1479

Finland Northern 1669

France Western 1804

UK Western 2122

Croatia Eastern 1679

Hungary Eastern 1574

Ireland Western 2082

Iceland Northern 787

Italy Southern 2457

Lithuania Eastern 1497

Latvia Eastern 805

Netherlands Western 1591

Norway Northern 1353

Poland Eastern 1296

Portugal Southern 994

Sweden Northern 1462

Slovenia Eastern 1226

Slovakia Eastern 994

A separate GGM was ﬁt to data from each country. For estimation, we used the

same set-up as in the simulation study conducted in Section 6. For each of the 27

2= 351

country pairs, we ﬁrst tested for network invariance using the R package

NetworkComparisonTest (van Borkulo et al., 2017). The resampling-based network

invariance test evaluates the null hypothesis that, for a given pair of networks, all edges are

equal by evaluating whether the largest diﬀerence between corresponding edges of the

compared networks is signiﬁcantly diﬀerent from zero (van Borkulo et al., 2022). We

NETWORK SIMILARITY 36

employed a conﬁdence level of α=.05 with Bonferroni correction due to multiple pairwise

comparisons.

We computed and compared pairwise country network similarities in terms of sF

and scor only for network pairs for which the null hypothesis of all edges being equal was

rejected. Data and R code are provided in the OSF repository accompanying this study.

Table 3

Items of the European Social Survey Human Values scales

Value Name Content

Universalism

UN1 Important that people are treated equally and have equal

opportunities

UN2 Important to understand diﬀerent people

UN3 Important to care for nature and environment

Benevolence BE1 Important to help people and care for others well-being

BE2 Important to be loyal to friends and devote to people close

Conformity CO1 Important to do what is told and follow rules

CO2 Important to behave properly

Tradition TR1 Important to be humble and modest, not draw attention

TR2 Important to follow traditions and customs

Security SE1 Important to live in secure and safe surroundings

SE2 Important that government is strong and ensures safety

Power PO1 Important to be rich, have money and expensive things

PO2 Important to get respect from others

Achievement AC1 Important to show abilities and be admired

AC2 Important to be successful and that people recognise

achievements

Hedonism HE1 Important to have a good time

HE2 Important to seek fun and things that give pleasure

Stimulation ST1 Important to try new and diﬀerent things in life

ST2 Important to seek adventures and have an exciting life

Self-direction SD1 Important to think new ideas and being creative

SD2 Important to make own decisions and be free

8.2 Results

Of the 351 country pairs, 297 exhibited signiﬁcant diﬀerences in human values networks.

For country pairs with signiﬁcant diﬀerences in human values networks, both sFand scor

indicated considerable variability in similarities across European countries (range sF: [.66;

NETWORK SIMILARITY 37

.80]; range scor: [.34; .83]). Both the network invariance test and the two similarity

measures suggested that Western (proportion of signiﬁcant network invariance tests: .57;

median sF: .75; median scor: .76) and Northern European (proportion of signiﬁcant

network invariance tests: .50; median sF: .75; median scor: .74) countries shared strong

similarities in human values networks among each other—network pairs tended to show no

signiﬁcant diﬀerences or, if they did, exhibited high similarities. Eastern (proportion of

signiﬁcant network invariance tests: .89; median sF: .71; median scor : .57) and, to a lesser

extent, Southern (proportion of signiﬁcant network invariance tests: .67; median sF: .71;

median scor: .62) European countries had more unique networks, i.e. tended to exhibit

signiﬁcant diﬀerences more often and to be less similar to the other countries in their

group. This pattern could be observed for both similarity measures and is illustrated in

Figure 15, displaying human values network similarities for the 27 investigated European

countries in terms of sFand scor. Blue lines indicate that human values networks did not

signiﬁcantly diﬀer and are the same in Figures 15a and 15b. For human values network

pairs with signiﬁcant diﬀerences, displayed in red, thicker lines indicate higher similarity in

terms of the respective measure. For readability, only the upper 20 percentile of network

similarities is displayed for each measure.

Figure 15 also illustrates some disagreement among the measures. For instance,

Italy’s human values network was signiﬁcantly diﬀerent from all other networks, as

indicated by the absence of blue lines connecting Italy with other countries. While for sF,

none of Italy’s similarities to other countries was in the upper 20 percentile, ccor indicated

high similarity of Italy’s network with few other European countries.

sFand scor exhibited a high, but not perfect rank order correlation of .89. To

illustrate diﬀerences in rank orders, Figures 16 and 17 display human values networks with

the highest and lowest similarities for the two measures. Out of all network pairs with

signiﬁcant invariance tests, both measures yielded the highest values for the similarity

between the networks of Switzerland and Germany (sF=.80;scor =.83). To further

NETWORK SIMILARITY 38

Netherlands Germany

Switzerland Slovakia

Belgium

Hungary

Denmark

Ireland

Finland

Portugal

Slovenia

Spain

Cyprus

Norway

France Czech Republic

Iceland

Latvia

Italy

Bulgaria

Sweden Estonia

Austria

Lithuania

Poland

Croatia

(a) sF

Netherlands

Germany

Switzerland Slovakia

Belgium

Hungary

Denmark

Ireland

Finland

Portugal

Slovenia

Spain

Cyprus

Norway

France Czech Republic

Iceland

Latvia

Italy

Bulgaria

Sweden Estonia

Austria

Lithuania

Poland

Croatia

(b) scor

Figure 15 . Human values network similarities for 27 European countries for diﬀerent

similarity measures. Blue lines indicate that human values networks did not signiﬁcantly

diﬀer. For human values network pairs with signiﬁcant diﬀerences, displayed in red, thicker

lines indicate higher similarity in terms of the respective measure. Displayed are only the

upper 20 percentile of network similarities for each measure; that is, absent lines indicate

that the network invariance test was signiﬁcant and the similarity measure was not in the

upper 20 percentile.

NETWORK SIMILARITY 39

investigate sources of deviations from perfect similarity between the two human values

networks, we inspected the distribution of non-zero diﬀerences in edge weights of the

German and Swiss human values networks, displayed in Figure 18. As can be seen,

diﬀerences oscillated around zero, indicating that deviations from perfect similarity

between the two networks were driven by small diﬀerences spread across the whole network

rather than strongly pronounced diﬀerences in single edge weights.

UN1

UN2

UN3

BE1

BE2

CO1

CO2

TR1

TR2

SE1

SE2

PO1

PO2

AC1

AC2

HE1 HE2 ST1

ST2

SD1

SD2

Switzerland

UN1

UN2

UN3

BE1

BE2

CO1

CO2

TR1

TR2

SE1

SE2

PO1

PO2

AC1

AC2

HE1 HE2 ST1

ST2

SD1

SD2

Germany

Figure 16 . Signiﬁcantly diﬀerent human values networks with the highest similarity for

both similarity measures (sF=.80;scor =.83).

UN1

UN2

UN3

BE1

BE2

CO1

CO2

TR1

TR2

SE1

SE2

PO1

PO2

AC1

AC2

HE1 HE2 ST1

ST2

SD1

SD2

Iceland

UN1

UN2

UN3

BE1

BE2

CO1

CO2

TR1

TR2

SE1

SE2

PO1

PO2

AC1

AC2

HE1 HE2 ST1

ST2

SD1

SD2

Bulgaria

UN1

UN2

UN3

BE1

BE2

CO1

CO2

TR1

TR2

SE1

SE2

PO1

PO2

AC1

AC2

HE1 HE2 ST1

ST2

SD1

SD2

Portugal

UN1

UN2

UN3

BE1

BE2

CO1

CO2

TR1

TR2

SE1

SE2

PO1

PO2

AC1

AC2

HE1 HE2 ST1

ST2

SD1

SD2

Slovakia

Figure 17 . Signiﬁcantly diﬀerent human values networks with the lowest similarity for

diﬀerent similarity measures. sFwas lowest for the networks of Iceland and Bulgaria

(sF=.66;scor =.38), while scor was lowest for the networks of Slovakia and

Portugal(sF=.68;scor =.34).

Iceland and Bulgaria (see Figure 17) displayed the lowest similarity in terms of sF

(sF=.66;scor =.38). With 18 and 60 out of 210 possible edges being present,

NETWORK SIMILARITY 40

corresponding to densities of .09 and .29, these were among the networks with the lowest

and highest density, respectively, and only 13 of Iceland’s 18 edges were also included in

Bulgaria’s network. Slovakia and Portugal were the least similar in terms of scor (sF=.68;

scor =.34). With 36 and 31 edges, corresponding to densities of .17 and .15, both networks

were relatively sparse, and only 13 of Portugal’s 31 edges were also included in Slovakia’s

network. Note that, again, diﬀerences in the same measure for these pairwise similarities

were rather small. That is, overall, the two measures agreed in that networks for Iceland

and Bulgaria were comparably dissimilar as networks for Slovakia and Portugal.

0.0

2.5

5.0

7.5

10.0

12.5

−0.2 −0.1 0.0 0.1 0.2

Non−zero difference in edge weights (Germany − Switzerland)

Number of edge weights

Figure 18 . Distribution of non-zero diﬀerences in edge weights of the German and Swiss

human values networks.

9 Discussion

The present study aimed to provide a similarity measure for quantifying diﬀerences in

psychometric networks. To this end, we derived a similarity measure based on the

Frobenius norm of diﬀerences in psychometric networks’ weighted adjacency matrices. The

measure originates in a commonly used graph-theoretical measure for determining

similarity of graphs and accommodates the speciﬁcs of psychometric networks. We

illustrated and evaluated the proposed similarity measure sFby studying a wide range of

diﬀerences in pairs of population network models both implied by latent variable models

and by GGMs as well as sF’s ﬁnite-sample behavior.

In our evaluations of population network pairs, we could show that the studied

NETWORK SIMILARITY 41

scenarios were captured in an intuitive manner by sF, while the same did not hold true for

customary correlation-based comparisons. Our evaluations further underlined pitfalls of

currently used correlation-based comparisons (see also Brusco, 2004; Brusco & Cradit,

2005; Hubert, 1978) and, as such, the need for a measure that is more tailored towards the

context of psychometric networks. First, correlation-based comparisons do not allow

comparisons with empty networks due to lack of variance in the entries of the comparison

network’s weighted adjacency matrix. Second, correlation-based comparisons are highly

sensitive to the variability of entries of the strictly lower triangulars of networks’ weighted

adjacency matrices. Among others, this implies diﬀerent conclusions concerning similarity

between networks for the same type of diﬀerence when networks possess low as compared

to high homogeneity of edge weights, or low as compared to high density. Third,

correlation-based comparisons are not sensitive to edge weights being dampened. This

may, for instance, be the case when a clinical intervention reduces symptom dependencies

compared to the baseline. Fourth, on a more general note, the correlation of the strictly

lower triangulars of networks’ weighted adjacency matrices is not a true metric, meaning

that a correlation of 1 does not necessarily imply equality of psychometric networks. This

is diﬀerent for sF, which allows drawing such conclusions when sF= 1.

In studying the ﬁnite-sample behavior of sF, we found sFto be consistent—albeit

biased for small sub-sample sizes—only if population networks did not exhibit exact

similarity. We, therefore, strongly recommend to (a) use sFas a complement to

signiﬁcance tests for network comparisons such as the NCT that should be interpreted only

if there is evidence for diﬀerences in sub-population networks and (b) to interpret sFonly

if sub-sample sizes are suﬃciently large, i.e., >2000. To guide the interpretation of sF

when these conditions are met, we summarized results from our investigations of

sub-population networks by providing exemplar scenarios for diﬀerent values of sF.

A further aspect that should be kept in mind when interpreting sFis that it is a

measure of global similarity that compresses structural similarities and diﬀerences between

NETWORK SIMILARITY 42

to-be-compared networks into a single score. As such, from mere assessment of sF, it is not

possible to determine whether networks have low similarity because of large diﬀerence in

some local area of the networks—in the most extreme case a single edge weight—or due to

small diﬀerences spread throughout the networks. A fairly straightforward follow-up

analysis to better understand the determinants of an observed sFscore may be an

assessment of the distribution of edge weight diﬀerences (see Figure 18 from the empirical

application for an example).

In an empirical application based on cross-country comparisons of human values

networks, we showcased the potential insights from quantifying network similarity. We

illustrated that sFcan be employed for both normative (e.g., human values networks of

countries A and B exhibit low similarity) and ipsative evaluations (e.g., networks of

country A and B exhibit the same degree of similarity as networks of country A and B) of

network similarity.

In motivating and evaluating sF, we focused on the quantiﬁcation of similarity

between sub-population psychometric networks. Another use case for sFwe view

worthwhile to consider are simulation studies when evaluating the statistical performance

of network estimation techniques. Speciﬁcity, sensitivity, and scor are commonly employed

outcome measures in psychometric network simulations (e.g. Isvoranu et al., 2021;

van Borkulo et al., 2014). For a more nuanced evaluation of bias in network estimation,

researchers may consider to complement these with sFbetween data-generating and

estimated networks.

Note that although sFwas illustrated on GGMs, it is also applicable to other types

of Markov random ﬁelds, e.g., Ising models or mixed graphical models. This is because sF

uses the weighted adjacency matrices of networks that have already been estimated, and is,

therefore, not bound to speciﬁc models or estimation techniques used to obtain the

networks. Nevertheless, separate interpretation guidelines may need to be established, as

bounds and typical ranges of diﬀerences in edge weights may markedly diﬀer between

NETWORK SIMILARITY 43

diﬀerent types of Markov random ﬁelds. Recall that GGMs’ edge weights are partial

correlations bounded between −1and 1. That is, the largest possible contribution of a

single absolute diﬀerence between corresponding edge weights to sFis 2. This is diﬀerent

for Ising models, where edge weights are not bounded and magnitudes can exceed unity.

For instance, in an application of the Ising model to political opinion data, Brusco et al.

(2022) reported maximum edge weights of as high as roughly 3.5. For edge weights of such

size, one can easily imagine scenarios where the contribution of a single absolute diﬀerence

between corresponding edge weights to sFby far exceeds 1, e.g., when the corresponding

edge weight in the comparison network is close to zero or negative. Consequently, sFused

to study diﬀerences between Ising models may take values that are much smaller than sF

applied in the context of GGMs.

9.1 Limitations and Future Directions

The interpretation guidelines derived from simulation results provide initial guidance for

interpreting obtained similarities. Nevertheless, we point out that these guidelines are

preliminary and should be subject to further reﬁnement once a broader range of scenarios

has been investigated and subject-matter expertise on typical similarity values in diﬀerent

ﬁelds of applications of psychometric networks has accumulated.

Note the ﬁnite-sampling behavior of sFdoes not only constrain its applicability to

research contexts where network comparison tests indicate diﬀerences in sub-population

networks and sub-sample sizes are suﬃciently large, but also impedes quantifying

uncertainty in similarity estimates. While bootstrapping seems like an obvious means to

this end, its employment is not straightforward, as bootstrap resamples may oftentimes

yield similarities below the sample point estimate. Hence, bootstrapped conﬁdence

intervals for sFcan be assumed to underestimate similarity, especially when sFis high.

Developing techniques for quantifying uncertainty in similarity estimates remains an

important task for future research.

NETWORK SIMILARITY 44

The present study focused on using sFas an eﬀect size measure analogue for

quantifying psychometric network diﬀerences. Having a measure for quantifying network

similarities, however, also opens the path for uncovering groups of similar networks. More

speciﬁcally, when multiple networks are compared (as in, e.g., Fried et al., 2018, who

conducted a cross-cultural multi-site study of post-traumatic stress disorder symptom

networks), the result can, again, be depicted as a full graph, with nodes denoting networks

and edge weights quantifying similarity among them. Then, graph-modeled data clustering

techniques such as spectral clustering (Von Luxburg, 2007) or cluster editing (Böcker &

Baumbach, 2013) can be used to uncover clusters of networks that are similar to each

other, such that researchers may identify sets of sub-populations that share common

network structures. When comparing symptom networks, researchers may uncover groups

of sub-populations that diﬀer in inter-relationships among symptoms and may design

group-speciﬁc interventions. Likewise, clustering idiographic networks (see Epskamp,

van Borkulo, et al., 2018, for an introduction) may uncover sub-populations of, say,

patients that have diﬀerent network structures and may require diﬀerent treatment.

Evaluating diﬀerent clustering techniques and exploring their potential for identifying

subgroups of similar networks pose interesting topics for future research.

We point out that the present study is aimed at initiating discussions on urgently

needed eﬀect size measures for quantifying diﬀerences in sub-population psychometric

networks rather than providing deﬁnite solutions. The proposed sFcaptures diﬀerences in

sub-population in an intuitive manner. It is, however, limited by a sampling distribution

that is oftentimes not centered around the true population value. Future research may

develop and explore alternative measures.

Concordance indices for comparing proximity matrices developed by Hubert (1978),

for instance, may be a promising starting point. Intuitively speaking, these indices capture

concordance in the ordering of the entries of proximity matrices—i.e., in the present

application, concordance in the ordering of to-be-compared psychometric networks’ edge

NETWORK SIMILARITY 45

weights. Diﬀerent indices exist that are sensitive to diﬀerent aspects of concordance in

ordering. In psychological research, these indices have been applied to compare confusion

matrices (Brusco, 2004) as well as in the context of comparing submatrices of

multitrait-multimethod correlation matrices (Hubert & Baker, 1979), and recently gained

attention in network psychometric application as a means for easing interpretability of

weighted adjacency matrices by re-ordering items in a meaningful way (Brusco et al.,

2022). Their utility as eﬀect size measures for quantifying diﬀerences in psychometric

networks, however, has not been studied yet.5A potential limitation of concordance indices

may be that they capture diﬀerences related to ordinal properties of the to-be-compared

matrices. That is, concordance indices can be expected to not be sensitive to scenarios

where diﬀerences between networks are not reﬂected in ordering, e.g., when one network is

a dampened duplicate of the other. Likewise, in analogy to drawing on graph-theoretical

literature for quantifying diﬀerences in psychometric networks, concordance indices would

need to be selected with care, as not all indices may capture aspects that are meaningful in

the context of network psychometrics. Concordance indices may, however, have a more

favorable sampling distribution that is centered around the population value and, if so,

could be a viable alternative when network comparison tests are not signiﬁcant or when

sub-sample sizes are small. A further advantage is that generalized concordance indices

exist that allow for comparing more than two matrices (Hubert, 1987), while, in its current

form, sFcan only be employed for pairwise comparisons.

Finally, we focused on the special case of comparing sub-population networks

comprising the same set of variables, assuming known node correspondence. Future

research may expand on comparisons of psychometric networks comprising diﬀerent sets of

5Hubert (1987) also presented permutation-based signiﬁcance tests for concordance indices. These

compare the obtained concordance index value to a reference distribution of no concordance between

proximity matrices. Note, however, that these tests are based on permutation of matrix entries, while the

entries themselves are treated as ﬁxed. Hence, when applied in the context of network psychometrics,

permutation-based signiﬁcance tests for concordance indices would neglect the uncertainty of edge weights

estimates due to sampling variation and, in our view, do, therefore, not pose an alternative to existing

network comparison tests.

NETWORK SIMILARITY 46

variables, e.g., when comparing networks using diﬀerent measures for similar constructs.

Before evaluating measures applicable to compare networks with unknown node

correspondence, however, conceptual discussion are needed on the speciﬁc patterns of

similarity among psychometric networks with unknown node correspondence that applied

researchers would deem meaningful and may want to capture.

Extra Material

Extra materials for this article can be found in the OSF and are available via the following

link: https://osf.io/guxf8/

NETWORK SIMILARITY 47

Appendix A

Example Calculation for sF

To illustrate the calculation of sF, we will consider two networks nand mwith p= 3 nodes

and weighted adjacency matrices An=







1.0 0.3 0.2

0.3 1.0 0.1

0.2 0.1 1.0







and Am=







1.0 0.1 0.4

0.1 1.0 0.0

0.4 0.0 1.0







The normalized Frobenius norm of the diﬀerence between these two adjacency

matrices is

dF(n, m) = 1

√3/2q3· |1.0−1.0|2+ 2 · |0.3−0.1|2+ 2 · |0.2−0.4|2+ 2 · |0.1−0.0|2=

√3/2√0.18 ≈0.35. Then, sF=1

1+dF(n,m)≈0.74.

NETWORK SIMILARITY 48

Appendix B

R Function for Obtaining sF

1sF <- function ( n1 , n 2 , p ){

2# n 1 : w ei g h te d a dj a c en c y ma t ri x fo r n e t wo r k 1

3# n 2 : w ei g h te d a dj a c en c y ma t ri x fo r n e t wo r k 2

4# p: num b er of n od e s

5r et ur n (1 /( 1+ ( n or m ( n1 - n 2 , ty p e =" F " )/ sqrt( p /2 )) ) )

Figure B1 . R function for obtaining sF.

A preview of this full-text is provided by American Psychological Association.

Learn more

Content available from Psychological Methods

This content is subject to copyright. Terms and conditions apply.

ResearchGate has not been able to resolve any citations for this publication.

Learning to Live With Sampling Variability: Expected Replicability in Partial Correlation Networks

Article

Full-text available

Jan 2022

Donald Ray Williams

The topic of replicability has recently captivated the emerging field of network psychometrics. Although methodological practice (e.g., p-hacking) has been identified as a root cause of unreliable research findings in psychological science, the statistical model itself has come under attack in the partial correlation network literature. In a motivating example, I first describe how sampling variability inherent to partial correlations can merely give the appearance of unreliability. For example, when going from zero-order to partial correlations there is necessarily more sampling variability that translates into reduced statistical power. I then introduce novel methodology for deriving expected network replicability (ENR), wherein replication is modeled with the Poisson-binomial distribution. This analytic solution can be used with the Pearson, Spearman, Kendall, and polychoric partial correlation coefficient. I first employed the method to estimate ENR for a variety of data sets from the network literature. Here it was determined that partial correlation networks do not have inherent limitations, given current estimates of replicability were consistent with ENR. I then highlighted sources that can reduce replicability, that is, when going from continuous to ordinal data with few categories and employing a multiple comparisons correction. To address these challenges, I described a strategy for using the proposed method to plan for network replication. I end with recommendations that include the importance of the network literature repositioning itself with gold-standard approaches for assessing replication, including explicit consideration of Type I and Type II error rates. The method for computing ENR is implemented in the R package GGMnonreg. (PsycInfo Database Record (c) 2022 APA, all rights reserved).

Comparing Network Structures on Three Aspects: A Permutation Test

Article

Full-text available

Apr 2022

The network approach, in which psychological constructs are modeled in terms of interactions between their constituent factors, have rapidly gained popularity in psychology. Applications of such network approaches to various psychological constructs have recently moved from a descriptive stance, in which the goal is to estimate the network structure, to a more comparative stance, in which the goal is to compare network structures across groups. However, the statistical tools to do so are lacking. In this article, we present the network comparison test (NCT). NCT is a statistical test that compares two network structures on three types of characteristics. Performance of NCT is evaluated by means of a simulation study. Simulated data shows that NCT performs well in various circumstances for all three tests: when the groups are simulated to be similar, the error rate (i.e., NCT indicating that they are different, while the simulated networks are similar) is adequately low, and when the groups are simulated to be different, the ability to detect a difference is sufficiently high when the difference between simulated networks and the sample size are substantial. We illustrate NCT by comparing depression symptom networks of males and females. Possible extensions of NCT are discussed.

Work Addiction and Work Engagement: a Network Approach to Cross-Cultural Data

Article

Full-text available

Dec 2021
Int J Ment Health Addiction

This study aimed to investigate direct relationships of work addiction symptoms with dimensions of work engagement. We used three samples in which work addiction was measured with the Bergen Work Addiction Scale and work engagement was measured with the Utrecht Work Engagement Scale. One sample comprised responses from working Norwegians ( n 1 = 776), and two samples comprised responses from working Poles ( n 2 = 719; n 3 = 715). We jointly estimated three networks using the fused graphic lasso method. Additionally, we estimated the stability of each network, node centrality, and node predictability and quantitatively compared all networks. The results showed that absorption and mood modification could constitute a bridge between work addiction and work engagement. It suggests that further investigation of properties of absorption and mood modification might be crucial for answering the question of how engaged workers become addicted to work.

Network Models of Posttraumatic Stress Disorder: A Meta-Analysis

Article

Full-text available

Nov 2021

Posttraumatic stress disorder (PTSD) researchers have increasingly used psychological network models to investigate PTSD symptom interactions, as well as to identify central driver symptoms. It is unclear, however, how generalizable such results are. We have developed a meta-analytic framework for aggregating network studies while taking between-study heterogeneity into account and applied this framework in the first-ever meta-analytic study of PTSD symptom networks. We analyzed the correlational structures of 52 different samples with a total sample size of n = 29,561 and estimated a single pooled network model underlying the data sets, investigated the scope of between-study heterogeneity, and assessed the performance of network models estimated from single studies. Our main findings are that: (a) We identified large between-study heterogeneity, indicating that it should be expected for networks of single studies to not perfectly align with one-another, and meta-analytic approaches are vital for the study of PTSD networks. (b) While several clear symptom-links, interpretable clusters, and significant differences between strength of edges and centrality of nodes can be identified in the network, no single or small set of nodes that clearly played a more central role than other nodes could be pinpointed, except for the symptom "amnesia" that was clearly the least central symptom. (c) Despite large between-study heterogeneity, we found that network models estimated from single samples can lead to similar network structures as the pooled network model. We discuss the implications of these findings for both the PTSD literature as well as methodological literature on network psychometrics. (PsycInfo Database Record (c) 2021 APA, all rights reserved).

Dynamic Fit Index Cutoffs for Confirmatory Factor Analysis Models

Article

Full-text available

Oct 2021

Model fit assessment is a central component of evaluating confirmatory factor analysis models and the validity of psychological assessments. Fit indices remain popular and researchers often judge fit with fixed cutoffs derived by Hu and Bentler (1999). Despite their overwhelming popularity, methodological studies have cautioned against fixed cutoffs, noting that the meaning of fit indices varies based on a complex interaction of model characteristics like factor reliability, number of items, and number of factors. Criticism of fixed cutoffs stems primarily from the fact that they were derived from one specific confirmatory factor analysis model and lack generalizability. To address this, we propose a simulation-based method called dynamic fit index cutoffs such that derivation of cutoffs is adaptively tailored to the specific model and data characteristics being evaluated. Unlike previously proposed simulation-based techniques, our method removes existing barriers to implementation by providing an open-source, Web based Shiny software application that automates the entire process so that users neither need to manually write any software code nor be knowledgeable about foundations of Monte Carlo simulation. Additionally, we extend fit index cutoff derivations to include sets of cutoffs for multiple levels of misspecification. In doing so, fit indices can more closely resemble their originally intended purpose as effect sizes quantifying misfit rather than improperly functioning as ad hoc hypothesis tests. We also provide an approach specifically designed for the nuances of 1-factor models, which have received surprisingly little attention in the literature despite frequent substantive interests in unidimensionality. (PsycInfo Database Record (c) 2021 APA, all rights reserved).

Estimating group differences in network models using moderation analysis

Article

Full-text available

Jul 2021
BEHAV RES METHODS

Jonas M B Haslbeck

Statistical network models such as the Gaussian Graphical Model and the Ising model have become popular tools to analyze multivariate psychological datasets. In many applications, the goal is to compare such network models across groups. In this paper, I introduce a method to estimate group differences in network models that is based on moderation analysis. This method is attractive because it allows one to make comparisons across more than two groups for all parameters within a single model and because it is implemented for all commonly used cross-sectional network models. Next to introducing the method, I evaluate the performance of the proposed method and existing approaches in a simulation study. Finally, I provide a fully reproducible tutorial on how to use the proposed method to compare a network model across three groups using the R-package mgm.

Meta-analytic Gaussian Network Aggregation

Article

Full-text available

Jul 2021

A growing number of publications focus on estimating Gaussian graphical models (GGM, networks of partial correlation coefficients). At the same time, generalizibility and replicability of these highly parameterized models are debated, and sample sizes typically found in datasets may not be sufficient for estimating the underlying network structure. In addition, while recent work emerged that aims to compare networks based on different samples, these studies do not take potential cross-study heterogeneity into account. To this end, this paper introduces methods for estimating GGMs by aggregating over multiple datasets. We first introduce a general maximum likelihood estimation modeling framework in which all discussed models are embedded. This modeling framework is subsequently used to introduce meta-analytic Gaussian network aggregation (MAGNA). We discuss two variants: fixed-effects MAGNA, in which heterogeneity across studies is not taken into account, and random-effects MAGNA, which models sample correlations and takes heterogeneity into account. We assess the performance of MAGNA in large-scale simulation studies. Finally, we exemplify the method using four datasets of post-traumatic stress disorder (PTSD) symptoms, and summarize findings from a larger meta-analysis of PTSD symptom.

Political Systems and Political Networks: The Structure of Parliamentarians’ Retweet Networks in 19 Countries

Article

Full-text available

Apr 2021

Social scientists have long studied international differences in political culture and communication. An influential strand of theory within political science argues that different types of political systems generate different parliamentary cultures: Systems with proportional representation generate cross-party cohesion, whereas majoritarian systems generate division. To contribute to this long-standing discussion, we study parliamentarian retweets across party lines using a database of 2.3 million retweets by 4,018 incumbent parliamentarians across 19 countries during 2018. We find that there is at most a tenuous relationship between democratic systems and cross-party retweeting: Majoritarian systems are not unequivocally more divisive than proportional systems. Moreover, we find important qualitative differences: Countries are not only more or less divisive, but they are cohesive and divisive in different ways. To capture this complexity, we complement our quantitative analysis with Visual Network Analysis to identify four types of network structures: divided, bipolar, fringe party, and cohesive.

A comparison of logistic regression methods for Ising model estimation

Article

Oct 2022
BEHAV RES METHODS

The Ising model has received significant attention in network psychometrics during the past decade. A popular estimation procedure is IsingFit, which uses nodewise l1-regularized logistic regression along with the extended Bayesian information criterion to establish the edge weights for the network. In this paper, we report the results of a simulation study comparing IsingFit to two alternative approaches: (1) a nonregularized nodewise stepwise logistic regression method, and (2) a recently proposed global l1-regularized logistic regression method that estimates all edge weights in a single stage, thus circumventing the need for nodewise estimation. MATLAB scripts for the methods are provided as supplemental material. The global l1-regularized logistic regression method generally provided greater accuracy and sensitivity than IsingFit, at the expense of lower specificity and much greater computation time. The stepwise approach showed considerable promise. Relative to the l1-regularized approaches, the stepwise method provided better average specificity for all experimental conditions, as well as comparable accuracy and sensitivity at the largest sample size.

Network analyses of associations between cancer‐related physical and psychological symptoms and quality of life in gastric cancer patients

Article

Mar 2021

Objective This study applied network analyses to illustrate patterns of associations between cancer‐related physical and psychological symptoms (CPPS) and quality of life (QOL) before and after surgery. Methods Participants consisted of 256 gastric cancer patients admitted for curative section surgery at the surgical department in a teaching hospital in Korea between May 2016 and November 2017. Participants completed the survey, including MD Anderson Symptom Inventory, Hospital Anxiety and Depression Scale, and Functional Assessment of Cancer Therapy‐Gastric Cancer before surgery (T0), one week after surgery (T1), and 3–6 months after surgery (T2). Results Three networks featured several salient connections with varying magnitudes between CPPS and QOL across all time points. Particularly, anxiety was tightly connected to emotional wellbeing (EWB) across all time points and physical wellbeing (PWB) at T1. On the other hand, depression was connected to functional wellbeing at T0 and T2, gastric cancer concerns (CS) at T1, and PWB at T2. Distress and sadness were the most central symptoms in the three networks. Other central symptoms included shortness of breath at T0, fatigue at T0 and T1, and PWB and CS at T2. Anxiety, depression, and EWB served as bridges connecting CPPS to QOL across all time points with varying degrees of importance, as did PWB at T1 and T2. Conclusions Treating psychological distress and enhancing EWB and PWB can be high impact intervention targets throughout the cancer trajectory. This article is protected by copyright. All rights reserved.

A Graph Theory Based Similarity Metric Enables Comparison of Subpopulation Psychometric Networks

Abstract

Recommended publications

Meta-analytic Gaussian Network Aggregation

Estimating group differences in network models using moderation analysis

Comparing Estimation Methods for Psychometric Networks With Ordinal Data

The Challenge of Generating Causal Hypotheses Using Network Models