ArticlePDF Available

On applications of marginal models for categorical data

February 2004
METRON LXII(1):15-37

February 2004
LXII(1):15-37

Source
RePEc

Authors:

Tamás Rudas

Eötvös Loránd University

Wicher Bergsma

The London School of Economics and Political Science

The paper considers marginal models for categorical data and after reviewing the most important theoretical results concerning the definition, estimation and testing of such models, discusses a number of common statistical problems. These examples include, among others, the analysis of repeated measurements, panel studies and missing data. Fitting marginal models in these cases has the potential of providing the researcher with substantial new insight. The examples illustrate that the marginal modeling approach may be used more widely than thought before. One of the examples shows howgraphical models associated with directed acyclic graphs can be parameterized. A general algorithm is presented to compute maximum likelihood estimates under marginal models.

A directed acyclic graph.

…

Figures - uploaded by Tamás Rudas

Content may be subject to copyright.

Content uploaded by Tamás Rudas

Content may be subject to copyright.

METRON - International Journal of Statistics

2004, vol. LXII, n. 1, pp. 15-37

TAM ´

AS RUDAS – WICHER P. BERGSMA

On applications of marginal models for

categorical data

Summary - The paper considers marginal models for categorical data and after review-

ing the most important theoretical results concerning the deﬁnition, estimation and

testing of such models, discusses a number of common statistical problems. These

examples include, among others, the analysis of repeated measurements, panel studies

and missing data. Fitting marginal models in these cases has the potential of pro-

viding the researcher with substantial new insight. The examples illustrate that the

marginal modeling approach may be used more widely than thought before. One of

the examples shows how graphical models associated with directed acyclic graphs can

be parameterized. A general algorithm is presented to compute maximum likelihood

estimates under marginal models.

Key Words - Graphical models; Log-linear models; Marginal models; Maximum

likelihood estimation, Missing data; Repeated measurements.

1. Introduction

During the past decade, a fair number of papers applying marginal mod-

els to medical (Balagtes, Becker, Lang, (1995), Molenberghs, Lesaffre, (1999))

and sociological (Becker, (1994), Becker, Minick, Yang, (1998)) data, parallel

to papers exploring components of the theory of marginal modeling (Lang,

Agresti, (1994), Glonek, McCullagh, (1995), Bergsma, (1997), Lang, Mc-

Donald, Smith, (1999), Colombi, Forcina, (2001), Bartolucci, Forcina, Dard-

adoni, (2001), Bartolucci, Forcina, (2002), Bergsma, Rudas, (2002a), Bergsma,

Rudas, (2002b)) have been published. In its most general form, a marginal

model, when applied to a multivariate statistical problem, imposes structural

restrictions on certain marginals (i.e., subsets) of the original variables. When

the variables are categorical, the models for the marginals are usually of the

log-linear or of the log-afﬁne type. Such models are most conveniently formu-

lated by restricting the values of appropriately deﬁned parameters. Therefore,

Received October 2003 and revised December 2003.

16 TAM `

AS RUDAS – WICHER P. BERGSMA

the existence, ﬂexibility and interpretability of marginal models depend largely

on the parameters that are used to formulate the model.

The present paper, based on recent theoretical developments (Bergsma,

Rudas, (2002a)), illustrates the applicability of a large class of marginal mod-

els to a variety of statistical problems. This class of models is based on

restricting the values of certain marginal parameters of the joint distribution

in a contingency table. This is a ﬂexible class of parameters, that generalizes

earlier approaches to deﬁne marginal parameters (Glonek, McCullagh, (1995),

Glonek, (1996), Kauermann, (1997)). Certain combinatorial properties of the

variables involved imply smoothness of the parameterization and variation in-

dependence of its components. These properties are essential in interpretation,

imply the existence of a large class of models and the applicability of standard

large sample theory for estimation and testing.

The present paper contains almost no proofs. Section 2 gives a somewhat

informal exposition of the theory referred to above and Section 3 considers

applications of marginal models to a number of common statistical problems.

These include measuring the effect of a treatment, panel studies and Markov

chains, data fusion, missing data, joint treatment of the sampling and statistical

models, and graphical models. It is not the goal of the present paper to

explore the analyses made possible by the marginal approach in any depth,

rather, the aim is to illustrate that ﬁtting marginal models and the interpretation

of carefully deﬁned parameters may yield new insight into the above problems

and, often, appears to be the appropriate strategy. Finally, Section 4 describes

an algorithm to ﬁt marginal log-linear or log-afﬁne models. Much of the

theory and applications discussed in the present paper extend in a natural way

to problems involving continuous data, but these generalizations will not be

considered here.

2. Theory

The class of marginal models applied in this paper is based on marginal

log-linear parameters. These are obtained as ordinary log-linear parameters

(Agresti, (1990)) but they are not computed from the entire contingency table,

but rather from a marginal of it. A marginal log-linear parameter therefore, is

characterized by two subsets of the variables, one to which we ﬁrst marginalize

and a subset of this one, to which the parameter applies. For example, for

variables A,B,C,D,λABC

i∗∗ is a marginal log-linear parameter. The marginal

which it pertains to is ABC, and this is shown in the superscript. Within the

ABC marginal, the parameter represents the log-linear effect of category iof

variable A. Note that the ordinary log-linear parameter of the variable Ain

category iis λA

iwhich, as a marginal log-linear parameter is denoted by

λABCD

i∗∗∗ ,asitiscomputed from the entire ABCD table, not from a marginal

On applications of marginal models for categorical data 17

of it. In this paper, only marginal log-linear parameters are considered, that

is, the superscript always refers to the marginal from which the parameter is

computed.

The usual log-linear parameters can be interpreted (Bishop, Fienberg, Hol-

land, (1975)) as measuring average conditional association among the variables

involved, conditioned on all other variables and then average taken over all

possible categories of the conditioning variables. Every marginal log-linear pa-

rameter pertains to a certain marginal and the average conditional association is

measured within this marginal. For example, when all the variables are binary,

λABC

1∗∗ =1

4

j

log(p1jk+/p2jk+)1/2,

where pijkl is a cell probability, either theoretical or observed or estimated,

and “+”isamarginalization operator. That is, λABC

1∗∗ is related to the average

conditional log odds of category 1 of Aversus category 2, conditioned on and

averaged over all categories of Band C,inthe ABC marginal of the ABCD

table. Similarly,

λABC

11∗=1

2

log p11k+p22k+

p12k+p21k+1/4

that is, the marginal log-linear parameter λABC

ij∗of the ABCD table is related

to the average conditional log odds ratio between Aand B, conditioned on and

averaged over C, after marginalization over D. Throughout the paper, positivity

of the cell frequencies is assumed.

As a common way to refer to the various possibilities, the subset in the

superscript of a marginal log-linear parameter will be called the marginal and

the variables whose indices appear in the subscript will be called the effect

which is measured by the parameter.

The marginal parameters deﬁned here include the ordinary log-linear pa-

rameters (those parameters which have the set of all variables as the relevant

marginal), the multivariate logistic parameters of Glonek, McCullagh (1995)

(those parameters for which the effect variables coincide with the marginal

variables) and a mixture of these considered by Glonek (1996).

The marginal log-linear parameters can be used in several different ways to

parameterize the distribution on a contingency table. The parameter selection

can be done in two steps. The substantive problem at hand determines which

marginals of the contingency table are of interest. In the ﬁrst step, arrange

these marginals in a hierarchical ordering, i.e. in such a way that no marginal

contains one which comes later in the sequence. It is easy to see that such an

ordering always exists. If a certain rule is followed when selecting the subsets

that are effects within the marginals then, as it will be seen later, the resulting

18 TAM `

AS RUDAS – WICHER P. BERGSMA

parameters will have desirable properties. This rule says that for every marginal,

only such subsets of it should be included as an effect that are not subsets of

any of the previous marginals. Marginal log-linear parameters deﬁned by this

rule are called hierarchical.

Forexample, if for three variables A,B,C, the marginals of interest are

AB and BC,ahierarchical ordering is

AB BC.

Then, in the second step, for the AB marginal, the marginal log-linear pa-

rameters may pertain to the effects ∅,A,B,AB, and for the BC marginal,

the effects can be Cand BC. Thus, one possible set of hierarchical marginal

log-linear parameters implied by the above ordering of the relevant marginals

contains the following parameters:

λAB

∗∗ ,λ

i∗,λ

ij ,λ

jk .

The following are also hierarchical marginal log-linear parameters, based on

the same ordering

λAB

∗∗ ,λ

i∗λAB

∗j,λ

∗kλBC

jk .

Notice however, that the above parameters are not complete, i.e. they do not

constitute a parameterization (see below). If the other hierarchical ordering of

the relevant marginals,

BC AB,

is selected, the resulting parameters may be as follows:

λBC

∗∗ ,λ

j∗,λ

∗k,λ

jk ,λ

i∗λAB

ij .

As is seen above, the possible choices of the hierarchical marginal log-linear

parameters depend on the ordering of the relevant marginals. For example, λBC

j∗

is allowed in the second ordering but not in the ﬁrst one. Note that hierarchy

of these parameters refers to a property of the ordering of the marginals that

determines which parameters are allowed, not to the choice of the effects within

the marginals (as is the case with classical hierarchical log-linear models).

A set of hierarchical marginal log-linear parameters can be completed to

a parameterization of the distribution on the contingency table. To do so, the

list of marginals has to be completed by adding the entire set of variables as

the last one in the hierarchical ordering and as new parameters, those have to

be included that pertain to effects not present, for the marginal where it is ﬁrst

possible (Bergsma, Rudas, (2002a)). For example, the second set of parameters

above can be completed as

λAB

∗∗ ,λ

i∗λAB

∗j,λ

ij ,λ

∗kλBC

jk ,λ

ABC

i∗k,λ

ABC

ijk .

On applications of marginal models for categorical data 19

Note that for simplicity, parameterization refers here to the parameterization of a

frequency (rather than probability) distribution and for every parameter includes

all linearly independent choices of the indices, i.e. for binary variables every

parameter in the above list refers to one value. To obtain a parameterization of

a probability distribution, the main effect (i.e., λAB

∗∗ above) has to be omitted.

The ordinary log-linear parameterization and the one based on the mul-

tivariate logistic transform (Glonek, McCullagh, (1995)) are both hierarchical

marginal log-linear parameterizations.

Now certain desirable properties of marginal log-linear parameters and

of the statistical models derived from them will be studied. Note that these

properties extend to the hierarchical marginal log-linear parameterization if

parameters with these properties are completed by the above procedure to yield

a parameterization.

The properties studied will be smoothness of parameters, variation in-

dependence, existence of log-linear or log-afﬁne marginal models deﬁned by

restricting a set of parameters and the applicability of standard large sample

theory to these models. Proofs of the results to follow can be found in Bergsma,

Rudas (2002a).

Smoothness of the parameters considered essentially means a one-to-one

and differentiable correspondence between the vector of parameters and the

vector of cell probabilities. Smoothness is important in the interpretation of

the parameters and in studying the dimension of a model which, in turn, is

crucial for testing the ﬁt of the model.

Theorem 1. Hierarchical marginal log-linear parameters are smooth for strictly

positive frequency distributions on the contingency table.

The above result establishes only a sufﬁcient condition of smoothness of

marginal log-linear parameters but it can be shown that if the same effect

appears among marginal log-linear parameters within different marginals, then

these parameters cannot be smooth.

The next property to consider is variation independence of the parameters.

Variation independence means that the joint range of the parameters is the

Cartesian product of the separate ranges of the parameters involved. Lack

of variation independence may lead to the deﬁnition of non-existing (empty)

models and makes the separate interpretation of the parameters misleading. To

illustrate the importance of variation independence of the parameters, consider

the following marginal log-linear parameters and their prescribed values for

three binary variables:

λA

∗=log 8,λ

1=0,λ

1=0,

λAB

11 =(1/4)log(1/9), λAC

11 =(1/4)log(1/9), λBC

11 =(1/4)log(9).

20 TAM `

AS RUDAS – WICHER P. BERGSMA

The above prescribed values are all within the ranges of the respective param-

eters. In spite of this, these values are not within the combined range of the

parameters, that is, no distribution exists with these parameters. To see this,

notice that the prescriptions imply that there are 8 observations, the one-way

marginals are uniform (4,4)and the ﬁrst two two-way marginals have an odds

ratio equal to 1/9, while the third two-way marginal has an odds ratio equal

to 9. This completely speciﬁes the two-way marginals of the table, but they are

not compatible: there is no (non-negative) three-way table with these two-way

marginals. This can be seen either by establishing a contradiction implied by

the assumptions or by considering the correlation matrix and establishing that

it is not positive deﬁnite. The parameters involved in this case are not variation

independent and the deﬁnition above speciﬁes a non-existing distribution or, the

prescriptions deﬁne an empty model. Note, that for this example of potential

contradiction, neither the speciﬁcation of the value of λA

∗nor the multipliers

of (1/4)were necessary.

To see how the lack of variation independence makes the separate interpre-

tation of the parameters invalid, consider a simple 2×2 treatment by outcome

experiment in two groups say, men and women. Suppose the following data

are observed:

Outcome Treatment Control

good 10 5

bad 40 45

Men

Outcome Treatment Control

good 30 20

bad 20 30

Women

If the measure of the effect of the treatment is the difference in proportion

of positive outcome among treated and among control, then this measure is .1

for men and .2 for women. Is then the treatment twice as effective for women

than for men, as the numerical values suggest? The answer is, of course, not

necessarily, because, given the marginals, the maximum value of this measure

is .3 for men and 1 for women. That is, the treatment is one third as useful

On applications of marginal models for categorical data 21

for men and only one ﬁfth as useful for women as it could be. The measure

of treatment effect used here is not variation independent from the marginal

distributions therefore, it lacks calibration and cannot be interpreted without

paying attention to the other parameters.

To assure variation independence of hierarchical marginal log-linear pa-

rameters, a generalization of the classical decomposability concept is needed.

An ordering of a class of incomparable marginals is decomposable (Haber-

man, (1974), Lauritzen, Speed, Vijayan, (1984)) if it consists of two subsets

only or every subset has the property that its intersection with the union of the

previous subsets is equal to its intersection with one of the previous subsets. A

hierarchical ordering of subsets consisting of tmarginals is ordered decompos-

able if for every 3 ≤u≤t, the maximal ones from among the ﬁrst usubsets

have a decomposable ordering.

Forexample AB,BC,ABC is ordered decomposable but AB,BC,AC,ABC

is not. Ordered decomposability in fact does depend on ordering. For variables

A,B,C,D, the ordering AB,BC,ABC,AC D is ordered decomposable but

the ordering AB,BC,AC D,ABC is not.

Theorem 2. The components of a hierarchical marginal log-linear parameteriza-

tion are variation independent if and only if the ordering of the marginals involved

is ordered decomposable.

Marginal log-linear parameters derived from marginals in a hierarchical and

ordered decomposable order will be called hierarchical and ordered decompos-

able marginal log-linear parameters.

In the sequel, statistical models deﬁned by restrictions on marginal log-

linear parameters will be considered. In this context, the parameters pertain

to the expectations of cell frequencies under Poisson sampling. A log-linear

marginal model is deﬁned by assuming that certain linear combinations of

marginal log-linear parameters are equal to zero. Such models are never empty,

as they always contain the uniform distribution. A log-afﬁne marginal model

is deﬁned by assuming that certain linear combinations of marginal log-linear

parameters are equal to given constants. Examples of such models will be

considered in the next section.

The existence of log-afﬁne marginal models is, in general, a difﬁcult ques-

tion. In fact, the example used above to illustrate the importance of variation

independence is a log-afﬁne marginal model which is empty, i.e. does not exist.

The following result shows that the conclusion suggested by that example is

true in general.

Theorem 3. Alog-afﬁne marginal model deﬁned by restrictions of variation inde-

pendent parameters is not empty.

22 TAM `

AS RUDAS – WICHER P. BERGSMA

This implies that log-afﬁne marginal models based on ordered decompos-

able hierarchical marginal log-linear parameters always exist i.e., include at

least one distribution.

The last desirable property we discuss here, is the applicability of standard

large sample theory. This is, again, not as straightforward as everyday statistical

practice may appear to suggest. For example, consider a 2 ×2×2 table and

the model assuming that λAB

11 =0,λ

ABC

111 =0,λ

ABC

11∗=0. The ﬁrst condition

speciﬁes marginal independence of variables Aand B, and the last two imply

that Aand Bare conditionally independent given C.Dawid (1980) showed

that the three assumptions imply that either Ais independent of both Band C

jointly or Bis independent of both Aand Cjointly or both of these hold true.

In the latter case, however large the sample is, the likelihood has, with positive

probability, local maxima on both branches of the model and the likelihood

ratio statistic is, asymptotically, the minimum of two chi-squared distributions

rather than having asymptotic chi-squared distribution.

If however, the model is based on appropriately selected marginal log-linear

parameters, the standard asymptotic theory applies.

Theorem 4. Suppose a non-empty log-afﬁne marginal model is based on smooth

parameters. Then,under Poisson or multinomial sampling

a. The probability that the maximum likelihood estimate ˆπof the true probability

πexists and is a stationary point of the likelihood equation tends to 1,as the

sample size goes to inﬁnity.

b. The asymptotic distribution of N 1/2(ˆπ−π) is normal,with zero expectation,

where N is the sample size.

c. The likelihood ratio statistic has an asymptotic chi-squared distribution with

the number of degrees of freedom being equal to the number of linearly inde-

pendent restrictions.

If the log-afﬁne marginal model is based on hierarchical ordered decom-

posable marginal log-linear parameters, then the above asymptotic results hold

true. In other situations, standard asymptotic theory may or may not apply,

depending on the true population parameters.

3. Applications

Once in possession of the theoretical results outlined above, one ﬁnds

several important statistical problems where marginal log-linear or log-afﬁne

models may be applied. In this section, some of these situations are reviewed.

Here, we formulate the relevant marginal models and investigate their properties

using the results given in the previous section. Issues related to estimation of

these models will be considered in the next section.

On applications of marginal models for categorical data 23

3.1. Repeated measurements

One of the most widely used experimental designs in the medical and

behavioral sciences to measure the effect of a treatment is to observe the

same individuals before and after it. In this design, the variables observed

before and after the treatment are related because they are observed on the

same individuals. If the variables are categorical, the observations before and

after the treatment should be considered as marginals of the same contingency

table (Hagenaars, (1990)). We now outline some potentially useful repeated

measurements models and use the theory of the previous section to show that

these models are well-behaved.

If the same characteristic is measured before (variable A) and after (vari-

able D)treatment and the hypothesis is that the distributions before and after

treatment are the same (no effect of the treatment), the statistical model in the

A×Dtable is deﬁned by

λA

i=λD

i,for all i.

This is a log-linear marginal model and the marginals involved have a hierar-

chical and ordered decomposable ordering, e.g., A,D.Itimmediately follows

that the model exists and that standard large sample theory applies.

In fact, this is the model of marginal homogeneity.

When two variables are measured before (A,B) and also two after ( D,E)

the treatment, an interesting model assumes that Aand Bare independent and

Dand Eare independent. Note that this model may be meaningful whether

the same or different characteristics are measured before and after treatment.

This model assumes that

λAB

ij =0,λ

lm =0,for all i,jand l,m.

Here, the marginals involved are AB,DE and they are ordered decomposable.

The related hierarchical marginal log-linear parameters – that may be arbitrarily

restricted – are λAB

∗∗ ,λAB

i∗,λAB

∗j,λ

ij ,λ

l∗,λ

∗m,λ

lm . Therefore, the model is

a marginal log-linear model based on hierarchical and ordered decomposable

parameters, and consequently standard large sample theory applies to this model.

A log-afﬁne marginal model based on the same parameters which is relevant

here, is the one assuming that the ratio of the marginal odds ratios, as measures

of association, between Dand Eand between Aand Bis equal to a speciﬁed

constant or, equivalently, that the difference of association, as measured by

log-linear parameters is equal to a speciﬁed value. This leads to the model

speciﬁed by λAB

11 −λDE

11 =c. This model exists for all cand standard large

sample theory applies.

24 TAM `

AS RUDAS – WICHER P. BERGSMA

If there are several variables measured before and after the treatment,

arbitrary linear or afﬁne assumptions about some of the marginal log-linear

parameters pertaining to effects within the marginal of the before-treatment

variables and about some of those pertaining to effects within the marginal of

after-treatment variables will obey standard asymptotic theory, if the model is

not empty. The latter condition holds in the linear case and in the afﬁne case

it holds if the effects for both marginals are decomposable.

There are however, more general cases covered by the available theory. Re-

strictions considering the association between before and after treatment vari-

ables can also be included. For example, if there are three variables A,B

and Cmeasured before and three variables D,Eand Fmeasured after the

treatment, the model with the following restrictions

λABC

ijk =0,λ

DEF

lmn =0,λ

ABCD

i∗∗l=0,λ

ABCD

ij∗l=0,λ

ABCD

i∗kl =0,λ

ABCD

ijkl =0,

for all i,j,k,l,mand n

means that there is no second order association among the before-treatment

variables and among the after-treatment variables and Aand Dare condition-

ally independent, given the other before treatment variables. This model can be

obtained by restricting the parameters derived from the ordering of marginals

ABC,ABCD,DEF. Since this ordering is hierarchical and ordered decom-

posable, the model exists and standard asymptotic theory applies.

All this extends in a natural way to designs where several measurements are

taken over the same individuals, either with certain treatments applied between

the measurements or time passing by between the measurements.

A further application of marginal models is needed when measurements on

the same characteristic are taken repeatedly to reduce the effect of measurement

error as for example, blood pressure of a person may be measured on three

consecutive days, different tests of the same mental ability may be administered

to the same person, or different questions measuring the same attitude may be

included in a questionnaire. Suppose, A1and A2are the ﬁrst and second

measurements of the same characteristic, and B1and B2are the ﬁrst and

second measurements of another characteristic. Then, A1and A2should be

essentially (disregarding error related to imprecise measurement or to temporary

ﬂuctuations of the quantity measured) identical, just like B1and B2.Inaddition

to certain marginal homogeneity restrictions, this would also imply that the

association between Aiand Bjis the same for every combination of i,j=1,2.

This leads to the following model:

λA1

i=λA2

i,λ

j=λB2

j,λ

A1A2

i1i2=ri1i2(A), λB1B2

j1j2=rj1j2(B),

λA1B1

ij =λA1B2

ij =λA2B1

ij =λA2B2

ij ,

On applications of marginal models for categorical data 25

for all possible values of the indices, where ri1i2(A)and rj1j2(B)represent

the strength of association between the ﬁrst and second measurements of the

characteristics Aand B, respectively, and should be selected based on the

magnitude and distribution of error which is usually present, or acceptable,

when those measurements are performed. For this model, standard asymptotic

theory holds.

3.2. Panel studies and Markov chains

In this setup, again, the same individuals are observed several times on

the same variables however, interest lies not so much in whether or not the

distributions in the different time points are identical or different rather, the

pattern of change is of interest. A frequently investigated hypothesis is that of

ak-th order Markov chain that is, the conditional distribution of the variables

measured at time point tdepends only on the positions at the kpreceding

time points. This is, of course, a log-linear model but the related hypotheses

discussed below are of the marginal type.

The estimation of the transition probabilities are often among the goals

of the analysis of panel data. Parallel to the Markov hypothesis, one may be

interested in modeling whether or not the distributions in the previous waves

of the panel inﬂuence the association between the distributions in the last two

waves. The pattern of association between the distributions in the t-th and

t−1-st time points can be captured by the conditional odds ratios or log-linear

parameters of the joint distribution of the variables measured at these two time

points. If this only depends on the distribution at the t−k−1-st, ...t−2-nd

time points, the process generating the data has a k-th order memory.

Therefore, to test the hypothesis of a ﬁrst order memory against that of a

second order memory, one needs at least four waves of the panel. In this case,

the hypothesis that one has a ﬁrst order memory, given that the memory is of

second order (saturated in the present case) can be formulated as

λA1A2A3A4

∗∗i3i4=λA2A3A4

∗i3i4,λ

A1A2A3A4

∗i2i3i4=λA2A3A4

i2i3i4,λ

A1A2A3A4

i1i2i3i4=0,

where Aidenotes the variable(s) measured at the i-th time point. This model

asserts that the association between A3and A4depends only on A2and not

on A1. The association is measured by the appropriate marginal log-linear

parameters (or, equivalently by the appropriate marginal conditional odds ratios).

The conditions imply that the marginal log-linear parameters (or, equivalently,

the marginal conditional odds ratios) are the same if conditioned on A2only

or on both A2and A1. This is a collapsibility condition (Whittemore, (1980))

and is also a marginal log-linear model. The marginal log-linear parameters

in it are not contained in any hierarchical marginal log-linear parameterization

26 TAM `

AS RUDAS – WICHER P. BERGSMA

because, for example, the {A3,A4}effect appears in two marginals. Therefore,

the statistical properties of this model cannot be obtained from the results of

this paper. In fact, it can be shown (Bergsma, Rudas, (2002a)) that the above

parameters cannot be parts of a smooth marginal log-linear parameterization

and the Jacobian of any marginal log-linear parameterization containing the

above parameters is singular at the uniform distribution. Note, that the same

applies to any similar collapsibility restriction (see also Davis, (1989)).

If the process is known to have a, say, one step memory, testing stationarity

(with respect to conditional association between neighbors) requires ﬁtting the

following model:

λA1A2A3

∗jk =λA2A3A4

∗jk ,λ

A1A2A3

jkl =λA2A3A4

jkl ,

for every j,kand l,iffour waves are available. The model says that the

conditional association between A2and A3when A1is given is the same as

that between A3and A4when A2is given. This is a marginal log-linear model,

hierarchical, ordered decomposable and standard large sample theory applies.

The related log-afﬁne marginal model, in which the above marginal log-linear

parameters have prescribed values (for example, as in small area estimation),

also exists and has standard large sample behavior.

3.3. Incomplete data

There are various statistical problems requiring the analysis of an incom-

plete set of data. Incomplete data may arise unintentionally or intentionally, in

surveys or in censuses, in data collection or in secondary analysis problems.

The most common source of unintentionally missing data due to data col-

lection is that some of the respondents in a survey or census fail to respond to

certain questions in a questionnaire (item nonreponse), or to the entire question-

naire (unit nonreponse). The problem of coverage error (parts of the population

being omitted from the sample frame) leads to incomplete data similar to unit

nonresponse. The information collected is intentionally incomplete, when, to

reduce the burden of the respondents, with respect to time and invasion of

privacy, a long questionnaire is split into shorter overlapping parts and every

respondent is only asked questions in one of the parts. Such split designs may

be applied both in surveys and censuses. In secondary data analysis it may

happen that no available data set contains all the necessary information and the

researcher has to rely on several previously collected sets of data. This leads

to a problem similar to analyzing data arising as a result of a split design, with

the additional problem that the separate sampling procedures behind the sepa-

rate sets of data make even the existence of a common underlying population

distribution questionable.

On applications of marginal models for categorical data 27

When the data are categorical, the available information, in all the above

cases, can be considered as being marginal distributions of a higher dimen-

sional contingency table (the one that would contain all variables of interest).

Depending on the actual circumstances, the distribution on the entire table (the

complete data) would apply to the entire population or to a sample from it

or may not exist at all. The ﬁrst step of the analysis in all these cases is to

ﬁnd out whether such a joint distribution exists and if several such distribu-

tions exist, select one according to some optimality criterion. Depending on

the circumstances, such a procedure may be called an extension of measures,

estimation or data fusion.

Notice that this scheme also covers model-based estimation of the joint

distribution, when the sufﬁcient statistics are certain marginal distributions,

like, e.g., with log-linear models. Here, the information is not incomplete in

the sense that the entire table may have been observed but only certain aspects

(the sufﬁcient statistics) are relevant for further analysis.

If the distributions on an incomparable (with respect to inclusion) set of

marginals are given and they are weakly compatible that is, they coincide

on the intersections of the marginals, decomposability implies that there al-

ways exists an extension (in fact, usually inﬁnitely many) and if the system

is not decomposable, it depends on the actual marginals whether or not weak

compatibility implies strong compatibility (Darroch, Lauritzen, Speed, (1980),

Kellerer, (1964)). This classical theory however, does not cover cases when

information with respect to a more complex system of marginals is available

and the results of this paper are relevant. If some of the marginals for which

observations are available are contained in each other, classical decomposability

and the extension procedure based on it are of no help. This is the case, among

others, in the common missing data situation when there are respondents who

actually did respond to all questions, implying that observation not only for

some marginals but also for the entire table are available. Note that our ap-

proach here to handling missing data problems is fundamentally different from

the standard approach based on imputation techniques (Little, Rubin, (1987)).

In this more general case, the following procedure may be applied. Con-

sider all marginals for which information is available and their intersections.

Order these hierarchically and construct the hierarchical marginal log-linear pa-

rameterization. Determine the values of those parameters, for which this is

possible using the given information. If for a certain marginal different sources

of information are available, for example both Aand AB are observed, con-

sider a pooled estimate for the distribution on A. Set those marginal log-linear

parameters for which no information is available to arbitrary values, for exam-

ple to zero. Then, as described in Bergsma, Rudas (2002a), a generalization

of the iterative proportional scaling algorithm can be used to reconstruct the

entire distribution.

28 TAM `

AS RUDAS – WICHER P. BERGSMA

To illustrate the procedure, assume that for a three-way table, observations

are available for the AB,AC,ABC marginals. Adding intersections and putting

the marginals in a hierarchical order yields A,B,AB,AC,ABC.

The distribution on the Amarginal is estimated by pooling data from all

three original distributions, the Bmarginal is estimated by pooling data from

the AB and ABC marginals. The odds ratios (Rudas, (1998a)) in the AB

marginal are estimated by pooling the original AB and ABC data sets. That

is, estimates for the one-way marginals and the odds ratios of the AB marginal

table or, equivalently, estimates of the marginal log-linear parameters λA

i,λB

and λAB

ij , are obtained and combining these by the iterative scaling procedure

yields our estimate for the AB marginal distribution.

Next, the distribution of the AC marginal is obtained by taking into account

the already estimated Amarginal distribution and the conditional distribution

of Cgiven Awhich is obtained by pooling data from the original AC and

ABC marginals (yielding the λAC

∗kand λAC

ik marginal log-linear parameters).

Finally, to estimate the ABC marginal, the already estimated AB and AC

marginal distributions are combined with the conditional distribution of B,given

Aand C, which is taken from the original ABC marginal (i.e. the λABC

∗jk ,λABC

ijk

marginal log-linear parameters are estimated).

The procedure may not yield a joint distribution but if the marginals (in-

cluding intersections) have an ordered decomposable ordering, just like in the

present example, there will always be a common extension to the marginals.

In the following example, a certain part of the information available needs

to be discarded. Suppose that one is interested in reconstructing or estimating

the joint distribution of variables A,B,C. The AB marginal was observed in

a simple random sample, and the AC marginal in a sample which was stratiﬁed

according to A. But the stratiﬁcation in the latter data collection procedure was

based on information which may not be reliable, for example outdated census

data.

In this situation, one would use the AB sample to estimate the joint distri-

bution of these two variables (disregarding the Amarginal in AC ) and the AC

sample to estimate the Cmarginal and the interaction between Aand C. That

is, the information with respect to the distribution of Ais taken entirely from

the AB sample. Therefore, estimates are available for the following marginal

log-linear parameters:

λAB

i∗,λ

∗j,λ

ij ,λ

∗k,λ

ik .

Because these marginal log-linear parameters are ordered decomposable, there

always exists a three dimensional distribution with these parameters.

Note however, that if, additionally, observations are also available on the

joint distribution of BC,nocomponent of that, not even the association be-

tween Band Cis guaranteed to be strongly compatible (i.e., yielding a joint

On applications of marginal models for categorical data 29

distribution) with the information obtained from the ﬁrst two samples, because

ordered decomposability is lost.

As outlined above, the applicability of the theory of marginal models to

missing data problems comes from the fact that the different observed data

patterns can be considered as information pertaining to various marginals of

the joint distribution. If, for example, three variables, A,B,C, are observed,

but for some respondents the observation on Ais missing and for the remaining

ones the observation on Bis missing, the observed data patterns are BC and

AC. This problem, quite differently from the standard approach (Little, Rubin,

1987), can be viewed from the point of view of data fusion: one may try to

reconstruct the joint distribution with these marginals.

In the present example, the marginal log-linear parameters which cannot

be estimated from the data pertain to the AB and ABC effects. Therefore,

to be able to estimate the joint distribution, in order to have hierarchy, either

λAB

ij or λABC

ij∗and λABC

ijk need to be given certain values. While the most

straightforward assumption is that these marginal log-linear parameters are equal

to zero, this assumption will have different implications depending on the choice

of parameters to which it is applied. If the parameters selected are λAB

ij and

λABC

ijk , then assuming they are equal to zero means that Aand Bare marginally

independent, while if the same assumption is applied to λABC

ij∗and λABC

ijk , then

Aand Bare conditionally independent, given C.Ifitisonly the true response

to a question that decides whether or not the response is given, the observed

values for Aand the observed values for Bboth are random samples from

their respective distributions in the ﬁrst case. In the second case, this is only

true within ﬁxed categories of C.

3.4. Joint treatment of the sampling and statistical models

A statistical model may be viewed as a subset of the possible distributions

and a statistical hypothesis assumes that the true distribution belongs to this

subset. When the model is parametric, it restricts some of the parameters of the

distribution. The restricted parameters need to be estimated from the data in

such a way that the resulting estimates fulﬁll the requirements of the model and

the parameters not restricted by the model are estimated from the data without

further restrictions. A sampling model, on the other hand, assuming ﬁnite

population size, assigns probabilities to the possible samples (subsets of the

population), often in a way that it excludes certain samples from consideration.

In many practical situations this implies specifying certain parameters of the

observed distribution (e.g., as in stratiﬁed sampling some of the marginals are

kept ﬁxed). Then, these parameters should not be estimated from the data, even

if the statistical model, without consideration of the sampling model, would call

for estimating these parameters. Rather, the estimates should only be sought

30 TAM `

AS RUDAS – WICHER P. BERGSMA

among distributions fulﬁlling both the model and the sampling restrictions.

Therefore, the resulting model is the intersection of the statistical and of

the sampling model. Considering any model of interest as being the intersection

of other (possibly simpler) models may prove useful both from a conceptual

and from a computational point of view, as it was illustrated for log-linear

models by Rudas (1998b, 2002).

The parameter estimates obtained under the combined restrictions are the

estimates in the statistical model with the sampling model taken into account.

Many of the popular sampling models restrict the values of certain marginal

log-linear parameters and if the statistical model is also a marginal model, the

above combination can be carried out easily and the relationship between the

two models becomes apparent, while with other approaches potential conﬂicts

may not be easy to recognize.

As a ﬁrst example, consider simple random sampling with ﬁxed sample

size N. From all possible samples (subsets of the population) only those of

size Nare considered and no further restriction applies (that is, these samples

have equal probabilities). This restriction is equivalent to λ∅=log N. When a

marginal log-linear or log-afﬁne statistical model is estimated with this sampling

scheme, the combined model is a log-afﬁne marginal model. If the statistical

model is independence in a two-way table, the joint restrictions are

λ∅=log N,λ

ij =0,fo all i,j,

and this is a log-afﬁne marginal model. It is easy to see that if the statisti-

cal model is deﬁned by log-linear or log-afﬁne restrictions on a hierarchical

marginal log-linear parameterization and the overall effect λ∅does not appear

among those restricted by the statistical model, then adding the multinomial

constraint λ∅=log Ndoes not affect the properties of the model.

As another example, consider a case-control study, where cases (e.g. pa-

tients to be given a certain treatment) enter the study as a result of a process

not under the control of the experimenter, but the design calls for selecting

one control person for every case by a certain procedure. The status of both

cases and controls is measured before any treatment is applied and after the

treatment was applied. Here, the design speciﬁes that the case-control marginal

is uniform, while the total, the status marginal and the association between

the two variables are unrestricted. That is, if Ais the case-control variable,

λA

i=0.

In the case of stratiﬁed sampling, the distribution of a (group of) variable(s)

is ﬁxed by sampling design. If the frequency of variable Ais ﬁxed, say Ni>0

in category i, this is equivalent to λA

∗+λA

i=log Niand this restriction should

be added to the restrictions imposed by the statistical model.

To illustrate the possible conﬂicts that may arise, assume now that the

variables A,B,Care observed and the statistical model of interest prescribes

On applications of marginal models for categorical data 31

the marginal distribution of Cand the marginal odds ratios of AC and of BC.

If the available data are obtained from a stratiﬁed sample, where stratiﬁcation

prescribed the AB marginal and the stratiﬁcation was based on reliable infor-

mation concerning the distribution of the AB marginal (for example, a recent

census), then the combination of the sampling and statistical models prescribes

the AB marginal distribution and the AC and BC marginal odds ratios and

the further parameters of the distribution are to be estimated. This is a log-

afﬁne marginal model and as the parameters are hierarchical but not ordered

decomposable therefore, depending on the actual values of the parameters, it

may be empty. But the parameters are smooth and therefore, if the model is

nonempty, standard asymptotic theory applies to this model.

The advantage of the marginal modelling approach is that the combinatorial

properties of the class of parameters restricted by either one of the models

decides the statistical properties of the resulting model.

3.5. Graphical models

Graphical log-linear models (Darroch, Lauritzen, Speed, (1980)) use graphs

to model the association structure of multivariate distributions. The nodes of

the graph are identiﬁed with the variables involved, and two variables not

connected by an edge are assumed to be conditionally independent, given all

other variables. These are conditional independence statements involving all

variables. Models pertaining to the joint distribution of variables are also

associated with directed acyclic graphs (Lauritzen, (1996)). In this case, a

variable is assumed to be conditionally independent from its nondescendants,

given its parents, where nondescendants are those nodes into which no directed

path leads from the variable and parents are the nodes from which arrows points

to the variable. Graphical models based on directed acyclic graphs therefore,

assume conditional independencies which do not involve all variables rather,

certain marginal distributions pertaining to subsets of the variables.

Consequently, graphical models based on directed acyclic graphs are mar-

ginal log-linear models. For example, consider the directed acyclic graph in

Figure 1.

This graph implies the following conditional independencies:

C⊥⊥BDE|A

D⊥⊥C|AB

E⊥⊥ABCF|D

F⊥⊥ABE|CD.

The model deﬁned by the above graph or, equivalently, by the above condi-

tional independencies is a marginal model. It will be illustrated now how the

32 TAM `

AS RUDAS – WICHER P. BERGSMA

Figure 1. A directed acyclic graph.

marginal log-linear parameters of this model can be used to parameterize the

distributions in it. The parameters involved are associated with the arrows in

the graph deﬁning the model and their values can be given intuitively appealing

interpretations.

The parameterization of graphical log-linear models deﬁned by directed

acyclic graphs is based on the factorization of the distributions in such models

given in Lauritzen (1996). The factorization involves functions depending only

on subsets of the variables which consist of a node and its parents:

p(ω) =α∈Vfα(ω{α}∪pa(α) ),

where ωis a cell of the table, Vis the set of variables forming the table,

pa(α) is the set of variables that are parents of αand for W⊆V,ωWis a

projection operator.

The subsets entering the factorization for the above example are, in a

hierarchical order, the following:

A,AB,AC,ABD,DE,CDF,

where a node is always preceded by its parent(s). The hierarchical marginal

log-linear parameters are the following:

λA

∗,λ

i,λ

∗j,λ

ij ,λ

∗k,λ

ik ,λ

ABD

∗∗l,λ

ABD

i∗l,

λABD

∗jl ,λ

ABD

ijl ,λ

∗m,λ

lm ,λ

CDF

∗∗n,λ

CDF

∗ln ,λ

CDF

k∗n,λ

CDF

kln .

It is easy to see that the distribution, assuming its positivity, has the desired

conditional independence properties if and only if it has a parameterization as

above, with all marginal log-linear parameters pertaining to effects not appearing

in the list above set to zero. Therefore, the distributions in the graphical model

On applications of marginal models for categorical data 33

are parameterized by the marginal log-linear parameters pertaining to the nodes-

and-their-parent(s) type subsets of the variables.

The parameterization presented here consists of parameters with a straight-

forward interpretation. For example, λA

iis the effect of variable A, λAB

ij is the

effect of Aon B, etc. Note that λABD

ijl is a measure of the joint effect of A

and Bon D,inaddition to their separate effects, the existence of which is

implied by the presence of a directed triangle containing these variables. Note

that this interpretation of the meaning of λABD

ijl is justiﬁed because Aand B

precede Dand therefore the association among them may be interpreted as an

effect.

The parameters are most easily interpreted when all variables are binary, as

in this case they have a single numerical value. In other cases, the parameters

are vector valued and this reﬂects the way in which effects are measured in

the log-linear tradition.

The approach to parameterize graphical models based on directed acyclic

graphs presented here facilitates associating values with the arrows in the graph

in a meaningful way. Many potential users of graphical modeling may ﬁnd

this useful and this gives a chance to graphical modeling to compete with the

popular LISREL (J¨oreskog, (1997)) approach that in a similar but different

context provides the user with numbers assigned to the arrows, representing

the strengths of effects. In LISREL, the numbers are regression coefﬁcients

in marginal regression equations but these equations are being deﬁned by the

user without the opportunity to check their consistency or implications. In the

approach outlined here, the numbers are values of marginal log-linear parame-

ters. The models are speciﬁed with respect to the entire joint distribution using

graphs, all the implications can be read off from the graph and by restrict-

ing attention to directed acyclic graphs, contradicting model speciﬁcations are

impossible.

4. Fitting marginal models

The models discussed in this paper can be speciﬁed by the constraint

h(µ) =0,(1)

where µ=log mis the vector of log expected cell frequencies and

h(µ) =Blog Aexp(µ) −v(2)

for certain ﬁxed matrices Aand Band a vector v. Under Poisson sampling,

the kernel of the unrestricted log-likelihood is given as

Ln(µ) =niµi−exp(µi).

34 TAM `

AS RUDAS – WICHER P. BERGSMA

Notice that when conditioned on the sample size, the same estimates are ob-

tained by maximizing the kernel as for multinomial sampling. Maximum likeli-

hood estimation under a statistical model is a constrained optimization problem.

If ˆµ, the maximum likelihood estimate (MLE), exists, and if the matrices A

and Bpossess certain regularity properties, the MLE is a saddle point of the

Lagrange function

L(µ, λ) =Ln(µ) −λh(µ)

where λis a vector of Lagrange multipliers.

Aitchinson, Silvey (1958) proposed a Fisher scoring method to ﬁnd the

saddle point of L(µ, λ), searching in the product space of the Lagrange mul-

tiplier vector and the vector of expected frequencies. A drawback of such an

approach is that it does not distinguish between local maxima, local minima

or saddlepoints of the likelihood function subject to the constraints, that is,

the algorithm may converge to any stationary point depending on the starting

point. Only in certain special cases, for example ordinary log-linear models

(Haberman, (1974)), there is only one stationary point which is the maximum

of the likelihood. An improved approach (Fletcher, (1970), Rapcsak, (2000))

is based on a so called exact penalty function Pc(µ), which has the MLE ˆµ

as an unconstrained maximum. The function depends on a penalty parameter

c>0 which must be taken sufﬁciently large. The advantage is that standard

optimization algorithms can be used to maximize Pc(µ), which is not possible

with the Aitchison-Silvey approach. Furthermore, the search is done in the

original parameter space of µ, rather than the product space of the λand µ

parameter spaces, which also simpliﬁes the search.

The function Pc(µ) is derived from L(µ, λ) by (i) writing the Lagrange

multiplier as a function of µand (ii) adding a penalty term which penalizes

for deviations from the model constraint h(µ) =0. The Lagrange multiplier,

as a function of µ,isdetermined by differentiating L(µ, λ) with respect to µ,

equating the result to zero, and solving for λ.Apossible solution for λ, with

the Jacobian of h(µ) given by

H(µ) =∂h(µ)

∂µ=BD−1

AmADm

is obtained as

λ(µ) =(HD−1

mH)−1HD−1

m(n−m)

where H=H(µ) and Dxis the diagonal matrix with the vector xon the

main diagonal. A suitable nonnegative penalty term is the quadratic function

h(µ)(HD−1

mH)−1h(µ). Thus, an appropriate exact penalty function has the

form

Pc(µ) =L(µ, λ(µ)) +1

2ch(µ)(HD−1

mH)−1h(µ)

where cis some positive constant. Then we have (Rapcs´ak, (2000)):

On applications of marginal models for categorical data 35

Theorem 5. There exists a c∗>0such that,for every c >c∗,ˆµis an unconstrained

local maximum of Pc(µ).

Thus, standard optimization algorithms can be used to ﬁnd ˆµ.However,

a large enough value of cneeds to be selected. Initially, one can start with

any value of cgreater than one. If it is found that for the iterated estimates

the penalty term does not go to zero, the penalty parameter must be increased.

When a sufﬁciently large penalty parameter has been found the algorithm will

converge to a local maximum of the likelihood. If there is some doubt that

this is not the global maximum, the procedure must be repeated with different

starting values.

The standard Newton approach involves complicated derivatives making it

impractical. However, a modiﬁed quasi-Newton approach which is based on

simpliﬁed ﬁrst and second derivatives of Pc(µ) can be used instead. It can be

shown that the derivative of Pc(µ) with the derivative of λ(µ) replaced by its

expected value is given by

d(µ) =n−m−Hλ(µ) −(c−1)H(HD−1

m.H)−1h(µ).

In spite of the simpliﬁcation, d(µ) is still a valid search direction for ˆµ.

The expected value of the second derivative matrix evaluated under the model

h(µ) =0is

Fc(µ) =Dm+(c−2)H(HD−1

mH)−1H

and for c>1, this matrix is positive deﬁnite.

For sufﬁciently large c>1, an algorithm then is

µ(0)=log n

µ(k+1)=µ(k)−step(k)Fc(µ(k))−1d(µ(k))

where step(k)∈0,1] is a step size chosen such that Pc(µ(k+1)) >Pc(µ(k)) and

if ni=0, then it is replaced by a small positive quantity, say 10−50. The above

algorithm will converge to ˆµif the starting estimate log nis sufﬁciently close

to it. Otherwise a different starting estimate may need to be tried.

Acknowledgments

Rudas’s research was supported in part by Grant No. OTKA T-032213 from the Hungarian

National Science Foundation. Bergsma’s research was supported by The Netherlands Organization

for Scientiﬁc Research (NWO), Project Number 400-20-001.

36 TAM `

AS RUDAS – WICHER P. BERGSMA

REFERENCES

Agresti, A. (1990) Categorical Data Analysis,Wiley, New York.

Aitchison, J. and Silvey, S. D. (1958) Maximum likelihood estimation of parameters subject to

restraints, Ann. Math. Stat., 29, 813-828.

Balagtas, C. C., Becker, M. P., and Lang, J. B. (1995) Marginal modelling of categorical data

from crossover experiments, Applied Statistics, 44, 63-77.

Bartolucci, F. and Forcina, A. (2002) Extended RC association models allowing for order re-

strictions and marginal modeling, J. Amer. Statist. Assoc.,97(460), 1192-1199.

Bartolucci, F., Forcina, A., and Dardanoni, V. (2001) Positive quadrant dependence and

marginal modeling in two-way tables with ordered margins, J. Amer. Statist. Assoc.,96

(456), 1497-1505.

Becker, M. P. (1994) Analysis of repeated categorical measurements using models for marginal

distributions: an application to trends in attitudes on legalized abortion, In Marsden, P. V.

(ed.) Sociological Methodology, 24, 229-265, Blackwell, Oxford.

Becker, M. P., Minick, S., and Yang, I. (1998) Speciﬁcations of models for cross-classiﬁed

counts: comparisons of the log-linear model and marginal model perspectives, Sociological

Methods and Research, 26, 511-529.

Bergsma, W. P. (1997) Marginal Models for Categorical Data,Tilburg University Press, Tilburg.

Bergsma, W. P. and Rudas, T. (2002a) Marginal models for categorical data, Ann. Stat., 30,

140-159.

Bergsma, W. P and Rudas, T. (2002b) Modeling conditional and marginal association in contin-

gency tables, Ann. Fac. Sci. Tolulouse Math.,11(6), 443-454.

Bishop, Y. V. V., Fienberg, S. E., and Holland, P. W. (1975) Discrete Multivariate Analysis,

MIT Press, Cambridge, MA.

Colombi, R. and Forcina, A. (2001) Marginal regression models for the analysis of positive asso-

ciation of ordinal response variables, Biometrika,88(4), 1007-1019.

Darroch, J. N., Lauritzen, S. L., and Speed, T. P. (1980) Markov ﬁelds and log-linear models

for contingency tables, Ann. Stat.,8,539-552.

Davis, L. J. (1989) Intersection union tests for strict collapsibility in three-dimensional contingency

tables, Ann. Stat., 17, 1693-1708.

Dawid, A. P. (1980) Conditional independence for statistical operations, Ann. Stat.,8,598-617.

Fletcher, R. (1970) A class of methods for nonlinear programming with termination and conver-

gence properties, In Abadie, J. Wolfe, P. (eds.) Integer and nonlinear programming, North

Holland, Amsterdam.

Glonek, G. J. N. (1996) A class of regression models for multivariate responses, Biometrika, 83,

15-28.

Glonek, G. J. N. and McCullagh, P. (1995) Multivariate logistic models, J. Roy. Statist. Soc., Ser.

B, 57, 533-546.

Haberman, S. J. (1974) The Analysis of Frequency Data, University of Chicago Press, Chicago.

Hagenaars, J. A. (1990) Categorical Longitudinal Data, Sage, Newbury Park.

J¨

oreskog, K. G. (1997) Structural equation models in the social sciences: speciﬁcation, estimation,

and testing, In Krishnaiah, P. R. (ed.) Applications of Statistics, 267-287, North-Holland,

Amsterdam.

Kauermann, G. (1997) A note on multivariate logistic models for contingency tables, Austr. J. Stat.,

39, 261-276.

On applications of marginal models for categorical data 37

Kellerer, H. G. (1964) Verteilungfunktionen mit gegebenen Marginalverteilungen, Zeitschrift f¨

Wahrscheinlichkeitstheorie und verwandte Gebiete,3,247-270.

Lang, J. B. and Agresti, A. (1994) Simultaneously modelling the joint and marginal distributions

of multivariate categorical responses, J. Am. Stat. Assoc., 89, 625-632.

Lang, J. B., McDonald, J. W., and Smith, P. W. F. (1999) Association-marginal modelling of

multivariate categorical responses: a maximum likelihood approach, J. Am. Stat. Assoc., 94,

1161-1171.

Lauritzen, S. L. (1996) Graphical Models, Clarendon Press, Oxford.

Lauritzen, S. L., Speed, T. P., and Vijayan, K. (1984) Decomposable graphs and hypergraphs,

J. Austr. Math. Soc., Ser. A, 36, 12-29.

Little, R. J. and Rubin, D. (1987) Statistical Analysis with Missing Data,Wiley, New York.

Molenberghs, G. and Lesaffre, E. (1999) Marginal modelling of multivariate categorical data,

Statistics in Medicine, 18, 2237-2255.

Rapcs ´

ak, T. (2000) Global Lagrange multiplier rule and smooth exact penalty functions for equality

constraints, In Di Pillo, G., Giannesi, F. (eds.) Nonlinear Optimization and Related Topics,

Kluwer, 351-368.

Rudas, T. (1998a) Odds Ratios in the Analysis of Contingency Tables, Sage, Thousand Oaks.

Rudas, T. (1998b) A new algorithm for the maximum likelihood estimation of graphical log-linear

models, Computational Statistics,13(9), 529-537.

Rudas, T. (2002) Canonical representation of log-linear models, Communications in Statistics (The-

ory and Methods),31(12), 2311-2323.

TAM ´

AS RUDAS

Department of Statistics

Faculty of Social Sciences

E¨otv¨os Lor´and University

H-1117 Budapest

P´azm´any P´eter s ´et´any 1/A (Hungary)

rudas@tarki.hu

WICHER P. BERGSMA

EURANDOM, ofﬁce LG 1.37

P.O. Box 513

5600 MB Eindhoven (The Netherlands)

bergsma@eurandom.tue.nl

Grafikus modellek társadalomtudományi alkalmazása mobilitási adatokon (PhD-disszertáció)

Thesis

Full-text available

Jan 2009

Renáta Németh

An application of marginal log-linear models to examine changes in social mobility in Hungary during the transition period

Conference Paper

Full-text available

Jan 2004

Renáta Németh

The paper analyzes social mobility data for Hungary from the years 1987, 1992 and 1999. The main focus is put on testing Treiman’s modernization hypothesis that was posed in 1970 and is still widely cited today in the context of transition. The objective variables involved in the hypothesis are completed with subjective social position and subjective intergenerational mobility, aiming to take into account perception as a pathway through which objective factors have an impact on individual behavior. The fitted models are graphical models based on directed acyclic graphs and the values of marginal log-linear parameters (as proposed in Rudas, Bergsma, 2004) are used to gain insight into the strengths of associations. The main findings include that according to some parameters, a downward mobility trend prevailed between 1987 and 1992 as opposed to the upward trend between 1992 and 1999. That is, when investigating transition process we should distinguish these two periods.

Marginal Models: an Overview

Preprint

Full-text available

Apr 2023

Marginal models involve restrictions on the conditional and marginal association structure of a set of categorical variables. They generalize log-linear models for contingency tables, which are the fundamental tools for modelling the conditional association structure. This chapter gives an overview of the development of marginal models during the past 20 years. After providing some motivating examples, the first few sections focus on the definition and characteristics of marginal models. Specifically, we show how their fundamental properties can be understood from the properties of marginal log-linear parameterizations. Algorithms for estimating marginal models are discussed, focussing on the maximum likelihood and the generalized estimating equations approaches. It is shown how marginal models can help to understand directed graphical and path models, and a description is given of marginal models with latent variables.

Work-to-family spillover: Gender differences in Hungary

Article

Full-text available

Aug 2016

It is crucial to understand the role that labor market positions might play in creating gender differences in work–life balance. One theoretical approach to understanding this relationship is the spillover theory. The spillover theory argues that an individual’s life domains are integrated; meaning that well-being can be transmitted between life domains. Based on data collected in Hungary in 2014, this paper shows that work-to-family spillover does not affect both genders the same way. The effect of work on family life tends to be more negative for women than for men. Two explanations have been formulated in order to understand this gender inequality. According to the findings of the analysis, gender is conditionally independent of spillover if financial status and flexibility of work are also incorporated into the analysis. This means that the relative disadvantage for women in terms of spillover can be attributed to their lower financial status and their relatively low access to flexible jobs. In other words, the gender inequalities in work-to-family spillover are deeply affected by individual labor market positions. The observation of the labor market’s effect on work–life balance is especially important in Hungary since Hungary has one of the least flexible labor arrangements in Europe. A marginal log-linear model, which is a method for categorical multivariate analysis, has been applied in this analysis.

Maximum Augmented Empirical Likelihood Estimation of Categorical Marginal Models for Large Sparse Contingency Tables

Article

Full-text available

Sep 2023

Categorical marginal models (CMMs) are flexible tools for modelling dependent or clustered categorical data, when the dependencies themselves are not of interest. A major limitation of maximum likelihood (ML) estimation of CMMs is that the size of the contingency table increases exponentially with the number of variables, so even for a moderate number of variables, say between 10 and 20, ML estimation can become computationally infeasible. An alternative method, which retains the optimal asymptotic efficiency of ML, is maximum empirical likelihood (MEL) estimation. However, we show that MEL tends to break down for large, sparse contingency tables. As a solution, we propose a new method, which we call maximum augmented empirical likelihood (MAEL) estimation and which involves augmentation of the empirical likelihood support with a number of well-chosen cells. Simulation results show good finite sample performance for very large contingency tables.

Marginal Models: An Overview

Chapter

Apr 2023

Loglinear models in SPSS & Marginal loglinear and graphical models in R

Technical Report

Full-text available

Jan 2012

Renáta Németh

Probability Based Independence Sampler for Bayesian Quantitative Learning in Graphical Log-Linear Marginal Models

Preprint

Full-text available

Jul 2018

Bayesian methods for graphical log-linear marginal models have not been developed in the same extent as traditional frequentist approaches. In this work, we introduce a novel Bayesian approach for quantitative learning for such models. These models belong to curved exponential families that are difficult to handle from a Bayesian perspective. Furthermore, the likelihood cannot be analytically expressed as a function of the marginal log-linear interactions, but only in terms of cell counts or probabilities. Posterior distributions cannot be directly obtained, and MCMC methods are needed. Finally, a well-defined model requires parameter values that lead to compatible marginal probabilities. Hence, any MCMC should account for this important restriction. We construct a fully automatic and efficient MCMC strategy for quantitative learning for graphical log-linear marginal models that handles these problems. While the prior is expressed in terms of the marginal log-linear interactions, we build an MCMC algorithm that employs a proposal on the probability parameter space. The corresponding proposal on the marginal log-linear interactions is obtained via parameter transformation. By this strategy, we achieve to move within the desired target space. At each step, we directly work with well-defined probability distributions. Moreover, we can exploit a conditional conjugate setup to build an efficient proposal on probability parameters. The proposed methodology is illustrated by a simulation study and a real dataset.

What’s Next?

Chapter

Mar 2018

Tamás Rudas

Homogeneous Linear Predictor Models Specified by Constraints on the Goodman and Kruskal Tau Index

Thesis

Jan 2008

Elena Siletti

Multivariate Logistic Models

Article

Sep 1995

When data composed of several categorical responses together with categorical or continuous predictors are observed, the multivariate logistic transform introduced by McCullagh and Nelder can be used to define a class of regression models that is, in many applications, particularly suitable for relating the joint distribution of the responses to predictors. In this paper we give a general definition of this class of models and study their properties. A computational scheme for performing maximum likelihood estimation for data sets of moderate size is described and a system of model formulae that succinctly define particular models is introduced. Applications of these models to longitudinal problems are illustrated by numerical examples.

Association-marginal modeling of multivariate categorical responses: A maximum likelihood approach

Article

Dec 1999

Generalized log-linear models can be used to describe the association structure and/or the marginal distributions of multivariate categorical responses. We simultaneously model the association structure and marginal distributions using association-marginal (AM) models, which are specially formulated generalized log-linear models that combine two models: an association (A) model, which describes the association among all the responses; and a marginal (M) model, which describes the marginal distributions of the responses. Because the model's composite link function is not required to be invertible, a large class of models can be entertained and model specification is typically straightforward. We propose a "mixed freedom/constraint" parameterization that exploits the special structure of an AM model. Using this parameterization, maximum likelihood fitting is straightforward and typically: feasible for large, sparse tables. When a parsimonious association model is used, the size of the fitting problem is substantially reduced, and some of the problems associated with sampling 0's are avoided. We compare the asymptotic behavior of AM model parameter estimators assuming product-multinomial and Poisson sampling. For computational convenience, the product-multinomial variances are obtained by adjusting the Poisson variances. We propose a conditional score statistic for AM model assessment. The proposed maximum likelihood methods are illustrated through an analysis of marijuana use data from five waves of the National Youth Survey.

Structural equation models in the social sciences: Specification, estimation and testing

Article

Jan 1977

Karl G. Jöreskog

Analysis of Cross-Classifications of Counts Using Models for Marginal Distributions: An Application to Trends in Attitudes on Legalized Abortion

Article

Jan 1994

Mark P. Becker

Models parameterized in terms of linear models for marginal logits and linear models for marginal log-odds ratios provide a useful framework for the analysis of cross-classifications of counts when there is interest in comparing marginal distributions, or studying changes in marginal distributions. Examples of such cross-classifications include tabulations of responses to a collection of items from a questionnaire, and contingency tables cross-classifying the repeated measurements of a categorical response variable from longitudinal studies. The example used for illustrative purposes in this chapter is based on four items common to the National Opinion Research Center's (NORC) 1965 SRS870, 1975 General Social Survey (GSS), and 1985 GSS. Each of the items asked respondents to indicate whether or not they approved of the availability of legalized abortions for women in a specific situation. The utility of the marginal models approach to analyzing repeated categorical measurements is demonstrated through comparisons with two analyses based on conventional log-linear models. An algorithm that can be used to fit marginal models by the method of maximum likelihood is described in the appendix to this chapter.

Marginal Models for Categorical Data

Article

Jan 1997

Wicher Bergsma

A new algorithm for the maximum likelihood estimation of graphical log-linear models

Article

Jan 1998

Tamás Rudas

Graphical log-linear models for contingency tables are characterized as intersections of simpler conditional independence models. This suggests an algorithm to compute maximum likelihood estimates by iteratively imposing the conditional independences on the data. A convergence proof is given and this algorithm is compared to iterative proportional fitting, which is the standard method of computing maximum likelihood estimates under log-linear models. The algorithm proposed here is intuitively appealing and its main idea can be applied to the estimation of other statistical models.

Marginal Modelling of Categorical Data from Crossover Experiments

Article

Mar 1995
J R STAT SOC C-APPL

Marginal models provide a useful framework for the analysis of crossover experiments when the response variable is categorical. We use the three- treatment, three-periodic crossover experiment with a binary outcome variable to demonstrate how marginal models can be used to perform a likelihood-based analysis of multiple-period crossover experiments. Other designs are discussed in less detail. Maximum likelihood estimation is performed using a constraint equation specification of the marginal model. Data from a crossover trial comparing treatments for primary dysmenorrhoea are used to demonstrate the utility of marginal models in analysing crossover data.

Global Lagrange multiplier rule and smooth exact penalty functions for equality constraints

Article

Jan 2000

Tamás Rapcsák

Smooth exact penalty functions based on Courant’s and Fletcher’s ideas are reconsidered. After a short survey, the original ideas are combined with the global Lagrange multiplier rule formulated by the first and second covariant derivatives of the objective function with respect to the induced Riemannian metric of the constraint manifold. The tensor approach is described by the usual tools of nonlinear optimization giving a clearer geometric background of these methods.

Marginal regression models for the analysis of positive association

Article

Dec 2001
BIOMETRIKA

Given a set of discrete response variables, some of which are ordinal, and an arbitrary set of discrete explanatory variables, we propose a simple matrix formulation for parameterising the saturated model as in Glonek (1996). This is such that, within a hierarchical structure, marginal logits and log-odds ratios of various possible types, together with the remaining log-linear interactions of high order, may be modelled by equality and inequality constraints. Inequality constraints are particularly relevant for specifying models of positive association. Efficient algorithms are provided for computing maximum likelihood estimates under such constraints. The asymptotic distribution of the likelihood ratio test is derived and an extension of the usual analysis of deviance is outlined which incorporates inequality constraints.

Categorical Data Analysis 2nd Edn

Book

Jan 1990

Agresti

On applications of marginal models for categorical data

Abstract and Figures

Recommended publications

Modeling Strategies for Categorical Data: Examples from Housing and Tenure Choice

University Studies and Employment. An Application of the Principal Strata Approach to Causal Analysi...

Models for extremes using the extended three-parameter Burr XII system with application to flood fre...

Modeling abilities in 3-IRT models