PreprintPDF Available

P\'olygamma Data Augmentation to address Non-conjugacy in the Bayesian Estimation of Mixed Multinomial Logit Models

April 2019

April 2019

Authors:

Prateek Bansal

National University of Singapore

Rico Krueger

École Polytechnique Fédérale de Lausanne

Ricardo A. Daziano

Cornell University

Show all 5 authorsHide

Preprints and early-stage research may not have been peer reviewed yet.

The standard Gibbs sampler of Mixed Multinomial Logit (MMNL) models involves sampling from conditional densities of utility parameters using Metropolis-Hastings (MH) algorithm due to unavailability of conjugate prior for logit kernel. To address this non-conjugacy concern, we propose the application of P\'olygamma data augmentation (PG-DA) technique for the MMNL estimation. The posterior estimates of the augmented and the default Gibbs sampler are similar for two-alternative scenario (binary choice), but we encounter empirical identification issues in the case of more alternatives ($J \geq 3$).

Content uploaded by Prateek Bansal

Content may be subject to copyright.

Pólygamma Data Augmentation to address Non-conjugacy in the

Bayesian Estimation of Mixed Multinomial Logit Models

April 13, 2019

Prateek Bansal*

School of Civil and Environmental Engineering

Cornell University, United States

pb422@cornell.edu

Rico Krueger*

Research Centre for Integrated Transport Innovation, School of Civil and Environmental Engineering,

UNSW Australia, Sydney NSW 2052, Australia

r.krueger@student.unsw.edu.au

Michel Bierlaire

Transport and Mobility Laboratory, School of Architecture, Civil and Environmental Engineering,

Ecole Polytechnique Fédérale de Lausanne, Station 18, Lausanne 1015, Switzerland

michel.bierlaire@epﬂ.ch

Ricardo A. Daziano

School of Civil and Environmental Engineering

Cornell University, United States

daziano@cornell.edu

Taha H. Rashidi

Research Centre for Integrated Transport Innovation, School of Civil and Environmental Engineering,

UNSW Australia, Sydney NSW 2052, Australia

rashidi@unsw.edu.au

*These authors contributed equally to this work.

arXiv:1904.07688v1 [stat.ML] 13 Apr 2019

Abstract

The standard Gibbs sampler of Mixed Multinomial Logit (MMNL) models involves sampling from

conditional densities of utility parameters using Metropolis-Hastings (MH) algorithm due to unavail-

ability of conjugate prior for logit kernel. To address this non-conjugacy concern, we propose the

application of Pólygamma data augmentation (PG-DA) technique for the MMNL estimation. The

posterior estimates of the augmented and the default Gibbs sampler are similar for two-alternative sce-

nario (binary choice), but we encounter empirical identiﬁcation issues in the case of more alternatives

(J≥3).

1 Mixed multinomial logit model

The mixed multinomial logit (MMNL) model (McFadden and Train, 2000) is established as follows: We

consider a standard discrete choice setup, in which on choice occasion

t∈ {

, . . . T}

, a decision-maker

n∈ {

, . . . N}

derives utility

Unt j

(

Xnt j ,Γn

εnt j

from alternative

j∈ {

, . . . J}

. Here,

() denotes

the representative utility,

Xnt j

is a row-vector of covariates,

Γn

is a collection of taste parameters, and

εnt j

is a stochastic disturbance. The assumption

εnt j ∼Gumbel

1)leads to a multinomial logit

(MNL) kernel such that the probability that decision-maker

chooses alternative

on choice occasion

tis

P(ynt =j|Xn t j,Γn) = exp V(Xnt j ,Γn)

k=1exp {V(Xnt k,Γn)}, (1)

where

ynt

captures the observed choice. The choice probability can be iterated over choice scenarios

to obtain the probability of observing a decision-maker’s sequence of choices yn:

P(yn|Xn,Γn) =

t=1

P(ynt =j|Xn t ,Γn). (2)

We consider a general utility speciﬁcation under which tastes

Γn

are partitioned into ﬁxed taste

parameters

, which are invariant across decision-makers, and random taste parameters

βn

, which

are individual-speciﬁc, such that

Γn

α>β>

n>

, whereby

and

βn

are vectors of lengths

and

, respectively. Analogously, the row-vector of covariates

Xnt j

is partitioned into attributes

Xnt j ,F

, which pertain to the ﬁxed parameters

, as well as into attributes

Xnt j ,R

, which pertain to the

individual-speciﬁc parameters

βn

, such that

Xnt j

Xnt j ,FXnt j,R

. For simplicity, we assume that

the representative utility is linear-in-parameters, i.e.

V(Xnt j ,Γn) = Xnt j Γn=Xnt j,Fα+Xn t j,Rβn. (3)

The distribution of tastes

β1:N

is assumed to be multivariate normal, i.e.

βn∼N

(

ζ,Ω

)for

, . . . , N

, where

is a mean vector and

Ω

is a covariance matrix. In a fully Bayesian setup, the invariant

(across individuals) parameters

Ω

are also considered to be random parameters and are thus

given priors. We use normal priors for the ﬁxed parameters

and for the mean vector

. Following

Tan (2017) and Akinc and Vandebroek (2018), we employ Huang’s half-t prior (Huang and Wand,

2013) for covariance matrix

Ω

, as this prior speciﬁcation exhibits superior noninformativity properties

compared to other prior speciﬁcations for covariance matrices. In particular, (Akinc and Vandebroek,

2018) show that Huang’s half-t prior outperforms the inverse Wishart prior, which is often employed

in fully Bayesian speciﬁcations of MMNL models (e.g. Train, 2009), in terms of parameter recovery.

Stated succinctly, the generative process of the fully Bayesian MMNL model is:

α|λ0,Ξ0∼N(λ0,Ξ0)(4)

ζ|µ0,Σ0∼N(µ0,Σ0)(5)

ak|Ak∼Gamma 1

2,1

k,k=1, . . . , K, (6)

Ω|ν,a∼IW (ν+K−1, 2νdiag(a)) ,a=a1. . . aK>(7)

βn|ζ,Ω∼N(ζ,Ω),n=1, . . . , N, (8)

ynt |α,βn,Xn t ∼MNL(α,βn,Xnt ),n=1, . . . , N,t=1, . . . , T, (9)

where (6) and (7) induce Huang’s half-t prior (Huang and Wand, 2013).

{λ0,Ξ0,µ0,Σ0,ν,A1:K}

are known hyper-parameters, and

{α,ζ,Ω,a,β1:N}

is a collection of model parameters whose

posterior distribution we wish to estimate.

2 Pólya–Gamma data augmentation

The default Gibbs sampler for posterior inference in MMNL models involves Metropolis steps to take

draws from conditional densities of the utility parameters (

βn

and

) due to the unavailability of a

conjugate prior for the MNL kernel. MCMC estimation of binary and multinomial logistic regression

models encounters a similar issue of non-conjugacy. Pólya-Gamma data augmentation (PG-DA)

is the state-of-the-art technique to handle non-conjugacy in MCMC estimation of binary logistic

regression models (Polson et al., 2013). PG-DA augments the Gibbs sampler by introducing an

additional Pólya-Gamma distributed latent variable, which circumvents the need of the Metropolis

algorithm by ensuring conjugate updates. Polson et al. (2013) also extend PG-DA to the multinomial

logistic regression model. Yet, this extension requires all utility (or link function) parameters to be

alternative-speciﬁc. We use the same idea in deriving a PG-DA-based Gibbs sampler for MMNL, but

we have to consider the same restriction on utility speciﬁcation, i.e. replace Γnby Γn j .

2.1 Augmented Gibbs Sampler

The representative utility is:

Vnt j

Xnt j Γnj

Xnt j ,Fαj

Xnt j ,Rβnj

, where

βn j ∼N

(

ζj,Ω

). The

hyper-parameters remain the same, but the model parameters are

{α1:J,ζ1:J,Ω,a1:K,β{1:N,1:J}}

Adhering to the original notation, we can write the joint distribution of the data and the model

parameters:

P(y1:N,θ) =

P(Ω|ω,B)

n=1

P(yn|Xn,Γn)

j=1

P(αj|λ0,Ξ0)P(ζj|µ0,Σ0)

k=1

P(ak|s,rk)

n=1

j=1

P(βn j |ζj,Ω).(10)

Algorithm 1 presents the augmented Gibbs sampler for the MMNL model. The conditional densities

a1:K

Ω

, and

ζ1:J

are similar to those of the Allenby-Train procedure (Akinc and Vandebroek, 2018).

The next subsection details the derivation of conditional densities of β{1:N,1:J}and α1:J.

for (iteration in 1 to max-iteration) do

Sample akfor ∀kfrom Gamma ν+K

2,1

+νΩ−1kk ;

Sample Ωfrom IW ν+N J +K−1, 2νdiag(a) + PN

n=1PJ

j=1(βn j −ζj)(βn j −ζj)>;

for (i in 1to J ) do

Sample ζifrom N 1

NPN

n=1βni ,Ω

N;

Sample βni for ∀nusing equation 15 ;

Update ηnt i and Lnt i for ∀nt,iusing equation 12 ;

Sample αiusing equation 16;

Update ηnt i and Lnt i for ∀nt,iusing equation 12;

Sample φnt i for ∀n,tfrom PG(1, ηnt i );

end

Algorithm 1: P´

olya-Gamma augmented Gibbs sampler for the MMNL Model

2.2 Conditional distributions of βn j and αj

Using Holmes et al. (2006), we can convert the multinomial logit likelihood expression to the binary

logit likelihood:

P(βn j |y1:N,θ−βn j )∝P(βn j |ζj,Ω)

t=1exp(ηnt j )

1+exp(ηnt j )ynt j 1

1+exp(ηnt j )(1−ynt j )

∝P(βn j |ζj,Ω)

t=1exp(ηnt j )ynt j

1+exp(ηnt j )(11)

where θ−βnj is a resulting parameter vector after removing βn j and

ηnt j =Vnt j −Ln t j;Lnt j =ln J

k=1,k6=j

exp(Vntk )!(12)

We now introduce a Pólya–Gamma distributed auxiliary variable

φnt k ∼PG

∀n,t,k

and

κnt k

ynt k −1

2. Now consider the identity derived by Polson et al. (2013):

exp(ηnt k)yntk

1+exp(ηnt k)=exp(κntkηntk )

2Z∞

exp −η2

nt kφntk

2P(φnt k)dφntk (13)

The conditional density of βn j is:

P(βn j |y1:N,θ−βn j ,φ)∝exp −1

2(βn j −ζj)>Ω−1(βn j −ζj). . .

· · ·

t=1

exp 

κnt jXn t j,Rβn j −Xnt j ,Fαj+Xnt j,Rβn j −Ln t j2φn t j

2



(14)

The conditional distribution of βn j is:

βn j |y1:N,θ−βn j ,φ

∼NΩ−1+

t=1

φnt jX>

nt j ,RXnt j,R−1

Ω−1ζj+

t=1

nt j ,Rκnt j −φnt j (Xnt j,Fαj−Ln t j),

Ω−1+

t=1

φnt jX>

nt j ,RXnt j,R−1

(15)

The conditional density of αjcan be derived similarly:

αj|y1:N,θ−αj,φ

∼NΞ−1

n=1

t=1

φnt jX>

nt j ,FXnt j,F−1

Ξ−1

0λ0+

n=1

t=1

nt j ,Fκnt j −φnt j (Xnt j,Rβn j −Lnt j ),

Ξ−1

n=1

t=1

φnt jX>

nt j ,FXnt j,F−1

(16)

2.3 Discussion

We test the performance of the PG-DA-based Gibbs sampler against the Metropolis-based Gibbs

sampler in a Monte Carlo study. We ﬁrst consider the MNL model (

Γn j

αj

), where both samplers

perform equally well. For MMNL model, the posterior estimates of the proposed PG-DA approach

and the default Gibbs sampler are similar for

=2, but we encounter an explosion of the conditional

distribution parameters in the case of more alternatives (

J≥

3). Results for

=2 and MATLAB code

is available upon request.

This appears to be an issue of empirical identiﬁcation because of too many model parameters.

Before the PG-DA sampler diverges, representative utilities are either very small or very large in

magnitude for all alternatives across all observations. Therefore, instead of the actual magnitude

of utilities, their comparative scales determine the choice probabilities. Thus, the algorithm might

have a tendency to increase the relative scale of the latent utilities by increasing the scale of the

parameters.

In fact, prior to divergence the probability estimates of the chosen and non-chosen alternatives are

close to one and zero, respectively. We speculate that such behavior might be a consequence of too

many model parameters, which might allow the algorithm to ﬁnd a parameter conﬁguration that can

ﬁt the data very well (in terms of choice probabilities). Once the algorithm ﬁnds that conﬁguration,

it starts to increase the relative scale between the utilities (thus allowing the chosen alternatives to

have probability close to one), causing the parameter explosion.

As future research, stick-breaking constructions can be explored to adopt PG-DA in MCMC estimation

of MMNL while keeping a parsimonious model speciﬁcation, i.e. with generic utility parameters

(Linderman et al., 2015; Zhang and Zhou, 2017). However, before adopting these constructions,

consistency with microeconomic theory needs to be established ﬁrst.

References

Akinc, D. and Vandebroek, M. (2018). Bayesian estimation of mixed logit models: Selecting an

appropriate prior for the covariance matrix. Journal of choice modelling, 29:133–151.

Holmes, C. C., Held, L., et al. (2006). Bayesian auxiliary variable models for binary and multinomial

regression. Bayesian analysis, 1(1):145–168.

Huang, A. and Wand, M. P. (2013). Simple marginally noninformative prior distributions for covariance

matrices. Bayesian Anal., 8(2):439–452.

Linderman, S., Johnson, M., and Adams, R. P. (2015). Dependent multinomial models made easy:

Stick-breaking with the pólya-gamma augmentation. In Advances in Neural Information Processing

Systems, pages 3456–3464.

McFadden, D. and Train, K. (2000). Mixed MNL models for discrete response. Journal of applied

Econometrics, 15(5):447–470.

Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models using

pólya–gamma latent variables. Journal of the American statistical Association, 108(504):1339–1349.

Tan, L. S. L. (2017). Stochastic variational inference for large-scale discrete choice models using

adaptive batch sizes. Statistics and Computing, 27(1):237–257.

Train, K. E. (2009). Discrete Choice Methods with Simulation. Cambridge University Press, 2nd edition.

Zhang, Q. and Zhou, M. (2017). Permuted and augmented stick-breaking bayesian multinomial

regression. The Journal of Machine Learning Research, 18(1):7479–7511.

ResearchGate has not been able to resolve any citations for this publication.

Stochastic variational inference for large-scale discrete choice models using adaptive batch sizes

Article

Full-text available

Jan 2017
STAT COMPUT

Linda Tan

Discrete choice models describe the choices made by decision makers among alternatives and play an important role in transportation planning, marketing research and other applications. The mixed multinomial logit (MMNL) model is a popular discrete choice model that captures heterogeneity in the preferences of decision makers through random coefficients. While Markov chain Monte Carlo methods provide the Bayesian analogue to classical procedures for estimating MMNL models, computations can be prohibitively expensive for large datasets. Approximate inference can be obtained using variational methods at a lower computational cost with competitive accuracy. In this paper, we develop variational methods for estimating MMNL models that allow random coefficients to be correlated in the posterior and can be extended to large-scale datasets. We explore three alternatives: (1) Laplace variational inference, (2) nonconjugate variational message passing and (3) stochastic linear regression, and compare their performances using real and simulated data. To accelerate convergence for large datasets, we develop stochastic variational inference for MMNL models using each of the above three alternatives. Stochastic variational inference allows data to be processed in minibatches by optimizing global variational parameters using stochastic gradient approximation. A novel strategy for increasing minibatch sizes adaptively within stochastic variational inference is proposed.

Discrete Choice Methods With Simulation

Book

Full-text available

Jan 2009

Kenneth Train

This book describes the new generation of discrete choice methods, focusing on the many advances that are made possible by simulation. Researchers use these statistical methods to examine the choices that consumers, households, firms, and other agents make. Each of the major models is covered: logit, generalized extreme value, or GEV (including nested and cross-nested logits), probit, and mixed logit, plus a variety of specifications that build on these basics. Simulation-assisted estimation procedures are investigated and compared, including maximum simulated likelihood, method of simulated moments, and method of simulated scores. Procedures for drawing from densities are described, including variance reduction techniques such as anithetics and Halton draws. Recent advances in Bayesian procedures are explored, including the use of the Metropolis-Hastings algorithm and its variant Gibbs sampling. No other book incorporates all these fields, which have arisen in the past 20 years. The procedures are applicable in many fields, including energy, transportation, environmental studies, health, labor, and marketing.

Bayesian Inference for Logistic Models Using Polya-Gamma Latent Variables

Article

Full-text available

May 2012

We propose a new data-augmentation strategy for fully Bayesian inference in models with binomial likelihoods. The approach appeals to a new class of Polya-Gamma distributions, which are constructed in detail. A variety of examples are presented to show the versatility of the method, including logistic regression, negative binomial regression, nonlinear mixed-effects models, and spatial models for count data. In each case, our data-augmentation strategy leads to simple, effective methods for posterior inference that: (1) circumvent the need for analytic approximations, numerical integration, or Metropolis-Hastings; and (2) outperform other known data-augmentation strategies, both in ease of use and in computational efficiency. All methods, including an efficient sampler for the Polya-Gamma distribution, are implemented in the R package BayesLogit. In the technical supplement appended to the end of the paper, we provide further details regarding the generation of Polya-Gamma random variables; the empirical benchmarks reported in the main manuscript; and the extension of the basic data-augmentation framework to contingency tables and multinomial outcomes.

Bayesian estimation of mixed logit models: Selecting an appropriate prior for the covariance matrix

Article

Dec 2017

Maximum likelihood and Bayesian estimation are both frequently used to fit mixed logit models to choice data. The type and the number of quasi-random draws used for simulating the likelihood and the choice of the priors in Bayesian estimation have a big impact on the estimates. We compare the different approaches and compute the relative root mean square errors of the resulting estimates for the mean, covariance matrix and individual parameters in a large simulation study. We focus on the prior for the covariance matrix in Bayesian estimation and investigate the effect of Inverse Wishart priors, the Separation Strategy, Scaled Inverse Wishart and Huang Half-t priors. We show that the default settings in many software packages can lead to very unreliable results and that it is important to check the robustness of the results.

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

Article

Dec 2016
J MACH LEARN RES

To model categorical response variables given their covariates, we propose a permuted and augmented stick-breaking (paSB) construction that one-to-one maps the observed categories to randomly permuted latent sticks. This new construction transforms multinomial regression into regression analysis of stick-specific binary random variables that are mutually independent given their covariate-dependent stick success probabilities, which are parameterized by the regression coefficients of their corresponding categories. The paSB construction allows transforming an arbitrary cross-entropy-loss binary classifier into a Bayesian multinomial one. Specifically, we parameterize the negative logarithms of the stick failure probabilities with a family of covariate-dependent softplus functions to construct nonparametric Bayesian multinomial softplus regression, and transform Bayesian support vector machine (SVM) into Bayesian multinomial SVM. These Bayesian multinomial regression models are not only capable of providing probability estimates, quantifying uncertainty, and producing nonlinear classification decision boundaries, but also amenable to posterior simulation. Example results demonstrate their attractive properties and appealing performance.

Mixed MNL models for discrete response

Article

Jan 2000

Dependent Multinomial Models Made Easy: Stick Breaking with the P\'olya-Gamma Augmentation

Article

Jun 2015

Many practical modeling problems involve discrete data that are best represented as draws from multinomial or categorical distributions. For example, nucleotides in a DNA sequence, children's names in a given state and year, and text documents are all commonly modeled with multinomial distributions. In all of these cases, we expect some form of dependency between the draws: the nucleotide at one position in the DNA strand may depend on the preceding nucleotides, children's names are highly correlated from year to year, and topics in text may be correlated and dynamic. These dependencies are not naturally captured by the typical Dirichlet-multinomial formulation. Here, we leverage a logistic stick-breaking representation and recent innovations in P\'olya-gamma augmentation to reformulate the multinomial distribution in terms of latent variables with jointly Gaussian likelihoods, enabling us to take advantage of a host of Bayesian inference techniques for Gaussian models with minimal overhead.

Simple Marginally Noninformative Prior Distributions for Covariance Matrices

Article

Jun 2013

A family of prior distributions for covariance matrices is studied. Members of the family possess the attractive property of all standard deviation and correlation parameters being marginally noninformative for particular hyperparameter choices. Moreover, the family is quite simple and, for approximate Bayesian inference techniques such as Markov chain Monte Carlo and mean field variational Bayes, has tractability on par with the Inverse-Wishart conjugate family of prior distributions. A simulation study shows that the new prior distributions can lead to more accurate sparse covariance matrix estimation.

Bayesian Auxiliary Variable Models for Binary and Multinomial Regression

Article

Mar 2006
BAYESIAN ANAL

In this paper we discuss auxiliary variable approaches to Bayesian binary and multinomial regression. These approaches are ideally suited to automated Markov chain Monte Carlo simulation. In the first part we describe a simple technique using joint updating that improves the performance of the conventional probit regression algorithm. In the second part we discuss auxiliary variable methods for inference in Bayesian logistic regression, including covariate set uncertainty. Finally, we show how the logistic method is easily extended to multinomial regression models. All of the algorithms are fully automatic with no user set parameters and no necessary Metropolis-Hastings accept/reject steps.

P\'olygamma Data Augmentation to address Non-conjugacy in the Bayesian Estimation of Mixed Multinomial Logit Models

Abstract

Recommended publications

Multinomial discrete choice models (in Russian)

General Saddlepoint Approximations of Marginal Densities and Tail Probabilities

Change of Scale and Forecasting with the Control-Function Method in Logit Models

Testing the sequential logit model against the nested logit model