PreprintPDF Available

P\'olygamma Data Augmentation to address Non-conjugacy in the Bayesian Estimation of Mixed Multinomial Logit Models

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

The standard Gibbs sampler of Mixed Multinomial Logit (MMNL) models involves sampling from conditional densities of utility parameters using Metropolis-Hastings (MH) algorithm due to unavailability of conjugate prior for logit kernel. To address this non-conjugacy concern, we propose the application of P\'olygamma data augmentation (PG-DA) technique for the MMNL estimation. The posterior estimates of the augmented and the default Gibbs sampler are similar for two-alternative scenario (binary choice), but we encounter empirical identification issues in the case of more alternatives ($J \geq 3$).
Pólygamma Data Augmentation to address Non-conjugacy in the
Bayesian Estimation of Mixed Multinomial Logit Models
April 13, 2019
Prateek Bansal*
School of Civil and Environmental Engineering
Cornell University, United States
pb422@cornell.edu
Rico Krueger*
Research Centre for Integrated Transport Innovation, School of Civil and Environmental Engineering,
UNSW Australia, Sydney NSW 2052, Australia
r.krueger@student.unsw.edu.au
Michel Bierlaire
Transport and Mobility Laboratory, School of Architecture, Civil and Environmental Engineering,
Ecole Polytechnique Fédérale de Lausanne, Station 18, Lausanne 1015, Switzerland
michel.bierlaire@epfl.ch
Ricardo A. Daziano
School of Civil and Environmental Engineering
Cornell University, United States
daziano@cornell.edu
Taha H. Rashidi
Research Centre for Integrated Transport Innovation, School of Civil and Environmental Engineering,
UNSW Australia, Sydney NSW 2052, Australia
rashidi@unsw.edu.au
*These authors contributed equally to this work.
1
arXiv:1904.07688v1 [stat.ML] 13 Apr 2019
Abstract
The standard Gibbs sampler of Mixed Multinomial Logit (MMNL) models involves sampling from
conditional densities of utility parameters using Metropolis-Hastings (MH) algorithm due to unavail-
ability of conjugate prior for logit kernel. To address this non-conjugacy concern, we propose the
application of Pólygamma data augmentation (PG-DA) technique for the MMNL estimation. The
posterior estimates of the augmented and the default Gibbs sampler are similar for two-alternative sce-
nario (binary choice), but we encounter empirical identification issues in the case of more alternatives
(J3).
1 Mixed multinomial logit model
The mixed multinomial logit (MMNL) model (McFadden and Train, 2000) is established as follows: We
consider a standard discrete choice setup, in which on choice occasion
t {
1
, . . . T}
, a decision-maker
n {
1
, . . . N}
derives utility
Unt j
=
V
(
Xnt j ,Γn
)+
εnt j
from alternative
j {
1
, . . . J}
. Here,
V
() denotes
the representative utility,
Xnt j
is a row-vector of covariates,
Γn
is a collection of taste parameters, and
εnt j
is a stochastic disturbance. The assumption
εnt j Gumbel
(0
,
1)leads to a multinomial logit
(MNL) kernel such that the probability that decision-maker
n
chooses alternative
j
on choice occasion
tis
P(ynt =j|Xn t j,Γn) = exp V(Xnt j ,Γn)
PJ
k=1exp {V(Xnt k,Γn)}, (1)
where
ynt
captures the observed choice. The choice probability can be iterated over choice scenarios
to obtain the probability of observing a decision-maker’s sequence of choices yn:
P(yn|Xn,Γn) =
T
Y
t=1
P(ynt =j|Xn t ,Γn). (2)
We consider a general utility specification under which tastes
Γn
are partitioned into fixed taste
parameters
α
, which are invariant across decision-makers, and random taste parameters
βn
, which
are individual-specific, such that
Γn
=
α>β>
n>
, whereby
α
and
βn
are vectors of lengths
L
and
K
, respectively. Analogously, the row-vector of covariates
Xnt j
is partitioned into attributes
Xnt j ,F
, which pertain to the fixed parameters
α
, as well as into attributes
Xnt j ,R
, which pertain to the
individual-specific parameters
βn
, such that
Xnt j
=
Xnt j ,FXnt j,R
. For simplicity, we assume that
the representative utility is linear-in-parameters, i.e.
V(Xnt j ,Γn) = Xnt j Γn=Xnt j,Fα+Xn t j,Rβn. (3)
The distribution of tastes
β1:N
is assumed to be multivariate normal, i.e.
βnN
(
ζ,
)for
n
=
1
, . . . , N
, where
ζ
is a mean vector and
is a covariance matrix. In a fully Bayesian setup, the invariant
(across individuals) parameters
α
,
ζ
,
are also considered to be random parameters and are thus
given priors. We use normal priors for the fixed parameters
α
and for the mean vector
ζ
. Following
Tan (2017) and Akinc and Vandebroek (2018), we employ Huang’s half-t prior (Huang and Wand,
2013) for covariance matrix
, as this prior specification exhibits superior noninformativity properties
compared to other prior specifications for covariance matrices. In particular, (Akinc and Vandebroek,
2
2018) show that Huang’s half-t prior outperforms the inverse Wishart prior, which is often employed
in fully Bayesian specifications of MMNL models (e.g. Train, 2009), in terms of parameter recovery.
Stated succinctly, the generative process of the fully Bayesian MMNL model is:
α|λ0,Ξ0N(λ0,Ξ0)(4)
ζ|µ0,Σ0N(µ0,Σ0)(5)
ak|AkGamma 1
2,1
A2
k,k=1, . . . , K, (6)
|ν,aIW (ν+K1, 2νdiag(a)) ,a=a1. . . aK>(7)
βn|ζ,N(ζ,),n=1, . . . , N, (8)
ynt |α,βn,Xn t MNL(α,βn,Xnt ),n=1, . . . , N,t=1, . . . , T, (9)
where (6) and (7) induce Huang’s half-t prior (Huang and Wand, 2013).
{λ0,Ξ0,µ0,Σ0,ν,A1:K}
are known hyper-parameters, and
θ
=
{α,ζ,,a,β1:N}
is a collection of model parameters whose
posterior distribution we wish to estimate.
2 Pólya–Gamma data augmentation
The default Gibbs sampler for posterior inference in MMNL models involves Metropolis steps to take
draws from conditional densities of the utility parameters (
βn
and
α
) due to the unavailability of a
conjugate prior for the MNL kernel. MCMC estimation of binary and multinomial logistic regression
models encounters a similar issue of non-conjugacy. Pólya-Gamma data augmentation (PG-DA)
is the state-of-the-art technique to handle non-conjugacy in MCMC estimation of binary logistic
regression models (Polson et al., 2013). PG-DA augments the Gibbs sampler by introducing an
additional Pólya-Gamma distributed latent variable, which circumvents the need of the Metropolis
algorithm by ensuring conjugate updates. Polson et al. (2013) also extend PG-DA to the multinomial
logistic regression model. Yet, this extension requires all utility (or link function) parameters to be
alternative-specific. We use the same idea in deriving a PG-DA-based Gibbs sampler for MMNL, but
we have to consider the same restriction on utility specification, i.e. replace Γnby Γn j .
2.1 Augmented Gibbs Sampler
The representative utility is:
Vnt j
=
Xnt j Γnj
=
Xnt j ,Fαj
+
Xnt j ,Rβnj
, where
βn j N
(
ζj,
). The
hyper-parameters remain the same, but the model parameters are
θ
=
{α1:J,ζ1:J,,a1:K,β{1:N,1:J}}
.
Adhering to the original notation, we can write the joint distribution of the data and the model
parameters:
P(y1:N,θ) =
P(|ω,B)
N
Y
n=1
P(yn|Xn,Γn)
J
Y
j=1
P(αj|λ0,Ξ0)P(ζj|µ0,Σ0)
K
Y
k=1
P(ak|s,rk)
N
Y
n=1
J
Y
j=1
P(βn j |ζj,).(10)
3
Algorithm 1 presents the augmented Gibbs sampler for the MMNL model. The conditional densities
of
a1:K
,
, and
ζ1:J
are similar to those of the Allenby-Train procedure (Akinc and Vandebroek, 2018).
The next subsection details the derivation of conditional densities of β{1:N,1:J}and α1:J.
for (iteration in 1 to max-iteration) do
Sample akfor kfrom Gamma ν+K
2,1
A2
k
+ν1kk ;
Sample from IW ν+N J +K1, 2νdiag(a) + PN
n=1PJ
j=1(βn j ζj)(βn j ζj)>;
for (i in 1to J ) do
Sample ζifrom N 1
NPN
n=1βni ,
N;
Sample βni for nusing equation 15 ;
Update ηnt i and Lnt i for nt,iusing equation 12 ;
Sample αiusing equation 16;
Update ηnt i and Lnt i for nt,iusing equation 12;
Sample φnt i for n,tfrom PG(1, ηnt i );
end
end
Algorithm 1: P´
olya-Gamma augmented Gibbs sampler for the MMNL Model
2.2 Conditional distributions of βn j and αj
Using Holmes et al. (2006), we can convert the multinomial logit likelihood expression to the binary
logit likelihood:
P(βn j |y1:N,θβn j )P(βn j |ζj,)
T
Y
t=1exp(ηnt j )
1+exp(ηnt j )ynt j 1
1+exp(ηnt j )(1ynt j )
P(βn j |ζj,)
T
Y
t=1exp(ηnt j )ynt j
1+exp(ηnt j )(11)
where θβnj is a resulting parameter vector after removing βn j and
ηnt j =Vnt j Ln t j;Lnt j =ln J
X
k=1,k6=j
exp(Vntk )!(12)
We now introduce a Pólya–Gamma distributed auxiliary variable
φnt k PG
(1
,
0)
n,t,k
and
κnt k
=
ynt k 1
2. Now consider the identity derived by Polson et al. (2013):
exp(ηnt k)yntk
1+exp(ηnt k)=exp(κntkηntk )
2Z
0
exp η2
nt kφntk
2P(φnt k)dφntk (13)
The conditional density of βn j is:
P(βn j |y1:N,θβn j ,φ)exp 1
2(βn j ζj)>1(βn j ζj). . .
· · ·
T
Y
t=1
exp
κnt jXn t j,Rβn j Xnt j ,Fαj+Xnt j,Rβn j Ln t j2φn t j
2
(14)
4
The conditional distribution of βn j is:
βn j |y1:N,θβn j ,φ
N1+
T
X
t=1
φnt jX>
nt j ,RXnt j,R1
1ζj+
T
X
t=1
X>
nt j ,Rκnt j φnt j (Xnt j,FαjLn t j),
1+
T
X
t=1
φnt jX>
nt j ,RXnt j,R1
(15)
The conditional density of αjcan be derived similarly:
αj|y1:N,θαj,φ
NΞ1
0+
N
X
n=1
T
X
t=1
φnt jX>
nt j ,FXnt j,F1
Ξ1
0λ0+
N
X
n=1
T
X
t=1
X>
nt j ,Fκnt j φnt j (Xnt j,Rβn j Lnt j ),
Ξ1
0+
N
X
n=1
T
X
t=1
φnt jX>
nt j ,FXnt j,F1
(16)
2.3 Discussion
We test the performance of the PG-DA-based Gibbs sampler against the Metropolis-based Gibbs
sampler in a Monte Carlo study. We first consider the MNL model (
Γn j
=
αj
), where both samplers
perform equally well. For MMNL model, the posterior estimates of the proposed PG-DA approach
and the default Gibbs sampler are similar for
J
=2, but we encounter an explosion of the conditional
distribution parameters in the case of more alternatives (
J
3). Results for
J
=2 and MATLAB code
is available upon request.
This appears to be an issue of empirical identification because of too many model parameters.
Before the PG-DA sampler diverges, representative utilities are either very small or very large in
magnitude for all alternatives across all observations. Therefore, instead of the actual magnitude
of utilities, their comparative scales determine the choice probabilities. Thus, the algorithm might
have a tendency to increase the relative scale of the latent utilities by increasing the scale of the
parameters.
In fact, prior to divergence the probability estimates of the chosen and non-chosen alternatives are
close to one and zero, respectively. We speculate that such behavior might be a consequence of too
many model parameters, which might allow the algorithm to find a parameter configuration that can
fit the data very well (in terms of choice probabilities). Once the algorithm finds that configuration,
it starts to increase the relative scale between the utilities (thus allowing the chosen alternatives to
have probability close to one), causing the parameter explosion.
As future research, stick-breaking constructions can be explored to adopt PG-DA in MCMC estimation
of MMNL while keeping a parsimonious model specification, i.e. with generic utility parameters
5
(Linderman et al., 2015; Zhang and Zhou, 2017). However, before adopting these constructions,
consistency with microeconomic theory needs to be established first.
References
Akinc, D. and Vandebroek, M. (2018). Bayesian estimation of mixed logit models: Selecting an
appropriate prior for the covariance matrix. Journal of choice modelling, 29:133–151.
Holmes, C. C., Held, L., et al. (2006). Bayesian auxiliary variable models for binary and multinomial
regression. Bayesian analysis, 1(1):145–168.
Huang, A. and Wand, M. P. (2013). Simple marginally noninformative prior distributions for covariance
matrices. Bayesian Anal., 8(2):439–452.
Linderman, S., Johnson, M., and Adams, R. P. (2015). Dependent multinomial models made easy:
Stick-breaking with the pólya-gamma augmentation. In Advances in Neural Information Processing
Systems, pages 3456–3464.
McFadden, D. and Train, K. (2000). Mixed MNL models for discrete response. Journal of applied
Econometrics, 15(5):447–470.
Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models using
pólya–gamma latent variables. Journal of the American statistical Association, 108(504):1339–1349.
Tan, L. S. L. (2017). Stochastic variational inference for large-scale discrete choice models using
adaptive batch sizes. Statistics and Computing, 27(1):237–257.
Train, K. E. (2009). Discrete Choice Methods with Simulation. Cambridge University Press, 2nd edition.
Zhang, Q. and Zhou, M. (2017). Permuted and augmented stick-breaking bayesian multinomial
regression. The Journal of Machine Learning Research, 18(1):7479–7511.
6
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Discrete choice models describe the choices made by decision makers among alternatives and play an important role in transportation planning, marketing research and other applications. The mixed multinomial logit (MMNL) model is a popular discrete choice model that captures heterogeneity in the preferences of decision makers through random coefficients. While Markov chain Monte Carlo methods provide the Bayesian analogue to classical procedures for estimating MMNL models, computations can be prohibitively expensive for large datasets. Approximate inference can be obtained using variational methods at a lower computational cost with competitive accuracy. In this paper, we develop variational methods for estimating MMNL models that allow random coefficients to be correlated in the posterior and can be extended to large-scale datasets. We explore three alternatives: (1) Laplace variational inference, (2) nonconjugate variational message passing and (3) stochastic linear regression, and compare their performances using real and simulated data. To accelerate convergence for large datasets, we develop stochastic variational inference for MMNL models using each of the above three alternatives. Stochastic variational inference allows data to be processed in minibatches by optimizing global variational parameters using stochastic gradient approximation. A novel strategy for increasing minibatch sizes adaptively within stochastic variational inference is proposed.
Book
Full-text available
This book describes the new generation of discrete choice methods, focusing on the many advances that are made possible by simulation. Researchers use these statistical methods to examine the choices that consumers, households, firms, and other agents make. Each of the major models is covered: logit, generalized extreme value, or GEV (including nested and cross-nested logits), probit, and mixed logit, plus a variety of specifications that build on these basics. Simulation-assisted estimation procedures are investigated and compared, including maximum simulated likelihood, method of simulated moments, and method of simulated scores. Procedures for drawing from densities are described, including variance reduction techniques such as anithetics and Halton draws. Recent advances in Bayesian procedures are explored, including the use of the Metropolis-Hastings algorithm and its variant Gibbs sampling. No other book incorporates all these fields, which have arisen in the past 20 years. The procedures are applicable in many fields, including energy, transportation, environmental studies, health, labor, and marketing.
Article
Full-text available
We propose a new data-augmentation strategy for fully Bayesian inference in models with binomial likelihoods. The approach appeals to a new class of Polya-Gamma distributions, which are constructed in detail. A variety of examples are presented to show the versatility of the method, including logistic regression, negative binomial regression, nonlinear mixed-effects models, and spatial models for count data. In each case, our data-augmentation strategy leads to simple, effective methods for posterior inference that: (1) circumvent the need for analytic approximations, numerical integration, or Metropolis-Hastings; and (2) outperform other known data-augmentation strategies, both in ease of use and in computational efficiency. All methods, including an efficient sampler for the Polya-Gamma distribution, are implemented in the R package BayesLogit. In the technical supplement appended to the end of the paper, we provide further details regarding the generation of Polya-Gamma random variables; the empirical benchmarks reported in the main manuscript; and the extension of the basic data-augmentation framework to contingency tables and multinomial outcomes.
Article
Maximum likelihood and Bayesian estimation are both frequently used to fit mixed logit models to choice data. The type and the number of quasi-random draws used for simulating the likelihood and the choice of the priors in Bayesian estimation have a big impact on the estimates. We compare the different approaches and compute the relative root mean square errors of the resulting estimates for the mean, covariance matrix and individual parameters in a large simulation study. We focus on the prior for the covariance matrix in Bayesian estimation and investigate the effect of Inverse Wishart priors, the Separation Strategy, Scaled Inverse Wishart and Huang Half-t priors. We show that the default settings in many software packages can lead to very unreliable results and that it is important to check the robustness of the results.
Article
To model categorical response variables given their covariates, we propose a permuted and augmented stick-breaking (paSB) construction that one-to-one maps the observed categories to randomly permuted latent sticks. This new construction transforms multinomial regression into regression analysis of stick-specific binary random variables that are mutually independent given their covariate-dependent stick success probabilities, which are parameterized by the regression coefficients of their corresponding categories. The paSB construction allows transforming an arbitrary cross-entropy-loss binary classifier into a Bayesian multinomial one. Specifically, we parameterize the negative logarithms of the stick failure probabilities with a family of covariate-dependent softplus functions to construct nonparametric Bayesian multinomial softplus regression, and transform Bayesian support vector machine (SVM) into Bayesian multinomial SVM. These Bayesian multinomial regression models are not only capable of providing probability estimates, quantifying uncertainty, and producing nonlinear classification decision boundaries, but also amenable to posterior simulation. Example results demonstrate their attractive properties and appealing performance.
Article
Many practical modeling problems involve discrete data that are best represented as draws from multinomial or categorical distributions. For example, nucleotides in a DNA sequence, children's names in a given state and year, and text documents are all commonly modeled with multinomial distributions. In all of these cases, we expect some form of dependency between the draws: the nucleotide at one position in the DNA strand may depend on the preceding nucleotides, children's names are highly correlated from year to year, and topics in text may be correlated and dynamic. These dependencies are not naturally captured by the typical Dirichlet-multinomial formulation. Here, we leverage a logistic stick-breaking representation and recent innovations in P\'olya-gamma augmentation to reformulate the multinomial distribution in terms of latent variables with jointly Gaussian likelihoods, enabling us to take advantage of a host of Bayesian inference techniques for Gaussian models with minimal overhead.
Article
A family of prior distributions for covariance matrices is studied. Members of the family possess the attractive property of all standard deviation and correlation parameters being marginally noninformative for particular hyperparameter choices. Moreover, the family is quite simple and, for approximate Bayesian inference techniques such as Markov chain Monte Carlo and mean field variational Bayes, has tractability on par with the Inverse-Wishart conjugate family of prior distributions. A simulation study shows that the new prior distributions can lead to more accurate sparse covariance matrix estimation.
Article
In this paper we discuss auxiliary variable approaches to Bayesian binary and multinomial regression. These approaches are ideally suited to automated Markov chain Monte Carlo simulation. In the first part we describe a simple technique using joint updating that improves the performance of the conventional probit regression algorithm. In the second part we discuss auxiliary variable methods for inference in Bayesian logistic regression, including covariate set uncertainty. Finally, we show how the logistic method is easily extended to multinomial regression models. All of the algorithms are fully automatic with no user set parameters and no necessary Metropolis-Hastings accept/reject steps.