Content uploaded by Prateek Bansal
Author content
All content in this area was uploaded by Prateek Bansal on Apr 19, 2019
Content may be subject to copyright.
Pólygamma Data Augmentation to address Non-conjugacy in the
Bayesian Estimation of Mixed Multinomial Logit Models
April 13, 2019
Prateek Bansal*
School of Civil and Environmental Engineering
Cornell University, United States
pb422@cornell.edu
Rico Krueger*
Research Centre for Integrated Transport Innovation, School of Civil and Environmental Engineering,
UNSW Australia, Sydney NSW 2052, Australia
r.krueger@student.unsw.edu.au
Michel Bierlaire
Transport and Mobility Laboratory, School of Architecture, Civil and Environmental Engineering,
Ecole Polytechnique Fédérale de Lausanne, Station 18, Lausanne 1015, Switzerland
michel.bierlaire@epfl.ch
Ricardo A. Daziano
School of Civil and Environmental Engineering
Cornell University, United States
daziano@cornell.edu
Taha H. Rashidi
Research Centre for Integrated Transport Innovation, School of Civil and Environmental Engineering,
UNSW Australia, Sydney NSW 2052, Australia
rashidi@unsw.edu.au
*These authors contributed equally to this work.
1
arXiv:1904.07688v1 [stat.ML] 13 Apr 2019
Abstract
The standard Gibbs sampler of Mixed Multinomial Logit (MMNL) models involves sampling from
conditional densities of utility parameters using Metropolis-Hastings (MH) algorithm due to unavail-
ability of conjugate prior for logit kernel. To address this non-conjugacy concern, we propose the
application of Pólygamma data augmentation (PG-DA) technique for the MMNL estimation. The
posterior estimates of the augmented and the default Gibbs sampler are similar for two-alternative sce-
nario (binary choice), but we encounter empirical identification issues in the case of more alternatives
(J≥3).
1 Mixed multinomial logit model
The mixed multinomial logit (MMNL) model (McFadden and Train, 2000) is established as follows: We
consider a standard discrete choice setup, in which on choice occasion
t∈ {
1
, . . . T}
, a decision-maker
n∈ {
1
, . . . N}
derives utility
Unt j
=
V
(
Xnt j ,Γn
)+
εnt j
from alternative
j∈ {
1
, . . . J}
. Here,
V
() denotes
the representative utility,
Xnt j
is a row-vector of covariates,
Γn
is a collection of taste parameters, and
εnt j
is a stochastic disturbance. The assumption
εnt j ∼Gumbel
(0
,
1)leads to a multinomial logit
(MNL) kernel such that the probability that decision-maker
n
chooses alternative
j
on choice occasion
tis
P(ynt =j|Xn t j,Γn) = exp V(Xnt j ,Γn)
PJ
k=1exp {V(Xnt k,Γn)}, (1)
where
ynt
captures the observed choice. The choice probability can be iterated over choice scenarios
to obtain the probability of observing a decision-maker’s sequence of choices yn:
P(yn|Xn,Γn) =
T
Y
t=1
P(ynt =j|Xn t ,Γn). (2)
We consider a general utility specification under which tastes
Γn
are partitioned into fixed taste
parameters
α
, which are invariant across decision-makers, and random taste parameters
βn
, which
are individual-specific, such that
Γn
=
α>β>
n>
, whereby
α
and
βn
are vectors of lengths
L
and
K
, respectively. Analogously, the row-vector of covariates
Xnt j
is partitioned into attributes
Xnt j ,F
, which pertain to the fixed parameters
α
, as well as into attributes
Xnt j ,R
, which pertain to the
individual-specific parameters
βn
, such that
Xnt j
=
Xnt j ,FXnt j,R
. For simplicity, we assume that
the representative utility is linear-in-parameters, i.e.
V(Xnt j ,Γn) = Xnt j Γn=Xnt j,Fα+Xn t j,Rβn. (3)
The distribution of tastes
β1:N
is assumed to be multivariate normal, i.e.
βn∼N
(
ζ,Ω
)for
n
=
1
, . . . , N
, where
ζ
is a mean vector and
Ω
is a covariance matrix. In a fully Bayesian setup, the invariant
(across individuals) parameters
α
,
ζ
,
Ω
are also considered to be random parameters and are thus
given priors. We use normal priors for the fixed parameters
α
and for the mean vector
ζ
. Following
Tan (2017) and Akinc and Vandebroek (2018), we employ Huang’s half-t prior (Huang and Wand,
2013) for covariance matrix
Ω
, as this prior specification exhibits superior noninformativity properties
compared to other prior specifications for covariance matrices. In particular, (Akinc and Vandebroek,
2
2018) show that Huang’s half-t prior outperforms the inverse Wishart prior, which is often employed
in fully Bayesian specifications of MMNL models (e.g. Train, 2009), in terms of parameter recovery.
Stated succinctly, the generative process of the fully Bayesian MMNL model is:
α|λ0,Ξ0∼N(λ0,Ξ0)(4)
ζ|µ0,Σ0∼N(µ0,Σ0)(5)
ak|Ak∼Gamma 1
2,1
A2
k,k=1, . . . , K, (6)
Ω|ν,a∼IW (ν+K−1, 2νdiag(a)) ,a=a1. . . aK>(7)
βn|ζ,Ω∼N(ζ,Ω),n=1, . . . , N, (8)
ynt |α,βn,Xn t ∼MNL(α,βn,Xnt ),n=1, . . . , N,t=1, . . . , T, (9)
where (6) and (7) induce Huang’s half-t prior (Huang and Wand, 2013).
{λ0,Ξ0,µ0,Σ0,ν,A1:K}
are known hyper-parameters, and
θ
=
{α,ζ,Ω,a,β1:N}
is a collection of model parameters whose
posterior distribution we wish to estimate.
2 Pólya–Gamma data augmentation
The default Gibbs sampler for posterior inference in MMNL models involves Metropolis steps to take
draws from conditional densities of the utility parameters (
βn
and
α
) due to the unavailability of a
conjugate prior for the MNL kernel. MCMC estimation of binary and multinomial logistic regression
models encounters a similar issue of non-conjugacy. Pólya-Gamma data augmentation (PG-DA)
is the state-of-the-art technique to handle non-conjugacy in MCMC estimation of binary logistic
regression models (Polson et al., 2013). PG-DA augments the Gibbs sampler by introducing an
additional Pólya-Gamma distributed latent variable, which circumvents the need of the Metropolis
algorithm by ensuring conjugate updates. Polson et al. (2013) also extend PG-DA to the multinomial
logistic regression model. Yet, this extension requires all utility (or link function) parameters to be
alternative-specific. We use the same idea in deriving a PG-DA-based Gibbs sampler for MMNL, but
we have to consider the same restriction on utility specification, i.e. replace Γnby Γn j .
2.1 Augmented Gibbs Sampler
The representative utility is:
Vnt j
=
Xnt j Γnj
=
Xnt j ,Fαj
+
Xnt j ,Rβnj
, where
βn j ∼N
(
ζj,Ω
). The
hyper-parameters remain the same, but the model parameters are
θ
=
{α1:J,ζ1:J,Ω,a1:K,β{1:N,1:J}}
.
Adhering to the original notation, we can write the joint distribution of the data and the model
parameters:
P(y1:N,θ) =
P(Ω|ω,B)
N
Y
n=1
P(yn|Xn,Γn)
J
Y
j=1
P(αj|λ0,Ξ0)P(ζj|µ0,Σ0)
K
Y
k=1
P(ak|s,rk)
N
Y
n=1
J
Y
j=1
P(βn j |ζj,Ω).(10)
3
Algorithm 1 presents the augmented Gibbs sampler for the MMNL model. The conditional densities
of
a1:K
,
Ω
, and
ζ1:J
are similar to those of the Allenby-Train procedure (Akinc and Vandebroek, 2018).
The next subsection details the derivation of conditional densities of β{1:N,1:J}and α1:J.
for (iteration in 1 to max-iteration) do
Sample akfor ∀kfrom Gamma ν+K
2,1
A2
k
+νΩ−1kk ;
Sample Ωfrom IW ν+N J +K−1, 2νdiag(a) + PN
n=1PJ
j=1(βn j −ζj)(βn j −ζj)>;
for (i in 1to J ) do
Sample ζifrom N 1
NPN
n=1βni ,Ω
N;
Sample βni for ∀nusing equation 15 ;
Update ηnt i and Lnt i for ∀nt,iusing equation 12 ;
Sample αiusing equation 16;
Update ηnt i and Lnt i for ∀nt,iusing equation 12;
Sample φnt i for ∀n,tfrom PG(1, ηnt i );
end
end
Algorithm 1: P´
olya-Gamma augmented Gibbs sampler for the MMNL Model
2.2 Conditional distributions of βn j and αj
Using Holmes et al. (2006), we can convert the multinomial logit likelihood expression to the binary
logit likelihood:
P(βn j |y1:N,θ−βn j )∝P(βn j |ζj,Ω)
T
Y
t=1exp(ηnt j )
1+exp(ηnt j )ynt j 1
1+exp(ηnt j )(1−ynt j )
∝P(βn j |ζj,Ω)
T
Y
t=1exp(ηnt j )ynt j
1+exp(ηnt j )(11)
where θ−βnj is a resulting parameter vector after removing βn j and
ηnt j =Vnt j −Ln t j;Lnt j =ln J
X
k=1,k6=j
exp(Vntk )!(12)
We now introduce a Pólya–Gamma distributed auxiliary variable
φnt k ∼PG
(1
,
0)
∀n,t,k
and
κnt k
=
ynt k −1
2. Now consider the identity derived by Polson et al. (2013):
exp(ηnt k)yntk
1+exp(ηnt k)=exp(κntkηntk )
2Z∞
0
exp −η2
nt kφntk
2P(φnt k)dφntk (13)
The conditional density of βn j is:
P(βn j |y1:N,θ−βn j ,φ)∝exp −1
2(βn j −ζj)>Ω−1(βn j −ζj). . .
· · ·
T
Y
t=1
exp
κnt jXn t j,Rβn j −Xnt j ,Fαj+Xnt j,Rβn j −Ln t j2φn t j
2
(14)
4
The conditional distribution of βn j is:
βn j |y1:N,θ−βn j ,φ
∼NΩ−1+
T
X
t=1
φnt jX>
nt j ,RXnt j,R−1
Ω−1ζj+
T
X
t=1
X>
nt j ,Rκnt j −φnt j (Xnt j,Fαj−Ln t j),
Ω−1+
T
X
t=1
φnt jX>
nt j ,RXnt j,R−1
(15)
The conditional density of αjcan be derived similarly:
αj|y1:N,θ−αj,φ
∼NΞ−1
0+
N
X
n=1
T
X
t=1
φnt jX>
nt j ,FXnt j,F−1
Ξ−1
0λ0+
N
X
n=1
T
X
t=1
X>
nt j ,Fκnt j −φnt j (Xnt j,Rβn j −Lnt j ),
Ξ−1
0+
N
X
n=1
T
X
t=1
φnt jX>
nt j ,FXnt j,F−1
(16)
2.3 Discussion
We test the performance of the PG-DA-based Gibbs sampler against the Metropolis-based Gibbs
sampler in a Monte Carlo study. We first consider the MNL model (
Γn j
=
αj
), where both samplers
perform equally well. For MMNL model, the posterior estimates of the proposed PG-DA approach
and the default Gibbs sampler are similar for
J
=2, but we encounter an explosion of the conditional
distribution parameters in the case of more alternatives (
J≥
3). Results for
J
=2 and MATLAB code
is available upon request.
This appears to be an issue of empirical identification because of too many model parameters.
Before the PG-DA sampler diverges, representative utilities are either very small or very large in
magnitude for all alternatives across all observations. Therefore, instead of the actual magnitude
of utilities, their comparative scales determine the choice probabilities. Thus, the algorithm might
have a tendency to increase the relative scale of the latent utilities by increasing the scale of the
parameters.
In fact, prior to divergence the probability estimates of the chosen and non-chosen alternatives are
close to one and zero, respectively. We speculate that such behavior might be a consequence of too
many model parameters, which might allow the algorithm to find a parameter configuration that can
fit the data very well (in terms of choice probabilities). Once the algorithm finds that configuration,
it starts to increase the relative scale between the utilities (thus allowing the chosen alternatives to
have probability close to one), causing the parameter explosion.
As future research, stick-breaking constructions can be explored to adopt PG-DA in MCMC estimation
of MMNL while keeping a parsimonious model specification, i.e. with generic utility parameters
5
(Linderman et al., 2015; Zhang and Zhou, 2017). However, before adopting these constructions,
consistency with microeconomic theory needs to be established first.
References
Akinc, D. and Vandebroek, M. (2018). Bayesian estimation of mixed logit models: Selecting an
appropriate prior for the covariance matrix. Journal of choice modelling, 29:133–151.
Holmes, C. C., Held, L., et al. (2006). Bayesian auxiliary variable models for binary and multinomial
regression. Bayesian analysis, 1(1):145–168.
Huang, A. and Wand, M. P. (2013). Simple marginally noninformative prior distributions for covariance
matrices. Bayesian Anal., 8(2):439–452.
Linderman, S., Johnson, M., and Adams, R. P. (2015). Dependent multinomial models made easy:
Stick-breaking with the pólya-gamma augmentation. In Advances in Neural Information Processing
Systems, pages 3456–3464.
McFadden, D. and Train, K. (2000). Mixed MNL models for discrete response. Journal of applied
Econometrics, 15(5):447–470.
Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models using
pólya–gamma latent variables. Journal of the American statistical Association, 108(504):1339–1349.
Tan, L. S. L. (2017). Stochastic variational inference for large-scale discrete choice models using
adaptive batch sizes. Statistics and Computing, 27(1):237–257.
Train, K. E. (2009). Discrete Choice Methods with Simulation. Cambridge University Press, 2nd edition.
Zhang, Q. and Zhou, M. (2017). Permuted and augmented stick-breaking bayesian multinomial
regression. The Journal of Machine Learning Research, 18(1):7479–7511.
6