Content uploaded by Omer Ozturk
Author content
All content in this area was uploaded by Omer Ozturk on Nov 14, 2020
Content may be subject to copyright.
Supplementary materials for this article are available at https:// doi.org/ 10.1007/ s13253-019-00370- 6.
Post-stratified Probability-Proportional-to-Size
Sampling from Stratified Populations
Omer Ozturk
This paper develops statistical inference based on a post-stratified probability-
proportional-to-size ( pp) sample from a finite population. A pp sample selects the sample
units with selection probabilities proportional to their size and measures them for the
characteristic of interest. For each measured unit, the pp sample further creates position
information (rank) in a comparison set of size M. The sample is then post-stratified
into ranking classes based on their position information in the comparison set. A pp
sample is expanded to stratified populations by selecting a pp sample from each stratum
population to form the stratified pp sample. Using this stratified pp sample, we con-
struct unbiased and Rao–Blackwell estimators for the mean of the stratified populations.
Different sample size allocation procedures for stratum sample sizes are investigated.
The new sampling design is applied to apple production data to estimate the total apple
production in Turkey.
Supplementary materials accompanying this paper appear online.
Key Words: Rao–Blackwell estimator; Stratified sampling; Post-stratified sample;
Neyman allocation; Probability-proportional-to-size sampling.
1. INTRODUCTION
In settings, where characteristic of interest Yis approximately proportional to a positive
known auxiliary variable X, the probability-proportional-to-size ( pps) sampling would be
preferable over simple random sampling. In a pps sample, sample units are selected with
probability proportional to size of X. Hence, it gives higher chance for important (large)
units in the population to be included in the sample. Since X-variable is approximately
proportional to Y-variable and its values are available for all population units, it also provides
information on the relative position (rank) of Y-variable on a unit in a comparison set. This
position information can be used to induce more structure in the sample by post-stratifying
the sample into different ranking groups.
O. Ozturk (B), Department of Statistics, The Ohio State University, 1958 Neil Avenue, Columbus, OH 43210,
USA
(E-mail: omer@stat.osu.edu).
© 2019 International Biometric Society
Journal of Agricultural, Biological, and Environmental Statistics, Volume 24, Number 4, Pages 693–718
https://doi.org/10.1007/s13253-019-00370-6
693
694 O. Ozturk
Table 1. Population characteristics of apple production (in 1000 kg) data, θl,τl,Nl,andρlare the mean, standard
deviation, population size and the correlation coefficient between X-andY-variables for the strata l,
respectively.
Strata (l)θlτlNlρl
Marmara (l=1) 1536.8 6425 106 0.816
Aegean (l=2) 2233.7 11,604.9 105 0.856
Mediterranean (l=3) 9384.31 29,907.5 94 0.901
Black Sea (l=4) 967 2389.7 204 0.713
Central Anatolia (l=5) 5588 28,643.4 171 0.986
Eastern Anatolia (l=6) 631.4 1171 103 0.885
Southeastern Anatolia (l=7) 72.4 111.3 68 0.917
All regions combined 2940.456 17,135 851 0.916
One natural setting for post-stratified pps sampling is given in Kadilar and Cingi (2003)
for the estimation of apple production. The population consists of apple producing localities
in Turkey. The apples in Turkey are produced in seven different geographical regions, Mar-
mara, Aegean, Mediterranean, Central Anatolia, Black Sea, Eastern Anatolia and Southeast-
ern Anatolia regions. These regions have different weather patterns, and apple production
varies from one region to the other. Turkish Statistical Institute collected data from these
regions to estimate the total apple production in 2002. The data set contains two variables
apple production (Y, in 1000 kg) and the number of apple trees (X) in each locality and
region. The values of X-variable are available for all population units in the data base prior to
sampling. The entire population has seven subpopulations and fits into stratified population
structure. The main characteristics of the population are presented in Table 1.
The entire population with all regions combined contains 851 units (townships). The cor-
relation coefficient between Xand Yis 0.916. The population means (standard deviations)
of X- and Y-variables are 37,732.667 (145,031.7) and 2940.456 (17,135), respectively. It
is clear from Table 1that the strata populations have different means and variances. Hence,
there exist large within- and between-strata variations. It is reasonable to assume that apple
production Yis approximately proportional to the number of apple trees Xin each locality.
Hence, the use of pps sample would be appropriate. In addition to the structure imposed by
unequal probability sampling, the number of apple trees (X) can provide information about
the relative position of apple production (Y), among a small comparison set of localities, on
each sampling unit in the pps sample.
All 851 localities in apple production data will serve as our finite population in this
paper. We select a pps sample with selection probabilities proportional to the number of
apple trees and measure the apple production. For each unit in the pps sample, we select
another pps sample to construct a comparison set and determine the rank of the measured
unit, without measurement, using Xvariable. The pps sample is then post-stratified based
on their ranking information from the comparison sets. The theory of stratified sampling
suggests that the post-stratified pps sample improves the initial pps sample. This paper
provides a foundation for such a claim.
The position information is successfully used in ranked set sampling (rss) and judgment
post-stratified sampling ( jps) designs. Both of these designs determine the rank of each
Post- stratified Probability- Proportional- to-Size Sampling 695
measured unit in a comparison set of size M. Each unit in these designs, in addition to
the information it caries, provides additional information through its rank about the other
unmeasured units in the comparison set. Since the construction of the comparison set and
ranking of the units in it are available with no addition cost, this additional information is
essentially free.
For the construction of a rss sample of size n, we first need to determine the cycle
size dand set size M,n=dM. We then select nM localities from the apple production
population and partition them into ndisjoint comparison sets each having Munits. Units
in each comparison set are ranked without measurement using X-variable, and the value
of Yassociated with the hth ranked X(Y[h]j)is measured in ddifferent comparison sets,
h=1,...,M. The measured values Y[h]j,h=1,...,M;j=1,...,d, are called a ranked
set sample.
For the construction of a jps sample of size n, we start with a simple random sample of
size nand measure all of them, Yi;i=1,...,n. For each measured unit in this sample, we
select additional M−1 units from the population to form a comparison set of size M.The
rank Riof the measured unit in each of these comparison sets is determined. The pairs of
(Yi,Ri),i=1,...,n, constitute a jps sample.
The rss sampling design generated common interest for many researchers in a finite
population setting. Patil et al. (1995) used ranked set sample to estimate population mean
for a population of size Nwhen the sample is constructed without replacement. Deshpande
et al. (2006) described three different sampling designs and constructed nonparametric
confidence intervals for population quantiles. Al-Saleh and Samawi (2007), Ozdemir and
Gokpinar (2007 and 2008), Gokpinar and Ozdemir (2010), Ozturk and Jafari Jozani (2013),
Frey (2011) and Ozturk (2014a,2016a) computed inclusion probabilities and constructed
Horvitz–Thompson-type estimators for population mean and total based on a ranked set
sample. Ozturk and Bayramoglu Kavlak (2018,2019) developed statistical inference based
on a superpopulation model using ranked set sample data. These research papers show that
rss design yields a substantial amount of improvement in efficiency over the usual simple
random sampling design.
The jps sampling design is originally developed for an infinite population setting in
MacEachern et al. (2004). In recent years, considerable attention has been given to research
efforts in jps sampling. Wang et al. (2006) developed a class of estimators for population
mean using the concomitant of multivariate order statistics. Wang et al. (2008) put a stochas-
tic ordering constraint among judgment classes to improve the efficiency of the estimator of
population mean. Frey and Ozturk (2011) replaced the stochastic ordering constraint with a
weaker ordering condition in which judgment class cumulative distribution functions (cdf)
can be no more extreme than the cdf of the true order statistics. In a follow-up paper, Frey
(2012) combined this weaker ordering condition with stochastic ordering constraint to con-
struct a better estimator for the population mean. Frey and Feeman (2012,2013) constructed
optimal estimators within a class of unbiased estimators for population mean and variance.
In finite population setting, Ozturk (2016b) developed estimators for population mean based
on a jps sample where he showed that the estimator needs a finite population correction
factor similar to the one used in a simple random sampling.
696 O. Ozturk
In this paper, we look at the jpssampling design from different perspective. We construct
the jps sample using probability-proportional-to-size sampling design. We first construct
apps sample from a finite population. This pps sample is then post-stratified based on
their relative positions in comparison sets. Even though it may be possible to construct a
different type of estimator, presumably more efficient, based on full covariate information,
it is not considered in this paper. Section 1introduces the post-stratified pps (pp) sample
in a finite population setting. It constructs unbiased estimators for the population mean
and its variance. Section 2constructs Rao–Blackwell estimator by conditioning on the
measured values of the pps sample. Section 3extends pp sampling to a stratified population
and constructs unbiased and Rao–Blackwell estimators for the population mean. Section 4
considers four different sample size allocation procedures to minimize the variance of the
estimator under a cost model and different stratum population structures. Section 5provides
empirical evidence to investigate the properties of the proposed estimators and compares
them with their competitors. Section 6applies the proposed sampling design and estimators
to apple production data in Turkey. Section 7provides some concluding remarks. All proofs
are given in a supplementary material.
2. PROBABILITY-PROPORTIONAL-TO-SIZE POST-STRATIFIED
SAMPLING
We consider a finite population of size N,PN={u1,...,uN}. Each population unit
uipossesses two characteristics Yand X, where Yis the characteristic of interest and X
is an auxiliary size variable. In this population, actual values of Y- and X-variables are
denoted with y1,..., yNand x1,...,xN, respectively. We assume that the characteristic X
is approximately proportional to the characteristic Y. The population mean and variance of
Yare denoted by
θ=1
N
N
i=1
yiand γ2=1
N
N
i=1
(yi−θ)2.
From this population, we select a probability-proportional-to-size sample of size nwith
replacement. Note that we use a design-based inference. Hence, the values of Yand Xin the
population are non-random constants. The sampling variation is induced by the selection
probability of the units. Let Wibe an indicator function
Wi=1 if unit iis selected
0 otherwise
with P(Wi=1)=πi, where the πiis proportional to the value of Xon ui,πi∝xi.
We then write Yi=Wiyi. In this expression, even though yiis a constant, Yiis a random
variable since Wihas a Bernoulli distribution with success probability πi. In the remainder
of the paper, we reserve the capital letter Yfor random variable and the lowercase letter
(yi) for a constant value of Yon unit ui.The pps sample then constitutes the triplets,
(ui,Yi,π
i);i=1,...,n.
Post- stratified Probability- Proportional- to-Size Sampling 697
We now induce more structure in this pps sample to improve its information content.
For each selected unit uiin the pps sample, we select additional M−1 units using pps
sampling with replacement to form a comparison set of size M
Si={ui,u1,...,uM−1},i=1,...,n.
The units in comparison set Siare ranked with respect to size variable X, and the rank
of ui,Ri, is recorded. The pps sample is then augmented with this ranking information,
(ui,Yi,π
i,Ri);i=1,...,n. Each unit uiin the augmented pps sample has two pieces of
information. The first piece is the value (Yi=Wiyi) of the characteristic Y. The second piece
is the relative position (rank Ri)ofuiamong all Munits in the comparison set Si. The rank
Riis obtained with no additional cost since the size variable Xis available in the sampling
frame prior to sampling. Since the comparison sets are constructed with replacement pps
sampling design, it is possible that the same unit may appear more than once in Si.Ifthis
happens, ties are broken at random to rank the units in the comparison set Si. Even under
perfect ranking, ties can create ranking error in the comparison sets since they are broken
at random.
In the augmented pps sample, if we ignore the ranks, Ri;i=1,...,n,thesample
(Yi,π
i)becomes a pps sample. Based on this pps sample, an unbiased estimator of the
population mean θand its variance is given by
¯
Ypps,n=1
Nn
N
i=1
Wiyi
πi
,σ
2
pps,n=Va r (¯
Ypps,n)=1
n
N
k=1
πkyk
Nπk−θ2
.(1)
Unbiased estimator of σ2
pps,nis available in the literature (Thompson 2002, page 52), and
an approximate (1−α)100-% confidence interval for the population mean is given by
¯
Ypps,n±tn−1,α/2ˆσpps,n,ˆσ2
pps,n=1
n(n−1)
n
i=1yi
πiN−¯
Ypps,n2
,
where tdf,ais the ath upper quantile of the df-degrees of freedom t-distribution.
In a pps sample, the probability mass function (pmf) and cumulative distribution function
(cdf) of Yiare given by
f(y)=fYi(y)=P(Yi=y)=
N
i=1
P(Wi=1)I(yi=y)=
N
j=1
πjI(yj=y)
and
F(y)=FYi(y)=
i:yi≤y
N
j=1
πjI(yj=yi),
698 O. Ozturk
where I(a)is an indicator function. From the above equation, we also observe that Yis
are independent identically distributed discrete random variables. For independent discrete
random variables, cdf and pmf of the hth-order statistic in a set of size Mare given by
F(h:M)(y)=
M
k=hM
kFk(y){1−F(y)}M−kand f(h:M)(y)=F(h:M)(y)−F(h:M)(y−),
where F(h:M)(y−)is the left limit at y.
In the augmented pps sample, the ranks can be used to post-stratify the pps sample
based on their relative positions (ranks) in the comparison sets. The ranks, Ri;i=1,...,n,
are independent identically distributed (iid) discrete uniform random variables on integers
1,...,M. For large values of M, the post-stratified sample may create a lot of empty
ranking groups. The empty ranking groups usually increase the variance of the estimators.
Without loss of generality, we drop the notation uifrom the augmented pps sample. The
new sample will be called as post-stratified pps (pp) sample and it will contain the triplets,
(Yi,π
i,Ri), i=1,...,n.
To reduce the likelihood of empty ranking classes, we reduce the number of ranking
groups from Mto d,1≤d≤M, where dis the number of post-stratified ranking groups
and His the number of ranks in each ranking group, 1 ≤d≤M.Let
Dr={(r−1)H+1,...,(r−1)H+H};r=1,...,d;∪
d
r=1Dr={1,...,M},
where the sets Dr;r=1,...,d, form a partition for integers 1,...,M. For example, if
M=9 and d=3, D1={1,2,3},D2={4,5,6}and D3={7,8,9}form a partition for
integers 1,...,9. Using these partition sets, we stratify the sample into dstrata based on the
membership of Riin set Dr;r=1,...,d. The large values of dcreates more structure in
the sample, but may lead to a lot of empty strata and more uncertainty in the estimators. For
notational convenience, we relabel the pp sample, (Zi,r,π
i);i=1,...,n;r=1,...,d,
where
Zi,r=YiI(Ri∈Dr);i=1,...,n;r=1,...,d.
The Zi,rs are independent but not identically distributed random variables. The conditional
distribution of Z1,rgiven that the rank R1belongs to the set Dris given by
P(Z1,r=z1|R1∈Dr)=1
H
h∈Dr
f[h:M](z1),
where f[h:M](z)is the pmf of the hth-order statistic Y[h:M]in a pps sample when the
comparison set is ordered based on Xvariable. We note that the rank Rimay not be equal
to the rank of Yiin the comparison set Sisince the units are ranked based on X-variable.
Hence, we use the square brackets to denote the possibility of ranking error. If the units are
ranked based on Y-variable, the comparison sets still may contain repeated observations,
since the units are selected with replacement. In this case, ranking error may be relatively
Post- stratified Probability- Proportional- to-Size Sampling 699
small if the population size Nis large with respect to set size M. In this paper, unless stated
otherwise we consider a ranking procedure based on the characteristic X.
Using the conditional distribution of Z1,rgiven that Y1has a rank in the set Dr, we define
the conditional mean and variance of Z1,r/π1as follows
¯μr=EZ1,r
π1|R1∈Dr=1
H
h∈Dr
EY[h:M],1
π[h:M],1=1
H
h∈Dr
μ[h:M](2)
and
var Z1,r
π1|R1∈Dr=1
H
h∈Dr
σ2
[h:M]+1
H
h∈Drμ[h:M]−¯μr2=1
H(σ 2
r+τ2
r), (3)
where
σ2
r=
h∈Dr
σ2
[h:M];τ2
r=
h∈Drμ[h:M]−¯μr2,
μ[h:M]=EY[h:M],1
π[h:M],1and σ2
[h:M]=Var Y[h:M],1
π[h:M],1.
Let
Jr=1
nrif nr>0
0 otherwise. (4)
We now construct an estimator for the population mean θ
¯
Ypp,n=1
Nd
n
d
r=1
IrJr
n
i=1
Zi,r
πi=
d
r=1
ar¯
Zr,¯
Zr=Jr
N
n
i=1
Zi,r,ar=Ir
dn
,
where nris the number of observations in ranking class r,Ir=(nr>0)and dn=d
r=1Ir.
We note that ¯
Zris a pps estimator based on sample observations having membership in
set Dr. Hence, the estimator ¯
Ypp,nis a weighted average of pps estimators from ranking
groups. The weights, ar;r=1,...,d, are used as an adjustment to create an unbiased
estimator for θ.
Note that nr,Irand dnare random variables. The vector of ranking class sample sizes
n=(n1,...,nd)has a d-dimensional multinomial random variable with parameters n=
n1+···+ndand success probability vector (1/d,...,1/d). Using this multinomial random
variable, we establish the following results, proofs of which are given in Ozturk (2014b)
and Dastbaravarde et al. (2016).
Theorem 1. Let n=(n1,...,nd)be a multinomial random variable with success
probability vector (1/d,...,1/d). The following equalities hold
i. E(I1
dn)=1/d
ii. Var(I1
dn)=1
d2d−1
k=1(k
d)n−1
700 O. Ozturk
iii. Cov( I1
dn,I2
dn)=− 1
d−1Va r (I1
dn)
iv. E(I1J1
d2
n
)=1
dn1
n+d
k=2k−1
j=1n−k+1
m=1
(−1)j−1
k2md−1
k−1k−1
j−1n
m
(k−j)n−m.
Note that expected values, variance and covariance in Theorem 1do not depend on popu-
lation characteristics. They only depend on the design parameter dand sample size nand
hence can be computed once and for all, ahead of time, prior to sampling. We next show that
¯
Ypp,nis an unbiased estimator for θand provide a closed-form expression for its variance.
Theorem 2. Let (Zi,r,π
i);i=1,...,n;r=1,...,d be a post-stratified probability-
proportional-to-size sample from a finite population. The estimator ¯
Ypp,nis unbiased for
population mean θand its variance is given by σ2
pp,n=Va r (¯
Ypp,n)
σ2
pp,n=d
N2(d−1)Var I1
dnd
r=1
(¯μr−Nθ)2
+
EI1J1
d2
n
N2H
d
r=1
h∈Dr(μ[h:M]−¯μr)2+σ2
[h:M].
There are two types of variations contributing to the variance of the estimator ¯
Ypp,n,
variation due to differences among population units and the variation due to differences
among ranking class sample sizes nr,r=1,...,d. The ranking class sample size variation
is quantified by the expressions Var(I1/dn)and E(I1J1/d2
n), where J1is defined in Eq. (4).
For the large sample size n, we can establish the following limits
lim
n→∞ nVar I1
dn=0 and lim
n→∞ nE(I1J1/d2
n)=1/d.
Using these two limits, the variance of √n(¯
Ypp,n−θ) can be reduced to a simple form
Var √n(¯
Ypp,n−θ)≈1
N2Hd
d
r=1
h∈Dr(μ[h:M]−¯μr)2+σ2
[h:M]
=1
N2dH
d
r=1
(σ 2
r+τ2
r).
The large sample approximation of the variance of the estimator shows that it is partitioned
into two pieces, within and between ranking group variations. This is similar to the parti-
tion of the variation in a stratified sample, where variance is decomposed into within- and
between-strata variations.
Post- stratified Probability- Proportional- to-Size Sampling 701
We now construct a conditionally unbiased estimator for the variance of ¯
Ypp,ngiven that
one of the groups has at least two-measured units . Let
J∗
r=1/(nr−1)if nr>1
0 otherwise,(5)
U1=d
d
r=1
n
i=1
n
j=i
I∗
rJrJ∗
rZi,r
πi−Zj,r
πj2
d∗
n
,(6)
U2=
d
r=1
d
t=r
n
i=1
n
j=1
IrItJrJtZi,r
πi−Zj,t
πj2
d2
n
,(7)
where I∗
r=I(nr>1)and d∗
n=d
r=1I∗
r.
Theorem 3. Let (Zi,r,π
i);i=1,...,n;r=1,...,d, be a post-stratified probability-
proportional-to-size sample from a finite population. Assume that there is at least one set
Drthat contains at least two observations. A conditionally unbiased estimator for σ2
pp,nis
then given by
ˆσ2
pp,n=U1
2EI2
1J1
d2
n−Var I1
dn+U2
2(d−1)
Va r (I1/dn)
EI1I2
d2
n,
where E(I1I2/d2
n)=−Va r (I1/dn)/(d−1)+1/d2.
Theorem 3holds for any nas long as there exist a set Drwith at least two observations.
We can then construct an approximate (1−α)100% confidence interval for the population
mean θfor moderate sample sizes
¯
Ypp,n±tn−dn,α/2ˆσ2
pp,n,
where the degrees of freedom df =n−dnis suggested to account the heterogeneity among
ranking classes.
3. RAO–BLACKWELL ESTIMATOR
The post-stratified probability-proportional-to-size sample estimator can be considered
as a conditional estimator for given values of sample units, ui,i=1,...,n.LetR=
{R1,...,Rn}be the conditional ranks of nunits given S=(u1,...,un). The estimator
¯
Ypp,nis constructed based on just one realization of the ranks Ri,i=1,...,n,giventhe
sample units
¯
Ypp,n(R)=1
Nd
n
d
r=1
IrJr
n
i=1
Zi,r
πi|u1,...,un,
702 O. Ozturk
where the notation Rhighlights that this estimator depends the realization of the conditional
ranks for given sample unit vector S. For a given sample unit vector S, one can obtain many
realization of the ranks by constructing different comparison sets from the population. Each
of these realization leads to different estimator. We then use Rao–Blackwell theorem to
combine all theses estimators
¯
YRB,n=ER¯
Ypp,n(R),
where the expectation is taken over the conditional distribution of ranks, Ri;i=1,...,n,
given the sample units ui=1,...,n. The construction of the Rao–Blackwell estima-
tor requires the computation of the conditional expectations of post-stratified probability-
proportional-to-size sample estimator over conditional distribution of ranks given the set
of sample units S. Even though we are unable to find a closed-form expression for this
expectation, we provide an algorithm to approximate it.
Algorithm 1. I. Select an integer Q . For q =1,..., Q, construct comparison sets
Sq
i={ui,uq
2...,uq
M};i =1,...,n , where uq
t;t =2,...,M , are the unmeasured units
selected from the population using p ps sample to form the comparison set Sq
i.
II. Using the comparison sets in step I, compute Rq=Rq
1,...,Rq
nand
Jq
r=1/nq
rif nq
r>0
0otherwise; Iq
i,r=I(Rq
i∈Dr);nq
r=
n
i=1
Iq
i,r;
Iq
r=I(nq
r>0);dq
n=
d
r=1
Iq
r;¯
Yq
pp,n=
d
r=1
Iq
rJq
r
dq
n
n
i=1
YiI(Rq
i∈Dr)
Nπi
III. Approximate the ¯
YRB,nfrom
˜
YRB,n≈1
Q
Q
q=1¯
Yq
pp,n.
The algorithm does not provide an estimate for the variance of Rao–Blackwell estimator.
We use jackknife variance estimator to assess the sampling variation. To construct the
jackknife variance, for given Qsets of ranks Rq,q=1,..., Q, we compute nRao–
Blackwell estimator, ¯
Y(−i)
RB,n,i=1,...,n, where ¯
Y(−i)
RB,nis the Rao–Blackwell estimator after
the ith unit is removed from the sample. We now create jackknife replicates
si=n¯
YRB,n−(n−1)¯
Y(−i)
RB,n,i=1,...,n.
The jackknife variance estimate is then given by
ˆσ2
J=1
n(n−1)
n
i=1
(si−¯si)2
Post- stratified Probability- Proportional- to-Size Sampling 703
where ¯si=i=1si/n. An approximate (1−α)100% confidence interval for the population
mean θbased on Rao–Blackwell estimator is given by
¯
YRB,n±tn−1,α/2ˆσ2
J.
4. POST-STRATIFIED PROBABILITY-PROPORTIONAL-TO-SIZE
SAMPLES FROM STRATIFIED POPULATIONS
In this section, we expand the post-stratified probability-proportional-to-size sample to
stratified populations. We assume that main population is divided into Ldisjoint subpopula-
tions PNl=u1,l,...,uNl,l, where Nlis the population size of the lth stratum population,
l=1,...,L. The stratum population means, variances and totals are defined as
θl=1
Nl
Nl
i=1
yi,l;γ2
Nl=1
Nl
Nl
i=1
(yi,l−θl)2;tNl=Nlθl;l=1,...,L,
θ=1
N
L
l=1
Nl
i=1
yi,l,
where yi,lis the value of Yon unit ui,lin stratum population PNland N=N1+···+NL.In
this population, we wish to draw inference on parameter θ. To construct a post-stratified pps
sample from this stratified population, we select a post-stratified pps sample with sample
size nland set size Mlfrom each stratum population. We combine these samples to form
the post-stratified pps stratified sample (str), (Yi,l,π
i,l,Ri,l);i=1,...,nl;l=1,...,L,
where Yi,lis the value of Yon unit ui.l,π
i,lis the selection probability of the unit ui,land
Ri,lis the rank of the unit ui,lin the comparison set of size Mlfrom the stratum population
l. Using this stratified sample, we construct an estimator for the population mean θ
¯
Ystr =
L
l=1
Nl
N¯
Ypp,nl=
L
l=1
Nl
N
dl
r=1
Ir,lJr,l
Nldnl
nl
i=1
Zi,r,l
πi,l;n=n1+···,nL,
Jr,l=1/nr,lif nr,l>0
0 otherwise,
where nr,lis the number of observations in ranking group r,Zi,r,l=Yi,lI(Ri,l∈Dr,l),
Dr,l={(r−1)Hl+1,...,(r−1)Hl+Hl},Ir,l=I(nr,l>0),dnl=dl
r=1I(nr,l>0),
Hl=Ml/dl, and dlis the number of ranking groups for stratum l;l=1,...,L.We
use the notation (Zi,r,l,π
i,l),i=1,...,nl;r=1,...,dl;l=1,...,L, to denote the
post-stratified pps sample from a stratified population.
Theorem 4. Let Zi,r,l,π
i,l;i =1,...,nl;r=1,...,dl;l=1,...,L, be a post-
stratified pps sample from a stratified population. The estimator ¯
Ystr is unbiased for the
population mean θand its variance σ2
str =Va r (¯
Ystr )is given by
704 O. Ozturk
σ2
str =
L
l=1
N2
l
N2σ2
pp,nl,
σ2
pp,nl=dl
N2
l(dl−1)Var I1,l
dnldl
r=1
(¯μr,l−Nlθl)2
+
EI1,lJ1,l
d2
nl
N2
lHl
dl
r=1
h∈Dr,l(μh:Ml−¯μr,l)2+σ2
[h:Ml],
where ¯μr,l=h∈Dr,lμ[h:Ml]/Hl.
An unbiased estimator for the population total can be constructed from Tstr =N¯
Ystr .
The variance of Tstr follows from Theorem 4,Var(Tstr)=N2σ2
str .
Corollary 1. Let n0=min(n1,...,nL)and λl=limnl→∞ nl
n>0as n0goes to
infinity. The variance of √n(¯
Ystr −θ) =L
l=1n
nl
Nl
N[√nl(¯
Ypp,nl−θl)]is given by
σ2
λ=
L
l=1
1
N2dlHlλl
dl
r=1
(σ 2
r,l+τ2
r,l),
where σ2
r,l=h∈Dr,lσ2
[h:Ml],τ2
r,l=h∈Dr,l(μ[h:Ml]−¯μr,l)2and ¯μr,l=h∈Dr,lμ[h:Ml]/Hl.
A conditional unbiased estimator for σ2
str can be established from Theorem 3given that
there is at least one set Dr,lhaving at least two observations in each stratum sample
ˆσ2
str =
L
l=1
N2
l
N2ˆσ2
pp,nl,
ˆσ2
pp,nl=U1,l
2EI2
1,lJ1,l
d2
nl−Var I1,l
dnl+U2,l
2(dl−1)
Va r (I1,l/dnl)
EI1,lI2,l
d2
nl,
where U1,land U2,lare the expressions U1and U2in Eqs. (6) and (7)forstratuml, respec-
tively. An approximate (1−α) ×100% confidence interval for the population mean θcan
be constructed from
¯
Ystr ±tdf,α/2ˆσstr ,
where df =L
l=1nl−L
l=1dnl.
Post-stratified pps sample from a stratified population consists of Ldifferent post-
stratified probability-proportional-to-size samples, one from each stratum population. The
stratum populations usually have different means and variances. For a fixed sample size n,
n=n1+...+nL, the information content of the stratified sample depends on the stratum
sample sizes, nl;l=1,...,L. For a finite sample size n, it is a challenge to investigate
Post- stratified Probability- Proportional- to-Size Sampling 705
the relationship between the stratum sample sizes and information content of the sample.
To ease the computation, we look at four different sample size allocations for large sample
sizes.
The equal allocation procedure selects equal number of observations from each stratum
populations nl=n/L,l=1,...,L. Under this allocation scheme, the asymptotic variance
of √n(¯
Ystr −θ) reduces to
σ2
λ(E)=
L
l=1
L
N2dlHl
dl
r=1
(σ 2
r,l+τ2
r,l),
where Ein σ2
λ(E)is used to denote the equal allocation.
In certain cases, it may be reasonable to select sample sizes proportional to the stratum
population sizes, nl=n(Nl/N). Under proportional (P) allocation, the asymptotic variance
of the estimator reduces to
σ2
λ(P)=
L
l=1
1
Nd
lHlNl
dl
r=1
(σ 2
r,l+τ2
r,l).
Optimal (Neyman) allocation minimizes the variance of the estimator with respect to
sample sizes nlsubject to the constraint that the sum of the stratum sample sizes equals n.
Using Lagrange multiplier, we can show that Neyman (N) allocation sample sizes are given
by
nm=
ndl
r=1(σ 2
r,m+τ2
r,m)
√dmHm
L
l=1dl
r=1(σ 2
r,l+τ2
r,l)
√dlHl
;m=1,...,L.
Under Neyman allocation, the asymptotic variance simplifies to
σ2
λ(N)=L
l=1dl
r=1(σ 2
r,l+τ2
r,l)/(dlHl)2
N2.
Sampling cost is also a limiting factor in sample size determination when there is a
constraint in the budget. In this case, it is desirable to minimize the variance with respect to
stratum sample sizes for a given cost function and a budget. A simple cost function for this
setting can be constructed as
CT=C0+
L
l=1
(cl+rl)nl,(8)
where CTis the total cost, C0is overhead cost, clis the cost of measuring a single unit
from stratum population land rlis the cost of obtaining the rank of a measured unit in a
706 O. Ozturk
comparison set in stratum l. For the setting where post-stratified probability-proportional-
to-size sampling is appropriate, it is reasonable to assume that rlis relatively small since the
values of X-variable are available for all population units. Under the cost model in Eq. (8),
the asymptotic variance of the estimator is minimized for
nm=ndm
r=1(σ 2
r,m+τ2
r,m)/(Mm(cm+rm))
L
l=1dl
r=1(σ 2
r,l+τ2
r,l)/(Ml(cl+rl)) ;m=1,...,L.
For the cost function CT, the variance of the optimal estimator simplifies to
σ2
λ(C)=L
l=1dl
r=1(σ 2
r,l+τ2
r,l)/(Ml(cl+rl))2
N2.
The equal and proportional allocations are relatively easy to implement. The difference
between the variances of the estimators under equal and proportional allocations can be
written as follows
σ2
λ(E)−σ2
λ(P)=
L
l=1
LA
2
l
NM
lNl−¯
N
NN
l;A2
l=
dl
r=1
(σ 2
r,l+τ2
r,l);¯
N=
L
l=1
Nl/L.
It is reasonable to assume that A2
lis an increasing function of the population variance
of stratum l,τ2
l. We then expect that the difference between the variances of equal and
proportional allocation will be positive when large stratum population (large Nl) has large
variances (large τ2
lor large A2
l). In this case, proportional allocation procedure samples
more data from a stratum population having large population size and variance to reduce
the contribution of variation from this stratum sample to the estimator. We note that for the
implementation of the equal and proportional allocations it is not necessary to have point
estimates for the population variances. It only requires knowing if the larger populations
have larger variances. This may be less restrictive than knowing the point estimates of the
population variances.
The Neyman allocation is optimal. Hence, it yields smaller variance than both equal
and proportional allocations. On the other hand, the computation of stratum sample sizes
requires that A2
lmust be known prior to construction of the sample. For setting, where
Ml=dlHl≡Mfor all stratum samples and the stratum population variances are known
(or may be estimated from pilot studies) from the previous studies. The Neyman allocation
can be approximated from
nm=nA
m
L
l=1Al≈nˆτm
L
l=1ˆτl;m=1,...,L,
where ˆτ2
lis the estimate of the variance of the stratum population l.
Post- stratified Probability- Proportional- to-Size Sampling 707
5. EFFICIENCY COMPARISON OF THE ESTIMATORS
In this section, we provide empirical evidence about the efficiency of the proposed esti-
mators using several populations, where a probability-proportional-to-size sampling would
be a natural choice. In these populations, the values of Y-variable are proportional to the
values of the X-variable. A small percentage of the population units have extreme values
in both Y-and X-variables with different proportionality constants. The units that produce
extreme values usually behave differently from the other units in the population. They have
larger variance and the slope of the regression fit between Y- and X-values would be larger
than the slope of the regression fit of Yon Xfor the remaining population units. For exam-
ple, if we sample farms to estimate the crop production (Y), the farm population can be
divided into two parts small/normal size in acre (X) and mega farms that has extremely
large X-values . The percentage of the mega farms would be small, but they may have larger
variance in Y- and X-variables and the regression fit of YtoXmay have a larger slope. For
our empirical investigation, we generate this type of population structure using the model
below.
I. For a fixed population size N, generate the size variable Xfrom an exponential
distribution with mean 100 and order these Nrandom numbers from smallest to
largest, x(i)< ... < x(N), where x(i)is the ith smallest value of x-values.
II. Let N∗be he largest integer such that N∗≤Nω. Generate the Y-values from
y[i]=15x(i)+τ√x(i)ii=1,...,N∗
45x(i)+2τ√x(i)ii=N∗+1,...,N,(9)
where iis generated from a normal distribution with mean zero and variance 1 and
y[i]is the value of the Y-variable that corresponds to the value of x(i).
In model (9), the quantity 1 −ωis the proportion of population units that produce extreme
measurements on Y- and X-variables. It corresponds to proportion of mega farms in crop
production example. In the simulation study, we used R-software in R Core Team (2018)
to generate the population values. We set the random generator seed at set.seed =1
so that the same population values can be created. Using model (9), the populations
of size N=2000 are generated with several values of ω=0.7,0.8,0.90,0.95 and
τ=50,200,400,500,800 to establish different correlation structures between Y- and
X-variables. Sample size nis selected to be n=30 in the first phase of the simulation
study. The number of replication in Rao–Blackwell estimator is taken to be Q=50.
For each population, we used the pairs of set size and number of groups (M,d)as
(M,d)=(4,4), (5,5), (10,2), (10,5), (30,3), (60,3).
For the populations generated by the model in Eq. (9), a ratio estimator in a double
sampling could be an alternative estimator for the population mean or total using the size
variable Xas an auxiliary variable. The double sampling first selects nM units from the
population and measures only the Xvariable. In the second stage, from the nM units selected,
it selects a subsample of size nand measures both Y- and X-variables. The ratio estimator
is then given by
708 O. Ozturk
¯
YR=n
i=1Yi
n
i=1Xi1
nM
nM
j=1
Xi.
This estimator is a biased estimator, and its approximate mean square error (MSE) is given
in Section 14.1 in Thompson (2002). We compare the Rao–Blackwell estimator ( ¯
YRB,n)
with probability-proportional-to-size ( ¯
Ypps,n), post-stratified probability-proportional-to-
size ( ¯
Ypp,n) and ratio estimators ( ¯
YR). The efficiencies of these estimators are defined as
follows
RE1=Va r (¯
Ypps,n)
Va r (¯
YRB,n),RE2=Va r (¯
Ypp,n)
Va r (¯
YRB,n),RE3=MSE(¯
YR)
Va r (¯
YRB,n).
Tables 2and 3present the efficiency values RE1,RE2and RE3for the finite populations
generated by the model in Eq. (9). In all these simulation settings, the Rao–Blackwell
estimator ( ¯
YRB,n) is always better than the ratio estimator ( ¯
YR), (RE3>1). The efficiency
is higher for larger values of ρand decreases with lower values of ρ. The smallest efficiency
gain is RE3=1.102 in Table 3when M=60, d=3, τ=800, ω=0.95 and ρ=0.377.
The efficiency of Rao–Blackwell estimator with respect to probability-proportional-to-
size ( pps) estimator depends on the correlation coefficient ρand the proportion of population
units with extreme X- and Y-values (1 −ω) . For large values of ρ(≥0.5). The Rao–
Blackwell estimator is superior to pps estimator (RE1≥1). If the correlation coefficient is
less than 0.5, the Rao–Blackwell estimator is slightly less efficient than the pps estimator.
If the population has smaller number of units having extreme X- and Y-measurements, the
Rao–Blackwell estimators tend to have higher efficiencies then for a pps estimator for the
same correlation coefficient ρ. For example, in Table 2when ω=0.7 and ρ=0.966
and 0.856 the RE1values are usually higher than the RE1values when ω=0.8 and ρ=
0.949 and 0.854. These numerical values indicate that the Rao–Blackwell and post-stratified
probability-proportional-to-size ( pp) sample estimators can handle extreme observations
better than their competitor ratio estimator. As expected, Rao–Blackwell estimator is always
superior to the pp sample estimator (RE2>1).
Tables 4and 5present efficiency results for different sample (n=120), set (M=
5,10,60,2000) sizes and the number of groups (d) when ω=0.7,0.9. The efficiency
values slightly increase with sample size for the same values of M,d,ρand ω. For example,
in Table 2when ω=0.7 and ρ=0.966 , RE1values are 2.193, 4.188 for M=5,d=5 and
M=60,d=3, respectively. The RE1values in Table 4for the same simulation settings
are 2.353, and 4.449. Efficiencies in Tables 4and 5also depend on the selection of set size
Mand the number of groups. The larger set sizes suggest better efficiency for large ρ.For
example, the value of RE1is the largest (5.964) when ρ=0.966, M=2000 and d=20
in Table 4. On the other hand, the value of RE1is the largest (5.054) when ρ=0.907,
M=2000 and d=10 in Table 5. This suggests that selection of Mand ddepends on the
within-set ranking quality through the correlation coefficient between X- and Y-variables.
This relationship is further investigated in Figs. 1and 2.
Figures 1and 2present the plots of RE3values with respect to dfor different set sizes
Min nine panels for ω=0.7 and ω=0.90, respectively. The displays in the each row fix
Post- stratified Probability- Proportional- to-Size Sampling 709
Table 2. Efficiency comparison of the proposed estimators with probability-proportional-to-size and ratio esti-
mators.
ωτ ρ MdRE
1RE2RE3τρ MdRE
1RE2RE3
0.70 50 0.966 4 4 1.993 1.447 6.703 400 0.657 10 5 0.999 1.154 1.281
50 0.966 5 5 2.193 1.564 6.779 400 0.657 30 3 1.036 1.049 1.366
50 0.966 10 2 2.081 1.277 4.433 400 0.657 60 3 1.037 1.039 1.395
50 0.966 10 5 2.610 1.579 5.832 500 0.574 4 4 1.005 1.110 1.382
50 0.966 30 3 3.563 1.466 5.920 500 0.574 5 5 1.008 1.156 1.404
50 0.966 60 3 4.188 1.469 6.523 500 0.574 10 2 1.021 1.024 1.366
200 0.856 4 4 1.163 1.163 2.214 500 0.574 10 5 0.977 1.149 1.221
200 0.856 5 5 1.179 1.209 2.181 500 0.574 30 3 1.008 1.044 1.314
200 0.856 10 2 1.187 1.063 1.834 500 0.574 60 3 1.007 1.035 1.344
200 0.856 10 5 1.168 1.198 1.747 800 0.401 4 4 0.984 1.103 1.274
200 0.856 30 3 1.256 1.085 1.761 800 0.401 5 5 0.985 1.150 1.304
200 0.856 60 3 1.275 1.071 1.784 800 0.401 10 2 0.999 1.020 1.306
400 0.657 4 4 1.024 1.116 1.480 800 0.401 10 5 0.952 1.142 1.155
400 0.657 5 5 1.028 1.161 1.495 800 0.401 30 3 0.976 1.039 1.257
400 0.657 10 2 1.041 1.029 1.420 800 0.401 60 3 0.972 1.030 1.286
0.80 50 0.949 4 4 2.279 1.567 8.075 400 0.680 10 5 1.048 1.154 1.484
50 0.949 5 5 2.442 1.602 7.192 400 0.680 30 3 1.052 1.038 1.407
50 0.949 10 2 2.961 1.533 7.318 400 0.680 60 3 1.040 1.032 1.359
50 0.949 10 5 3.071 1.585 7.580 500 0.605 4 4 1.021 1.116 1.484
50 0.949 30 3 2.851 1.284 5.524 500 0.605 5 5 1.026 1.178 1.459
50 0.949 60 3 2.625 1.220 5.173 500 0.605 10 2 1.039 1.036 1.402
200 0.854 4 4 1.231 1.163 2.485 500 0.605 10 5 1.015 1.121 1.342
200 0.854 5 5 1.229 1.238 2.286 500 0.605 30 3 0.999 1.040 1.327
200 0.854 10 2 1.303 1.099 2.259 500 0.605 60 3 1.013 1.033 1.327
200 0.854 10 5 1.290 1.223 2.201 800 0.445 4 4 1.013 1.117 1.335
200 0.854 30 3 1.259 1.083 1.916 800 0.445 5 5 0.999 1.162 1.311
200 0.854 60 3 1.258 1.069 1.905 800 0.445 10 2 1.010 1.019 1.219
400 0.680 4 4 1.053 1.159 1.639 800 0.445 10 5 0.968 1.159 1.246
400 0.680 5 5 1.051 1.167 1.553 800 0.445 30 3 0.975 1.046 1.240
400 0.680 10 2 1.080 1.041 1.510 800 0.445 60 3 0.967 1.030 1.202
The finite population is constructed from the model in Eq. (9) with ω=0.70, sample size n=30, the population
size N=2000, and the correlation coefficient, ρ, between the X-andY-variables. The efficiencies are RE1=
Va r (¯
Ypps,n)/Var (¯
YRB,n),RE2=Var (¯
Ypp,n)/Var (¯
YRB,n),RE3=MSE(¯
YR)/Va r (¯
YRB,n). Variances and mean
square error (MSE) are computed from 5000 simulation replication.
the ρand ωand changes the sample sizes from n=30 to n=120. The displays in the each
column fix the sample size and ωand vary the correlation coefficient. The plots in Fig. 1
suggest that for large sample size n=120 and large correlation coefficient between X- and
Y-variable (ρ>0.90), we can select as large Mas possible with dhaving values between
6 and 15. On the other hand, for moderately large sample sizes n=60, M=300 and the
number of groups dbetween 6 and 10 seem to be slightly better than M=2000. For the
lower correlation coefficient ρ<90, the selection of Mdoes not make a big difference. All
efficiency curves are quite close to each other. In this case, the number of groups dshould
not be larger than 5 for any M.
Similar results also hold in Fig. 2. Only difference is that the efficiency values RE3
is much higher. This indicates that the proposed pp and Rao–Blackwell estimators can
710 O. Ozturk
Table 3. Efficiency comparison of the proposed estimators with probability-proportional-to-size and ratio esti-
mators.
ωτ ρ MdRE
1RE2RE3τρ MdRE
1RE2RE3
0.90 50 0.907 4 4 2.038 1.443 7.825 400 0.629 10 5 1.033 1.129 1.520
50 0.907 5 5 2.210 1.561 7.854 400 0.629 30 3 1.034 1.045 1.447
50 0.907 10 2 1.946 1.279 6.143 400 0.629 60 3 1.038 1.032 1.411
50 0.907 10 5 2.692 1.588 8.307 500 0.553 4 4 1.029 1.125 1.352
50 0.907 30 3 3.723 1.577 9.571 500 0.553 5 5 1.014 1.204 1.413
50 0.907 60 3 4.449 1.474 11.050 500 0.553 10 2 1.024 1.026 1.336
200 0.808 4 4 1.190 1.180 2.480 500 0.553 10 5 1.002 1.148 1.259
200 0.808 5 5 1.224 1.241 2.478 500 0.553 30 3 1.010 1.039 1.298
200 0.808 10 2 1.196 1.068 2.158 500 0.553 60 3 1.010 1.035 1.299
200 0.808 10 5 1.220 1.209 2.286 800 0.393 4 4 0.993 1.119 1.198
200 0.808 30 3 1.298 1.092 2.207 800 0.393 5 5 0.987 1.150 1.223
200 0.808 60 3 1.279 1.074 2.301 800 0.393 10 2 0.996 1.031 1.246
400 0.629 4 4 1.036 1.128 1.569 800 0.393 10 5 0.986 1.134 1.131
400 0.629 5 5 1.044 1.150 1.463 800 0.393 30 3 0.986 1.034 1.150
400 0.629 10 2 1.050 1.039 1.497 800 0.393 60 3 0.986 1.022 1.159
0.95 50 0.870 4 4 1.710 1.368 7.306 400 0.592 10 5 1.001 1.123 1.366
50 0.870 5 5 1.822 1.436 7.334 400 0.592 30 3 1.012 1.048 1.392
50 0.870 10 2 1.354 1.125 4.700 400 0.592 60 3 1.013 1.034 1.262
50 0.870 10 5 2.200 1.523 7.887 500 0.521 4 4 1.007 1.110 1.335
50 0.870 30 3 1.927 1.189 5.863 500 0.521 5 5 1.006 1.134 1.255
50 0.870 60 3 1.811 1.125 5.421 500 0.521 10 2 0.993 1.017 1.281
200 0.766 4 4 1.151 1.196 2.444 500 0.521 10 5 0.999 1.140 1.295
200 0.766 5 5 1.145 1.201 2.244 500 0.521 30 3 0.983 1.032 1.241
200 0.766 10 2 1.079 1.040 1.944 500 0.521 60 3 0.981 1.031 1.213
200 0.766 10 5 1.153 1.180 2.127 800 0.377 4 4 0.996 1.132 1.208
200 0.766 30 3 1.160 1.052 1.930 800 0.377 5 5 0.983 1.134 1.152
200 0.766 60 3 1.125 1.054 1.927 800 0.377 10 2 1.003 1.021 1.140
400 0.592 4 4 1.014 1.129 1.431 800 0.377 10 5 0.973 1.145 1.178
400 0.592 5 5 1.009 1.183 1.420 800 0.377 30 3 0.956 1.019 1.149
400 0.592 10 2 1.021 1.030 1.377 800 0.377 60 3 0.952 1.034 1.102
The finite population is constructed from the model in Eq. (9) with ω=0.70, sample size n=30, the population
size N=2000, and the correlation coefficient, ρ, between the X-andY-variables. The efficiencies are RE1=
Va r (¯
Ypps,n)/Var (¯
YRB,n),RE2=Var (¯
Ypp,n)/Var (¯
YRB,n),RE3=MSE(¯
YR)/Va r (¯
YRB,n). Variances and mean
square error (MSE) are computed from 5000 simulation replication.
be better alternatives than a ratio estimator for the population producing extreme X- and
Y-values for the smaller values of 1 −ω.
6. APPLICATION TO APPLE PRODUCTION DATA
In this section, we apply the proposed sampling designs to apple production data without
stratification. Since efficiency of the point estimators is discussed in the previous chapter,
we investigate the properties of the confidence intervals.
We performed a simulation study to compare the efficiency of the confidence intervals
of population mean based on the Rao–Blackwell ( ¯
YRB,n), pp (¯
Ypp,n) and pps (¯
Ypps,n)
estimators. The simulation study considered set and group size combinations (M,d),
(M,d)=(6,6),(7,7),(8,8),(9,9),(10,10),(30,3),(30,5),(300,3),(300,5),(300,10),
Post- stratified Probability- Proportional- to-Size Sampling 711
Table 4. Efficiency comparison of the proposed estimators with probability-proportional-to-size and ratio esti-
mators.
τρ MdRE
1RE2RE3τρ MdRE
1RE2RE3
50 0.966 5 5 2.353 1.456 7.398 400 0.657 60 10 1.048 1.062 1.339
50 0.966 10 10 2.760 1.408 5.784 400 0.657 2000 2 1.048 1.002 1.267
50 0.966 60 2 2.047 1.114 3.123 400 0.657 2000 4 1.027 1.006 1.250
50 0.966 60 3 4.449 1.380 6.940 400 0.657 2000 5 1.050 1.014 1.297
50 0.966 60 4 3.325 1.226 4.719 400 0.657 2000 10 1.027 1.014 1.381
50 0.966 60 5 3.950 1.302 6.496 400 0.657 2000 20 0.985 1.088 1.285
50 0.966 60 10 4.015 1.328 7.143 500 0.574 5 5 1.026 1.014 1.530
50 0.966 2000 2 1.808 1.029 2.445 500 0.574 10 10 1.020 1.082 1.246
50 0.966 2000 4 3.002 1.035 4.227 500 0.574 60 2 1.032 0.999 1.406
50 0.966 2000 5 3.331 1.028 5.063 500 0.574 60 3 1.044 1.017 1.296
50 0.966 2000 10 4.185 1.077 5.386 500 0.574 60 4 1.042 1.019 1.312
50 0.966 2000 20 5.964 1.258 8.858 500 0.574 60 5 1.038 1.022 1.469
200 0.856 5 5 1.208 1.126 2.160 500 0.574 60 10 1.017 1.055 1.464
200 0.856 10 10 1.237 1.101 2.034 500 0.574 2000 2 1.018 0.999 1.303
200 0.856 60 2 1.195 1.022 1.722 500 0.574 2000 4 1.035 1.002 1.309
200 0.856 60 3 1.300 1.029 1.763 500 0.574 2000 5 1.027 1.009 1.349
200 0.856 60 4 1.215 1.029 1.542 500 0.574 2000 10 1.030 1.012 1.395
200 0.856 60 5 1.285 1.042 1.904 500 0.574 2000 20 0.923 1.090 1.123
200 0.856 60 10 1.322 1.079 1.701 800 0.401 5 5 1.021 1.009 1.234
200 0.856 2000 2 1.153 1.002 1.295 800 0.401 10 10 1.001 1.054 1.362
200 0.856 2000 4 1.281 1.009 1.709 800 0.401 60 2 1.008 1.000 1.394
200 0.856 2000 5 1.251 1.013 1.608 800 0.401 60 3 1.025 1.004 1.262
200 0.856 2000 10 1.289 1.005 1.501 800 0.401 60 4 0.998 1.008 1.340
200 0.856 2000 20 1.160 1.090 1.571 800 0.401 60 5 0.998 1.017 1.275
400 0.657 5 5 1.058 1.038 1.533 800 0.401 60 10 1.007 1.059 1.390
400 0.657 10 10 1.047 1.074 1.492 800 0.401 2000 2 0.996 1.000 1.279
400 0.657 60 2 1.064 1.013 1.329 800 0.401 2000 4 1.011 1.001 1.433
400 0.657 60 3 1.069 1.001 1.526 800 0.401 2000 5 0.966 1.004 1.232
400 0.657 60 4 1.057 1.005 1.318 800 0.401 2000 10 0.937 1.015 1.167
400 0.657 60 5 1.073 1.031 1.440 800 0.401 2000 20 0.894 1.052 1.083
The finite population is constructed from the model in Eq. (9) with ω=0.70, sample size n=120, the population
size N=2000, and the correlation coefficient, ρ, between X-andY-variables. The efficiencies are RE1=
Va r (¯
Ypps,n)/Var (¯
YRB,n),RE2=Var (¯
Ypp,n)/Var (¯
YRB,n),RE3=MSE(¯
YR)/Va r (¯
YRB,n). Variances and mean
square error (MSE) are computed from 5000 simulation replication.
(600,3),(600,5),(600,10). We considered two different sample sizes n=60,120. Simu-
lation size is taken to be 1000. Rao–Blackwell estimator is computed with fifty replication
Q=50. The efficiencies of the confidence intervals are defined as the ratio of the squared
average lengths
RE4=1000
i=1L2
pp,i
1000
i=1L2
RB,i
,RE5=1000
i=1L2
pps,i
1000
i=1L2
RB,i
,
where Lpp,i,LRB,iand Lpps,iare the length of the confidence intervals in the ith replication
based on point estimators ¯
Ypp,n,˜
YRB,nand ¯
Ypps,n, respectively, and given by
Lpp,i=2tn−dnˆσpp,n,i,LRB,i=2tn−1ˆσRB,n,i,Lps,i=2tn−dnˆσpps,n,i.
712 O. Ozturk
Table 5. Efficiency comparison of the proposed estimators with probability-proportional-to-size and ratio esti-
mators.
τρ MdRE
1RE2RE3τρ MdRE
1RE2RE3
50 0.907 5 5 2.152 1.395 7.805 400 0.629 60 10 1.095 1.063 1.504
50 0.907 10 10 2.711 1.453 9.264 400 0.629 2000 2 1.037 1.000 1.402
50 0.907 60 2 1.748 1.103 5.312 400 0.629 2000 4 1.072 0.999 1.374
50 0.907 60 3 4.899 1.440 14.821 400 0.629 2000 5 1.092 1.016 1.484
50 0.907 60 4 3.233 1.247 9.029 400 0.629 2000 10 1.069 1.013 1.288
50 0.907 60 5 3.545 1.242 9.746 400 0.629 2000 20 1.029 1.061 1.342
50 0.907 60 10 4.363 1.282 11.177 500 0.553 5 5 1.031 1.027 1.495
50 0.907 2000 2 1.693 1.010 4.899 500 0.553 10 10 1.049 1.090 1.454
50 0.907 2000 4 2.767 1.037 6.999 500 0.553 60 2 1.027 1.005 1.363
50 0.907 2000 5 3.228 1.024 7.918 500 0.553 60 3 1.064 1.004 1.423
50 0.907 2000 10 5.054 1.109 14.088 500 0.553 60 4 1.043 1.018 1.305
50 0.907 2000 20 4.738 1.178 13.567 500 0.553 60 5 1.043 1.031 1.454
200 0.808 5 5 1.234 1.065 2.813 500 0.553 60 10 1.031 1.048 1.373
200 0.808 10 10 1.213 1.111 2.250 500 0.553 2000 2 1.046 1.001 1.338
200 0.808 60 2 1.189 1.021 1.986 500 0.553 2000 4 1.018 1.004 1.280
200 0.808 60 3 1.422 1.058 2.350 500 0.553 2000 5 1.027 1.005 1.243
200 0.808 60 4 1.311 1.042 2.266 500 0.553 2000 10 1.013 1.014 1.214
200 0.808 60 5 1.278 1.049 2.403 500 0.553 2000 20 0.925 1.053 1.132
200 0.808 60 10 1.362 1.103 2.462 800 0.393 5 5 1.025 1.007 1.223
200 0.808 2000 2 1.148 1.004 1.856 800 0.393 10 10 1.004 1.072 1.158
200 0.808 2000 4 1.247 1.003 2.024 800 0.393 60 2 1.004 1.002 1.168
200 0.808 2000 5 1.333 1.006 2.135 800 0.393 60 3 1.017 1.013 1.150
200 0.808 2000 10 1.280 1.017 2.161 800 0.393 60 4 1.002 1.004 1.135
200 0.808 2000 20 1.357 1.102 2.325 800 0.393 60 5 1.006 1.039 1.090
400 0.629 5 5 1.070 1.017 1.537 800 0.393 60 10 0.965 1.061 1.099
400 0.629 10 10 1.065 1.101 1.381 800 0.393 2000 2 1.003 1.004 1.349
400 0.629 60 2 1.058 1.008 1.429 800 0.393 2000 4 0.997 1.005 1.270
400 0.629 60 3 1.051 1.021 1.554 800 0.393 2000 5 0.978 1.002 1.139
400 0.629 60 4 1.060 1.010 1.610 800 0.393 2000 10 0.970 1.029 1.081
400 0.629 60 5 1.073 1.040 1.363 800 0.393 2000 20 0.909 1.099 1.248
The finite population is constructed from the model in Eq. (9) with ω=0.90, sample size n=120, the population
size N=2000, and the correlation coefficient, ρ, between the X-andY-variables. The efficiencies are RE1=
Va r (¯
Ypps,n)/Var (¯
YRB,n),RE2=Var (¯
Ypp,n)/Var (¯
YRB,n),RE3=MSE(¯
YR)/Va r (¯
YRB,n). Variances and mean
square error (MSE) are computed from 5000 simulation replication.
The notations ˆσpp,n,i,ˆσRB,n,iand ˆσpps,n,iare used to denote the variance estimates of the
estimator in the ith replication of the simulation study. The values of RE4,RE5greater than
1 indicate that the average length of the confidence interval in denominators are shorter than
the ones in the numerators. Table 6presents the efficiencies and the coverage probabilities
of the 95% confidence intervals.
It is clear that the values of RE5are all greater than 1. Hence, the confidence intervals
based on Rao–Blackwell estimators have shorter length than the lengths of the intervals
constructed based on a pps estimator. The efficiency values slightly increase with sample
size n=120. Since the correlation coefficient (ρ =0.916)between the X- and Y-variables
is relatively high, the larger set sizes Mtend to produce shorter intervals based on Rao–
Blackwell estimator.
Post- stratified Probability- Proportional- to-Size Sampling 713
246810 14
23456
The number
of groups (d)
RE3
M=30
M=60
M=300
M=2000
ρ=0.966, n=30, ω=0.7
246810 14
23456
The number
of groups (d)
RE3
ρ=0.966, n=60, ω=0.7
246810 14
23456
The number
of groups (d)
RE3
ρ=0.966, n=120, ω=0.7
246810 14
1.0 1.2 1.4 1.6 1.8
The number
of groups (d)
RE3
ρ=0.856, n=30, ω=0.7
246810 14
1.0 1.2 1.4 1.6 1.8
The number
of groups (d)
RE3
ρ=0.856, n=60, ω=0.7
246810 14
1.0 1.2 1.4 1.6 1.8
The number
of groups (d)
RE3
ρ=0.856, n=120, ω=0.7
246810 14
1.0 1.2 1.4 1.6 1.8
The number
of groups (d)
RE3
ρ=0.657, n=30, ω=0.7
246810 14
1.0 1.2 1.4 1.6 1.8
The number
of groups (d)
RE3
ρ=0.657, n=60, ω=0.7
246810 14
1.0 1.2 1.4 1.6 1.8
The number
of groups (d)
RE3
ρ=0.657, n=120, ω=0.7
Figure 1. Efficiency plots (RE3=MSE(¯
YR,n)
Va r (¯
YRB,n)) of Rao–Blackwell estimator with respect to ratio estimator in
double sampling for different values of M,dand nwhen ω=0.7.
The efficiency (RE4>1) of the jackknife confidence interval based on Rao–Blackwell
estimator is higher when the set size Mis less then or equal to 10. For large values of M,
RE4values are around 1 indicating that the jackknife confidence intervals are as good as or
slightly less efficient than pp-based confidence intervals.
The coverage probabilities for all three confidence intervals are slightly lower than the
nominal coverage probability 0.95 when n=60. Since the confidence intervals are con-
structed based on normal approximation, this may be due to the effect of sample sizes.
Table 6shows that coverage probabilities are quite close to 0.95 when n=120.
We performed another simulation study to investigate the efficiency of stratified pp
estimators under equal, proportional and Neyman allocations. In this part of the simula-
tion, pp samples are constructed from stratified apple production data in Table 1.Aswe
observe from the table, there is a large variation between stratum populations. Hence, the
use of stratified pp sample would be appropriate. To determine the sample sizes in Ney-
man allocation, we used the population standard deviations in Table 1. Simulation study
714 O. Ozturk
246810 14
4681012
The number
of groups (d)
RE3
M=30
M=60
M=300
M=2000
ρ=0.907, n=30, ω=0.9
246810 14
4 6 8 10 12
The number
of groups (d)
RE3
ρ=0.907, n=60, ω=0.9
246810 14
4681012
The number
of groups (d)
RE3
ρ=0.907, n=120, ω=0.9
246810 14
1.6 1.8 2.0
The number
of groups (d)
RE3
ρ=0.808, n=30, ω=0.9
246810 14
1.0 1.5 2.0 2.5
The number
of groups (d)
RE3
ρ=0.808, n=60, ω=0.9
246810 14
1.8 2.0 2.2
The number
of groups (d)
RE3
ρ=0.808, n=120, ω=0.9
246810 14
1.10 1.25 1.40
The number
of groups (d)
RE3
ρ=0.629, n=30, ω=0.9
246810 14
1.1 1.2 1.3 1.4
The number
of groups (d)
RE3
ρ=0.629, n=60, ω=0.9
246810 14
1.15 1.30 1.45
The number
of groups (d)
RE3
ρ=0.629, n=120, ω=0.9
Figure 2. Efficiency plots (RE3=MSE(¯
YR,n)
Va r (¯
YRB,n)) of Rao–Blackwell estimator with respect to ratio estimator in
double sampling for different values of M,d,ρand nwhen ω=0.90.
considered two designs, stratified pps and stratified pp sampling designs with sample sizes
n=140,210,240. For stratified pp sampling design, we computed two estimators, ¯
Ystr
and Rao–Blackwell estimator ¯
Ystr,RB
¯
Ystr,RB =
L
l=1
Nl
N¯
YRB,nl
where ¯
YRB,nlis the Rao–Blackwell estimator from stratum population l. The variance of
¯
Ystr,RB is denoted with σ2
λ,RB(E),σ2
λ,RB(P), and σ2
λ,RB(N)for equal, proportional and
Neyman allocations, respectively. To approximate Rao–Blackwell estimator, the number of
replications, Q, is selected to be 10 and 50, respectively.
For comparison purposes, we also considered the estimator of θbased on stratified pps
sample
Post- stratified Probability- Proportional- to-Size Sampling 715
Table 6. The efficiencies and coverage of probabilities (Cov) of the confidence intervals.
nMdRE
4RE5cov (RB) Cov (pp) Cov (pps)
60 6 6 1.172 1.333 0.937 0.938 0.931
60 7 7 1.200 1.344 0.926 0.927 0.933
60 8 8 1.220 1.348 0.919 0.919 0.931
60 9 3 1.108 1.363 0.920 0.923 0.936
60 9 9 1.243 1.366 0.935 0.927 0.934
60 10 10 1.268 1.353 0.927 0.925 0.937
60 30 3 1.062 1.396 0.941 0.935 0.944
60 30 5 1.092 1.401 0.930 0.931 0.935
60 300 3 0.998 1.410 0.932 0.929 0.931
60 300 5 0.994 1.353 0.941 0.934 0.942
60 300 10 1.006 1.283 0.906 0.905 0.937
60 600 3 0.987 1.419 0.931 0.932 0.937
60 600 5 0.991 1.362 0.928 0.928 0.936
60 600 10 0.943 1.204 0.930 0.922 0.937
120 6 6 1.125 1.338 0.949 0.948 0.947
120 7 7 1.125 1.359 0.933 0.949 0.938
120 8 8 1.134 1.371 0.963 0.955 0.960
120 9 3 1.093 1.362 0.943 0.945 0.950
120 9 9 1.147 1.385 0.956 0.953 0.963
120 10 10 1.162 1.388 0.945 0.945 0.940
120 30 3 1.057 1.405 0.940 0.944 0.936
120 30 5 1.077 1.449 0.933 0.934 0.935
120 300 3 1.013 1.475 0.936 0.939 0.938
120 300 5 1.008 1.423 0.942 0.938 0.941
120 300 10 1.017 1.443 0.934 0.932 0.931
120 600 3 1.004 1.471 0.949 0.950 0.961
120 600 5 1.006 1.439 0.941 0.935 0.945
120 600 10 1.006 1.453 0.939 0.940 0.951
The RE4is the ratio of the average squared lengths of the confidence intervals based on post-stratified probability-
proportional-to-size (pp) and Rao–Blackwell (RB) estimators. The RE5is the ratio of the average squared lengths
of the confidence intervals based on probability-proportional-to-size (pps) and RB estimators.
˘
Ystr =
L
l=1
Nl
N¯
Ypps,nl,
where ¯
Ypps,nlis the pps estimator in Eq. (1) from the stratum population l. The variance
of √n(˘
Ystr −θ), similar to the ones in stratified pp sampling, can be computed for equal,
proportional and Neyman allocation
˘σ2
λ(E)=
L
l=1
L˘
A2
l
N2;˘σ2
λ(P)=
L
l=1
˘
A2
l
NN
l
,˘σ2
λ(N)=L
l=1
˘
Al
N2
,
where ˘
A2
l=1
nN
k=1πkyk
Nlπk−θl2
.
Stratified samples for both pps and pp designs are constructed using equal, proportional
and Neyman allocations. Simulation size is taken to be 50,000. Table 7presents relative
efficiencies of stratified pps and pp estimators with respect to Rao–Blackwell estimators.
716 O. Ozturk
Table 7. Relative efficiencies the estimator and coverage probability of the confidence interval of the parameter θfor equal (E), proportional ( P)andNeyman(N) allocation procedures.
nQStratified pps Stratified pp Rao–Blackwell Stratified pp
˘σ2
λ(E)
σ2
λ,RB(E)˘σ2
λ(P)
σ2
λ,RB(P)˘σ2
λ(N)
σ2
λ,RB(N)
σ2
λ(E)
σ2
λ,RB(E)
σ2
λ(P)
σ2
λ,RB(P)
σ2
λ(N)
σ2
λ,RB(N)
σ2
λ,RB(E)
σ2
λ,RB(N)
σ2
λ,RB(P)
σ2
λ,RB(N)Cov (E)Cov(P)Cov(N)
140 10 1.742 1.741 1.549 1.356 1.351 1.179 1.351 1.553 0.936 0.927 0.942
140 50 1.864 1.790 1.577 1.372 1.378 1.185 1.331 1.525 0.937 0.928 0.941
210 10 1.907 1.913 1.709 1.269 1.341 1.166 1.418 1.500 0.942 0.937 0.946
210 50 1.932 1.901 1.653 1.330 1.322 1.205 1.319 1.452 0.939 0.938 0.938
280 10 1.967 1.899 1.731 1.266 1.237 1.209 1.328 1.514 0.942 0.938 0.945
280 50 2.060 1.928 1.683 1.278 1.284 1.239 1.256 1.412 0.947 0.938 0.938
Post- stratified Probability- Proportional- to-Size Sampling 717
It is clear that Rao–Blackwell estimator for each allocation procedure provides substantial
amount of improvement over stratified pps and pp estimators. As expected, the efficiency
increases with sample size n, but the increase in the number of replications Qin Rao–
Blackwell estimator from 10 to 50 does not make significant improvement on the efficiency.
The Neyman allocation dominates the other allocation procedures as expected. For this
population, equal allocation has higher efficiency than the proportional allocation. This is
consistent with the expression σ2
λ(E)−σ2
λ(L). This difference could be negative for pop-
ulations in which smaller stratum populations has larger variances. A close inspection of
apple production data in Table 1indicates the largest stratum population variance belongs
to second smallest stratum. Hence for this population equal allocation is better than a pro-
portional allocations. The coverage probabilities of the confidence interval of θbased on
stratified pp estimator are relatively close to the nominal value of 0.95.
7. CONCLUDING REMARKS
In many survey sampling studies, in addition to variable of interest, the population units
have a known auxiliary variable. This auxiliary variable is often proportional to the variable
under study. If the population has strong heterogeneity among its members, such as extremely
large values for some population units, the pps sample would provide an estimator for
population mean with smaller variance than a simple random sample estimator of the same
size. In a pps sample, sample units are selected with selection probabilities proportional
to size of the auxiliary variable. Since the auxiliary variable is highly correlated with the
variable of interest, it also provides information about the relative position of the units in
a comparison set with respect to variable of interest. In this paper, we used this position
information to construct post-stratified pps sample. The new sample creates post-strata
among sample units of a pps sample. Hence, the estimators of the population mean have
a smaller variance than a pps sample of the same size. The post-stratification of the pps
sample is performed by conditioning on the comparison sets. We use Rao–Blackwell theorem
to improve the post-stratified pps sample estimator. The new sampling design is naturally
extended to stratified population. Efficiency of the estimator of the population mean is
empirically evaluated in a stratified population.
[Received February 2019. Accepted July 2019. Published Online July 2019.]
REFERENCES
Al-Saleh, M. F. and Samawi, H. (2007). A note on Inclusion Probability in Ranked Set Sampling for finite
population. Tes t , 16, 198–209.
Dastbaravarde, A., Arghami, N.R., Sarmad, M., (2016). Some theoretical results concerning non parametric esti-
mation by using a judgement poststratification sample. Communications in Statistics - Theory and Methods,
45, 2181–2203.
Deshpande, J.V., Frey, J., Ozturk, O. (2006) Nonparametric ranked set-sampling confidence intervals for a finite
population. Environmental and Ecological Statistics, 13, 25–40.
718 O. Ozturk
Frey, J. (2011). A note on ranked-set sampling using a covariate. Journal of Statistical Planning and Inference,
141, 809–816.
— (2012). Constrained nonparametric estimation of the mean and the CDF using ranked-set sampling with a
covariate. Annals of the Institute of Statistical Mathematics, 64, 439–456.
Frey, J. and Feeman, T.G. (2012). An improved mean estimator for judgement post-stratification. Computational
Statistics and Data Analysis, 56, 418–426.
— (2013). Variance estimation using judgement post- stratification. Annals of the Institute of Statistical Mathe-
matics, 65, 551–569.
Frey, J. and Ozturk, O. (2011). Constrained estimation using judgement post-stratification. Annals of the Institute
of Statistical Mathematics, 63, 769–789.
Gokpinar, F. and Ozdemir, Y.A. (2010). Generalization of inclusion probabilities in ranked set sampling. Hacettepe
Journal of Mathematics and Statistics, 39, 89–95.
Kadilar, C. and Cingi, H. (2003). Ratio estimators in stratified random sampling. Biometrical Journal, 45, 218–225.
MacEachern, S. N., Stasny, E. A., and Wolfe, D. A. (2004) Judgment post- stratification with imprecise rankings.
Biometrics, 60, 207–215.
Ozdemir, Y.A. and Gokpinar,F. (2008). A new formula for inclusion probabilities in median ranked set sampling.
Communications in Statistics - Theory and Methods, 37, 2022–2033.
— (2007). A generalized formula for inclusion probabilities in ranked set sampling. Hacettepe Journal of Mathe-
matics and Statistics, 36, 89–99.
Ozturk, O. (2014a). Estimation of population mean and total in finite population setting using multiple auxiliary
variables. Journal of Agricultural, Biological and Environmental Statistics, 19, 161–184.
— (2014b). Statistical inference for population quantiles and variance in judgment post-stratified samples, Com-
putational Statistics and Data Analysis, 77, 188–205.
— (2016a). Estimation of a finite population mean and total using population ranks of sample units. Journal of
Agricultural, Biological and Environmental Statistics, 21, 181–202.
— (2016b). Statistical inference based on judgment post-stratified samples in finite population. Survey Methodol-
ogy, 42, 239–262.
Ozturk., O. and Bayramoglu Kavlak, K. (2018). Model based inference using ranked set samples. Survey Method-
ology, 44, 1–16.
— (2019). Statistical inference using stratified ranked set samples from finite populations, Chapter 12, pages,
157–170. Ranked Set Sampling: 65 Years Improving the Accuracy in Data Gathering edited by Bouza and
Al-Omari, Elsevier, San Diego, USA.
Ozturk, O. and Jafari Jozani, M. (2013). Inclusion Probabilities in Partially Rank Ordered Set Sampling. Compu-
tational Statistics and Data Analysis, 69, 122–132.
Patil, G.P., Sinha, A.K., and Taillie, C. (1995). Finite population corrections for ranked set sampling. Annals of the
Institute of Statistical Mathematics, 47, 621–636.
R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. URL https://www.R-project.org/.
Thompson, S.K. (2002). Sampling, 2nd edition, Wiley, New York.
Wang, X., Stokes, L., Lim, J., and Chen, M. (2006). Concomitants of multivariate order statistics with application
to judgment post-stratification. Journal of the American Statistical Association, 101, 1693–1704
Wang, X., Lim, J., Stokes, S.L. (2008). A nonparametric mean estimator for judgement post-stratified data.
Biometrics, 64, 355–363.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.