ArticlePDF Available

Post-stratified Probability-Proportional-to-Size Sampling from Stratified Populations

July 2019
Journal of Agricultural Biological and Environmental Statistics 24(4)

July 2019
24(4)

Authors:

The Ohio State University

This paper develops statistical inference based on a post-stratified probability-proportional-to-size (pp) sample from a finite population. A pp sample selects the sample units with selection probabilities proportional to their size and measures them for the characteristic of interest. For each measured unit, the pp sample further creates position information (rank) in a comparison set of size M. The sample is then post-stratified into ranking classes based on their position information in the comparison set. A pp sample is expanded to stratified populations by selecting a pp sample from each stratum population to form the stratified pp sample. Using this stratified pp sample, we construct unbiased and Rao–Blackwell estimators for the mean of the stratified populations. Different sample size allocation procedures for stratum sample sizes are investigated. The new sampling design is applied to apple production data to estimate the total apple production in Turkey.

The efficiencies and coverage of probabilities (Cov) of the confidence intervals.

…

Figures - uploaded by Omer Ozturk

Content may be subject to copyright.

Content uploaded by Omer Ozturk

Content may be subject to copyright.

Supplementary materials for this article are available at https:// doi.org/ 10.1007/ s13253-019-00370- 6.

Post-stratiﬁed Probability-Proportional-to-Size

Sampling from Stratiﬁed Populations

Omer Ozturk

This paper develops statistical inference based on a post-stratiﬁed probability-

proportional-to-size ( pp) sample from a ﬁnite population. A pp sample selects the sample

units with selection probabilities proportional to their size and measures them for the

characteristic of interest. For each measured unit, the pp sample further creates position

information (rank) in a comparison set of size M. The sample is then post-stratiﬁed

into ranking classes based on their position information in the comparison set. A pp

sample is expanded to stratiﬁed populations by selecting a pp sample from each stratum

population to form the stratiﬁed pp sample. Using this stratiﬁed pp sample, we con-

struct unbiased and Rao–Blackwell estimators for the mean of the stratiﬁed populations.

Different sample size allocation procedures for stratum sample sizes are investigated.

The new sampling design is applied to apple production data to estimate the total apple

production in Turkey.

Supplementary materials accompanying this paper appear online.

Key Words: Rao–Blackwell estimator; Stratiﬁed sampling; Post-stratiﬁed sample;

Neyman allocation; Probability-proportional-to-size sampling.

1. INTRODUCTION

In settings, where characteristic of interest Yis approximately proportional to a positive

known auxiliary variable X, the probability-proportional-to-size ( pps) sampling would be

preferable over simple random sampling. In a pps sample, sample units are selected with

probability proportional to size of X. Hence, it gives higher chance for important (large)

units in the population to be included in the sample. Since X-variable is approximately

proportional to Y-variable and its values are available for all population units, it also provides

information on the relative position (rank) of Y-variable on a unit in a comparison set. This

position information can be used to induce more structure in the sample by post-stratifying

the sample into different ranking groups.

O. Ozturk (B), Department of Statistics, The Ohio State University, 1958 Neil Avenue, Columbus, OH 43210,

USA

(E-mail: omer@stat.osu.edu).

Journal of Agricultural, Biological, and Environmental Statistics, Volume 24, Number 4, Pages 693–718

https://doi.org/10.1007/s13253-019-00370-6

693

694 O. Ozturk

Table 1. Population characteristics of apple production (in 1000 kg) data, θl,τl,Nl,andρlare the mean, standard

deviation, population size and the correlation coefﬁcient between X-andY-variables for the strata l,

respectively.

Strata (l)θlτlNlρl

Marmara (l=1) 1536.8 6425 106 0.816

Aegean (l=2) 2233.7 11,604.9 105 0.856

Mediterranean (l=3) 9384.31 29,907.5 94 0.901

Black Sea (l=4) 967 2389.7 204 0.713

Central Anatolia (l=5) 5588 28,643.4 171 0.986

Eastern Anatolia (l=6) 631.4 1171 103 0.885

Southeastern Anatolia (l=7) 72.4 111.3 68 0.917

All regions combined 2940.456 17,135 851 0.916

One natural setting for post-stratiﬁed pps sampling is given in Kadilar and Cingi (2003)

for the estimation of apple production. The population consists of apple producing localities

in Turkey. The apples in Turkey are produced in seven different geographical regions, Mar-

mara, Aegean, Mediterranean, Central Anatolia, Black Sea, Eastern Anatolia and Southeast-

ern Anatolia regions. These regions have different weather patterns, and apple production

varies from one region to the other. Turkish Statistical Institute collected data from these

regions to estimate the total apple production in 2002. The data set contains two variables

apple production (Y, in 1000 kg) and the number of apple trees (X) in each locality and

region. The values of X-variable are available for all population units in the data base prior to

sampling. The entire population has seven subpopulations and ﬁts into stratiﬁed population

structure. The main characteristics of the population are presented in Table 1.

The entire population with all regions combined contains 851 units (townships). The cor-

relation coefﬁcient between Xand Yis 0.916. The population means (standard deviations)

of X- and Y-variables are 37,732.667 (145,031.7) and 2940.456 (17,135), respectively. It

is clear from Table 1that the strata populations have different means and variances. Hence,

there exist large within- and between-strata variations. It is reasonable to assume that apple

production Yis approximately proportional to the number of apple trees Xin each locality.

Hence, the use of pps sample would be appropriate. In addition to the structure imposed by

unequal probability sampling, the number of apple trees (X) can provide information about

the relative position of apple production (Y), among a small comparison set of localities, on

each sampling unit in the pps sample.

All 851 localities in apple production data will serve as our ﬁnite population in this

paper. We select a pps sample with selection probabilities proportional to the number of

apple trees and measure the apple production. For each unit in the pps sample, we select

another pps sample to construct a comparison set and determine the rank of the measured

unit, without measurement, using Xvariable. The pps sample is then post-stratiﬁed based

on their ranking information from the comparison sets. The theory of stratiﬁed sampling

suggests that the post-stratiﬁed pps sample improves the initial pps sample. This paper

provides a foundation for such a claim.

The position information is successfully used in ranked set sampling (rss) and judgment

post-stratiﬁed sampling ( jps) designs. Both of these designs determine the rank of each

Post- stratified Probability- Proportional- to-Size Sampling 695

measured unit in a comparison set of size M. Each unit in these designs, in addition to

the information it caries, provides additional information through its rank about the other

unmeasured units in the comparison set. Since the construction of the comparison set and

ranking of the units in it are available with no addition cost, this additional information is

essentially free.

For the construction of a rss sample of size n, we ﬁrst need to determine the cycle

size dand set size M,n=dM. We then select nM localities from the apple production

population and partition them into ndisjoint comparison sets each having Munits. Units

in each comparison set are ranked without measurement using X-variable, and the value

of Yassociated with the hth ranked X(Y[h]j)is measured in ddifferent comparison sets,

h=1,...,M. The measured values Y[h]j,h=1,...,M;j=1,...,d, are called a ranked

set sample.

For the construction of a jps sample of size n, we start with a simple random sample of

size nand measure all of them, Yi;i=1,...,n. For each measured unit in this sample, we

select additional M−1 units from the population to form a comparison set of size M.The

rank Riof the measured unit in each of these comparison sets is determined. The pairs of

(Yi,Ri),i=1,...,n, constitute a jps sample.

The rss sampling design generated common interest for many researchers in a ﬁnite

population setting. Patil et al. (1995) used ranked set sample to estimate population mean

for a population of size Nwhen the sample is constructed without replacement. Deshpande

et al. (2006) described three different sampling designs and constructed nonparametric

conﬁdence intervals for population quantiles. Al-Saleh and Samawi (2007), Ozdemir and

Gokpinar (2007 and 2008), Gokpinar and Ozdemir (2010), Ozturk and Jafari Jozani (2013),

Frey (2011) and Ozturk (2014a,2016a) computed inclusion probabilities and constructed

Horvitz–Thompson-type estimators for population mean and total based on a ranked set

sample. Ozturk and Bayramoglu Kavlak (2018,2019) developed statistical inference based

on a superpopulation model using ranked set sample data. These research papers show that

rss design yields a substantial amount of improvement in efﬁciency over the usual simple

random sampling design.

The jps sampling design is originally developed for an inﬁnite population setting in

MacEachern et al. (2004). In recent years, considerable attention has been given to research

efforts in jps sampling. Wang et al. (2006) developed a class of estimators for population

mean using the concomitant of multivariate order statistics. Wang et al. (2008) put a stochas-

tic ordering constraint among judgment classes to improve the efﬁciency of the estimator of

population mean. Frey and Ozturk (2011) replaced the stochastic ordering constraint with a

weaker ordering condition in which judgment class cumulative distribution functions (cdf)

can be no more extreme than the cdf of the true order statistics. In a follow-up paper, Frey

(2012) combined this weaker ordering condition with stochastic ordering constraint to con-

struct a better estimator for the population mean. Frey and Feeman (2012,2013) constructed

optimal estimators within a class of unbiased estimators for population mean and variance.

In ﬁnite population setting, Ozturk (2016b) developed estimators for population mean based

on a jps sample where he showed that the estimator needs a ﬁnite population correction

factor similar to the one used in a simple random sampling.

696 O. Ozturk

In this paper, we look at the jpssampling design from different perspective. We construct

the jps sample using probability-proportional-to-size sampling design. We ﬁrst construct

apps sample from a ﬁnite population. This pps sample is then post-stratiﬁed based on

their relative positions in comparison sets. Even though it may be possible to construct a

different type of estimator, presumably more efﬁcient, based on full covariate information,

it is not considered in this paper. Section 1introduces the post-stratiﬁed pps (pp) sample

in a ﬁnite population setting. It constructs unbiased estimators for the population mean

and its variance. Section 2constructs Rao–Blackwell estimator by conditioning on the

measured values of the pps sample. Section 3extends pp sampling to a stratiﬁed population

and constructs unbiased and Rao–Blackwell estimators for the population mean. Section 4

considers four different sample size allocation procedures to minimize the variance of the

estimator under a cost model and different stratum population structures. Section 5provides

empirical evidence to investigate the properties of the proposed estimators and compares

them with their competitors. Section 6applies the proposed sampling design and estimators

to apple production data in Turkey. Section 7provides some concluding remarks. All proofs

are given in a supplementary material.

2. PROBABILITY-PROPORTIONAL-TO-SIZE POST-STRATIFIED

SAMPLING

We consider a ﬁnite population of size N,PN={u1,...,uN}. Each population unit

uipossesses two characteristics Yand X, where Yis the characteristic of interest and X

is an auxiliary size variable. In this population, actual values of Y- and X-variables are

denoted with y1,..., yNand x1,...,xN, respectively. We assume that the characteristic X

is approximately proportional to the characteristic Y. The population mean and variance of

Yare denoted by

θ=1



i=1

yiand γ2=1



i=1

(yi−θ)2.

From this population, we select a probability-proportional-to-size sample of size nwith

replacement. Note that we use a design-based inference. Hence, the values of Yand Xin the

population are non-random constants. The sampling variation is induced by the selection

probability of the units. Let Wibe an indicator function

Wi=1 if unit iis selected

0 otherwise

with P(Wi=1)=πi, where the πiis proportional to the value of Xon ui,πi∝xi.

We then write Yi=Wiyi. In this expression, even though yiis a constant, Yiis a random

variable since Wihas a Bernoulli distribution with success probability πi. In the remainder

of the paper, we reserve the capital letter Yfor random variable and the lowercase letter

(yi) for a constant value of Yon unit ui.The pps sample then constitutes the triplets,

(ui,Yi,π

i);i=1,...,n.

Post- stratified Probability- Proportional- to-Size Sampling 697

We now induce more structure in this pps sample to improve its information content.

For each selected unit uiin the pps sample, we select additional M−1 units using pps

sampling with replacement to form a comparison set of size M

Si={ui,u1,...,uM−1},i=1,...,n.

The units in comparison set Siare ranked with respect to size variable X, and the rank

of ui,Ri, is recorded. The pps sample is then augmented with this ranking information,

(ui,Yi,π

i,Ri);i=1,...,n. Each unit uiin the augmented pps sample has two pieces of

information. The ﬁrst piece is the value (Yi=Wiyi) of the characteristic Y. The second piece

is the relative position (rank Ri)ofuiamong all Munits in the comparison set Si. The rank

Riis obtained with no additional cost since the size variable Xis available in the sampling

frame prior to sampling. Since the comparison sets are constructed with replacement pps

sampling design, it is possible that the same unit may appear more than once in Si.Ifthis

happens, ties are broken at random to rank the units in the comparison set Si. Even under

perfect ranking, ties can create ranking error in the comparison sets since they are broken

at random.

In the augmented pps sample, if we ignore the ranks, Ri;i=1,...,n,thesample

(Yi,π

i)becomes a pps sample. Based on this pps sample, an unbiased estimator of the

population mean θand its variance is given by

Ypps,n=1



i=1

Wiyi

πi

,σ

pps,n=Va r (¯

Ypps,n)=1



k=1

πkyk

Nπk−θ2

.(1)

Unbiased estimator of σ2

pps,nis available in the literature (Thompson 2002, page 52), and

an approximate (1−α)100-% conﬁdence interval for the population mean is given by

Ypps,n±tn−1,α/2ˆσpps,n,ˆσ2

pps,n=1

n(n−1)



i=1yi

πiN−¯

Ypps,n2

where tdf,ais the ath upper quantile of the df-degrees of freedom t-distribution.

In a pps sample, the probability mass function (pmf) and cumulative distribution function

(cdf) of Yiare given by

f(y)=fYi(y)=P(Yi=y)=



i=1

P(Wi=1)I(yi=y)=



j=1

πjI(yj=y)

and

F(y)=FYi(y)=

i:yi≤y



j=1

πjI(yj=yi),

698 O. Ozturk

where I(a)is an indicator function. From the above equation, we also observe that Yis

are independent identically distributed discrete random variables. For independent discrete

random variables, cdf and pmf of the hth-order statistic in a set of size Mare given by

F(h:M)(y)=



k=hM

kFk(y){1−F(y)}M−kand f(h:M)(y)=F(h:M)(y)−F(h:M)(y−),

where F(h:M)(y−)is the left limit at y.

In the augmented pps sample, the ranks can be used to post-stratify the pps sample

based on their relative positions (ranks) in the comparison sets. The ranks, Ri;i=1,...,n,

are independent identically distributed (iid) discrete uniform random variables on integers

1,...,M. For large values of M, the post-stratiﬁed sample may create a lot of empty

ranking groups. The empty ranking groups usually increase the variance of the estimators.

Without loss of generality, we drop the notation uifrom the augmented pps sample. The

new sample will be called as post-stratiﬁed pps (pp) sample and it will contain the triplets,

(Yi,π

i,Ri), i=1,...,n.

To reduce the likelihood of empty ranking classes, we reduce the number of ranking

groups from Mto d,1≤d≤M, where dis the number of post-stratiﬁed ranking groups

and His the number of ranks in each ranking group, 1 ≤d≤M.Let

Dr={(r−1)H+1,...,(r−1)H+H};r=1,...,d;∪

r=1Dr={1,...,M},

where the sets Dr;r=1,...,d, form a partition for integers 1,...,M. For example, if

M=9 and d=3, D1={1,2,3},D2={4,5,6}and D3={7,8,9}form a partition for

integers 1,...,9. Using these partition sets, we stratify the sample into dstrata based on the

membership of Riin set Dr;r=1,...,d. The large values of dcreates more structure in

the sample, but may lead to a lot of empty strata and more uncertainty in the estimators. For

notational convenience, we relabel the pp sample, (Zi,r,π

i);i=1,...,n;r=1,...,d,

where

Zi,r=YiI(Ri∈Dr);i=1,...,n;r=1,...,d.

The Zi,rs are independent but not identically distributed random variables. The conditional

distribution of Z1,rgiven that the rank R1belongs to the set Dris given by

P(Z1,r=z1|R1∈Dr)=1

H

h∈Dr

f[h:M](z1),

where f[h:M](z)is the pmf of the hth-order statistic Y[h:M]in a pps sample when the

comparison set is ordered based on Xvariable. We note that the rank Rimay not be equal

to the rank of Yiin the comparison set Sisince the units are ranked based on X-variable.

Hence, we use the square brackets to denote the possibility of ranking error. If the units are

ranked based on Y-variable, the comparison sets still may contain repeated observations,

since the units are selected with replacement. In this case, ranking error may be relatively

Post- stratified Probability- Proportional- to-Size Sampling 699

small if the population size Nis large with respect to set size M. In this paper, unless stated

otherwise we consider a ranking procedure based on the characteristic X.

Using the conditional distribution of Z1,rgiven that Y1has a rank in the set Dr, we deﬁne

the conditional mean and variance of Z1,r/π1as follows

¯μr=EZ1,r

π1|R1∈Dr=1

H

h∈Dr

EY[h:M],1

π[h:M],1=1

H

h∈Dr

μ[h:M](2)

and

var Z1,r

π1|R1∈Dr=1

H

h∈Dr

σ2

[h:M]+1

H

h∈Drμ[h:M]−¯μr2=1

H(σ 2

r+τ2

r), (3)

where

σ2

r=

h∈Dr

σ2

[h:M];τ2

r=

h∈Drμ[h:M]−¯μr2,

μ[h:M]=EY[h:M],1

π[h:M],1and σ2

[h:M]=Var Y[h:M],1

π[h:M],1.

Let

Jr=1

nrif nr>0

0 otherwise. (4)

We now construct an estimator for the population mean θ

Ypp,n=1



r=1

IrJr



i=1

Zi,r

πi=



r=1

ar¯

Zr,¯

Zr=Jr



i=1

Zi,r,ar=Ir

where nris the number of observations in ranking class r,Ir=(nr>0)and dn=d

r=1Ir.

We note that ¯

Zris a pps estimator based on sample observations having membership in

set Dr. Hence, the estimator ¯

Ypp,nis a weighted average of pps estimators from ranking

groups. The weights, ar;r=1,...,d, are used as an adjustment to create an unbiased

estimator for θ.

Note that nr,Irand dnare random variables. The vector of ranking class sample sizes

n=(n1,...,nd)has a d-dimensional multinomial random variable with parameters n=

n1+···+ndand success probability vector (1/d,...,1/d). Using this multinomial random

variable, we establish the following results, proofs of which are given in Ozturk (2014b)

and Dastbaravarde et al. (2016).

Theorem 1. Let n=(n1,...,nd)be a multinomial random variable with success

probability vector (1/d,...,1/d). The following equalities hold

i. E(I1

dn)=1/d

ii. Var(I1

dn)=1

d2d−1

k=1(k

d)n−1

700 O. Ozturk

iii. Cov( I1

dn,I2

dn)=− 1

d−1Va r (I1

dn)

iv. E(I1J1

)=1

dn1

n+d

k=2k−1

j=1n−k+1

m=1

(−1)j−1

k2md−1

k−1k−1

j−1n

m

(k−j)n−m.

Note that expected values, variance and covariance in Theorem 1do not depend on popu-

lation characteristics. They only depend on the design parameter dand sample size nand

hence can be computed once and for all, ahead of time, prior to sampling. We next show that

Ypp,nis an unbiased estimator for θand provide a closed-form expression for its variance.

Theorem 2. Let (Zi,r,π

i);i=1,...,n;r=1,...,d be a post-stratiﬁed probability-

proportional-to-size sample from a ﬁnite population. The estimator ¯

Ypp,nis unbiased for

population mean θand its variance is given by σ2

pp,n=Va r (¯

Ypp,n)

σ2

pp,n=d

N2(d−1)Var I1

dnd



r=1

(¯μr−Nθ)2

EI1J1

n

N2H



r=1

h∈Dr(μ[h:M]−¯μr)2+σ2

[h:M].

There are two types of variations contributing to the variance of the estimator ¯

Ypp,n,

variation due to differences among population units and the variation due to differences

among ranking class sample sizes nr,r=1,...,d. The ranking class sample size variation

is quantiﬁed by the expressions Var(I1/dn)and E(I1J1/d2

n), where J1is deﬁned in Eq. (4).

For the large sample size n, we can establish the following limits

lim

n→∞ nVar I1

dn=0 and lim

n→∞ nE(I1J1/d2

n)=1/d.

Using these two limits, the variance of √n(¯

Ypp,n−θ) can be reduced to a simple form

Var √n(¯

Ypp,n−θ)≈1

N2Hd



r=1

h∈Dr(μ[h:M]−¯μr)2+σ2

[h:M]

N2dH



r=1

(σ 2

r+τ2

r).

The large sample approximation of the variance of the estimator shows that it is partitioned

into two pieces, within and between ranking group variations. This is similar to the parti-

tion of the variation in a stratiﬁed sample, where variance is decomposed into within- and

between-strata variations.

Post- stratified Probability- Proportional- to-Size Sampling 701

We now construct a conditionally unbiased estimator for the variance of ¯

Ypp,ngiven that

one of the groups has at least two-measured units . Let

J∗

r=1/(nr−1)if nr>1

0 otherwise,(5)

U1=d



r=1



i=1



j=i

I∗

rJrJ∗

rZi,r

πi−Zj,r

πj2

d∗

,(6)

U2=



r=1



t=r



i=1



j=1

IrItJrJtZi,r

πi−Zj,t

πj2

,(7)

where I∗

r=I(nr>1)and d∗

n=d

r=1I∗

Theorem 3. Let (Zi,r,π

i);i=1,...,n;r=1,...,d, be a post-stratiﬁed probability-

proportional-to-size sample from a ﬁnite population. Assume that there is at least one set

Drthat contains at least two observations. A conditionally unbiased estimator for σ2

pp,nis

then given by

ˆσ2

pp,n=U1

2EI2

1J1

n−Var I1

dn+U2

2(d−1)

Va r (I1/dn)

EI1I2

n,

where E(I1I2/d2

n)=−Va r (I1/dn)/(d−1)+1/d2.

Theorem 3holds for any nas long as there exist a set Drwith at least two observations.

We can then construct an approximate (1−α)100% conﬁdence interval for the population

mean θfor moderate sample sizes

Ypp,n±tn−dn,α/2ˆσ2

pp,n,

where the degrees of freedom df =n−dnis suggested to account the heterogeneity among

ranking classes.

3. RAO–BLACKWELL ESTIMATOR

The post-stratiﬁed probability-proportional-to-size sample estimator can be considered

as a conditional estimator for given values of sample units, ui,i=1,...,n.LetR=

{R1,...,Rn}be the conditional ranks of nunits given S=(u1,...,un). The estimator

Ypp,nis constructed based on just one realization of the ranks Ri,i=1,...,n,giventhe

sample units

Ypp,n(R)=1



r=1

IrJr



i=1

Zi,r

πi|u1,...,un,

702 O. Ozturk

where the notation Rhighlights that this estimator depends the realization of the conditional

ranks for given sample unit vector S. For a given sample unit vector S, one can obtain many

realization of the ranks by constructing different comparison sets from the population. Each

of these realization leads to different estimator. We then use Rao–Blackwell theorem to

combine all theses estimators

YRB,n=ER¯

Ypp,n(R),

where the expectation is taken over the conditional distribution of ranks, Ri;i=1,...,n,

given the sample units ui=1,...,n. The construction of the Rao–Blackwell estima-

tor requires the computation of the conditional expectations of post-stratiﬁed probability-

proportional-to-size sample estimator over conditional distribution of ranks given the set

of sample units S. Even though we are unable to ﬁnd a closed-form expression for this

expectation, we provide an algorithm to approximate it.

Algorithm 1. I. Select an integer Q . For q =1,..., Q, construct comparison sets

i={ui,uq

2...,uq

M};i =1,...,n , where uq

t;t =2,...,M , are the unmeasured units

selected from the population using p ps sample to form the comparison set Sq

II. Using the comparison sets in step I, compute Rq=Rq

1,...,Rq

nand

r=1/nq

rif nq

r>0

0otherwise; Iq

i,r=I(Rq

i∈Dr);nq



i=1

i,r;

r=I(nq

r>0);dq



r=1

r;¯

pp,n=



r=1

rJq



i=1

YiI(Rq

i∈Dr)

Nπi

III. Approximate the ¯

YRB,nfrom

YRB,n≈1



q=1¯

pp,n.

The algorithm does not provide an estimate for the variance of Rao–Blackwell estimator.

We use jackknife variance estimator to assess the sampling variation. To construct the

jackknife variance, for given Qsets of ranks Rq,q=1,..., Q, we compute nRao–

Blackwell estimator, ¯

Y(−i)

RB,n,i=1,...,n, where ¯

Y(−i)

RB,nis the Rao–Blackwell estimator after

the ith unit is removed from the sample. We now create jackknife replicates

si=n¯

YRB,n−(n−1)¯

Y(−i)

RB,n,i=1,...,n.

The jackknife variance estimate is then given by

ˆσ2

J=1

n(n−1)



i=1

(si−¯si)2

Post- stratified Probability- Proportional- to-Size Sampling 703

where ¯si=i=1si/n. An approximate (1−α)100% conﬁdence interval for the population

mean θbased on Rao–Blackwell estimator is given by

YRB,n±tn−1,α/2ˆσ2

4. POST-STRATIFIED PROBABILITY-PROPORTIONAL-TO-SIZE

SAMPLES FROM STRATIFIED POPULATIONS

In this section, we expand the post-stratiﬁed probability-proportional-to-size sample to

stratiﬁed populations. We assume that main population is divided into Ldisjoint subpopula-

tions PNl=u1,l,...,uNl,l, where Nlis the population size of the lth stratum population,

l=1,...,L. The stratum population means, variances and totals are deﬁned as

θl=1



i=1

yi,l;γ2

Nl=1



i=1

(yi,l−θl)2;tNl=Nlθl;l=1,...,L,

θ=1



l=1



i=1

yi,l,

where yi,lis the value of Yon unit ui,lin stratum population PNland N=N1+···+NL.In

this population, we wish to draw inference on parameter θ. To construct a post-stratiﬁed pps

sample from this stratiﬁed population, we select a post-stratiﬁed pps sample with sample

size nland set size Mlfrom each stratum population. We combine these samples to form

the post-stratiﬁed pps stratiﬁed sample (str), (Yi,l,π

i,l,Ri,l);i=1,...,nl;l=1,...,L,

where Yi,lis the value of Yon unit ui.l,π

i,lis the selection probability of the unit ui,land

Ri,lis the rank of the unit ui,lin the comparison set of size Mlfrom the stratum population

l. Using this stratiﬁed sample, we construct an estimator for the population mean θ

Ystr =



l=1

N¯

Ypp,nl=



l=1



r=1

Ir,lJr,l

Nldnl



i=1

Zi,r,l

πi,l;n=n1+···,nL,

Jr,l=1/nr,lif nr,l>0

0 otherwise,

where nr,lis the number of observations in ranking group r,Zi,r,l=Yi,lI(Ri,l∈Dr,l),

Dr,l={(r−1)Hl+1,...,(r−1)Hl+Hl},Ir,l=I(nr,l>0),dnl=dl

r=1I(nr,l>0),

Hl=Ml/dl, and dlis the number of ranking groups for stratum l;l=1,...,L.We

use the notation (Zi,r,l,π

i,l),i=1,...,nl;r=1,...,dl;l=1,...,L, to denote the

post-stratiﬁed pps sample from a stratiﬁed population.

Theorem 4. Let Zi,r,l,π

i,l;i =1,...,nl;r=1,...,dl;l=1,...,L, be a post-

stratiﬁed pps sample from a stratiﬁed population. The estimator ¯

Ystr is unbiased for the

population mean θand its variance σ2

str =Va r (¯

Ystr )is given by

704 O. Ozturk

σ2

str =



l=1

N2σ2

pp,nl,

σ2

pp,nl=dl

l(dl−1)Var I1,l

dnldl



r=1

(¯μr,l−Nlθl)2

EI1,lJ1,l

nl

lHl



r=1

h∈Dr,l(μh:Ml−¯μr,l)2+σ2

[h:Ml],

where ¯μr,l=h∈Dr,lμ[h:Ml]/Hl.

An unbiased estimator for the population total can be constructed from Tstr =N¯

Ystr .

The variance of Tstr follows from Theorem 4,Var(Tstr)=N2σ2

str .

Corollary 1. Let n0=min(n1,...,nL)and λl=limnl→∞ nl

n>0as n0goes to

inﬁnity. The variance of √n(¯

Ystr −θ) =L

l=1n

N[√nl(¯

Ypp,nl−θl)]is given by

σ2

λ=



l=1

N2dlHlλl



r=1

(σ 2

r,l+τ2

r,l),

where σ2

r,l=h∈Dr,lσ2

[h:Ml],τ2

r,l=h∈Dr,l(μ[h:Ml]−¯μr,l)2and ¯μr,l=h∈Dr,lμ[h:Ml]/Hl.

A conditional unbiased estimator for σ2

str can be established from Theorem 3given that

there is at least one set Dr,lhaving at least two observations in each stratum sample

ˆσ2

str =



l=1

N2ˆσ2

pp,nl,

ˆσ2

pp,nl=U1,l

2EI2

1,lJ1,l

nl−Var I1,l

dnl+U2,l

2(dl−1)

Va r (I1,l/dnl)

EI1,lI2,l

nl,

where U1,land U2,lare the expressions U1and U2in Eqs. (6) and (7)forstratuml, respec-

tively. An approximate (1−α) ×100% conﬁdence interval for the population mean θcan

be constructed from

Ystr ±tdf,α/2ˆσstr ,

where df =L

l=1nl−L

l=1dnl.

Post-stratiﬁed pps sample from a stratiﬁed population consists of Ldifferent post-

stratiﬁed probability-proportional-to-size samples, one from each stratum population. The

stratum populations usually have different means and variances. For a ﬁxed sample size n,

n=n1+...+nL, the information content of the stratiﬁed sample depends on the stratum

sample sizes, nl;l=1,...,L. For a ﬁnite sample size n, it is a challenge to investigate

Post- stratified Probability- Proportional- to-Size Sampling 705

the relationship between the stratum sample sizes and information content of the sample.

To ease the computation, we look at four different sample size allocations for large sample

sizes.

The equal allocation procedure selects equal number of observations from each stratum

populations nl=n/L,l=1,...,L. Under this allocation scheme, the asymptotic variance

of √n(¯

Ystr −θ) reduces to

σ2

λ(E)=



l=1

N2dlHl



r=1

(σ 2

r,l+τ2

r,l),

where Ein σ2

λ(E)is used to denote the equal allocation.

In certain cases, it may be reasonable to select sample sizes proportional to the stratum

population sizes, nl=n(Nl/N). Under proportional (P) allocation, the asymptotic variance

of the estimator reduces to

σ2

λ(P)=



l=1

lHlNl



r=1

(σ 2

r,l+τ2

r,l).

Optimal (Neyman) allocation minimizes the variance of the estimator with respect to

sample sizes nlsubject to the constraint that the sum of the stratum sample sizes equals n.

Using Lagrange multiplier, we can show that Neyman (N) allocation sample sizes are given

nm=

ndl

r=1(σ 2

r,m+τ2

r,m)

√dmHm

L

l=1dl

r=1(σ 2

r,l+τ2

r,l)

√dlHl

;m=1,...,L.

Under Neyman allocation, the asymptotic variance simpliﬁes to

σ2

λ(N)=L

l=1dl

r=1(σ 2

r,l+τ2

r,l)/(dlHl)2

N2.

Sampling cost is also a limiting factor in sample size determination when there is a

constraint in the budget. In this case, it is desirable to minimize the variance with respect to

stratum sample sizes for a given cost function and a budget. A simple cost function for this

setting can be constructed as

CT=C0+



l=1

(cl+rl)nl,(8)

where CTis the total cost, C0is overhead cost, clis the cost of measuring a single unit

from stratum population land rlis the cost of obtaining the rank of a measured unit in a

706 O. Ozturk

comparison set in stratum l. For the setting where post-stratiﬁed probability-proportional-

to-size sampling is appropriate, it is reasonable to assume that rlis relatively small since the

values of X-variable are available for all population units. Under the cost model in Eq. (8),

the asymptotic variance of the estimator is minimized for

nm=ndm

r=1(σ 2

r,m+τ2

r,m)/(Mm(cm+rm))

L

l=1dl

r=1(σ 2

r,l+τ2

r,l)/(Ml(cl+rl)) ;m=1,...,L.

For the cost function CT, the variance of the optimal estimator simpliﬁes to

σ2

λ(C)=L

l=1dl

r=1(σ 2

r,l+τ2

r,l)/(Ml(cl+rl))2

N2.

The equal and proportional allocations are relatively easy to implement. The difference

between the variances of the estimators under equal and proportional allocations can be

written as follows

σ2

λ(E)−σ2

λ(P)=



l=1

lNl−¯

l;A2



r=1

(σ 2

r,l+τ2

r,l);¯



l=1

Nl/L.

It is reasonable to assume that A2

lis an increasing function of the population variance

of stratum l,τ2

l. We then expect that the difference between the variances of equal and

proportional allocation will be positive when large stratum population (large Nl) has large

variances (large τ2

lor large A2

l). In this case, proportional allocation procedure samples

more data from a stratum population having large population size and variance to reduce

the contribution of variation from this stratum sample to the estimator. We note that for the

implementation of the equal and proportional allocations it is not necessary to have point

estimates for the population variances. It only requires knowing if the larger populations

have larger variances. This may be less restrictive than knowing the point estimates of the

population variances.

The Neyman allocation is optimal. Hence, it yields smaller variance than both equal

and proportional allocations. On the other hand, the computation of stratum sample sizes

requires that A2

lmust be known prior to construction of the sample. For setting, where

Ml=dlHl≡Mfor all stratum samples and the stratum population variances are known

(or may be estimated from pilot studies) from the previous studies. The Neyman allocation

can be approximated from

nm=nA

L

l=1Al≈nˆτm

L

l=1ˆτl;m=1,...,L,

where ˆτ2

lis the estimate of the variance of the stratum population l.

Post- stratified Probability- Proportional- to-Size Sampling 707

5. EFFICIENCY COMPARISON OF THE ESTIMATORS

In this section, we provide empirical evidence about the efﬁciency of the proposed esti-

mators using several populations, where a probability-proportional-to-size sampling would

be a natural choice. In these populations, the values of Y-variable are proportional to the

values of the X-variable. A small percentage of the population units have extreme values

in both Y-and X-variables with different proportionality constants. The units that produce

extreme values usually behave differently from the other units in the population. They have

larger variance and the slope of the regression ﬁt between Y- and X-values would be larger

than the slope of the regression ﬁt of Yon Xfor the remaining population units. For exam-

ple, if we sample farms to estimate the crop production (Y), the farm population can be

divided into two parts small/normal size in acre (X) and mega farms that has extremely

large X-values . The percentage of the mega farms would be small, but they may have larger

variance in Y- and X-variables and the regression ﬁt of YtoXmay have a larger slope. For

our empirical investigation, we generate this type of population structure using the model

below.

I. For a ﬁxed population size N, generate the size variable Xfrom an exponential

distribution with mean 100 and order these Nrandom numbers from smallest to

largest, x(i)< ... < x(N), where x(i)is the ith smallest value of x-values.

II. Let N∗be he largest integer such that N∗≤Nω. Generate the Y-values from

y[i]=15x(i)+τ√x(i)ii=1,...,N∗

45x(i)+2τ√x(i)ii=N∗+1,...,N,(9)

where iis generated from a normal distribution with mean zero and variance 1 and

y[i]is the value of the Y-variable that corresponds to the value of x(i).

In model (9), the quantity 1 −ωis the proportion of population units that produce extreme

measurements on Y- and X-variables. It corresponds to proportion of mega farms in crop

production example. In the simulation study, we used R-software in R Core Team (2018)

to generate the population values. We set the random generator seed at set.seed =1

so that the same population values can be created. Using model (9), the populations

of size N=2000 are generated with several values of ω=0.7,0.8,0.90,0.95 and

τ=50,200,400,500,800 to establish different correlation structures between Y- and

X-variables. Sample size nis selected to be n=30 in the ﬁrst phase of the simulation

study. The number of replication in Rao–Blackwell estimator is taken to be Q=50.

For each population, we used the pairs of set size and number of groups (M,d)as

(M,d)=(4,4), (5,5), (10,2), (10,5), (30,3), (60,3).

For the populations generated by the model in Eq. (9), a ratio estimator in a double

sampling could be an alternative estimator for the population mean or total using the size

variable Xas an auxiliary variable. The double sampling ﬁrst selects nM units from the

population and measures only the Xvariable. In the second stage, from the nM units selected,

it selects a subsample of size nand measures both Y- and X-variables. The ratio estimator

is then given by

708 O. Ozturk

YR=n

i=1Yi

n

i=1Xi1



j=1

Xi.

This estimator is a biased estimator, and its approximate mean square error (MSE) is given

in Section 14.1 in Thompson (2002). We compare the Rao–Blackwell estimator ( ¯

YRB,n)

with probability-proportional-to-size ( ¯

Ypps,n), post-stratiﬁed probability-proportional-to-

size ( ¯

Ypp,n) and ratio estimators ( ¯

YR). The efﬁciencies of these estimators are deﬁned as

follows

RE1=Va r (¯

Ypps,n)

Va r (¯

YRB,n),RE2=Va r (¯

Ypp,n)

Va r (¯

YRB,n),RE3=MSE(¯

YR)

Va r (¯

YRB,n).

Tables 2and 3present the efﬁciency values RE1,RE2and RE3for the ﬁnite populations

generated by the model in Eq. (9). In all these simulation settings, the Rao–Blackwell

estimator ( ¯

YRB,n) is always better than the ratio estimator ( ¯

YR), (RE3>1). The efﬁciency

is higher for larger values of ρand decreases with lower values of ρ. The smallest efﬁciency

gain is RE3=1.102 in Table 3when M=60, d=3, τ=800, ω=0.95 and ρ=0.377.

The efﬁciency of Rao–Blackwell estimator with respect to probability-proportional-to-

size ( pps) estimator depends on the correlation coefﬁcient ρand the proportion of population

units with extreme X- and Y-values (1 −ω) . For large values of ρ(≥0.5). The Rao–

Blackwell estimator is superior to pps estimator (RE1≥1). If the correlation coefﬁcient is

less than 0.5, the Rao–Blackwell estimator is slightly less efﬁcient than the pps estimator.

If the population has smaller number of units having extreme X- and Y-measurements, the

Rao–Blackwell estimators tend to have higher efﬁciencies then for a pps estimator for the

same correlation coefﬁcient ρ. For example, in Table 2when ω=0.7 and ρ=0.966

and 0.856 the RE1values are usually higher than the RE1values when ω=0.8 and ρ=

0.949 and 0.854. These numerical values indicate that the Rao–Blackwell and post-stratiﬁed

probability-proportional-to-size ( pp) sample estimators can handle extreme observations

better than their competitor ratio estimator. As expected, Rao–Blackwell estimator is always

superior to the pp sample estimator (RE2>1).

Tables 4and 5present efﬁciency results for different sample (n=120), set (M=

5,10,60,2000) sizes and the number of groups (d) when ω=0.7,0.9. The efﬁciency

values slightly increase with sample size for the same values of M,d,ρand ω. For example,

in Table 2when ω=0.7 and ρ=0.966 , RE1values are 2.193, 4.188 for M=5,d=5 and

M=60,d=3, respectively. The RE1values in Table 4for the same simulation settings

are 2.353, and 4.449. Efﬁciencies in Tables 4and 5also depend on the selection of set size

Mand the number of groups. The larger set sizes suggest better efﬁciency for large ρ.For

example, the value of RE1is the largest (5.964) when ρ=0.966, M=2000 and d=20

in Table 4. On the other hand, the value of RE1is the largest (5.054) when ρ=0.907,

M=2000 and d=10 in Table 5. This suggests that selection of Mand ddepends on the

within-set ranking quality through the correlation coefﬁcient between X- and Y-variables.

This relationship is further investigated in Figs. 1and 2.

Figures 1and 2present the plots of RE3values with respect to dfor different set sizes

Min nine panels for ω=0.7 and ω=0.90, respectively. The displays in the each row ﬁx

Post- stratified Probability- Proportional- to-Size Sampling 709

Table 2. Efﬁciency comparison of the proposed estimators with probability-proportional-to-size and ratio esti-

mators.

ωτ ρ MdRE

1RE2RE3τρ MdRE

1RE2RE3

0.70 50 0.966 4 4 1.993 1.447 6.703 400 0.657 10 5 0.999 1.154 1.281

50 0.966 5 5 2.193 1.564 6.779 400 0.657 30 3 1.036 1.049 1.366

50 0.966 10 2 2.081 1.277 4.433 400 0.657 60 3 1.037 1.039 1.395

50 0.966 10 5 2.610 1.579 5.832 500 0.574 4 4 1.005 1.110 1.382

50 0.966 30 3 3.563 1.466 5.920 500 0.574 5 5 1.008 1.156 1.404

50 0.966 60 3 4.188 1.469 6.523 500 0.574 10 2 1.021 1.024 1.366

200 0.856 4 4 1.163 1.163 2.214 500 0.574 10 5 0.977 1.149 1.221

200 0.856 5 5 1.179 1.209 2.181 500 0.574 30 3 1.008 1.044 1.314

200 0.856 10 2 1.187 1.063 1.834 500 0.574 60 3 1.007 1.035 1.344

200 0.856 10 5 1.168 1.198 1.747 800 0.401 4 4 0.984 1.103 1.274

200 0.856 30 3 1.256 1.085 1.761 800 0.401 5 5 0.985 1.150 1.304

200 0.856 60 3 1.275 1.071 1.784 800 0.401 10 2 0.999 1.020 1.306

400 0.657 4 4 1.024 1.116 1.480 800 0.401 10 5 0.952 1.142 1.155

400 0.657 5 5 1.028 1.161 1.495 800 0.401 30 3 0.976 1.039 1.257

400 0.657 10 2 1.041 1.029 1.420 800 0.401 60 3 0.972 1.030 1.286

0.80 50 0.949 4 4 2.279 1.567 8.075 400 0.680 10 5 1.048 1.154 1.484

50 0.949 5 5 2.442 1.602 7.192 400 0.680 30 3 1.052 1.038 1.407

50 0.949 10 2 2.961 1.533 7.318 400 0.680 60 3 1.040 1.032 1.359

50 0.949 10 5 3.071 1.585 7.580 500 0.605 4 4 1.021 1.116 1.484

50 0.949 30 3 2.851 1.284 5.524 500 0.605 5 5 1.026 1.178 1.459

50 0.949 60 3 2.625 1.220 5.173 500 0.605 10 2 1.039 1.036 1.402

200 0.854 4 4 1.231 1.163 2.485 500 0.605 10 5 1.015 1.121 1.342

200 0.854 5 5 1.229 1.238 2.286 500 0.605 30 3 0.999 1.040 1.327

200 0.854 10 2 1.303 1.099 2.259 500 0.605 60 3 1.013 1.033 1.327

200 0.854 10 5 1.290 1.223 2.201 800 0.445 4 4 1.013 1.117 1.335

200 0.854 30 3 1.259 1.083 1.916 800 0.445 5 5 0.999 1.162 1.311

200 0.854 60 3 1.258 1.069 1.905 800 0.445 10 2 1.010 1.019 1.219

400 0.680 4 4 1.053 1.159 1.639 800 0.445 10 5 0.968 1.159 1.246

400 0.680 5 5 1.051 1.167 1.553 800 0.445 30 3 0.975 1.046 1.240

400 0.680 10 2 1.080 1.041 1.510 800 0.445 60 3 0.967 1.030 1.202

The ﬁnite population is constructed from the model in Eq. (9) with ω=0.70, sample size n=30, the population

size N=2000, and the correlation coefﬁcient, ρ, between the X-andY-variables. The efﬁciencies are RE1=

Va r (¯

Ypps,n)/Var (¯

YRB,n),RE2=Var (¯

Ypp,n)/Var (¯

YRB,n),RE3=MSE(¯

YR)/Va r (¯

YRB,n). Variances and mean

square error (MSE) are computed from 5000 simulation replication.

the ρand ωand changes the sample sizes from n=30 to n=120. The displays in the each

column ﬁx the sample size and ωand vary the correlation coefﬁcient. The plots in Fig. 1

suggest that for large sample size n=120 and large correlation coefﬁcient between X- and

Y-variable (ρ>0.90), we can select as large Mas possible with dhaving values between

6 and 15. On the other hand, for moderately large sample sizes n=60, M=300 and the

number of groups dbetween 6 and 10 seem to be slightly better than M=2000. For the

lower correlation coefﬁcient ρ<90, the selection of Mdoes not make a big difference. All

efﬁciency curves are quite close to each other. In this case, the number of groups dshould

not be larger than 5 for any M.

Similar results also hold in Fig. 2. Only difference is that the efﬁciency values RE3

is much higher. This indicates that the proposed pp and Rao–Blackwell estimators can

710 O. Ozturk

Table 3. Efﬁciency comparison of the proposed estimators with probability-proportional-to-size and ratio esti-

mators.

ωτ ρ MdRE

1RE2RE3τρ MdRE

1RE2RE3

0.90 50 0.907 4 4 2.038 1.443 7.825 400 0.629 10 5 1.033 1.129 1.520

50 0.907 5 5 2.210 1.561 7.854 400 0.629 30 3 1.034 1.045 1.447

50 0.907 10 2 1.946 1.279 6.143 400 0.629 60 3 1.038 1.032 1.411

50 0.907 10 5 2.692 1.588 8.307 500 0.553 4 4 1.029 1.125 1.352

50 0.907 30 3 3.723 1.577 9.571 500 0.553 5 5 1.014 1.204 1.413

50 0.907 60 3 4.449 1.474 11.050 500 0.553 10 2 1.024 1.026 1.336

200 0.808 4 4 1.190 1.180 2.480 500 0.553 10 5 1.002 1.148 1.259

200 0.808 5 5 1.224 1.241 2.478 500 0.553 30 3 1.010 1.039 1.298

200 0.808 10 2 1.196 1.068 2.158 500 0.553 60 3 1.010 1.035 1.299

200 0.808 10 5 1.220 1.209 2.286 800 0.393 4 4 0.993 1.119 1.198

200 0.808 30 3 1.298 1.092 2.207 800 0.393 5 5 0.987 1.150 1.223

200 0.808 60 3 1.279 1.074 2.301 800 0.393 10 2 0.996 1.031 1.246

400 0.629 4 4 1.036 1.128 1.569 800 0.393 10 5 0.986 1.134 1.131

400 0.629 5 5 1.044 1.150 1.463 800 0.393 30 3 0.986 1.034 1.150

400 0.629 10 2 1.050 1.039 1.497 800 0.393 60 3 0.986 1.022 1.159

0.95 50 0.870 4 4 1.710 1.368 7.306 400 0.592 10 5 1.001 1.123 1.366

50 0.870 5 5 1.822 1.436 7.334 400 0.592 30 3 1.012 1.048 1.392

50 0.870 10 2 1.354 1.125 4.700 400 0.592 60 3 1.013 1.034 1.262

50 0.870 10 5 2.200 1.523 7.887 500 0.521 4 4 1.007 1.110 1.335

50 0.870 30 3 1.927 1.189 5.863 500 0.521 5 5 1.006 1.134 1.255

50 0.870 60 3 1.811 1.125 5.421 500 0.521 10 2 0.993 1.017 1.281

200 0.766 4 4 1.151 1.196 2.444 500 0.521 10 5 0.999 1.140 1.295

200 0.766 5 5 1.145 1.201 2.244 500 0.521 30 3 0.983 1.032 1.241

200 0.766 10 2 1.079 1.040 1.944 500 0.521 60 3 0.981 1.031 1.213

200 0.766 10 5 1.153 1.180 2.127 800 0.377 4 4 0.996 1.132 1.208

200 0.766 30 3 1.160 1.052 1.930 800 0.377 5 5 0.983 1.134 1.152

200 0.766 60 3 1.125 1.054 1.927 800 0.377 10 2 1.003 1.021 1.140

400 0.592 4 4 1.014 1.129 1.431 800 0.377 10 5 0.973 1.145 1.178

400 0.592 5 5 1.009 1.183 1.420 800 0.377 30 3 0.956 1.019 1.149

400 0.592 10 2 1.021 1.030 1.377 800 0.377 60 3 0.952 1.034 1.102

The ﬁnite population is constructed from the model in Eq. (9) with ω=0.70, sample size n=30, the population

size N=2000, and the correlation coefﬁcient, ρ, between the X-andY-variables. The efﬁciencies are RE1=

Va r (¯

Ypps,n)/Var (¯

YRB,n),RE2=Var (¯

Ypp,n)/Var (¯

YRB,n),RE3=MSE(¯

YR)/Va r (¯

YRB,n). Variances and mean

square error (MSE) are computed from 5000 simulation replication.

be better alternatives than a ratio estimator for the population producing extreme X- and

Y-values for the smaller values of 1 −ω.

6. APPLICATION TO APPLE PRODUCTION DATA

In this section, we apply the proposed sampling designs to apple production data without

stratiﬁcation. Since efﬁciency of the point estimators is discussed in the previous chapter,

we investigate the properties of the conﬁdence intervals.

We performed a simulation study to compare the efﬁciency of the conﬁdence intervals

of population mean based on the Rao–Blackwell ( ¯

YRB,n), pp (¯

Ypp,n) and pps (¯

Ypps,n)

estimators. The simulation study considered set and group size combinations (M,d),

(M,d)=(6,6),(7,7),(8,8),(9,9),(10,10),(30,3),(30,5),(300,3),(300,5),(300,10),

Post- stratified Probability- Proportional- to-Size Sampling 711

Table 4. Efﬁciency comparison of the proposed estimators with probability-proportional-to-size and ratio esti-

mators.

τρ MdRE

1RE2RE3τρ MdRE

1RE2RE3

50 0.966 5 5 2.353 1.456 7.398 400 0.657 60 10 1.048 1.062 1.339

50 0.966 10 10 2.760 1.408 5.784 400 0.657 2000 2 1.048 1.002 1.267

50 0.966 60 2 2.047 1.114 3.123 400 0.657 2000 4 1.027 1.006 1.250

50 0.966 60 3 4.449 1.380 6.940 400 0.657 2000 5 1.050 1.014 1.297

50 0.966 60 4 3.325 1.226 4.719 400 0.657 2000 10 1.027 1.014 1.381

50 0.966 60 5 3.950 1.302 6.496 400 0.657 2000 20 0.985 1.088 1.285

50 0.966 60 10 4.015 1.328 7.143 500 0.574 5 5 1.026 1.014 1.530

50 0.966 2000 2 1.808 1.029 2.445 500 0.574 10 10 1.020 1.082 1.246

50 0.966 2000 4 3.002 1.035 4.227 500 0.574 60 2 1.032 0.999 1.406

50 0.966 2000 5 3.331 1.028 5.063 500 0.574 60 3 1.044 1.017 1.296

50 0.966 2000 10 4.185 1.077 5.386 500 0.574 60 4 1.042 1.019 1.312

50 0.966 2000 20 5.964 1.258 8.858 500 0.574 60 5 1.038 1.022 1.469

200 0.856 5 5 1.208 1.126 2.160 500 0.574 60 10 1.017 1.055 1.464

200 0.856 10 10 1.237 1.101 2.034 500 0.574 2000 2 1.018 0.999 1.303

200 0.856 60 2 1.195 1.022 1.722 500 0.574 2000 4 1.035 1.002 1.309

200 0.856 60 3 1.300 1.029 1.763 500 0.574 2000 5 1.027 1.009 1.349

200 0.856 60 4 1.215 1.029 1.542 500 0.574 2000 10 1.030 1.012 1.395

200 0.856 60 5 1.285 1.042 1.904 500 0.574 2000 20 0.923 1.090 1.123

200 0.856 60 10 1.322 1.079 1.701 800 0.401 5 5 1.021 1.009 1.234

200 0.856 2000 2 1.153 1.002 1.295 800 0.401 10 10 1.001 1.054 1.362

200 0.856 2000 4 1.281 1.009 1.709 800 0.401 60 2 1.008 1.000 1.394

200 0.856 2000 5 1.251 1.013 1.608 800 0.401 60 3 1.025 1.004 1.262

200 0.856 2000 10 1.289 1.005 1.501 800 0.401 60 4 0.998 1.008 1.340

200 0.856 2000 20 1.160 1.090 1.571 800 0.401 60 5 0.998 1.017 1.275

400 0.657 5 5 1.058 1.038 1.533 800 0.401 60 10 1.007 1.059 1.390

400 0.657 10 10 1.047 1.074 1.492 800 0.401 2000 2 0.996 1.000 1.279

400 0.657 60 2 1.064 1.013 1.329 800 0.401 2000 4 1.011 1.001 1.433

400 0.657 60 3 1.069 1.001 1.526 800 0.401 2000 5 0.966 1.004 1.232

400 0.657 60 4 1.057 1.005 1.318 800 0.401 2000 10 0.937 1.015 1.167

400 0.657 60 5 1.073 1.031 1.440 800 0.401 2000 20 0.894 1.052 1.083

The ﬁnite population is constructed from the model in Eq. (9) with ω=0.70, sample size n=120, the population

size N=2000, and the correlation coefﬁcient, ρ, between X-andY-variables. The efﬁciencies are RE1=

Va r (¯

Ypps,n)/Var (¯

YRB,n),RE2=Var (¯

Ypp,n)/Var (¯

YRB,n),RE3=MSE(¯

YR)/Va r (¯

YRB,n). Variances and mean

square error (MSE) are computed from 5000 simulation replication.

(600,3),(600,5),(600,10). We considered two different sample sizes n=60,120. Simu-

lation size is taken to be 1000. Rao–Blackwell estimator is computed with ﬁfty replication

Q=50. The efﬁciencies of the conﬁdence intervals are deﬁned as the ratio of the squared

average lengths

RE4=1000

i=1L2

pp,i

1000

i=1L2

RB,i

,RE5=1000

i=1L2

pps,i

1000

i=1L2

RB,i

where Lpp,i,LRB,iand Lpps,iare the length of the conﬁdence intervals in the ith replication

based on point estimators ¯

Ypp,n,˜

YRB,nand ¯

Ypps,n, respectively, and given by

Lpp,i=2tn−dnˆσpp,n,i,LRB,i=2tn−1ˆσRB,n,i,Lps,i=2tn−dnˆσpps,n,i.

712 O. Ozturk

Table 5. Efﬁciency comparison of the proposed estimators with probability-proportional-to-size and ratio esti-

mators.

τρ MdRE

1RE2RE3τρ MdRE

1RE2RE3

50 0.907 5 5 2.152 1.395 7.805 400 0.629 60 10 1.095 1.063 1.504

50 0.907 10 10 2.711 1.453 9.264 400 0.629 2000 2 1.037 1.000 1.402

50 0.907 60 2 1.748 1.103 5.312 400 0.629 2000 4 1.072 0.999 1.374

50 0.907 60 3 4.899 1.440 14.821 400 0.629 2000 5 1.092 1.016 1.484

50 0.907 60 4 3.233 1.247 9.029 400 0.629 2000 10 1.069 1.013 1.288

50 0.907 60 5 3.545 1.242 9.746 400 0.629 2000 20 1.029 1.061 1.342

50 0.907 60 10 4.363 1.282 11.177 500 0.553 5 5 1.031 1.027 1.495

50 0.907 2000 2 1.693 1.010 4.899 500 0.553 10 10 1.049 1.090 1.454

50 0.907 2000 4 2.767 1.037 6.999 500 0.553 60 2 1.027 1.005 1.363

50 0.907 2000 5 3.228 1.024 7.918 500 0.553 60 3 1.064 1.004 1.423

50 0.907 2000 10 5.054 1.109 14.088 500 0.553 60 4 1.043 1.018 1.305

50 0.907 2000 20 4.738 1.178 13.567 500 0.553 60 5 1.043 1.031 1.454

200 0.808 5 5 1.234 1.065 2.813 500 0.553 60 10 1.031 1.048 1.373

200 0.808 10 10 1.213 1.111 2.250 500 0.553 2000 2 1.046 1.001 1.338

200 0.808 60 2 1.189 1.021 1.986 500 0.553 2000 4 1.018 1.004 1.280

200 0.808 60 3 1.422 1.058 2.350 500 0.553 2000 5 1.027 1.005 1.243

200 0.808 60 4 1.311 1.042 2.266 500 0.553 2000 10 1.013 1.014 1.214

200 0.808 60 5 1.278 1.049 2.403 500 0.553 2000 20 0.925 1.053 1.132

200 0.808 60 10 1.362 1.103 2.462 800 0.393 5 5 1.025 1.007 1.223

200 0.808 2000 2 1.148 1.004 1.856 800 0.393 10 10 1.004 1.072 1.158

200 0.808 2000 4 1.247 1.003 2.024 800 0.393 60 2 1.004 1.002 1.168

200 0.808 2000 5 1.333 1.006 2.135 800 0.393 60 3 1.017 1.013 1.150

200 0.808 2000 10 1.280 1.017 2.161 800 0.393 60 4 1.002 1.004 1.135

200 0.808 2000 20 1.357 1.102 2.325 800 0.393 60 5 1.006 1.039 1.090

400 0.629 5 5 1.070 1.017 1.537 800 0.393 60 10 0.965 1.061 1.099

400 0.629 10 10 1.065 1.101 1.381 800 0.393 2000 2 1.003 1.004 1.349

400 0.629 60 2 1.058 1.008 1.429 800 0.393 2000 4 0.997 1.005 1.270

400 0.629 60 3 1.051 1.021 1.554 800 0.393 2000 5 0.978 1.002 1.139

400 0.629 60 4 1.060 1.010 1.610 800 0.393 2000 10 0.970 1.029 1.081

400 0.629 60 5 1.073 1.040 1.363 800 0.393 2000 20 0.909 1.099 1.248

The ﬁnite population is constructed from the model in Eq. (9) with ω=0.90, sample size n=120, the population

size N=2000, and the correlation coefﬁcient, ρ, between the X-andY-variables. The efﬁciencies are RE1=

Va r (¯

Ypps,n)/Var (¯

YRB,n),RE2=Var (¯

Ypp,n)/Var (¯

YRB,n),RE3=MSE(¯

YR)/Va r (¯

YRB,n). Variances and mean

square error (MSE) are computed from 5000 simulation replication.

The notations ˆσpp,n,i,ˆσRB,n,iand ˆσpps,n,iare used to denote the variance estimates of the

estimator in the ith replication of the simulation study. The values of RE4,RE5greater than

1 indicate that the average length of the conﬁdence interval in denominators are shorter than

the ones in the numerators. Table 6presents the efﬁciencies and the coverage probabilities

of the 95% conﬁdence intervals.

It is clear that the values of RE5are all greater than 1. Hence, the conﬁdence intervals

based on Rao–Blackwell estimators have shorter length than the lengths of the intervals

constructed based on a pps estimator. The efﬁciency values slightly increase with sample

size n=120. Since the correlation coefﬁcient (ρ =0.916)between the X- and Y-variables

is relatively high, the larger set sizes Mtend to produce shorter intervals based on Rao–

Blackwell estimator.

Post- stratified Probability- Proportional- to-Size Sampling 713

246810 14

23456

The number

of groups (d)

RE3

M=30

M=60

M=300

M=2000

ρ=0.966, n=30, ω=0.7

246810 14

23456

The number

of groups (d)

RE3

ρ=0.966, n=60, ω=0.7

246810 14

23456

The number

of groups (d)

RE3

ρ=0.966, n=120, ω=0.7

246810 14

1.0 1.2 1.4 1.6 1.8

The number

of groups (d)

RE3

ρ=0.856, n=30, ω=0.7

246810 14

1.0 1.2 1.4 1.6 1.8

The number

of groups (d)

RE3

ρ=0.856, n=60, ω=0.7

246810 14

1.0 1.2 1.4 1.6 1.8

The number

of groups (d)

RE3

ρ=0.856, n=120, ω=0.7

246810 14

1.0 1.2 1.4 1.6 1.8

The number

of groups (d)

RE3

ρ=0.657, n=30, ω=0.7

246810 14

1.0 1.2 1.4 1.6 1.8

The number

of groups (d)

RE3

ρ=0.657, n=60, ω=0.7

246810 14

1.0 1.2 1.4 1.6 1.8

The number

of groups (d)

RE3

ρ=0.657, n=120, ω=0.7

Figure 1. Efﬁciency plots (RE3=MSE(¯

YR,n)

Va r (¯

YRB,n)) of Rao–Blackwell estimator with respect to ratio estimator in

double sampling for different values of M,dand nwhen ω=0.7.

The efﬁciency (RE4>1) of the jackknife conﬁdence interval based on Rao–Blackwell

estimator is higher when the set size Mis less then or equal to 10. For large values of M,

RE4values are around 1 indicating that the jackknife conﬁdence intervals are as good as or

slightly less efﬁcient than pp-based conﬁdence intervals.

The coverage probabilities for all three conﬁdence intervals are slightly lower than the

nominal coverage probability 0.95 when n=60. Since the conﬁdence intervals are con-

structed based on normal approximation, this may be due to the effect of sample sizes.

Table 6shows that coverage probabilities are quite close to 0.95 when n=120.

We performed another simulation study to investigate the efﬁciency of stratiﬁed pp

estimators under equal, proportional and Neyman allocations. In this part of the simula-

tion, pp samples are constructed from stratiﬁed apple production data in Table 1.Aswe

observe from the table, there is a large variation between stratum populations. Hence, the

use of stratiﬁed pp sample would be appropriate. To determine the sample sizes in Ney-

man allocation, we used the population standard deviations in Table 1. Simulation study

714 O. Ozturk

246810 14

4681012

The number

of groups (d)

RE3

M=30

M=60

M=300

M=2000

ρ=0.907, n=30, ω=0.9

246810 14

4 6 8 10 12

The number

of groups (d)

RE3

ρ=0.907, n=60, ω=0.9

246810 14

4681012

The number

of groups (d)

RE3

ρ=0.907, n=120, ω=0.9

246810 14

1.6 1.8 2.0

The number

of groups (d)

RE3

ρ=0.808, n=30, ω=0.9

246810 14

1.0 1.5 2.0 2.5

The number

of groups (d)

RE3

ρ=0.808, n=60, ω=0.9

246810 14

1.8 2.0 2.2

The number

of groups (d)

RE3

ρ=0.808, n=120, ω=0.9

246810 14

1.10 1.25 1.40

The number

of groups (d)

RE3

ρ=0.629, n=30, ω=0.9

246810 14

1.1 1.2 1.3 1.4

The number

of groups (d)

RE3

ρ=0.629, n=60, ω=0.9

246810 14

1.15 1.30 1.45

The number

of groups (d)

RE3

ρ=0.629, n=120, ω=0.9

Figure 2. Efﬁciency plots (RE3=MSE(¯

YR,n)

Va r (¯

YRB,n)) of Rao–Blackwell estimator with respect to ratio estimator in

double sampling for different values of M,d,ρand nwhen ω=0.90.

considered two designs, stratiﬁed pps and stratiﬁed pp sampling designs with sample sizes

n=140,210,240. For stratiﬁed pp sampling design, we computed two estimators, ¯

Ystr

and Rao–Blackwell estimator ¯

Ystr,RB

Ystr,RB =



l=1

N¯

YRB,nl

where ¯

YRB,nlis the Rao–Blackwell estimator from stratum population l. The variance of

Ystr,RB is denoted with σ2

λ,RB(E),σ2

λ,RB(P), and σ2

λ,RB(N)for equal, proportional and

Neyman allocations, respectively. To approximate Rao–Blackwell estimator, the number of

replications, Q, is selected to be 10 and 50, respectively.

For comparison purposes, we also considered the estimator of θbased on stratiﬁed pps

sample

Post- stratified Probability- Proportional- to-Size Sampling 715

Table 6. The efﬁciencies and coverage of probabilities (Cov) of the conﬁdence intervals.

nMdRE

4RE5cov (RB) Cov (pp) Cov (pps)

60 6 6 1.172 1.333 0.937 0.938 0.931

60 7 7 1.200 1.344 0.926 0.927 0.933

60 8 8 1.220 1.348 0.919 0.919 0.931

60 9 3 1.108 1.363 0.920 0.923 0.936

60 9 9 1.243 1.366 0.935 0.927 0.934

60 10 10 1.268 1.353 0.927 0.925 0.937

60 30 3 1.062 1.396 0.941 0.935 0.944

60 30 5 1.092 1.401 0.930 0.931 0.935

60 300 3 0.998 1.410 0.932 0.929 0.931

60 300 5 0.994 1.353 0.941 0.934 0.942

60 300 10 1.006 1.283 0.906 0.905 0.937

60 600 3 0.987 1.419 0.931 0.932 0.937

60 600 5 0.991 1.362 0.928 0.928 0.936

60 600 10 0.943 1.204 0.930 0.922 0.937

120 6 6 1.125 1.338 0.949 0.948 0.947

120 7 7 1.125 1.359 0.933 0.949 0.938

120 8 8 1.134 1.371 0.963 0.955 0.960

120 9 3 1.093 1.362 0.943 0.945 0.950

120 9 9 1.147 1.385 0.956 0.953 0.963

120 10 10 1.162 1.388 0.945 0.945 0.940

120 30 3 1.057 1.405 0.940 0.944 0.936

120 30 5 1.077 1.449 0.933 0.934 0.935

120 300 3 1.013 1.475 0.936 0.939 0.938

120 300 5 1.008 1.423 0.942 0.938 0.941

120 300 10 1.017 1.443 0.934 0.932 0.931

120 600 3 1.004 1.471 0.949 0.950 0.961

120 600 5 1.006 1.439 0.941 0.935 0.945

120 600 10 1.006 1.453 0.939 0.940 0.951

The RE4is the ratio of the average squared lengths of the conﬁdence intervals based on post-stratiﬁed probability-

proportional-to-size (pp) and Rao–Blackwell (RB) estimators. The RE5is the ratio of the average squared lengths

of the conﬁdence intervals based on probability-proportional-to-size (pps) and RB estimators.

Ystr =



l=1

N¯

Ypps,nl,

where ¯

Ypps,nlis the pps estimator in Eq. (1) from the stratum population l. The variance

of √n(˘

Ystr −θ), similar to the ones in stratiﬁed pp sampling, can be computed for equal,

proportional and Neyman allocation

˘σ2

λ(E)=



l=1

L˘

N2;˘σ2

λ(P)=



l=1

,˘σ2

λ(N)=L



l=1

N2

where ˘

l=1

nN

k=1πkyk

Nlπk−θl2

Stratiﬁed samples for both pps and pp designs are constructed using equal, proportional

and Neyman allocations. Simulation size is taken to be 50,000. Table 7presents relative

efﬁciencies of stratiﬁed pps and pp estimators with respect to Rao–Blackwell estimators.

716 O. Ozturk

Table 7. Relative efﬁciencies the estimator and coverage probability of the conﬁdence interval of the parameter θfor equal (E), proportional ( P)andNeyman(N) allocation procedures.

nQStratiﬁed pps Stratiﬁed pp Rao–Blackwell Stratiﬁed pp

˘σ2

λ(E)

σ2

λ,RB(E)˘σ2

λ(P)

σ2

λ,RB(P)˘σ2

λ(N)

σ2

λ,RB(N)

σ2

λ(E)

σ2

λ,RB(E)

σ2

λ(P)

σ2

λ,RB(P)

σ2

λ(N)

σ2

λ,RB(N)

σ2

λ,RB(E)

σ2

λ,RB(N)

σ2

λ,RB(P)

σ2

λ,RB(N)Cov (E)Cov(P)Cov(N)

140 10 1.742 1.741 1.549 1.356 1.351 1.179 1.351 1.553 0.936 0.927 0.942

140 50 1.864 1.790 1.577 1.372 1.378 1.185 1.331 1.525 0.937 0.928 0.941

210 10 1.907 1.913 1.709 1.269 1.341 1.166 1.418 1.500 0.942 0.937 0.946

210 50 1.932 1.901 1.653 1.330 1.322 1.205 1.319 1.452 0.939 0.938 0.938

280 10 1.967 1.899 1.731 1.266 1.237 1.209 1.328 1.514 0.942 0.938 0.945

280 50 2.060 1.928 1.683 1.278 1.284 1.239 1.256 1.412 0.947 0.938 0.938

Post- stratified Probability- Proportional- to-Size Sampling 717

It is clear that Rao–Blackwell estimator for each allocation procedure provides substantial

amount of improvement over stratiﬁed pps and pp estimators. As expected, the efﬁciency

increases with sample size n, but the increase in the number of replications Qin Rao–

Blackwell estimator from 10 to 50 does not make signiﬁcant improvement on the efﬁciency.

The Neyman allocation dominates the other allocation procedures as expected. For this

population, equal allocation has higher efﬁciency than the proportional allocation. This is

consistent with the expression σ2

λ(E)−σ2

λ(L). This difference could be negative for pop-

ulations in which smaller stratum populations has larger variances. A close inspection of

apple production data in Table 1indicates the largest stratum population variance belongs

to second smallest stratum. Hence for this population equal allocation is better than a pro-

portional allocations. The coverage probabilities of the conﬁdence interval of θbased on

stratiﬁed pp estimator are relatively close to the nominal value of 0.95.

7. CONCLUDING REMARKS

In many survey sampling studies, in addition to variable of interest, the population units

have a known auxiliary variable. This auxiliary variable is often proportional to the variable

under study. If the population has strong heterogeneity among its members, such as extremely

large values for some population units, the pps sample would provide an estimator for

population mean with smaller variance than a simple random sample estimator of the same

size. In a pps sample, sample units are selected with selection probabilities proportional

to size of the auxiliary variable. Since the auxiliary variable is highly correlated with the

variable of interest, it also provides information about the relative position of the units in

a comparison set with respect to variable of interest. In this paper, we used this position

information to construct post-stratiﬁed pps sample. The new sample creates post-strata

among sample units of a pps sample. Hence, the estimators of the population mean have

a smaller variance than a pps sample of the same size. The post-stratiﬁcation of the pps

sample is performed by conditioning on the comparison sets. We use Rao–Blackwell theorem

to improve the post-stratiﬁed pps sample estimator. The new sampling design is naturally

extended to stratiﬁed population. Efﬁciency of the estimator of the population mean is

empirically evaluated in a stratiﬁed population.

[Received February 2019. Accepted July 2019. Published Online July 2019.]

REFERENCES

Al-Saleh, M. F. and Samawi, H. (2007). A note on Inclusion Probability in Ranked Set Sampling for ﬁnite

population. Tes t , 16, 198–209.

Dastbaravarde, A., Arghami, N.R., Sarmad, M., (2016). Some theoretical results concerning non parametric esti-

mation by using a judgement poststratiﬁcation sample. Communications in Statistics - Theory and Methods,

45, 2181–2203.

Deshpande, J.V., Frey, J., Ozturk, O. (2006) Nonparametric ranked set-sampling conﬁdence intervals for a ﬁnite

population. Environmental and Ecological Statistics, 13, 25–40.

718 O. Ozturk

Frey, J. (2011). A note on ranked-set sampling using a covariate. Journal of Statistical Planning and Inference,

141, 809–816.

— (2012). Constrained nonparametric estimation of the mean and the CDF using ranked-set sampling with a

covariate. Annals of the Institute of Statistical Mathematics, 64, 439–456.

Frey, J. and Feeman, T.G. (2012). An improved mean estimator for judgement post-stratiﬁcation. Computational

Statistics and Data Analysis, 56, 418–426.

— (2013). Variance estimation using judgement post- stratiﬁcation. Annals of the Institute of Statistical Mathe-

matics, 65, 551–569.

Frey, J. and Ozturk, O. (2011). Constrained estimation using judgement post-stratiﬁcation. Annals of the Institute

of Statistical Mathematics, 63, 769–789.

Gokpinar, F. and Ozdemir, Y.A. (2010). Generalization of inclusion probabilities in ranked set sampling. Hacettepe

Journal of Mathematics and Statistics, 39, 89–95.

Kadilar, C. and Cingi, H. (2003). Ratio estimators in stratiﬁed random sampling. Biometrical Journal, 45, 218–225.

MacEachern, S. N., Stasny, E. A., and Wolfe, D. A. (2004) Judgment post- stratiﬁcation with imprecise rankings.

Biometrics, 60, 207–215.

Ozdemir, Y.A. and Gokpinar,F. (2008). A new formula for inclusion probabilities in median ranked set sampling.

Communications in Statistics - Theory and Methods, 37, 2022–2033.

— (2007). A generalized formula for inclusion probabilities in ranked set sampling. Hacettepe Journal of Mathe-

matics and Statistics, 36, 89–99.

Ozturk, O. (2014a). Estimation of population mean and total in ﬁnite population setting using multiple auxiliary

variables. Journal of Agricultural, Biological and Environmental Statistics, 19, 161–184.

— (2014b). Statistical inference for population quantiles and variance in judgment post-stratiﬁed samples, Com-

putational Statistics and Data Analysis, 77, 188–205.

— (2016a). Estimation of a ﬁnite population mean and total using population ranks of sample units. Journal of

Agricultural, Biological and Environmental Statistics, 21, 181–202.

— (2016b). Statistical inference based on judgment post-stratiﬁed samples in ﬁnite population. Survey Methodol-

ogy, 42, 239–262.

Ozturk., O. and Bayramoglu Kavlak, K. (2018). Model based inference using ranked set samples. Survey Method-

ology, 44, 1–16.

— (2019). Statistical inference using stratiﬁed ranked set samples from ﬁnite populations, Chapter 12, pages,

157–170. Ranked Set Sampling: 65 Years Improving the Accuracy in Data Gathering edited by Bouza and

Al-Omari, Elsevier, San Diego, USA.

Ozturk, O. and Jafari Jozani, M. (2013). Inclusion Probabilities in Partially Rank Ordered Set Sampling. Compu-

tational Statistics and Data Analysis, 69, 122–132.

Patil, G.P., Sinha, A.K., and Taillie, C. (1995). Finite population corrections for ranked set sampling. Annals of the

Institute of Statistical Mathematics, 47, 621–636.

R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical

Computing, Vienna, Austria. URL https://www.R-project.org/.

Thompson, S.K. (2002). Sampling, 2nd edition, Wiley, New York.

Wang, X., Stokes, L., Lim, J., and Chen, M. (2006). Concomitants of multivariate order statistics with application

to judgment post-stratiﬁcation. Journal of the American Statistical Association, 101, 1693–1704

Wang, X., Lim, J., Stokes, S.L. (2008). A nonparametric mean estimator for judgement post-stratiﬁed data.

Biometrics, 64, 355–363.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published

maps and institutional afﬁliations.

The Power of Influencers: How Does Influencer Marketing Shape Consumers’ Purchase Intentions?

Article

Full-text available

Jun 2024

In the current digital wave, social media is not only a hub for information exchange but also a shaper of new business marketing models, as is especially evident in the trend towards light and healthy eating. The influence of the influencer economy on consumer purchasing decisions is increasingly pronounced. This paper systematically investigates the impact of influencer marketing on consumer purchase intentions in social media utilizing the Consumer Attitude Theory. Through a sample survey of 654 consumers and empirical analysis using the fuzzy comprehensive evaluation model, the results show that the influencers’ credibility and professionalism and consumers’ satisfaction with live-streaming sales by influencers have a significant positive impact on enhancing consumers’ purchase intentions. To enhance consumers’ purchase intentions, this study suggests that influencers should transparently disclose their collaborations with brands, showcase the positive experiences of other users, and use relevant research and data to support their product recommendations in order to enhance their credibility. Simultaneously, influencers need to strengthen product knowledge, improve professional image and reputation, and meet consumer needs through personalized recommendations and carefully designed live-streaming content to promote brand-value enhancement.

Ultrasonographic Assessment of Liver Size and its Association to Selected Morphometric Parameters of Domestic Dogs in Accra, Ghana

Article

Full-text available

Mar 2024

Introduction: The global unpopularity of linear ultrasonographic measurement, due to its inherent subjectivity, contrasts with the safety, portability, low cost, and real-time capabilities of this imaging modality. The increased availability of ultrasounds in veterinary practice in Ghana presents an opportunity to provide ultrasonographic liver size reference ranges to aid the diagnosis of hepatopathies in domestic dogs. Therefore, this study sought to establish ultrasonographic liver size reference ranges of dogs in Accra, Ghana. It also aimed to to investigate the correlation between liver size and selected morphometric parameters in these domestic dogs. Materials and methods: A total of 60 dogs from different domestic breeds, sexes (27 males and 33 females), age ranges (2.82 ± 2.12 years), weights (28.83 ±9.98kg), and body conformation were sampled. Purposive sampling of dogs was performed based on presenting history, clinical signs, physical exam, and blood analysis. Blood samples were collected for serum biochemistry to distinguish between those classified as healthy and those presenting with clinical illness. Additionally, all dogs were subjected to linear ultrasonographic liver size measurements in longitudinal and transverse planes. Results: The findings indicated a strong positive correlation of mean longitudinal sonographic liver measurement with body height, body girth (the widest point of the chest and the rib cage), the distance between the last rib and the tuber coxa, and the distance between the xiphoid and the tuber ischium. Equations were derived from the mean longitudinal sonographic measurement and these body parameters for deep and non-deep-chested breeds. This study helped to establish equations that can be used to estimate the longitudinal liver measurement. Conclusion: This information can be used in clinical settings to help veterinarians (even with basic knowledge of hepatic ultrasonography) to have a fair idea of hepatopathies relating to size.

An Efficient Variant of Ranked Set Sampling, Probability Proportional to Size with Application to Economic Data

Article

Full-text available

Jan 2024

In this paper, we apply the Ranked Set Sampling (RSS) technique to economic data in the form of homescan market research data set for the meat food group. The RSS method is then extended to select sampling units based on the Probability-Proportional-to-Size (PPS) approach. The new proposed ranked set sampling, using the PPS-derived method, RPPS, is assessed via Monte Carlo investigations and an extensive homescan data set to evaluate its performances. The results are promising and in line with theoretical and simulation studies, showing that the RPPS technique is more reliable and has a smaller variance than the PPS route.

STRATIFIED TWO-STAGE CLUSTER SAMPLING DESIGN WITH RANKING

Article

Full-text available

Jan 2024

A new sampling scheme is introduced in this paper which can be considered to be an extension of the stratified sub-sampling. Here, the population is first stratified, and probability proportional to size (PPS) sampling with replacement is used to select clusters within each stratum. From each selected cluster, units are selected with ranked set sampling (RSS) without replacement. An estimator is proposed under the sampling design and its efficiency is checked using simulated data and Census data of India, 2011.

Enhanced Estimation of the Population Mean Using Two Auxiliary Variables under Probability Proportional to Size Sampling

Article

Full-text available

Apr 2023
MATH PROBL ENG

In some situations, the population of interest difers signifcantly in size, for example, in a medical study, the number of patients having a specifc disease and the size of health units may vary. Similarly, in a survey related to the income of a household, the household may have a diferent number of siblings, and then in such situations, we use probability proportional to size sampling. In this article, we have proposed an improved class of estimators for the estimation of population mean on the basis of probability proportional to size (PPS) sampling, using two auxiliary variables. Te mathematical expressions of the bias and mean square error (MSE) are derived up to the frst order of approximation. Four real datasets and a simulation study are conducted to assess the efciency of the improved class of estimators. It is found from the real datasets and a simulation study, that the proposed generalized class of estimators produced better results in terms of minimum MSE and higher PRE, as related to other considered estimators. An empirical study is given to support the theoretical results. Te theoretical study also demonstrates that the proposed generalized class of estimators outperforms the existing estimators.

New generalized class of estimators for estimation of finite population mean based on probability proportional to size sampling using two auxiliary variables: A simulation study

Article

Full-text available

Oct 2023
Sci Progr

This article aims to suggest a new generalized class of estimators based on probability proportional to size sampling using two auxiliary variables. The numerical expressions for the bias and mean squared error (MSE) are derived up to the first order of approximation. Four actual data sets are used to examine the performances of a new improved generalized class of estimators. From the results of real data sets, it is examined that the suggested estimator gives the minimum MSE and the percentage relative efficiency is higher than all existing estimators, which shows the importance of the new generalized class of estimators. To check the strength and generalizability of our proposed class of estimators, a simulation study is also accompanied. The consequence of the simulation study shows the worth of newly found proposed class estimators. Overall, we get to the conclusion that the proposed estimator outperforms as compared to all other estimators taken into account in this study.

Empirical likelihood inference for area under the receiver operating characteristic curve using ranked set samples

Article

May 2022

The area under a receiver operating characteristic curve (AUC) is a useful tool to assess the performance of continuous‐scale diagnostic tests on binary classification. In this article, we propose an empirical likelihood (EL) method to construct confidence intervals for the AUC from data collected by ranked set sampling (RSS). The proposed EL‐based method enables inferences without assumptions required in existing nonparametric methods and takes advantage of the sampling efficiency of RSS. We show that for both balanced and unbalanced RSS, the EL‐based point estimate is the Mann–Whitney statistic, and confidence intervals can be obtained from a scaled chi‐square distribution. Simulation studies and two case studies on diabetes and chronic kidney disease data suggest that using the proposed method and RSS enables more efficient inference on the AUC.

Judgment Post-stratified Assessment Combining Ranking Information from Multiple Sources, with a Field Phenotyping Example

Article

Full-text available

Feb 2021

This paper presents novel estimators for a judgment post-stratified (JPS) sample, which combine the ranking information from different methods or rankers. A JPS sample divides the units in the original simple random sample (SRS) into several ranking groups based on the relative positions (ranks) of the units in their individual small comparison sets. Ranks in the comparison sets may be assigned with several different ranking procedures. When considered separately, each ranking method leads to a different JPS sample estimator of the population mean or total. Here we introduce equally or unequally weighted estimators, which combine the ranking information from multiple sources. The unequal weights utilize the standard errors of the individual ranking methods estimators. The weighted estimators provide a substantial improvement over an SRS estimator and a JPS estimator based on a single ranking method. The new estimators are applied to crop establishment phenotypic data from an agricultural field experiment. Supplementary materials accompanying this paper appear online.

Product Sampling Based on Remarks of Customs in Online Shopping Websites for Quality Evaluation

Chapter

Jan 2021

In recent years, the scale of network marketing increase rapidly. The remark information of customs after shopping will mostly make comments on the quality of goods. The information can provide support for online marketing platform, production enterprises and market supervision departments, and guide management organizations to find quality problems. This paper proposes a sampling method based on Bayesian method and remarks of customs, which can be used to evaluate the quality of goods. It can greatly reduce the number of samples and find the quality problems of goods effectively.

Automating the Generation of Study Teams Through Genetic Algorithms Based on Learning Styles in Higher Education

Chapter

Jan 2021

Both the International Education Organization (OIE) and UNESCO have stated that promoting collaborative activities is a key competence for sustainable development. This postulate focuses on collaboration with local and international networks. In this line, it is important to mention that, in each teamwork, the members are people who interact sharing objectives, rules and deadlines linked to the activity. Under this reality, it is essential to promote study-team activities in higher education, where students can develop skills to solve problems in multidisciplinary groups. To support the process of generating efficient study-teams, in this investigation we present a system capable of exploring the best alternatives to automatically organize homogeneous study-teams that favor the best performance. Our proposal uses a personalized genetic algorithm (GA), based on student learning styles and academic profile. The experimentation phase has yielded positive results compared to the self-organization method or the teacher imposition method.

Estimation of a Finite Population Mean and Total Using Population Ranks of Sample Units

Article

Full-text available

Oct 2015

Omer Ozturk

This paper introduces new estimators for population total and mean in a finite population setting, where ranks (or approximate ranks) of population units are available before selecting sample units. The proposed estimators require selecting a simple random sample and identifying the population ranks of sample units. Selection of the sample can be performed with- or without-replacement. The population ranks of the selected units of with-replacement samples are determined among all population units. On the other hand, the ranks of the sample units of without-replacement samples are identified in two different ways: (1) The rank of a sample unit is determined sequentially among the remaining population units after excluding all previously ranked sample units from the population; (2) The ranks are determined among all units in the population. By conditioning on these population ranks, we construct a set of weighted estimators, develop a bootstrap re-sampling procedure to estimate the variances of the estimators, and construct percentile confidence intervals for the population mean and total. We show that the new estimators provide a substantial amount of efficiency gain over their competitors. We apply the proposed estimators to estimate corn production in one of the counties in Ohio.

Estimation of Population Mean and Total in a Finite Population Setting Using Multiple Auxiliary Variables

Article

Full-text available

Jun 2013

Omer Ozturk

This paper introduces a new sampling design in a finite population setting, where potential sampling units have a wealth of auxiliary information that can be used to rank them into partially ordered sets. The proposed sampling design selects a set of sampling units. These units are judgment ranked without measurement by using available auxiliary information. The ranking process allows ties among ranks whenever units cannot be ranked accurately with high confidence. The ranking information from all sources is combined in a meaningful way to construct strength-of-agreement weights. These weights are then used to select a single sampling unit for full measurement in each set. Three different levels of sampling design, level-0, level-1, and level-2, are investigated. They differ in their replacement policies. Level-0 sampling designs construct the sample by sampling with replacement, level-1 sampling designs constructs the sample without replacement of the fully measured unit in each set, and level-2 sampling designs construct the sample without replacement on the entire set. For these three designs, we estimate the first and second order inclusion probabilities and construct estimators for the population total and mean. We develop a bootstrap resampling procedure to estimate the variances of the estimators and to construct percentile confidence intervals for the population mean and total. We show that the new sampling designs provide a substantial amount of efficiency gain over their competitor designs in the literature.

Statistical Inference Using Stratified Ranked Set Samples From Finite Populations

Chapter

Jan 2019

R: A Language and Environment for Statistical Computing

Book

Jan 2015

Core R Team

R: A Language and Environment for Statistical Computing

Book

Jan 2015

Core R Team

R: A Language and Environment for Statistical Computing

Book

Jan 2017

Core R Team

Statistical inference based on judgment post-stratified samples in finite population

Article

Dec 2016

Omer Ozturk

This paper draws statistical inference for finite population mean based on judgment post stratified (JPS) samples. The JPS sample first selects a simple random sample and then stratifies the selected units into H judgment classes based on their relative positions (ranks) in a small set of size H. This leads to a sample with random sample sizes in judgment classes. Ranking process can be performed either using auxiliary variables or visual inspection to identify the ranks of the measured observations. The paper develops unbiased estimator and constructs confidence interval for population mean. Since judgment ranks are random variables, by conditioning on the measured observations we construct Rao-Blackwellized estimators for the population mean. The paper shows that Rao-Blackwellized estimators perform better than usual JPS estimators. The proposed estimators are applied to 2012 United States Department of Agriculture Census Data.

Team RDC.R: A Language And Environment For Statistical Computing. R Foundation for Statistical Computing: Vienna, Austria

Technical Report

Jan 2012

Core R Team

Statistical inference for population quantiles and variance in judgment post-stratified samples

Article

Sep 2014
COMPUT STAT DATA AN

Omer Ozturk

A judgment post-stratified (JPS) sample is used in order to develop statistical inference for population quantiles and variance. For the ppth order of the population quantile, a test is constructed, an estimator is developed, and a distribution-free confidence interval is provided. An unbiased estimator for the population variance is also derived. For finite sample sizes, it is shown that the proposed inferential procedures for quantiles are more efficient than corresponding simple random sampling (SRS) procedures, but less efficient than corresponding ranked set sampling (RSS) procedures. The variance estimator is less efficient, as efficient as, or more efficient than a simple random sample variance estimator for small, moderately small, and large sample sizes, respectively. Furthermore, it is shown that JPS sample quantile estimators and tests are asymptotically equivalent to RSS estimators and tests in their efficiency comparison.

Inclusion Probabilities in Partially Rank Ordered Set Sampling.

Article

Jan 2014
COMPUT STAT DATA AN

In a finite population setting, this paper considers a partially rank ordered set (PROS) sampling design. The PROS design selects a simple random sample (SRS) of MM units without replacement from a finite population and creates a partially rank ordered judgment subsets by dividing the units in SRS into subsets of a pre-specified size. The subsetting process creates a partial ordering among units in which each unit in subset hh is considered to be smaller than every unit in subset h′h′ for h′>hh′>h. The PROS design then selects a unit for full measurement from one of these subsets. Remaining units are returned to the population based on three replacement policies. For each replacement policy, we compute the first and second order inclusion probabilities and use them to construct the Horvitz–Thompson estimator and its variance for the estimation of the population total and mean. It is shown that the replacement policy that does not return any of the MM units, prior to selection of the next unit for full measurement, outperforms all other replacement policies.

Post-stratified Probability-Proportional-to-Size Sampling from Stratified Populations

Abstract and Figures

Recommended publications

Nonparametric procedures for selecting fixed-size subsets

Statistical inference based on judgment post-stratified samples in finite population

Model‐based inference using judgement post‐stratified samples in finite populations

Estimation of a Finite Population Mean and Total Using Population Ranks of Sample Units

Two‐stage cluster samples with judgment post‐stratification