Click here for search results

Site Tools

Sampling with Probability Proportional to Size

The first stage of most household survey sampling designs consists of selecting a sample of Primary Sampling Units (PSUs) with Probability proportional to Size (PPS).  The procedure achieves this objective, using as sample frame a Stata file composed of all PSUs in a stratum.

The procedure takes four arguments:

Run PPS sizevar samplevar nunits randseed


 sizevar  is the name of the variable that contains the measure of size (typically, the number of households, the number of dwellings or the population of the PSU).  It must be a non-negative variable.
 samplevar  is the name of the variable where the PSUs selected in the sample will be flagged. It cannot be an already existing variable.
 nunits  is the number of PSUs to be selected in the sample.  It must be a positive integer.
 randseed  is a random number between 0 and 1.  Although this number can be in principle chosen automatically by Stata, this procedure requests the user to choose it externally and plug it in as a numeric argument instead.

Description of the Selection Algorithm

Most of the code is devoted to checking that the four arguments received are correct and other administrative chores.  The core of the selection algorithm is in the following five lines:

 generate CumuSize = `1' if _n= =1  
 replace  CumuSize = `1' + CumuSize[_n-1]  if _n >1
 generate CumulSSS = CumuSize * `3' / CumuSize[_N] + `4'  
 generate `2' = int(CumulSSS)  if _n= =1
 replace  `2' = int(CumulSSS) - int(CumulSSS[_n-1])  if _n> 1

The first two lines create an auxiliary variable (CumuSize,) with the cumulated size of all PSUs, up to and including the current PSU.  The measure of size is the variable specified in the first argument.

The third line creates another auxiliary variable (CumulSSS,) with the cumulated size, scaled to run from 0 to the number of PSUs to select in the sample (the third argument,) and shifted by the random seed (the fourth argument.)

The last two lines create the output variable specified in the second argument.  The sample is defined by the PSUs where the integer part of CumulSSS changes.  The output variable will have zeroes in the PSUs not selected in the sample and positive numbers (generally 'ones') in the selected PSUs.

Technical notes

[1] The sample can be endowed with implicit stratification by simply sorting the Stata file by the relevant criteria prior to running this procedure.

[2] The output variable will typically contain only zeroes and ones.  However, if the size distribution of the PSUs is too skewed, some of the larger ones may be selected twice or more times, and will flagged by numbers larger than one in the output variable.

[3] The procedure assumes that the Stata file is composed of all PSUs in a stratum, with no PSUs from other strata.  It cannot be asked to operate on a subset of the dataset by means of Stata's "ïf" clause, but this effect can easily be achieved by simply creating a size variable with zeroes for all PSUs not belonging to the stratum.

Download a zipped copy of this description and the Stata do file


Permanent URL for this page:

© 2016 The World Bank Group, All Rights Reserved. Legal