Selection & Sampling

When you want to gather information about a group of people such as, “the number of people in ABC town that drive to work every day,” it would be nice to just observe ALL the people in ABC town that drive to work, right? You would have exactly the right answer.

Regrettably, in terms of staff, budget, time, and other available resources, it is not usually feasible to observe or question the entire population you are interested in getting information about.

This is where probability sampling comes into play. We will discuss the following:

Probability Sampling: Simple Random and Stratified Sampling

Fortunately, you do not have to collect information for every member of your population. There is a much more efficient and cost-effective way to gather your information. This method is called ‘probability sampling.’ If designed and implemented well, probability sampling will yield accurate results about your population.

A ‘probability sample’ is:

A subgroup of the entire population chosen using random selection.

Many times, the term ‘random selection’ is taken to mean ‘haphazard selection.’ However, ‘random selection’ is:

A technical term that means each eligible subject has a specific and known probability of being included in your sample.

There are several types of probability sampling, such as:

Simple random sampling
Stratified sampling
Cluster sampling
Multistage sampling

Simple Random Sampling

When each eligible subject has the same probability of being selected for inclusion in your sample, it is called ‘simple random sampling.’

For example, suppose a school administrator over 4 schools wishes to find out students’ opinions about food served in the school cafeterias. (S)he has a complete list of all students in the schools and decides to randomly select 150 students from the list.

In this example, each student throughout the 4 schools has an equal probability of selection to be given the survey; therefore, it is a simple random sample.

As the name implies, selecting a simple random sample is, well… simple!

Here are the steps:

Assign each member of your population a numerical label.
Use statistical software or a random digit table to select numerical labels at random.

Example:
A small catering business serves 9 reception centers. The owner wants to interview a sample of 4 clients in detail to find ways to improve services to his/her clients. To avoid bias, the owner chooses a simple random sample of size 4.

Step 1: Each reception center is assigned a numerical label 1-9.

Darlene’s Wedding Center
Magic Moments Reception Hall
Rustic Realm Weddings
Romance Gardens
Classic Weddings
Old Time Chapel
Lovers Lane Weddings
Accents-Modern Weddings
Century Falls Reception Center

Step 2: The owner decides to use a statistical software program to generate 4 numerical labels between 1 and 9 at random. The software returns the following numbers:

5, 8, 6, 4

Therefore, the simple random sample to be interviewed in detail will be:

Classic Weddings (5)
Accents-Modern Weddings (8)
Old Time Chapel (6)
Romance Gardens (4)

Stratified Sampling

Sometimes subpopulations within your entire population vary considerably. In this case, it is advantageous to divide your sample into subpopulations called “strata” and then perform simple random sampling within each stratum. This is stratified sampling.

The primary advantage of stratified sampling over simple random sampling is it improves accuracy of estimation if you select a relevant stratification variable.

In general, the size of the sample in each stratum is taken in proportion to the size of the stratum.

Example:
Imagine you would like to interview schools that contract with different vendors to bring food to their cafeteria. We would expect opinions about cafeteria food to vary widely from school to school. Therefore, it makes sense to create school strata to sample from. Suppose the schools are as follows:

School 1: 1050 students
School 2: 565 students
School 3: 1554 students
School 4: 306 students

Total students: 1050 + 565 + 1554 + 306 = 3475 students

The administrator wishes to take a sample of 150 students.

Step 1: The first step is to find the total number of students (3475 above) and calculate the percent of students in each stratum.

School 1: 1050 / 3475 = .30
School 2: 565 / 3475 = .16
School 3: 1554 / 3475 = .45
School 4: 306 / 3475 = .09

Step 2: Next, to select a sample in proportion to the size of each stratum (in this case school), the following number of students should be randomly selected:

School 1: 150 x .30 = 45
School 2: 150 x .16 = 24
School 3: 150 x .45 ~ 67
School 4: 150 x .09 ~ 14

This tells us that our sample of 150 students should be comprised of:

45 students randomly selected from School 1
24 students randomly selected from School 2
67 students randomly selected from School 3
14 students randomly selected from School 4

For more information about simple random sampling, stratified sampling, or other sampling methods, please consult a sample design textbook.

Calculating Sample Size

There are several key items you should start thinking about before you consult with a statistician or other researcher…

A wise man once said, “Never begin data collection without calculating the necessary sample size!” Ok, so perhaps a wise man did not say that, but a wise user of statistics will always have a plan to succeed before data collection begins.

Plan to Succeed

A big part of planning to succeed is figuring out how many observations you will need to meet the objectives of your project. Taking observations costs time and money, so we want to make sure we get just the right amount to make inferences about our outcomes of interest.

Before Consulting a Statistician

Sample size calculation is a more complex topic than can be covered in-depth here, but there are several key items you should start thinking about before you consult with a statistician or other researcher familiar with sample size calculations.

Possible Outcomes

First, if you have worked with data that is similar to the data you are going to be gathering, or you have researched similar work done by others, write down what you think the biggest and smallest possible outcomes could be.

For example, if you are working with height data, you know it is not possible for someone to be 4 inches tall. You also know it is not possible for people to be 20 feet tall. If you are familiar with human heights, you will have a rather good idea what the tallest and shortest values possible could be. This is to estimate the variability of your data.

Significance Level

Second, did you know that when you take a sample there is a chance of concluding there is a difference between your subgroups of interest, when in fact, your population does not have a true difference between the subgroups?

Since you are observing a sample, and not the entire population, sometimes you will get the wrong answer simply due to chance. However, you can choose the probability of this occurrence. Do you want there to be a 10% chance of making this kind of error, or 1% chance? This is called the significance level. Keep in mind, however, the smaller you make the chance of making this kind of error, the larger your sample size will have to be.

Power

Third, did you know you can conclude there is not a difference in your subgroups of interest when in actuality there is a difference between the subgroups? Again, since you are observing a sample, and not the entire population, sometimes you will get the wrong answer simply due to chance. And again, you can choose the probability of this occurrence. Do you want there to be a 20% chance of making this kind of error, or 5% chance? This is called the power. Once more, the smaller this probability is, the larger your sample size will have to be.

A Meaningful Size

Finally, you need to think about the size of a difference between your subgroups of interest that is meaningful. Let’s use a nutritional supplement example.

What if the actual difference between the supplement (intervention) and eating the same number of calories (control) is .02 lbs. Is that meaningful? What if the actual difference is 20 lbs.?

You need to think about the size of a difference between your groups which is meaningful to be able to detect it in your project. Keep in mind the smaller the difference you would like to be able to detect, the larger your sample size is going to need to be.

Margin of Error

Along these same lines, sometimes data is collected about a single population to make estimates about a mean value for the population. In this case, you will need to think about how close the estimate from your sample should be to the actual population value. This is called the margin of error.

Since you are taking a sample, there is a degree of variability in the estimate you will get. You have probably seen political polls that estimate the percent in favor of a particular candidate. The reports will usually give the estimate with a margin of error.

For example, a newspaper recently reported one candidate for mayor was favored with 63% ± 3% of the vote. The ± 3% means the polling agency is confident the actual percent is between 60-66%, based on their sample data.

I bet you guessed it, but keep in mind, the smaller your margin of error, the larger your sample size.

Random Selection

It is common to read magazine polls that go something like this:
“In our last issue, we asked readers to respond to the following question: ‘Do you think your mechanic is honest?’ Based on this poll of our readers, 79% of Americans do not trust their car mechanic.”

There are a couple of reasons why polls of this sort are NOT RELIABLE:
First, this type of survey requires the voluntary response of the readers; those that feel strongly about an issue will voluntarily respond more frequently than those who do not, resulting in very biased results.

Second, this sample is not representative of Americans because this sample is based on a group of people that are similar in a particular way (common interest in reading the magazine).

Think of readers of “Fish and Game” or “Science.” Do you think they have different proportions of male and female readers than the United States population? Do you think they are more or less likely to live in urban or rural areas than the proportions in the United States population?

These are extremely specific groups of people that have characteristics that may not be representative of the United States population as a whole. Therefore, the sample results are biased and not representative of the population the magazine is making a conclusion about (Americans).

Importance of Random Selection

Randomly selecting the members of a sample is important because it helps prevent bias in your results. Random selection allows impersonal choice to choose the sample, rather than the individual performing the poll (the sampler) to select their own participants or self-selection of respondents as in the voluntary response poll mentioned above. If a sampler does not use random selection, the sample will favor selection of certain groups, without the sampler even realizing it.

Difficulties and Cautions

In the real world, a truly random sample is difficult to achieve, but you can come close. One of the most difficult steps is obtaining a complete list of every member in the population you want to sample from.

Many times, the telephone book for a city is used. Some of the problems associated with using the telephone book are it excludes those who do not have a telephone, those who have unlisted numbers, and more recently relevant, those who use a cell phone instead of a land line for all their calls.

A second barrier to purely random samples is for some people in the population, you will find it difficult or impossible to locate them. For example, people who work unusual hours or who travel a lot may be selected to be included in the sample, but are not available when you attempt to contact them.

Control Group

Intervention projects or experiments should always compare treatments or interventions rather than try to assess effectiveness in isolation.

Semantics or Science

What would you think if I told you 54% of patients that received a gastric freezing treatment for ulcers improved in condition? Sounds like an impressive treatment.

BUT, what if I also told you 56% of patients with ulcers that received a dummy treatment also improved? The freezing treatment does not seem so impressive anymore.

What is a control group

The group that received the dummy treatment is called a control group, because it allows us to control for the effects of variables that have an influence on the response. In this example, by using a control group, we can see that patients seem to get better about 55% of the time regardless of whether or not they are treated with gastric freezing.

If we had looked at this treatment in isolation, without a control group to compare it to, we would have mistakenly concluded that gastric freezing is an effective treatment.

How to Select Your Treatment and Control Groups

Participants in an intervention or experiment should be randomly assigned to either the treatment or control group. This can be accomplished a number of ways. Commonly, people use random digit tables or statistical software to randomly assign the treatment groups to subjects. Random assignment reduces the biases in your groups. Because random assignment leaves to chance the decision of which group an individual participates in, it ensures there is no bias in how the groups are created and unmeasured factors which affect your outcome are distributed equally between your intervention and control group.

Example:

Suppose a researcher wishes to assess a nutritional supplement’s claim to help weight lifters gain weight faster than simply consuming the same number of calories.

To test this claim, the researcher decides (s)he will recruit 30 subjects and divide them into 2 groups. The treatment group will receive the nutritional supplement; the control group will receive food with an equal number of calories as the supplement.

To randomize the subjects into the two groups, the researcher assigns a number 1-30 to each subject. We decide to use software to return 15 random numbers between 1-30. We get the following numbers:

01 07 05 02 13 10 27 19 06 23 25 09 12 11 04

All the subjects assigned these numbers will be in the treatment group receiving the nutritional supplement. All other subjects will be in the control group.