AP · Random Sampling and Data Collection · 14 min read · Updated 2026-05-10

Random Sampling and Data Collection — AP Statistics Study Guide

For: AP Statistics candidates sitting AP Statistics.

Covers: All key sampling methods including simple random, stratified, cluster, systematic, and convenience sampling, plus common biases (selection, response, nonresponse) and the distinction between samples and censuses for data collection.

You should already know: Basic difference between populations and samples, basic probability of random events, descriptive statistics for summarizing data.

A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board / Cambridge / IB papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official mark schemes for grading conventions.

1. What Is Random Sampling and Data Collection?

Random sampling is the process of selecting a subset of observational units from a larger defined population of interest to collect data, with the core goal of producing a representative subset that allows valid statistical inferences about the whole population. By contrast, a census collects data from every unit in the population, which is rarely feasible for large populations due to cost, time constraints, or destructive testing (e.g., testing battery life destroys the product). According to the AP Statistics Course and Exam Description (CED), the full Collecting Data unit accounts for 12-15% of the overall AP exam score, with random sampling and data collection making up roughly half of that unit. This topic appears on both the multiple-choice (MCQ) and free-response (FRQ) sections of the exam: you can expect 2-3 standalone MCQ questions on sampling methods and bias, and often a short part of a multi-part FRQ that asks you to compare methods or identify sources of error. Standard notation defines $N$ as the total number of units in the population, and $n$ as the number of units selected for the sample. Synonyms for random sampling include probability sampling and random selection.

2. Common Probability Sampling Methods

Probability sampling methods are sampling methods where every unit in the population has a known, non-zero probability of being selected into the sample. This is required to avoid systematic bias and allow valid statistical inference about the population. The four most common probability sampling methods tested on AP are:

Simple Random Sampling (SRS): Every possible sample of size $n$ has an equal chance of being selected. This is typically implemented by assigning a unique number to every population unit, then using a random number generator or random number table to select $n$ unique numbers.
Stratified Random Sampling: The population is divided into non-overlapping groups called strata, where units within a stratum are similar on a variable expected to affect the response of interest. An SRS is then taken from each stratum. This method reduces sampling error by ensuring representation from every key subgroup.
Cluster Sampling: The population is divided into non-overlapping groups called clusters, each of which is representative of the whole population. Some clusters are randomly selected, then all or most units within the selected clusters are sampled. This is used for cost and logistical efficiency when populations are geographically spread out.
Systematic Random Sampling: Every $k$ th unit is selected from a list of the population, where $k = N / n$ rounded to the nearest integer. A random starting point between 1 and $k$ is selected first, then every $k$ th unit after that is added to the sample. This is simpler than SRS when a sequential list of the population exists.

Worked Example

A high school principal wants to sample 100 students from the school's 1200 total students to assess student satisfaction with the cafeteria. The principal wants to ensure that freshmen, sophomores, juniors, and seniors are all proportionally represented in the sample. There are 300 freshmen, 320 sophomores, 290 juniors, and 290 seniors. What sampling method should the principal use, and how would they implement it?

Step 1: Match the method to the goal. The principal needs guaranteed representation from each grade level, which is a variable that likely affects cafeteria satisfaction. This calls for stratified random sampling.
Step 2: Define the strata. The four grade levels are the four non-overlapping strata: every student belongs to exactly one grade.
Step 3: Calculate proportional sample sizes per stratum. The sampling fraction is $n / N = 100/1200 = 1/12$ . We calculate: Freshmen = $300 * (1/12) = 25$ , Sophomores = $320 * (1/12) \approx 27$ , Juniors = $290 * (1/12) \approx 24$ , Seniors = $290 * (1/12) \approx 24$ . The total adds to 100, matching the desired sample size.
Step 4: Implement sampling. Assign each student within each grade a unique number, then use a random number generator to select the calculated number of students from each grade to participate.

Exam tip: If an AP question asks which method is most appropriate, always match the method to the stated goal: if the goal is to ensure subgroup representation, it is stratified; if the goal is cost/logistical efficiency with representative groups, it is cluster. Don’t confuse the two.

3. Non-Probability Sampling and Common Sampling Biases

Non-probability sampling methods do not assign known non-zero selection probabilities to all population units, so they almost always produce biased results. The most common non-probability method is convenience sampling, which selects easily accessible units. Bias is a systematic error in sampling that causes sample results to consistently differ from the true population value in a specific direction, unlike random sampling error (unavoidable random variation between samples that decreases with larger sample size). AP expects you to identify three core types of bias:

Selection (Undercoverage) Bias: Occurs when some groups in the population are systematically excluded from the sampling frame (the list of units available for selection), so they have no chance of being selected.
Nonresponse Bias: Occurs when selected units refuse to participate or cannot be contacted, and nonrespondents differ systematically from respondents on the variable of interest.
Response Bias: Occurs when participants give inaccurate responses, usually due to social desirability bias, leading question wording, or recall error.

Worked Example

A campus radio station wants to survey students about whether they support a 10% increase in student activity fees to fund station upgrades. The station posts a link to the poll on its website, asking listeners to click to vote. Identify what type(s) of bias are most likely present, and explain their effect.

Step 1: First, selection bias is present: only students who listen to the radio station are aware of the poll, so students who never listen are excluded. Students who do not listen are far less likely to support the fee increase, so they are underrepresented.
Step 2: Voluntary response bias (a form of nonresponse/selection bias) is also present: students who care strongly about the issue (most often supporters of the station who want upgrades) are much more likely to take the time to vote than neutral or opposing students.
Step 3: The net effect is that the poll will systematically overestimate the level of support for the fee increase compared to the true value for the entire student population.

Exam tip: AP FRQs require you to explain bias in context, not just name it. Always add one sentence explaining whether the sample estimate will be too high or too low relative to the true population value to earn full credit.

4. Key Comparisons of Sampling Methods

AP Statistics frequently asks students to distinguish between similar sampling methods, most commonly stratified vs. cluster sampling, which are often confused because both divide the population into non-overlapping groups. The core difference lies in how groups are constructed and sampled:

For stratified sampling: Strata are constructed so units within a stratum are similar on the variable of interest, and units between strata are different. You sample from every stratum. The purpose is to reduce sampling variability and guarantee subgroup representation.
For cluster sampling: Clusters are constructed so each cluster is representative of the whole population (heterogeneous within clusters, similar across clusters). You only sample from randomly selected clusters, not all clusters. The purpose is to reduce logistical cost and effort.

Another common comparison is SRS vs. systematic sampling: systematic sampling is simpler to implement, but will produce bias if there is a repeating periodic pattern in the population list that aligns with the sampling interval $k$ .

Worked Example

A researcher wants to estimate the average weight of apples produced in a 100-acre orchard. The orchard is divided into 100 1-acre plots, each of which has a mix of all apple varieties grown on the farm. The researcher does not have time to visit every plot, so they randomly select 10 plots and weigh all apples on those 10 plots. Is this stratified or cluster sampling? Justify your answer.

Step 1: Check how groups are constructed: each 1-acre plot has a mix of all apple varieties, so each plot is representative of the whole orchard (diverse within plots, similar across plots).
Step 2: Check how sampling is done: the researcher only selects 10 of the 100 plots, and does not sample from the other 90 plots. No sampling occurs in the majority of groups.
Step 3: Match to definitions: this matches cluster sampling, where you sample all units from selected representative clusters. If this were stratified sampling, strata would be the apple varieties, and you would sample from every variety to ensure representation.
Step 4: The method is chosen for logistical convenience, which aligns with the core purpose of cluster sampling.

Exam tip: To tell stratified and cluster apart on the exam, ask one simple question: Do we sample from every group? If yes: stratified. If no, we only sample selected groups: cluster. This rule works for 99% of AP questions.

5. Common Pitfalls (and how to avoid them)

Wrong move: Confusing stratified sampling and cluster sampling by mislabeling cluster sampling as stratified. Why: Both divide the population into groups, so students mix up names and purposes. Correct move: Always run the "do we sample from every group?" check and confirm whether groups are similar or diverse internally before labeling.
Wrong move: Claiming that a large sample size eliminates bias. Why: Students assume "big sample = good" so any error goes away. Correct move: Remember that increasing sample size only reduces random sampling error, it does not fix systematic bias; a large biased sample is still biased.
Wrong move: Identifying bias but not explaining the direction of the error in context. Why: Students memorize the name of the bias but forget that AP requires context for full credit. Correct move: After naming the bias, add one sentence explaining whether the estimate will be too high or too low relative to the true population value.
Wrong move: Calling a voluntary response online poll a simple random sample. Why: Students think any random selection of participants counts as SRS, but in voluntary response, participants select themselves, so not every sample has an equal chance of being selected. Correct move: Recognize that voluntary response polls are always convenience sampling with selection bias, not random probability sampling.
Wrong move: Confusing bias with random sampling error. Why: Students think any error in sampling is bias. Correct move: Remember that bias is a systematic error in one direction, while sampling error is random variation between samples that is always present, even in well-designed random samples.
Wrong move: Assuming systematic sampling is always biased. Why: Students remember the caveat about periodic patterns and assume it is never valid. Correct move: State that systematic sampling is a valid probability sampling method as long as there is no repeating pattern in the population list that aligns with the sampling interval $k$ .

6. Practice Questions (AP Statistics Style)

Question 1 (Multiple Choice)

A city planner wants to estimate the average monthly household water use in a city with neighborhoods ranging from low-income to high-income. High-income households consistently use more water on average than low-income households. The planner wants to ensure the sample accurately represents all income groups. Which of the following sampling methods is most appropriate? A) Simple random sampling of all households in the city B) Stratified random sampling with income groups as strata C) Cluster sampling with neighborhoods as clusters D) Systematic sampling of every 100th household from the city's utility list

Worked Solution: The goal of the sampling is to guarantee representation of all income groups, since water use varies systematically by income. Simple random sampling (A) can leave out one or more income groups by random chance, leading to higher sampling error. Cluster sampling with neighborhoods as clusters (C) is used for logistical convenience, not to ensure subgroup representation. Systematic sampling (D) does nothing to ensure all income groups are represented. Only stratified random sampling with income strata requires sampling from each income group, guaranteeing representation and reducing sampling error. The correct answer is B.

Question 2 (Free Response)

A movie theater wants to survey its customers to find out what percentage of customers want the theater to add more vegan snack options. The manager decides to survey the first 50 customers who exit a 7pm showing of a popular action movie on a Saturday night. (a) What type of sampling method is the manager using? (b) Identify one type of bias that is present in this sample, and explain how this bias will affect the results. (c) Suggest a better random sampling method that the theater could use to get a representative sample of all its customers in a typical week.

Worked Solution: (a) The manager is using convenience sampling: they select easily accessible customers (the first 50 exiting one showing) rather than using a random method to select from all customers who visit the theater in a week. (b) A clear source of bias is selection (undercoverage) bias. The sample only includes customers who attended this specific movie and time, so customers who go to other movies (e.g., children’s films, indie dramas) at other times (weekday matinees) are excluded from the sampling frame. Audiences for action movies likely have a lower share of vegan customers than the theater’s full customer base, so the survey will underestimate the true percentage of all customers who want more vegan snack options. (c) A better method is stratified random sampling. Stratify the theater’s opening times by four groups: weekday matinee, weekday evening, weekend matinee, weekend evening. Assign every customer who visits the theater in the week a unique number, then randomly sample the desired number of customers from each stratum proportional to the number of customers that visit each time slot. This ensures all customer groups are represented.

Question 3 (Application / Real-World Style)

A quality control engineer at a factory produces 5000 car tires per day across 10 separate assembly lines. Each assembly line produces the same mix of tire models and operates the same production process all day. The engineer has time to test 100 tires total for defects. Recommend a sampling method for this scenario, explain how to implement it, and why it is the best choice.

Worked Solution: This scenario calls for cluster sampling. First, define each assembly line as a cluster: since each line produces the same mix of tires and uses the same process, each cluster is representative of the entire factory’s production. Next, randomly select 2 of the 10 assembly lines, then test all tires produced by those two lines in the last hour of production. At 5000 tires per day across 10 lines, each line produces ~50 tires per hour on average, so two lines give ~100 tires, matching the sample size the engineer can test. This method is the best choice because it is logistically far simpler to pull all tires from a couple of lines for testing than to collect individual tires from all 10 lines, and it still produces a representative sample. In context, this gives the engineer a valid sample of 100 tires to estimate the overall defect rate within their time constraint.

7. Quick Reference Cheatsheet

Category	Formula / Definition	Notes
Population/sample size notation	$N =$ population size, $n =$ sample size	Standard notation used on all AP problems
Systematic sampling interval	$k = round (\frac{N}{n})$	Randomly select starting point between 1 and $k$ , then select every $k$ th unit
Proportional stratified sample size per stratum	$n_{i} = n \times \frac{N _{i}}{N}$	$N_{i}$ = size of stratum $i$ , allocation matches population stratum proportions
Simple Random Sampling (SRS)	All samples of size $n$ are equally likely to be selected	Gold standard for random sampling, implemented with random number generators
Stratified Random Sampling	Sample from every stratum; strata are similar internally	Used to guarantee subgroup representation, reduces sampling variability
Cluster Sampling	Sample all units from randomly selected clusters; clusters are representative internally	Used for logistical/cost efficiency; no need to sample from every group
Effect of increasing sample size	Reduces random sampling error	Does NOT fix systematic bias
Core bias types	1. Selection (undercoverage): some groups excluded 2. Nonresponse: nonparticipants differ systematically 3. Response: inaccurate answers from participants	Always explain direction of bias in context for FRQ credit

8. What's Next

Random sampling is the foundational prerequisite for designing studies and making valid statistical inferences about populations. Immediately after this topic in the AP Statistics syllabus, you will learn to distinguish between observational studies and experiments, and how to design randomized controlled experiments to test causal claims. Without understanding how random selection works and how to identify bias in data collection, you cannot properly evaluate the conclusions of any study or experiment, a skill that is regularly tested on multi-part FRQs. Across the rest of the course, random sampling is the basis for sampling distributions of sample means and proportions, which are the foundation for confidence intervals and hypothesis testing: if your sample is biased, any inference you make from it will be fundamentally flawed.

← Back to topic

Stuck on a specific question?
Snap a photo or paste your problem — Ollie (our AI tutor) walks through it step-by-step with diagrams.
Try Ollie free →

Random Sampling and Data Collection — AP Statistics Study Guide

1. What Is Random Sampling and Data Collection?

2. Common Probability Sampling Methods

Worked Example

3. Non-Probability Sampling and Common Sampling Biases

Worked Example

4. Key Comparisons of Sampling Methods

Worked Example

5. Common Pitfalls (and how to avoid them)

6. Practice Questions (AP Statistics Style)

Question 1 (Multiple Choice)

Question 2 (Free Response)

Question 3 (Application / Real-World Style)

7. Quick Reference Cheatsheet

8. What's Next

More study guides