How to Experiment Well — AP Statistics Study Guide
For: AP Statistics candidates sitting AP Statistics.
Covers: core principles of experimental design (control, randomization, replication, blocking), sources of confounding, common experimental designs (completely randomized, block, matched pairs), and scope of inference for experimental studies.
You should already know: difference between an observational study and an experiment, basics of random sampling, how to interpret basic summary statistics for response data.
A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board / Cambridge / IB papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official mark schemes for grading conventions.
1. What Is How to Experiment Well?
"How to Experiment Well" refers to the set of statistical principles for designing studies that can reliably isolate the effect of an explanatory variable (called a factor) on a response variable, and rule out alternative explanations from other variables. Unlike observational studies, where researchers only observe existing groups, experiments impose different treatments on experimental units (called subjects if units are human) to measure response. According to the AP Statistics CED, this topic makes up ~40% of the Unit 3 (Collecting Data) exam weight, and accounts for 4-6% of the total AP exam score. It appears in both multiple choice (MCQ) and free response (FRQ) sections: most exams include 2-3 MCQ questions and one multi-part FRQ question on this topic, often asking for design descriptions or scope of inference. The goal of a well-designed experiment is to enable valid causal inference, which cannot be drawn from observational studies. Standard notation: treatments are denoted , the response variable is , and the experimental units are indexed by .
2. The Three Core Principles of Experimental Design
Any valid experiment relies on three non-negotiable principles: control, randomization, and replication. Each principle addresses a different source of bias or variability that can distort results:
- Control: Control means accounting for the effect of lurking variables (unmeasured variables that could affect the response) by including a comparison group. The comparison group (called the control group) receives either no treatment, a placebo (an inert treatment that mimics the experimental treatment to account for the placebo effect), or the existing standard treatment. Control eliminates confounding, where the effect of a lurking variable is mixed up with the effect of the treatment of interest.
- Randomization: Randomization is the random assignment of experimental units to treatment groups. Randomization ensures that both known and unknown lurking variables are balanced across all treatment groups, on average. This means any difference in average response between groups is far more likely to be caused by the treatment than by pre-existing differences between groups.
- Replication: Replication means assigning each treatment to multiple independent experimental units, rather than just one unit per treatment. More replication reduces sampling variability, making it easier to detect a real treatment effect if one exists. Replication does not mean repeating the entire experiment (that is separate confirmation); it refers to multiple units per treatment within a single experiment.
Worked Example
A researcher wants to test whether a new over-the-counter sleep aid reduces the time it takes to fall asleep compared to a current popular brand. She recruits 60 adult volunteers who report occasional insomnia. Describe how she would implement all three core principles in this experiment.
- Control: She will include a control comparison, testing the new sleep aid against the existing popular brand (the standard treatment). This allows her to compare the effect of the new aid to the existing product, controlling for the fact that any sleep aid will have some effect due to the placebo effect.
- Randomization: She will label each volunteer 1 to 60, then use a random number generator to select 30 unique numbers. Volunteers with these numbers are assigned to the new sleep aid treatment, and the remaining 30 are assigned to the existing brand. Every volunteer is equally likely to be in either group, which balances out pre-existing differences (like baseline insomnia severity) across groups.
- Replication: She assigns 30 volunteers to each treatment, rather than testing each treatment on only one person. Multiple volunteers per treatment allow her to account for natural variability in time to fall asleep between people, reducing the chance that a observed difference is due to random chance rather than the treatment itself.
Exam tip: On AP FRQs asking you to describe randomization, you must explicitly mention how you assign units to treatments (e.g., "label units, use a random number generator to select groups") — vague statements like "randomly split into groups" almost always lose points.
3. Common Experimental Designs
After applying the three core principles, experiments are structured into one of three common designs, depending on whether researchers know of any additional sources of variability that need to be accounted for:
- Completely Randomized Design (CRD): This is the simplest design, where all experimental units are randomly assigned to treatment groups with no pre-grouping. This design is used when there are no known systematic differences between units that would affect the response.
- Randomized Block Design (RBD): If researchers know of a nuisance variable (a variable that is not the treatment of interest but will affect the response), they group units into blocks where all units in a block are similar with respect to the nuisance variable. Random assignment of treatments is then done within each block. Blocking reduces variability from the nuisance variable, making it easier to detect a real treatment effect.
- Matched Pairs Design: This is a special case of randomized block design where each block has exactly two units that are matched on similar characteristics (e.g., identical twins, two plots of land with similar soil). One unit in each block gets treatment 1, the other gets treatment 2. Alternatively, each unit can get both treatments, in random order (called repeated measures matched pairs), and each unit acts as its own block.
Worked Example
A researcher wants to test three different doses of a new allergy medication: 10mg, 20mg, and 30mg. She knows that allergy symptoms are much more severe for people with pet allergies than for people with only seasonal allergies. Should she use a completely randomized design or a randomized block design? Describe the appropriate design.
- Since baseline allergy type (pet vs seasonal) is a known nuisance variable that affects symptom severity, a randomized block design is the appropriate choice, not completely randomized. Blocking will remove the variability from allergy type from the treatment comparison, making it easier to detect a dose effect.
- First, form two blocks: Block 1 contains all participants with pet allergies, Block 2 contains all participants with only seasonal allergies.
- Within each block, randomly assign each participant to one of the three dose groups, so each dose gets an equal number of participants in each block.
- After four weeks of treatment, measure average allergy symptom severity for each dose group, and compare results across doses, accounting for block differences.
Exam tip: Always remember: blocking is done before random assignment, and you randomize within blocks. Reversing this order (randomizing first, then blocking) is a common point deduction on AP exams.
4. Scope of Inference
A key skill tested on the AP exam is identifying what type of conclusions are valid from an experiment, based on how it was designed. There are two separate questions to answer for any study:
- Can we conclude that the treatment caused the difference in response? Causal conclusions are only allowed if treatments were randomly assigned to experimental units. Random assignment balances out lurking variables, so any difference in response can be attributed to the treatment. Without random assignment, you can only conclude association, not causation.
- Can we generalize the results to a larger population of interest? Generalization is only allowed if the experimental units were randomly sampled from the broader population of interest. If researchers use a convenience sample (like volunteer college students from a psychology class), you cannot generalize the results to the entire population. Confounding occurs when the effect of the treatment is mixed up with the effect of a lurking variable, making it impossible to draw causal conclusions even in an experiment, if the design is poor.
Worked Example
A college professor wants to test whether standing during lectures improves student test scores. He teaches two sections of introductory statistics: one section meets at 8am and the other at 1pm. He has the 8am section stand during lectures, and the 1pm section sit. He finds that the 8am section scores 7% higher on the final exam than the 1pm section. Can he conclude that standing caused the higher score? Name a possible confounding variable, and state what inference is valid.
- No, the professor cannot conclude that standing caused the higher score, because there was no random assignment of treatments: sections were assigned to treatments based on their meeting time, not randomization.
- A possible confounding variable is the selection of students who sign up for 8am vs 1pm classes: students who choose 8am classes are often more motivated and academically prepared than students who choose 1pm classes, so motivation is confounded with standing. We cannot tell if the higher score comes from standing or from pre-existing differences in student motivation.
- Because the professor used a convenience sample of his own students at his college, he also cannot generalize these results to all college students. The only valid inference is that there is an association between standing in lectures and higher test scores in this specific group.
Exam tip: When AP questions ask "what conclusion can you draw", always answer both parts: whether you can conclude causation, and whether you can generalize. Most students forget one part and lose points.
5. Common Pitfalls (and how to avoid them)
- Wrong move: Stating that "random sampling of units is required for a randomized experiment". Why: Students confuse random sampling (for generalization) with random assignment (for causation, the core requirement of an experiment). Correct move: Explicitly separate the two: random assignment of treatments is required for a good experiment; random sampling of units is only required if you want to generalize results to a larger population.
- Wrong move: Defining the control group as "the group that gets no treatment". Why: Students memorize a simplified definition and miss that control groups often get a standard existing treatment or a placebo, not no treatment at all. Correct move: Define the control group as "the comparison group for treatments; it may get a placebo, no treatment, or the existing standard treatment".
- Wrong move: Confusing blocking with an additional treatment variable. Why: Students think blocks are just another factor to test, when blocks are nuisance variables we group by to reduce unwanted variability, not variables of interest. Correct move: If you are not testing the effect of the variable, and you are grouping to reduce variability, it is a blocking variable, not a treatment factor.
- Wrong move: Claiming replication means repeating the entire experiment multiple times. Why: Students confuse post-experiment confirmation with replication within the original experiment. Correct move: Remember that replication in experimental design means having multiple independent experimental units assigned to each treatment within your experiment.
- Wrong move: Forgetting to randomize treatment order in a repeated measures matched pairs design. Why: Students think matching removes the need for randomization, but order effects can confound results. Correct move: When each unit gets both treatments, randomly assign half the units to get treatment A first then B, and half to get B first then A to control for order effects.
- Wrong move: Claiming causal conclusions from any experiment that uses convenience assignment of treatments. Why: Students assume any study called an experiment can support causation, but only random assignment enables causal inference. Correct move: Before claiming causation, always verify that treatments were randomly assigned to units; if not, no causal conclusion is allowed.
6. Practice Questions (AP Statistics Style)
Question 1 (Multiple Choice)
A botanist wants to study the effect of increased carbon dioxide levels on plant growth. She has 60 tomato seeds, and randomly assigns 30 seeds to grow in a high CO2 environment and 30 to grow in normal CO2 levels. She uses a convenience sample of tomato seeds from her university's greenhouse. What is the scope of inference for this study? A) We can conclude that increased CO2 causes increased growth, and can generalize the results to all tomato plants. B) We can conclude that increased CO2 causes increased growth, but cannot generalize to all tomato plants. C) We cannot conclude that increased CO2 causes increased growth, but can generalize to all tomato plants. D) We cannot conclude that increased CO2 causes increased growth, and cannot generalize to all tomato plants.
Worked Solution: First, recall the two rules for scope of inference: causation requires random assignment of treatments, and generalization requires random sampling from the population of interest. In this study, the botanist randomly assigned the CO2 treatments to seeds, so random assignment is present, which allows us to conclude causation. The seeds are a convenience sample from a university greenhouse, not a random sample from the entire population of all tomato plants, so we cannot generalize. This matches option B. Correct answer: B.
Question 2 (Free Response)
A bakery wants to test whether a new type of yeast makes bread rise higher than their current yeast. They have 24 identical loaf pans to bake bread in, and want to compare the two yeasts. (a) Describe how to implement a completely randomized design for this experiment. (b) Explain why the bakery might instead choose a matched pairs design if they bake bread in three different ovens, each of which can bake 8 loaves at a time. Describe the matched pairs design. (c) The bakery finds that the new yeast gives a 15% higher average rise. A food blogger writes "this new yeast works better for all home bakers". What mistake did the blogger make, if the bakery used the design from part (a) with their professional bakery equipment?
Worked Solution: (a) 1. Label each of the 24 loaf pans with a unique number from 1 to 24. 2. Use a random number generator to select 12 unique numbers; the loaves with these numbers are mixed with the new yeast. 3. The remaining 12 loaves are mixed with the current yeast. 4. Bake all loaves under identical conditions, measure the height of each loaf, and compare the average height between the two groups. (b) Oven temperature varies between ovens, which is a known nuisance variable that affects how much bread rises. A matched pairs design (blocking by oven) will control for this variability. To implement: Each oven is a block that can bake 8 loaves, so within each oven, randomly assign 4 loaves to new yeast and 4 loaves to current yeast. This ensures that the effect of oven temperature is balanced across both yeast groups, reducing variability. (c) The study used professional bakery equipment, and the loaves were a convenience sample from this bakery, not a random sample from all bread baked by home bakers. Therefore, the blogger cannot generalize the results to all home bakers, so the claim is unsupported. We can only conclude causation for the loaves in this study, not for all home baking.
Question 3 (Application / Real-World Style)
A bicycle manufacturer wants to test whether a new carbon fiber frame reduces the weight of a road bicycle compared to their existing aluminum frame. The manufacturer has 10 frames of each material, and knows that frame size (small vs large) affects weight. There are 5 small aluminum frames, 5 large aluminum frames, 5 small carbon frames, and 5 large carbon frames. Name the best experimental design for this context, justify your choice, and describe how to implement it to test for the effect of frame material on weight.
Worked Solution: The best design is a randomized block design, blocking by frame size. Frame size is a known nuisance variable that directly affects weight: large frames are heavier than small frames regardless of material. Blocking by size will remove the variability from frame size from the comparison of material, making it easier to detect a weight difference between carbon and aluminum. To implement: 1. Form two blocks: Block 1 = all small frames (5 aluminum, 5 carbon), Block 2 = all large frames (5 aluminum, 5 carbon). 2. Weigh all frames, then compare the average weight of carbon vs aluminum frames within each block, then combine the results to get an overall comparison of frame material, accounting for size. In context: This design eliminates the confounding effect of frame size, so any difference in average weight can be attributed to the frame material, not the mix of sizes.
7. Quick Reference Cheatsheet
| Category | Formula | Notes |
|---|---|---|
| Core Principle: Control | No formula | Provides a comparison group for treatments; controls for lurking variables and placebo effects. |
| Core Principle: Randomization | No formula | Random assignment of units to treatments; balances lurking variables across groups; required for causal inference. |
| Core Principle: Replication | No formula | Multiple units per treatment; reduces sampling variability to make it easier to detect treatment effects. |
| Completely Randomized Design | No formula | All units randomly assigned to treatments; used when no known nuisance variability between units. |
| Randomized Block Design | No formula | Units grouped into blocks by known nuisance variable; randomize within blocks; reduces unwanted variability. |
| Matched Pairs Design | No formula | Special case of RBD with 2 units per block, or one unit getting both treatments; used for paired similar units. |
| Causal Inference | No formula | Allowed if and only if treatments are randomly assigned to units. |
| Generalization to Population | No formula | Allowed if and only if units are randomly sampled from the population of interest. |
8. What's Next
This chapter gives you the foundation for all statistical inference about causation, the core goal of most applied statistical studies. Next you will apply these design principles to inference for experiments, specifically hypothesis tests for differences in response between treatment groups, covered in Units 5, 6, and 7 of the AP Statistics CED. Without mastering the principles of good experimental design, you will not be able to correctly identify whether a causal conclusion is appropriate, a commonly tested skill on both MCQ and FRQ questions for inference. This topic also connects to the broader skill of critiquing statistical studies: you will need to spot confounding and poor design in any real-world data analysis.
Observational Studies vs Experiments Inference for Two Means Experimental Design Critique