Exploring Two-Variable Data — AP Statistics Stats Study Guide
For: AP Statistics candidates sitting AP Statistics.
Covers: Two-way tables and conditional distributions, scatterplots and correlation, least-squares regression lines, residuals and influential points, and cautions about extrapolation per the AP Statistics Course and Exam Description.
You should already know: Algebra 2, basic probability intuition.
A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official College Board mark schemes for grading conventions.
1. What Is Exploring Two-Variable Data?
Exploring two-variable data is the practice of analyzing relationships between two paired variables (either categorical or quantitative) to identify patterns, measure association strength, and build predictive models. This topic makes up 12-15% of your AP Statistics exam score, and forms the foundation for all inferential work with paired data later in the syllabus. Unlike single-variable analysis, which focuses on distribution features like center and spread, two-variable analysis prioritizes identifying connections between variables, and distinguishing between correlation and causal relationships.
2. Two-way tables and conditional distributions
Two-way tables (also called contingency tables) organize frequency data for two categorical variables: one variable defines the table rows, the other defines the columns, and each cell holds the count of observations that fall into both corresponding categories.
Key Definitions
- Joint frequency: The count of observations in a single cell of the table, representing membership in both a row and column category.
- Marginal distribution: The total frequency of each category for one variable, ignoring the other variable, found in the "total" row or column of the table.
- Conditional distribution: The frequency of one variable given a fixed value of the other variable, calculated by dividing the joint frequency of the relevant cell by the marginal frequency of the condition. If conditional distributions are identical across all categories of the conditioning variable, the two variables are independent; if not, they are associated.
Worked Example
A survey of 200 high school students tracks gender (male/female) and part-time job status (yes/no):
| Has Part-Time Job | No Part-Time Job | Total | |
|---|---|---|---|
| Male | 42 | 58 | 100 |
| Female | 51 | 49 | 100 |
| Total | 93 | 107 | 200 |
| The conditional distribution of having a part-time job given the student is female is with a job, 49% without. For male students, it is 42% with a job, 58% without. Since these conditional distributions differ, gender and part-time job status are associated in this sample. |
Exam tip: Examiners often ask you to calculate conditional percentages, so always explicitly state the condition first to avoid mixing up which variable you are conditioning on.
3. Scatterplots and correlation
Scatterplots are used to visualize relationships between two quantitative variables: the explanatory (independent) variable is plotted on the x-axis, and the response (dependent) variable is plotted on the y-axis. When describing a scatterplot, always note four features:
- Direction: Positive (as x increases, y increases), negative (as x increases, y decreases), or no association
- Form: Linear, non-linear (curved), or no clear form
- Strength: How closely points follow the observed form
- Unusual features: Outliers, clusters, or gaps in the data
Correlation Coefficient
The correlation coefficient measures the strength and direction of linear association between two quantitative variables. It is calculated as: Where is the sample size, and are the mean and standard deviation of the x-variable, and and are the mean and standard deviation of the y-variable.
Key properties of :
- , with closer to 1 indicating a stronger linear association
- has no units, and is not affected by changes to the units of x or y
- is not resistant to outliers: a single extreme point can drastically change its value
Worked Example
For data on hours studied (x: 1, 2, 3, 4, 5) and test score (y: 60, 65, 75, 82, 90), we calculate , , , . Plugging into the formula gives , indicating a very strong positive linear association between hours studied and test score.
Common trap: only measures linear association, so a value near 0 does not mean there is no relationship between variables — it could mean there is a strong non-linear relationship (e.g., a U-shape).
4. Least-squares regression line
A regression line models the linear relationship between x and y, and is used to predict values of the response variable for given values of the explanatory variable. The least-squares regression line (LSRL) is the line that minimizes the sum of the squared vertical distances between observed y-values and predicted y-values (these distances are called residuals, covered in the next section).
The LSRL has the form: Where:
- = predicted value of the response variable
- = slope of the line, calculated as
- = y-intercept, calculated as
Interpretations
- Slope: For each 1-unit increase in the explanatory variable x, the predicted value of y increases/decreases by units, on average.
- Y-intercept: The predicted value of y when x=0. This is only practically meaningful if x=0 is a plausible value within or very close to the range of observed x-values.
Worked Example
Using the hours studied and test score data from the previous section, , , , so: The LSRL is . The slope means each additional hour of study predicts a 7.5-point increase in test score, on average. The intercept means the predicted test score for 0 hours of study is 51.9, which is plausible in this context.
5. Residuals and influential points
A residual is the difference between an observed y-value and the predicted y-value from the LSRL, calculated as: The sum of all residuals for a LSRL is always 0, by definition.
Residual Plots
A residual plot graphs residuals on the y-axis against either the explanatory variable x or the predicted value on the x-axis. It is used to verify if a linear model is appropriate for the data:
- If the residual plot shows random, evenly distributed scatter around the line , the linear model is appropriate
- If the residual plot shows a curved pattern, U-shape, or fanning (increasing/decreasing spread of residuals as x increases), the linear model is not a good fit for the data.
Influential Points
An influential point is an observation that drastically changes the slope, intercept, or correlation coefficient of the LSRL if it is removed from the dataset. There are two types of unusual points that are often influential:
- Outliers: Points with an unusually large residual (far from the LSRL in the y-direction)
- High-leverage points: Points with an unusually large or small x-value, far from the mean of x
A point that is both an outlier and high-leverage is almost always influential.
Worked Example
If we add a point to our study time dataset: x=10 hours studied, y=60 test score, this point has a residual of , making it an outlier. It also has high leverage, as x=10 is far from the mean x of 3. Removing this point increases the slope of the LSRL from 2.7 back to 7.5, so it is highly influential.
6. Cautions about extrapolation
Extrapolation is the practice of using a LSRL to predict y-values for x-values that fall far outside the range of x-values used to calculate the line. Extrapolation is almost always unreliable, because we have no evidence that the linear relationship between x and y holds outside the observed range of x.
Worked Example
Our study time dataset has x-values ranging from 1 to 5 hours. If we use the LSRL to predict the test score for a student who studied 20 hours, we get , which is impossible because test scores are capped at 100. The linear relationship between study time and test score does not hold for very high values of x, as students will eventually hit the maximum score, or suffer burnout from excessive studying.
Another common example: Predicting the height of a 30-year-old using a LSRL built from height data for children aged 2-12. The linear growth rate of childhood stops after adolescence, so the extrapolated prediction will be drastically too high.
Exam tip: Examiners frequently ask you to critique a prediction made with a regression line. First check if the x-value falls inside the original range of x-data; if not, always label the prediction as extrapolation and note that it is unreliable.
7. Common Pitfalls (and how to avoid them)
- Wrong move: Calculating or interpreting the correlation coefficient for categorical variables, or for quantitative variables with a non-linear relationship. Why students do it: They forget that only measures linear association between two quantitative variables. Correct move: First confirm both variables are quantitative, and check the scatterplot for linear form before using .
- Wrong move: Mixing up explanatory and response variables when calculating conditional probabilities or regression slopes. Why: They fail to label variables clearly at the start of analysis. Correct move: Always explicitly identify which variable is explanatory (x) and which is response (y) first, and state the condition clearly for conditional probabilities (e.g., "pass rate given female" instead of just "pass rate").
- Wrong move: Interpreting the y-intercept of a regression line when x=0 is impossible or far outside the observed data range. Why: They memorize the interpretation without considering context. Correct move: Only interpret the intercept if x=0 is a plausible value; otherwise, note that it has no practical meaning in the context of the problem.
- Wrong move: Assuming a strong correlation implies a causal relationship between variables. Why: Strong linear association feels like a cause-effect link. Correct move: Always mention that confounding variables could explain the association, and note that only controlled randomized experiments can prove causation.
- **Wrong move: Treating extrapolated predictions as valid. Why: They assume the linear relationship between variables holds indefinitely. **Correct move: Always check the x-value of the prediction against the original data range; if it falls far outside, label the prediction as unreliable extrapolation.
8. Practice Questions (AP Statistics Style)
Question 1
A school surveys 150 students to study the association between after-school sports participation (yes/no) and honor roll status (yes/no). The two-way table below shows the results:
| Honor Roll | No Honor Roll | Total | |
|---|---|---|---|
| Sports | 45 | 35 | 80 |
| No Sports | 30 | 40 | 70 |
| Total | 75 | 75 | 150 |
| a) Calculate the conditional distribution of honor roll status for students who play sports. | |||
| b) Are sports participation and honor roll status independent? Justify your answer. |
Solution 1
a) The conditional distribution is calculated relative to the total number of sports participants (80). , and . The conditional distribution is 56.25% honor roll, 43.75% no honor roll for sports participants. b) The variables are not independent. For independence, the conditional distribution of honor roll status should be the same for sports and non-sports participants. , which is different from 56.25%, so the variables are associated.
Question 2
A researcher collects data on 12 adults measuring age (x, in years) and resting heart rate (y, in beats per minute). Summary statistics are: , , , , . a) Calculate the least-squares regression line for predicting resting heart rate from age. b) Interpret the slope of the line in context. c) What is the predicted resting heart rate for a 50-year-old adult?
Solution 2
a) First calculate the slope: . Then calculate the intercept: . The LSRL is . b) The slope of 0.6 means that for each additional year of age, predicted resting heart rate increases by 0.6 beats per minute, on average. c) For x=50: beats per minute.
Question 3
A student uses data from 10 cars, with age (x, in years, range 1 to 10 years) and value (y, in USD, range 25000) to calculate the regression line . The residual plot shows random scatter around 0, with no obvious patterns. a) A car dealer uses this line to predict the value of a 20-year-old car. Is this prediction reliable? Justify your answer. b) A 3-year-old car in the dataset has an observed value of $21000. Calculate the residual for this car.
Solution 3
a) This prediction is not reliable. The original data used to build the line has x-values (car age) ranging from 1 to 10 years. A 20-year-old car is far outside this range, so this is extrapolation. There is no evidence that the linear relationship between age and value holds for cars older than 10 years (for example, vintage cars may increase in value after a certain age). b) First calculate the predicted value for x=3: . Residual = observed y - predicted y = . This means the car is worth $600 more than the model predicts for a 3-year-old car.
9. Quick Reference Cheatsheet
| Concept | Formula/Rule | Key Notes |
|---|---|---|
| Conditional Probability (Two-way Tables) | $P(A | B) = \frac{\text{Joint frequency of A and B}}{\text{Marginal frequency of B}}$ |
| Correlation Coefficient | , only measures linear association between quantitative variables, not resistant to outliers | |
| Least Squares Regression Line | , , | Slope: change in predicted y per 1-unit x increase; intercept only meaningful if x=0 is a plausible value |
| Residual | Sum of residuals = 0; random scatter in residual plot confirms linear model is appropriate | |
| Extrapolation | Predicting y for x outside the observed x range | Always unreliable, no evidence linear relationship holds outside the data range |
10. What's Next
This topic forms the foundation for all inferential statistics with two variables later in the AP Statistics syllabus. You will use the correlation and regression skills you learned here to conduct hypothesis tests for the significance of a regression slope, calculate confidence intervals for slopes, and analyze relationships in experimental and observational studies for Units 9 and 10. Understanding the difference between correlation and causation is also critical for evaluating the validity of statistical claims in the investigative task section of the AP exam, which counts for 25% of your free-response score.
If you struggle with any of the concepts in this guide, from calculating conditional distributions to interpreting residual plots, you can get personalized help from Ollie at any time by visiting Ollie. You can also practice more AP Statistics two-variable data questions, review full past papers, and access targeted feedback to make sure you are fully prepared for exam day.