Exploring Two-Variable Data — AP Statistics Unit Overview

For: AP Statistics candidates sitting AP Statistics.

Covers: All 8 core sub-topics of AP Statistics Unit 2 (Exploring Two-Variable Data): representing bivariate quantitative data, correlation, linear regression, least squares, residuals, outliers, transformations for linearity, and categorical association.

You should already know: How to calculate and interpret one-variable descriptive statistics (center, spread, shape). How to graph and interpret linear equations in slope-intercept form. How to distinguish between quantitative and categorical variables.

A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board / Cambridge / IB papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official mark schemes for grading conventions.

1. Why This Unit Matters

Up to this point in AP Statistics, you have only analyzed patterns in a single variable at a time, but almost all real-world statistical questions ask how one variable relates to another. Does weekly study time correlate with final exam scores? Does annual carbon output correlate with global average temperature? Is customer age associated with preference for different product lines? This unit builds your complete toolset to answer all of these questions, and it is the foundation for all inference on two variables later in the course. Per the College Board AP Statistics Course and Exam Description (CED), this unit makes up 10-15% of your total exam score, and it appears on both multiple-choice (MCQ) and free-response (FRQ) sections, very often as the first FRQ of the exam that tests your ability to interpret bivariate analysis in context. This unit also introduces the critical statistical habit of plotting your data before calculating numerical summaries, a practice you will apply in every unit that follows. (Word count: 218)

2. Unit Concept Map

The 8 sub-topics of this unit build sequentially from visual exploration to numerical summary to model construction to validation to extension, each step building on the output of the previous one. Any analysis of two-variable data always starts with a visual, which is why Representing Two-Variable Quantitative Data is the first step: here you learn to construct and interpret scatterplots, which separate the explanatory variable (plotted on the x-axis, used to predict) and response variable (plotted on the y-axis, the outcome being predicted), and let you spot broad patterns: direction (positive/negative), form (linear/non-linear), strength, and outliers before any calculations.

Once you have a visual sense of a potential linear relationship, Correlation quantifies that pattern, turning a qualitative description ("strong positive") into a numerical value $r$ between -1 and 1 that measures the strength and direction of only linear association. Next, Linear Regression Models formalizes the idea of using a line to predict values of the response variable from the explanatory variable, defining the roles of slope and intercept in context. Least Squares Regression then provides the standard mathematical method to find the "best" line: it calculates the unique slope and intercept that minimize the sum of squared vertical distances (residuals) between the observed response values and the line.

After fitting the model, you need to check how well it works: the Residuals sub-topic teaches you to construct residual plots and interpret their patterns to check for linearity, constant variance, and outliers. If your residual plot shows that an observation is extreme, Regression Outliers and Influential Points teaches you to distinguish between outliers (points with large residuals) and influential points (points that change the model dramatically if removed), and how to handle each. If residuals confirm the original relationship is not linear, Transforming to Achieve Linearity gives you tools to re-express x or y (using logarithms, powers, roots) to create a linear relationship that you can then model with regression. Finally, to extend your analysis beyond two quantitative variables, Introduction to Categorical Association adapts the core question ("are these two variables associated?") to two categorical variables, using two-way tables and marginal/conditional distributions to answer that question. (Word count: 592)

3. A Guided Tour of a Typical Exam Problem

We will walk through a full exam-style problem to show how the core sub-topics connect in practice, demonstrating how each builds on the previous step:

Problem context: A marine biologist measures body mass (x, in grams) and maximum sustained swim speed (y, in cm/s) for 12 species of reef fish. She wants to model how swim speed changes with body mass.

First stop: Representing Two-Variable Quantitative Data: The problem opens with a scatterplot of the data and asks you to describe the relationship. You observe a general increasing, curved relationship with moderate spread and one extreme point for a very large fish. This is the mandatory first step of any bivariate analysis: you always describe what you see visually before any calculations.
Second stop: Correlation: The next part gives the correlation of the raw data as $r = 0.82$ . You interpret this as a strong positive linear association between body mass and swim speed, but note that the curved pattern visible in the scatterplot means the correlation overstates the strength of the true underlying relationship. This builds on your visual observation by adding a numerical summary.
Third stop: Least Squares Regression + Residuals: The problem gives the least squares regression line for the raw data as $\overset{y}{^} = 25 + 0.12 x$ , and shows a residual plot with a clear U-shaped pattern. You connect these two skills: the U-shaped residual pattern confirms the relationship is non-linear, so a linear model on the original data is not appropriate. This sequence — fit the model, then check fit with residuals — is the core of every regression analysis.
Fourth stop: Transforming to Achieve Linearity: The problem gives the residual plot after a natural log transformation of x, which has no obvious pattern, and the new correlation is $r = 0.96$ . You conclude the transformed model is appropriate and can be used for prediction.

This single problem touches 5 of the 8 unit sub-topics in sequence, showing how each step depends on the previous one. Exam tip: On any multi-part bivariate FRQ, always carry your conclusion from earlier parts forward. If you conclude the original linear model is inappropriate, never use that original model for prediction in a later part.

4. Cross-Cutting Common Pitfalls

Wrong move: Mixing up the explanatory and response variable when calculating regression slope. Why: Students know correlation is symmetric (swap x and y, r stays the same), so they incorrectly assume slope is also symmetric. Slope depends entirely on which variable is the response. Correct move: Always label x = explanatory (the predictor) and y = response (the outcome being predicted) before doing any calculations, and double-check variable roles against the problem context.
Wrong move: Interpreting a strong correlation or significant regression slope as proof of causation. Why: Most bivariate data is observational, and students often transfer causal thinking from controlled lab experiments to observational data. Correct move: Unless the problem explicitly states the data comes from a randomized experiment, always explicitly state that correlation does not imply causation, and note that confounding variables could explain the observed association.
Wrong move: Assuming a high correlation coefficient $r$ means the relationship between two variables is linear. Why: Students learn that r measures strength of association, so they forget it only measures strength of linear association. A strongly curved relationship can still have a high r. Correct move: Always check both the correlation coefficient and a residual plot (or original scatterplot) to confirm the relationship is linear before using a linear regression model.
Wrong move: Calling any extreme point in regression an influential point. Why: Students confuse outliers (points with large residuals) with influential points (points that change the model parameters when removed). An extreme point that follows the trend of the other data is usually not influential. Correct move: A point is only influential if removing it meaningfully changes the slope or intercept of the regression line. Always check for a meaningful change in the model before labeling a point as influential.
Wrong move: Ignoring a clear curved pattern in a residual plot and using the linear model anyway. Why: Students want to use the linear model provided in the problem, even when the data contradicts the model assumptions. Correct move: Any non-random pattern (U-shape, fanning, curvature) in a residual plot means linear model assumptions are violated, and you must explicitly state the linear model is not appropriate.

5. Quick Check: When Do I Use Which Sub-Topic?

For each question below, identify which sub-topic you would use to answer it:

I have two quantitative variables, and I want to see the overall pattern of their relationship before calculating any numbers.
I want to measure how strong the linear relationship between my two quantitative variables is.
I have fit a linear model and I want to check if the model is appropriate for my data.
My residual plot shows a clear curved pattern, so my linear model doesn’t fit. What do I do next?
I have two categorical variables, and I want to know if they are associated with each other.

Click to check answers

1. Representing Two-Variable Quantitative Data (use a scatterplot) 2. Correlation (calculate $r$) 3. Residuals (analyze the residual plot for patterns) 4. Transforming to Achieve Linearity (re-express x or y to get a linear relationship) 5. Introduction to Categorical Association (calculate conditional distributions from a two-way table)

6. Practice Questions (AP Statistics Style)

Question 1 (Multiple Choice)

A researcher studies the relationship between minutes of exercise per week (x, quantitative) and resting systolic blood pressure (y, quantitative) for 50 adults. The scatterplot shows a negative, moderately linear relationship, with one point that is far to the right of all other points (very high exercise) and has a small residual. Removing this point from the data will most likely result in which of the following? A) The correlation becomes weaker, and the slope of the regression line changes substantially B) The correlation becomes stronger, and the slope of the regression line changes substantially C) The correlation does not change much, and the slope of the regression line changes substantially D) The correlation does not change much, and the slope of the regression line does not change much

Worked Solution: First, recall the definition of an influential point: a point that substantially changes the regression slope or correlation when removed. This point is extreme in the x-direction but has a small residual, meaning it falls close to the regression line formed by the other points and follows the existing linear trend. Because it aligns with the trend, removing it will not meaningfully change the correlation or the slope of the regression line. Correct answer: D.

Question 2 (Free Response)

A café owner records the average daily temperature (x, in °F) and the number of iced lattes sold per day (y) over 25 randomly selected days. Summary statistics are: $\overset{x}{ˉ} = 70$ , $s_{x} = 10$ , $\overset{y}{ˉ} = 45$ , $s_{y} = 12$ , $r = 0.85$ . (a) Calculate the slope of the least-squares regression line for predicting number of iced lattes from average temperature. Show your work. (b) Interpret the slope you calculated in context. (c) The residual plot for this model has no obvious pattern. What does that tell you about the fit of the linear model?

Worked Solution: (a) The formula for the least squares slope is: $b_{1} = r \frac{s _{y}}{s _{x}}$ Plugging in the given values: $b_{1} = 0.85 \times \frac{12}{10} = 1.02$ . The slope is 1.02 iced lattes per °F. (b) In context: For every 1°F increase in average daily temperature, the predicted number of iced lattes sold per day increases by 1.02, on average. (c) A residual plot with no obvious pattern confirms that the assumptions for a linear model are satisfied, so the linear model is appropriate for describing the relationship between average temperature and number of iced lattes sold.

Question 3 (Application / Real-World Style)

A plant biologist measures the 10-year height of pine trees (y, in meters) for different amounts of fertilizer applied (x, in kg per tree). She finds the relationship between x and y is curved, so she applies a square-root transformation to y, getting a linear relationship with the regression equation $\overset{y}{^} = 1.2 + 0.8 x$ . Predict the 10-year height of a pine tree that received 3 kg of fertilizer. Give your answer with units, and interpret the prediction in context.

Worked Solution: First substitute $x = 3$ into the transformed regression equation: $\overset{y}{^} = 1.2 + 0.8 (3) = 1.2 + 2.4 = 3.6$ Square both sides to reverse the transformation and get the predicted height in meters: $\overset{y}{^} = (3.6)^{2} = 12.96$ In context: The predicted 10-year height of a pine tree that received 3 kg of fertilizer is approximately 13 meters.

7. Quick Reference Cheatsheet

Category	Formula	Notes
Correlation Coefficient $r$	$r = \frac{1}{n - 1} \sum (\frac{x _{i} - x ˉ}{s _{x}}) (\frac{y _{i} - y ˉ}{s _{y}})$	$- 1 \leq r \leq 1$ , symmetric (swap x/y, r unchanged), only measures linear association
Least Squares Slope $b_{1}$	$b_{1} = r \frac{s _{y}}{s _{x}}$	Units = (y units)/(x units), changes if you swap x and y
Least Squares Intercept $b_{0}$	$b_{0} = \overset{y}{ˉ} - b_{1} \overset{x}{ˉ}$	Regression line always passes through $(\overset{x}{ˉ}, \overset{y}{ˉ})$
Predicted Response	$\overset{y}{^} = b_{0} + b_{1} x$	Predicted value, not the observed actual value
Residual	$e_{i} = y_{i} - \overset{y}{^}_{i}$	Prediction error; no pattern in residuals = good linear fit
Sum of Squared Residuals	$S S E = \sum (y_{i} - \overset{y}{^}_{i})^{2}$	Least squares regression minimizes this value
Exponential Model (Log Transformed)	$ln (\overset{y}{^}) = b_{0} + b_{1} x$	Used for curved exponential relationships, $\overset{y}{^} = e^{b_{0}} e^{b_{1} x}$
Conditional Proportion (Categorical Association)	$Proportion = \frac{Cell Count}{Group Total}$	Compare conditional proportions to test for association

8. What's Next (Sub-Topic Links)

This unit is the foundation for all inference on two-variable data, which comes later in the AP Statistics course, specifically inference for regression slope and chi-square tests for association. Without mastering the skills in this unit — including identifying when a linear model is appropriate, calculating regression coefficients, and correctly describing association — you will not be able to correctly interpret or conduct inference for bivariate data later in the course. This unit also reinforces the core statistical thinking habit of plotting data before calculating, which is critical for every topic that follows.

Links to all sub-topic study guides in this unit: