Residuals — AP Statistics Study Guide

For: AP Statistics candidates sitting AP Statistics.

Covers: Definition of a residual, residual calculation formula, constructing and interpreting residual plots, assessing linear model fit, standard deviation of residuals, and using residuals to identify unusual patterns in two-variable data.

You should already know: Least-squares regression line (LSRL) equation for bivariate data. How to calculate predicted values from a regression model. Basic properties of scatterplots for two-variable data.

A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board / Cambridge / IB papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official mark schemes for grading conventions.

1. What Is Residuals?

A residual is the difference between the observed value of the response variable and the predicted value from a regression model, used exclusively to assess how well a linear model fits bivariate data. Per the AP Statistics Course and Exam Description (CED), this topic makes up 10-15% of Unit 2 (Exploring Two-Variable Data) and appears in both multiple-choice (MCQ) and free-response (FRQ) sections of the exam, almost always including at least one interpretation or justification question on the first FRQ.

Notation conventions used on the AP exam label the residual as $e$ , observed response as $y$ , and predicted response as $\overset{y}{^}$ . Residuals are sometimes called "prediction errors" in textbooks, but the AP exam consistently uses the term residual. A key property frequently tested is that the sum of all residuals for a least-squares regression line is always zero, because the LSRL balances positive and negative errors. This gives you a built-in check to confirm your calculations are correct on any problem.

2. Calculating and Interpreting Individual Residuals

To find the residual for a single data point, we use a simple formula that follows directly from the definition of a residual as the error in the model's prediction. The formula is: $e = y - \overset{y}{^}$ Where $y$ is the observed response value (measured in the original data) and $\overset{y}{^}$ is the predicted response value, calculated by plugging the observed explanatory variable $x$ into the LSRL equation $\overset{y}{^} = a + b x$ .

Intuition: We want to measure how wrong the model was, so we subtract what the model predicted from what we actually observed. A positive residual means the model underpredicted the response (observed value is higher than predicted), and a negative residual means the model overpredicted the response (observed value is lower than predicted. On the AP exam, you will almost always be asked to both calculate and interpret the residual in context, so both steps are required for full credit.

Worked Example

A tutor studies the relationship between hours of tutoring per month ( $x$ ) and student final exam score out of 100 ( $y$ ) and finds the LSRL: $\overset{y}{^} = 65 + 2.1 x$ . One student attended 6 hours of tutoring and scored a 74. Calculate and interpret the residual for this student.

Identify known values: $x = 6$ , observed $y = 74$ , LSRL $\overset{y}{^} = 65 + 2.1 x$ .
Calculate predicted score: $\overset{y}{^} = 65 + 2.1 (6) = 65 + 12.6 = 77.6$ .
Apply residual formula: $e = y - \overset{y}{^} = 74 - 77.6 = - 3.6$ .
Interpret: The negative residual means the model overpredicted this student's final exam score by 3.6 points.

Exam tip: When interpreting a residual, you must name whether the model overpredicted or underpredicted and include the magnitude and context to earn full credit. Stating just "the residual is -3.6" will not earn the point.

3. Interpreting Residual Plots to Assess Linear Model Fit

A residual plot is a scatterplot with residuals $e$ on the y-axis and either the explanatory variable $x$ or predicted value $\overset{y}{^}$ on the x-axis. We use residual plots to check if a linear model is appropriate for the data, because subtle patterns that are hard to see on the original $y$ vs $x$ scatterplot become very clear in a residual plot.

The rule for assessment is simple: if there is no clear, systematic pattern in the residual plot, a linear model is appropriate. If there is a clear systematic pattern, a linear model is not appropriate. Common problematic patterns include: (1) curved patterns (U-shape or inverted U-shape), which indicate the true relationship between $x$ and $y$ is non-linear; (2) fanning/funnel patterns, where the spread of residuals increases or decreases as $x$ increases, which indicate non-constant variance (heteroscedasticity).

Worked Example

Three residual plots for three different linear models have the following patterns. For each, state if a linear model is appropriate: (a) Residuals are randomly scattered between -3 and 3 across the entire range of $x$ , no visible trend. (b) Residuals start negative for small $x$ , increase to positive for middle $x$ , then decrease back to negative for large $x$ , forming a clear inverted U-shape. (c) Residuals are tightly clustered between -1 and 1 for small $x$ , and spread out between -6 and 6 for large $x$ .

(a): No clear systematic pattern means the linear model is appropriate. Random scatter confirms that the linear model correctly captures the relationship between $x$ and $y$ , with no systematic prediction error.
(b): The clear curved inverted U pattern means the true relationship is non-linear. A linear model is not appropriate.
(c): The clear fanning pattern (increasing spread of residuals as $x$ increases) indicates non-constant variance, so the linear model is not appropriate.

Exam tip: A common AP MCQ trick shows a seemingly linear original $y$ vs $x$ scatterplot but a residual plot with a clear pattern. Always rely on the residual plot, not the original scatterplot, to assess model fit.

4. Standard Deviation of Residuals

The standard deviation of residuals (written $s$ or $s_{e}$ ) is a numerical measure of the average size of the residuals, meaning it tells you how far, on average, observed values are from the regression line. It complements the graphical assessment from residual plots by giving a quantitative measure of model fit: smaller $s$ means predictions are typically closer to observed values, so the model fits better.

The formula for the standard deviation of residuals is: $s = \frac{\sum e ^{2}}{n - 2} = \frac{\sum ( y - y ^ ) ^{2}}{n - 2}$ Where $n$ is the number of observations, and we divide by $n - 2$ (degrees of freedom for regression) to get an unbiased estimate. We square residuals to eliminate negative signs (since the sum of raw residuals is always zero, the average raw residual is useless) then take the square root to return to the original units of the response variable.

Worked Example

A data set of 6 observations has residuals: $- 0.8, 0.5, 1.1, - 0.6, - 0.2, 0.0$ . Calculate the standard deviation of residuals $s$ .

Confirm $n = 6$ , so degrees of freedom $n - 2 = 4$ .
Square each residual: $(- 0.8)^{2} = 0.64$ , $0. 5^{2} = 0.25$ , $1. 1^{2} = 1.21$ , $(- 0.6)^{2} = 0.36$ , $(- 0.2)^{2} = 0.04$ , $0. 0^{2} = 0$ .
Sum the squared residuals: $\sum e^{2} = 0.64 + 0.25 + 1.21 + 0.36 + 0.04 + 0 = 2.5$ .
Plug into the formula: $s = \frac{2.5}{4} = 0.625 \approx 0.79$ .

Exam tip: When interpreting $s$ on the AP exam, always mention that it is the typical deviation of observed values from the regression line, and include the original units of the response variable to earn full credit.

5. Common Pitfalls (and how to avoid them)

Wrong move: Calculates residual as $\overset{y}{^} - y$ instead of $y - \overset{y}{^}$ , leading to the wrong sign. Why: Students confuse the order of subtraction when writing "prediction error". Correct move: Memorize "Observed minus Expected (Predicted)" to always get the order right.
Wrong move: Claims a linear model is appropriate based only on the original $y$ vs $x$ scatterplot, ignoring the residual plot. Why: Subtle non-linear patterns are often invisible on the original scatterplot but clear in residuals. Correct move: Always use the residual plot to assess model appropriateness, regardless of the original scatterplot.
Wrong move: Interprets $s$ as the "average of the raw residuals". Why: Students forget the sum of raw residuals is always zero for LSRL. Correct move: Remember $s$ measures the average distance of observed values from the regression line, not the average of raw residuals.
Wrong move: Claims any small fluctuation in a residual plot means a linear model is inappropriate. Why: Students confuse random sampling variation with a systematic pattern. Correct move: Only label a pattern as problematic if it is clear and systematic across the entire range of $x$ .
Wrong move: When calculating $s$ , divides by $n$ instead of $n - 2$ . Why: Students confuse population standard deviation with regression residual standard deviation. Correct move: For all AP Statistics problems, divide by $n - 2$ when calculating $s$ for a regression model.
Wrong move: Claims any non-zero residual means the model is a poor fit. Why: Students think residuals should be zero for a good model. Correct move: Natural variation produces non-zero residuals for all models; we only care about systematic patterns across all residuals, not individual non-zero residuals.

6. Practice Questions (AP Statistics Style)

Question 1 (Multiple Choice)

A café owner studies the relationship between average daily temperature ( $^{\circ}$ F, $x$ ) and daily iced coffee sales (dollars, $y$ ). He calculates the LSRL: $\overset{y}{^} = 45 + 2.8 x$ . On a day with an average temperature of 75 $^{\circ}$ F, observed sales were $245. What is the residual for this day, and what does it mean? A) Residual = -10; the model overpredicted sales by $10 B) Residual = -10; the model underpredicted sales by $10 C) Residual = 10; the model overpredicted sales by $10 D) Residual = 10; the model underpredicted sales by $10

Worked Solution: First calculate predicted sales for $x = 75$ : $\overset{y}{^} = 45 + 2.8 (75) = 45 + 210 = 255$ . Next apply the residual formula: $e = y - \overset{y}{^} = 245 - 255 = - 10$ . A negative residual means the observed value is less than the predicted value, so the model overpredicted sales by $10. Correct answer: A.

Question 2 (Free Response)

A real estate agent studies the relationship between house size (in square feet, $x$ ) and selling price (in thousands of dollars, $y$ ) for 15 houses for sale in a neighborhood. She fits a linear regression model and finds the residual plot has a clear pattern: residuals become more spread out as house size increases. (a) A 1800 square foot house has an observed selling price of $420 thousand and a predicted selling price of $405 thousand. Calculate the residual for this house. (b) Is a linear model appropriate for this data? Justify your answer. (c) The standard deviation of residuals is $s = 12.8$ . Interpret this value in context.

Worked Solution: (a) Use the residual formula $e = y - \overset{y}{^} = 420 - 405 = 15$ . The residual is 15 thousand dollars, or $15,000. (b) A linear model is not appropriate. The residual plot shows a clear fanning pattern, where the spread of residuals increases as house size increases. This systematic pattern indicates non-constant variance, which violates the assumptions for a linear model. (c) A standard deviation of residuals $s = 12.8$ means that, on average, observed selling prices of houses deviate from the linear regression model's prediction by approximately 12.8 thousand dollars, or $12,800.

Question 3 (Application / Real-World Style)

An agricultural scientist studies the relationship between annual rainfall (in inches, $x$ ) and corn yield (in bushels per acre, $y$ ) for 7 test plots. She calculates the LSRL $\overset{y}{^} = 25 + 5.2 x$ , and the residuals for the 7 plots are: $- 1.2, 0.8, 0.3, - 0.5, 0.1, 0.4, 0.1$ . Calculate the standard deviation of residuals for this model, and interpret the result in context.

Worked Solution: We have $n = 7$ , so degrees of freedom $n - 2 = 5$ . Square each residual: $(- 1.2)^{2} = 1.44$ , $0. 8^{2} = 0.64$ , $0. 3^{2} = 0.09$ , $(- 0.5)^{2} = 0.25$ , $0. 1^{2} = 0.01$ , $0. 4^{2} = 0.16$ , $0. 1^{2} = 0.01$ . Sum of squared residuals: $\sum e^{2} = 1.44 + 0.64 + 0.09 + 0.25 + 0.01 + 0.16 + 0.01 = 2.6$ . Plug into the formula: $s = \frac{2.6}{5} = 0.52 \approx 0.72$ bushels per acre. Interpretation: On average, observed corn yields deviate from the linear regression model's prediction by approximately 0.72 bushels per acre.

7. Quick Reference Cheatsheet

Category	Formula	Notes
Individual Residual	$e = y - \overset{y}{^}$	Observed minus Predicted. Positive $e$ = underprediction; negative $e$ = overprediction.
Predicted Response	$\overset{y}{^} = a + b x$	Plug in explanatory $x$ ; $a$ = LSRL intercept, $b$ = LSRL slope.
Sum of Residuals (LSRL)	$\sum e = 0$	Always true for least-squares lines. Use to check calculation accuracy.
Standard Deviation of Residuals	$s = \frac{\sum e ^{2}}{n - 2}$	Measures average distance of observed values from the regression line. Smaller $s$ = better fit.
Linear Model Appropriateness	No clear pattern = appropriate	Random scatter confirms linear model is appropriate.
Curved Residual Pattern	Pattern = inappropriate	Curved pattern means true relationship is non-linear.
Fanning Residual Pattern	Pattern = inappropriate	Changing spread means non-constant variance, linear model not appropriate.
Potential Outlier	$	e

8. What's Next

Residuals are the foundational prerequisite for all regression work in AP Statistics. Immediately after this topic, you will learn how to use residuals to identify outliers and influential points in regression, and how these points impact the slope, intercept, and correlation of your model. Without correctly calculating and interpreting residuals, you cannot correctly assess model fit or the impact of unusual points. Across the rest of the course, residuals are critical for checking the assumptions required for inference for regression, which is a major topic on the AP exam. All inference for regression relies on checking that residuals are independent, normally distributed, and have constant variance—skills you build in this chapter.

Residuals — AP Statistics Study Guide

1. What Is Residuals?

2. Calculating and Interpreting Individual Residuals

Worked Example

3. Interpreting Residual Plots to Assess Linear Model Fit

Worked Example

4. Standard Deviation of Residuals

Worked Example

5. Common Pitfalls (and how to avoid them)

6. Practice Questions (AP Statistics Style)

Question 1 (Multiple Choice)

Question 2 (Free Response)

Question 3 (Application / Real-World Style)

7. Quick Reference Cheatsheet

8. What's Next

More study guides