AP · Linear Regression Models · 14 min read · Updated 2026-05-10

Linear Regression Models — AP Statistics Study Guide

For: AP Statistics candidates sitting AP Statistics.

Covers: Population and sample linear regression models, the least squares regression line (LSRL), slope/intercept interpretation, residual calculation, residual analysis for model fit, and the coefficient of determination for simple two-variable linear regression.

You should already know: Scatterplot construction and correlation interpretation for two-variable data, how to calculate summary statistics (mean, standard deviation) for a sample, basic linear equation algebra.

A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board / Cambridge / IB papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official mark schemes for grading conventions.

1. What Is Linear Regression Models?

Linear regression models are statistical models that describe the linear relationship between an explanatory (independent) variable $x$ and a response (dependent) variable $y$ . In AP Statistics, we almost always use sample data to estimate the true underlying linear relationship that exists in the full population, so we distinguish between the population model (with unknown parameters $β_{0}, β_{1}$ and random error term $ε$ ) and the estimated sample model (with calculated statistics $b_{0}, b_{1}$ ). The most common method for fitting a linear model to sample data is least squares regression, which produces the line that minimizes the sum of squared vertical distances between observed data points and the line. This topic makes up approximately 5-7% of the total AP Statistics exam weight (as part of Unit 2: Exploring Two-Variable Data, which is 10-15% total weight), and it appears in both multiple-choice (MCQ) and free-response (FRQ) sections, often as an early FRQ testing contextual interpretation.

2. The Least Squares Regression Line (LSRL)

The least squares regression line (LSRL) is the straight line that best fits a set of two-variable sample data by minimizing the sum of squared residuals. A residual is defined as the vertical difference between the observed response value and the predicted response value for any $x$ : $e_{i} = y_{i} - \overset{y}{^}_{i}$ , where $\overset{y}{^}_{i}$ is the predicted $y$ for observation $i$ . The goal of least squares is to minimize $\sum e_{i}^{2} = \sum (y_{i} - \overset{y}{^}_{i})^{2}$ .

From calculus and algebra, the optimal slope and intercept of the LSRL have simple closed-form formulas that use common summary statistics: $b_{1} = r \cdot \frac{s _{y}}{s _{x}}$ $b_{0} = \overset{y}{ˉ} - b_{1} \overset{x}{ˉ}$ where $r$ is the correlation coefficient between $x$ and $y$ , $s_{y}$ is the sample standard deviation of $y$ , $s_{x}$ is the sample standard deviation of $x$ , $\overset{y}{ˉ}$ is the sample mean of $y$ , and $\overset{x}{ˉ}$ is the sample mean of $x$ . A key property of the LSRL is that it always passes through the point $(\overset{x}{ˉ}, \overset{y}{ˉ})$ , which we can use to check our calculations. It is critical to remember that the LSRL is defined for predicting $y$ from $x$ : swapping $x$ and $y$ will produce an entirely different line.

Worked Example

A student researcher collects data on hours spent playing video games per week ( $x$ ) and GPA (y, on a 4.0 scale) for 20 college students. Summary statistics are: $\overset{x}{ˉ} = 8.2$ hours, $s_{x} = 4.1$ hours, $\overset{y}{ˉ} = 3.1$ , $s_{y} = 0.6$ , $r = - 0.58$ . Calculate the equation of the LSRL for predicting GPA from weekly video game play.

Calculate the slope $b_{1}$ using the formula: $b_{1} = r \cdot \frac{s _{y}}{s _{x}} = - 0.58 \cdot \frac{0.6}{4.1} \approx - 0.58 \cdot 0.146 \approx - 0.085$ .
Calculate the intercept $b_{0}$ using the fact that the LSRL passes through $(\overset{x}{ˉ}, \overset{y}{ˉ})$ : $b_{0} = \overset{y}{ˉ} - b_{1} \overset{x}{ˉ} = 3.1 - (- 0.085) (8.2) \approx 3.1 + 0.697 = 3.797$ .
Write the final equation with defined variables: $\overset{y}{^} = 3.80 - 0.085 x$ , where $\overset{y}{^}$ is predicted GPA and $x$ is weekly video game play in hours.
Check that $(\overset{x}{ˉ}, \overset{y}{ˉ})$ satisfies the equation: $3.80 - 0.085 (8.2) \approx 3.1 = \overset{y}{ˉ}$ , so calculations are correct.

Exam tip: Always round your slope and intercept to 2-3 significant digits matching the input data; over-rounding early leads to calculation errors, and too many digits wastes time on the exam.

3. Interpreting Slope and Intercept in Context

One of the most frequently tested skills on the AP Statistics exam is correctly interpreting the slope and intercept of a linear regression model in context. Unlike pure math problems, AP requires interpretation tied directly to the scenario, not just a generic description.

The slope $b_{1}$ is the predicted average change in the response variable $y$ for a 1-unit increase in the explanatory variable $x$ . It always has units of (units of y) per (unit of x). The intercept $b_{0}$ is the predicted average value of $y$ when $x = 0$ . The intercept only has a practical, meaningful interpretation if $x = 0$ is a plausible possible value in the context of the problem. If $x = 0$ is impossible or far outside the range of observed data, the intercept is only a mathematical anchor for the line and has no practical meaning.

Full credit for interpretation questions on the AP exam always requires two key phrases: "predicted" (or "estimated") and "on average", because regression models predict the average $y$ for a given $x$ , not an exact $y$ for any single individual.

Worked Example

Using the LSRL from the previous example: $\overset{y}{^} = 3.80 - 0.085 x$ , where $x$ is weekly video game play in hours and $\overset{y}{^}$ is predicted college GPA. Interpret the slope and intercept in context, and state if the intercept is practically meaningful.

Interpret the slope: The slope of -0.085 means that for each additional 1 hour of weekly video game play, the predicted average college GPA decreases by 0.085 points (on the 4.0 scale).
Interpret the intercept: The intercept of 3.80 means that for a student who plays 0 hours of video games per week, the predicted average GPA is 3.80.
Check for meaningfulness: 0 hours of weekly video game play is a plausible value for a college student, so the intercept has a practical interpretation in this context. If the explanatory variable was "height of adult men" instead, $x = 0$ would be impossible, and the intercept would have no practical meaning.

Exam tip: If you are asked to compare slopes of two models, a steeper slope (larger absolute value) always means a larger predicted change in $y$ per 1-unit change in $x$ , regardless of the sign.

4. Residual Analysis and Coefficient of Determination

After fitting a linear regression model, we need to check if a linear model is actually appropriate for the data, and measure how much variation in $y$ the model explains. This is done with residual plots and the coefficient of determination ( $R^{2}$ ).

A residual plot graphs residuals on the $y$ -axis against the explanatory variable $x$ on the $x$ -axis. For a linear model to be appropriate, residuals should be randomly scattered around the horizontal line at 0 with no clear pattern. A curved pattern means the true relationship between $x$ and $y$ is non-linear, so a linear model is a poor fit. A fan-shaped pattern (residuals getting wider or narrower as $x$ increases) means non-constant error variance, which violates regression assumptions.

The coefficient of determination $R^{2}$ (equal to $r^{2}$ for simple linear regression) measures the proportion of variation in the response variable $y$ that is explained by the linear relationship with $x$ . It ranges from 0 (no linear explanation) to 1 (all variation explained), or 0% to 100% when expressed as a percentage. Higher $R^{2}$ means a stronger linear relationship.

Worked Example

A botanist fits a linear regression model to data on tree age (x, years) and tree height (y, meters) for trees aged 1 to 50 years. Her residual plot shows residuals that are negative for young trees, positive for middle-aged trees, and negative again for old trees, forming a clear hump shape. $r = 0.82$ between age and height. What does the residual plot tell you about model fit? Calculate and interpret $R^{2}$ .

The clear curved hump pattern in the residual plot indicates that a linear model is not appropriate for this relationship. This matches what we know about tree growth: trees grow quickly when young, level off when mature, so the relationship is curved, not linear.
Calculate $R^{2}$ : $R^{2} = r^{2} = (0.82)^{2} = 0.6724 = 67.24%$ .
Interpret $R^{2}$ : Approximately 67% of the variation in tree height is explained by the linear relationship with tree age. Even though the linear relationship is strong, the curved pattern means a non-linear model would fit better.

Exam tip: Residual plots only check if a linear model is appropriate, not how strong the relationship is. A weak linear relationship can still have a random residual pattern, meaning a linear model is appropriate but not very predictive.

5. Common Pitfalls (and how to avoid them)

Wrong move: Swapping $s_{y}$ and $s_{x}$ when calculating slope, using $b_{1} = r (s_{x} / s_{y})$ instead of $r (s_{y} / s_{x})$ . Why: Correlation is symmetric, but regression is not, and students often mix up which variable is which. Correct move: Always remember slope is change in predicted y per change in x, so the standard deviation of y goes on top: $b_{1} = r (s_{y} / s_{x})$ .
Wrong move: Interpreting slope as an exact change for every individual, e.g. "each extra hour of video games lowers your GPA by 0.085 points". Why: Students forget regression predicts the average response, not an exact outcome for every person. Correct move: Always include "predicted" and "on average" in any slope or intercept interpretation to earn full credit.
Wrong move: Claiming that a strong regression relationship (high $R^{2}$ , large slope) proves that changes in x cause changes in y. Why: Students confuse association (what regression measures) with causation, which requires a randomized experiment. Correct move: Always explicitly state that causation can only be concluded if the data comes from a randomized experiment; otherwise, we only have evidence of an association.
Wrong move: Interpreting the intercept as meaningful even when $x = 0$ is impossible, e.g. interpreting an intercept for a regression of weight on height of adults. Why: Students think every intercept needs an interpretation by default. Correct move: If $x = 0$ is not a plausible value in context, explicitly state that the intercept has no practical interpretation.
Wrong move: Extrapolating predictions far outside the range of observed x values. Why: Students assume the linear relationship holds everywhere, which is almost never true. Correct move: Always check if the x value you are predicting for is within the range of the original data; if it is far outside, note that the prediction is unreliable.
Wrong move: Stating that a random residual pattern means the relationship is strong. Why: Students confuse model adequacy (linearity) with strength of relationship. Correct move: Random residuals only mean the linear model is appropriate; strength is measured by $r$ or $R^{2}$ .

6. Practice Questions (AP Statistics Style)

Question 1 (Multiple Choice)

A property analyst builds a model to predict home price (y, in thousands of dollars) from the size of the home (x, in square feet). She calculates the following summary statistics: $\overset{x}{ˉ} = 1800$ sq ft, $s_{x} = 400$ sq ft, $\overset{y}{ˉ} = 450$ thousand dollars, $s_{y} = 120$ thousand dollars, $r = 0.75$ . What is the correct equation for the least squares regression line?

A) $\overset{y}{^} = 225 + 0.25 x$ B) $\overset{y}{^} = 450 + 0.75 x$ C) $\overset{y}{^} = 112.5 + 0.225 x$ D) $\overset{y}{^} = - 225 + 0.225 x$

Worked Solution: First calculate the slope: $b_{1} = r (s_{y} / s_{x}) = 0.75 * (120/400) = 0.75 * 0.3 = 0.225$ . This eliminates options A and B, which have incorrect slopes. Next calculate the intercept: $b_{0} = \overset{y}{ˉ} - b_{1} \overset{x}{ˉ} = 450 - (0.225 * 1800) = 450 - 405 = - 225$ . The correct equation is $\overset{y}{^} = - 225 + 0.225 x$ . Correct answer: D.

Question 2 (Free Response)

A coffee shop owner studies the relationship between daily high temperature (x, in °F) and the number of hot coffee drinks sold (y) per day. She collects 30 days of data, with summary statistics: $\overset{x}{ˉ} = 70° F$ , $s_{x} = 15° F$ , $\overset{y}{ˉ} = 120$ drinks, $s_{y} = 25$ drinks, $r = - 0.72$ . (a) Calculate the equation of the least squares regression line for predicting number of hot drinks from high temperature. Show all work. (b) Interpret the slope of your regression line in context. (c) On a day with a high temperature of 90°F, the shop sold 82 hot drinks. Calculate and interpret the residual for this day.

Worked Solution: (a) Slope calculation: $b_{1} = r \frac{s _{y}}{s _{x}} = - 0.72 \cdot \frac{25}{15} = - 0.72 \cdot 1.667 \approx - 1.2$ . Intercept calculation: $b_{0} = \overset{y}{ˉ} - b_{1} \overset{x}{ˉ} = 120 - (- 1.2) (70) = 120 + 84 = 204$ . Final equation: $\overset{y}{^} = 204 - 1.2 x$ , where $\overset{y}{^}$ is predicted number of hot drinks sold and $x$ is daily high temperature in °F. (b) The slope of -1.2 means that for each 1°F increase in daily high temperature, the predicted average number of hot coffee drinks sold decreases by 1.2 drinks per day. (c) Predicted number of drinks for 90°F: $\overset{y}{^} = 204 - 1.2 (90) = 204 - 108 = 96$ . Residual: $e = y - \overset{y}{^} = 82 - 96 = - 14$ . Interpretation: The coffee shop sold 14 fewer hot drinks than the linear regression model predicted for a day with a high temperature of 90°F.

Question 3 (Application / Real-World Style)

A hydrologist studies the relationship between annual rainfall (x, in inches) and annual stream flow (y, in cubic feet per second average) for a small river. She collects 10 years of data, with rainfall ranging from 25 inches to 55 inches. The resulting LSRL is $\overset{y}{^} = 120 + 8.2 x$ , $R^{2} = 0.89$ , and the residual plot shows random scatter around zero with no clear pattern. Is a linear model appropriate for this relationship? Interpret $R^{2}$ in context, and predict the average annual stream flow for a year with 40 inches of rainfall.

Worked Solution: The residual plot shows no clear pattern and random scatter around zero, so a linear model is appropriate for this relationship. $R^{2} = 0.89$ means that 89% of the variation in annual average stream flow for this river is explained by the linear relationship with annual rainfall. Predicting for 40 inches of rainfall: $\overset{y}{^} = 120 + 8.2 (40) = 120 + 328 = 448$ cubic feet per second. In context, the predicted average annual stream flow for this river in a year with 40 inches of rainfall is 448 cubic feet per second.

7. Quick Reference Cheatsheet

Category	Formula	Notes
Population Linear Model	$y = β_{0} + β_{1} x + ε$	True model for the full population; $ε$ = random error term
Estimated LSRL	$\overset{y}{^} = b_{0} + b_{1} x$	Estimated from sample data; $\overset{y}{^}$ = predicted response
LSRL Slope	$b_{1} = r \cdot \frac{s _{y}}{s _{x}}$	For predicting $y$ from $x$ ; swapping variables changes slope
LSRL Intercept	$b_{0} = \overset{y}{ˉ} - b_{1} \overset{x}{ˉ}$	LSRL always passes through $(\overset{x}{ˉ}, \overset{y}{ˉ})$
Residual	$e_{i} = y_{i} - \overset{y}{^}_{i}$	Observed $y$ minus predicted $y$
Coefficient of Determination	$R^{2} = r^{2}$	Proportion of variation in $y$ explained by the linear model; 0 ≤ R² ≤ 1
Slope Interpretation	N/A	Predicted average change in $y$ for a 1-unit increase in $x$
Intercept Interpretation	N/A	Predicted average $y$ when $x = 0$ ; only meaningful if $x = 0$ is plausible

8. What's Next

This chapter lays the foundational groundwork for all regression topics across the AP Statistics curriculum. Immediately after mastering simple linear regression models, you will move to studying more advanced model diagnostics, including outlier and influential point detection, which builds directly on the residual analysis you learned here. Later in the course, you will study inference for regression, which relies entirely on the structure and interpretation of the linear regression model you learned in this chapter. Without mastering this core topic, more complex topics like multiple regression and slope inference will be very difficult to master, as all inference for regression builds on these foundational skills.

← Back to topic

Stuck on a specific question?
Snap a photo or paste your problem — Ollie (our AI tutor) walks through it step-by-step with diagrams.
Try Ollie free →

Linear Regression Models — AP Statistics Study Guide

1. What Is Linear Regression Models?

2. The Least Squares Regression Line (LSRL)

Worked Example

3. Interpreting Slope and Intercept in Context

Worked Example

4. Residual Analysis and Coefficient of Determination

Worked Example

5. Common Pitfalls (and how to avoid them)

6. Practice Questions (AP Statistics Style)

Question 1 (Multiple Choice)

Question 2 (Free Response)

Question 3 (Application / Real-World Style)

7. Quick Reference Cheatsheet

8. What's Next

More study guides