AP · Least Squares Regression · 14 min read · Updated 2026-05-10

Least Squares Regression — AP Statistics Study Guide

For: AP Statistics candidates sitting AP Statistics.

Covers: The least squares criterion, slope and intercept formulas for the least squares regression line (LSRL), residual calculation, sum of squared residuals, coefficient interpretation, and connecting sum of squared errors to model fit for linear bivariate data.

You should already know: How to create and interpret scatterplots for bivariate data, how to calculate and interpret the correlation coefficient $r$ , how to work with linear equations in slope-intercept form.

A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board / Cambridge / IB papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official mark schemes for grading conventions.

1. What Is Least Squares Regression?

Least squares regression (abbreviated LSR, with the resulting model called the least squares regression line, LSRL) is the standard objective method for fitting a straight line to bivariate quantitative data. According to the AP Statistics Course and Exam Description (CED), this topic makes up approximately 2-3% of total exam weight, and appears in both multiple-choice (MCQ) and free-response (FRQ) sections, almost always as part of a larger two-variable data question.

The core goal of LSR is to find the "best fitting" line for predicting values of a response variable $y$ from an explanatory variable $x$ . Unlike a line fit by eye, least squares uses a formal, replicable criterion to define "best": it minimizes the sum of the squared vertical distances (called residuals) between observed $y$ -values and $y$ -values predicted by the line. This method is preferred because it has favorable statistical properties, is easy to compute from summary statistics, and is the foundation for all further regression analysis in AP Statistics. On the exam, you will be expected to calculate LSRL coefficients, interpret them in context, calculate residuals, and explain the least squares criterion.

2. Least Squares Criterion and LSRL Coefficient Formulas

For any linear model predicting $y$ from $x$ , we write the line as $\overset{y}{^} = a + b x$ , where $\overset{y}{^}$ is the predicted value of the response variable, $b$ is the slope, and $a$ is the y-intercept. A residual $e_{i}$ is the vertical difference between the observed $y_{i}$ and predicted $\overset{y}{^}_{i}$ for the $i$ -th data point, defined as $e_{i} = y_{i} - \overset{y}{^}_{i}$ .

The least squares criterion states that the best fitting line is the one that minimizes the sum of squared residuals (called SSE, or sum of squared errors): $S S E = i = 1 \sum n e_{i}^{2} = i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2} = i = 1 \sum n (y_{i} - a - b x_{i})^{2}$

Using calculus to minimize this expression gives the closed-form formulas for the LSRL coefficients. The slope $b$ is calculated as: $b = r \frac{s _{y}}{s _{x}}$ where $r$ is the correlation between $x$ and $y$ , $s_{y}$ is the standard deviation of $y$ , and $s_{x}$ is the standard deviation of $x$ . The slope scales with correlation: if $r = 0$ , slope is 0, meaning no linear relationship. The slope is also adjusted for the units of $x$ and $y$ , so it has units of (y units per x unit), as expected.

A key property of the LSRL is that it always passes through the point of the means $(\overset{x}{ˉ}, \overset{y}{ˉ})$ , which we use to calculate the intercept $a$ : $a = \overset{y}{ˉ} - b \overset{x}{ˉ}$

Worked Example

For a study of weekly hours of exercise vs. resting heart rate, the following summary statistics are given:

$\overset{x}{ˉ} = 3.5$ hours/week, $s_{x} = 2.1$
$\overset{y}{ˉ} = 68$ beats per minute (bpm), $s_{y} = 5.2$
Correlation $r = - 0.68$

Find the equation of the least squares regression line for predicting resting heart rate from weekly exercise hours.

Label variables: $x$ = weekly exercise (explanatory), $y$ = resting heart rate (response). Calculate slope first.
Slope calculation: $b = r \frac{s _{y}}{s _{x}} = - 0.68 \times \frac{5.2}{2.1} \approx - 0.68 \times 2.476 \approx - 1.68$
Intercept calculation: $a = \overset{y}{ˉ} - b \overset{x}{ˉ} = 68 - (- 1.68) (3.5) = 68 + 5.88 = 73.88$
Final LSRL equation: $\overset{y}{^} = 73.88 - 1.68 x$

Exam tip: Always confirm the sign of your slope matches the sign of the correlation. A negative correlation should always give a negative slope, and a positive correlation gives a positive slope; this is a quick check to catch calculation errors.

3. Residual Calculation and Interpretation

Residuals measure the error of our LSRL predictions: they tell us how far off the line is for each observed data point. As noted earlier, a residual is defined as $e_{i} = y_{i} - \overset{y}{^}_{i}$ : observed $y$ minus predicted $y$ . A positive residual means the observed $y$ is higher than the line predicted (the line underpredicts $y$ ), while a negative residual means the observed $y$ is lower than predicted (the line overpredicts $y$ ).

A key property of least squares residuals is that their sum is always zero: $\sum e_{i} = 0$ , because the LSRL is centered on the point of the means. SSE (sum of squared residuals) is the total squared prediction error, so lower SSE means a better fitting linear model, and higher SSE means a worse fitting model. On the AP exam, calculating and interpreting a residual from a given LSRL is a very common exam question, appearing in both MCQ and short FRQ parts.

Worked Example

Using the LSRL from the previous example ( $\overset{y}{^} = 73.88 - 1.68 x$ , where $x$ = weekly exercise hours, $\overset{y}{^}$ = predicted resting heart rate in bpm), find and interpret the residual for a person who exercises 5 hours per week and has a resting heart rate of 64 bpm.

Calculate the predicted resting heart rate for $x = 5$ : $\overset{y}{^} = 73.88 - 1.68 (5) = 73.88 - 8.4 = 65.48$ bpm
Calculate residual: $e = y - \overset{y}{^} = 64 - 65.48 = - 1.48$ bpm
Interpretation: This person's resting heart rate is 1.48 bpm lower than the least squares regression line predicted based on their weekly exercise time.

Exam tip: If you are asked to plot a residual, the x-coordinate is the x-value of the original observation, and the y-coordinate is the residual, not the original $y$ -value.

4. Interpreting LSRL Slope and Intercept

AP exam questions almost always require you to interpret the slope and intercept of a LSRL in context, and this is a common place for students to lose points for incomplete or incorrect wording. The interpretation rules are strict on the exam, so it is important to use the correct phrasing.

The slope $b$ is the predicted average change in the response variable $y$ for a 1-unit increase in the explanatory variable $x$ . It is critical to note that this describes an average trend across all observations, not a guaranteed change for every individual, and that it only describes association, not causation, unless the data comes from a randomized experiment.

The intercept $a$ is the predicted average value of $y$ when $x = 0$ . The intercept is only practically meaningful if $x = 0$ is a plausible value within the range of your data. If $x = 0$ is outside the range of observed $x$ -values, the intercept is just a mathematical anchor for the line and has no practical interpretation.

Worked Example

A LSRL for predicting the height of a pine seedling (in cm) from the amount of water it receives per week (in mL) is $\overset{y}{^} = 2.1 + 0.12 x$ . Interpret the slope and intercept of this line in context, and state whether the intercept is meaningful.

Slope interpretation: For each additional 1 mL of water given per week, the predicted average height of a pine seedling after one month increases by 0.12 cm.
Intercept interpretation: The predicted average height of a pine seedling that receives 0 mL of water per week is 2.1 cm.
Since 0 mL of water per week is a plausible treatment (it means no water added), the intercept is meaningful in this context. If the study only included treatments from 10 mL to 50 mL of water, 0 mL would be outside the range of data, and the intercept would not be practically meaningful.

Exam tip: Always include units of measurement for both x and y in your interpretation, and always use the phrase "predicted average" to avoid incorrect claims about individual changes or causation.

5. Common Pitfalls (and how to avoid them)

Wrong move: Calculating residual as $\overset{y}{^}_{i} - y_{i}$ instead of $y_{i} - \overset{y}{^}_{i}$ . Why: Students mix up the order of terms because they write predicted $y$ first in the regression equation. Correct move: Memorize the phrase "residual equals observed minus predicted" and repeat it every time you calculate a residual.
Wrong move: Using $r \frac{s _{x}}{s _{y}}$ for slope instead of $r \frac{s _{y}}{s _{x}}$ when calculating the LSRL. Why: Students swap the order of standard deviations because they forget which variable is which. Correct move: Label explanatory $x$ and response $y$ before starting calculations, then remember slope = r times (sd of response over sd of explanatory).
Wrong move: Interpreting slope as a guaranteed change for any individual, e.g. "one more mL of water will make the seedling 0.12 cm taller". Why: Students forget the LSRL models the average trend, not individual outcomes. Correct move: Always include the words "predicted average" when describing the change in y for a 1-unit increase in x.
Wrong move: Interpreting the intercept as meaningful even when x=0 is not a plausible value in context. Why: Students think all coefficients require a practical interpretation. Correct move: Always check if x=0 is a reasonable, possible value before interpreting the intercept; if not, explicitly state that the intercept has no practical interpretation.
Wrong move: Writing the LSRL as $y = a + b x$ instead of $\overset{y}{^} = a + b x$ . Why: Students confuse observed response values with predicted response values. Correct move: Always use hat notation on $y$ in the LSRL equation to indicate it represents predicted values.
Wrong move: Using causal language when interpreting regression coefficients from an observational study. Why: Students confuse association (measured by regression) with causation, which only can be inferred from randomized experiments. Correct move: Use only "associated with" or "predicted" when describing the relationship from observational data.

6. Practice Questions (AP Statistics Style)

Question 1 (Multiple Choice)

A researcher studies the relationship between daily high temperature (x, in degrees Celsius) and the number of cups of hot chocolate sold (y) at a mountain cafe. Summary statistics are: $\overset{x}{ˉ} = 1 0^{\circ} C$ , $s_{x} = 5$ , $\overset{y}{ˉ} = 45$ cups, $s_{y} = 12$ , $r = - 0.75$ . What is the correct equation of the LSRL for predicting number of hot chocolates sold from daily temperature? A) $\overset{y}{^} = 63 - 1.8 x$ B) $\overset{y}{^} = 45 - 0.31 x$ C) $\overset{y}{^} = - 1.8 + 63 x$ D) $\overset{y}{^} = 27 - 0.31 x$

Worked Solution: First, calculate the slope using the LSRL formula: $b = r \frac{s _{y}}{s _{x}} = - 0.75 \times \frac{12}{5} = - 0.75 \times 2.4 = - 1.8$ . This eliminates options B, C, and D, which have slopes of 0.31, 63, and 0.31 respectively. Next, calculate the intercept: $a = \overset{y}{ˉ} - b \overset{x}{ˉ} = 45 - (- 1.8) (10) = 45 + 18 = 63$ . The correct equation is $\overset{y}{^} = 63 - 1.8 x$ . The correct answer is A.

Question 2 (Free Response)

A sociologist studies the relationship between median household income (x, in thousands of dollars) and average life expectancy (y, in years) for 50 zip codes in a large US state. Summary statistics are: $\overset{x}{ˉ} = 65$ , $s_{x} = 18$ , $\overset{y}{ˉ} = 78$ , $s_{y} = 4.5$ , $r = 0.72$ . (a) Calculate the equation of the least squares regression line for predicting life expectancy from median income. Show all work. (b) Interpret the slope of your regression line in context. (c) One zip code has a median income of $80,000 and an average life expectancy of 79.2 years. Calculate and interpret the residual for this zip code.

Worked Solution: (a) First calculate slope: $b = r \frac{s _{y}}{s _{x}} = 0.72 \times \frac{4.5}{18} = 0.72 \times 0.25 = 0.18$ . Then calculate intercept: $a = \overset{y}{ˉ} - b \overset{x}{ˉ} = 78 - (0.18) (65) = 78 - 11.7 = 66.3$ . The LSRL is $\overset{y}{^} = 66.3 + 0.18 x$ , where $\overset{y}{^}$ is predicted average life expectancy in years, and x is median household income in thousands of dollars. (b) For each additional $1000 in median household income in a zip code, the predicted average life expectancy increases by 0.18 years. (c) Predicted life expectancy for x = 80: $\overset{y}{^} = 66.3 + 0.18 (80) = 66.3 + 14.4 = 80.7$ years. Residual: $e = y - \overset{y}{^} = 79.2 - 80.7 = - 1.5$ years. Interpretation: The average life expectancy for this zip code is 1.5 years lower than the LSRL predicted based on its median household income.

Question 3 (Application / Real-World Style)

A bakery owner collects data on the number of ounces of sugar used per batch of cookies (x) and the customer rating of the batch (y, on a 1-10 scale). The LSRL is calculated as $\overset{y}{^} = 2.1 + 0.8 x$ , and the sum of the residuals for the 8 batches tested is 0. The sum of the squared residuals is 3.2. What is the sum of squared errors for this model, and what does this value mean in context?

Worked Solution: By definition, the sum of squared errors (SSE) is equal to the sum of squared residuals, so SSE = 3.2 for this model. SSE measures the total amount of variability in customer cookie ratings that is not explained by the linear relationship with amount of sugar per batch. A total SSE of 3.2 means that the total squared deviation between observed customer ratings and ratings predicted by the LSRL is 3.2 rating points squared; this is a relatively small SSE for 8 observations, indicating the linear model fits the data reasonably well.

7. Quick Reference Cheatsheet

Category	Formula / Rule	Notes
Residual	$e_{i} = y_{i} - \overset{y}{^}_{i}$	Positive = underprediction, negative = overprediction; always observed minus predicted
LSRL Slope	$b = r \frac{s _{y}}{s _{x}}$	$y$ = response, $x$ = explanatory; sign of slope matches sign of $r$
LSRL Intercept	$a = \overset{y}{ˉ} - b \overset{x}{ˉ}$	LSRL always passes through the point of the means $(\overset{x}{ˉ}, \overset{y}{ˉ})$
Least Squares Criterion	Minimize $S S E = \sum e_{i}^{2}$	Defines the "best fit" line for linear regression
Slope Interpretation	Predicted average change in $y$ for 1-unit increase in $x$	Only use causal language for randomized experiments
Intercept Interpretation	Predicted average $y$ when $x = 0$	Only meaningful if $x = 0$ is a plausible value in context
Sum of Residuals	$\sum e_{i} = 0$	Always true for any least squares regression line

8. What's Next

Least squares regression is the foundation for all further work with linear regression in AP Statistics, and it is a required prerequisite for every remaining topic in Unit 2: Exploring Two-Variable Data. Immediately after mastering LSR, you will analyze residual plots to check the conditions for linear regression, and calculate the coefficient of determination $R^{2}$ to quantify how much variability in the response variable is explained by the linear model. Without understanding how LSRL coefficients and SSE work, you will not be able to correctly interpret regression diagnostics or complete full regression analysis on FRQs. Beyond Unit 2, least squares regression is the basis for inference for regression slope in Unit 5, which makes up a significant portion of the AP exam.

Follow-on topics to study next: Residual Analysis Coefficient of Determination Inference for Regression Slope

← Back to topic

Stuck on a specific question?
Snap a photo or paste your problem — Ollie (our AI tutor) walks through it step-by-step with diagrams.
Try Ollie free →

Least Squares Regression — AP Statistics Study Guide

1. What Is Least Squares Regression?

2. Least Squares Criterion and LSRL Coefficient Formulas

Worked Example

3. Residual Calculation and Interpretation

Worked Example

4. Interpreting LSRL Slope and Intercept

Worked Example

5. Common Pitfalls (and how to avoid them)

6. Practice Questions (AP Statistics Style)

Question 1 (Multiple Choice)

Question 2 (Free Response)

Question 3 (Application / Real-World Style)

7. Quick Reference Cheatsheet

8. What's Next

More study guides