Regression Outliers and Influential Points — AP Statistics Study Guide
For: AP Statistics candidates sitting AP Statistics.
Covers: Identification of regression outliers, high-leverage points, and influential points, leverage calculation, Cook's Distance for assessing influence, and interpretation of their impact on least-squares regression lines and summary statistics, aligned to CED Unit 2 learning objectives.
You should already know: How to calculate and interpret a least-squares regression line. How to calculate and interpret residuals from regression. How to interpret the correlation coefficient .
A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board / Cambridge / IB papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official mark schemes for grading conventions.
1. What Is Regression Outliers and Influential Points?
In AP Statistics, this subtopic falls within Unit 2: Exploring Two-Variable Data, which contributes 10-15% of the total AP exam score, with regression outliers and influential points making up roughly 1-3% of total exam points. This topic appears in both multiple-choice (MCQ) and free-response (FRQ) sections of the exam. When fitting a least-squares regression line (LSRL) to bivariate data, not all observations affect the line equally. Unusual points that deviate from the overall pattern of the data can change the LSRL’s slope, intercept, correlation, and predictive power in meaningful ways, so distinguishing between different types of unusual points is a core skill for regression analysis. A common misconception is that all unusual points are the same: in fact, we separate unusual points into three distinct categories: regression outliers, high-leverage points, and influential points, each with different properties and impacts on regression results.
2. Regression Outliers
A regression outlier is an observation that deviates substantially from the overall pattern of the data in the vertical (response, ) direction. In other words, the point has a large residual, meaning its observed -value is very different from the -value predicted by the LSRL fit to the full dataset. The residual for observation is defined as: where is the observed -value and is the predicted -value from the LSRL. By the empirical rule for roughly normally distributed residuals, approximately 95% of all residuals will fall within 2 standard deviations of 0. The standard rule of thumb for identifying a regression outlier is that it has an absolute residual larger than twice the standard deviation of the residuals (): . Regression outliers do not need to have extreme -values, and most regression outliers that fall within the range of the other -values have very little impact on the LSRL, so they are rarely influential.
Worked Example
We have bivariate data for 5 observations: , , , , . The LSRL fit to these data is , and the standard deviation of the residuals is . Identify the regression outlier.
- Calculate the residual for each observation:
- : , ,
- : , ,
- : , ,
- : , ,
- : , ,
- Calculate the outlier cutoff:
- Compare all absolute residuals to the cutoff: only the residual for exceeds 2.16.
- Conclusion: is the regression outlier.
Exam tip: Never identify an outlier just by how far it is from the origin. Always check residual size to confirm a point is a regression outlier, regardless of its position on the scatterplot.
3. High-Leverage Points
A high-leverage point is an observation that is extreme in the horizontal (explanatory, ) direction, meaning its -value falls far outside the range of the other -values in the dataset. Leverage quantifies how far an observation’s -value is from the mean of all -values. The formula for the leverage of observation is: where is the total number of observations, and is the mean of the -values. For simple linear regression (one predictor), the rule of thumb for high leverage is , which comes from the general rule for predictors (here ). Intuitively, high-leverage points have more "pull" on the LSRL because the LSRL minimizes the sum of squared errors: a point far out in the -direction can drag the line toward itself much more easily than a point in the middle of the -range. A high-leverage point is not automatically an outlier or influential, however. If a high-leverage point follows the pattern of the rest of the data, it will not change the LSRL much.
Worked Example
We have 6 observations with -values: and -values: . Determine if the point is high leverage.
- Calculate the mean of :
- Calculate the sum of squared deviations of :
- Calculate leverage for :
- Calculate the high leverage cutoff for :
- Compare: , so the point is high leverage.
Exam tip: Always check the -value range to identify high leverage. A point can have a very small residual (not an outlier) and still be high leverage.
4. Influential Points
An influential point is an observation that, when removed from the dataset, causes a substantial change to one or more key regression parameters: slope, intercept, correlation coefficient, or coefficient of determination. A point that is both a regression outlier and high leverage is almost always influential, but we can formally measure influence with Cook’s Distance, which combines residual size and leverage into a single metric. Cook’s Distance for observation is: where is the number of regression parameters ( for simple linear regression: intercept + slope), is the residual, is the mean squared error of the regression, and is leverage. The standard rule of thumb is that a point is influential if ; a more conservative cutoff of is often used for small datasets. Intuitively, Cook’s D captures how much all predicted -values would change if the -th point were removed, so it directly measures the point’s overall impact on the regression.
Worked Example
Using the 6-point dataset from the previous section: , points . We already calculated for , , and . Use Cook’s Distance to confirm if this point is influential.
- Plug values into the Cook’s D formula, with for simple linear regression:
- Calculate the first term:
- Calculate the second term:
- Multiply:
- Compare to the cutoff of 1: , so the point is influential. This matches the fact that removing the point changes the slope from 0.5 to 2, a massive change.
Exam tip: AP graders require you to link influence to a change in regression parameters. Always state how much the slope/intercept changes when you remove the point to justify your conclusion that it is influential.
5. Common Pitfalls (and how to avoid them)
- Wrong move: Calling any point far from the origin an influential point. Why: Students confuse distance from the origin with influence, mixing up x and y extremes. Correct move: Check if removing the point substantially changes the regression slope/intercept; only then label it influential.
- Wrong move: Assuming all high-leverage points are influential. Why: Students think any extreme x is automatically influential, but high-leverage points that follow the pattern of the other points do not change the slope. Correct move: After identifying high leverage, check if the point follows the pattern of the rest of the data to confirm influence.
- Wrong move: Assuming all regression outliers are influential. Why: Students mix up vertical outliers with influential points; an outlier in y that is in the middle of the x-range rarely has meaningful pull on the LSRL. Correct move: Always check leverage of an outlier to see if it can meaningfully change the regression line.
- Wrong move: Automatically removing any influential point from the dataset. Why: Students think unusual points are mistakes and must be removed, but influential points can be valid data that reveal important patterns. Correct move: Only remove an influential point if it is confirmed to be a measurement error or data entry error; otherwise, report regression results with and without the point included.
- Wrong move: Identifying an outlier by eye from a scatterplot without checking residual size. Why: What looks like an outlier visually can be within the 2s range for residuals, especially with large datasets. Correct move: Always calculate residuals and compare to 2s to confirm a regression outlier.
- Wrong move: Confusing correlation coefficient change with slope change when assessing influence. Why: Students assume any change in r means a matching change in slope, which is not always true. Correct move: Explicitly check the change in slope, the key parameter of interest in regression, when describing the impact of an influential point.
6. Practice Questions (AP Statistics Style)
Question 1 (Multiple Choice)
A researcher fits a least-squares regression line to study the relationship between hours studied (x) and test score (y) for 18 introductory statistics students. All students except one studied between 2 and 6 hours for the test. One student studied 0 hours and scored a 20, which falls exactly on the regression line fit to the other 17 students. Which of the following correctly describes this point? A) It is a high-leverage, influential regression outlier B) It is a high-leverage point that is not an outlier and not influential C) It is a regression outlier that is not high leverage and not influential D) It is neither high leverage, an outlier, nor influential
Worked Solution: First, the point has an x-value (0 hours) far outside the range of all other x-values (2 to 6 hours), so it is high leverage. This eliminates options C and D. Next, the point follows the pattern of the other students, so its residual is very small, meaning it is not a regression outlier. Because it aligns with the existing pattern, removing it will not meaningfully change the slope or intercept of the regression line, so it is not influential. The correct answer is B.
Question 2 (Free Response)
A marine biologist studies the relationship between the length of an adult great white shark (x, in feet) and its weight (y, in pounds) for 7 adult sharks. The data are: . (a) Classify the point as an outlier, high leverage, and/or influential. Justify your classification. (b) The LSRL for all 7 points is . After removing , the LSRL becomes . What does this comparison tell you about the influence of the point? (c) Under what conditions is it statistically justified for the biologist to remove the point from the analysis?
Worked Solution: (a) The x-value of 12 feet is far below the range of other x-values (15 to 18 feet), so it is high leverage. The residual for the point in the full LSRL is , which is more than twice the standard deviation of residuals (), so it is also a regression outlier. Since it is both high leverage and a regression outlier, it is influential. (b) The slope increases by 50 units (25% change) and the intercept changes by 900 units (82% change) when the point is removed. These large, substantive changes to the regression parameters confirm the point is highly influential. (c) Removal is only justified if the point is confirmed to be an error: for example, if the length or weight was mismeasured or recorded incorrectly. If the point is a valid measurement of a smaller adult shark, it should be retained, and the biologist should note that the relationship between length and weight differs for smaller sharks.
Question 3 (Application / Real-World Style)
A local coffee shop owner records the relationship between daily average temperature (x, °F) and the number of hot coffees sold (y) for 25 non-holiday business days. All days have temperatures between 45°F and 85°F, except one day with an unexpected snowstorm that had an average temperature of 32°F. On the snowstorm day, the shop sold 310 hot coffees. The LSRL fit to all 25 days is , with a correlation coefficient of . After removing the snowstorm day, the LSRL becomes , with . Is the snowstorm day an influential point? Justify your answer, and interpret its impact on the regression results.
Worked Solution: First, the temperature of 32°F is far outside the range of all other daily temperatures, so the point is high leverage. Removing the point changes the slope from -3.2 to -2.5 (a 22% change) and changes the correlation from -0.62 to -0.48, which is a substantial change to the strength of the observed relationship. Because removing the point causes large changes to key regression parameters, the snowstorm day is influential. In context, the influential snowstorm day makes the negative relationship between temperature and hot coffee sales look much stronger than it is on typical non-snowstorm days.
7. Quick Reference Cheatsheet
| Category | Formula / Definition | Notes |
|---|---|---|
| Regression Residual | Measures vertical deviation from the LSRL; used to identify regression outliers. | |
| Regression Outlier Rule | $ | e_i |
| Leverage | Measures how extreme the x-value is relative to other observations. | |
| High Leverage Cutoff (Simple Linear) | For multiple regression, use where = number of predictors. | |
| Cook's Distance | Combines residual and leverage to measure overall influence on regression. | |
| Influential Point Cutoff | Use as a more conservative cutoff for small datasets. | |
| Regression Outlier | Unusual in y-direction | Large residual, x within range of other x-values; usually not influential. |
| High-Leverage Point | Unusual in x-direction | Extreme x outside range of other x-values; not automatically influential. |
| Influential Point | Causes substantial change to regression parameters | Almost always influential if both outlier and high leverage; confirm by checking change after removal. |
| Justification for Removing an Influential Point | Only if measurement/data entry error | Valid influential points should not be removed; report results with and without inclusion. |
8. What's Next
This topic is a critical prerequisite for the remaining topics in Unit 2 and for all regression-based inference in Unit 5. Unusual points can completely change the slope of a regression line and invalidate the results of inference for slope, so failing to correctly identify influential points leads to incorrect conclusions about the relationship between variables. This skill also extends to multiple regression later in the course, where influential points can confound relationships between multiple predictors. Next you will apply the skills of identifying unusual points to check the conditions for regression inference, which cannot be done correctly without recognizing how influential points distort linearity and other model assumptions.