Correlation — AP Statistics Study Guide
For: AP Statistics candidates sitting AP Statistics.
Covers: Pearson product-moment correlation coefficient (), its z-score and deviation formula, hand calculation, properties of , interpretation of strength and direction of linear association, and common exam pitfalls including correlation vs causation.
You should already know: Scatterplots and description of two-variable association, z-scores for standardization, calculation of means and sample standard deviations for quantitative variables.
A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board / Cambridge / IB papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official mark schemes for grading conventions.
1. What Is Correlation?
Correlation measures the strength and direction of the linear relationship between two quantitative variables. It is a core topic in Unit 2: Exploring Two-Variable Data, which accounts for 5-7% of the total AP Statistics exam weight per the official College Board CED. Correlation appears in both multiple-choice (MCQ) and free-response (FRQ) sections of the exam: it is often tested as a standalone MCQ on properties or interpretation, or as a foundational component of longer FRQs on linear regression.
The most widely used correlation measure tested on the AP exam is the Pearson product-moment correlation coefficient, denoted for a sample and (rho) for a full population. The CED almost exclusively focuses on sample for this topic, so we will center our discussion on that. Unlike regression slope, has no units and is bounded strictly between -1 and 1, so it is unaffected by unit changes or scaling of either variable. A positive indicates positive linear association (as increases, tends to increase), while a negative indicates negative linear association. A value of means no linear association between the two variables.
2. Calculating the Correlation Coefficient
There are two equivalent common forms of the correlation formula, both of which you may need to use on the AP exam. The z-score form gives clear intuition: correlation is the average of the product of standardized z-scores for the two variables. The formula for sample is: where is the number of observations, is the z-score of the -th -value, and is the z-score of the -th -value. The equivalent deviation form, which is easier for hand calculation, is: where are sample means, and are sample standard deviations. The numerator captures how and deviate from their means in the same direction: if both are above or both below their means, the product is positive, pulling up; if one is above and the other below, the product is negative, pulling down.
Worked Example
A tutor collects 4 pairs of data on number of practice problems completed (x) and quiz score out of 10 (y): . Calculate the sample correlation coefficient .
- Calculate sample means: ,
- Calculate the sum of cross-products of deviations from the mean:
Product 1 3 -1.5 -2.5 3.75 2 5 -0.5 -0.5 0.25 3 6 0.5 0.5 0.25 4 8 1.5 2.5 3.75 Sum of products = - Calculate sample standard deviations: ,
- Plug into the formula:
Exam tip: On FRQs requiring hand calculation, AP readers accept both simplified decimal (accurate to 2+ decimal places) and unsimplified fraction forms of , so you do not need to do extra work to simplify if you are comfortable leaving it in fraction form.
3. Key Properties of the Correlation Coefficient
Most MCQ questions on correlation test knowledge of core properties of , so understanding these is critical for earning full points. The key properties tested on the AP exam are:
- Bounds: is always between -1 and 1 (). is a perfect positive linear relationship, is a perfect negative linear relationship, and means no linear association.
- No units: is unaffected by linear transformations (adding a constant or multiplying by a positive constant) of one or both variables. Changing units from inches to centimeters or pounds to kilograms will never change the value of .
- Symmetry: The correlation of on is identical to the correlation of on . Swapping the two variables does not change , unlike the slope of a regression line.
- Linear only: only measures linear association. It can be close to 0 even if there is a strong non-linear relationship between the two variables.
- Sensitivity to outliers: is very sensitive to extreme outliers. A single outlier can drastically shift toward or away from 0.
Worked Example
The correlation between distance commuted to work (in miles) and monthly gas spending (in dollars) is 0.81 for a sample of workers. If distance commuted is converted to kilometers (1 mile ≈ 1.61 km), what is the new correlation?
- Recall that unit conversion is a linear transformation that multiplies all -values by a positive constant.
- For any linear transformation multiplying by a positive constant, the z-scores of the transformed values remain identical to the original z-scores: .
- Since is the average product of z-scores, does not change. The new correlation is still 0.81.
Exam tip: If an MCQ asks how a unit conversion or adding a constant to all values affects , the answer is always that does not change; this is one of the most frequently tested properties of correlation.
4. Interpreting Correlation in Context
AP FRQs almost always require you to interpret the value of in the context of the problem, and the grading rubric has strict requirements for full credit. To earn full points, your interpretation must include three core elements: (1) the direction of the relationship (positive/negative), (2) the strength of the linear relationship (strong/moderate/weak, matched to the magnitude of ), (3) context that names both variables. A common convention for strength that is accepted on the exam is: = strong, = moderate, = weak. You must explicitly mention that measures linear association to avoid losing points for vague interpretation.
Worked Example
A study of 30 coffee shops finds a correlation of -0.42 between distance from the nearest downtown subway stop and daily number of customers. Interpret this correlation in context.
- Identify direction: is negative, so as distance from the subway increases, daily customers tend to decrease.
- Identify strength: , which is a moderate linear relationship.
- Combine into a full contextual interpretation: "There is a moderate negative linear relationship between distance from the nearest subway stop and daily customer count for these coffee shops: coffee shops located farther from a subway tend to have fewer daily customers."
- This interpretation includes all required elements: direction, strength, explicit linear reference, and context, so it would earn full credit.
Exam tip: Always include the word "linear" in your interpretation and explicitly name both variables; generic interpretations like "there is a moderate negative correlation" will lose points even if the direction and strength are correct.
5. Common Pitfalls (and how to avoid them)
- Wrong move: Interpreting to mean there is no relationship of any kind between the two variables. Why: Students confuse "no linear association" with "no association at all". Strong non-linear relationships can still have . Correct move: Always state means there is no linear relationship between the variables, not no relationship at all.
- Wrong move: Claiming correlation proves causation, e.g., "a positive correlation between number of firefighters at a fire and damage means firefighters cause damage". Why: Students forget that correlation from observational data can be explained by lurking third variables. Correct move: Always state that correlation alone does not provide evidence of a causal relationship between two variables.
- Wrong move: Calculating or interpreting correlation between two categorical variables, e.g., correlation between gender and major. Why: Students confuse correlation with measures of association for categorical data. Correlation is only defined for quantitative variables. Correct move: Confirm both variables are quantitative before using or interpreting correlation; if one or both are categorical, correlation is meaningless here.
- Wrong move: Claiming changes when you swap the explanatory and response variables. Why: Students confuse correlation with regression slope, which does change when you swap variables. Correct move: Remember is symmetric: swapping and leaves exactly unchanged.
- Wrong move: Stating a correlation of 0.8 is twice as strong as a correlation of 0.4. Why: Correlation is not a linear scale of strength; the distance from 0 to 0.4 is not equivalent to the distance from 0.4 to 0.8. Correct move: Only describe strength relative to how close is to -1, 0, or 1; never describe strength as a ratio of correlation values.
- Wrong move: Ignoring outliers when interpreting , assuming the calculated represents the relationship of all observations. Why: Students forget that is highly sensitive to extreme outliers, which can drastically change its value. Correct move: Always check the scatterplot for outliers before interpreting , and note if an outlier is influencing the correlation value.
6. Practice Questions (AP Statistics Style)
Question 1 (Multiple Choice)
Which of the following statements about the correlation coefficient is true? A) The correlation between tree diameter and tree height will be negative, because larger trees have larger diameters. B) If the correlation between two variables is 0, there is no relationship between them. C) The correlation of on equals the correlation of on . D) A correlation of 0.6 indicates a stronger relationship than a correlation of -0.7.
Worked Solution: Evaluate each option: Option A is incorrect because larger diameter should be associated with larger height, so the correlation would be positive, not negative. Option B is incorrect because means no linear relationship, not no relationship of any kind. Option C is correct: correlation is symmetric, so swapping and does not change . Option D is incorrect because strength depends on the absolute value of ; , so -0.7 indicates a stronger relationship. The correct answer is C.
Question 2 (Free Response)
A small bakery collects 5 pairs of data on average daily temperature (, in °F) and number of ice cream cones sold per day (): . (a) Calculate the sample correlation coefficient for these data. (b) Interpret your calculated in the context of this problem. (c) How would change if we added 5 degrees to every temperature measurement to account for a measurement error? Justify your answer.
Worked Solution: (a) Calculate means: , . Sum of cross products of deviations: . , . . (b) There is a very strong positive linear relationship between average daily temperature and number of ice cream cones sold at this bakery: on days with higher temperatures, the bakery tends to sell more ice cream cones. (c) will remain unchanged at 0.996. Adding a constant to all values of is a linear transformation that shifts all values by the same amount, so deviations from the mean and z-scores remain unchanged. This means the correlation does not change.
Question 3 (Application / Real-World Style)
A public health researcher finds a correlation of 0.65 between annual alcohol consumption and annual healthcare spending for a sample of 1000 adults. A policy analyst argues that this correlation proves alcohol consumption causes higher healthcare costs, so policymakers should tax alcohol to reduce healthcare spending. Is the analyst's argument supported by the correlation? Explain what the correlation actually tells us.
Worked Solution: The analyst's argument that correlation proves causation is not supported by the available data. The correlation of 0.65 tells us that there is a moderately strong positive linear relationship between annual alcohol consumption and annual healthcare spending: adults who consume more alcohol per year tend to have higher annual healthcare spending. However, this correlation could be explained by lurking variables such as age, smoking status, or income that affect both alcohol consumption and healthcare spending. Since this is an observational study, the correlation alone does not prove that alcohol consumption causes higher healthcare costs, so the analyst's argument is not valid.
7. Quick Reference Cheatsheet
| Category | Formula | Notes |
|---|---|---|
| Correlation (z-score form) | For sample correlation; are standard scores | |
| Correlation (deviation form) | Easier for hand calculation on FRQs | |
| Bounds of | Any value outside this range indicates a calculation error | |
| Symmetry | Swapping variables does not change , unlike regression slope | |
| Linear transformation effect | Positive leave unchanged; one negative flips the sign | |
| What measures | Strength and direction of linear association | Does not measure non-linear association |
| Causation rule | Correlation Causation | Only randomized experiments can prove causation |
| Sensitivity | is sensitive to outliers | One extreme outlier can drastically change |
8. What's Next
Correlation is the foundational prerequisite for linear regression, the next major topic in Unit 2: Exploring Two-Variable Data. The correlation coefficient directly determines the slope of the least squares regression line, and the square of (the coefficient of determination) measures how much variation in the response variable is explained by the linear model. Without understanding the properties and interpretation of , you will not be able to correctly interpret regression output or answer FRQ questions about model fit, which make up a large portion of the AP exam. Correlation also plays a key role in later topics such as inference for regression, where you test whether a population correlation is significantly different from zero.