Describing Distributions of a Quantitative Variable — AP Statistics Study Guide
For: AP Statistics candidates sitting AP Statistics.
Covers: The SOCS (Shape, Outliers, Center, Spread) framework, shape classification for distributions, the 1.5×IQR outlier rule, appropriate selection of center/spread measures, and comparing two quantitative distributions.
You should already know: Difference between quantitative and categorical variables, how to construct histograms and boxplots for one-variable data, how to calculate basic summary statistics.
A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board / Cambridge / IB papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official mark schemes for grading conventions.
1. What Is Describing Distributions of a Quantitative Variable?
A distribution of a quantitative variable describes what values the variable takes and how frequently it takes those values. Describing distributions is the foundational exploratory data analysis skill for AP Statistics, accounting for approximately 15-20% of the total exam weight per the official AP Statistics CED for Unit 1. This topic appears on both the multiple-choice (MCQ) and free-response (FRQ) sections of the exam: MCQ questions test your ability to identify correct descriptions, compare distributions, and apply rules for outliers, while FRQ almost always includes an early question requiring a full contextual description or comparison of distributions. Standard notation conventions used on the exam are: for sample mean, for population mean, for sample standard deviation, for population standard deviation, for median, and for interquartile range. You may also see the question phrased as "summarize the distribution" or "compare the distributions," which are just synonyms for the skill tested here.
2. The SOCS Description Framework
The SOCS mnemonic is the standard structure AP exam graders expect for full credit on any distribution description question, and it organizes all key features of a distribution into four required components.
- S = Shape: The overall pattern of the data, classified by number of peaks (unimodal, bimodal, uniform) and symmetry/skewness. Symmetric distributions are mirrored around the center, with mean ≈ median. Right-skewed distributions have a long tail extending toward lower higher values, with mean > median. Left-skewed distributions have a long tail extending toward lower values, with mean < median. Uniform distributions are flat, with all values equally frequent, and bimodal distributions have two distinct peaks (often indicating two separate groups in the data).
- O = Outliers: Any individual values that fall far outside the overall pattern of the data, visible on graphs or confirmed via calculation.
- C = Center: A typical value of the distribution. Use the median for skewed data or data with outliers, and the mean for symmetric data without outliers.
- S = Spread: A measure of how much the data varies. Use IQR for skewed data or data with outliers, and standard deviation for symmetric data without outliers. All descriptions must be tied to the context of the problem, not just generic labels.
Worked Example
Problem: A small coffee shop records the number of online orders it receives each hour on a weekday. A histogram of the data shows most hours have between 2 and 8 orders, with a long tail extending toward 15 orders. Two hours with 14 and 15 orders stand far apart from the main cluster. The mean number of orders is 6.2, and the median is 5. Describe this distribution.
- Shape: The distribution of hourly online orders is right-skewed, matching the long tail extending toward higher values.
- Outliers: There are two clear high outliers at 14 and 15 orders, separated from the main cluster of data.
- Center: Because the distribution is skewed, the typical number of hourly online orders is 5 (the median).
- Spread: The number of orders ranges from 1 to 15 per hour, so there is moderate variability in hourly order volume across the day.
Exam tip: Always address all four SOCS components, and tie every label to the context of the variable you are describing. AP graders deduct a full point for generic descriptions that do not reference the study context.
3. Identifying Outliers with the 1.5×IQR Rule
While outliers are often visible on histograms or boxplots, the AP exam frequently requires you to confirm outliers using the 1.5×IQR rule, the only outlier rule tested on the exam. First, the interquartile range (IQR) is the range of the middle 50% of sorted data, calculated as: where is the 25th percentile (first quartile) and is the 75th percentile (third quartile). The rule states that any value less than or greater than is classified as an outlier. This rule is preferred because it uses quartiles, which are resistant to extreme values, unlike the mean and standard deviation. The 1.5 cutoff was chosen to place outlier bounds just beyond the whiskers of a standard boxplot, capturing only values that fall far outside the middle half of the data.
Worked Example
Problem: The five-number summary for the total points scored per game by a college basketball team is: 42, 58, 65, 72, 88. Is the highest scoring game of 88 points an outlier per the 1.5×IQR rule?
- Identify and from the ordered five-number summary (Min, , Median, , Max): , .
- Calculate IQR: points.
- Calculate outlier bounds: Lower bound = points. Upper bound = points.
- Compare the maximum value to the upper bound: 88 < 93, and 42 > 37, so 88 points is not an outlier by this rule. There are no outliers in this dataset.
Exam tip: Always explicitly compare the value in question to both the lower and upper bounds on FRQ questions. You will lose partial credit if you only calculate IQR and do not show the comparison step.
4. Comparing Two Distributions of a Quantitative Variable
A very common AP exam question asks you to compare two distributions of the same quantitative variable (e.g., test scores for two classes, wait times at two restaurants) using side-by-side boxplots, parallel histograms, or summary statistics. The same SOCS framework applies here, but the key difference from describing a single distribution is that you must make explicit comparative statements for each component, not just describe each distribution separately. For example, instead of writing "Distribution A has median 10, Distribution B has median 15," you need to write "Distribution B has a higher typical value than Distribution A, with a median 5 points higher than Distribution A." Always use comparative language: higher, lower, more spread out, less variable, has more outliers, etc. You should also match your choice of center/spread measures to the shape of the distributions: compare medians and IQRs for skewed distributions, and means and standard deviations for symmetric distributions.
Worked Example
Problem: Side-by-side boxplots compare the distribution of hourly wages for entry-level workers in two industries: Industry A and Industry B. Industry A has a median wage of 4/hr, and no outliers. Industry B has a median wage of 6/hr, and one low outlier at $10/hr. Compare the two distributions.
- Shape: Both distributions are roughly symmetric, as their boxplots are balanced around the median with no extended tails.
- Center: Industry B has a higher median hourly wage (15/hr), so the typical entry-level wage is $3 higher in Industry B than Industry A.
- Spread: Industry B has a larger IQR (4/hr), so entry-level wages are more variable in Industry B than Industry A.
- Outliers: Industry B has one low outlier at $10/hr, while Industry A has no outliers.
Exam tip: You will lose a full point on comparison FRQs if you only describe each distribution individually without explicit comparative statements. Practice adding the comparative sentence for every SOCS component.
5. Common Pitfalls (and how to avoid them)
- Wrong move: Stating that a left-skewed distribution has a mean greater than the median, reversing the direction of skew. Why: Students confuse the name of the skew (named for the tail) with where the mean is pulled. Correct move: Memorize "the mean is pulled toward the tail": right skew = tail right = mean > median; left skew = tail left = mean < median.
- Wrong move: Only describing shape when asked to describe a distribution, forgetting to mention outliers, center, and spread. Why: Students focus on the most obvious feature and skip other required SOCS components. Correct move: Run through the S-O-C-S mnemonic in order, checking off each component before turning in your answer.
- Wrong move: Mixing up the order of the five-number summary, using the median as Q1 or Q3 when calculating outliers. Why: Students often forget the standard ordering of the five-number summary. Correct move: Write "Min, Q1, Median, Q3, Max" at the top of your work for any outlier calculation problem.
- Wrong move: Using mean and standard deviation to describe a strongly skewed distribution with outliers. Why: Students default to more familiar mean/SD, but they are not resistant to extreme values. Correct move: Always use median and IQR for skewed data or data with outliers; reserve mean/SD for symmetric data without outliers.
- Wrong move: Labeling a bimodal distribution as skewed because it has two peaks. Why: Students confuse multiple peaks with skewness, which describes the direction of the tail. Correct move: Always state the number of peaks first: if you see two clear peaks, explicitly label the distribution as bimodal.
- Wrong move: When comparing two distributions, only describing each distribution separately with no direct comparison. Why: Students think describing both is enough, but the question asks for a comparison. Correct move: For every SOCS component, add a sentence that directly compares the two groups (e.g., "X has a higher median than Y").
6. Practice Questions (AP Statistics Style)
Question 1 (Multiple Choice)
A ecologist measures the diameter of 75 mature oak trees in a state park. The distribution of tree diameters is unimodal, strongly right-skewed, with three high outliers. Which pair of measures is most appropriate to describe the center and spread of this distribution? A) Mean and standard deviation B) Median and IQR C) Mean and IQR D) Median and standard deviation
Worked Solution: Right-skewed distributions with outliers require resistant measures of center and spread, because mean and standard deviation are distorted by extreme values in the tail. The median is resistant to outliers, and the IQR is also resistant, so this pair is the most appropriate. Pairs that mix resistant and non-resistant measures (C and D) are inconsistent, and A is appropriate only for symmetric data without outliers. The correct answer is B.
Question 2 (Free Response)
A restaurant records the wait time (in minutes) for customers on Friday and Saturday nights. Summary statistics are below:
| Statistic | Friday | Saturday |
|---|---|---|
| Minimum | 5 | 2 |
| Q1 | 12 | 10 |
| Median | 18 | 22 |
| Q3 | 24 | 28 |
| Maximum | 35 | 45 |
(a) Calculate the IQR of wait times for Saturday, and check if the maximum wait time of 45 minutes is an outlier per the 1.5×IQR rule. (b) The mean wait time for Friday is 17.8 minutes. Describe the shape of the Friday wait time distribution, justifying your answer with the relationship between mean and median. (c) Compare the distribution of wait times between Friday and Saturday nights.
Worked Solution: (a) IQR for Saturday is minutes. The upper outlier bound is minutes. The lower bound is minutes. 45 < 55, so 45 minutes is not an outlier. (b) For Friday, the mean is 17.8 minutes and the median is 18 minutes. The mean is slightly less than the median, so the mean is pulled toward the left tail. This means the distribution of Friday wait times is slightly left-skewed. (c) Shape: Friday is slightly left-skewed, while Saturday is slightly right-skewed (the right tail extends far from Q3). Center: Saturday has a higher median wait time (22 minutes vs 18 minutes), so typical wait times are 4 minutes longer on Saturday than Friday. Spread: Saturday has a larger IQR (18 minutes vs 12 minutes) and larger overall range, so wait times are more variable on Saturday. Outliers: Neither distribution has outliers per the 1.5×IQR rule.
Question 3 (Application / Real-World Style)
A real estate agent records the sale price of 100 houses in a suburban neighborhood. The distribution of sale prices has a long tail extending toward very high prices, with most prices between 400,000. The mean sale price is 320,000. A seller claims the "typical" house in the neighborhood sells for $385,000. Is this claim accurate? Justify your answer, and give a more accurate typical value if needed.
Worked Solution: The distribution of house prices is right-skewed, confirmed by the long right tail and the mean being much larger than the median. For right-skewed distributions, the mean is pulled toward the extreme high values in the tail, so it overestimates the typical house price. The median is resistant to extreme values, so it is a better measure of center for skewed data. The claim that the typical house sells for 320,000, the median.
7. Quick Reference Cheatsheet
| Category | Formula / Mnemonic | Notes |
|---|---|---|
| SOCS Description Framework | S=Shape, O=Outliers, C=Center, S=Spread | Required for all distribution description questions on the AP exam |
| Interquartile Range (IQR) | Resistant to outliers; use for skewed distributions | |
| 1.5×IQR Lower Outlier Bound | Any value < lower bound is an outlier | |
| 1.5×IQR Upper Outlier Bound | Any value > upper bound is an outlier | |
| Appropriate Center: Symmetric, no outliers | (sample), (population) | Uses all data; not resistant to outliers |
| Appropriate Center: Skewed / with outliers | Median | Resistant to outliers; measures typical value |
| Appropriate Spread: Symmetric, no outliers | (sample), (population) | Measures average deviation from mean; not resistant |
| Appropriate Spread: Skewed / with outliers | IQR | Resistant to outliers; preferred for non-symmetric data |
| Skew Direction Rule | N/A | Mean is always pulled toward the tail. Right skew = mean > median; Left skew = mean < median |
8. What's Next
This chapter is the foundational skill for all data analysis in AP Statistics, and it is a prerequisite for nearly every topic that comes after. Immediately next, you will learn to calculate and interpret measures of position (percentiles, z-scores) that allow you to compare individual values across different distributions, then move on to exploring two-variable data for linear regression. Without mastering the SOCS framework and how to choose appropriate measures of center and spread, you will struggle to compare groups on FRQ questions throughout the course, and to interpret the results of inference procedures later on. This topic feeds into the core AP Statistics theme of exploratory data analysis, which makes up nearly half the exam.
Follow-on topics: Measures of Position for One-Variable Data Comparing Distributions of Categorical Variables Exploring Two-Variable Quantitative Data Inference for Comparing Two Means