Summary Statistics for Quantitative Data — AP Statistics Study Guide
For: AP Statistics candidates sitting AP Statistics.
Covers: Measures of center (mean, median, mode), measures of spread (range, interquartile range, variance, standard deviation), five-number summary, 1.5×IQR outlier detection, and the distinction between resistant vs non-resistant statistics for one-variable quantitative data.
You should already know: Basic distinction between quantitative and categorical data. How to sort a dataset of quantitative values. Basic summation notation.
A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board / Cambridge / IB papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official mark schemes for grading conventions.
1. What Is Summary Statistics for Quantitative Data?
Summary statistics are numerical values that condense and describe key features of a distribution of one-variable quantitative data, rather than displaying the full raw dataset or graph. According to the AP Statistics Course and Exam Description (CED), this topic is part of Unit 1: Exploring One-Variable Data, which accounts for 15-20% of the total AP exam score, with this specific topic making up 4-6% of total exam points. This topic is tested on both multiple-choice questions (MCQ) and free-response questions (FRQ): MCQ typically test calculation, identification of resistant measures, and formula recall, while FRQ test interpretation, outlier detection, and selection of appropriate statistics for comparing distributions. Standard notation distinguishes population parameters (true values for an entire population, written with Greek letters: for population mean, for population standard deviation) from sample statistics (values calculated from sample data, written with Roman letters: for sample mean, for sample standard deviation). Summary statistics are the core descriptive tool for quantitative data, and they are a prerequisite for all inferential statistics later in the course.
2. Measures of Center
Measures of center are summary values that describe the typical, or central, value of a quantitative distribution. The three most commonly used measures on the AP exam are mode, median, and mean. The mode is simply the most frequently occurring value in a dataset; it is rarely used as a primary measure of center for continuous data but can be useful for identifying bimodal or multimodal distributions. The median is the middle value of a sorted dataset: for a dataset of size , if is odd, it is the th sorted value; if is even, it is the average of the th and th sorted values. The mean is the arithmetic average of all values, with formulas for sample and population mean: Intuitively, the mean acts as the balancing point of a distribution, like the fulcrum of a seesaw, while the median is the 50th percentile, splitting the data into two equal halves. The key distinction between the two is resistance: a resistant measure is not affected by extreme values or outliers. The median is resistant, while the mean is not. For symmetric distributions with no outliers, the mean is preferred; for skewed distributions or distributions with outliers, the median is preferred.
Worked Example
The number of minutes 8 students waited for lunch in the school cafeteria line are: 2, 3, 5, 7, 8, 10, 10, 105. Calculate the mode, median, and mean, and state which is the best measure of center for this distribution.
- Sort the dataset: it is already sorted in ascending order, with .
- The mode is the most frequent value: 10 occurs twice, all other values occur once, so minutes.
- For (even), the median is the average of the 4th and 5th sorted values: minutes.
- Calculate the mean: sum of all values is , so minutes.
- The distribution is strongly right-skewed by the outlier 105 (an extremely long wait caused by a broken register). The median is resistant to this outlier, so it is the best measure of center.
Exam tip: On FRQ questions asking to choose between mean and median, always justify your choice by referencing skewness or outliers. "Median is better" is not enough for full credit—you must link your choice to the property of resistance.
3. Measures of Spread
Measures of spread (also called measures of variability) describe how spread out the values of a quantitative distribution are around the center. The four common measures tested on the AP exam are range, interquartile range (IQR), variance, and standard deviation. Range is the simplest measure: . It is easy to calculate but non-resistant, meaning it is heavily affected by outliers. The IQR is the spread of the middle 50% of the data, calculated as , where is the 25th percentile (first quartile) and is the 75th percentile (third quartile). The IQR is resistant to outliers, so it pairs with the median as the preferred measure of spread for skewed data. Variance is the average of the squared deviations from the mean. For populations and samples, the formulas are: The correction in the sample variance (called Bessel's correction) reduces bias in the estimate of the population variance. Standard deviation is the square root of variance, which converts the value back to the original units of the dataset, making it easier to interpret. Intuitively, standard deviation is roughly the typical distance of observations from the mean. It is non-resistant, so it pairs with the mean for symmetric distributions with no outliers.
Worked Example
For the cafeteria wait time dataset from the previous example: [2, 3, 5, 7, 8, 10, 10, 105], calculate the range, IQR, and sample standard deviation, and state the appropriate measure of spread for this distribution.
- Range = minutes.
- Split the sorted dataset into lower half [2, 3, 5, 7] and upper half [8, 10, 10, 105]. is the median of the lower half: . is the median of the upper half: . So minutes.
- We already know . The sum of squared deviations from the mean is approximately 8563.48. Sample variance , so sample standard deviation minutes.
- This distribution is skewed with an outlier, so the resistant IQR is the appropriate measure of spread.
Exam tip: AP MCQs almost always include an option that uses instead of for sample standard deviation. Always double-check whether the problem states the data is a sample (requires ) or the full population (requires ).
4. Five-Number Summary and Outlier Detection
The five-number summary is a complete set of summary values that captures the minimum, spread of the lower half, center, spread of the upper half, and maximum of a distribution. It consists of five values: . The five-number summary is the basis for boxplots (a common graph for comparing distributions) and for the standard AP exam method of outlier detection: the 1.5×IQR rule. The 1.5×IQR rule defines two outlier fences: any value less than or greater than is classified as a potential outlier. This rule is designed to flag unusual values that would otherwise distort summary statistics; for a normal distribution, 99% of values will fall within these fences, so any value outside is rare enough to flag. The 1.5×IQR rule is the only outlier detection method accepted on the AP exam unless another method is explicitly specified.
Worked Example
A hiker records the elevation (in hundreds of feet) of 10 campsites along a trail: [12, 14, 15, 16, 18, 19, 21, 22, 23, 34]. Use the 1.5×IQR rule to identify any outliers in this dataset.
- Sort the dataset: it is already sorted, . Minimum = 12, Maximum = 34.
- Median is the average of the 5th and 6th values: . Split the data into lower half [12, 14, 15, 16, 18] and upper half [19, 21, 22, 23, 34].
- is the median of the lower half = 15, is the median of the upper half = 22. .
- Calculate outlier fences: Lower fence = , Upper fence = .
- Compare all values: 34 > 32.5, so 34 (3400 feet) is a potential outlier by the 1.5×IQR rule.
Exam tip: You must explicitly show the calculation of the outlier fences and compare the candidate value to the fence to get full credit on FRQs. Saying "34 is an outlier because it is far from the other values" is not sufficient.
5. Common Pitfalls (and how to avoid them)
- Wrong move: Using instead of in the denominator for sample variance and standard deviation. Why: Students confuse population and sample formulas, and forget Bessel's correction is required for unbiased sample statistics. Correct move: Always check if the problem labels the data as a sample from a larger population; if yes, use , only use if it is the entire population.
- Wrong move: Forgetting to sort the data before calculating median, quartiles, or IQR. Why: Students rush and use the unsorted order given in the problem statement. Correct move: Always explicitly sort the data in ascending order before calculating any position-based summary statistic.
- Wrong move: Choosing mean and standard deviation for a skewed distribution with outliers. Why: Students default to more familiar mean/SD without checking distribution shape. Correct move: Always pair median with IQR for skewed data or data with outliers, and mean with standard deviation for roughly symmetric data without outliers.
- Wrong move: Including the median in the lower and upper halves when calculating quartiles for odd . Why: Multiple methods exist for quartiles, but AP uses the exclude-median method, and including it gives the wrong IQR. Correct move: After finding the median, split the dataset into lower and upper halves that never include the median when is odd.
- Wrong move: Classifying a value as an outlier just because it is the maximum or minimum. Why: Students assume any extreme value is automatically an outlier. Correct move: Always explicitly apply the 1.5×IQR rule to calculate the fences and confirm the value is outside the fences.
6. Practice Questions (AP Statistics Style)
Question 1 (Multiple Choice)
A statistics student takes a random sample of 10 full-service gas stations in her state and records the price of regular unleaded gasoline (in dollars per gallon): 3.29, 3.59, 3.49, 3.69, 3.39, 3.49, 4.19, 3.39, 3.49, 3.59. The sample standard deviation of gas prices is closest to which of the following? A) 0.08 B) 0.24 C) 0.25 D) 0.06
Worked Solution: This is a random sample, so we use Bessel's correction ( in the denominator). First, calculate the sample mean: the sum of all 10 prices is 35.6, so . Next, the sum of squared deviations from the mean is 0.561. Sample variance is , so sample standard deviation is . The correct answer is C.
Question 2 (Free Response)
A bookstore records the number of books purchased per transaction one Saturday, for a random sample of 15 transactions: [1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 9, 12]. (a) Calculate the five-number summary for this dataset. (b) Use the 1.5×IQR rule to identify any outliers in this dataset. (c) Would you use the mean or the median to describe the center of this distribution? Justify your answer.
Worked Solution: (a) The data is already sorted, . , . The median is the 8th value = 4. The lower half (first 7 values, excluding the median) has median . The upper half (last 7 values, excluding the median) has median . Five-number summary: . (b) . Outlier fences: Lower = , Upper = . All values fall between -4 and 12, so there are no outliers by the 1.5×IQR rule. (c) This distribution is right-skewed, with most transactions between 1 and 5 books and a long tail extending to 12 books. The median is resistant to the skewness, so the median is the appropriate measure of center.
Question 3 (Application / Real-World Style)
A marine biologist measures the length (in cm) of 12 random juvenile green sea turtles off the coast of Florida: [28, 31, 33, 35, 36, 37, 38, 40, 41, 42, 44, 47]. The distribution of lengths is roughly symmetric with no outliers. Calculate and interpret the mean and sample standard deviation of turtle lengths in context.
Worked Solution: , the sum of all lengths is 452, so cm. The sum of squared deviations from the mean is approximately 294.67. Sample variance , so cm. Interpretation: The average length of juvenile green sea turtles in this sample is 37.7 cm, and individual turtle lengths typically vary from this average by about 5.2 cm.
7. Quick Reference Cheatsheet
| Category | Formula | Notes |
|---|---|---|
| Sample Mean | Non-resistant, use for symmetric distributions | |
| Population Mean | Use when data is the entire population of interest | |
| Median | Middle value of sorted data | Resistant, use for skewed distributions/outliers |
| Interquartile Range | Resistant measure of spread, pairs with median | |
| Population Variance | Units are squared original data units | |
| Sample Variance | Uses Bessel's correction for unbiased estimates | |
| Sample Standard Deviation | Same units as original data, pairs with mean | |
| Five-Number Summary | Basis for boxplots and outlier detection | |
| 1.5×IQR Outlier Fences | , | Values outside fences are classified as outliers |
8. What's Next
This chapter is the foundation for all future work with quantitative data in AP Statistics. Immediately after this topic, you will learn to display quantitative data with graphs like boxplots, histograms, and dotplots, which rely on summary statistics to interpret and compare distributions. You will also use these summary values to compare two or more distributions, a common FRQ task on the AP exam. Later in the course, summary statistics from samples are the basis for statistical inference: you will use sample means and sample standard deviations to construct confidence intervals and run hypothesis tests for population means, which makes up a large portion of the AP exam’s inference weight. Without mastering the calculation, interpretation, and selection of the correct summary statistics, all subsequent inference work will be built on a shaky foundation.