Exploring One-Variable Data — AP Statistics Stats Study Guide
For: AP Statistics candidates sitting AP Statistics.
Covers: categorical vs quantitative variables, frequency tables and core data visualizations, measures of center and spread, outlier identification rules, and z-scores/percentiles as required for AP Statistics Unit 1.
You should already know: Algebra 2, basic probability intuition.
A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official College Board mark schemes for grading conventions.
1. What Is Exploring One-Variable Data?
Exploring one-variable data is the process of describing, visualizing, and summarizing datasets that measure a single characteristic across a group of individuals, and it is the foundational skill for all statistical analysis in the AP Statistics curriculum. It appears as Unit 1 in the AP Statistics CED, accounting for 15-23% of your multiple-choice score, and forms the basis for all inferential work later in the course. When working with a single variable, you will always answer four core analysis questions: What type of variable are you measuring? What is the shape of the data distribution? Where is the center of the distribution? How much variability is present in the data?
2. Categorical vs quantitative variables
All statistical analysis begins with classifying your variable correctly, as this determines every subsequent step of your work. First, define an individual as the person, object, or event you are measuring, and a variable as any characteristic of that individual that varies across the group.
- Categorical variables: Place individuals into distinct groups or categories, where arithmetic operations on the values have no meaningful interpretation. Examples include eye color, pass/fail exam status, zip code, and sports jersey number (even though these are numeric, adding two zip codes yields no useful information).
- Quantitative variables: Take numerical values that represent counts or measurements, where arithmetic (addition, averaging) produces meaningful results. These are split into two subtypes:
- Discrete quantitative variables: Countable values that take only whole numbers (e.g., number of siblings, number of AP courses taken)
- Continuous quantitative variables: Can take any value within a range, measured rather than counted (e.g., height in centimeters, time to run a 5K, weighted GPA)
Worked Example
Classify each variable collected from a group of college freshmen:
- Number of meals eaten on campus per week: Quantitative discrete, you count whole meals with no fractional values.
- Dormitory building name: Categorical, groups students into residential buildings with no meaningful numeric value.
- Monthly spending on textbooks in USD: Quantitative continuous, can take any dollar value (e.g., $187.42).
Exam tip: Examiners regularly trick students with numeric categorical variables like jersey numbers or student ID numbers. If averaging the values produces a meaningless result, the variable is categorical, not quantitative.
3. Frequency tables, dotplots, stemplots, histograms
Once you classify your variable, you will visualize its distribution to identify shape, clusters, gaps, and extreme values. The four most common visualizations for one-variable data are:
- Frequency tables: List each value (or group of values for quantitative data) alongside its count (absolute frequency) and relative frequency (count divided by total sample size, often converted to a percentage). For categorical variables, each row corresponds to a category; for quantitative variables, values are grouped into equal-width bins.
- Dotplots: Plot each data point as a dot above its value on a number line. They are ideal for small datasets (<50 observations) because they retain all original data values, making it easy to see clusters and gaps.
- Stemplots (stem-and-leaf plots): Split each data point into a stem (leading digit(s)) and leaf (trailing digit). Like dotplots, they retain all original data, but are easier to sort and read for slightly larger datasets (50-200 observations). For example, a test score of 87 would appear as
8 | 7where 8 is the tens digit stem and 7 is the units digit leaf. - Histograms: Use adjacent bars to show the frequency of values in equal-width bins for quantitative variables. The x-axis shows the bin ranges, and the y-axis shows absolute or relative frequency. Histograms are ideal for large datasets (>200 observations) because they clearly show the shape of the distribution (symmetric, skewed left, skewed right, uniform, unimodal, bimodal).
Exam tip: A common 1-point loss on FRQs is mixing up histograms and bar charts. Bar charts are for categorical variables, with gaps between bars to show separate categories. Histograms are for quantitative variables, with no gaps between bars because the bins represent a continuous numerical range.
4. Center and spread — mean, median, SD, IQR
After visualizing the distribution, you will summarize it with measures of center (typical value) and spread (variability around the center).
Measures of Center
- Mean: The arithmetic average of the dataset, denoted for samples and for populations. The formula for sample mean is: where is the sample size and are individual observations.
- Median: The middle value of the sorted dataset, equal to the 50th percentile. If is odd, it is the th value; if is even, it is the average of the two middle values.
Measures of Spread
- Standard Deviation (SD): Measures the average distance of each observation from the mean, denoted for samples and for populations. The formula for sample standard deviation uses in the denominator to correct for bias in sample data:
- Interquartile Range (IQR): Measures the spread of the middle 50% of the dataset, calculated as , where is the 25th percentile (median of the lower half of sorted data) and is the 75th percentile (median of the upper half of sorted data).
Worked Example
For the sorted set of 10 test scores: 62, 67, 71, 73, 76, 79, 82, 85, 88, 94
- Mean =
- Median = average of 5th and 6th values =
- Q1 = median of lower 5 values = 71, Q3 = median of upper 5 values = 85, so IQR =
- Standard deviation =
Exam tip: For skewed distributions or distributions with outliers, use median and IQR, as they are not pulled by extreme values. For symmetric distributions with no outliers, use mean and SD, as they use all data points and are more precise.
5. Outliers and resistance
An outlier is a data point that falls far outside the overall pattern of the distribution. A statistic is resistant if its value is not heavily affected by extreme outliers. The median and IQR are resistant, while the mean and SD are not — this is why median/IQR are preferred for skewed data.
The standard method for identifying outliers in AP Statistics is the 1.5*IQR rule:
- Calculate the lower fence:
- Calculate the upper fence:
- Any observation less than the lower fence or greater than the upper fence is classified as an outlier.
Worked Example
Using the test score data from Section 4: Q1=71, Q3=85, IQR=14
- Lower fence =
- Upper fence = A test score of 48 would be below the lower fence, so it is an outlier. If we add this outlier to the dataset, the median only shifts from 77.5 to 76, while the mean drops from 77.7 to 75, clearly demonstrating that the median is resistant and the mean is not.
Exam tip: You will lose points on FRQs if you identify outliers by eye alone. Always explicitly calculate the 1.5*IQR fences and show that the outlier falls outside the fence range to get full credit.
6. Z-scores and percentiles
These two metrics let you compare individual observations to the rest of the distribution, even across datasets with different units or scales.
- Percentile: The kth percentile is the value where k% of observations in the dataset are less than or equal to that value. For example, if your SAT score is at the 90th percentile, 90% of test takers scored equal to or lower than you.
- Z-score: Measures how many standard deviations an observation is from the mean of the distribution. The formula is: A positive z-score means the observation is above the mean, a negative z-score means it is below the mean, and a z-score of 0 means it is exactly equal to the mean.
Worked Example
The average height of 12th grade boys is 69 inches, with a standard deviation of 2.8 inches. A 12th grade boy is 74 inches tall:
- His z-score = , so he is 1.79 standard deviations above the mean height for his group.
- If a boy has a z-score of -0.6, his height = inches.
Exam tip: Z-scores are the only valid way to compare observations from different distributions. For example, you can use z-scores to compare a student's performance on a math test (mean 72, SD 6) to their performance on an English test (mean 80, SD 4), even though the tests use different scoring scales.
7. Common Pitfalls (and how to avoid them)
- Wrong move: Treating numeric categorical variables (zip codes, jersey numbers, student IDs) as quantitative. Why: Students assume any numeric value is quantitative. Correct move: Ask "does averaging these values produce a meaningful result?" If no, the variable is categorical.
- Wrong move: Using mean and SD to describe skewed distributions. Why: Students default to the mean because it is the most familiar measure of center. Correct move: Check the distribution shape first; use median and IQR for skewed data or data with outliers.
- Wrong move: Drawing gaps between bars on histograms, or omitting gaps on bar charts for categorical data. Why: Students mix up the two visualization types. Correct move: Histograms = no gaps (quantitative bins are continuous), bar charts = gaps (categories are separate groups).
- Wrong move: Identifying outliers by eye instead of using the 1.5*IQR rule. Why: Students assume "looking far away" is sufficient justification. Correct move: Always calculate the upper and lower fences explicitly and show the outlier falls outside the range for FRQ credit.
- Wrong move: Dividing by instead of when calculating sample standard deviation. Why: Students forget the bias correction for sample data. Correct move: Use for all sample SD calculations; only use if you are explicitly told you have data for the entire population.
8. Practice Questions (AP Statistics Style)
Question 1
A coffee shop collects data on 80 customers one Saturday. For each customer, they record: 1) payment method (cash, credit, mobile pay), 2) number of drinks ordered, 3) wait time in minutes, 4) satisfaction rating (poor, fair, good, excellent). Classify each variable as categorical or quantitative, and if quantitative, state if it is discrete or continuous.
Solution
- Payment method: Categorical, groups customers into distinct payment types with no meaningful numeric value.
- Number of drinks ordered: Quantitative discrete, counted as whole numbers with no fractional values.
- Wait time in minutes: Quantitative continuous, can take any value (e.g., 3.72 minutes).
- Satisfaction rating: Categorical, groups customers into distinct satisfaction tiers even if you assign numeric values to the tiers.
Question 2
The distribution of annual household income in a small town has Q1 = 92,000, median = 79,000, SD = $31,000. a) Calculate the IQR. b) Use the 1.5*IQR rule to determine if a household income of $165,000 is an outlier. c) Is the distribution likely symmetric, left skewed, or right skewed? Justify your answer.
Solution
a) b) Upper fence = . , so this income is an outlier. c) The distribution is right skewed. The mean (67,000), which occurs when high extreme values (very high incomes) pull the mean upward, a defining feature of right skew.
Question 3
The average score on a biology midterm is 71, with a standard deviation of 6.2. The average score on a chemistry midterm is 68, with a standard deviation of 7.5. A student scores 81 on the biology midterm and 80 on the chemistry midterm. On which exam did the student perform better relative to their class? Show your work.
Solution
Calculate z-scores for both exams:
- Biology z-score:
- Chemistry z-score: The student's z-score is slightly higher for biology, so they performed better relative to their biology class than their chemistry class.
9. Quick Reference Cheatsheet
Key Definitions
| Term | Definition |
|---|---|
| Categorical variable | Groups individuals, no meaningful arithmetic on values |
| Quantitative variable | Numeric counts/measurements, arithmetic produces meaningful results |
| Resistant statistic | Not heavily affected by extreme outliers (median, IQR are resistant; mean, SD are not) |
| Outlier | Value < or > |
| kth percentile | k% of observations are ≤ this value |
Key Formulas
- Sample mean:
- Sample standard deviation:
- IQR:
- Z-score: (sample) / (population)
Distribution Shape Rules
- Symmetric: Mean ≈ Median, use mean/SD for summary
- Right skewed: Mean > Median, use median/IQR for summary
- Left skewed: Mean < Median, use median/IQR for summary
10. What's Next
Mastering one-variable data is the foundation for every subsequent unit in the AP Statistics curriculum. Next, you will extend these skills to exploring two-variable data, where you will describe relationships between pairs of variables, calculate correlation coefficients, and build linear regression models to make predictions. Later, the summary statistics you learned in this guide will be the basis for all inferential work, including confidence intervals and significance tests for population means and proportions, which make up more than 40% of your total AP exam score.
If you struggle with any of the concepts in this guide, or want more personalized practice questions tailored to your weak spots, you can ask Ollie for help at any time on the homepage. You can also move on to our next guide for AP Statistics Unit 2, Exploring Two-Variable Data, to continue building your skills for the exam.