Introduction to Categorical Association — AP Statistics Study Guide
For: AP Statistics candidates sitting AP Statistics.
Covers: Two-way frequency tables, joint/marginal/conditional relative frequency calculation, detecting categorical association vs independence, proportional comparison of conditional distributions, and introductory Simpson’s paradox analysis for contingency tables.
You should already know: Categorical vs quantitative variable distinction, one-variable frequency distribution calculation, basic proportional reasoning.
A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board / Cambridge / IB papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official mark schemes for grading conventions.
1. What Is Introduction to Categorical Association?
This topic introduces how to analyze and describe relationships between two categorical variables, and it is part of AP Statistics Unit 2: Exploring Two-Variable Data, which accounts for 10-15% of the total AP exam score, with this subtopic making up 5-7% of total exam weight. It appears in both multiple-choice (MCQ) and free-response (FRQ) sections, most commonly as a set of MCQs or as the first two parts of a multi-part FRQ.
Categorical association means that the distribution of one categorical variable changes depending on the value of the other categorical variable; if no association exists, the variables are said to be independent. Standard notation uses a two-way (contingency) table with rows for the explanatory variable and columns for the response variable, where is the observed count in row column , is the row total for row , is the column total for column , and is the total sample size. Synonyms include contingency table analysis and two-way categorical association. Unlike quantitative association (which relies on correlation and linear regression), categorical association depends on comparing conditional proportions rather than linear trends.
2. Two-Way Tables and Frequency Types
A two-way (contingency) table organizes counts of observations for all combinations of levels of two categorical variables. For association analysis, three core frequency types are used, each with a corresponding relative frequency (proportion) that adjusts for sample size:
- Marginal frequency: The total count of observations for a single level of one variable, found in the margins of the table. Marginal relative frequency is calculated as the marginal frequency divided by the total sample size , and it describes the proportion of the entire sample that falls into that level of the variable.
- Joint frequency: The count of observations that fall into a specific combination of levels of both variables (a single cell in the table). Joint relative frequency is joint frequency divided by , and it describes the proportion of the entire sample that has that specific combination of outcomes.
- Conditional frequency: The count of observations for a level of one variable, restricted to (conditioned on) a specific level of the other variable. Conditional relative frequency is conditional frequency divided by the total of the conditioning group, not the overall sample size . Comparing conditional relative distributions of the response variable across levels of the explanatory variable is the core method for detecting association.
Worked Example
A survey of 200 high school students asks two questions: do they play a varsity sport (Yes/No), and do they have an after-school job (Yes/No). The observed counts are shown below:
| After-school job: Yes | After-school job: No | Row Total | |
|---|---|---|---|
| Varsity: Yes | 32 | 48 | 80 |
| Varsity: No | 60 | 60 | 120 |
| Column Total | 92 | 108 | 200 |
Calculate: (a) the joint relative frequency of varsity athletes with no after-school job, (b) the marginal relative frequency of students with after-school jobs, (c) the conditional relative frequency of having an after-school job among varsity athletes.
- Recall the definition for each frequency type to confirm the correct denominator: joint relative frequency uses overall , marginal uses overall , and conditional uses the conditioning group total.
- Solve (a): The cell for varsity athletes with no after-school job has a count of 48. Joint relative frequency = .
- Solve (b): The marginal total for students with after-school jobs is 92. Marginal relative frequency = .
- Solve (c): We are conditioning on being a varsity athlete, so the denominator is the row total for varsity athletes (80), not the overall . The count of varsity athletes with after-school jobs is 32. Conditional relative frequency = .
Exam tip: Always circle the condition mentioned in the question before calculating. If the question uses phrases like "given that", "among", or "conditional on", the denominator is the total of the specified condition group, not the overall sample size.
3. Detecting Categorical Association vs Independence
Two categorical variables are independent (no association) if the conditional distribution of the response variable is identical across all levels of the explanatory variable. In other words, knowing the value of the explanatory variable gives you no additional information about the value of the response variable.
With real sample data, we never get perfectly identical conditional distributions, so we assess association by the magnitude of difference between conditional proportions. For 2x2 tables (the most common case on the AP exam), a difference of less than 10 percentage points is considered small, meaning there is little to no evidence of association. A difference of more than 10 percentage points is considered large enough to conclude there is an association in the sample. A common mistake is to compare raw counts instead of conditional proportions; raw counts are misleading when group sizes are different, because a larger group will always have more outcomes even if the proportion is identical.
Worked Example
A researcher studies whether pet ownership (cat/dog/no pet) is associated with living arrangement (house/apartment). The conditional relative distribution of living arrangement, given pet type, is shown below:
| House | Apartment | Total | |
|---|---|---|---|
| Cat owner | 0.65 | 0.35 | 1.00 |
| Dog owner | 0.72 | 0.28 | 1.00 |
| No pet | 0.68 | 0.32 | 1.00 |
Is there evidence of an association between pet ownership and living arrangement in this sample? Justify your answer.
- To assess association, we compare the conditional distributions of living arrangement across the three pet ownership groups.
- Calculate the range of conditional proportions for each level of living arrangement to measure how much they vary: For House, proportions range from 0.65 (cat owners) to 0.72 (dog owners), a difference of 0.07 (7 percentage points). For Apartment, proportions range from 0.28 to 0.35, also a 7 percentage point difference.
- A difference of only 7 percentage points across groups is small, meaning knowing a person's pet ownership gives almost no information about their living arrangement.
- Conclusion: There is no meaningful evidence of an association between the two variables in this sample.
Exam tip: On FRQ questions asking about association, you must reference the magnitude of the difference in conditional proportions in context to earn full credit—never just state "yes" or "no" without numerical evidence.
4. Introduction to Simpson's Paradox
Simpson's paradox is a phenomenon where the direction of association between two categorical variables is reversed when the data is split into subgroups by a third confounding variable. This occurs when the confounding variable is unevenly distributed across the levels of the original explanatory variable. For example, an aggregated analysis might show that a higher proportion of male applicants are accepted to a university than female applicants, but when split by department (the confounder), most departments actually have a small acceptance advantage for women. The paradox occurs in this case because women are more likely to apply to competitive, low-acceptance-rate departments, while men are more likely to apply to higher-acceptance-rate departments, so the aggregated result is distorted by the confounding variable.
Simpson's paradox highlights the importance of checking for lurking confounding variables when analyzing categorical association, as these variables can completely reverse the observed relationship between two variables of interest.
Worked Example
Two pitchers on a college baseball team record their hit rates over a season. The table below shows hits and batters faced against right-handed and left-handed batters:
| Pitcher | Hits vs Right | Total Right Batters | Hits vs Left | Total Left Batters |
|---|---|---|---|---|
| A | 1 | 10 | 36 | 90 |
| B | 18 | 90 | 5 | 10 |
Explain how this is an example of Simpson's paradox, and identify the confounding variable.
- Calculate the conditional hit rate for each pitcher within each subgroup:
- Pitcher A: (10%) vs right-handed batters, (40%) vs left-handed batters.
- Pitcher B: (20%) vs right-handed batters, (50%) vs left-handed batters.
- Compare within subgroups: Pitcher A has a lower (better) hit rate than Pitcher B against both right-handed and left-handed batters.
- Calculate the overall (aggregated) hit rate for each pitcher:
- Pitcher A overall: (37%).
- Pitcher B overall: (23%).
- The direction of the association is reversed: when aggregated, Pitcher B has a lower (better) overall hit rate than Pitcher A, even though A performs better against both subgroups of batters. This reversal matches the definition of Simpson's paradox.
- The confounding variable is batter handedness: Pitcher A faces far more left-handed batters (who have higher overall hit rates than right-handed batters) than Pitcher B, who faces mostly right-handed batters.
Exam tip: When asked to explain Simpson's paradox on the exam, you must explicitly state the reversal of the association direction and explain the uneven distribution of the confounding variable to earn full credit.
5. Common Pitfalls (and how to avoid them)
- Wrong move: Calculating conditional relative frequency with the overall sample size as the denominator instead of the condition group total. Why: Students confuse joint and conditional frequency because they forget that "given" or "among" means we restrict the sample to the condition group. Correct move: Circle the condition mentioned in the question every time, then write down the total of that condition as your denominator before doing any calculation.
- Wrong move: Concluding association from a 7 percentage point difference in conditional proportions from a sample. Why: Students think any difference means association, not accounting for random sample variation. Correct move: Always report the magnitude of the difference in context, and only conclude association for differences that are 10 percentage points or larger for 2x2 tables.
- Wrong move: Concluding no association because marginal frequencies are similar across groups. Why: Students confuse marginal and conditional distributions; association is about conditional distributions, not marginal. Correct move: Always calculate and compare conditional distributions of the response variable across explanatory variable levels when assessing association.
- Wrong move: Calculating joint relative frequency as cell count divided by row or column total. Why: Students mix up the definitions of joint vs conditional relative frequency. Correct move: Remember: joint = out of all observations, conditional = out of the condition group.
- Wrong move: Claiming Simpson's paradox always proves the aggregated analysis is wrong. Why: Students assume splitting by any variable gives the "true" result, but the variable may not be a confounding lurking variable. Correct move: When asked about Simpson's paradox, only note the association reversal and explain that the uneven distribution of the third variable causes the reversal.
6. Practice Questions (AP Statistics Style)
Question 1 (Multiple Choice)
A survey of 150 college students is conducted to study the association between on-campus vs off-campus residence and owning a car. The table below gives the counts:
| Own Car: Yes | Own Car: No | Total | |
|---|---|---|---|
| On-campus | 25 | 60 | 85 |
| Off-campus | 45 | 20 | 65 |
| Total | 70 | 80 | 150 |
What is the conditional relative frequency of living off-campus, given that a student does not own a car? A) B) C) D)
Worked Solution: First, identify the condition: "given that a student does not own a car", so our denominator is the total number of students who do not own a car, which is 80. We want the count of off-campus students among these, which is 20. The conditional relative frequency is . Option A gives the joint relative frequency, Option B uses the wrong denominator (off-campus total), and Option D is the marginal relative frequency of not owning a car. The correct answer is C.
Question 2 (Free Response)
A public health researcher studies the association between daily soda consumption (less than 1 per week / 1+ per week) and self-reported sleep quality (good / poor). The sample of 300 adults gives the following counts:
| Good Sleep | Poor Sleep | Total | |
|---|---|---|---|
| <1 soda per week | 90 | 60 | 150 |
| 1+ sodas per week | 65 | 85 | 150 |
| Total | 155 | 145 | 300 |
(a) Calculate the conditional relative frequency of poor sleep for each soda consumption group. (b) Based on your calculations from (a), is there evidence of an association between soda consumption and sleep quality in this sample? Justify your answer. (c) Explain why comparing the raw counts of poor sleep (60 vs 85) would not be a valid way to assess association here.
Worked Solution: (a) For the <1 soda per week group: conditional relative frequency = (40%). For the 1+ sodas per week group: conditional relative frequency = (57%). (b) The difference in conditional relative frequencies is , or 17 percentage points. This is a large difference: people who drink 1+ sodas per week are 17 percentage points more likely to report poor sleep than those who drink less than 1 per week. This large difference means there is clear evidence of an association between soda consumption and sleep quality in this sample. (c) Raw counts do not adjust for different group sizes. Even if the proportion of people with poor sleep was identical in both groups, a larger group would have a higher raw count of poor sleep. In this case, both groups are the same size, but in general, comparing relative frequencies (proportions) is the only valid way to compare groups of different sizes.
Question 3 (Application / Real-World Style)
A wildlife biologist studies the association between forest fire severity (low / high) and the presence of invasive grass species (present / absent) in 500 randomly selected plots in a national park. When aggregated across all elevation levels, 62% of plots with invasive grass had high-severity fires, compared to 38% of plots without invasive grass. When split by elevation (low elevation < 1000m / high elevation ≥ 1000m), the results are reversed: 22% of low-elevation plots with invasive grass had high-severity fires vs 28% of low-elevation plots without invasive grass, and 45% of high-elevation plots with invasive grass had high-severity fires vs 51% of high-elevation plots without invasive grass. 78% of invasive grass plots are at high elevation, while only 22% of plots without invasive grass are at high elevation. What is this an example of, what is the confounding variable, and what caused the reversal of the association?
Worked Solution: This is an example of Simpson's paradox. The confounding variable is elevation, which is correlated with both invasive grass presence and fire severity: high elevation plots naturally have a higher rate of high-severity fires than low elevation plots. Invasive grass plots are heavily concentrated in high elevation (78% of invasive plots are high elevation, compared to only 22% of non-invasive plots). Because invasive grass is clustered in the high-risk high elevation group, the aggregated analysis incorrectly suggests invasive grass is associated with higher fire severity, even though within each elevation group, invasive grass is actually associated with lower high-severity fire risk. The uneven distribution of elevation across invasive grass groups caused the reversal.
7. Quick Reference Cheatsheet
| Category | Formula | Notes |
|---|---|---|
| Joint Relative Frequency | Proportion of all observations in a specific cell; describes combined outcomes | |
| Marginal Relative Frequency | Proportion of all observations in a single level of one variable | |
| Conditional Relative Frequency (condition on row) | Denominator is always the total of the conditioning group | |
| Conditional Relative Frequency (condition on column) | Adjust denominator to match the column condition | |
| Rule for Categorical Association | Compare conditional distributions across groups | Differences > 10 percentage points (2x2 tables) = evidence of association |
| Independence (No Association) | Conditional distributions are identical across groups | Knowing one variable gives no information about the other |
| Simpson's Paradox | Association direction reverses when splitting by a third variable | Caused by uneven distribution of the confounding variable |
8. What's Next
This chapter is the foundational prerequisite for all further analysis of two categorical variables in AP Statistics. Immediately after this topic, you will learn how to conduct chi-square tests for association and homogeneity, which use the two-way tables and frequency calculations you mastered here to test whether an observed association in a sample is statistically significant (not just due to random chance). Without correctly calculating conditional frequencies and understanding what association means for categorical variables, you cannot correctly set up or interpret chi-square tests, which make up 2-5% of the total AP exam score. This topic also builds understanding of confounding and lurking variables that applies to all study design and inference topics across the course.
Follow-on topics: Chi-Square Tests for Association Confounding and Lurking Variables Two-Variable Quantitative Association