Inference for Categorical Data: Chi-Square — AP Statistics Unit Overview

For: AP Statistics candidates sitting AP Statistics.

Covers: This unit overview covers the chi-square distribution, chi-square test for goodness of fit, test for homogeneity, test for independence, and skills for selecting the correct inference procedure for categorical data per the AP Statistics CED.

You should already know: Core significance test and inference framework; how to summarize categorical data in one and two-way tables; the concept of degrees of freedom for sampling distributions.

A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board / Cambridge / IB papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official mark schemes for grading conventions.

1. Why This Unit Matters

This is the only unit in the AP Statistics curriculum dedicated exclusively to inference for categorical data, the most common type of data collected in social science, public health, marketing, and biological research. Before this unit, most inference you learned focused on quantitative data (means) or comparisons of only two proportions; chi-square tests extend inference to categorical variables with any number of levels, and answer new research questions like "does this distribution match our claim?" and "are these two categorical variables related?"

This unit reinforces the core significance test framework that unifies all of AP Statistics: stating hypotheses, checking conditions, calculating a test statistic, finding a p-value, and drawing a conclusion in context. Per the AP Statistics CED, this unit makes up 6-10% of the total exam score, appearing regularly in both multiple-choice and free-response sections, including common procedure-selection free-response questions.

2. Unit Concept Map

The five subtopics in this unit build sequentially, starting with foundational concepts and moving to applied tests and skill mastery:

Introducing Chi-Square: Lays the shared foundation for all chi-square tests. You will learn the shape and properties of the chi-square distribution (right-skewed, always non-negative, shape determined by degrees of freedom), and the general chi-square test statistic formula that is used for all three tests.
Chi-Square Test for Goodness of Fit: The simplest application of the chi-square framework, for one categorical variable. You will practice the full inference workflow, testing whether the distribution of a single categorical variable matches a claimed hypothesized distribution.
Chi-Square Test for Homogeneity: Builds on goodness-of-fit to work with two-way tables, comparing the distribution of one categorical variable across multiple independent populations or groups.
Chi-Square Test for Independence: Another two-way table test, this time for testing association between two categorical variables measured on a single sample. It uses the same test statistic calculation as homogeneity but has a different study design and research question.
Skills Focus: Selecting an Inference Procedure: Ties all three chi-square tests together with all other inference procedures you have learned, teaching you to match the research question and study design to the correct test — a core AP exam skill.

3. Guided Tour of a Unit-Style Exam Problem

We work through a single exam-style problem to show how multiple core subtopics connect in sequence to solve it:

A student researcher at a large university collects a random sample of 220 undergraduate students. They ask each student two questions: their favorite snack type (sweet, salty, savory) and whether they prefer to study in the library, at home, or a coffee shop. The researcher wants to know if favorite snack type is associated with study location preference.

Step-by-Step Connection to Unit Subtopics

First, apply the Skills Focus: Selecting an Inference Procedure subtopic: First, characterize the data and research question: we have one random sample of students, two categorical variables measured on each individual, and we want to test for association. This rules out goodness-of-fit (only one variable, testing a claimed distribution) and homogeneity (multiple independent groups, one variable). We correctly select a chi-square test for independence.
Next, apply core concepts from the Introducing Chi-Square subtopic: All chi-square tests share the same core test statistic: $χ^{2} = a l l ce l l s \sum \frac{( O b ser v e d - E x p ec t e d ) ^{2}}{E x p ec t e d}$ The chi-square distribution is right-skewed, and we only use the upper tail to calculate p-values (large test statistics support the alternative hypothesis). For this problem, degrees of freedom are calculated as $(r o w s - 1) (co l u mn s - 1) = (3 - 1) (3 - 1) = 4$ , so we use a chi-square distribution with 4 df to find our p-value.
Finally, execute the test using the Chi-Square Test for Independence subtopic: Follow the full inference framework: state hypotheses ( $H_{0}$ : no association between snack type and study location preference in the population of undergraduates; $H_{a}$ : there is an association), check conditions (random sample, 10% condition for independence, expected counts meet the 1/20% rule), calculate the test statistic, find the p-value, and conclude in context.

4. Cross-Cutting Common Pitfalls

These are traps that appear across multiple subtopics in this unit, rooted in common confusions:

Wrong move: Confusing chi-square test for independence and chi-square test for homogeneity, using the wrong one for the study design. Why: Both tests use two-way tables, the same test statistic calculation, and the same degrees of freedom formula, so students often assume they are interchangeable. Correct move: Always start by asking how data was collected: one sample with two variables per observation = independence; multiple independent samples from pre-defined groups with one variable per observation = homogeneity.
Wrong move: Using $df = n - 1$ for two-way chi-square tests, instead of the table-based formula. Why: Students memorize $df = n - 1$ from goodness-of-fit tests and t-procedures, then incorrectly generalize it to all chi-square scenarios. Correct move: For goodness-of-fit only, use $df = k - 1$ (k = number of categories); for all two-way chi-square tests, always use $df = (r - 1) (c - 1)$ .
Wrong move: Running multiple two-proportion z-tests instead of a single chi-square test when comparing proportions across three or more groups. Why: Students are more comfortable with proportion tests they learned earlier, and do not recognize that multiple testing inflates the Type I error rate. Correct move: Any comparison of a categorical variable's distribution across 2+ groups should use a chi-square test for homogeneity, not multiple z-tests.
Wrong move: Concluding that a significant chi-square test for independence proves causation between two variables. Why: Students regularly confuse statistically significant association with causal relationship, especially when the problem context suggests a plausible causal link. Correct move: Only conclude causation from a chi-square test if the study used random assignment to groups; for observational studies, only conclude there is a statistically significant association.
Wrong move: Skipping the expected count condition check, or claiming the condition is met if just all expected counts are above 1. Why: Students mix up rules of thumb for the expected count condition across different inference procedures. Correct move: For any chi-square test, require two criteria: all expected counts are at least 1, and fewer than 20% of expected counts are less than 5. If the condition fails, combine small categories to increase expected counts before proceeding.

5. Quick Check: When To Use Which Chi-Square Test

Test your understanding of when to apply each subtopic's procedure by matching each scenario to the correct chi-square test:

Scenarios

A bakery wants to test if the distribution of daily customer arrivals is the same across all seven days of the week, as they claim in their staffing model. They count arrivals each day for 100 random business days.
A sociologist studies whether the distribution of voting intent (Democrat, Republican, Third Party, Undecided) differs across five different age groups, sampling 100 registered voters from each age group.
A medical researcher studies whether smoking status (current, former, never) is associated with asthma status (has asthma, does not have asthma) in a random sample of 500 adults.

Answers

Chi-Square Test for Goodness of Fit: One categorical variable (day of week) tested against a claimed uniform distribution.
Chi-Square Test for Homogeneity: Five independent samples (one from each age group), comparing the distribution of one categorical variable (voting intent) across groups.
Chi-Square Test for Independence: One random sample of adults, two categorical variables measured per individual, testing for association.

6. Quick Reference Cheatsheet

Category	Formula / Rule	Notes
General Chi-Square Test Statistic	$χ^{2} = \sum \frac{( O - E ) ^{2}}{E}$ $O$ = observed count, $E$ = expected count	Used for all three chi-square tests; all values are non-negative, larger $χ^{2}$ gives stronger evidence against $H_{0}$
Degrees of Freedom: Goodness of Fit	$df = k - 1$ $k$ = number of categories of the variable	Only for one-variable tests of a claimed distribution
Degrees of Freedom: Two-Way Tests	$df = (r - 1) (c - 1)$ $r$ = number of rows, $c$ = number of columns	Same calculation for both homogeneity and independence; depends only on table size, not sample size
Expected Count for Two-Way Tables	$E = \frac{( R o w T o t a l ) ( C o l u mn T o t a l )}{G r an d T o t a l}$	Used for both homogeneity and independence; expected counts for goodness-of-fit come from the null distribution
All Chi-Square Conditions	1. Random sample/random assignment 2. Independent observations (10% condition if sampling without replacement) 3. All $E \geq 1$ , <20% of $E < 5$	Same conditions for all three chi-square tests
Goodness of Fit Hypotheses	$H_{0}$ : The variable's distribution matches the claimed distribution $H_{a}$ : The variable's distribution differs from the claimed distribution	All chi-square tests are right-tailed
Homogeneity Hypotheses	$H_{0}$ : The distribution of the variable is the same across all groups $H_{a}$ : At least one group has a different distribution	Used for comparing distributions across multiple independent groups
Independence Hypotheses	$H_{0}$ : Two variables are independent (no association) in the population $H_{a}$ : Two variables are dependent (there is an association) in the population	Used for testing association between two variables from one sample

7. What's Next & Sub-Topic Links

This unit is the capstone of inference for categorical data in the AP Statistics curriculum, building on the core inference framework you learned in earlier units on one- and two-sample inference for proportions and means. Mastery of chi-square inference is required for the next unit in the AP Statistics CED, Inference for Slope, where you will extend the same significance test structure to linear regression parameters. This unit also directly targets the AP exam's skill of selecting the correct inference procedure, a skill tested in both multiple-choice and free-response sections that makes up roughly 10% of the total exam score. After completing all subtopics in this unit, you will have covered all major inference procedures required for the AP Statistics exam.