| Study Guides
IBO · ibo-math-ai-hl · IB Math: Applications & Interpretation HL · Statistics and Probability (AI HL) · 16 min read · Updated 2026-05-06

Statistics and Probability (AI HL) — IB Math AI HL AI HL Study Guide

For: IB Math AI HL candidates sitting IB Math: Applications & Interpretation HL.

Covers: Sampling techniques and bias, discrete/continuous probability distributions, HL hypothesis testing (t-test, chi-squared), linear/non-linear regression, and HL confidence intervals, with exam-style worked examples and mark-saving tips.

You should already know: IGCSE / pre-DP math; comfort with applied problems and tech.

A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the IB Math AI HL style for educational use. They are not reproductions of past IBO papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official IBO mark schemes for grading conventions.


1. What Is Statistics and Probability (AI HL)?

This syllabus component focuses on collecting, analyzing, and interpreting real-world data, and quantifying uncertainty for evidence-based decision-making — a core priority for the applied-focused AI HL curriculum. It makes up 30–40% of your final exam mark across both Paper 1 (calculator allowed) and Paper 2 (calculator required, tech-heavy problems). Common related terms you will see in exam contexts include biostatistics, inferential statistics, and probabilistic modeling, often tied to real-world fields like biology, business, and environmental science.

2. Sampling techniques and bias

Sampling is the process of selecting a subset of individuals (a sample) from a larger target population to estimate population parameters, since measuring the entire population is almost always too costly or logistically impossible.

Key sampling techniques for AI HL:

  1. Simple Random Sampling (SRS): Every member of the population has equal probability of being selected, e.g., using a random number generator to select participant IDs.
  2. Stratified Sampling: The population is split into mutually exclusive groups (strata) with shared characteristics (e.g., age group, income bracket), then SRS is conducted within each stratum proportional to stratum size.
  3. Cluster Sampling: The population is split into logistically convenient clusters (e.g., school classes, neighborhood blocks), random clusters are selected, and all members of selected clusters are sampled.
  4. Convenience Sampling: Selecting easily accessible individuals (e.g., surveying your classmates for a school-wide poll), which carries very high bias risk.

Bias is defined as systematic error in sample selection that leads to unrepresentative results. Common types include selection bias (sample does not reflect the population), response bias (respondents give inaccurate answers), and non-response bias (a large share of selected participants do not reply).

Worked Example: A school with 800 students (300 Year 1, 300 Year 2, 200 Year 3) wants to survey 80 students about cafeteria satisfaction. What sampling technique should they use, and how many Year 3 students should be included? Solution: Stratified sampling is optimal to capture variation in preferences across year groups. The share of Year 3 students is , so Year 3 students should be included. Exam Tip: Examiners always require justification for sampling technique choices, so link your choice explicitly to reducing bias for the given context.

3. Discrete and continuous distributions

A probability distribution describes the probability of all possible outcomes of a random variable. Discrete random variables take countable, distinct values (e.g., number of customers per hour), while continuous random variables take uncountable values across a range (e.g., height, time to complete a test).

Key discrete distributions:

  1. Binomial Distribution : Number of successes in independent Bernoulli trials with success probability . Use case: Number of defective products in a batch of 100 with 5% defect rate.
  2. Poisson Distribution : Number of events in a fixed time/space interval with constant average rate . Use case: Number of calls received by a call center per minute, average 4 calls per minute.

Key continuous distributions:

  1. Normal Distribution : Symmetric bell curve defined by mean and variance . Follows the 68-95-99.7 rule: 68% of data within , 95% within , 99.7% within .
  2. Uniform Distribution : Equal probability across interval .

Worked Example: The number of rainy days per month in a city follows a Poisson distribution with . Find the probability of exactly 4 rainy days next month. Solution: , which you can verify directly with your GDC for full marks. Exam Tip: Always state the distribution and parameters first when solving distribution questions, as this earns method marks even if your final calculation is wrong.

4. Hypothesis testing — t-test, chi-squared (HL)

Hypothesis testing is an inferential method to test whether a claim about a population parameter is supported by sample data, using a pre-defined significance level (usually 5% for IB exams).

t-test

Used to test the mean of a normally distributed population when the population standard deviation is unknown (the most common case in AI HL exams). Steps for a one-sample t-test:

  1. State null hypothesis (the claimed mean) and alternative hypothesis (one-tailed if or , two-tailed if ).
  2. Calculate sample mean , sample standard deviation , degrees of freedom .
  3. Calculate t-statistic:
  4. Compare the p-value of the t-statistic to : if , reject ; else, fail to reject .

Chi-squared tests

  1. Goodness of fit test: Tests if observed frequency data matches a hypothesized distribution (e.g., if a dice is fair). Test statistic: where = observed frequency, = expected frequency, .
  2. Test for independence: Tests if two categorical variables are associated (e.g., if smoking status is linked to lung cancer diagnosis). where = number of rows, = number of columns in the contingency table.

Worked Example: A coffee shop claims their average latte temperature is 65°C. A customer measures 10 lattes, finds mean 62°C, sample standard deviation 3°C. Test at 5% significance if the average temperature is lower than claimed. Solution: , (one-tailed, ). , t-statistic = , p-value ≈ 0.0058 < 0.05, so reject . There is sufficient evidence at 5% significance that the average temperature is lower than claimed. Exam Note: For chi-squared tests, combine categories if any expected frequency is less than 5, or examiners will deduct marks for invalid results.

5. Linear and non-linear regression

Regression models the relationship between a dependent response variable and one or more independent explanatory variables , to make predictions.

Linear regression

Models linear relationships of the form , where = slope, = y-intercept, calculated via the least squares method that minimizes the sum of squared residuals (residual = observed - predicted ). The Pearson correlation coefficient measures linear association strength, ranging from -1 (perfect negative linear) to 1 (perfect positive linear). The coefficient of determination measures the share of variance in explained by the model, ranging from 0 to 1.

Non-linear regression

For non-linear relationships, transform variables to create a linear relationship, or use your GDC to fit non-linear models (exponential, quadratic, power law):

  • Exponential relationship : Take natural log of both sides: , so plot vs for a linear relationship.
  • Power law relationship : Take log base 10 of both sides: , so plot vs for a linear relationship.

Worked Example: Bacteria population growth follows the exponential model , where = population, = time in hours. At , ; , ; , . Find and predict population at . Solution: (initial population at ). Linear regression of vs gives slope . At , . Exam Tip: Extrapolating predictions outside the range of your input data is always unreliable, and examiners often ask you to comment on this limitation.

6. Confidence intervals (HL)

A confidence interval (CI) is a range of values constructed from sample data that is likely to contain the true unknown population parameter, with a specified confidence level (usually 95% for IB exams). For a population mean when is unknown (the standard case for AI HL), use the t-distribution to calculate the CI: Where is the critical t-value for your chosen confidence level and .

Worked Example: A researcher measures the weight of 15 adult golden retrievers, finds mean 32kg, sample standard deviation 2.8kg. Calculate the 95% CI for the true average weight of the breed. Solution: , for 95% confidence. Standard error = , margin of error = . CI = kg. We can be 95% confident the true average weight falls between 30.5kg and 33.6kg. Exam Note: Never interpret a CI as "there is a 95% chance the true mean is in the interval" — the true mean is a fixed value, so the confidence level refers to the reliability of the sampling method, not the specific interval. This is a very common mark-losing mistake.

7. Common Pitfalls (and how to avoid them)

  • Wrong move: Using a z-test instead of a t-test for mean inference when population standard deviation is unknown. Why students do it: They confuse z and t test use cases. Correct move: Only use z-tests if you are explicitly given the population standard deviation .
  • Wrong move: Forgetting to combine categories in chi-squared tests when . Why students do it: They skip checking expected frequencies first. Correct move: Always calculate expected frequencies first, combine adjacent categories if any , and adjust degrees of freedom accordingly.
  • Wrong move: Interpreting a 95% CI as "95% of population data falls in this interval". Why students do it: They confuse parameter CIs with prediction intervals for individual observations. Correct move: Explicitly state that the CI estimates the true population parameter (e.g., mean), not the range of individual values.
  • Wrong move: Using linear regression for clearly non-linear data and claiming high means good fit. Why students do it: They only check the correlation coefficient, not residual plots. Correct move: Plot data first, use residual plots to check for non-linearity, and apply appropriate transformations for non-linear relationships.
  • Wrong move: Claiming a convenience sample is representative of the full population. Why students do it: They prioritize ease of sampling over representativeness. Correct move: Justify sampling techniques by linking them to relevant population strata, and explicitly state bias risks if convenience sampling is the only logistical option.

8. Practice Questions (IB Math AI HL Style)

Question 1

A town council wants to survey 200 residents about plans to build a new park. The town has 12000 residents: 20% under 18, 60% 18–65, 20% over 65. a) Name the most appropriate sampling technique to ensure representative age group results. (1 mark) b) Calculate how many residents over 65 should be included in the sample. (2 marks) c) Give one example of response bias that could affect results. (1 mark)

Solution

a) Stratified random sampling (1 mark, must include "random"). b) Share of over 65 residents = 0.2, so (1 mark for calculation, 1 mark for final answer). c) Example: The survey asks if residents support higher taxes to fund the park, and respondents underreport their support to avoid appearing willing to pay more (1 mark for any contextually valid response bias example).

Question 2

A gym claims members who join their 8-week weight loss program lose an average of 7kg. A consumer group tests this with 12 random participants, finds average weight loss of 5.8kg, sample standard deviation 2.1kg. a) State null and alternative hypotheses for a two-tailed test of the claim. (2 marks) b) Calculate the p-value and state the conclusion at 5% significance level. (4 marks)

Solution

a) , (1 mark per correct hypothesis). b) , t-statistic = , two-tailed p-value ≈ 0.073 (2 marks for correct t-statistic and p-value). 0.073 > 0.05, so fail to reject (1 mark for comparison). There is insufficient evidence at 5% significance to reject the gym's claim (1 mark for contextually appropriate conclusion).

Question 3

The relationship between weekly ads run () and weekly sales ( in y=kx^n\log_{10} y\log_{10}x$ gives a slope of 0.62 and y-intercept of 0.38. a) Find the values of and . (2 marks) b) Predict weekly sales if 15 ads are run per week. (2 marks)

Solution

a) (slope, 1 mark), so (1 mark). b) , so weekly sales ≈ $14,260 (1 mark for substitution, 1 mark for final answer in dollars, accept values between $14,000 and $14,500).

9. Quick Reference Cheatsheet

Category Key Formulas & Rules
Sampling & Bias Techniques: SRS, stratified, cluster, convenience; Bias types: selection, response, non-response
Probability Distributions Binomial: , ; Poisson: ; Normal: 68-95-99.7 rule; Uniform:
Hypothesis Testing t-statistic: , ; Chi-squared: ; Reject if
Regression Linear: , , ; Exponential transform: ; Power law transform:
Confidence Intervals 95% CI for mean (σ unknown): ,

10. What's Next

This Statistics and Probability topic is one of the most heavily weighted in the IB Math AI HL syllabus, and it connects directly to all applied problem-solving sections of your exams, including the internal assessment (IA). The majority of AI HL IA projects use statistical methods to analyze real-world data, so mastering the techniques in this guide will give you a strong foundation for scoring highly on your IA, as well as on Paper 2's long-response tech-enabled questions. The inferential statistics skills you learned here also translate directly to university courses in data science, biology, economics, and engineering, if you choose to pursue those fields.

If you have questions about any of the concepts in this guide, from calculating a t-test p-value on your GDC to choosing the right sampling technique for your IA, you can ask Ollie for personalized explanations and extra practice problems at any time on the homepage. We also recommend that you practice with official past IBO papers to get familiar with the exact question structure and mark scheme conventions for your exam.

← Back to topic

Stuck on a specific question?
Snap a photo or paste your problem — Ollie (our AI tutor) walks through it step-by-step with diagrams.
Try Ollie free →