| Study Guides
A-Level · cie-9709 · Paper 5 (Probability & Statistics 1) · Representation of Data · 18 min read · Updated 2026-05-06

Representation of Data — A-Level Mathematics Stats Study Guide

For: A-Level Mathematics candidates sitting Paper 5 (Probability & Statistics 1).

Covers: Frequency tables, histograms, cumulative frequency, stem-and-leaf plots, box plots, measures of center and spread, coding for simplified calculations, and statistical comparison of distributions.

You should already know: Basic probability, summation, integration (Pure 1 calculus).

A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the A-Level Mathematics style for educational use. They are not reproductions of past Cambridge International examination papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official Cambridge mark schemes for grading conventions.


1. What Is Representation of Data?

Representation of data is the set of descriptive statistical techniques used to organize, visualize, and summarize raw quantitative or qualitative data, so you can draw meaningful conclusions without analyzing every individual observation. It is often referred to as descriptive statistics, and forms the foundational first topic for A-Level Mathematics Paper 5: it is tested in 5–10 mark questions in every exam series, and underpins all later topics including probability distributions, hypothesis testing, and correlation.

2. Frequency tables, histograms, cumulative frequency

These three tools are used to organize and visualize grouped large datasets:

  • Frequency tables: Group raw data into class intervals, and record the number of observations (frequency) in each interval. For continuous data, always use class boundaries (the true end points of each interval, adjusted to eliminate gaps between discrete classes) rather than stated class limits for calculations. For example, for the class "1–10 marks" (discrete), the class boundaries are 0.5 and 10.5.
  • Histograms: Visual representations of continuous grouped data, where the area of each bar equals the frequency of the class. The y-axis plots frequency density, calculated as: Unlike bar charts, histograms have no gaps between adjacent bars for continuous data. Examiners frequently test histograms with unequal class widths, so never use raw frequency on the y-axis for these questions.
  • Cumulative frequency: The running total of frequencies up to the upper class boundary of each interval. When plotted against upper class boundaries, it forms an S-shaped ogive (cumulative frequency curve), used to estimate medians, quartiles, and percentiles.

Worked example

For 40 students' test marks grouped into classes 0–20, 21–40, 41–60, 61–80, 81–100, with frequencies 2, 7, 15, 12, 4:

  • Class widths: 21, 20, 20, 20, 20
  • Frequency densities: , , , ,
  • Cumulative frequencies: 2, 9, 24, 36, 40

3. Stem-and-leaf and box plots

These are visual tools for summarizing smaller datasets (<50 observations for stem-and-leaf) and presenting summary statistics for any dataset:

  • Stem-and-leaf plots: Split each data point into a stem (first one or two digits) and leaf (last digit), so all original data is retained. Always include a key (e.g. marks) when drawing these for exams, as unlabelled plots lose marks. They make it easy to sort data and identify modes or outliers.
  • Box plots (box-and-whisker diagrams): Summarize 5 key values: minimum observation, lower quartile (, 25th percentile), median (, 50th percentile), upper quartile (, 75th percentile), maximum observation. Outliers (values more than above or below ) are plotted as separate points outside the whiskers.

Worked example

From the test score cumulative frequency curve above, estimated values are: min=12, , median=57, , max=98. The box plot will have whiskers extending from 12 to 98, a box spanning 42 to 74, and a vertical line inside the box at 57 to mark the median.

4. Mean, median, mode, range, IQR, standard deviation

These are quantitative measures of center (typical value) and spread (variability) of a dataset:

Measures of center

  • Mode: The most frequently occurring value. For grouped data, the modal class is the class with the highest frequency density.
  • Median: The middle value of sorted data. For ungrouped observations, it is the value if is odd, or the average of the and values if is even. For grouped data, it is estimated from the ogive or via linear interpolation.
  • Mean: The arithmetic average, calculated as:

Measures of spread

  • Range: . Simple to calculate but heavily affected by outliers.
  • Interquartile Range (IQR): . Measures the spread of the middle 50% of data, and is not affected by outliers.
  • Standard deviation: The average distance of observations from the mean, calculated as: Variance is the square of standard deviation, .

Worked example

For ungrouped data :

  • Mode = 20, median = 20, mean =
  • Range = , , , IQR =
  • Standard deviation: , so

5. Coding to simplify calculations

Coding is a time-saving technique where you transform large raw data values into smaller coded values using the linear transformation: where (a value close to the raw mean) and (a factor that simplifies raw values to integers) are constants you can choose, or that are given in the question. The key rules for coded data are:

  1. Adding or subtracting a constant shifts all values equally, so it only affects the mean, not the spread.
  2. Multiplying or dividing by scales all values, so it affects both the mean and standard deviation.

Calculations from coded data

  • Coded mean: , so rearranged to get raw mean:
  • Coded standard deviation: , so raw standard deviation:
  • Raw variance:

Worked example

Raw data values: . Use code :

  • Coded values:
  • Coded mean: , so raw mean =
  • Coded standard deviation: , so raw

6. Comparing distributions — center, spread, shape

When comparing two datasets (e.g. test scores for boys vs girls, crop yields for two fertilizers), you must comment on three features to get full marks, always linking to the context of the question:

  1. Center: Compare the median or mean. For example: "The median reaction time for Group A is 0.2 seconds faster than for Group B, so participants given the drug reacted faster on average."
  2. Spread: Compare IQR, standard deviation, or range. For example: "The IQR for scores in Paper 2 is 8 marks lower than for Paper 1, so student performance on Paper 2 was more consistent."
  3. Shape: Comment on skewness:
  • Positive skew: Long tail to the right, (e.g. income distributions, with a small number of very high earners)
  • Negative skew: Long tail to the left, (e.g. test scores for an easy exam, with most students scoring highly)
  • Symmetric: No skew, (e.g. normal distribution of heights)

7. Common Pitfalls (and how to avoid them)

  • Wrong move: Using raw frequency instead of frequency density on the y-axis of a histogram, especially for unequal class widths. Why students do it: They confuse histograms with bar charts, and forget that bar area equals frequency for histograms. Correct move: Always calculate frequency density first, and verify that the area of each bar equals the class frequency before plotting.
  • Wrong move: Using class limits instead of class boundaries for cumulative frequency and histogram calculations. Why students do it: They fail to adjust discrete class limits to continuous boundaries. Correct move: Add/subtract 0.5 from discrete class limits to get boundaries, or use given boundary values for continuous data.
  • Wrong move: Adding the constant to coded standard deviation to get the raw value. Why students do it: They mix up mean and standard deviation rules for coded data. Correct move: Remember shifting all values by a constant does not change spread, so raw , no term.
  • Wrong move: Failing to link statistics to context when comparing distributions. Why students do it: They rush through the question and only state numerical values. Correct move: Always add a context-specific comment, e.g. "The higher mean yield for Fertilizer X means it produces more crop on average" instead of just "Mean of X is higher".
  • Wrong move: Calculating quartiles for grouped data as without interpolation. Why students do it: They use simplified small-dataset rules for large grouped datasets. Correct move: Estimate quartiles from the cumulative frequency curve at 25% and 75% of total cumulative frequency, or use linear interpolation.

8. Practice Questions (A-Level Mathematics Paper 5 Style)

Question 1

The time taken for 60 students to complete a math puzzle is recorded below:

Time (t minutes) 1 ≤ t < 3 3 ≤ t < 6 6 ≤ t < 10 10 ≤ t < 15
Frequency 8 18 22 12
(a) Calculate the frequency density for each class. [2 marks]
(b) Estimate the mean time taken. [2 marks]
(c) Estimate the standard deviation of the time taken. [3 marks]

Worked solution

(a) Class widths: , , , . Frequency densities: , , , (b) Class midpoints: 2, 4.5, 8, 12.5. . Mean = minutes. (c) . Variance = . Standard deviation = minutes.


Question 2

Coded values are calculated from raw temperature values using . Summary statistics for coded values: , , . (a) Calculate the mean raw temperature. [2 marks] (b) Calculate the variance of the raw temperature values. [3 marks]

Worked solution

(a) Coded mean . Raw mean: . (b) Coded variance . Raw variance: (3 s.f.).


Question 3

Box plots for monthly salaries at Company A and Company B give the following values:

  • Company A: min=1200, =1800, median=2400, =3200, max=5000
  • Company B: min=1500, =2100, median=2800, =3500, max=4100 Compare the two distributions, commenting on center, spread, and shape. [4 marks]

Worked solution

  • Center: The median salary at Company B is $400 higher than at Company A, so employees at B earn more on average. [1 mark]
  • Spread: The IQR for Company A is 1400 for B, so the spread of middle salaries is identical. The range for A is 2600, as A has more extreme high and low salaries. [2 marks]
  • Shape: Company A's distribution is positively skewed, as the upper whisker is much longer than the lower whisker, with a tail of high salaries. Company B's distribution is approximately symmetric, as the median is halfway between and , and whiskers are similar length. [1 mark]

9. Quick Reference Cheatsheet

Category Formula/Rule
Histogram Frequency density = , Bar area = Frequency
Mean Ungrouped: , Grouped:
Standard Deviation Ungrouped: , Grouped:
IQR , Outliers = values or
Coding () Raw mean: , Raw SD: , Raw variance:
Skewness Positive: Mean > Median > Mode (long right tail), Negative: Mode > Median > Mean (long left tail), Symmetric: Mean ≈ Median

10. What's Next

This topic is the foundation for all remaining content in A-Level Mathematics Paper 5. You will use measures of center and spread to calculate probabilities for normal distributions later in the syllabus, and box plots and summary statistics to interpret results of hypothesis tests for population means. Histogram and cumulative frequency skills will also be reused when working with probability density functions, where the area under the curve equals probability, just like the area of histogram bars equals frequency.

If you struggle with any of the concepts, practice questions, or formula applications in this guide, you can ask Ollie for personalized explanations, extra practice problems, or step-by-step walkthroughs of past paper questions at any time on the homepage. Make sure to test your knowledge with official past A-Level Mathematics Paper 5 papers to get used to exam wording and marking schemes before your test.

Aligned with the Cambridge International AS & A Level Mathematics 9709 syllabus. OwlsAi is not affiliated with Cambridge Assessment International Education.

← Back to topic

Stuck on a specific question?
Snap a photo or paste your problem — Ollie (our AI tutor) walks through it step-by-step with diagrams.
Try Ollie free →