Transforming to Achieve Linearity — AP Statistics Study Guide

For: AP Statistics candidates sitting AP Statistics.

Covers: Exponential and power function non-linear models, log transformations (natural and common log), linear regression on transformed data, back-transformation for predictions, and residual analysis for model selection.

You should already know: Simple linear regression least squares estimation, how to interpret residual plots for linearity, basic properties of logarithms.

A note on the practice questions: All worked questions in the "Practice Questions" section below are original problems written by us in the AP Statistics style for educational use. They are not reproductions of past College Board / Cambridge / IB papers and may differ in wording, numerical values, or context. Use them to practise the technique; cross-check with official mark schemes for grading conventions.

1. What Is Transforming to Achieve Linearity?

Many real-world two-variable relationships (such as bacterial growth, radioactive decay, planetary orbital periods, and stopping distance) follow non-linear patterns, so a straight least-squares regression line will produce biased predictions and systematic error in residuals. Transforming to achieve linearity is the process of applying a mathematical re-expression (usually a logarithmic or power transformation) to one or both variables to convert a curved relationship into a linear one, allowing us to use existing simple linear regression tools to fit and analyze the model.

This topic is explicitly required by the AP Statistics Course and Exam Description (CED) for Unit 2: Exploring Two-Variable Data, which makes up 20-30% of the total AP exam score, with linearization accounting for 10-15% of Unit 2 content. It appears in both multiple-choice questions (testing transformation identification and back-transformation) and free-response questions (requiring model selection, fitting, and prediction). Synonyms for this technique include linearization of non-linear models and data re-expression.

2. Log Transformation for Exponential Models

Exponential models describe relationships where the response variable changes by a constant percentage rate for each 1-unit increase in the explanatory variable. The general form of an exponential model is: $y = a b^{x}$ where $y$ is the response variable, $x$ is the explanatory variable, $a > 0$ is the initial value of $y$ when $x = 0$ , and $b > 0$ is the constant growth/decay factor ( $b > 1$ for growth, $0 < b < 1$ for decay). This relationship is always curved on the original $x$ - $y$ scale, so we linearize it by taking the logarithm of both sides: $ln y = ln (a b^{x}) = ln a + x ln b$ This is a linear equation! If we let $y^{'} = ln y$ , $A = ln a$ , and $B = ln b$ , we get the standard linear form: $y^{'} = A + B x$ We fit a least-squares regression line to the transformed $(x, ln y)$ data to get estimates of $A$ and $B$ , then back-transform to recover $a = e^{A}$ and $b = e^{B}$ for the original exponential model. Residual plots on the transformed data confirm if linearization worked: random scatter around zero means the model is appropriate.

Worked Example

A microbiologist studying bacterial growth obtains the following regression output after transforming the response variable (number of bacteria $y$ ) to linearize the relationship with time $x$ (hours): $ln y = 2.3 + 0.41 x$ . Write the exponential model for $y$ as a function of $x$ .

Recall the linearized form of an exponential model: $ln y = ln a + x ln b$ , where $A = ln a$ and $B = ln b$ .
Match coefficients from the given regression: $A = 2.3$ , $B = 0.41$ .
Back-transform to solve for $a$ and $b$ : $a = e^{2.3} \approx 9.974$ , $b = e^{0.41} \approx 1.507$ .
Write the final exponential model for the original data: $y = 9.97 (1.51)^{x}$ .

Exam tip: If the problem uses base-10 (common) logs instead of natural logs, back-transform with base 10, not $e$ . If $lo g_{10} y = A + B x$ , then $y = 1 0^{A} (1 0^{B})^{x}$ — always match the log base when exponentiating.

3. Power Transformations for Power Function Models

Power function models describe relationships where the response variable is proportional to a fixed power of the explanatory variable. Common examples include Kepler's third law of planetary motion, stopping distance vs speed, and area vs side length. The general form of a power model is: $y = a x^{p}$ where $a > 0$ is a constant of proportionality and $p$ is the power the explanatory variable is raised to. Like exponential models, power models are curved on the original scale, but linearization requires transforming both variables. Take the logarithm of both sides to get: $ln y = ln (a x^{p}) = ln a + p ln x$ Let $y^{'} = ln y$ and $x^{'} = ln x$ , so this becomes the linear equation: $y^{'} = ln a + p x^{'}$ We fit least-squares regression to the transformed $(ln x, ln y)$ data, get an intercept $A = ln a$ and slope equal to the power $p$ , then back-transform $a = e^{A}$ to get the original power model $y = a x^{p}$ .

Worked Example

A civil engineer measures the stopping distance $y$ (meters) of a car moving at speed $x$ (km/h) and fits a linear regression to transformed $(ln x, ln y)$ data, resulting in $ln y = - 1.1 + 2.05 ln x$ . Write the power model for $y$ and interpret the slope of the transformed model.

Recall the linearized form of a power model is $ln y = ln a + p ln x$ , where $p$ is the power in the original model.
Match coefficients: $ln a = - 1.1$ , $p = 2.05$ . Back-transform to get $a = e^{- 1.1} \approx 0.3329$ .
Write the original power model: $y = 0.33 x^{2.05}$ .
Interpret the slope: A 1-unit increase in $ln x$ corresponds to a 2.05-unit increase in $ln y$ , which means a 10% increase in speed $x$ corresponds to approximately a $2.05 \times 10% = 20.5%$ increase in predicted stopping distance $y$ .

Exam tip: The most common MCQ error is mixing up transformations: exponential models only need $y$ transformed, power models need both $x$ and $y$ transformed. Memorize the derivation, not just the rule, to avoid this mistake.

4. Residual Analysis and Model Selection

When you start with a curved scatterplot of $y$ vs $x$ , you will often test multiple transformations to find one that produces a linear relationship. The primary tool for selecting the appropriate transformation is residual analysis: after fitting a linear regression to the transformed data, you plot the residuals from the transformed regression against the explanatory variable $x$ . If the residual plot has no systematic curved pattern (residuals are randomly scattered around zero), the transformation successfully linearized the relationship, and the model is appropriate. If there is still a visible curve, you need to test a different transformation.

AP Statistics almost always asks for justification of model selection, which requires referencing the residual plot pattern. Higher $R^{2}$ on a transformed scale is not a valid justification on its own, because $R^{2}$ values are not comparable across different transformation scales.

Worked Example

A materials scientist testing the relationship between object volume $x$ (cm³) and mass $y$ (g) produces three residual plots after testing different transformations: (1) Residuals for untransformed $y$ vs $x$ : clear U-shaped curve; (2) Residuals for $ln y$ vs $ln x$ : randomly scattered around zero with no pattern; (3) Residuals for $ln y$ vs $x$ : clear upward curved trend. Which transformation is appropriate, and what model does this correspond to?

The goal of transformation is to achieve linearity, which is confirmed by a residual plot with no systematic pattern.
Eliminate the untransformed model and the $ln y$ vs $x$ model, because both residual plots have clear curved patterns that show the relationship is still non-linear after transformation.
The $ln y$ vs $ln x$ transformation produces random residual scatter, so it is the appropriate choice.
A linear relationship between $ln y$ and $ln x$ corresponds to a power function model $y = a x^{p}$ for the original data.

Exam tip: Always explicitly reference the residual pattern in your justification: saying "no curved pattern so the model is appropriate" will get you full credit, while just saying "the model fits better" will not.

5. Common Pitfalls (and how to avoid them)

Wrong move: Confusing transformations for exponential vs power models, transforming $x$ for exponential or only $y$ for power. Why: Students memorize "log works for linearization" and forget which variable to transform based on model form. Correct move: Always write the original model first, derive the linear form by taking logs of both sides, then identify which variables need transformation.
Wrong move: Mixing up log bases during back-transformation, using $e$ for base-10 logs or 10 for natural logs. Why: Students assume all log transformations use natural log and don't check the problem's given transformation. Correct move: Explicitly note the log base before back-transforming, and match the base when exponentiating.
Wrong move: Reporting the predicted value of the transformed $y^{'}$ as the prediction for original $y$ , forgetting to back-transform. Why: Students stop after calculating the prediction from the transformed regression and don't circle back to the question's request. Correct move: Always check if the question asks for a prediction on the original variable scale, and back-transform if required.
Wrong move: Selecting a model with higher $R^{2}$ over a model with random residuals, just because $R^{2}$ is larger. Why: Students are used to using $R^{2}$ for model comparison, but it is not comparable across different transformation scales. Correct move: Always use residual pattern first to select a linearizing transformation, only use $R^{2}$ to compare models that both produce random residuals on the same transformed scale.
Wrong move: Interpreting the slope of the transformed model directly on the original scale. For example, saying "a 1-unit increase in $x$ gives a 0.32 increase in $y$ " when the slope is for $ln y$ . Why: Students forget the response variable was transformed, so the slope is on the transformed scale. Correct move: Always explicitly state if you are interpreting the slope on the transformed scale, or back-transform the interpretation to the original variable scale.
Wrong move: Plotting residuals from transformed regression against the original response $y$ instead of the explanatory variable to check for pattern. Why: Students mix up which variable to use for residual plots. Correct move: Always plot residuals against the explanatory variable $x$ to check for remaining non-linear pattern.

6. Practice Questions (AP Statistics Style)

Question 1 (Multiple Choice)

A researcher studies the relationship between the number of years a business has operated $x$ and the business's total annual profit $y$ (in thousands of dollars). The relationship follows a power model $y = a x^{p}$ . Which of the following transformations will linearize this relationship? A) $ln y$ vs $ln x$ B) $ln y$ vs $x$ C) $y$ vs $ln x$ D) $y$ vs $x$

Worked Solution: To linearize a power model $y = a x^{p}$ , take the natural logarithm of both sides to get $ln y = ln a + p ln x$ , which is linear in $ln y$ and $ln x$ . Option B is the transformation for an exponential model, not a power model. Options C and D do not produce a linear relationship for a power model. The correct answer is A.

Question 2 (Free Response)

Ecologists studying tree growth in a regenerating forest collect data on tree height $y$ (meters) and years since planting $x$ , and test two models:

Model 1 (linear regression of $y$ on $x$ ): Residual plot has a clear upward curved pattern.
Model 2 (linear regression of $ln y$ on $x$ ): Residual plot is randomly scattered around zero, with regression equation $ln y = 0.5 + 0.11 x$ .

(a) Which model is more appropriate for this relationship? Justify your answer. (b) Write the equation of the non-linear model for predicted tree height $y$ in terms of $x$ . (c) Predict the height of a tree that has been growing for 25 years. Round your answer to one decimal place.

Worked Solution: (a) Model 2 is more appropriate. A successful transformation to achieve linearity produces a residual plot with no systematic curved pattern. Model 1 has a clear curved residual pattern, meaning the relationship between $y$ and $x$ remains non-linear, while Model 2 has random scatter, so it successfully linearized the relationship. (b) For the linearized exponential model $ln y = A + B x$ , we have $A = 0.5 = ln a$ and $B = 0.11 = ln b$ . Back-transforming gives $a = e^{0.5} \approx 1.6487$ and $b = e^{0.11} \approx 1.1162$ . The final model is $y = 1.65 (1.12)^{x}$ . (c) Substitute $x = 25$ : $y = 1.65 (1.12)^{25} \approx 1.65 * 17.00 \approx 28.1$ meters. The predicted height of a 25-year-old tree is 28.1 meters.

Question 3 (Application / Real-World Style)

Biologists study the relationship between the body mass $x$ (in kilograms) of a mammal and its average resting heart rate $y$ (in beats per minute). The relationship is known to follow a power model $y = a x^{p}$ . A regression on transformed data gives $ln y = 4.8 - 0.25 ln x$ , for a mass range of 0.01 kg (shrew) to 4000 kg (elephant). Write the power model for $y$ , then predict the resting heart rate of a 64 kg human. Round your answer to the nearest whole number, and interpret your prediction in context.

Worked Solution: For the linearized power model $ln y = ln a + p ln x$ , we have $ln a = 4.8$ and $p = - 0.25$ . Back-transforming gives $a = e^{4.8} \approx 121.51$ , so the power model is $y = 121.51 x^{- 0.25}$ . Substitute $x = 64$ : $6 4^{- 0.25} = (2^{6})^{- 1/4} = 2^{- 1.5} = \frac{1}{2 2} \approx 0.3536$ . Multiply by 121.51 to get $y \approx 121.51 * 0.3536 \approx 43$ beats per minute. In context, the model predicts that a 64 kg human has a resting heart rate of approximately 43 beats per minute, which is close to the typical range for healthy resting heart rates for adults.

7. Quick Reference Cheatsheet

Category	Formula	Notes
Exponential Model (original)	$y = a b^{x}$	Used for constant percent growth/decay; $a > 0, b > 0$
Linearized Exponential	$ln y = ln a + (ln b) x$	Only transform response variable $y$ ; match log base for back-transformation
Power Model (original)	$y = a x^{p}$	Used for proportional scaling relationships; $a > 0$
Linearized Power	$ln y = ln a + p ln x$	Transform both response $y$ and explanatory $x$
Back-transformation (Natural Log)	$y = e^{l n y}$	For predictions on original scale after natural log transformation
Back-transformation (Common Log)	$y = 1 0^{l o g_{10} y}$	For predictions after base-10 log transformation
Model Fit Check	Residual plot with no systematic curved pattern	Always confirm fit with residuals, not just $R^{2}$
Slope Approximation (Power Model)	1% change in $x$ → ~ $p$ % change in $y$	Works for small percentage changes in $x$

8. What's Next

Transforming to achieve linearity is the foundation for working with non-linear relationships across the entire AP Statistics course. Immediately after this topic in Unit 2, you will move to learning outliers and influential points, then on to multiple regression, where you will extend these linearization ideas to model curvilinear relationships with multiple explanatory variables; without mastering how to re-express non-linear data to fit linear models, multiple regression modeling will be much harder to interpret. Across the rest of the course, this topic feeds directly into inference for regression, where you will conduct hypothesis tests and build confidence intervals for slopes of transformed linear models.

Transforming to Achieve Linearity — AP Statistics Study Guide

1. What Is Transforming to Achieve Linearity?

2. Log Transformation for Exponential Models

Worked Example

3. Power Transformations for Power Function Models

Worked Example

4. Residual Analysis and Model Selection

Worked Example

5. Common Pitfalls (and how to avoid them)

6. Practice Questions (AP Statistics Style)

Question 1 (Multiple Choice)

Question 2 (Free Response)

Question 3 (Application / Real-World Style)

7. Quick Reference Cheatsheet

8. What's Next

More study guides