Linear Regression | OttoLearn

An NBA analytics team wants to predict a player’s points per game from their weekly practice hours. They scatter-plot the data, see a linear trend, and draw a line through it — but which line? There are infinitely many straight lines that could pass through that cloud of points. The regression line is the unique line that fits the data best in a mathematically precise sense: it minimizes the total squared prediction error. Its slope tells the analyst something specific — “for each additional hour of practice per week, we predict ___ more points per game, on average” — and that number depends on more than just the direction of the association.

In REG-1 you learned how to measure the strength and direction of a linear relationship using the Pearson correlation . Now you will use , together with the standard deviations of both variables, to compute the exact equation of that best-fitting line, interpret what its slope and intercept mean in context, and assess how well any individual observation fits the prediction.

By the end of this lesson, you will be able to:

Compute the slope and intercept of the least-squares regression line.
Verify your equation by confirming the line passes through .
Interpret the slope and intercept in plain language, using “on average” and respecting the observed data range.
Compute and interpret residuals as the gap between actual and predicted values.
Explain why the regression of on is not the same as the regression of on .

What you need coming in — and why it matters today:

Pearson correlation (REG-1): captures the direction and strength of the linear relationship. Today’s slope formula uses directly — its sign determines whether is positive or negative.
Standard deviations and (REG-1, DS-4): These measure the spread of each variable. The ratio scales the slope to the units of relative to . Know which is which.
Means and (DS-3): The intercept formula requires both means. The point is guaranteed to lie on the regression line — you will use this to verify your arithmetic.
Scatter plot reading (REG-1, DS-2): Before computing anything, check the scatter plot: is the relationship approximately linear? Are there obvious outliers? Regression is only valid for a linear pattern without extreme influential points.

Quick check — can you recall these from REG-1?

What does the Pearson correlation coefficient measure?

I can state the range of

: from −1 (perfect negative linear) to +1 (perfect positive linear) I know that

is the sample standard deviation of

and

is the sample standard deviation of

I can compute

and

from a dataset or read them from given summary statistics I can read a scatter plot and assess whether a linear model is appropriate

Success Factor:

What changes in this lesson: In REG-1 you summarized the relationship with a single number . Here you build a full equation that lets you make predictions and measure how far each individual data point falls from the line. reg-3 will extend this to formally test whether the slope is statistically significant — the framework you build here carries directly into that lesson.

Retrieval Warm-up — from earlier lessons

A quality-control engineer measures the thickness (mm) of ceramic tiles from a production run. She computes mm. A colleague claims this means “most tiles are within 0.34 mm of the mean.” Which response is most accurate?

A marine biologist finds between sea surface temperature (°C) and coral bleaching percentage across 25 reef sites. A journalist headlines the story: “Warmer seas destroy coral.” Which statistical critique is most precise?

How this section is organized: Ten concepts build in natural calculation order. Understand C1 before computing b (C2), and b before computing a (C3). The interpretation concepts (C5, C6) apply once you have the equation. C7–C10 round out the picture with residuals, conditions, and a subtle asymmetry property.

C1–C4: What the regression line is and how to compute it (criterion, slope, intercept, point-of-means check)
C5–C6: What the slope and intercept mean in plain language
C7–C8: Residuals and making predictions
C9–C10: Conditions and a key asymmetry property

C1 — The Least-Squares Criterion

Imagine drawing any straight line through a scatter plot. For each data point, measure the vertical gap between the actual value and the value the line predicts — that gap is called a residual. The least-squares line is the unique line that makes the sum of all those squared residuals as small as possible.

The Least-Squares Criterion

The least-squares regression line is the line that minimizes the sum of squared errors (SSE):

where is the actual value and is the value predicted by the line. No other straight line produces a smaller SSE for the same data.

We write (read “y-hat”) for a predicted value to distinguish it from the observed .

Why square the residuals? Squaring does two things: it makes all gaps positive (so a point 3 units above the line and one 3 units below don’t cancel), and it penalizes large deviations more heavily than small ones (a 6-unit gap contributes 36, not 6, to the total). The result is a line that avoids large misses even at the cost of slightly worse small fits.

The interactive visualization below shows three candidate lines for the same scatter plot. Watch how the SSE changes — only the least-squares line achieves the minimum.

The regression line is the unique line that minimizes the sum of squared residuals. See how SSE changes when we use different lines:

Slope 4.26 Intercept 51.94

Show squared residuals Show point of means (x̄, ȳ)

Drag the sliders to try to beat the best-fit SSE of 55.66.

C2 — Slope Formula

Once we accept the least-squares criterion, calculus shows that the slope must be:

Slope of the Least-Squares Regression Line

where is the Pearson correlation coefficient, is the sample standard deviation of , and is the sample standard deviation of .

The slope has the same sign as . Its magnitude is the correlation scaled by the ratio of spreads: how much typically changes (in -units) for each unit change in .

Mini-example: Suppose , , .

For each 1-unit increase in , the predicted increases by 2.00 units, on average.

The inverted ratio error (very common): The formula is — is in the numerator. Writing inverts the ratio and gives a completely different (wrong) answer. A memory aid: is the slope of on , so the spread of the response variable () goes on top.

C3 — Intercept Formula

The intercept is computed after the slope, using both means:

Intercept of the Least-Squares Regression Line

where is the slope computed in C2, is the mean of , and is the mean of .

The intercept positions the line vertically so that it passes exactly through the point of means .

Mini-example (continued): From C2, . Suppose and .

The regression line is .

C4 — The Line Always Passes Through

This is a mathematical guarantee, not a coincidence:

The Point of Means Is Always on the Line

Substituting into :

So when . The point always lies on the least-squares line.

Practical use: After computing and , plug in and verify you get . If not, you made an arithmetic error.

Verification (continued): ✓

C5 — Interpreting the Slope

Slope Interpretation Template

”For each 1-unit increase in [x variable], the predicted [y variable] [increases / decreases] by [y units], on average.”

The phrase “on average” is mandatory — the slope describes the average change across all individuals in the data, not the guaranteed change for any specific individual.

Example: For where = study hours and = exam score:

“For each additional hour of study, the predicted exam score increases by 2.00 points, on average.”

Writing “If a student studies 1 more hour, their score increases by 2.00 points” drops the phrase “on average.” Regression predicts the average outcome for students with that study time — it does not guarantee any individual’s result. The missing “on average” is a marked error on every assignment in this module.

The slope is a constant rate: each additional unit of adds to the predicted . The triangle below shows this rise-over-run directly on the best-fit line:

Run = +1 hour, rise = +4.26 points. Slope b = rise ÷ run = 4.26 points per hour: for each additional hour, predicted exam score rises by 4.26 points on average.

C6 — Interpreting the Intercept

Intercept Interpretation

The intercept is the predicted value of when . It is only contextually meaningful when is plausible given the observed data range.

When falls far outside the observed data range, is a mathematical anchor that positions the line — do not interpret it as a realistic prediction.

Two contrasting cases:

where = temperature (°C) and = hot beverage sales. At °C (freezing point), predicting sales is realistic — is meaningful.
where = age (years, observed range 20–65) and = reaction time (ms). At (a newborn), the predicted reaction time of 177.4 ms has no meaning — is a mathematical anchor only.

Always check whether is within or near the observed range before interpreting the intercept. Using the intercept as a meaningful prediction when is far outside the data is called extrapolation — the same logical error as predicting outside the data range in any direction. When in doubt, state “the intercept is a mathematical anchor; is not within the observed data range.”

C7 — Residuals

Residual

The residual for observation is:

Positive residual (): the actual value is above the line — the model underpredicted.
Negative residual (): the actual value is below the line — the model overpredicted.
Zero residual: the point lies exactly on the line.

For any least-squares line, the sum of all residuals equals zero: .

Mini-example: Line: . Observed: , .

— this point is 2 units above the line.

C8 — Making Predictions with

To predict the response for a specific value, substitute into the equation:

Prediction

Given , the predicted value for a specific is:

Always state that the result is a predicted value, not a guaranteed outcome. For within the observed range of , this is called interpolation and is generally reliable. For outside that range, this is extrapolation — use with caution.

Drag the slider past the observed data to see why extrapolation is risky — the line keeps climbing even where we have no evidence the trend holds:

Prediction x (hours studied)

x = 10 h. Predicted ŷ ≈ 94.5. EXTRAPOLATION — outside the observed range (1–8 h), so use with caution: the linear trend has no evidence here.

C9 — Conditions for Simple Linear Regression

At the CEGEP level, focus on three conditions to check before using a regression equation:

Three conditions for simple linear regression:

Both variables are quantitative. Regression requires numerical and . For categorical variables, use different methods (see reg-4).
The relationship is approximately linear. Check the scatter plot. A curved pattern means linear regression will systematically mis-predict.
No extreme influential outliers. A single influential point (far from the rest in ) can drastically shift the slope. Inspect the scatter plot for outliers before trusting the equation.

Independence of observations (no repeated measures on the same subject) is also assumed but usually clear from context at this level.

C10 — Regression of on ≠ Regression of on

Asymmetry of Regression

Swapping the roles of and produces a genuinely different regression line.

Slope of on :
Slope of on :

In general, . Only when (perfect linear association) are the two lines identical.

Reason: the regression of on minimizes vertical errors (deviations in ); the regression of on minimizes horizontal errors (deviations in ). These are different optimization problems.

If you want to predict from , you must compute a new regression equation with as the response and as the predictor — you cannot simply solve the original equation for . Using the original equation in reverse would apply the wrong optimization criterion and give systematically biased predictions.

Example 1 — Fully Worked: Study Hours and Exam Scores

A statistics instructor records study hours () and exam scores () for a small group. The summary statistics are: , , , , .

Step 1: Compute the slope.

I notice the formula is , and I keep in the numerator (not ).

Step 2: Compute the intercept.

I use the slope just computed and both means.

Step 3: Write the regression equation.

Step 4: Verify via the point of means.

Step 5: Interpret.

For each additional hour of study, the predicted exam score increases by 3.70 points, on average.

The intercept 56.90 is the predicted score for 0 hours of study. Since is at the boundary of the realistic range (some students may study 0 hours), it has borderline contextual meaning.

Step 6: Residual for a specific observation.

A student studied 4 hours and scored 73. What is the residual?

This student scored 1.30 points above what the model predicted — their actual score is above the line.

Step 7: Computation table (for transparency).

If we had the raw data, a , , , , table would confirm that and make SSE concrete.

Example 2 — Partially Scaffolded: Temperature and Hot Beverage Sales

A café manager believes temperature (, °C) is negatively associated with hot beverage sales (, units). Summary statistics: , , , , .

Compute and , write the equation, and interpret the slope and intercept.

Step 1:

The negative sign makes sense: as temperature rises, fewer hot beverages are sold.

Step 2:

Equation:

Verify: ✓

Slope interpretation: For each additional °C of temperature, predicted hot beverage sales decrease by 2.16 units, on average.

Intercept interpretation: At 0°C (freezing point), the model predicts 123.2 units sold. Since 0°C is plausible winter weather, this interpretation is contextually meaningful.

Example 3 — Prediction and Residual: Regression Equation Given

The regression equation for daily exercise minutes () and resting heart rate (, bpm) is .

(a) Predict the resting heart rate of a person who exercises 45 minutes per day.

(b) That person’s actual resting heart rate is 68 bpm. Compute the residual and state whether the actual value is above or below the line.

Show Solution

(a) bpm

(b)

The residual is positive, so this person’s actual resting heart rate (68 bpm) is above the line — the model underpredicted. This person’s heart rate is higher than expected for someone who exercises 45 minutes daily.

Example 4 — Which Is the Regression Line? (Application Twist)

Two analysts both fit a line to the same dataset where and .

Analyst A’s line:
Analyst B’s line:

Without access to the raw data, how can you determine which line is the least-squares regression line?

Show Verification

Analyst A: ✓

Analyst B:

Both lines pass through ! This means the point-of-means check alone cannot distinguish them — you would need the raw data to compute SSE for both lines and identify the true minimum. The (x̄, ȳ) condition is necessary but not sufficient for identifying the regression line.

Problem 1 — Computing the Regression Equation

Problem 2 — Interpreting Slope and Intercept

Problem 3 — Residual Scenarios

Problem 4 — Parameterized Regression Prediction

Problem 1 — Full Chain: Slope, Intercept, Prediction, Residual

Problem 2 — Regression Interpretation and Prediction

Problem 3 — Find the Error

Problem 4 — Residual Classification

Problem 5 — Multi-Step Synthesis

Retrieval — Point of Means

Retrieval — Regression Conditions

Mixed Review — Retrieval from Earlier Lessons

These problems draw on concepts from earlier in the course. Attempting them without re-reading prior lessons is the point — retrieval practice strengthens long-term memory more than re-reading.

Review Problem 1 — Sample Standard Deviation (DS-4)

A botanist measures the stem heights (cm) of five seedlings one week after germination:

(a) Compute .

(b) Compute the sample standard deviation using .

(c) Explain why statisticians divide by rather than when computing a sample standard deviation.

Show Solution

(a) cm

(b) Deviations from the mean:


12	−1	1
15	+2	4
11	−2	4
14	+1	1
13	0	0
Sum		10

(c) Dividing by rather than corrects for the fact that we used the sample mean — not the true population mean — to compute deviations. Because is calculated from the same data, the deviations around are slightly smaller than they would be around . Dividing by inflates the estimate just enough to make an unbiased estimator of the population variance . This correction is called Bessel’s correction.

Review Problem 2 — Interpreting and (REG-1)

A sports scientist reports: “In a study of 30 elite sprinters, I found between weekly training volume (km) and 100 m race time (seconds).”

(a) Interpret the direction and strength of the correlation in context.

(b) Compute and interpret it. Does training volume explain the majority of variance in race time?

(c) A coach concludes: “More training causes faster sprint times.” Identify the flaw and name one specific confounding variable.

Show Solution

(a) indicates a moderate positive linear association between training volume and 100 m race time. As weekly training volume increases, race time tends to increase as well — meaning more training is associated with slower times. This may seem counterintuitive but could reflect that high-volume training is associated with overtraining or injury, or simply that the athletes logging the most km are those working on endurance, not sprint speed.

(b)

Training volume explains approximately 42.3% of the variability in 100 m race time across these athletes. This means 57.7% of the variance in race times is explained by other factors. Training volume accounts for a substantial but not dominant portion of the variability — it does not explain the majority.

(c) The flaw is inferring causation from correlation. Correlation shows that training volume and race time are statistically associated, but it does not establish that one causes the other.

One specific confounding variable: athlete age. Older athletes may have accumulated more total training volume over their careers while simultaneously slowing down due to age-related physiological decline. Age would drive both variables simultaneously, creating the appearance of a positive even if more training per se does not cause slower times.

Question 1 — Interpret the Slope

Proposed retrieval — Least-Squares Criterion

Question 2 — Apply: Study Hours Regression

The fixed study-hours equation from the guided practice is . The fresh generator below changes the context and numbers on each attempt so prediction is produced rather than recognized.

Part B: That student actually scores 74. What is the residual, and what does its sign tell you?

Question 3 — Error Analysis

Proposed retrieval — Intercept Calculation

Proposed retrieval — Intercept Meaning

Boss Fight — Scenario A

Boss Fight — Scenario B

Ready for more? These go beyond the lesson objectives.

Challenge 1 — Transfer Scenario A

Challenge 2 — Transfer Scenario B

Challenge 3 — Transfer Scenario C

Full worked solutions for all problems in this lesson (Sections 5–9) are available on the dedicated solutions page. Solutions include every computation step, formula derivation, and interpretation note.

View all solutions →

Quick-Reference Formulas

Formula	Purpose
	Slope of the least-squares regression line ( in numerator)
	Intercept — computed after
	Predicted value of for a given
	Residual (actual − predicted); positive = above line
	Sum of residuals is always zero for a least-squares line
always on line	Use to verify arithmetic

Key Interpretation Rules

Slope: “For each 1-unit increase in , the predicted [increases/decreases] by [units], on average.” The phrase “on average” is mandatory.
Intercept: Only interpret when is within or near the observed data range. Otherwise, state it is a mathematical anchor.
Residual sign: Positive → actual above line (underpredicted); Negative → actual below line (overpredicted).
Asymmetry: Regression of on ≠ regression of on . Swapping roles requires computing a new equation.

Common Pitfalls

Pitfall	What goes wrong	Correction
P1 — Inverted ratio	Writing	Always in numerator:
P2 — Swapping x and y	Using the same equation in reverse	Compute a new equation with roles swapped
P3 — Missing “on average”	Slope stated as a guarantee for individuals	Add “on average” to every slope interpretation
P4 — Intercept out of context	Interpreting when is outside data range	Check data range; call it a “mathematical anchor” if is not plausible

REG-2: Linear Regression

Section 1: Introduction

Section 2: Prerequisites

Section 3: Core Concepts

C1 — The Least-Squares Criterion

The Least-Squares Criterion

C2 — Slope Formula

Slope of the Least-Squares Regression Line

C3 — Intercept Formula

Intercept of the Least-Squares Regression Line

C4 — The Line Always Passes Through

The Point of Means Is Always on the Line

C5 — Interpreting the Slope

Slope Interpretation Template

C6 — Interpreting the Intercept

Intercept Interpretation

C7 — Residuals

Residual

C8 — Making Predictions with

Prediction

C9 — Conditions for Simple Linear Regression

C10 — Regression of on ≠ Regression of on

Asymmetry of Regression

Section 4: Worked Examples

Example 1 — Fully Worked: Study Hours and Exam Scores

Example 2 — Partially Scaffolded: Temperature and Hot Beverage Sales

Example 3 — Prediction and Residual: Regression Equation Given

Example 4 — Which Is the Regression Line? (Application Twist)

Section 5: Guided Practice

Problem 1 — Computing the Regression Equation

Problem 2 — Interpreting Slope and Intercept

Problem 3 — Residual Scenarios

Problem 4 — Parameterized Regression Prediction

Section 6: Independent Practice

Problem 1 — Full Chain: Slope, Intercept, Prediction, Residual

Problem 2 — Regression Interpretation and Prediction

Problem 3 — Find the Error

Problem 4 — Residual Classification

Problem 5 — Multi-Step Synthesis

Retrieval — Point of Means

Retrieval — Regression Conditions

Mixed Review — Retrieval from Earlier Lessons

Review Problem 1 — Sample Standard Deviation (DS-4)

Review Problem 2 — Interpreting and (REG-1)

Section 7: Mastery Check

Question 1 — Interpret the Slope

Proposed retrieval — Least-Squares Criterion

Question 2 — Apply: Study Hours Regression

Question 3 — Error Analysis

Proposed retrieval — Intercept Calculation

Proposed retrieval — Intercept Meaning

Section 8: Boss Fight

Boss Fight — Scenario A

Boss Fight — Scenario B

Section 9: Challenge Problems

Challenge 1 — Transfer Scenario A

Challenge 2 — Transfer Scenario B

Challenge 3 — Transfer Scenario C

Section 10: Solutions Reference

Quick-Reference Formulas

Key Interpretation Rules

Common Pitfalls