An NBA analytics team wants to predict a player’s points per game from their weekly practice hours. They scatter-plot the data, see a linear trend, and draw a line through it — but which line? There are infinitely many straight lines that could pass through that cloud of points. The regression line is the unique line that fits the data best in a mathematically precise sense: it minimizes the total squared prediction error. Its slope tells the analyst something specific — “for each additional hour of practice per week, we predict ___ more points per game, on average” — and that number depends on more than just the direction of the association.
In REG-1 you learned how to measure the strength and direction of a linear relationship using the Pearson correlation . Now you will use , together with the standard deviations of both variables, to compute the exact equation of that best-fitting line, interpret what its slope and intercept mean in context, and assess how well any individual observation fits the prediction.
After this lesson, you will be able to:
By the end of this lesson, you will be able to:
Compute the slope and intercept of the least-squares regression line.
Verify your equation by confirming the line passes through .
Interpret the slope and intercept in plain language, using “on average” and respecting the observed data range.
Compute and interpret residuals as the gap between actual and predicted values.
Explain why the regression of on is not the same as the regression of on .
Section 2: Prerequisites
▾
What you need coming in — and why it matters today:
Pearson correlation (REG-1): captures the direction and strength of the linear relationship. Today’s slope formula uses directly — its sign determines whether is positive or negative.
Standard deviations and (REG-1, DS-4): These measure the spread of each variable. The ratio scales the slope to the units of relative to . Know which is which.
Means and (DS-3): The intercept formula requires both means. The point is guaranteed to lie on the regression line — you will use this to verify your arithmetic.
Scatter plot reading (REG-1, DS-2): Before computing anything, check the scatter plot: is the relationship approximately linear? Are there obvious outliers? Regression is only valid for a linear pattern without extreme influential points.
Quick check — can you recall these from REG-1?
What does the Pearson correlation coefficient measure?
Success Factor:
What changes in this lesson: In REG-1 you summarized the relationship with a single number . Here you build a full equation that lets you make predictions and measure how far each individual data point falls from the line. REG-3 will extend this to formally test whether the slope is statistically significant — the framework you build here carries directly into that lesson.
Retrieval Warm-up — from earlier lessons
A quality-control engineer measures the thickness (mm) of ceramic tiles from a production run. She computes mm. A colleague claims this means “most tiles are within 0.34 mm of the mean.” Which response is most accurate?
A marine biologist finds between sea surface temperature (°C) and coral bleaching percentage across 25 reef sites. A journalist headlines the story: “Warmer seas destroy coral.” Which statistical critique is most precise?
Section 3: Core Concepts
▾
How this section is organized: Ten concepts build in natural calculation order. Understand C1 before computing b (C2), and b before computing a (C3). The interpretation concepts (C5, C6) apply once you have the equation. C7–C10 round out the picture with residuals, conditions, and a subtle asymmetry property.
C1–C4: What the regression line is and how to compute it (criterion, slope, intercept, point-of-means check)
C5–C6: What the slope and intercept mean in plain language
C7–C8: Residuals and making predictions
C9–C10: Conditions and a key asymmetry property
C1 — The Least-Squares Criterion
Imagine drawing any straight line through a scatter plot. For each data point, measure the vertical gap between the actual value and the value the line predicts — that gap is called a residual. The least-squares line is the unique line that makes the sum of all those squared residuals as small as possible.
The Least-Squares Criterion
The least-squares regression line is the line that minimizes the sum of squared errors (SSE):
where is the actual value and is the value predicted by the line. No other straight line produces a smaller SSE for the same data.
We write (read “y-hat”) for a predicted value to distinguish it from the observed .
Why square the residuals? Squaring does two things: it makes all gaps positive (so a point 3 units above the line and one 3 units below don’t cancel), and it penalizes large deviations more heavily than small ones (a 6-unit gap contributes 36, not 6, to the total). The result is a line that avoids large misses even at the cost of slightly worse small fits.
The interactive visualization below shows three candidate lines for the same scatter plot. Watch how the SSE changes — only the least-squares line achieves the minimum.
The regression line is the unique line that minimizes the sum of squared residuals. See how SSE changes when we use different lines:
SSE = 55.66 (minimum)
The least-squares line minimizes SSE. No other straight line achieves a lower SSE = 55.66.
C2 — Slope Formula
Once we accept the least-squares criterion, calculus shows that the slope must be:
Slope of the Least-Squares Regression Line
where is the Pearson correlation coefficient, is the sample standard deviation of , and is the sample standard deviation of .
The slope has the same sign as . Its magnitude is the correlation scaled by the ratio of spreads: how much typically changes (in -units) for each unit change in .
Mini-example: Suppose , , .
For each 1-unit increase in , the predicted increases by 2.00 units, on average.
The inverted ratio error (very common): The formula is — is in the numerator. Writing inverts the ratio and gives a completely different (wrong) answer. A memory aid: is the slope of on , so the spread of the response variable () goes on top.
C3 — Intercept Formula
The intercept is computed after the slope, using both means:
Intercept of the Least-Squares Regression Line
where is the slope computed in C2, is the mean of , and is the mean of .
The intercept positions the line vertically so that it passes exactly through the point of means .
Mini-example (continued): From C2, . Suppose and .
The regression line is .
C4 — The Line Always Passes Through
This is a mathematical guarantee, not a coincidence:
The Point of Means Is Always on the Line
Substituting into :
So when . The point always lies on the least-squares line.
Practical use: After computing and , plug in and verify you get . If not, you made an arithmetic error.
Verification (continued): ✓
C5 — Interpreting the Slope
Slope Interpretation Template
”For each 1-unit increase in [x variable], the predicted [y variable] [increases / decreases] by [y units], on average.”
The phrase “on average” is mandatory — the slope describes the average change across all individuals in the data, not the guaranteed change for any specific individual.
Example: For where = study hours and = exam score:
“For each additional hour of study, the predicted exam score increases by 2.00 points, on average.”
Writing “If a student studies 1 more hour, their score increases by 2.00 points” drops the phrase “on average.” Regression predicts the average outcome for students with that study time — it does not guarantee any individual’s result. The missing “on average” is a marked error on every assignment in this module.
C6 — Interpreting the Intercept
Intercept Interpretation
The intercept is the predicted value of when . It is only contextually meaningful when is plausible given the observed data range.
When falls far outside the observed data range, is a mathematical anchor that positions the line — do not interpret it as a realistic prediction.
Two contrasting cases:
where = temperature (°C) and = hot beverage sales. At °C (freezing point), predicting sales is realistic — is meaningful.
where = age (years, observed range 20–65) and = reaction time (ms). At (a newborn), the predicted reaction time of 177.4 ms has no meaning — is a mathematical anchor only.
Always check whether is within or near the observed range before interpreting the intercept. Using the intercept as a meaningful prediction when is far outside the data is called extrapolation — the same logical error as predicting outside the data range in any direction. When in doubt, state “the intercept is a mathematical anchor; is not within the observed data range.”
C7 — Residuals
Residual
The residual for observation is:
Positive residual (): the actual value is above the line — the model underpredicted.
Negative residual (): the actual value is below the line — the model overpredicted.
Zero residual: the point lies exactly on the line.
For any least-squares line, the sum of all residuals equals zero: .
Mini-example: Line: . Observed: , .
— this point is 2 units above the line.
C8 — Making Predictions with
To predict the response for a specific value, substitute into the equation:
Prediction
Given , the predicted value for a specific is:
Always state that the result is a predicted value, not a guaranteed outcome. For within the observed range of , this is called interpolation and is generally reliable. For outside that range, this is extrapolation — use with caution.
C9 — Conditions for Simple Linear Regression
At the CEGEP level, focus on three conditions to check before using a regression equation:
Three conditions for simple linear regression:
Both variables are quantitative. Regression requires numerical and . For categorical variables, use different methods (see REG-4).
The relationship is approximately linear. Check the scatter plot. A curved pattern means linear regression will systematically mis-predict.
No extreme influential outliers. A single influential point (far from the rest in ) can drastically shift the slope. Inspect the scatter plot for outliers before trusting the equation.
Independence of observations (no repeated measures on the same subject) is also assumed but usually clear from context at this level.
C10 — Regression of on ≠ Regression of on
Asymmetry of Regression
Swapping the roles of and produces a genuinely different regression line.
Slope of on :
Slope of on :
In general, . Only when (perfect linear association) are the two lines identical.
Reason: the regression of on minimizes vertical errors (deviations in ); the regression of on minimizes horizontal errors (deviations in ). These are different optimization problems.
If you want to predict from , you must compute a new regression equation with as the response and as the predictor — you cannot simply solve the original equation for . Using the original equation in reverse would apply the wrong optimization criterion and give systematically biased predictions.
Section 4: Worked Examples
▾
Example 1 — Fully Worked: Study Hours and Exam Scores
A statistics instructor records study hours () and exam scores () for a small group. The summary statistics are: , , , , .
Step 1: Compute the slope.
I notice the formula is , and I keep in the numerator (not ).
Step 2: Compute the intercept.
I use the slope just computed and both means.
Step 3: Write the regression equation.
Step 4: Verify via the point of means.
Step 5: Interpret.
For each additional hour of study, the predicted exam score increases by 3.70 points, on average.
The intercept 56.90 is the predicted score for 0 hours of study. Since is at the boundary of the realistic range (some students may study 0 hours), it has borderline contextual meaning.
Step 6: Residual for a specific observation.
A student studied 4 hours and scored 73. What is the residual?
This student scored 1.30 points above what the model predicted — their actual score is above the line.
Step 7: Computation table (for transparency).
If we had the raw data, a , , , , table would confirm that and make SSE concrete.
Example 2 — Partially Scaffolded: Temperature and Hot Beverage Sales
A café manager believes temperature (, °C) is negatively associated with hot beverage sales (, units). Summary statistics: , , , , .
Compute and , write the equation, and interpret the slope and intercept.
Pause here. Before reading the solution:
What sign should have? Why?
Is °C plausible for this context?
Write down your predictions, then check.
Show Solution
Step 1:
The negative sign makes sense: as temperature rises, fewer hot beverages are sold.
Step 2:
Equation:
Verify: ✓
Slope interpretation: For each additional °C of temperature, predicted hot beverage sales decrease by 2.16 units, on average.
Intercept interpretation: At 0°C (freezing point), the model predicts 123.2 units sold. Since 0°C is plausible winter weather, this interpretation is contextually meaningful.
Example 3 — Prediction and Residual: Regression Equation Given
The regression equation for daily exercise minutes () and resting heart rate (, bpm) is .
(a) Predict the resting heart rate of a person who exercises 45 minutes per day.
(b) That person’s actual resting heart rate is 68 bpm. Compute the residual and state whether the actual value is above or below the line.
Show Solution
(a) bpm
(b)
The residual is positive, so this person’s actual resting heart rate (68 bpm) is above the line — the model underpredicted. This person’s heart rate is higher than expected for someone who exercises 45 minutes daily.
Example 4 — Which Is the Regression Line? (Application Twist)
Two analysts both fit a line to the same dataset where and .
Analyst A’s line:
Analyst B’s line:
Without access to the raw data, how can you determine which line is the least-squares regression line?
Show Verification
Analyst A: ✓
Analyst B:
Both lines pass through ! This means the point-of-means check alone cannot distinguish them — you would need the raw data to compute SSE for both lines and identify the true minimum. The (x̄, ȳ) condition is necessary but not sufficient for identifying the regression line.
Section 5: Guided Practice
▾
Problem 1 — Computing , , and Identifying the Point of Means
Study hours (x) vs. Exam score (y):, , , ,
Part A: Compute the slope .
Part B: Using , compute the intercept .
Part C: Which ordered pair must lie on this regression line?
Temperature (x, °C) vs. Hot beverage sales (y, units):, , , ,
Part A: Compute the slope .
Part B: Using , compute the intercept .
Part C: Which ordered pair must lie on this regression line?
Part C: Which ordered pair must lie on this regression line?
Age (x, years) vs. Reaction time (y, ms):, , , ,
Part A: Compute the slope .
Part B: Using , compute the intercept .
Part C: Which ordered pair must lie on this regression line?
Problem 2 — Interpreting Slope and Intercept
Equation: where = study hours and = exam score.
Part A: Select the correct interpretation of the slope.
Part B: Is the intercept interpretation contextually meaningful? ( = 0 hours of study)
Equation: where = temperature (°C) and = hot beverage sales (units).
Part A: Select the correct interpretation of the slope.
Part B: Is the intercept interpretation contextually meaningful? ( = 0°C)
Equation: where = daily exercise (min) and = resting heart rate (bpm).
Part A: Select the correct interpretation of the slope.
Part B: Is the intercept interpretation contextually meaningful? ( = 0 minutes of exercise)
Equation: where = fertilizer applied (g) and = tomato yield (kg).
Part A: Select the correct interpretation of the slope.
Part B: Is the intercept interpretation contextually meaningful? ( = 0 g of fertilizer)
Equation: where = age (years, observed range 20–65) and = reaction time (ms).
Part A: Select the correct interpretation of the slope.
Part B: Is the intercept interpretation contextually meaningful? ( = 0 years, a newborn)
Problem 3 — Residual Scenarios (Non-regenerable)
For each scenario, compute the predicted value , then the residual . Select the correct residual and identify what its sign means.
Scenario 1:. A student studies 4 hours and scores 73.
A positive residual means this student’s actual score is above the line (the model underpredicted).
Scenario 2:. At 25°C, actual sales were 68 units.
A negative residual means actual sales fell below the line (the model overpredicted).
Scenario 3:. A person exercises 60 minutes daily and has a resting HR of 56 bpm.
A tiny negative residual means this person’s heart rate is just barely below the line.
Problem 4 — Parameterized Generator
Section 6: Independent Practice
▾
Problem 1 — Full Chain: Slope, Intercept, Prediction, Residual
Study hours (x) vs. Exam score (y):, , , , . Observed: a student studied 5 hours and scored 78.
(a) What is ?
(b) Using , what is ?
(c) What is for hours?
(d) The actual score was 78. What does the sign of the residual tell you?
Show Full Solution
Equation:
Verification: ✓
→ the student scored 2.60 points above the line (above predicted).
Temperature (x, °C) vs. Hot beverage sales (y, units):, , , , . Observed: at 28°C, sales were 62 units.
(a) What is ?
(b) Using , what is ?
(c) What is for °C?
(d) Actual sales were 62 units. What does the sign of the residual tell you?
Show Full Solution
Equation:
Verification: ✓
→ actual sales fell 0.72 units below the predicted value (below the line).
Daily exercise (x, min) vs. Resting heart rate (y, bpm):, , , , . Observed: a person exercises 45 min/day and has resting HR 68 bpm.
(a)–(d) as above.
Show Full Solution
Equation:
Verification: ✓
→ this person’s resting heart rate is 3.95 bpm above the line (the model underpredicted — their heart rate is higher than expected for their exercise level).
Fertilizer applied (x, g) vs. Tomato yield (y, kg):, , , , . Observed: a plot with 12 g fertilizer yields 6.8 kg.
Show Full Solution
Equation:
Verification: ✓
→ this plot yielded 0.20 kg less than predicted (below the line; the model overpredicted).
Age (x, years) vs. Reaction time (y, ms):, , , , . Observed: a 55-year-old has reaction time 290 ms.
Show Full Solution
Equation:
Verification: ✓
→ this person’s reaction time is 19.65 ms above the predicted value (above the line; their reaction time is slower than expected for their age).
Problem 2 — Regression Interpretation Generator
Problem 3 — Find the Error
Equation: (study hours → exam score). A researcher reports: “If a student studies 1 more hour, their score will increase by 3.70 points.”
What is the error in this statement?
Show Full Analysis
The error (Pitfall P3): The statement drops the phrase “on average.” The slope 3.70 describes the average predicted change in exam score for students with one additional hour of study — it does not guarantee that any specific student’s score will increase by exactly 3.70 points. An individual student might score higher or lower than the line predicts.
Corrected statement: “For each additional hour of study, the predicted exam score increases by 3.70 points, on average.”
Dataset: temperature → hot beverage sales, , , . A researcher computes .
What is the error in this calculation?
Show Full Analysis
The error (Pitfall P1): The researcher wrote instead of . The response variable’s standard deviation () always goes in the numerator.
Correct calculation:
The researcher’s answer of is the slope of on (not on ) — a completely different line with a different meaning.
Equation: (age 20–65 years → reaction time, ms). A researcher states: “At birth (age 0), the predicted reaction time is 177.4 ms.”
What is the error in this statement?
Show Full Analysis
The error (Pitfall P4): The researcher is extrapolating far beyond the observed range. The model was built on data from adults aged 20–65. Using it to predict the reaction time of a newborn (age 0) is a classic extrapolation error — the linear relationship observed in adults need not hold for infants.
Correct statement: “The intercept 177.4 ms is a mathematical anchor that positions the line. Since (a newborn) is far outside the observed data range of 20–65 years, this value should not be interpreted as a meaningful prediction.”
Equation: (exercise minutes → heart rate). A researcher wants to predict daily exercise minutes from resting heart rate, so she uses = heart rate and = exercise minutes with the same equation.
What is the error in this approach?
Show Full Analysis
The error (Pitfall P2): The regression of on minimizes vertical errors (errors in predicting ). The regression of on minimizes horizontal errors (errors in predicting ). These are different optimization problems that produce different equations.
To predict exercise minutes from heart rate, the researcher must compute a new regression with heart rate as the predictor () and exercise minutes as the response (). The slope of the new equation would be , not .
Equation: (fertilizer g → tomato yield kg). A researcher states: “The intercept 1.60 means that fertilizer explains 1.60 kg of yield.”
What is the error in this statement?
Show Full Analysis
The error: The researcher has confused the intercept with a measure of the fertilizer’s effect. The intercept is the predicted tomato yield when (no fertilizer is applied). It is the baseline prediction, not the amount of yield attributable to fertilizer.
The amount of yield attributable to a unit of fertilizer is described by the slope (): each additional gram of fertilizer is associated with a predicted increase of 0.45 kg in yield, on average.
Corrected statement: “The intercept 1.60 is the predicted tomato yield when no fertilizer is applied (). It is a baseline prediction, not a measure of the fertilizer’s effect.”
Problem 4 — Residual Generator
Problem 5 — Multi-Step Synthesis: Physical Therapy Rehabilitation
A physical therapy clinic records weeks of rehabilitation () and mobility score improvement (, scale 0–100) for 10 patients:
(weeks)
2
3
4
4
5
6
6
7
8
9
(improvement)
12
18
22
25
30
34
38
42
46
53
Pre-computed sums: , , , , , .
(a) Using the pre-computed sums, compute using the computational formula. (The formula is .)
(b) Compute and . Write the regression equation. Use and (or use the computational approach directly from your ).
(c) Interpret the slope in the context of this rehabilitation study. Use the required phrase “on average.”
(d) Is the intercept ( weeks) contextually meaningful here? Explain your reasoning.
(e) Compute the residual for the patient at weeks, . What does the sign tell you?
(f) A clinician wants to predict mobility improvement for a patient completing 10 weeks of rehabilitation. Calculate and note any caution about this prediction.
Show Full Solution
(a) Computing :
Numerator:
Left bracket:
Right bracket:
Very strong positive linear association.
(b) Computing and :
Equation:
Verification: ✓
(c) Slope interpretation:
“Each additional week of rehabilitation is associated with a predicted mobility improvement of approximately 5.95 points, on average.”
(d) Intercept meaningfulness:
represents a patient with zero weeks of rehabilitation — which is conceptually a patient who received no treatment and therefore had no mobility improvement opportunity. The predicted value of is actually intuitive (near-zero improvement with no therapy). However, is at the edge of the observed data range (minimum weeks), so interpretations should be made cautiously. The near-zero intercept is consistent with the model but technically extrapolates slightly outside the observed range.
(e) Residual for patient at , :
The negative residual means this patient improved slightly less than predicted by the model — their actual score falls below the regression line. The model overpredicted for this patient.
(f) Prediction for weeks:
points
Caution: The maximum observed in the dataset is 9 weeks. Predicting at weeks is mild extrapolation beyond the observed data range. The linear trend may not continue beyond 9 weeks (e.g., mobility improvement could plateau). Use this prediction with caution and note that it assumes the linear relationship extends to 10 weeks.
Mixed Review — Retrieval from Earlier Lessons
These problems draw on concepts from earlier in the course. Attempting them without re-reading prior lessons is the point — retrieval practice strengthens long-term memory more than re-reading.
Review Problem 1 — Sample Standard Deviation (DS-4)
A botanist measures the stem heights (cm) of five seedlings one week after germination:
(a) Compute .
(b) Compute the sample standard deviation using .
(c) Explain why statisticians divide by rather than when computing a sample standard deviation.
Show Solution
(a) cm
(b) Deviations from the mean:
12
−1
1
15
+2
4
11
−2
4
14
+1
1
13
0
0
Sum
10
(c) Dividing by rather than corrects for the fact that we used the sample mean — not the true population mean — to compute deviations. Because is calculated from the same data, the deviations around are slightly smaller than they would be around . Dividing by inflates the estimate just enough to make an unbiased estimator of the population variance . This correction is called Bessel’s correction.
Review Problem 2 — Interpreting and (REG-1)
A sports scientist reports: “In a study of 30 elite sprinters, I found between weekly training volume (km) and 100 m race time (seconds).”
(a) Interpret the direction and strength of the correlation in context.
(b) Compute and interpret it. Does training volume explain the majority of variance in race time?
(c) A coach concludes: “More training causes faster sprint times.” Identify the flaw and name one specific confounding variable.
Show Solution
(a) indicates a moderate positive linear association between training volume and 100 m race time. As weekly training volume increases, race time tends to increase as well — meaning more training is associated with slower times. This may seem counterintuitive but could reflect that high-volume training is associated with overtraining or injury, or simply that the athletes logging the most km are those working on endurance, not sprint speed.
(b)
Training volume explains approximately 42.3% of the variability in 100 m race time across these athletes. This means 57.7% of the variance in race times is explained by other factors. Training volume accounts for a substantial but not dominant portion of the variability — it does not explain the majority.
(c) The flaw is inferring causation from correlation. Correlation shows that training volume and race time are statistically associated, but it does not establish that one causes the other.
One specific confounding variable: athlete age. Older athletes may have accumulated more total training volume over their careers while simultaneously slowing down due to age-related physiological decline. Age would drive both variables simultaneously, creating the appearance of a positive even if more training per se does not cause slower times.
Section 7: Mastery Check
▾
Question 1 — Feynman Test
In your own words, explain what the slope of a regression line tells you — and what phrase must always appear in a correct interpretation. Write as if explaining to a classmate who has never taken statistics. Aim for 200–500 characters.
0 / 500
Model Answer
The slope tells you how much the predicted response variable () changes for each 1-unit increase in the predictor variable (). The critical phrase that must always appear is “on average.”
Why “on average”? The regression line predicts the mean response for all individuals with a given value — it does not guarantee what any specific individual will do. A student who studies one more hour might score 3.70 points higher, or lower, or exactly as predicted. The slope only pins down the average across many such students.
What the slope does NOT tell you:
It is NOT a causal relationship (more study hours do not necessarily cause higher scores — there may be confounders).
It is NOT the change for any specific individual — only the average predicted change.
It is NOT valid for values far outside the data range (extrapolation).
The sign of tells you the direction: positive means the predicted increases with ; negative means it decreases.
Question 2 — Apply: Study Hours Regression
Using the regression equation from GP1 Variant 0: (study hours → exam score, based on , , , , ):
Part A: A student studies 6 hours. What is the predicted exam score?
Part B: That student actually scores 74. What is the residual, and what does its sign tell you?
Show Full Solution
The negative residual means this student scored 5.10 points below what the model predicted for someone who studied 6 hours. Their actual exam score fell below the regression line.
This is normal — individual observations scatter around the line. The negative residual just means this particular student underperformed relative to the average trend for 6-hour studiers.
Question 3 — Error Analysis
Flawed statistical report:
A researcher has data on fertilizer (g) and tomato yield (kg). She computes , , , , . She then writes:
“The slope is . So for each additional gram of fertilizer, tomato yield increases by 1.25 kg.”
Identify the errors in this report.
Show Full Analysis
Error 1 — Inverted ratio (P1): The correct formula is , with in the numerator. The researcher wrote . This is the slope of on , not on .
Correct slope:
Error 2 — Missing “on average” (P3): The slope interpretation requires the phrase “on average.” Regression predicts the average yield for a population of plants with a given fertilizer amount — individual plants will scatter around this prediction.
Corrected report: “The slope is . For each additional gram of fertilizer applied, the predicted tomato yield increases by 0.45 kg, on average.”
Self-Assessment
How confident do you feel about computing and interpreting the regression line?
Still confusedReady for the Boss Fight
Section 8: Boss Fight
▾
Choose your path. Both require full regression reasoning from start to finish.
🔢 Path A: The Calculator
A production supervisor wants to model units produced per worker from weeks of on-the-job training. You have raw data: compute r, then b and a, then make a prediction and evaluate a residual.
📊 Path B: The Interpreter
A regression equation has been handed to you. Interpret every piece of it, make a prediction, compute a residual, and reason about what would happen if an extreme outlier were added to the dataset.
🔢 Path A: The Calculator
A factory tracks weeks of on-the-job training () and units produced per shift () for 7 new workers:
(weeks)
1
2
3
4
5
6
7
(units)
10
16
22
24
30
32
40
Pre-computed sums: , , , , .
Task 1. Check the three conditions for simple linear regression. Would you proceed?
Show Guidance for Task 1
Both variables quantitative: Weeks () and units () are both quantitative. ✓
Approximately linear relationship: With only 7 points, we would check the scatter plot. The data increases consistently from 10 to 40 units — a linear trend is plausible. ✓
No extreme influential outliers: No single point appears drastically out of line with the trend. ✓
Proceed with regression.
Task 2. Using the computational formula, compute . Then compute (using and ) and .
Show Guidance for Task 2
Numerator:
Left bracket:
Right bracket:
Equation:
Verification: ✓
Task 3. Predict units produced for a worker with 5 weeks of training. Then compute the residual for the worker with , .
Show Guidance for Task 3
units
This worker produced 1.04 units more than the regression model predicted for someone with 5 weeks of training — their actual output is above the line.
Task 4. Interpret the slope and intercept. Is the intercept contextually meaningful for this dataset?
Show Guidance for Task 4
Slope: For each additional week of on-the-job training, the predicted units produced per shift increases by approximately 4.10 units, on average.
Intercept: The intercept is the predicted units produced for a worker with weeks of training (a brand-new, untrained worker). Since 0 weeks is a realistic starting point and is at the boundary of the observed data range, the intercept has borderline contextual meaning — it suggests that untrained workers produce roughly 8–9 units per shift.
Reflection: In one or two sentences, explain what the regression equation tells the factory supervisor about the value of training — and what it does not tell them.
0 / 500
📊 Path B: The Interpreter
You are given the following regression equation for a factory dataset: , where = weeks of training and = units produced per shift (observed range: from 1 to 7 weeks, , ).
Task 1. Interpret the slope and the intercept. For the intercept, explicitly state whether it is contextually meaningful.
Show Guidance for Task 1
Slope: For each additional week of on-the-job training, the predicted number of units produced per shift increases by 4.56 units, on average.
Intercept: The intercept 6.62 is the predicted units produced for a brand-new worker with zero weeks of training. Since is at the boundary of the observed range (minimum observed week), it has borderline contextual meaning. A reasonable interpretation: workers with no formal training are predicted to produce approximately 6–7 units per shift.
Task 2. A supervisor wants to predict output for a worker completing 10 weeks of training. Compute and explain any concern.
Show Guidance for Task 2
units
Concern: The observed data only goes up to weeks. Predicting at is extrapolation beyond the observed range. The linear relationship may not hold — productivity gains from training often level off or plateau at higher experience levels. Use this prediction with caution; flag it as an extrapolation.
Task 3. A worker with 4 weeks of training produces 28 units. Compute the residual and interpret what it means.
Show Guidance for Task 3
This worker produced 3.14 units more than the model predicted for someone with 4 weeks of training. Their actual output is above the regression line — they outperformed the average trend.
Task 4. Suppose an exceptional worker with weeks produces units (far above the others). Reason through what would happen to the regression line if this outlier were added to the dataset. Would increase, decrease, or stay about the same? Would the equation still pass through ?
Show Guidance for Task 4
This point (, ) has high and an extreme — it would be a high-leverage influential point. Its inclusion would:
Increase : The outlier pulls the high- end of the line upward, increasing the slope. The formula would change because and/or would increase with the outlier’s inclusion.
Change : The new mean of would be higher, which shifts accordingly.
The line still passes through : This is a mathematical guarantee for any least-squares line, regardless of what data are in the dataset. The new would shift, but the new equation would still pass through it.
Influential outliers can drastically change the regression equation, which is why checking the scatter plot before trusting the equation is part of the conditions check (C9).
Reflection: In two sentences, explain why the slope interpretation requires “on average” — and what would be wrong with saying “a worker trained for 1 more week will produce 4.56 more units.”
0 / 500
Section 9: Challenge Problems
▾
Ready for more? These go beyond the lesson objectives.
Challenge 1 — Regression Asymmetry
For a dataset with , , , , :
(a) Compute the slope of on .
(b) Compute the slope of on (i.e., treating as the predictor and as the response, with formula ).
(c) Is ? Show your work.
(d) Under what condition would exactly? Explain using the formula.
Show Solution
(a)
(b)
(c). But . So No, .
Check:. This is always true: .
(d) only when , i.e., when , i.e., when . Only in a perfect linear association do the two regression lines coincide, so that the regression of on and the regression of on produce inverse slopes. For any , the two lines are genuinely different.
Challenge 2 — Sensitivity of to
Using and , fill in the table:
Notes
0.4
?
0.8
?
Does double when doubles from 0.4 to 0.8?
−0.4
?
1.0
?
(a) What is the slope when ? What does this line look like?
(b) What is the slope when ? What does this mean geometrically?
Show Solution
for all rows.
0.4
1.0
0.8
2.0
−0.4
−1.0
1.0
2.5
Does double when doubles from 0.4 to 0.8? Yes — because is linear in . Doubling doubles (the ratio is constant).
(a) When : . The regression line is horizontal: for all . The best prediction for any is simply the mean of — knowing provides no useful linear information.
(b) When : . All data points lie exactly on the line (perfect positive linear association). The slope equals exactly the ratio of standard deviations.
Takeaway: controls the “tilt” of the regression line. As increases from 0 to 1, the line tilts from flat to its maximum slope .
Challenge 3 — Regression to the Mean (Optional Stretch)
A professor gives two exams. The correlation between scores is . Both exams have the same mean () and standard deviation (). A student scores 90 on Exam 1 — two standard deviations above the mean.
(a) Compute and for predicting Exam 2 score from Exam 1 score.
(b) Predict the Exam 2 score for a student who scored 90 on Exam 1.
(c) The predicted score is below 90, closer to the mean. Explain why this happens mathematically, using the slope formula.
(d) A coach observes that athletes who perform exceptionally in week 1 tend to do worse in week 2, and concludes that hard training in week 1 “wears them out.” Is this a valid conclusion? What is the name of the statistical phenomenon at play?
Show Solution
(a) (since ).
(b)
The predicted Exam 2 score is 82, not 90 — it has moved closer to the mean.
(c) When , the slope equals (for imperfect association). The predicted deviation from the mean is times the observed deviation. For a student who scored 2 SDs above the mean on Exam 1:
So the predicted score is . The factor of “shrinks” the deviation toward the mean. This shrinkage happens mathematically because imperfect correlation means part of the extreme score was due to luck (random error), and luck does not repeat.
(d) No — this is not a valid causal conclusion. The phenomenon is called regression to the mean (or the regression fallacy). Extreme performances partly reflect random variation. Even if the coach did nothing differently, the top performers of week 1 would be expected, on average, to perform closer to average in week 2. Attributing the “decline” to training fatigue confuses a mathematical artifact with a causal mechanism. This fallacy has real consequences — it underlies the incorrect belief that praise leads to worse performance and that punishment leads to better performance.
Section 10: Solutions Reference
▾
Full worked solutions for all problems in this lesson (Sections 5–9) are available on the dedicated solutions page. Solutions include every computation step, formula derivation, and interpretation note.