In April 2020, the Georgia Department of Public Health posted a bar chart showing COVID-19 cases
by county over 15 days. The counties weren’t in chronological order on the x-axis — they were
sorted so the bars would appear to go down, giving the visual impression that cases were
falling when they weren’t. Public outcry forced the department to repost the chart with the dates
in correct order. The bars now visibly climbed.
Same data. Two very different stories.
This is what visualization literacy is really about: not just making a graph, but understanding
what a graph communicates — and what it can hide. Every design choice (what goes on each axis,
where the axis starts, how wide the bars are, whether to use a pie chart or a histogram) changes
the story the viewer perceives. Make the wrong choice by accident and you mislead your audience.
Make it deliberately and you deceive them.
By the end of this lesson you’ll have the tools to both create honest, informative
graphs and read suspicious ones critically.
After this lesson, you will be able to:
Build a frequency distribution table showing absolute, relative, and cumulative frequencies
Select the correct graph type for a given variable — histogram for quantitative data, bar chart for qualitative data
Construct and interpret histograms, bar charts, stem-and-leaf plots, scatter plots, and time-series graphs
Identify the techniques used to make graphs misleading — truncated axes, inconsistent scales, distorted areas — and explain why they deceive
These skills matter well beyond this course. Every field — medicine, economics, journalism,
engineering, social science — presents data graphically. Being able to read and evaluate those
graphs is a form of literacy as important as reading text.
Section 2: Prerequisites
▾
Building accurate visual representations requires a clear understanding of the data types you identified in DS-1.
From DS-1: Qualitative vs. Quantitative. Qualitative data are labels/categories; quantitative data are numerical measurements. (Histograms only apply to quantitative data.)
From DS-1: Discrete vs. Continuous. Discrete values are countable (1, 2, 3); continuous values can take any value in an interval (like 1.75 kg).
Frequency vs. Relative Frequency: Frequency is the raw count; relative frequency is the proportion (count / total). Both are used to scale the vertical axis of your charts.
Sorting and Classing: Grouping individual data points into “bins” (e.g., ages 10–19, 20–29) is the first step in creating any distribution plot.
Retrieval Checkpoint
A dataset contains the heights (in cm) of 50 students. You want to visualize the distribution of these heights using a chart. Which category does this data belong to?
Success Factor:
Visual Guardrail:
If you aren’t sure whether to use a Bar Chart or a Histogram, check the data type. Bar charts are for categories (qualitative); histograms are for numerical ranges (quantitative). Using the wrong one is a “Level 1” error in data visualization.
Retrieval Warm-up — from DS-1
A researcher surveys 500 people and records the following for each participant: (i) their preferred streaming platform (Netflix, Crave, Disney+, Other), and (ii) their monthly screen time in hours. Which of the following correctly classifies these two variables?
A study reports that the mean resting heart rate for a sample of 80 marathon runners is beats per minute, while the population mean for all adults is bpm. Which statement uses these symbols correctly?
Section 3: Core Concepts
▾
C1 — Frequency Distributions
Before drawing any graph, we organize raw data into a frequency distribution — a
table that tallies how often each value (or range of values) appears in the dataset. It’s the
scaffolding behind almost every graph you’ll encounter.
Frequency Distribution
A frequency distribution is a table that summarizes data by recording:
Class (or value): the category or interval
Absolute frequency : the raw count of observations in that class
Relative frequency : the proportion of the total in that class, computed as
Cumulative frequency : the running total of absolute frequencies up to and including that class
Cumulative relative frequency : the running total of relative frequencies — equivalently, the proportion of observations at or below that class
Here’s what a frequency distribution looks like for a dataset of 20 exam scores:
Class (Score)
Absolute Freq.
Relative Freq.
Cumulative Freq.
Cumul. Rel. Freq.
50–59
2
0.10
2
0.10
60–69
5
0.25
7
0.35
70–79
8
0.40
15
0.75
80–89
4
0.20
19
0.95
90–99
1
0.05
20
1.00
Total
20
1.00
—
—
Read it like this: 8 students scored in the 70–79 range (absolute frequency).
That’s 40% of the class (relative frequency). And 75% of students scored 79 or below
(cumulative relative frequency at the 70–79 class).
Cumulative relative frequency ≠ relative frequency. The relative frequency
for the 70–79 class is 0.40 (40% scored in that range). The cumulative relative frequency
is 0.75 (75% scored 79 or below — including everyone in earlier classes too). These are
completely different quantities. The cumulative column always ends at 1.00 (100%).
Figure 1b: What “cumulative” means. Left: the five classes as separate bars — ordinary frequency. Right: the same five classes stacked into one column (colours match). Each segment adds its class’s count to the running total. The label at each boundary is cf. Reading off the right column: 15 of the 20 students scored 79 or below (cf = 15 at the top of the 70–79 segment), and all 20 scored 99 or below (cf = 20 at the very top).
C2 — Histograms (for Quantitative Data)
A histogram turns a frequency distribution into a picture. Each class becomes a bar; the bar’s
height represents frequency. But one feature distinguishes histograms from all other bar-based
graphs: the bars touch.
Histogram
A histogram is a graph of a frequency distribution for quantitative (numerical) data. Each bar represents one class interval:
Horizontal axis (x): the numerical values — class boundaries are marked at the edges of each bar
Vertical axis (y): frequency (absolute or relative)
Bars touch — because the x-axis is a continuous numerical scale; there are no gaps between classes
Each bar’s area is proportional to its frequency (when class widths are equal, height alone is sufficient)
Here is the histogram for the exam score data above. Notice how the bars share their edges — the
right edge of the 60–69 bar is the same as the left edge of the 70–79 bar.
Figure 1: Histogram of 20 exam scores (class width = 10). Bars touch — the x-axis is continuous.
Histogram shapes tell a story. Where does the data cluster? Are there a few unusually high or low values that extend one side? Practise describing what you see: “most values fall between X and Y, with a few as high as Z.” You will give these patterns formal names and measures in DS-5 (Position and Distribution Shape).
Histogram intervals are not discrete categories. The 70–79 bar represents
every score from 70 up to (but not including) 80. There’s no “70 category” and “79 category” —
the x-axis is a continuous number line divided into intervals. Reading the bars as isolated
discrete bins is a misinterpretation; the bars represent ranges on a continuum.
C3 — Bar Charts (for Qualitative Data)
When the data is categorical rather than numerical, we use a bar chart. Bar charts look similar
to histograms at a glance, but there’s one crucial difference that isn’t merely cosmetic —
it reflects the nature of the data itself.
Bar Chart
A bar chart displays the frequency or relative frequency of each category in a qualitative variable. Key features:
Horizontal axis (x): category labels
Vertical axis (y): frequency or relative frequency
Bars do NOT touch — there is a gap between bars because categories are distinct (there is no value “between” Bakery and Dairy)
Bar order is arbitrary for nominal data; use the natural order for ordinal data
A bar chart sorted in descending order of frequency — most common category first — is called a Pareto chart, widely used in quality control to quickly identify the largest sources of defects
Figure 2: Bar chart of favourite sports (qualitative nominal data). Bars have gaps — the categories are distinct.
The Key Difference — One Rule to Remember
Histogram bars touch. Bar chart bars don’t.
This isn’t an aesthetic choice. In a histogram, the x-axis is a continuous number line —
scores of 69.9 and 70.0 are adjacent, so the bars share an edge. In a bar chart, the x-axis
shows distinct categories — “Soccer” and “Basketball” are not adjacent on any meaningful
scale, so there’s a gap to reflect that separation.
Use a histogram for quantitative data. Use a bar chart for qualitative data. Using a histogram for categorical data is always wrong.
C4 — Other Graph Types
Histograms and bar charts cover most of what you’ll build in this course. But four other graph types appear regularly in statistics — each exists because a particular question about data cannot be answered with a bar. Here’s what each one is for.
Pie Chart
A pie chart divides a circle into sectors, where each sector’s angle is
proportional to its relative frequency (). Best used when:
Data is qualitative (usually nominal)
There are 5 or fewer categories (more segments make the chart unreadable)
Part-to-whole relationships are the main message
Limitation: Humans are poor at estimating angles. When categories are
similar in size, a bar chart is usually clearer. Many statisticians prefer bar charts to
pie charts almost universally.
Pie chart
Bar chart — same data
Figure 2b: Same data, two graph types. The pie chart encodes frequency as sector angle — difficult to compare when sectors are similar in size (try judging “Basketball vs. Swimming” precisely). The bar chart encodes frequency as bar height — immediately comparable. For most purposes, prefer the bar chart.
Stem-and-Leaf Plot
A stem-and-leaf plot displays quantitative data by splitting each value into
a stem (leading digit(s)) and a leaf (trailing digit). It looks like a
sideways histogram but preserves the individual data values:
Best for small datasets (n ≤ 50) where you want to see the distribution and keep
the raw data visible. Each row is a class; leaves within a row are individual observations.
Scatter Plot
A scatter plot displays pairs of quantitative measurements as points on a
two-dimensional graph. Each point represents one observation; its x-coordinate is one
variable and its y-coordinate is another.
Use a scatter plot when you want to explore the relationship between two
quantitative variables — does one tend to increase as the other does? Is there a linear
pattern? Are there outliers? You’ll use scatter plots extensively in
REG-1 (Correlation Analysis).
Figure 2c: Scatter plot of study hours versus exam score (n = 15). Each dot is one student. The upward trend — more hours, higher scores — is visible as a whole-cloud pattern, not a rule for any individual point. This is what “relationship between two quantitative variables” looks like. You will quantify this trend precisely in REG-1.
Time-Series Plot
A time-series plot (line graph) shows how a quantitative variable changes
over time. Time goes on the x-axis; the variable of interest on the y-axis. Points are
connected by lines to emphasize the change from one time period to the next.
Use a time-series plot when the x-variable specifically represents time (days, months,
years) and the trend over time is the story you want to tell.
Figure 2d: Time-series plot of monthly average temperature over one year. The connected line emphasizes the trend — rise through summer, fall through winter. Points are connected because the x-axis is time: consecutive months are adjacent, so tracking the change from one to the next is meaningful. Compare this to a bar chart of the same data: a bar chart would show the same heights but lose the sense of continuous progression.
C5 — Choosing the Right Graph
Every graph-choice decision starts with the same question: What type of variable is this?
Use the decision chain below.
Graph Selection Decision Chain
One variable?
Qualitative: bar chart (always); pie chart (only if ≤5 categories and part-to-whole is the focus)
Quantitative: histogram (distribution shape); stem-and-leaf (small n, want raw values)
Two variables?
Both quantitative: scatter plot; if one is time, use a time-series plot
One qualitative, one quantitative: side-by-side bar charts or grouped histograms (covered in later courses)
Using a histogram for qualitative data is always wrong. A histogram’s
x-axis is a continuous number line — it implies that values between bars are possible.
For qualitative data (e.g., “Bakery,” “Dairy”), no such “between” exists. Use a bar chart.
Similarly, using a bar chart for quantitative data with many distinct values (like a
weight distribution) will produce a cluttered mess where a histogram would be clean and clear.
C6 — Graph Misrepresentation
Not all misleading graphs are accidents. Knowing the common techniques helps you spot them —
and avoid creating them yourself.
Truncated Y-Axis
A truncated y-axis starts above zero, making small differences look large.
The bars or line still reflect the true data values, but the visual impression is distorted.
When it’s a problem: bar charts and histograms should (almost) always start
at 0. A bar that represents 97% vs. 94% disappears entirely when the y-axis starts at 0 —
fine. But when the axis starts at 93%, that 3-point difference looks enormous.
When it’s acceptable: time-series plots often legitimately start above zero
(e.g., tracking temperature variations around 15°C — starting at 0 would waste most of the chart).
The key is whether the starting point is disclosed and whether the visual impression matches
the magnitude of the difference.
Here is the same fictional data — approval ratings over four quarters — displayed two ways.
The only difference between the two graphs is where the y-axis starts.
✓ Honest: y-axis starts at 0
The change looks small — as it should.
✗ Misleading: y-axis starts at 93
Looks like a collapse — same 3-point drop.
Figure 3: Same data, two y-axes. The left graph starts at 0%; the right graph starts at 93%.
The 3% decline from Q1 to Q4 looks negligible on the left and catastrophic on the right.
This is the truncated y-axis trick.
Other Common Misrepresentation Techniques
Inconsistent class widths in histograms: If classes are different widths, the bar height alone is misleading — a wider class captures more observations. The correct display uses frequency density (frequency ÷ class width) on the y-axis, ensuring area rather than height encodes frequency.
3D effects and distorted areas: 3D pie charts tilt and expand the slice nearest the viewer, making it look larger than its true proportion. Pictograms (where a doubled icon also doubles in width, quadrupling area) distort relative sizes.
Selective date ranges in time-series: Choosing a start date that captures only the upward part of a trend creates the impression of consistent growth when the full history shows volatility.
Omitting the sample size n: “67% of customers prefer our brand!” — based on a survey of 3 customers — is technically correct but meaningless without context.
Quick checklist for evaluating any graph you encounter:
Does the y-axis start at 0? If not, why not — and does the distortion matter?
Are all bars/slices drawn to a consistent scale?
Is the graph type appropriate for the variable type?
What is n? Is the sample large enough to support the claim?
Is the time range shown cherry-picked?
Section 4: Worked Examples
▾
Let’s walk through four examples — each exercises a different core concept, and the scaffolding
gradually fades so you’re doing more of the thinking by Example 4.
Example 1 — Building a Frequency Distribution Table (C1)
Scenario: A quality-control inspector records the number of defective items
found in each of 20 production batches:
Organize this into a frequency distribution with 5 classes of width 2 (classes: 1–2, 3–4, 5–6, 7–8, 9–10).
Compute the absolute frequency, relative frequency, cumulative frequency, and cumulative relative frequency.
Step 1: Tally each class.
Go through the data values one by one and mark which class each belongs to:
Step 2: Compute relative frequency. Divide each count by .
Step 3: Compute cumulative frequencies. Running totals from top to bottom.
Result:
Class
1–2
2
0.10
2
0.10
3–4
6
0.30
8
0.40
5–6
7
0.35
15
0.75
7–8
4
0.20
19
0.95
9–10
1
0.05
20
1.00
Total
20
1.00
—
—
Interpretation: 75% of batches had 6 or fewer defects ( at the
5–6 class). Only 1 batch had 9–10 defects (5% of all batches).
Sanity checks: the column must sum to .
The column must sum to 1.00. The final row of must equal .
The final row of must equal 1.00. Run these checks every time.
Example 2 — Constructing a Histogram (C2)
Scenario: Using the frequency table from Example 1, sketch the histogram.
Before I show the steps, take a moment to predict: where do you expect the tallest bar to be? Will the bars be roughly equal on both sides of that peak, or will one side drop off more steeply?
Your prediction: Based on the frequency table, where do you expect the tallest bar to be? Will the bars be roughly the same height on both sides of the peak, or will one side drop off more steeply? Think it through, then continue.
Step 1: Draw and label the axes.
Horizontal axis: continuous scale from 0.5 to 10.5 (add half-unit padding so bars don’t cut against the axis ends); mark class boundaries at 0.5, 2.5, 4.5, 6.5, 8.5, 10.5
Vertical axis: frequency; mark from 0 to 8 (the maximum frequency)
Step 2: Draw bars for each class, touching at the class boundaries.
Figure 4: Histogram of defective items per batch (n = 20). The distribution peaks at the 5–6 class; bar heights decrease on both sides of the peak.
Shape: The peak is at the 5–6 class (f = 7). From the peak, bars decrease on both sides — dropping to f = 4 in the 7–8 class and f = 1 in the 9–10 class on the right, and to f = 6 in the 3–4 class and f = 2 in the 1–2 class on the left. The distribution is approximately balanced around its centre. In DS-5 you will give shapes like this formal names and measures; for now, practise describing what you observe: which class is tallest, and how do the bars change on each side of the peak.
Example 3 — Choosing the Right Graph (C3, C5)
For each scenario, which graph is most appropriate? Think through the decision chain
(variable type → graph choice) before revealing the answer.
Scenario A: A researcher records the preferred music genre of 200 university students (Pop, Rock, Hip-hop, Classical, Other).
Show answer for Scenario A
Answer: Bar chart. Music genre is a qualitative nominal variable —
categories with no natural order. A bar chart with one bar per genre and frequency
on the y-axis is correct. A pie chart would also be acceptable here (5 categories
fits the pie chart guideline), but a bar chart makes relative sizes easier to compare.
Not a histogram — genre is not a number; there is no meaningful “between Pop and Rock.”
Scenario B: A nurse records the resting heart rate (beats per minute) of 50 patients.
Show answer for Scenario B
Answer: Histogram (or stem-and-leaf). Heart rate is
quantitative continuous. A histogram groups the values into class intervals
(e.g., 60–69, 70–79 bpm) and shows the distribution shape. For a small dataset
(n = 50), a stem-and-leaf plot would also work and preserves individual values.
Not a bar chart — the x-axis is a continuous number line.
Scenario C: An economist tracks the unemployment rate each month for 5 years.
Show answer for Scenario C
Answer: Time-series plot (line graph). The x-variable is time
(months), and showing the trend over time is the point. Connecting the points with
a line emphasizes the change from month to month. A histogram would destroy the
temporal ordering of the data — it would show the distribution of rates but not
how rates evolved.
Example 4 — Spot the Misleading Graph (C6)
This example tests whether you can recognize the misrepresentation techniques from C6 in
a new context — not just describe them, but identify them when they appear.
Graph A: A bar chart shows monthly sales for two competing stores. Store A’s bar
is drawn as a dollar-sign icon 3 cm tall. Store B’s bar is a dollar-sign icon 6 cm tall —
twice as tall and twice as wide, representing $2 million vs. $1 million in sales.
Identify the misrepresentation in Graph A
Misrepresentation: Distorted area in a pictogram. Store B has twice
the sales of Store A, so the icon is drawn twice as tall — but it was also made twice
as wide! Area = height × width. A 2× taller and 2× wider icon has 4× the area
of Store A’s icon, making the difference look four times as large as it actually is.
Fix: Use bars of equal width or use a simple bar chart without pictogram icons.
Graph B: A line graph shows a company’s stock price over 6 months. The y-axis
starts at $82 and ends at $90. The line climbs steeply from $83 to $89. The headline reads:
“Stock price surges — up 7.2% in six months!”
Identify the misrepresentation in Graph B
Misrepresentation: Truncated y-axis creating visual exaggeration.
Starting the y-axis at $82 rather than $0 makes a 7.2% rise look like a near-vertical
climb. On an axis from $0 to $90, the same data would appear as a very shallow
upward slope.
Note: For a time-series plot of a stock, starting at $0 is sometimes
impractical (the line would be nearly flat in an invisible region). The honest fix is
to clearly label the axis start and avoid claiming the visual magnitude
represents the magnitude of the change. The headline “surges” is the misleading part
— 7.2% over 6 months is a moderate increase, not a surge.
Section 5: Guided Practice
▾
Time to try it yourself — with support. Each problem below gives you immediate feedback.
If you get something wrong, read the rationale to understand why before moving on.
Problem 1 — Completing a Frequency Distribution Table (C1)
A survey of 40 commuters recorded how many minutes their morning commute took. The frequency table is partially completed:
Class (minutes)
0–14
6
0.15
6
0.15
15–29
12
0.30
18
0.45
30–44
14
?
?
?
45–59
6
0.15
38
0.95
60–74
2
0.05
40
1.00
Question 1a: What is the relative frequency for the 30–44 minute class?
Question 1b: What is the cumulative frequency for the 30–44 minute class?
Question 1c: What percentage of commuters take 45 minutes or more?
Problem 2 — Selecting the Correct Graph Type (C3, C5)
Apply the decision chain: identify the variable type, then select the graph. Click “Try a similar problem” to practice with a different scenario.
A city planner surveys 300 residents about their primary mode of transportation to work (Car, Bus, Bike, Walk, Work from home).
What is the most appropriate graph for displaying these results?
A nurse records the body temperature (°C) of each of 80 patients at admission.
What is the most appropriate graph for displaying the distribution of temperatures?
An online review platform collects star ratings (1 star, 2 stars, 3 stars, 4 stars, 5 stars) from 500 diners for a restaurant.
What is the most appropriate graph for displaying the distribution of ratings?
A meteorologist records the total monthly rainfall (mm) in Montréal over 36 consecutive months.
What is the most appropriate graph for showing how rainfall changed over time?
A sociologist records the number of siblings each student in a class has (0, 1, 2, 3, 4, or 5+).
What is the most appropriate graph for displaying the distribution?
Problem 3 — Reading and Interpreting a Frequency Table (C1)
The table shows the distribution of time (in minutes) it takes 50 students to complete a quiz.
Class (min)
10–14
4
0.08
4
0.08
15–19
11
0.22
15
0.30
20–24
18
0.36
33
0.66
25–29
12
0.24
45
0.90
30–34
5
0.10
50
1.00
How many students finished the quiz in the 20–24 minute class?
What proportion of students took 25 minutes or more to finish?
How many students finished the quiz in under 20 minutes?
What percentage of students finished between 15 and 24 minutes (inclusive)?
Which class has the highest frequency (the modal class)?
Problem 4 — Identify the Misleading Feature (C6)
A supermarket chain releases a bar chart comparing the annual revenue of its three store formats.
Here is a description of the chart:
Y-axis labeled “Revenue (millions)” — starts at $180M, ends at $220M
The Flagship bar is nearly 4× as tall as the Express bar on the chart
What is the primary misleading feature of this chart?
Show full explanation
The actual difference between Flagship ($215M) and Express ($196M) is $19M — only about
8.8% of the Express revenue. On a y-axis running from $0 to $220M, the Flagship bar
would be 97.7% of the chart height, the Express bar 89.1% — visually almost identical.
By starting at $180M, the chart compresses the data range to $40M. Now Express ($196M)
appears at of the axis height, while Flagship ($215M)
appears at . This makes the Flagship bar look more than
twice as tall as Express — a gross visual distortion of the actual 8.8% difference.
Before moving to Independent Practice, check your confidence on these key skills:
Section 6: Independent Practice
▾
No hints here — these problems are yours to work through. Use scratch paper as needed.
Show the solution when you’re ready to check your work.
Interleaving tip: These problems mix concepts intentionally. Don’t expect
every problem to test the same skill as the one before — that’s by design. Research shows
interleaved practice builds stronger long-term memory than blocked practice. If you get stuck, the concept tag in each problem header — e.g., (C2) — tells you exactly which subsection of Section 3 to revisit.
Problem 1 — Build a Complete Frequency Table (C1)
A new dataset is generated each time you click “Generate new problem.” Build the full frequency table — absolute, relative, cumulative, and cumulative relative — using 5 equal-width classes.
Problem 2 — Interpreting a Histogram (C2)
The histogram below shows the distribution of daily steps walked by 60 office workers over one month.
Figure 5: Histogram of daily step counts for 60 office workers.
Before reading the questions, study the histogram for ten seconds. Make two quick predictions: (1) Which class has the highest frequency? (2) Is most of the data concentrated toward lower step counts, higher step counts, or roughly in the middle? Hold your answers, then continue.
Answer the following questions without looking at the solution:
a. What is the class width?
b. Approximately how many workers walked between 6,000 and 8,000 steps per day?
c. What percentage of workers walked fewer than 6,000 steps per day?
d. Describe the shape of the distribution in your own words: where does it peak, and how do the bars change as step count increases or decreases from the peak?
Show Solution
(a) Class width: Each class spans 2,000 steps (2,000–3,999, 4,000–5,999, etc.). Class width = 2,000.
(b) Workers in 6,000–7,999 steps: Reading the bar — the frequency is approximately
22. (The bar extends to the 20 gridline and slightly above it.)
(c) Percentage walking fewer than 6,000 steps: Add the first two bars:
workers.
Relative frequency: .
(d) Shape: The distribution peaks at the 6,000–7,999 class (the tallest bar, f ≈ 22). From the peak, bars decrease on both sides: the 8,000–9,999 class (f ≈ 15) and 10,000–11,999 class (f ≈ 6) step down to the right, while the 4,000–5,999 class (f ≈ 12) and 2,000–3,999 class (f ≈ 5) step down to the left. The right side (higher step counts) retains more workers than the left side (lower step counts), so the distribution sits somewhat higher in the step-count range — not quite balanced around the centre.
Problem 3 — Select the Best Graph (C5)
Each scenario below describes a dataset. Choose the most appropriate graph. Justify your choice by thinking about the variable type(s) involved.
A cardiologist measures both systolic blood pressure (mmHg) and age (years) for 120 patients. She wants to see whether blood pressure tends to increase with age.
Show Solution
Scatter plot. Both variables are quantitative continuous, and the
goal is to explore the relationship between them. In a scatter plot,
age would go on the x-axis (the explanatory variable) and blood pressure on
the y-axis (the response variable). Each patient is one point. The pattern of
points reveals whether there is an upward trend.
An HR department surveys 300 employees about their job satisfaction level:
Very Dissatisfied / Dissatisfied / Neutral / Satisfied / Very Satisfied. They want to
display how satisfaction is distributed across the company.
Show Solution
Bar chart (with categories in natural ordinal order). Job satisfaction
is a qualitative ordinal variable — ordered categories, not numbers. A bar chart
with one bar per satisfaction level (in order from Very Dissatisfied to Very
Satisfied) shows the distribution clearly. The bars should have gaps because
the categories are distinct.
A pie chart would technically work (5 categories) but is harder to compare
similar-sized categories. A bar chart is the stronger choice here.
A small café tracks its weekly profit ($) over 52 weeks. The owner
wants to identify which months were most profitable and whether there are seasonal patterns.
Show Solution
Time-series plot. The x-axis is time (weeks 1 through 52),
and the goal is to see how profit changes over time. Connecting the
weekly data points with a line emphasizes the trend and makes seasonal patterns
visible. A histogram would show how weekly profits are distributed (useful) but
would lose all temporal information about when profits were high or low.
An e-commerce company records the geographic region of each purchase
(North America, Europe, Asia-Pacific, Latin America, Other) over one quarter.
They want to show what share of total sales came from each region.
Show Solution
Bar chart or pie chart. Geographic region is a qualitative
nominal variable with 5 categories. Either works here — the question says
“share” (part-to-whole), which is where pie charts shine. With only 5
categories, a pie chart is readable. However, if the slices are similar in
size, a bar chart will make comparisons clearer. Both answers are acceptable;
the key is not using a histogram (the x-axis has no continuous numerical scale).
A food manufacturer quality-checks the weight of cereal boxes (g) from
a production line. They test 200 boxes and want to see how the weights are distributed
and whether any fall outside the acceptable range.
Show Solution
Histogram. Box weight is quantitative continuous. A histogram
groups weights into class intervals and shows the shape of the distribution —
the manufacturer can see whether weights cluster tightly around the target or
whether there is a systematic pattern of over- or under-fill. A
stem-and-leaf plot would also work, but n = 200 is too large for a readable
stem-and-leaf. Histogram is the standard choice.
Problem 4 — Two Variables, One Graph (C4, C5)
A public health researcher collects two measurements for each of 85 participants in a study:
Hours of sleep per night (continuous, 4.5 to 9.5 hours)
Score on a cognitive performance test (continuous, 0–100)
She asks: “Does more sleep correlate with better cognitive performance?”
Answer the following in writing (or mentally):
a. What graph should she use? Name it and explain why, using the variable types as your justification.
b. Which variable goes on which axis, and why?
c. What pattern in the graph would suggest that more sleep is associated with better performance?
Show Solution
(a) Graph: Scatter plot. Both variables are quantitative continuous.
When both variables are quantitative and the goal is to explore the relationship
between them, a scatter plot is the correct choice. Each participant is one point on
the plot; the x-coordinate is hours of sleep and the y-coordinate is the test score.
(b) Axes: Hours of sleep (the explanatory variable — the factor we
think might influence performance) goes on the x-axis. Cognitive performance score
(the response variable — the outcome we’re measuring) goes on the y-axis. Convention:
the variable you think explains variation in the other goes on x.
(c) Pattern suggesting positive association: If more sleep correlates
with better performance, the points should form an upward pattern from left to right —
participants with fewer hours of sleep (x small) tend to have lower scores (y small),
and those with more sleep (x large) tend to have higher scores (y large). This is
called a positive association and usually appears as points trending upward
from the lower-left to the upper-right of the scatter plot.
Problem 5 — Critique a Misleading Graph (C6)
A political party releases the following graph to support its campaign message “Crime has plummeted under our leadership.”
Figure 6: “Crime Rate Over Six Years” — published by a political party. Data: Year 1: 958, Year 2: 955, Year 3: 952, Year 4: 950, Year 5: 947, Year 6: 943 incidents per 100,000.
Answer the following:
a. Identify the misleading technique used in this graph.
b. Compute the true percentage decrease in crime rate from Year 1 to Year 6.
c. Describe how the graph should be redrawn to represent the data honestly.
Show Solution
(a) Misleading technique: Truncated y-axis. The y-axis starts at 940
instead of 0 (or at least a much lower value). This compresses the visual scale so
that the decline from 958 to 943 — which spans only 15 units — takes up nearly the
entire height of the chart. The line appears to drop almost vertically, suggesting a
massive collapse in crime.
(b) True percentage decrease:
(c) Honest redraw: Start the y-axis at 0 (or clearly indicate a
broken axis if starting mid-range, with a zigzag break symbol). The line would then
appear nearly flat over the 6 years, accurately reflecting the modest change. Include
the actual data values next to the points and note the y-axis scale explicitly.
Mixed Review — Retrieval from Earlier Lessons
These problems draw on concepts from DS-1. Attempting them without re-reading that lesson is the point — retrieval practice strengthens long-term memory more than re-reading.
Review Problem 1 — Variable Classification and Graph Selection
A school nurse records the following for each of 120 students during a health check: blood type (A, B, AB, O), height (cm), and a self-reported pain level on a 5-point scale (None / Mild / Moderate / Severe / Very Severe).
A classmate says: “I’ll use a histogram for blood type because there are four categories — that’s enough to make a frequency distribution.” Explain what is wrong with this reasoning, identify the correct graph type for blood type, and justify your choice using the variable’s measurement level.
Show Solution
Blood type is qualitative nominal — the four types (A, B, AB, O) are category labels with no natural numeric ordering and no meaningful arithmetic. A histogram is designed for quantitative data, where the horizontal axis is a continuous number line and bar widths represent class intervals (e.g., 160–169 cm). Blood type values cannot be placed on a number line, so the concept of a “class interval” makes no sense here.
The correct graph is a bar chart: one bar per blood type, with frequency (or relative frequency) on the vertical axis, and gaps between bars to signal that the horizontal axis is not a continuous scale.
Pain level is a subtler case: it is qualitative ordinal (the categories have a natural order from None to Very Severe, but the gaps between levels are not guaranteed to be equal). A bar chart with bars in the natural order is again the correct choice — not a histogram, because the horizontal axis is still a set of labelled categories, not a numeric measurement scale.
Review Problem 2 — Sampling Method and Bias
A university wants to estimate the proportion of its 12,000 students who use the campus food bank. The student affairs office proposes the following approach: place a link to an online survey on the campus homepage for one week and count responses.
Identify the sampling method being used, name at least two sources of bias this method introduces, and explain which direction each bias is likely to push the estimated proportion (upward or downward). Then suggest a better sampling method and explain why it would reduce bias.
Show Solution
Sampling method: Voluntary response sampling (also called self-selection). Only students who notice the link and choose to respond are included — the sample is not randomly selected from the population.
Source of bias 1 — Voluntary response bias: Students who use the food bank regularly have a stronger personal stake in the question and are more likely to click and respond. This pushes the estimated proportion upward, overstating food-bank use.
Source of bias 2 — Undercoverage bias: Students without reliable internet access, those who rarely visit the homepage (e.g., off-campus or part-time students), and students who feel embarrassed disclosing food insecurity will be systematically underrepresented. The first two groups push the estimate in an uncertain direction; the third group — students who do use the food bank but don’t self-report — pushes the estimate downward.
Better method: Simple random sampling (SRS). Randomly select, say, 600 student ID numbers from the registrar’s complete list and send each selected student a private, confidential survey invitation. SRS gives every student an equal chance of selection, eliminating both voluntary response and undercoverage bias. The confidential framing reduces social-desirability pressure as well.
Section 7: Mastery Check
▾
No hints. No scaffolds. These questions test whether you can recall and apply what you’ve
learned without support — the clearest signal that you’ve actually internalized it.
Attempt each question fully before revealing the answer. Peeking early
short-circuits the retrieval practice that makes this section effective. Even an imperfect
attempt trains your memory more than reading the solution directly.
Question 1 — The Feynman Test (C3)
Imagine a classmate missed the lesson on histograms and bar charts. They’ve been told
that “a histogram is just a bar chart without gaps” and don’t understand why that matters.
Explain — in your own words, as if to that classmate — why you cannot use a histogram
for categorical data, and what the touching bars actually represent. Don’t just state the rule;
explain the reasoning behind it.
0 / 500 characters
Show model answer
A histogram’s x-axis is a continuous number line. The bars touch because the
classes are adjacent intervals on that line — there is no gap between “40–49” and
“50–59” because numbers don’t suddenly stop at 49 and jump to 50. Every value on the
number line belongs to exactly one bar, and the bars cover the line with no holes.
For categorical data (like colour, gender, or city), there is no number line. “Red,”
“Blue,” and “Green” are not adjacent on any scale — there is nothing “between” red and
blue. Drawing the bars touching would imply that some value exists between “Red” and
“Blue,” which is nonsense. The gap in a bar chart signals: these are distinct categories
with no in-between.
So the touching vs. gap distinction isn’t cosmetic — it tells the viewer whether the
x-axis is a continuous scale (histogram) or a set of unrelated categories (bar chart).
Question 2 — Apply It (C5)
A sports analytics team collects data on professional soccer players. For each player, they record:
Position (Goalkeeper, Defender, Midfielder, Forward)
Distance run per game (km, continuous)
Goals scored per season (count, discrete)
Part A: Which graph would best display the distribution of distance run per game across all players?
Part B: Which graph would best show the relationship between distance run and goals scored across all players?
Part C: Which graph would best display the number of players in each position?
Question 3 — Find the Error (C6, C2)
A student creates a histogram to display the following frequency table for the
height (cm) of 30 plants:
Height class (cm)
10–19
4
20–29
9
30–49
11
50–59
6
The student draws four bars — all the same width — with heights proportional to the frequencies
4, 9, 11, and 6. They claim the third bar (30–49) is the most frequent class because it is
the tallest.
Identify and explain the error.
Show full explanation
The 30–49 class has a width of 20 cm, while the 10–19, 20–29, and 50–59 classes each
have a width of 10 cm. The 30–49 class is twice as wide as the others.
In a histogram, a bar’s area — not its height — represents frequency.
When class widths are equal, height and area are proportional, so height works fine.
But when class widths differ, you must plot frequency density
(= frequency ÷ class width) on the y-axis, not raw frequency:
For the 30–49 class: frequency density = 11 ÷ 20 = 0.55 per cm.
For the 20–29 class: frequency density = 9 ÷ 10 = 0.90 per cm.
On a frequency density histogram, the 20–29 bar would be taller than the 30–49 bar,
correctly reflecting that plants are more densely concentrated in the 20–29 cm range.
The student’s equal-width bars with raw frequencies made the wider class look
disproportionately dominant.
Self-Assessment
How confident are you with the material in this lesson?
Not confident — I need to reviewVery confident — I’ve got this
Section 8: Boss Fight
▾
You’ve reached the Boss Fight — a substantial challenge that asks you to bring everything together.
Two paths, equal in difficulty, different in approach. Choose the one that fits how you like to think.
🔬 The Analyst
You have a real dataset. Build a complete graphical summary from scratch — frequency table, histogram, and interpretation
Best for: students who like working with numbers and computing things step by step
🏗️ The Architect
A company’s quarterly report contains three graphs with design flaws. Identify the flaws, explain why they mislead, and propose corrections.
Best for: students who like critical thinking, design, and finding what’s wrong with someone else’s work
🔬 Path A: The Analyst
A school board collected data on the number of books read by each of 25 students over a
summer reading program. The raw data is:
Use 4 equal-width classes starting at 2 (classes: 2–4, 5–7, 8–10, 11–14, noting the
last class is slightly wider). Wait — there’s a problem here. What issue do you notice
about the class widths I proposed? Fix it before building the table.
Hint: What’s wrong with the proposed classes?
The classes 2–4, 5–7, 8–10 each span 3 values, but 11–14 spans 4 values — unequal
class widths. Either use 4 equal classes of width 3 (2–4, 5–7, 8–10, 11–13, noting
that 14 needs to be included in the last class as 11–14 with an adjusted label),
or use a consistent width. The cleanest solution: use 4 classes of width 3 (min = 3,
max = 14), so classes: 3–5, 6–8, 9–11, 12–14.
Build the complete frequency table (f, f_r, cf, cf_r) using classes 3–5, 6–8, 9–11, 12–14.
Show Solution — Part 1
Step 1: Tally the data into classes.
3–5: values 3, 5, 4, 3, 5, 4 → f = 6
6–8: values 8, 7, 6, 8, 7, 6, 8, 7, 6 → f = 9
9–11: values 9, 11, 10, 9, 11, 10, 9 → f = 7
12–14: values 12, 14, 13 → f = 3
Check: 6 + 9 + 7 + 3 = 25 ✓
Class
3–5
6
0.24
6
0.24
6–8
9
0.36
15
0.60
9–11
7
0.28
22
0.88
12–14
3
0.12
25
1.00
Total
25
1.00
—
—
Part 2 — Describe the Histogram
Based on your frequency table, describe what the histogram would look like without
drawing it. Address:
a. Which class is the modal class (most frequent)?
b. Describe the overall shape: where does the distribution peak, and how do the bars change on each side of the peak?
c. What percentage of students read fewer than 9 books?
Show Solution — Part 2
(a) Modal class: 6–8 books, with f = 9 (the highest frequency).
(b) Shape: The distribution peaks at the 6–8 class (f = 9, the highest frequency). From the peak, bars decrease in both directions — dropping to f = 7 in the 9–11 class and f = 3 in the 12–14 class on the right, and to f = 6 in the 3–5 class on the left. The right side drops off more gradually than the left, with a thin tail extending to 12–14 books. Most students cluster in the lower-to-middle range, with fewer reading large numbers of books.
(c) Students reading fewer than 9 books: “Fewer than 9” means
the 3–5 and 6–8 classes.
at the 6–8 class = 15, so .
Part 3 — Interpretation
The school board’s goal was to encourage reading. Based on your analysis, write 2–3
sentences summarizing what the data tells the board — what does the distribution suggest
about how students engaged with the program?
Show model interpretation
The modal number of books read was in the 6–8 range, and 60% of students read 8
or fewer books over the summer. Most students engaged at a moderate level, while
a smaller group of highly motivated readers reached 12–14 books. The board might
consider whether the program successfully motivated the middle group (6–8 books)
to go further, or whether a different incentive structure could shift the peak
toward higher counts.
Reflection: What was the most challenging part of this analysis?
Was it the frequency table arithmetic, the shape description, or the interpretation?
What would you do differently on the next dataset?
🏗️ Path B: The Architect
You’ve been hired as a data visualization consultant. A mid-sized retail company has just
released its annual report, and their marketing team created three graphs to highlight key
metrics. Unfortunately, each graph has a design flaw. Your job: identify each flaw, explain
why it misleads, and propose a corrected version.
Graph 1 — Holiday Season Revenue
A bar chart shows quarterly revenue for the past year. The bars are:
Q1: $41.2M — bar is 5.5 cm tall
Q2: $39.8M — bar is 4.2 cm tall
Q3: $40.5M — bar is 4.8 cm tall
Q4: $58.7M — bar is 24.0 cm tall (using a 3D dollar-sign icon, twice as wide as the others)
The headline reads: “Q4 Holiday Sales — Our Best Quarter by Far!”
Identify the flaw in Graph 1
Two compounding flaws:
1. Truncated y-axis. The bars’ heights don’t start at 0 —
they’re scaled only across the range ~$39M to ~$59M. This exaggerates the
Q4 spike relative to Q1–Q3.
2. Distorted pictogram area. The Q4 icon is twice as wide AND
taller, making its area roughly 4× larger instead of the actual 42% increase
(58.7M / $41.2M \approx 1.42$, not 4×). The visual impression of “dominance”
is grossly inflated.
Fix: Use a simple bar chart (equal-width bars) with the y-axis
starting at $0. The Q4 bar would still be visibly taller — it genuinely is the
best quarter — but by a proportionate amount, not an eye-catching 4× visual lie.
Graph 2 — Customer Satisfaction Distribution
A histogram with 5 classes displays customer satisfaction survey scores (0–100):
Class 0–20: f = 8, bar width = 1 cm
Class 21–40: f = 12, bar width = 1 cm
Class 41–80: f = 30, bar width = 2 cm
Class 81–90: f = 25, bar width = 0.5 cm
Class 91–100: f = 15, bar width = 0.5 cm
The team claims “most customers are highly satisfied” pointing to the 41–80 bar being the tallest.
Identify the flaw in Graph 2
Flaw: Unequal class widths with height (not frequency density) on the y-axis.
The 41–80 class spans 40 points, while 81–90 spans only 10 points. The 41–80 bar has
f = 30, but the 81–90 bar has f = 25 in a much smaller range. If you compute
frequency density:
41–80: 30 ÷ 40 = 0.75 per point
81–90: 25 ÷ 10 = 2.5 per point
The 81–90 class is far more densely packed with customers. “Most customers are highly
satisfied” is actually correct when measured properly — but the original
histogram hides this because it doesn’t account for class width.
Fix: Redesign with equal class widths (e.g., 0–19, 20–39, 40–59,
60–79, 80–100) or use frequency density on the y-axis. The story changes from
“middle range is most common” to “high satisfaction is most densely concentrated.”
Graph 3 — Market Share Pie Chart
A 3D tilted pie chart shows market share across 7 product categories. The nearest slice
(Electronics, 18%) appears to take up roughly 30% of the visual area due to the 3D tilt.
The far slices (Furniture 17%, Clothing 16%) appear tiny. The chart has no data labels —
only a legend.
Identify the flaw in Graph 3
Two flaws:
1. 3D tilt distorts area. The perspective projection inflates the
slices closest to the viewer and compresses those in the back. This makes the
Electronics slice look dominant when it is only 1% larger than Furniture and 2%
larger than Clothing. A 3D pie chart almost always misrepresents proportions.
2. Too many slices without labels. Seven slices sharing a legend
(not label-per-slice) forces viewers to cross-reference colours, making accurate
reading difficult. With 7 categories, a bar chart would be clearer and more honest.
Fix: Replace with a flat 2D bar chart ordered by market share
(highest to lowest), with percentage labels on each bar. Eliminates both the 3D
distortion and the legend confusion.
Reflection: Of the three graphs, which flaw do you think is the most
common in real-world published reports? Which is the easiest to accidentally create
without intending to mislead? How would you communicate these issues to the marketing
team without sounding accusatory?
Section 9: Challenge Problems
▾
Optional stretch material. These problems go beyond the lesson objectives.
They’re here if you’re curious, ambitious, or just enjoy a harder challenge. None of the
material below is required for DS-3 — but C1 (the ogive) will reappear in DS-5.
Challenge 1 — The Ogive (C1 + extension)
An ogive (pronounced “OH-jive”) is a graph of the cumulative frequency
or cumulative relative frequency. Instead of bars, it connects points with a smooth
curve — the x-value of each point is the upper class boundary and the y-value is the
cumulative frequency up to that boundary.
Use the frequency table from Example 1 (Section 4) to build a cumulative relative frequency ogive:
Class
Upper boundary
1–2
2.5
0.10
3–4
4.5
0.40
5–6
6.5
0.75
7–8
8.5
0.95
9–10
10.5
1.00
The ogive starts at (0.5, 0.00) — the lower boundary of the first class with cumulative
frequency 0 — and ends at (10.5, 1.00).
Question: Using the ogive, estimate the value below which approximately 50%
of the observations fall (the median). Draw a horizontal line at ,
find where it intersects the ogive, and read off the x-value.
Show Solution
Between the 3–4 class () and the 5–6 class (),
the cumulative relative frequency passes through 0.50. Linear interpolation between
the two points (4.5, 0.40) and (6.5, 0.75):
So approximately 50% of batches had 5 or fewer defects — the estimated median is
about 5.07 defects. You’ll formalize percentile estimation like this in DS-5.
Preview: The ogive is the graphical tool for reading percentiles
directly. The 25th percentile (Q1) corresponds to , the 75th (Q3)
to , and the 50th (Q2 = median) to . You’ll use
this in DS-5 when studying position in a distribution.
Challenge 2 — Does Bin Width Matter? (C2)
Here are two histograms of the same dataset: ages at first employment for 40 recent
graduates, ranging from 18 to 35 years. The only difference is the number of classes used.
Histogram A: 3 classes (width = 6)
Shape appears: roughly symmetric peak at 24–29
Histogram B: 6 classes (width = 3)
Shape appears: peak at 24–26, longer tail toward higher ages
Answer the following:
a. Both histograms display the same 40 data values. Why do they appear to tell different stories about the shape of the distribution?
b. Which histogram would you trust more, and why?
c. What would happen to the histogram if you used 18 classes (width = 1 year each)?
Show Solution
(a) The bin width controls how much detail is visible. With 3 wide
classes, the 24–29 class lumps together two patterns that Histogram B separates: a high
cluster around 24–26 and a secondary peak around 27–29. Wide bins smooth out the
distribution and can make an uneven concentration look balanced. Narrow bins reveal
the true pattern of where values cluster.
(b) Histogram B (6 classes) is generally more trustworthy here, as it
reveals more detail about where values concentrate. However, there’s a tradeoff: too
many bins and each bar represents only 1–2 observations, making random variation
(noise) look like a pattern. The choice of bin width requires judgment about the sample
size and the question being asked. A common rule of thumb (Sturges’ rule) is
, suggesting 6–7 bins
is appropriate for n = 40.
(c) With 18 classes (each 1 year wide), n = 40 gives an average of
about 2 observations per bar. The histogram would be jagged and noisy — some bars
would be empty, and the overall pattern would be obscured by sampling variability. Too
few bins hides detail; too many bins creates false patterns. The right bin count
balances signal and noise.
Challenge 3 — The Double Y-Axis Debate (C6)
A common graph type in business and journalism is the dual y-axis plot (also
called a secondary axis chart). It overlays two different variables on the same graph, with one
y-axis on the left and a different y-axis on the right.
Example: A financial news article shows monthly ice cream sales (in thousands
of units, left y-axis) and monthly drowning rates per 100,000 (right y-axis) on the same graph.
The two lines move almost perfectly together. The article implies this suggests a causal link.
a. Why is this graph potentially misleading, independent of the correlation–causation issue?
b. Under what conditions is a dual y-axis graph a legitimate and useful tool?
c. What is the underlying statistical error in claiming ice cream sales cause drowning rates?
Show Solution
(a) Why the graph is misleading by design: The scales on the two
y-axes are chosen independently by the designer. By rescaling either axis, you can
make the two lines align perfectly (suggesting correlation) or diverge completely
(suggesting no relationship) — with the same data. The visual impression of
correlation is entirely a function of the axis scaling choices, not the data. This
makes dual y-axis charts inherently subjective and easy to manipulate.
(b) When dual y-axes are legitimate: They can be useful when two
variables are measured in genuinely different units and both are relevant to the same
story (e.g., overlaying temperature in °C and precipitation in mm on a climate plot,
where both axes are clearly labelled and the reader understands they cannot be
compared directly). The key conditions: clearly labelled axes, no implication of
a direct comparison between the two y-scales, and no manipulation of scale to
create a false impression of alignment.
(c) Correlation ≠ causation — the confounding variable: Both ice
cream sales and drowning rates are driven by a third variable: summer heat.
Hot weather increases both ice cream consumption and the number of people swimming
(which creates more opportunities to drown). This is a classic confounding
variable (sometimes called a lurking variable). When a third variable
causes both variables to change together, a strong correlation can appear even if
the two variables have no direct causal link. You’ll study this formally in REG-1
(Correlation Analysis).
Section 10: Solutions Reference
▾
Complete, step-by-step solutions for all problems in Sections 5–9 are available on the solutions page. Solutions include worked arithmetic, common mistakes to watch for, and interpretation guidance.
If you’re stuck: Re-read the relevant Core Concept in Section 3, then find the Worked Example that maps to that concept (e.g., Example 1 maps to Concept 1). The solutions page shows the reasoning behind every step, not just the final answer.
Quick-Reference Formulas
Class Width (for frequency distributions):(Always round UP to a convenient number)
Relative Frequency:
Graph Type
Best Used For
Key Feature
Histogram
Quantitative, continuous data
Bars touch (shows continuity)
Bar Chart
Qualitative, categorical data
Bars do not touch
Pareto Chart
Categorical, finding largest factor
Bars sorted descending by frequency
Pie Chart
Parts of a whole (relative freq)
Angles proportional to frequency
Common Misleading Features
Why it’s a problem
Y-axis not starting at 0
Exaggerates small differences (mostly an issue for bar charts)
3D effects / perspective
Distorts areas and makes values hard to read
Pictograms without proportional area
Changing 1D height usually changes 2D area, overstating differences