EN FR

DS-2: Data Visualization

Module 1 · Descriptive Statistics

Section 1: Introduction

In April 2020, the Georgia Department of Public Health posted a bar chart showing COVID-19 cases by county over 15 days. The counties weren’t in chronological order on the x-axis — they were sorted so the bars would appear to go down, giving the visual impression that cases were falling when they weren’t. Public outcry forced the department to repost the chart with the dates in correct order. The bars now visibly climbed.

Same data. Two very different stories.

This is what visualization literacy is really about: not just making a graph, but understanding what a graph communicates — and what it can hide. Every design choice (what goes on each axis, where the axis starts, how wide the bars are, whether to use a pie chart or a histogram) changes the story the viewer perceives. Make the wrong choice by accident and you mislead your audience. Make it deliberately and you deceive them.

By the end of this lesson you’ll have the tools to both create honest, informative graphs and read suspicious ones critically.

After this lesson, you will be able to:

  • Build a frequency distribution table showing absolute, relative, and cumulative frequencies
  • Select the correct graph type for a given variable — histogram for quantitative data, bar chart for qualitative data
  • Construct and interpret histograms, bar charts, stem-and-leaf plots, scatter plots, and time-series graphs
  • Identify the techniques used to make graphs misleading — truncated axes, inconsistent scales, distorted areas — and explain why they deceive

These skills matter well beyond this course. Every field — medicine, economics, journalism, engineering, social science — presents data graphically. Being able to read and evaluate those graphs is a form of literacy as important as reading text.

Section 2: Prerequisites

Building accurate visual representations requires a clear understanding of the data types you identified in DS-1.

  • From DS-1: Qualitative vs. Quantitative. Qualitative data are labels/categories; quantitative data are numerical measurements. (Histograms only apply to quantitative data.)
  • From DS-1: Discrete vs. Continuous. Discrete values are countable (1, 2, 3); continuous values can take any value in an interval (like 1.75 kg).
  • Frequency vs. Relative Frequency: Frequency is the raw count; relative frequency is the proportion (count / total). Both are used to scale the vertical axis of your charts.
  • Sorting and Classing: Grouping individual data points into “bins” (e.g., ages 10–19, 20–29) is the first step in creating any distribution plot.

Retrieval Checkpoint

A dataset contains the heights (in cm) of 50 students. You want to visualize the distribution of these heights using a chart. Which category does this data belong to?

Success Factor:

Visual Guardrail:

  • If you aren’t sure whether to use a Bar Chart or a Histogram, check the data type. Bar charts are for categories (qualitative); histograms are for numerical ranges (quantitative). Using the wrong one is a “Level 1” error in data visualization.

Retrieval Warm-up — from DS-1

A researcher surveys 500 people and records the following for each participant: (i) their preferred streaming platform (Netflix, Crave, Disney+, Other), and (ii) their monthly screen time in hours. Which of the following correctly classifies these two variables?

A study reports that the mean resting heart rate for a sample of 80 marathon runners is beats per minute, while the population mean for all adults is bpm. Which statement uses these symbols correctly?

Section 3: Core Concepts

C1 — Frequency Distributions

Before drawing any graph, we organize raw data into a frequency distribution — a table that tallies how often each value (or range of values) appears in the dataset. It’s the scaffolding behind almost every graph you’ll encounter.

Frequency Distribution

A frequency distribution is a table that summarizes data by recording:

  • Class (or value): the category or interval
  • Absolute frequency : the raw count of observations in that class
  • Relative frequency : the proportion of the total in that class, computed as

  • Cumulative frequency : the running total of absolute frequencies up to and including that class
  • Cumulative relative frequency : the running total of relative frequencies — equivalently, the proportion of observations at or below that class

Here’s what a frequency distribution looks like for a dataset of 20 exam scores:

Class (Score)Absolute Freq. Relative Freq. Cumulative Freq. Cumul. Rel. Freq.
50–5920.1020.10
60–6950.2570.35
70–7980.40150.75
80–8940.20190.95
90–9910.05201.00
Total201.00

Read it like this: 8 students scored in the 70–79 range (absolute frequency). That’s 40% of the class (relative frequency). And 75% of students scored 79 or below (cumulative relative frequency at the 70–79 class).

Cumulative relative frequency ≠ relative frequency. The relative frequency for the 70–79 class is 0.40 (40% scored in that range). The cumulative relative frequency is 0.75 (75% scored 79 or below — including everyone in earlier classes too). These are completely different quantities. The cumulative column always ends at 1.00 (100%).

Frequency per classRunning total (cf)246850–5960–6970–7980–8990–99Scoref0510152070–79 (f=8)60–69 (f=5)80–89 (f=4)cf = 2cf = 7cf = 15cf = 19cf = 20cf

Figure 1b: What “cumulative” means. Left: the five classes as separate bars — ordinary frequency. Right: the same five classes stacked into one column (colours match). Each segment adds its class’s count to the running total. The label at each boundary is cf. Reading off the right column: 15 of the 20 students scored 79 or below (cf = 15 at the top of the 70–79 segment), and all 20 scored 99 or below (cf = 20 at the very top).


C2 — Histograms (for Quantitative Data)

A histogram turns a frequency distribution into a picture. Each class becomes a bar; the bar’s height represents frequency. But one feature distinguishes histograms from all other bar-based graphs: the bars touch.

Histogram

A histogram is a graph of a frequency distribution for quantitative (numerical) data. Each bar represents one class interval:

  • Horizontal axis (x): the numerical values — class boundaries are marked at the edges of each bar
  • Vertical axis (y): frequency (absolute or relative)
  • Bars touch — because the x-axis is a continuous numerical scale; there are no gaps between classes
  • Each bar’s area is proportional to its frequency (when class widths are equal, height alone is sufficient)

Here is the histogram for the exam score data above. Notice how the bars share their edges — the right edge of the 60–69 bar is the same as the left edge of the 70–79 bar.

25685060708090100ScoreFrequency
Figure 1: Histogram of 20 exam scores (class width = 10). Bars touch — the x-axis is continuous.

Histogram shapes tell a story. Where does the data cluster? Are there a few unusually high or low values that extend one side? Practise describing what you see: “most values fall between X and Y, with a few as high as Z.” You will give these patterns formal names and measures in DS-5 (Position and Distribution Shape).

Histogram intervals are not discrete categories. The 70–79 bar represents every score from 70 up to (but not including) 80. There’s no “70 category” and “79 category” — the x-axis is a continuous number line divided into intervals. Reading the bars as isolated discrete bins is a misinterpretation; the bars represent ranges on a continuum.


C3 — Bar Charts (for Qualitative Data)

When the data is categorical rather than numerical, we use a bar chart. Bar charts look similar to histograms at a glance, but there’s one crucial difference that isn’t merely cosmetic — it reflects the nature of the data itself.

Bar Chart

A bar chart displays the frequency or relative frequency of each category in a qualitative variable. Key features:

  • Horizontal axis (x): category labels
  • Vertical axis (y): frequency or relative frequency
  • Bars do NOT touch — there is a gap between bars because categories are distinct (there is no value “between” Bakery and Dairy)
  • Bar order is arbitrary for nominal data; use the natural order for ordinal data
  • A bar chart sorted in descending order of frequency — most common category first — is called a Pareto chart, widely used in quality control to quickly identify the largest sources of defects
47912SoccerBasketballSwimmingTennisOtherFavourite SportFrequency
Figure 2: Bar chart of favourite sports (qualitative nominal data). Bars have gaps — the categories are distinct.

The Key Difference — One Rule to Remember

Histogram bars touch. Bar chart bars don’t.

This isn’t an aesthetic choice. In a histogram, the x-axis is a continuous number line — scores of 69.9 and 70.0 are adjacent, so the bars share an edge. In a bar chart, the x-axis shows distinct categories — “Soccer” and “Basketball” are not adjacent on any meaningful scale, so there’s a gap to reflect that separation.

Use a histogram for quantitative data. Use a bar chart for qualitative data. Using a histogram for categorical data is always wrong.


C4 — Other Graph Types

Histograms and bar charts cover most of what you’ll build in this course. But four other graph types appear regularly in statistics — each exists because a particular question about data cannot be answered with a bar. Here’s what each one is for.

Pie Chart

A pie chart divides a circle into sectors, where each sector’s angle is proportional to its relative frequency (). Best used when:

  • Data is qualitative (usually nominal)
  • There are 5 or fewer categories (more segments make the chart unreadable)
  • Part-to-whole relationships are the main message

Limitation: Humans are poor at estimating angles. When categories are similar in size, a bar chart is usually clearer. Many statisticians prefer bar charts to pie charts almost universally.

Pie chart

Soccer40%Bball30%Swim20%Tennis 10%Favourite Sport (n = 40)

Bar chart — same data

481216SoccerBballSwimTennisFavourite SportCount

Figure 2b: Same data, two graph types. The pie chart encodes frequency as sector angle — difficult to compare when sectors are similar in size (try judging “Basketball vs. Swimming” precisely). The bar chart encodes frequency as bar height — immediately comparable. For most purposes, prefer the bar chart.

Stem-and-Leaf Plot

A stem-and-leaf plot displays quantitative data by splitting each value into a stem (leading digit(s)) and a leaf (trailing digit). It looks like a sideways histogram but preserves the individual data values:

Stem | Leaf
  5  | 3 7
  6  | 1 4 4 8 9
  7  | 0 2 3 5 5 7 8
  8  | 1 6 9
  9  | 4

Best for small datasets (n ≤ 50) where you want to see the distribution and keep the raw data visible. Each row is a class; leaves within a row are individual observations.

Scatter Plot

A scatter plot displays pairs of quantitative measurements as points on a two-dimensional graph. Each point represents one observation; its x-coordinate is one variable and its y-coordinate is another.

Use a scatter plot when you want to explore the relationship between two quantitative variables — does one tend to increase as the other does? Is there a linear pattern? Are there outliers? You’ll use scatter plots extensively in REG-1 (Correlation Analysis).

024681040557085100Study HoursExam Score

Figure 2c: Scatter plot of study hours versus exam score (n = 15). Each dot is one student. The upward trend — more hours, higher scores — is visible as a whole-cloud pattern, not a rule for any individual point. This is what “relationship between two quantitative variables” looks like. You will quantify this trend precisely in REG-1.

Time-Series Plot

A time-series plot (line graph) shows how a quantitative variable changes over time. Time goes on the x-axis; the variable of interest on the y-axis. Points are connected by lines to emphasize the change from one time period to the next.

Use a time-series plot when the x-variable specifically represents time (days, months, years) and the trend over time is the story you want to tell.

01020JanFebMarAprMayJunJulAugSepOctNovDecMonthTemp. (°C)↑ Jul peak (25°C)

Figure 2d: Time-series plot of monthly average temperature over one year. The connected line emphasizes the trend — rise through summer, fall through winter. Points are connected because the x-axis is time: consecutive months are adjacent, so tracking the change from one to the next is meaningful. Compare this to a bar chart of the same data: a bar chart would show the same heights but lose the sense of continuous progression.


C5 — Choosing the Right Graph

Every graph-choice decision starts with the same question: What type of variable is this? Use the decision chain below.

Graph Selection Decision Chain

  1. One variable?
    • Qualitative: bar chart (always); pie chart (only if ≤5 categories and part-to-whole is the focus)
    • Quantitative: histogram (distribution shape); stem-and-leaf (small n, want raw values)
  2. Two variables?
    • Both quantitative: scatter plot; if one is time, use a time-series plot
    • One qualitative, one quantitative: side-by-side bar charts or grouped histograms (covered in later courses)

Using a histogram for qualitative data is always wrong. A histogram’s x-axis is a continuous number line — it implies that values between bars are possible. For qualitative data (e.g., “Bakery,” “Dairy”), no such “between” exists. Use a bar chart. Similarly, using a bar chart for quantitative data with many distinct values (like a weight distribution) will produce a cluttered mess where a histogram would be clean and clear.


C6 — Graph Misrepresentation

Not all misleading graphs are accidents. Knowing the common techniques helps you spot them — and avoid creating them yourself.

Truncated Y-Axis

A truncated y-axis starts above zero, making small differences look large. The bars or line still reflect the true data values, but the visual impression is distorted.

When it’s a problem: bar charts and histograms should (almost) always start at 0. A bar that represents 97% vs. 94% disappears entirely when the y-axis starts at 0 — fine. But when the axis starts at 93%, that 3-point difference looks enormous.

When it’s acceptable: time-series plots often legitimately start above zero (e.g., tracking temperature variations around 15°C — starting at 0 would waste most of the chart). The key is whether the starting point is disclosed and whether the visual impression matches the magnitude of the difference.

Here is the same fictional data — approval ratings over four quarters — displayed two ways. The only difference between the two graphs is where the y-axis starts.

✓ Honest: y-axis starts at 0

050100Q1Q2Q3Q4Approval Rate (%)

The change looks small — as it should.

✗ Misleading: y-axis starts at 93

939598Q1Q2Q3Q4Approval Rate (%)

Looks like a collapse — same 3-point drop.

Figure 3: Same data, two y-axes. The left graph starts at 0%; the right graph starts at 93%. The 3% decline from Q1 to Q4 looks negligible on the left and catastrophic on the right. This is the truncated y-axis trick.

Other Common Misrepresentation Techniques

  • Inconsistent class widths in histograms: If classes are different widths, the bar height alone is misleading — a wider class captures more observations. The correct display uses frequency density (frequency ÷ class width) on the y-axis, ensuring area rather than height encodes frequency.
  • 3D effects and distorted areas: 3D pie charts tilt and expand the slice nearest the viewer, making it look larger than its true proportion. Pictograms (where a doubled icon also doubles in width, quadrupling area) distort relative sizes.
  • Selective date ranges in time-series: Choosing a start date that captures only the upward part of a trend creates the impression of consistent growth when the full history shows volatility.
  • Omitting the sample size n: “67% of customers prefer our brand!” — based on a survey of 3 customers — is technically correct but meaningless without context.
Quick checklist for evaluating any graph you encounter:
  • Does the y-axis start at 0? If not, why not — and does the distortion matter?
  • Are all bars/slices drawn to a consistent scale?
  • Is the graph type appropriate for the variable type?
  • What is n? Is the sample large enough to support the claim?
  • Is the time range shown cherry-picked?

Section 4: Worked Examples

Let’s walk through four examples — each exercises a different core concept, and the scaffolding gradually fades so you’re doing more of the thinking by Example 4.

Example 1 — Building a Frequency Distribution Table (C1)

Scenario: A quality-control inspector records the number of defective items found in each of 20 production batches:

3, 7, 2, 5, 8, 4, 6, 3, 5, 9, 1, 4, 6, 7, 5, 2, 8, 3, 6, 4

Organize this into a frequency distribution with 5 classes of width 2 (classes: 1–2, 3–4, 5–6, 7–8, 9–10). Compute the absolute frequency, relative frequency, cumulative frequency, and cumulative relative frequency.

Step 1: Tally each class.

Go through the data values one by one and mark which class each belongs to:

Step 2: Compute relative frequency. Divide each count by .

Step 3: Compute cumulative frequencies. Running totals from top to bottom.

Result:

Class
1–220.1020.10
3–460.3080.40
5–670.35150.75
7–840.20190.95
9–1010.05201.00
Total201.00

Interpretation: 75% of batches had 6 or fewer defects ( at the 5–6 class). Only 1 batch had 9–10 defects (5% of all batches).

Sanity checks: the column must sum to . The column must sum to 1.00. The final row of must equal . The final row of must equal 1.00. Run these checks every time.


Example 2 — Constructing a Histogram (C2)

Scenario: Using the frequency table from Example 1, sketch the histogram.

Before I show the steps, take a moment to predict: where do you expect the tallest bar to be? Will the bars be roughly equal on both sides of that peak, or will one side drop off more steeply?

Your prediction: Based on the frequency table, where do you expect the tallest bar to be? Will the bars be roughly the same height on both sides of the peak, or will one side drop off more steeply? Think it through, then continue.

Step 1: Draw and label the axes.

Step 2: Draw bars for each class, touching at the class boundaries.

124570.52.54.56.58.510.5Defective Items per BatchFrequency
Figure 4: Histogram of defective items per batch (n = 20). The distribution peaks at the 5–6 class; bar heights decrease on both sides of the peak.

Shape: The peak is at the 5–6 class (f = 7). From the peak, bars decrease on both sides — dropping to f = 4 in the 7–8 class and f = 1 in the 9–10 class on the right, and to f = 6 in the 3–4 class and f = 2 in the 1–2 class on the left. The distribution is approximately balanced around its centre. In DS-5 you will give shapes like this formal names and measures; for now, practise describing what you observe: which class is tallest, and how do the bars change on each side of the peak.


Example 3 — Choosing the Right Graph (C3, C5)

For each scenario, which graph is most appropriate? Think through the decision chain (variable type → graph choice) before revealing the answer.

Scenario A: A researcher records the preferred music genre of 200 university students (Pop, Rock, Hip-hop, Classical, Other).

Show answer for Scenario A

Answer: Bar chart. Music genre is a qualitative nominal variable — categories with no natural order. A bar chart with one bar per genre and frequency on the y-axis is correct. A pie chart would also be acceptable here (5 categories fits the pie chart guideline), but a bar chart makes relative sizes easier to compare. Not a histogram — genre is not a number; there is no meaningful “between Pop and Rock.”

Scenario B: A nurse records the resting heart rate (beats per minute) of 50 patients.

Show answer for Scenario B

Answer: Histogram (or stem-and-leaf). Heart rate is quantitative continuous. A histogram groups the values into class intervals (e.g., 60–69, 70–79 bpm) and shows the distribution shape. For a small dataset (n = 50), a stem-and-leaf plot would also work and preserves individual values. Not a bar chart — the x-axis is a continuous number line.

Scenario C: An economist tracks the unemployment rate each month for 5 years.

Show answer for Scenario C

Answer: Time-series plot (line graph). The x-variable is time (months), and showing the trend over time is the point. Connecting the points with a line emphasizes the change from month to month. A histogram would destroy the temporal ordering of the data — it would show the distribution of rates but not how rates evolved.


Example 4 — Spot the Misleading Graph (C6)

This example tests whether you can recognize the misrepresentation techniques from C6 in a new context — not just describe them, but identify them when they appear.

Graph A: A bar chart shows monthly sales for two competing stores. Store A’s bar is drawn as a dollar-sign icon 3 cm tall. Store B’s bar is a dollar-sign icon 6 cm tall — twice as tall and twice as wide, representing $2 million vs. $1 million in sales.

Identify the misrepresentation in Graph A

Misrepresentation: Distorted area in a pictogram. Store B has twice the sales of Store A, so the icon is drawn twice as tall — but it was also made twice as wide! Area = height × width. A 2× taller and 2× wider icon has 4× the area of Store A’s icon, making the difference look four times as large as it actually is.

Fix: Use bars of equal width or use a simple bar chart without pictogram icons.

Graph B: A line graph shows a company’s stock price over 6 months. The y-axis starts at $82 and ends at $90. The line climbs steeply from $83 to $89. The headline reads: “Stock price surges — up 7.2% in six months!”

Identify the misrepresentation in Graph B

Misrepresentation: Truncated y-axis creating visual exaggeration. Starting the y-axis at $82 rather than $0 makes a 7.2% rise look like a near-vertical climb. On an axis from $0 to $90, the same data would appear as a very shallow upward slope.

Note: For a time-series plot of a stock, starting at $0 is sometimes impractical (the line would be nearly flat in an invisible region). The honest fix is to clearly label the axis start and avoid claiming the visual magnitude represents the magnitude of the change. The headline “surges” is the misleading part — 7.2% over 6 months is a moderate increase, not a surge.

Section 5: Guided Practice

Time to try it yourself — with support. Each problem below gives you immediate feedback. If you get something wrong, read the rationale to understand why before moving on.

Problem 1 — Completing a Frequency Distribution Table (C1)

A survey of 40 commuters recorded how many minutes their morning commute took. The frequency table is partially completed:

Class (minutes)
0–1460.1560.15
15–29120.30180.45
30–4414???
45–5960.15380.95
60–7420.05401.00

Question 1a: What is the relative frequency for the 30–44 minute class?

Question 1b: What is the cumulative frequency for the 30–44 minute class?

Question 1c: What percentage of commuters take 45 minutes or more?


Problem 2 — Selecting the Correct Graph Type (C3, C5)

Apply the decision chain: identify the variable type, then select the graph. Click “Try a similar problem” to practice with a different scenario.

A city planner surveys 300 residents about their primary mode of transportation to work (Car, Bus, Bike, Walk, Work from home).

What is the most appropriate graph for displaying these results?

A nurse records the body temperature (°C) of each of 80 patients at admission.

What is the most appropriate graph for displaying the distribution of temperatures?

An online review platform collects star ratings (1 star, 2 stars, 3 stars, 4 stars, 5 stars) from 500 diners for a restaurant.

What is the most appropriate graph for displaying the distribution of ratings?

A meteorologist records the total monthly rainfall (mm) in Montréal over 36 consecutive months.

What is the most appropriate graph for showing how rainfall changed over time?

A sociologist records the number of siblings each student in a class has (0, 1, 2, 3, 4, or 5+).

What is the most appropriate graph for displaying the distribution?


Problem 3 — Reading and Interpreting a Frequency Table (C1)

The table shows the distribution of time (in minutes) it takes 50 students to complete a quiz.

Class (min)
10–1440.0840.08
15–19110.22150.30
20–24180.36330.66
25–29120.24450.90
30–3450.10501.00

How many students finished the quiz in the 20–24 minute class?

What proportion of students took 25 minutes or more to finish?

How many students finished the quiz in under 20 minutes?

What percentage of students finished between 15 and 24 minutes (inclusive)?

Which class has the highest frequency (the modal class)?

Problem 4 — Identify the Misleading Feature (C6)

A supermarket chain releases a bar chart comparing the annual revenue of its three store formats. Here is a description of the chart:

What is the primary misleading feature of this chart?

Show full explanation

The actual difference between Flagship ($215M) and Express ($196M) is $19M — only about 8.8% of the Express revenue. On a y-axis running from $0 to $220M, the Flagship bar would be 97.7% of the chart height, the Express bar 89.1% — visually almost identical.

By starting at $180M, the chart compresses the data range to $40M. Now Express ($196M) appears at of the axis height, while Flagship ($215M) appears at . This makes the Flagship bar look more than twice as tall as Express — a gross visual distortion of the actual 8.8% difference.

Before moving to Independent Practice, check your confidence on these key skills:

Section 6: Independent Practice

No hints here — these problems are yours to work through. Use scratch paper as needed. Show the solution when you’re ready to check your work.

Interleaving tip: These problems mix concepts intentionally. Don’t expect every problem to test the same skill as the one before — that’s by design. Research shows interleaved practice builds stronger long-term memory than blocked practice. If you get stuck, the concept tag in each problem header — e.g., (C2) — tells you exactly which subsection of Section 3 to revisit.

Problem 1 — Build a Complete Frequency Table (C1)

A new dataset is generated each time you click “Generate new problem.” Build the full frequency table — absolute, relative, cumulative, and cumulative relative — using 5 equal-width classes.


Problem 2 — Interpreting a Histogram (C2)

The histogram below shows the distribution of daily steps walked by 60 office workers over one month.

1020020004000600080001000012000Daily StepsFrequency
Figure 5: Histogram of daily step counts for 60 office workers.

Before reading the questions, study the histogram for ten seconds. Make two quick predictions: (1) Which class has the highest frequency? (2) Is most of the data concentrated toward lower step counts, higher step counts, or roughly in the middle? Hold your answers, then continue.

Answer the following questions without looking at the solution:

a. What is the class width? b. Approximately how many workers walked between 6,000 and 8,000 steps per day? c. What percentage of workers walked fewer than 6,000 steps per day? d. Describe the shape of the distribution in your own words: where does it peak, and how do the bars change as step count increases or decreases from the peak?

Show Solution

(a) Class width: Each class spans 2,000 steps (2,000–3,999, 4,000–5,999, etc.). Class width = 2,000.

(b) Workers in 6,000–7,999 steps: Reading the bar — the frequency is approximately 22. (The bar extends to the 20 gridline and slightly above it.)

(c) Percentage walking fewer than 6,000 steps: Add the first two bars: workers. Relative frequency: .

(d) Shape: The distribution peaks at the 6,000–7,999 class (the tallest bar, f ≈ 22). From the peak, bars decrease on both sides: the 8,000–9,999 class (f ≈ 15) and 10,000–11,999 class (f ≈ 6) step down to the right, while the 4,000–5,999 class (f ≈ 12) and 2,000–3,999 class (f ≈ 5) step down to the left. The right side (higher step counts) retains more workers than the left side (lower step counts), so the distribution sits somewhat higher in the step-count range — not quite balanced around the centre.


Problem 3 — Select the Best Graph (C5)

Each scenario below describes a dataset. Choose the most appropriate graph. Justify your choice by thinking about the variable type(s) involved.

A cardiologist measures both systolic blood pressure (mmHg) and age (years) for 120 patients. She wants to see whether blood pressure tends to increase with age.

Show Solution

Scatter plot. Both variables are quantitative continuous, and the goal is to explore the relationship between them. In a scatter plot, age would go on the x-axis (the explanatory variable) and blood pressure on the y-axis (the response variable). Each patient is one point. The pattern of points reveals whether there is an upward trend.

An HR department surveys 300 employees about their job satisfaction level: Very Dissatisfied / Dissatisfied / Neutral / Satisfied / Very Satisfied. They want to display how satisfaction is distributed across the company.

Show Solution

Bar chart (with categories in natural ordinal order). Job satisfaction is a qualitative ordinal variable — ordered categories, not numbers. A bar chart with one bar per satisfaction level (in order from Very Dissatisfied to Very Satisfied) shows the distribution clearly. The bars should have gaps because the categories are distinct.

A pie chart would technically work (5 categories) but is harder to compare similar-sized categories. A bar chart is the stronger choice here.

A small café tracks its weekly profit ($) over 52 weeks. The owner wants to identify which months were most profitable and whether there are seasonal patterns.

Show Solution

Time-series plot. The x-axis is time (weeks 1 through 52), and the goal is to see how profit changes over time. Connecting the weekly data points with a line emphasizes the trend and makes seasonal patterns visible. A histogram would show how weekly profits are distributed (useful) but would lose all temporal information about when profits were high or low.

An e-commerce company records the geographic region of each purchase (North America, Europe, Asia-Pacific, Latin America, Other) over one quarter. They want to show what share of total sales came from each region.

Show Solution

Bar chart or pie chart. Geographic region is a qualitative nominal variable with 5 categories. Either works here — the question says “share” (part-to-whole), which is where pie charts shine. With only 5 categories, a pie chart is readable. However, if the slices are similar in size, a bar chart will make comparisons clearer. Both answers are acceptable; the key is not using a histogram (the x-axis has no continuous numerical scale).

A food manufacturer quality-checks the weight of cereal boxes (g) from a production line. They test 200 boxes and want to see how the weights are distributed and whether any fall outside the acceptable range.

Show Solution

Histogram. Box weight is quantitative continuous. A histogram groups weights into class intervals and shows the shape of the distribution — the manufacturer can see whether weights cluster tightly around the target or whether there is a systematic pattern of over- or under-fill. A stem-and-leaf plot would also work, but n = 200 is too large for a readable stem-and-leaf. Histogram is the standard choice.


Problem 4 — Two Variables, One Graph (C4, C5)

A public health researcher collects two measurements for each of 85 participants in a study:

She asks: “Does more sleep correlate with better cognitive performance?”

Answer the following in writing (or mentally):

a. What graph should she use? Name it and explain why, using the variable types as your justification. b. Which variable goes on which axis, and why? c. What pattern in the graph would suggest that more sleep is associated with better performance?

Show Solution

(a) Graph: Scatter plot. Both variables are quantitative continuous. When both variables are quantitative and the goal is to explore the relationship between them, a scatter plot is the correct choice. Each participant is one point on the plot; the x-coordinate is hours of sleep and the y-coordinate is the test score.

(b) Axes: Hours of sleep (the explanatory variable — the factor we think might influence performance) goes on the x-axis. Cognitive performance score (the response variable — the outcome we’re measuring) goes on the y-axis. Convention: the variable you think explains variation in the other goes on x.

(c) Pattern suggesting positive association: If more sleep correlates with better performance, the points should form an upward pattern from left to right — participants with fewer hours of sleep (x small) tend to have lower scores (y small), and those with more sleep (x large) tend to have higher scores (y large). This is called a positive association and usually appears as points trending upward from the lower-left to the upper-right of the scatter plot.


Problem 5 — Critique a Misleading Graph (C6)

A political party releases the following graph to support its campaign message “Crime has plummeted under our leadership.”

940950960Year 1Year 2Year 3Year 4Year 5Year 6Crime Rate (incidents per 100,000)
Figure 6: “Crime Rate Over Six Years” — published by a political party. Data: Year 1: 958, Year 2: 955, Year 3: 952, Year 4: 950, Year 5: 947, Year 6: 943 incidents per 100,000.

Answer the following:

a. Identify the misleading technique used in this graph. b. Compute the true percentage decrease in crime rate from Year 1 to Year 6. c. Describe how the graph should be redrawn to represent the data honestly.

Show Solution

(a) Misleading technique: Truncated y-axis. The y-axis starts at 940 instead of 0 (or at least a much lower value). This compresses the visual scale so that the decline from 958 to 943 — which spans only 15 units — takes up nearly the entire height of the chart. The line appears to drop almost vertically, suggesting a massive collapse in crime.

(b) True percentage decrease:

(c) Honest redraw: Start the y-axis at 0 (or clearly indicate a broken axis if starting mid-range, with a zigzag break symbol). The line would then appear nearly flat over the 6 years, accurately reflecting the modest change. Include the actual data values next to the points and note the y-axis scale explicitly.


Mixed Review — Retrieval from Earlier Lessons

These problems draw on concepts from DS-1. Attempting them without re-reading that lesson is the point — retrieval practice strengthens long-term memory more than re-reading.

Review Problem 1 — Variable Classification and Graph Selection

A school nurse records the following for each of 120 students during a health check: blood type (A, B, AB, O), height (cm), and a self-reported pain level on a 5-point scale (None / Mild / Moderate / Severe / Very Severe).

A classmate says: “I’ll use a histogram for blood type because there are four categories — that’s enough to make a frequency distribution.” Explain what is wrong with this reasoning, identify the correct graph type for blood type, and justify your choice using the variable’s measurement level.

Show Solution

Blood type is qualitative nominal — the four types (A, B, AB, O) are category labels with no natural numeric ordering and no meaningful arithmetic. A histogram is designed for quantitative data, where the horizontal axis is a continuous number line and bar widths represent class intervals (e.g., 160–169 cm). Blood type values cannot be placed on a number line, so the concept of a “class interval” makes no sense here.

The correct graph is a bar chart: one bar per blood type, with frequency (or relative frequency) on the vertical axis, and gaps between bars to signal that the horizontal axis is not a continuous scale.

Pain level is a subtler case: it is qualitative ordinal (the categories have a natural order from None to Very Severe, but the gaps between levels are not guaranteed to be equal). A bar chart with bars in the natural order is again the correct choice — not a histogram, because the horizontal axis is still a set of labelled categories, not a numeric measurement scale.


Review Problem 2 — Sampling Method and Bias

A university wants to estimate the proportion of its 12,000 students who use the campus food bank. The student affairs office proposes the following approach: place a link to an online survey on the campus homepage for one week and count responses.

Identify the sampling method being used, name at least two sources of bias this method introduces, and explain which direction each bias is likely to push the estimated proportion (upward or downward). Then suggest a better sampling method and explain why it would reduce bias.

Show Solution

Sampling method: Voluntary response sampling (also called self-selection). Only students who notice the link and choose to respond are included — the sample is not randomly selected from the population.

Source of bias 1 — Voluntary response bias: Students who use the food bank regularly have a stronger personal stake in the question and are more likely to click and respond. This pushes the estimated proportion upward, overstating food-bank use.

Source of bias 2 — Undercoverage bias: Students without reliable internet access, those who rarely visit the homepage (e.g., off-campus or part-time students), and students who feel embarrassed disclosing food insecurity will be systematically underrepresented. The first two groups push the estimate in an uncertain direction; the third group — students who do use the food bank but don’t self-report — pushes the estimate downward.

Better method: Simple random sampling (SRS). Randomly select, say, 600 student ID numbers from the registrar’s complete list and send each selected student a private, confidential survey invitation. SRS gives every student an equal chance of selection, eliminating both voluntary response and undercoverage bias. The confidential framing reduces social-desirability pressure as well.

Section 7: Mastery Check

No hints. No scaffolds. These questions test whether you can recall and apply what you’ve learned without support — the clearest signal that you’ve actually internalized it.

Attempt each question fully before revealing the answer. Peeking early short-circuits the retrieval practice that makes this section effective. Even an imperfect attempt trains your memory more than reading the solution directly.

Question 1 — The Feynman Test (C3)

Imagine a classmate missed the lesson on histograms and bar charts. They’ve been told that “a histogram is just a bar chart without gaps” and don’t understand why that matters.

Explain — in your own words, as if to that classmate — why you cannot use a histogram for categorical data, and what the touching bars actually represent. Don’t just state the rule; explain the reasoning behind it.

0 / 500 characters
Show model answer

A histogram’s x-axis is a continuous number line. The bars touch because the classes are adjacent intervals on that line — there is no gap between “40–49” and “50–59” because numbers don’t suddenly stop at 49 and jump to 50. Every value on the number line belongs to exactly one bar, and the bars cover the line with no holes.

For categorical data (like colour, gender, or city), there is no number line. “Red,” “Blue,” and “Green” are not adjacent on any scale — there is nothing “between” red and blue. Drawing the bars touching would imply that some value exists between “Red” and “Blue,” which is nonsense. The gap in a bar chart signals: these are distinct categories with no in-between.

So the touching vs. gap distinction isn’t cosmetic — it tells the viewer whether the x-axis is a continuous scale (histogram) or a set of unrelated categories (bar chart).


Question 2 — Apply It (C5)

A sports analytics team collects data on professional soccer players. For each player, they record:

Part A: Which graph would best display the distribution of distance run per game across all players?

Part B: Which graph would best show the relationship between distance run and goals scored across all players?

Part C: Which graph would best display the number of players in each position?


Question 3 — Find the Error (C6, C2)

A student creates a histogram to display the following frequency table for the height (cm) of 30 plants:

Height class (cm)
10–194
20–299
30–4911
50–596

The student draws four bars — all the same width — with heights proportional to the frequencies 4, 9, 11, and 6. They claim the third bar (30–49) is the most frequent class because it is the tallest.

Identify and explain the error.

Show full explanation

The 30–49 class has a width of 20 cm, while the 10–19, 20–29, and 50–59 classes each have a width of 10 cm. The 30–49 class is twice as wide as the others.

In a histogram, a bar’s area — not its height — represents frequency. When class widths are equal, height and area are proportional, so height works fine. But when class widths differ, you must plot frequency density (= frequency ÷ class width) on the y-axis, not raw frequency:

For the 30–49 class: frequency density = 11 ÷ 20 = 0.55 per cm. For the 20–29 class: frequency density = 9 ÷ 10 = 0.90 per cm.

On a frequency density histogram, the 20–29 bar would be taller than the 30–49 bar, correctly reflecting that plants are more densely concentrated in the 20–29 cm range. The student’s equal-width bars with raw frequencies made the wider class look disproportionately dominant.


Self-Assessment

How confident are you with the material in this lesson?

Not confident — I need to reviewVery confident — I’ve got this

Section 8: Boss Fight

You’ve reached the Boss Fight — a substantial challenge that asks you to bring everything together. Two paths, equal in difficulty, different in approach. Choose the one that fits how you like to think.

🔬 The Analyst

You have a real dataset. Build a complete graphical summary from scratch — frequency table, histogram, and interpretation

Best for: students who like working with numbers and computing things step by step

🏗️ The Architect

A company’s quarterly report contains three graphs with design flaws. Identify the flaws, explain why they mislead, and propose corrections.

Best for: students who like critical thinking, design, and finding what’s wrong with someone else’s work

🔬 Path A: The Analyst

A school board collected data on the number of books read by each of 25 students over a summer reading program. The raw data is:

8, 3, 12, 5, 9, 7, 11, 4, 6, 10, 8, 14, 3, 7, 9, 6, 13, 5, 8, 11, 7, 4, 10, 6, 9

Part 1 — Build the Frequency Distribution

Use 4 equal-width classes starting at 2 (classes: 2–4, 5–7, 8–10, 11–14, noting the last class is slightly wider). Wait — there’s a problem here. What issue do you notice about the class widths I proposed? Fix it before building the table.

Hint: What’s wrong with the proposed classes?

The classes 2–4, 5–7, 8–10 each span 3 values, but 11–14 spans 4 values — unequal class widths. Either use 4 equal classes of width 3 (2–4, 5–7, 8–10, 11–13, noting that 14 needs to be included in the last class as 11–14 with an adjusted label), or use a consistent width. The cleanest solution: use 4 classes of width 3 (min = 3, max = 14), so classes: 3–5, 6–8, 9–11, 12–14.

Build the complete frequency table (f, f_r, cf, cf_r) using classes 3–5, 6–8, 9–11, 12–14.

Show Solution — Part 1

Step 1: Tally the data into classes.

  • 3–5: values 3, 5, 4, 3, 5, 4 → f = 6
  • 6–8: values 8, 7, 6, 8, 7, 6, 8, 7, 6 → f = 9
  • 9–11: values 9, 11, 10, 9, 11, 10, 9 → f = 7
  • 12–14: values 12, 14, 13 → f = 3

Check: 6 + 9 + 7 + 3 = 25 ✓

Class
3–560.2460.24
6–890.36150.60
9–1170.28220.88
12–1430.12251.00
Total251.00

Part 2 — Describe the Histogram

Based on your frequency table, describe what the histogram would look like without drawing it. Address:

a. Which class is the modal class (most frequent)? b. Describe the overall shape: where does the distribution peak, and how do the bars change on each side of the peak? c. What percentage of students read fewer than 9 books?

Show Solution — Part 2

(a) Modal class: 6–8 books, with f = 9 (the highest frequency).

(b) Shape: The distribution peaks at the 6–8 class (f = 9, the highest frequency). From the peak, bars decrease in both directions — dropping to f = 7 in the 9–11 class and f = 3 in the 12–14 class on the right, and to f = 6 in the 3–5 class on the left. The right side drops off more gradually than the left, with a thin tail extending to 12–14 books. Most students cluster in the lower-to-middle range, with fewer reading large numbers of books.

(c) Students reading fewer than 9 books: “Fewer than 9” means the 3–5 and 6–8 classes. at the 6–8 class = 15, so .


Part 3 — Interpretation

The school board’s goal was to encourage reading. Based on your analysis, write 2–3 sentences summarizing what the data tells the board — what does the distribution suggest about how students engaged with the program?

Show model interpretation

The modal number of books read was in the 6–8 range, and 60% of students read 8 or fewer books over the summer. Most students engaged at a moderate level, while a smaller group of highly motivated readers reached 12–14 books. The board might consider whether the program successfully motivated the middle group (6–8 books) to go further, or whether a different incentive structure could shift the peak toward higher counts.

Reflection: What was the most challenging part of this analysis? Was it the frequency table arithmetic, the shape description, or the interpretation? What would you do differently on the next dataset?

🏗️ Path B: The Architect

You’ve been hired as a data visualization consultant. A mid-sized retail company has just released its annual report, and their marketing team created three graphs to highlight key metrics. Unfortunately, each graph has a design flaw. Your job: identify each flaw, explain why it misleads, and propose a corrected version.

Graph 1 — Holiday Season Revenue

A bar chart shows quarterly revenue for the past year. The bars are:

  • Q1: $41.2M — bar is 5.5 cm tall
  • Q2: $39.8M — bar is 4.2 cm tall
  • Q3: $40.5M — bar is 4.8 cm tall
  • Q4: $58.7M — bar is 24.0 cm tall (using a 3D dollar-sign icon, twice as wide as the others)

The headline reads: “Q4 Holiday Sales — Our Best Quarter by Far!”

Identify the flaw in Graph 1

Two compounding flaws:

1. Truncated y-axis. The bars’ heights don’t start at 0 — they’re scaled only across the range ~$39M to ~$59M. This exaggerates the Q4 spike relative to Q1–Q3.

2. Distorted pictogram area. The Q4 icon is twice as wide AND taller, making its area roughly 4× larger instead of the actual 42% increase (58.7M / $41.2M \approx 1.42$, not 4×). The visual impression of “dominance” is grossly inflated.

Fix: Use a simple bar chart (equal-width bars) with the y-axis starting at $0. The Q4 bar would still be visibly taller — it genuinely is the best quarter — but by a proportionate amount, not an eye-catching 4× visual lie.


Graph 2 — Customer Satisfaction Distribution

A histogram with 5 classes displays customer satisfaction survey scores (0–100):

  • Class 0–20: f = 8, bar width = 1 cm
  • Class 21–40: f = 12, bar width = 1 cm
  • Class 41–80: f = 30, bar width = 2 cm
  • Class 81–90: f = 25, bar width = 0.5 cm
  • Class 91–100: f = 15, bar width = 0.5 cm

The team claims “most customers are highly satisfied” pointing to the 41–80 bar being the tallest.

Identify the flaw in Graph 2

Flaw: Unequal class widths with height (not frequency density) on the y-axis.

The 41–80 class spans 40 points, while 81–90 spans only 10 points. The 41–80 bar has f = 30, but the 81–90 bar has f = 25 in a much smaller range. If you compute frequency density:

  • 41–80: 30 ÷ 40 = 0.75 per point
  • 81–90: 25 ÷ 10 = 2.5 per point

The 81–90 class is far more densely packed with customers. “Most customers are highly satisfied” is actually correct when measured properly — but the original histogram hides this because it doesn’t account for class width.

Fix: Redesign with equal class widths (e.g., 0–19, 20–39, 40–59, 60–79, 80–100) or use frequency density on the y-axis. The story changes from “middle range is most common” to “high satisfaction is most densely concentrated.”


Graph 3 — Market Share Pie Chart

A 3D tilted pie chart shows market share across 7 product categories. The nearest slice (Electronics, 18%) appears to take up roughly 30% of the visual area due to the 3D tilt. The far slices (Furniture 17%, Clothing 16%) appear tiny. The chart has no data labels — only a legend.

Identify the flaw in Graph 3

Two flaws:

1. 3D tilt distorts area. The perspective projection inflates the slices closest to the viewer and compresses those in the back. This makes the Electronics slice look dominant when it is only 1% larger than Furniture and 2% larger than Clothing. A 3D pie chart almost always misrepresents proportions.

2. Too many slices without labels. Seven slices sharing a legend (not label-per-slice) forces viewers to cross-reference colours, making accurate reading difficult. With 7 categories, a bar chart would be clearer and more honest.

Fix: Replace with a flat 2D bar chart ordered by market share (highest to lowest), with percentage labels on each bar. Eliminates both the 3D distortion and the legend confusion.

Reflection: Of the three graphs, which flaw do you think is the most common in real-world published reports? Which is the easiest to accidentally create without intending to mislead? How would you communicate these issues to the marketing team without sounding accusatory?

Section 9: Challenge Problems

Optional stretch material. These problems go beyond the lesson objectives. They’re here if you’re curious, ambitious, or just enjoy a harder challenge. None of the material below is required for DS-3 — but C1 (the ogive) will reappear in DS-5.

Challenge 1 — The Ogive (C1 + extension)

An ogive (pronounced “OH-jive”) is a graph of the cumulative frequency or cumulative relative frequency. Instead of bars, it connects points with a smooth curve — the x-value of each point is the upper class boundary and the y-value is the cumulative frequency up to that boundary.

Use the frequency table from Example 1 (Section 4) to build a cumulative relative frequency ogive:

ClassUpper boundary
1–22.50.10
3–44.50.40
5–66.50.75
7–88.50.95
9–1010.51.00

The ogive starts at (0.5, 0.00) — the lower boundary of the first class with cumulative frequency 0 — and ends at (10.5, 1.00).

Question: Using the ogive, estimate the value below which approximately 50% of the observations fall (the median). Draw a horizontal line at , find where it intersects the ogive, and read off the x-value.

Show Solution

Between the 3–4 class () and the 5–6 class (), the cumulative relative frequency passes through 0.50. Linear interpolation between the two points (4.5, 0.40) and (6.5, 0.75):

So approximately 50% of batches had 5 or fewer defects — the estimated median is about 5.07 defects. You’ll formalize percentile estimation like this in DS-5.

Preview: The ogive is the graphical tool for reading percentiles directly. The 25th percentile (Q1) corresponds to , the 75th (Q3) to , and the 50th (Q2 = median) to . You’ll use this in DS-5 when studying position in a distribution.


Challenge 2 — Does Bin Width Matter? (C2)

Here are two histograms of the same dataset: ages at first employment for 40 recent graduates, ranging from 18 to 35 years. The only difference is the number of classes used.

Histogram A: 3 classes (width = 6)

0162518243036Age (years)

Shape appears: roughly symmetric peak at 24–29

Histogram B: 6 classes (width = 3)

071418243036Age (years)

Shape appears: peak at 24–26, longer tail toward higher ages

Answer the following:

a. Both histograms display the same 40 data values. Why do they appear to tell different stories about the shape of the distribution? b. Which histogram would you trust more, and why? c. What would happen to the histogram if you used 18 classes (width = 1 year each)?

Show Solution

(a) The bin width controls how much detail is visible. With 3 wide classes, the 24–29 class lumps together two patterns that Histogram B separates: a high cluster around 24–26 and a secondary peak around 27–29. Wide bins smooth out the distribution and can make an uneven concentration look balanced. Narrow bins reveal the true pattern of where values cluster.

(b) Histogram B (6 classes) is generally more trustworthy here, as it reveals more detail about where values concentrate. However, there’s a tradeoff: too many bins and each bar represents only 1–2 observations, making random variation (noise) look like a pattern. The choice of bin width requires judgment about the sample size and the question being asked. A common rule of thumb (Sturges’ rule) is , suggesting 6–7 bins is appropriate for n = 40.

(c) With 18 classes (each 1 year wide), n = 40 gives an average of about 2 observations per bar. The histogram would be jagged and noisy — some bars would be empty, and the overall pattern would be obscured by sampling variability. Too few bins hides detail; too many bins creates false patterns. The right bin count balances signal and noise.


Challenge 3 — The Double Y-Axis Debate (C6)

A common graph type in business and journalism is the dual y-axis plot (also called a secondary axis chart). It overlays two different variables on the same graph, with one y-axis on the left and a different y-axis on the right.

Example: A financial news article shows monthly ice cream sales (in thousands of units, left y-axis) and monthly drowning rates per 100,000 (right y-axis) on the same graph. The two lines move almost perfectly together. The article implies this suggests a causal link.

a. Why is this graph potentially misleading, independent of the correlation–causation issue? b. Under what conditions is a dual y-axis graph a legitimate and useful tool? c. What is the underlying statistical error in claiming ice cream sales cause drowning rates?

Show Solution

(a) Why the graph is misleading by design: The scales on the two y-axes are chosen independently by the designer. By rescaling either axis, you can make the two lines align perfectly (suggesting correlation) or diverge completely (suggesting no relationship) — with the same data. The visual impression of correlation is entirely a function of the axis scaling choices, not the data. This makes dual y-axis charts inherently subjective and easy to manipulate.

(b) When dual y-axes are legitimate: They can be useful when two variables are measured in genuinely different units and both are relevant to the same story (e.g., overlaying temperature in °C and precipitation in mm on a climate plot, where both axes are clearly labelled and the reader understands they cannot be compared directly). The key conditions: clearly labelled axes, no implication of a direct comparison between the two y-scales, and no manipulation of scale to create a false impression of alignment.

(c) Correlation ≠ causation — the confounding variable: Both ice cream sales and drowning rates are driven by a third variable: summer heat. Hot weather increases both ice cream consumption and the number of people swimming (which creates more opportunities to drown). This is a classic confounding variable (sometimes called a lurking variable). When a third variable causes both variables to change together, a strong correlation can appear even if the two variables have no direct causal link. You’ll study this formally in REG-1 (Correlation Analysis).

Section 10: Solutions Reference

Complete, step-by-step solutions for all problems in Sections 5–9 are available on the solutions page. Solutions include worked arithmetic, common mistakes to watch for, and interpretation guidance.

View Full Solutions →

If you’re stuck: Re-read the relevant Core Concept in Section 3, then find the Worked Example that maps to that concept (e.g., Example 1 maps to Concept 1). The solutions page shows the reasoning behind every step, not just the final answer.

Quick-Reference Formulas

Class Width (for frequency distributions): (Always round UP to a convenient number)

Relative Frequency:

Graph TypeBest Used ForKey Feature
HistogramQuantitative, continuous dataBars touch (shows continuity)
Bar ChartQualitative, categorical dataBars do not touch
Pareto ChartCategorical, finding largest factorBars sorted descending by frequency
Pie ChartParts of a whole (relative freq)Angles proportional to frequency
Common Misleading FeaturesWhy it’s a problem
Y-axis not starting at 0Exaggerates small differences (mostly an issue for bar charts)
3D effects / perspectiveDistorts areas and makes values hard to read
Pictograms without proportional areaChanging 1D height usually changes 2D area, overstating differences