The Normal model & sampling variation

Notes and in-class exercises

Notes

Download the .qmd file for this activity here.

Learning goals

  • Recognize the difference between a population parameter and a sample estimate.
  • Review the Normal probability model, a tool we’ll need to turn information in our sample data into inferences about the broader population.
  • Explore the ideas of randomness, sampling distributions, and standard error through a class experiment. (We’ll define these more formally in the next class.)

Readings and videos

Please do the following videos and reading before class.

Exercises: Review

Let’s practice some Normal model concepts.

First, run the following chunk which defines a shaded_normal() function which is specialized to this activity alone:

library(dplyr)
library(ggplot2)

shaded_normal <- function(mean, sd, a = NULL, b = NULL){
  min_x <- mean - 4*sd
  max_x <- mean + 4*sd
  a <- max(a, min_x)
  b <- min(b, max_x)
  ggplot() + 
    scale_x_continuous(limits = c(min_x, max_x), breaks = c(mean - sd*(0:3), mean + sd*(1:3))) +
    stat_function(fun = dnorm, args = list(mean = mean, sd = sd)) + 
    stat_function(geom = "area", fun = dnorm, args = list(mean = mean, sd = sd), xlim = c(a, b), fill = "blue") + 
    labs(y = "density") + 
    theme_minimal()
}

Exercise 1: Using the Normal model

Suppose that the speeds of cars on a highway, in miles per hour, can be reasonably represented by the Normal model with a mean of 55mph and a standard deviation of 5mph from car to car:

\[ X \sim N(55, 5^2) \]

shaded_normal(mean = 55, sd = 5)
  1. Provide the (approximate) range of the middle 68% of speeds, and shade in the corresponding region on your Normal curve. NOTE: a is the lower end of the range and b is the upper end.
shaded_normal(mean = 55, sd = 5, a = ___, b = ___)
  1. Use the 68-95-99.7 rule to estimate the probability that a car’s speed exceeds 60mph.
# Visualize
shaded_normal(mean = 55, sd = 5, a = 60)
  1. Which of the following is the correct range for the probability that a car’s speed exceeds 67mph? Explain your reasoning.
  • less than 0.0015
  • between 0.0015 and 0.025
  • between 0.025 and 0.16
  • greater than 0.16
# Visualize
shaded_normal(mean = 55, sd = 5, a = 67)

Exercise 2: Z-scores

Inherently important to all of our calculations above is how many standard deviations a value “X” is from the mean.

This distance is called a Z-score and can be calculated as follows:

\[ \text{Z-score} = \frac{X - \text{mean}}{\text{sd}} \]

For example (from Exercise 1), if I’m traveling 40 miles an hour, my Z-score is -3. That is, my speed is 3 standard deviations below the average speed:

(40 - 55) / 5
  1. Consider 2 other drivers. Both drivers are speeding. Who do you think is speeding more, relative to the distributions of speed in their area?

    • Driver A is traveling at 60mph on the highway where speeds are N(55, 5^2) and the speed limit is 55mph.
    • Driver B is traveling at 36mph on a residential road where speeds are N(30, 3^2) and the speed limit is 30mph.
  2. Calculate the Z-scores for Drivers A and B.

# Driver A


# Driver B
  1. Now, based on the Z-scores, who is speeding more? NOTE: The below plots might provide some insights.
# Driver A
shaded_normal(mean = 55, sd = 5) + 
  geom_vline(xintercept = 60)

# Driver B
shaded_normal(mean = 30, sd = 3) + 
  geom_vline(xintercept = 36)  

Exercises: Sampling variation

The fandango dataset in the fivethirtyeight package includes information about the ratings of 143 films released in 2014-2015 across the review sites Fandango, IMDB, RottenTomatoes, and MetaCritic.

Our goal will be to model the relationship between the RottenTomatoes “Tomatometer” rating (an aggregate of official film critic scores), and the RottenTomatoes “User Rating,” which averages the ratings of anyone who uploaded their rating to the site.

Exercise 3: Parameter vs estimate

It’s important to note that this dataset is by no means a comprehensive collection of films and their review scores–it does not contain every film that was released from 2014-2015, nor films released outside of that date range. The review scores are also frozen in time–all of these films have almost certainly accumulated additional reviews since the data were first collected.

However, our stated goal is to make inferences about the overarching relationship between critic reviews and user reviews for all films (relatedly, we may want to use our model to make predictions about how user reviews are affected by critic reviews for films that may not even exist yet!). Can we actually make these inferences/predictions about a potentially infinite collection of films when all we have is a fairly limited subset of these?

Populations and Samples

This question points to two of the most important concepts in the field of statistical inference: populations and samples. Statisticians have many different ways of defining a population (depending on the questions they are asking), but for the purposes of this exercise, we can think of the population as the set of all possible films and all possible review scores that have been or could be catalogued on RottenTomatoes.

Our dataset of 146 films is considered a sample of this population. A sample is simply a subset of observations taken from that population.

Sampling Criteria

When we take a sample of data from a population, there is always some set of criteria used to determine how a sample is taken. This could be as simple as “we randomly selected 1% of all films catalogued on RottenTomatoes as of 4/1/2025”, or a more complex set of specific criteria (for this dataset, the sample was taken by selecting all films that had tickets for sale on Fandango on 8/24/2015, then further filtering to include films that have a Rotten Tomatoes rating, a RT User rating, a Metacritic score, a Metacritic User score, an IMDb score, and at least 30 fan reviews on Fandango.)

“True” parameters versus estimates

In order to conduct statistical inference using linear regression, we must assume that there is some true, underlying, fixed intercept and slope \(\beta_0\) and \(\beta_1\), that describe the true linear relationship in the overall population that we’re interested in.

If we are modeling the relationship between UserScore and CriticScore on RottenTomatoes, The “true” underlying model we assume is thus:

\[ E[UserScore \mid CriticScore] = \beta_0 + \beta_1 CriticScore \]

However, the “true” values of \(\beta_0\) and \(\beta_1\) are typically impossible to know, because knowing them requires access to our entire population of interest (in this case, the review scores for every film that has been or will be released). When we fit a regression model using the sample that we do have, we are actually obtaining estimates of those true population parameters (note the notation change of putting a \(\hat{ }\) on top of the Betas, to indicate that this is an estimate):

\[ E[UserScore \mid CriticScore] = \hat{\beta}_0 + \hat{\beta}_1 CriticScore \]

where our estimates are given by our model as \(\hat{\beta}_0 = 32.3%\), \(\hat{\beta}_1 = 0.52%\)

For the sake of this activity, let’s assume that these estimates are identical to the true population parameters.

🚩🚩🚩 HOWEVER, be very careful not to make this assumption in other models you encounter.For this dataset, recall the specific sampling criteria that were used, which means these 146 films likely aren’t representative of the full population of films we’re interested in. This means that the estimates we obtained probably don’t match the true population parameters–they may or may not be close, but we don’t know for certain! 🚩🚩🚩

Below, we’ll simulate how parameter estimates are impacted by taking different samples. You’ll each take a random sample of 10 films in the dataset, and we’ll see if we can recover the presumed population parameters (i.e., the coefficient estimates we obtained from our model using all 146 films that were initially sampled).

First, fill in your intuition below:

  • Do you think every student will get the same set of 10 films?

  • Do you think that your coefficient estimates will be the same as your neighbors’?

Exercise 4: Random sampling

  1. Use the sample_n() function to take a random sample of 2 films
# Try running the following chunk A FEW TIMES
sample_n(fandango, size = 2, replace = TRUE)

Reflect:

  • How do your results compare to your neighbors’?

  • What is the role of size = 2? HINT: Remember you can look at function documentation by running ?sample_n in the console!

  1. Now, “set the seed” to 155 and re-try your sampling.
# Try running the following FULL chunk A FEW TIMES
set.seed(155)
sample_n(fandango, size = 2, replace = TRUE)

What changed?

Exercise 5: Take your own sample

The underlying random number generator plays a role in the random sample we happen to get. If we set.seed(some positive integer) before taking a random sample, we’ll get the same results.

This reproducibility is important:

  • we get the same results every time we render our qmd

  • we can share our work with others & ensure they get our same answers

  • it wouldn’t be great if you submitted your work to, say, a journal and weren’t able to back up / confirm / reproduce your results!

Follow the chunks below to obtain and use your own unique sample.

# DON'T SKIP THIS STEP! 
# Set the random number seed to the first six digits of your own phone number (just the numbers)
set.seed()

# Take a sample of 10 films
my_sample <- sample_n(fandango, size = 10, replace = TRUE)
my_sample                       
# Plot the relationship of UserScore with CriticScore among your sample
my_sample %>% 
  ggplot(aes(y = userscore_rt, x = criticscore_rt)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
# Model the relationship among your sample
my_model <- lm(userscore_rt ~ criticscore_rt, data = my_sample)
coef(summary(my_model))[,1]

REPORT YOUR WORK

Log your intercept and slope sample estimates in this survey.

Exercise 6: Sampling variation

Recall that we are assuming the population parameters are equal to the estimates we obtained from the model we fit using the initial sample of 146 films:

\[ E[UserScore \mid CriticScore] = 32.3 + 0.52 CriticScore \]

Let’s explore how our sample estimates of these parameters varied from student to student:

# Import the experiment results
# NOTE: You *may* need to run `install.packages("gsheet")` first, in order for this code chunk to run!
library(gsheet)
results <- gsheet2tbl('https://docs.google.com/spreadsheets/d/11OT1VnLTTJasp5BHSKulgJiCbSLiutv8mKDOfvvXZSo/edit?usp=sharing')

Plot each student’s sample estimate of the model line (gray). How do these compare to the assumed population model (red)?

fandango %>% 
  ggplot(aes(y = userscore_rt, x = criticscore_rt)) +
  geom_abline(data = results, aes(intercept = sample_intercept, slope = sample_slope, linetype=section), color = "gray") + 
  geom_smooth(method = "lm", color = "red", se = FALSE)

Exercise 7: Sample intercepts

Let’s focus on just the sample estimates of the intercept parameter:

results %>% 
  ggplot(aes(x = sample_intercept)) + 
  geom_density() + 
  geom_vline(xintercept = 32.3, color = "red")

Comment on the shape, center, and spread of these sample estimates and how they relate to the (assumed) population intercept (red line).

Exercise 8: Slopes

Suppose we were to construct a density plot of the sample estimates of the criticscore_rt coefficient (i.e. the slopes).

  1. Intuitively, what shape do you think this plot will have?

  2. Intuitively, around what value do you think this plot will be centered?

  3. Check your intuition:

results %>% 
  ggplot(aes(x = sample_slope)) + 
  geom_density() + 
  geom_vline(xintercept = 0.52, color = "red")
  1. Thinking back to the 68-95-99.7 rule, visually approximate the standard deviation among the sample slopes.

Exercise 9: Standard error

You’ve likely observed that the typical or mean slope estimate is roughly equal to the (assumed) population slope parameter of 0.52:

results %>% 
  summarize(mean(sample_slope))

Thus the standard deviation of the slope estimates measures how far we might expect an estimate to fall from the (assumed) population slope parameter.

That is, it measures the typical or standard error in our sample estimates:

results %>% 
  summarize(sd(sample_slope))
  1. Recall your sample estimate of the slope. How far is it from the population slope, 0.52?
  1. How many standard errors does your estimate fall from the population slope? That is, what’s your Z-score?
  1. Reflecting upon your Z-score, do you think your sample estimate was one of the “lucky” ones, or one of the “unlucky” ones?

Solutions

Exercise 1: Using the Normal model

  1. .
shaded_normal(mean = 55, sd = 5, a = 50, b = 60)

  1. 16% (32/2)

  2. between 0.0015 and 0.025

Exercise 2: Z-scores

  1. intuition

  2. .

# Driver A
(60 - 55) / 5
[1] 1
# Driver B
(36 - 30) / 3
[1] 2
  1. B, they are 2 standard deviations above the mean (the speed limit)

Exercise 3: Parameter vs estimate

intuition

Exercise 4: Random sampling

# Observe that the 2 films change every time & differ from your neighbors' samples
sample_n(fandango, size = 2, replace = FALSE)
# A tibble: 2 × 3
  film      userscore_rt criticscore_rt
  <chr>            <int>          <int>
1 Mortdecai           30             12
2 Dope                86             87
# Observe that the 2 films are the same every time & are the same as your neighbors' samples
set.seed(155)
sample_n(fandango, size = 2, replace = FALSE)
# A tibble: 2 × 3
  film            userscore_rt criticscore_rt
  <chr>                  <int>          <int>
1 Unbroken                  70             51
2 Project Almanac           46             34

Exercise 5: Take your own sample

will vary from student to student

Exercise 6: Sampling variation

The sample estimates vary around the population model:

# Import the experiment results
library(gsheet)
results <- gsheet2tbl('https://docs.google.com/spreadsheets/d/11OT1VnLTTJasp5BHSKulgJiCbSLiutv8mKDOfvvXZSo/edit?usp=sharing')

fandango %>% 
  ggplot(aes(y = userscore_rt, x = criticscore_rt)) +
  geom_abline(data = results, aes(intercept = sample_intercept, slope = sample_slope, linetype=section), color = "gray") + 
  geom_smooth(method = "lm", color = "red", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

Exercise 7: Sample intercepts

The intercepts are roughly normal, centered around the intercept of the larger sample (32.3), and range from roughly 15.816 to 50.161:

results %>% 
  ggplot(aes(x = sample_intercept)) + 
  geom_density() + 
  geom_vline(xintercept = 32.3, color = "red")

Exercise 8: Slopes

  1. intuition
  2. intuition
  3. Check your intuition:
results %>% 
  ggplot(aes(x = sample_slope)) + 
  geom_density() + 
  geom_vline(xintercept = 0.52, color = "red")

  1. Will vary, but should roughly be 0.14.

Exercise 9: Standard error

  1. For example, suppose my estimate were 0.7:
0.7 - 0.52
[1] 0.18
  1. For example, suppose my estimate were 0.7. Then my Z-score is (0.7 - 0.52) / 0.14 = 1.2857143

  2. This is somewhat subjective. But we’ll learn that if your estimate is within 2 sd of the actual slope, i.e. your Z-score is between -2 and 2, you’re pretty “lucky”.