Simple linear regression: formalizing concepts

Notes and in-class exercises

Notes

Download the .qmd template file for this activity here.

Learning goals

By the end of this lesson, you should be able to:

  • Differentiate between a response / outcome variable and a predictor / explanatory variable
  • Write a model formula for a simple linear regression model with a quantitative predictor
  • Write R code to fit a linear regression model
  • Interpret the intercept and slope coefficients in a simple linear regression model with a quantitative predictor
  • Compute expected / predicted / fitted values and residuals from a linear regression model formula
  • Interpret predicted values and residuals in the context of the data
  • Explain the connection between residuals and the least squares criterion

Readings and videos

Choose either the reading or the videos to go through before class.

File organization: Save this file in the “Activities” subfolder of your “STAT155” folder.

Exercises

Context: Today we’ll explore data from Capital Bikeshare, a company in Washington DC. Our main goal will be to explore daily ridership among registered users, as measured by riders_registered. Read in the data below.

# Load packages and import data
library(readr)
library(ggplot2)
library(dplyr)

bikes <- read_csv("https://mac-stat.github.io/data/bikeshare.csv")

Exercise 1: Get to know the data

Create a new code chunk to look at the first few rows of the data and learn how much data (in terms of cases and variables) we have.

  1. What does a case represent?

  2. How many and what kinds of variables do we have?

  3. Thinking about the who, what, when, where, why, and how of this data, which of the 5W’s + H seem most relevant to our investigations? Explain your thoughts.

Exercise 2: Get to know the outcome/response variable

Let’s get acquainted with the riders_registered variable.

  • Construct an appropriate plot to visualize the distribution of this variable, and compute appropriate numerical summaries.

  • Write a good paragraph interpreting the plot and numerical summaries.

Exercise 3: Explore the relationship between ridership and temperature

We’d like to understand how daily ridership among registered users relates with the temperature that it feels like that day (temp_feel).

  1. What type of plot would be appropriate to visualize this relationship? Sketch and describe what you expect this plot to look like.

  2. Create an appropriate plot using ggplot(). How does the plot compare to what you predicted?

  3. Add the following two lines after your plot to add a linear (blue) and curved (red) smoothing line. What do you notice? Is a simple linear regression model appropriate for this data?

# Add a red straight line of best fit and a blue curve of best fit
YOUR_PLOT +
    geom_smooth(method = "lm", color = "red", se = FALSE) +
    geom_smooth(color = "blue", se = FALSE)

Exercise 4: Filtering our data

The relationship between registered riders and temperature looks linear below 80 degrees. We can use the filter() function from the dplyr package to subset our cases. (We’ll learn techniques soon for handling this nonlinear relationship.)

If we wanted to only keep cases where registered ridership was greater than 2000, we would use the following code:

# The %>% is called a "pipe" and feeds what comes before it
# into what comes after (bikes data is "fed into" the filter() function)
NEW_DATASET_NAME <- bikes %>% 
    filter(riders_registered > 2000)

Adapt the example above to create a new dataset called bikes_sub that only keeps cases where the felt temperature is less than 80 degrees.

Exercise 5: Model fitting and coefficient interpretation

Let’s fit a simple linear regression model and examine the results. Step through code chunk slowly, and make note of new code.

# Construct and save the model as bike_mod
# What's the purpose of "riders_registered ~ temp_feel"?
# What's the purpose of "data = bikes_sub"?
bike_mod <- lm(riders_registered ~ temp_feel, data = bikes_sub)
# A long summary of the model stored in bike_mod
summary(bike_mod)
# A simplified model summary
coef(summary(bike_mod))
  1. Using the model summary output, complete the following model formula:
    E[riders_registered | temp_feel] = ___ + ___ * temp_feel

  2. Interpret the intercept in terms of the data context. Make sure to use non-causal language, include units, and talk about averages rather than individual cases. Is the intercept meaningful in this situation?

  3. Interpret the slope in terms of the data context. Make sure to use non-causal language, include units, and talk about averages rather than individual cases.

Exercise 6: Predictions and residuals

On August 17, 2012, the temp_feel was 53.816 degrees and there were 5665 riders. We can get data for this day using the filter() and select() dplyr functions. Note, but don’t worry about the syntax – we haven’t learned this yet:

bikes_sub %>% 
    filter(date == "2012-08-17") %>% 
    select(riders_registered, temp_feel) 
  1. Peak back at the scatterplot. Identify which point corresponds to August 17, 2012. Is it close to the trend? Were there more riders than expected or fewer than expected?

  2. Use your model formula from the previous exercise to predict the ridership on August 17, 2012 from the temperature on that day. (That is, where do days with this temperature fall on the model trend line? How many registered riders would we expect on a 53.816 degree day?)

  3. Check your part b calculation using the predict() function. Take careful note of the syntax – there’s a lot going on!

# What is the purpose of newdata = ___???
predict(bike_mod, newdata = data.frame(temp_feel = 53.816))
  1. Calculate the residual or prediction error. How far does the observed ridership fall from the model prediction?

    residual = observed y - predicted y = ???

  2. Are positive residuals above or below the trend line? When we have positive residuals, does the model over- or under-estimate ridership? Repeat these questions for negative residuals.

  3. For an 85 degree day, how many registered riders would we expect? Do you think it’s a good idea to make this prediction? (Revisit the visualization and filtering we did in Exercises 3 and 4.)

Exercise 7: Changing temperature units (CHALLENGE)

Suppose we had measured temperature in degrees Celsius rather than degrees Fahrenheit. How do you think our intercept and slope estimates, and their coefficient interpretations, would change?

Reflection

Statistics is a particular kind of language and collection of tools for channeling curiosity to improve our world.

Review the learning objectives at the top of this file and the flow of today’s activity. How do the concepts we practiced today facilitate curiosity?

Render your work

  • Click the “Render” button in the menu bar for this pane (blue arrow pointing right). This will create an HTML file containing all of the directions, code, and responses from this activity. A preview of the HTML will appear in the browser.
  • Scroll through and inspect the document to check that your work translated to the HTML format correctly.
  • Close the browser tab.
  • Go to the “Background Jobs” pane in RStudio and click the Stop button to end the rendering process.
  • Navigate to your “Activities” subfolder within your “STAT155” folder and locate the HTML file. You can open it again in your browser to double check.

Additional Practice

Exercise 8: Ridership and windspeed

Let’s pull together everything that you’ve practiced in the preceding exercises to investigate the relationship between riders_registered and windspeed. Go back to using the bikes dataset (instead of bikes_sub) because we no longer need to only keep days less than 80 degrees.

# Construct and interpret a visualization of this relationship
# Include a representation of the relationship trend


# Use lm to construct a model of riders_registered vs windspeed
# Save this as bike_mod2


# Get a short summary of this model
  1. Summarize your observations from the visualizations.

  2. Write out a formula for the model trend.

  3. Interpret both the intercept and the windspeed coefficient. (Note: What does a negative slope indicate?)

  4. Use this model to predict the ridership on August 17, 2012 and calculate the corresponding residual. (Note: You’ll first need to find the windspeed on this date!)

Exercise 9: Data drills (filter, select, summarize)

This exercise is designed to help you keep building your dplyr skills. These skills are important to data cleaning and digging, which in turn is important to really making meaning of our data. We’ll work with a simpler set of 10 data points:

new_bikes <- bikes %>% 
    select(date, temp_feel, humidity, riders_registered, day_of_week) %>% 
    head(10)

Verb 1: summarize

Thus far, in the dplyr grammar you’ve seen 3 verbs or action words: summarize(), select(), filter(). Try out the following code and then summarize the point of the summarize() function:

new_bikes %>% 
    summarize(mean(temp_feel), mean(humidity))

Verb 2: select

Try out the following code and then summarize the point of the select() function:

new_bikes %>%
    select(date, temp_feel)
new_bikes %>% 
    select(-date, -temp_feel)

Verb 3: filter

Try out the following code and then summarize the point of the filter() function:

new_bikes %>% 
    filter(riders_registered > 850)
new_bikes %>% 
    filter(day_of_week == "Sat")
new_bikes %>% 
    filter(riders_registered > 850, day_of_week == "Sat")

Exercise 10: Your turn

Use dplyr verbs to complete each task below.

# Keep only information about the humidity and day of week

# Keep only information about the humidity and day of week using a different approach

# Keep only information for Sundays

# Keep only information for Sundays with temperatures below 50

# Calculate the maximum and minimum temperatures







Solutions

Exercise 1: Get to know the data

dim(bikes)
[1] 731  15
head(bikes)
# A tibble: 6 × 15
  date       season  year month day_of_week weekend holiday temp_actual
  <date>     <chr>  <dbl> <chr> <chr>       <lgl>   <chr>         <dbl>
1 2011-01-01 winter  2011 Jan   Sat         TRUE    no             57.4
2 2011-01-02 winter  2011 Jan   Sun         TRUE    no             58.8
3 2011-01-03 winter  2011 Jan   Mon         FALSE   no             46.5
4 2011-01-04 winter  2011 Jan   Tue         FALSE   no             46.8
5 2011-01-05 winter  2011 Jan   Wed         FALSE   no             48.7
6 2011-01-06 winter  2011 Jan   Thu         FALSE   no             47.1
# ℹ 7 more variables: temp_feel <dbl>, humidity <dbl>, windspeed <dbl>,
#   weather_cat <chr>, riders_casual <dbl>, riders_registered <dbl>,
#   riders_total <dbl>
  1. A case represents a day of the year.
  2. We have 15 variables broadly concerning weather, day of week information, whether the day is a holiday.
  3. Lots of answers are reasonable here! When and where seem to be particularly relevant because this is for a rideshare based in Washington DC with data from 2011-2012. Ridership likely changes a lot from city to city and over time.

Exercise 2: Get to know the outcome/response variable

The distribution of the riders_registered variable looks fairly symmetric. On average there are about 3600 registered riders per day (mean = 3656, median = 3662). On any given day, the number of registered riders is about 1560 from the mean. There seem to be a small number of low outliers (minimum ridership was 20).

ggplot(bikes, aes(x = riders_registered)) +
    geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(bikes, aes(y = riders_registered)) +
    geom_boxplot()

summary(bikes$riders_registered)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     20    2497    3662    3656    4776    6946 
bikes %>% 
    summarize(sd(riders_registered))
# A tibble: 1 × 1
  `sd(riders_registered)`
                    <dbl>
1                   1560.

Exercise 3: Explore the relationship between ridership and temperature

We’d like to understand how daily ridership among registered users relates with the temperature that it feels like that day (temp_feel).

  1. Scatterplot (outcome and predictor are both quantitative)
ggplot(bikes, aes(x = temp_feel, y = riders_registered)) +
    geom_point()

  1. If we only displayed the red line of best fit on the plot, we might miss the slight downward trend at the highest temperatures that we can see more clearly with the blue curve of best fit. A linear model is not appropriate if fit to the whole range of the data, but there does seem to be a linear relationship between ridership and temperature below 80 degrees Fahrenheit.
# Add a red straight line of best fit and a blue curve of best fit
ggplot(bikes, aes(x = temp_feel, y = riders_registered)) +
    geom_point() +
    geom_smooth(method = "lm", color = "red", se = FALSE) +
    geom_smooth(color = "blue", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Exercise 4: Filtering our data

# The %>% is called a "pipe" and feeds what comes before it
# into what comes after (bikes data is "fed into" the filter() function)
bikes_sub <- bikes %>% 
    filter(temp_feel < 80)

Exercise 5: Model fitting and coefficient interpretation

Let’s fit a simple linear regression model and examine the results. Step through code chunk slowly, and make note of new code.

# Construct and save the model as bike_mod
# What's the purpose of "riders_registered ~ temp_feel"?
# What's the purpose of "data = bikes_sub"?
bike_mod <- lm(riders_registered ~ temp_feel, data = bikes_sub)
# A long summary of the model stored in bike_mod
summary(bike_mod)

Call:
lm(formula = riders_registered ~ temp_feel, data = bikes_sub)

Residuals:
    Min      1Q  Median      3Q     Max 
-3681.8  -928.3   -98.6   904.9  3496.7 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2486.412    421.379  -5.901 7.37e-09 ***
temp_feel      86.493      6.464  13.380  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1267 on 428 degrees of freedom
Multiple R-squared:  0.2949,    Adjusted R-squared:  0.2933 
F-statistic:   179 on 1 and 428 DF,  p-value: < 2.2e-16
# A simplified model summary
coef(summary(bike_mod))
               Estimate Std. Error   t value     Pr(>|t|)
(Intercept) -2486.41180 421.379174 -5.900652 7.368345e-09
temp_feel      86.49251   6.464247 13.380135 2.349753e-34
  1. E[riders_registered | temp_feel] = -2486.41180 + 86.49251 * temp_feel

  2. Intercept interpretation: On days that feel like 0 degrees Fahrenheit, we can expect an average of -2486.41180 riders—a negative number of riders doesn’t make sense! This results because of extrapolation—0 degrees is so far below the minimum temperature in the data. We only have information on the relationship between ridership and temperature in the ~40-100 degree range and have no idea what that relationship looks like outside that range.

  3. Slope interpretation: Every 1 degree increase in feeling temperature is associated with an average of about 86 more riders.

Exercise 6: Predictions and residuals

On August 17, 2012, the temp_feel was 53.816 degrees and there were 5665 riders. We can get data for this day using the filter() and select() dplyr functions. Note, but don’t worry about the syntax – we haven’t learned this yet:

bikes_sub %>% 
    filter(date == "2012-08-17") %>% 
    select(riders_registered, temp_feel) 
# A tibble: 1 × 2
  riders_registered temp_feel
              <dbl>     <dbl>
1              5665      53.8
  1. More riders than expected – the point is far above the trend line

  2. -2486.41180 + 86.49251 * 53.816 = 2168.269

  3. We get the same result with predict():

# What is the purpose of newdata = ___???
predict(bike_mod, newdata = data.frame(temp_feel = 53.816))
       1 
2168.269 
  1. residual = 5665 - 2168.269 = 3496.731. On August 17, 2012, there were 3496.731 more riders than would be expected from our model.

    • Positive residuals are above the trend line—we under-estimate ridership.
    • Negative residuals are below the trend line—we over-estimate ridership.
  2. On an 85 degree day, we would predict 4865.452 riders. Even though we can compute this prediction, it’s not a good idea because of extrapolation–the data that we used to fit our model was filtered to days less than 80 degrees.

-2486.41180 + 86.49251 * 85
[1] 4865.452
predict(bike_mod, newdata = data.frame(temp_feel = 85))
       1 
4865.451 

Exercise 7: Changing temperature units (CHALLENGE)

If we had measured temperature in degrees Celsius rather than degrees Fahrenheit, both the intercept and slope should change. The intercept would now represent 0 degrees Celsius (32 degrees Fahrenheit) and a one unit change in temperature is now 1 degree Celsius (1.8 degrees Fahrenheit).

Exercise 8: Ridership and windspeed

Let’s pull together everything that you’ve practiced in the preceding exercises to investigate the relationship between riders_registered and windspeed. Go back to using the bikes dataset (instead of bikes_sub) because we no longer need to only keep days less than 80 degrees.

# Construct and interpret a visualization of this relationship
# Include a representation of the relationship trend
ggplot(bikes, aes(x = windspeed, y = riders_registered)) + 
    geom_point() + 
    geom_smooth(method = "lm", color = "red", se = FALSE) +
    geom_smooth(color = "blue", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Use lm to construct a model of riders_registered vs windspeed
# Save this as bike_mod2
bike_mod2 <- lm(riders_registered ~ windspeed, data = bikes)

# Get a short summary of this model
coef(summary(bike_mod2))
              Estimate Std. Error   t value      Pr(>|t|)
(Intercept) 4490.09761  149.65992 30.002005 2.023179e-129
windspeed    -65.34145   10.86299 -6.015053  2.844453e-09
  1. There’s a weak, negative relationship – ridership tends to be smaller on windier days.

  2. E[riders_registered | windspeed] = 4490.09761 - 65.34145 windspeed

    • Intercept: On days with no wind, we’d expect around 4490 riders. (0 is a little below the minimum of the observed data, but not by much! So extrapolation in interpreting the intercept isn’t a huge concern.)
    • Slope: Every 1mph increase in windspeed is associated with a ridership decrease of 65 riders on average.
  3. See the code below to predict ridership on August 17, 2012 and calculate the corresponding residual. Note that this residual is smaller than the residual from the temperature model (that residual was 3496.731). This indicates that August 17 was more of an outlier in ridership given the temperature than the windspeed.

bikes %>% 
    filter(date == "2012-08-17") %>% 
    select(riders_registered, windspeed)
# A tibble: 1 × 2
  riders_registered windspeed
              <dbl>     <dbl>
1              5665      15.5
# prediction
4490.09761 - 65.34145 * 15.50072
[1] 3477.258
# residual 
5665 - 3477.258
[1] 2187.742

Exercise 9: Data drills (filter, select, summarize)

This exercise is designed to help you keep building your dplyr skills. These skills are important to data cleaning and digging, which in turn is important to really making meaning of our data. We’ll work with a simpler set of 10 data points:

new_bikes <- bikes %>% 
    select(date, temp_feel, humidity, riders_registered, day_of_week) %>% 
    head(10)

Verb 1: summarize

summarize() calculates numerical summaries of variables (columns).

new_bikes %>% 
    summarize(mean(temp_feel), mean(humidity))
# A tibble: 1 × 2
  `mean(temp_feel)` `mean(humidity)`
              <dbl>            <dbl>
1              52.0            0.544

Verb 2: select

select() selects variables (columns).

new_bikes %>%
    select(date, temp_feel)
# A tibble: 10 × 2
   date       temp_feel
   <date>         <dbl>
 1 2011-01-01      64.7
 2 2011-01-02      63.8
 3 2011-01-03      49.0
 4 2011-01-04      51.1
 5 2011-01-05      52.6
 6 2011-01-06      53.0
 7 2011-01-07      50.8
 8 2011-01-08      46.6
 9 2011-01-09      42.5
10 2011-01-10      45.6
new_bikes %>% 
    select(-date, -temp_feel)
# A tibble: 10 × 3
   humidity riders_registered day_of_week
      <dbl>             <dbl> <chr>      
 1    0.806               654 Sat        
 2    0.696               670 Sun        
 3    0.437              1229 Mon        
 4    0.590              1454 Tue        
 5    0.437              1518 Wed        
 6    0.518              1518 Thu        
 7    0.499              1362 Fri        
 8    0.536               891 Sat        
 9    0.434               768 Sun        
10    0.483              1280 Mon        

Verb 3: filter

filter() keeps only days (rows) that meet the given condition(s).

new_bikes %>% 
    filter(riders_registered > 850)
# A tibble: 7 × 5
  date       temp_feel humidity riders_registered day_of_week
  <date>         <dbl>    <dbl>             <dbl> <chr>      
1 2011-01-03      49.0    0.437              1229 Mon        
2 2011-01-04      51.1    0.590              1454 Tue        
3 2011-01-05      52.6    0.437              1518 Wed        
4 2011-01-06      53.0    0.518              1518 Thu        
5 2011-01-07      50.8    0.499              1362 Fri        
6 2011-01-08      46.6    0.536               891 Sat        
7 2011-01-10      45.6    0.483              1280 Mon        
new_bikes %>% 
    filter(day_of_week == "Sat")
# A tibble: 2 × 5
  date       temp_feel humidity riders_registered day_of_week
  <date>         <dbl>    <dbl>             <dbl> <chr>      
1 2011-01-01      64.7    0.806               654 Sat        
2 2011-01-08      46.6    0.536               891 Sat        
new_bikes %>% 
    filter(riders_registered > 850, day_of_week == "Sat")
# A tibble: 1 × 5
  date       temp_feel humidity riders_registered day_of_week
  <date>         <dbl>    <dbl>             <dbl> <chr>      
1 2011-01-08      46.6    0.536               891 Sat        

Exercise 10: Your turn

Use dplyr verbs to complete each task below.

# Keep only information about the humidity and day of week
new_bikes %>% 
    select(humidity, day_of_week)
# A tibble: 10 × 2
   humidity day_of_week
      <dbl> <chr>      
 1    0.806 Sat        
 2    0.696 Sun        
 3    0.437 Mon        
 4    0.590 Tue        
 5    0.437 Wed        
 6    0.518 Thu        
 7    0.499 Fri        
 8    0.536 Sat        
 9    0.434 Sun        
10    0.483 Mon        
# Keep only information about the humidity and day of week using a different approach
new_bikes %>% 
    select(-date, -temp_feel, -riders_registered)
# A tibble: 10 × 2
   humidity day_of_week
      <dbl> <chr>      
 1    0.806 Sat        
 2    0.696 Sun        
 3    0.437 Mon        
 4    0.590 Tue        
 5    0.437 Wed        
 6    0.518 Thu        
 7    0.499 Fri        
 8    0.536 Sat        
 9    0.434 Sun        
10    0.483 Mon        
# Keep only information for Sundays
new_bikes %>% 
    filter(day_of_week == "Sun")
# A tibble: 2 × 5
  date       temp_feel humidity riders_registered day_of_week
  <date>         <dbl>    <dbl>             <dbl> <chr>      
1 2011-01-02      63.8    0.696               670 Sun        
2 2011-01-09      42.5    0.434               768 Sun        
# Keep only information for Sundays with temperatures below 50
new_bikes %>% 
    filter(day_of_week == "Sun", temp_feel < 50)
# A tibble: 1 × 5
  date       temp_feel humidity riders_registered day_of_week
  <date>         <dbl>    <dbl>             <dbl> <chr>      
1 2011-01-09      42.5    0.434               768 Sun        
# Calculate the maximum and minimum temperatures
new_bikes %>% 
    summarize(min(temp_feel), max(temp_feel))
# A tibble: 1 × 2
  `min(temp_feel)` `max(temp_feel)`
             <dbl>            <dbl>
1             42.5             64.7