# Load packages and data
library(readr)
library(ggplot2)
library(dplyr)
<- read_csv("https://mac-stat.github.io/data/cps_2018.csv")
cps
# Get data on just the management and transportation sectors
<- cps %>%
cps_sub filter(industry %in% c("management", "transportation"))
Multiple linear regression: Interaction
Notes and in-class exercises
Notes
Download the .qmd file for this activity here.
Learning goals
By the end of Part 1 of this lesson, you should be able to:
- Describe when it would be useful to include an interaction term to a model
- Write a model formula for an interaction model
- Interpret the coefficients in an interaction model in the data context
By the end of Part 2 of this lesson, you should be able to:
- Visualize interactions between categorical and quantitative predictors using scatterplots and side-by-side or boxplots
- Critically think through whether an interaction term makes sense, or should be included in a multiple linear regression model
- Write a model formula for a multiple linear regression model with an interaction term between two quantitative predictors, two categorical predictors, or one quantitative and one categorical predictor
- Interpret the intercept and slope coefficients in a multiple linear regression model with an interaction term
Readings and videos
Choose either the reading or the videos to go through before class.
- Reading: Section 3.9.3 in the STAT 155 Notes
- Video:
File organization: Save this file in the “Activities” subfolder of your “STAT155” folder.
Class exploration
Guiding question: What job sectors have the highest return on education?
We’ll use data from the 2018 Current Population Survey to explore. The codebook for this data is available here. For now we’ll focus on individuals who have jobs in the management or transportation sectors to simplify our explorations.
It would be great to know the true effect of years of education on wages. Let’s start by looking at the relationship between these two variables.
ggplot(cps_sub, aes(x = education, y = wage)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
There is a positive correlation between years of education and wages, with a fair bit of spread about the line of best fit. We can fit a simple linear regression model to obtain the intercept and slope of that line.
<- lm(wage ~ education, data = cps_sub)
wage_mod_1 coef(summary(wage_mod_1))
Note that our intercept estimate is negative which doesn’t make sense! People with 0 years of education still earn wages! Inspecting the plot, we likely could have better accounted for this with a nonlinear transformation of the education
variable, but we will leave this issue aside for now.
Let’s focus on this question: does this simple linear regression model help us understand the true effect of years of education?
We’ll want to consider confounding variables in order to better answer that question. One possible confounder is industry. Draw a causal diagram showing how industry, years of education, and wage relate, and explain what unfair comparisons result from using a simple linear regression model.
Let’s adjust for industry by fitting a multiple linear regression model.
<- lm(wage ~ education + industry, cps_sub)
wage_mod_2 coef(summary(wage_mod_2))
Interpret the education
and industrytransportation
coefficients in the context of the data. (Remember to include units.) How does the relationship between years of education and wages change after adjusting for industry?
Hold on! We sped ahead too quickly. It’s important to visualize our data thoroughly first. Let’s add industry to our original scatterplot. What do you notice about the lines of best fit for these two industries?
ggplot(cps_sub, aes(x = education, y = wage, color = industry)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
It would be nice to be able to capture the different trend for the two industries. Let’s see if our original multiple linear regression model (wage_mod_1
) was able to do so.
# Visualize the relationships from the fitted model
ggplot(cps_sub, aes(y = wage, x = education, color = industry)) +
geom_line(aes(y = wage_mod_2$fitted.values))
What do you notice about what our model produces? Based on your coefficient interpretations from earlier, is this behavior what you would have expected? How do you think our multiple linear regression model is limited? How might we try to fix this?
In our causal diagram, both years of education and industry affect wages, and one way to capture this is with our model in wage_mod_2
:
\(E[\text{wage} \mid \text{education}, \text{industry}) = \beta_0 + \beta_1 \text{education} + \beta_2 \text{industrytransportation}\)
Some other ways to capture how wages are affected by years of education and industry could look like this:
- \(\beta_0 + \beta_1 \text{education} + \beta_2 \text{industrytransportation} + \beta_3 \text{education}^2\)
- \(\beta_0 + \beta_1 \text{education} + \beta_2 \text{industrytransportation} + \beta_3 \log(\text{education})\)
- \(\beta_0 + \beta_1 \text{education} + \beta_2 \text{industrytransportation} + \beta_3 \text{education}*\text{industrytransportation}\)
That last type of model is called an interaction model. A general interaction model formula looks like this:
\(E[Y \mid X_1, X_2) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1*X_2\)
The outcome \(Y\) depends on \(X_1\) and \(X_2\) with the usual multiple linear regression part: \(\beta_1 X_1 + \beta_2 X_2\). But it also includes an interaction term \(\beta_3 X_1*X_2\).
Let’s fit an interaction model for our cps_sub
data and explore what relationships our model estimates.
# Fit an interaction model
# NEW SYNTAX: Note the * instead of +
<- lm(wage ~ education * industry, cps_sub)
wage_mod_2
# Visualize the relationships from the interaction model
ggplot(cps_sub, aes(y = wage, x = education, color = industry)) +
geom_line(aes(y = wage_mod_2$fitted.values))
How does our new interaction model compare to our previous one?
This is a more complex new model! Let’s explore what is going on mathematically by examining the overall model formula and how we can use it to get model formulas for each industry.
# View coefficient estimates
coef(summary(wage_mod_2))
Model formulas:
E[wage | education, industry] = -65590.606 + 8678.274 education + 90232.230 transportation - 7580.228 education * transportation
Broken down by industry:
Management:
E[wage | education, industry = management] = -65590.606 + 8678.274 education
Transportation:
E[wage | education, industry = transportation] = -65590.606 + 8678.274 education + 90232.230 - 7580.228 education = (-65590.606 + 90232.230) + (8678.274 - 7580.228)education
= 24641.62 + 1098.046education
Question 1: The intercept coefficient, -65590.606, corresponds to what property of the lines?
- management intercept
- transportation intercept
- how the transportation intercept compares to the management intercept
Thus, how can we interpret this coefficient in the context of the wage analysis?
Question 2: The transportation coefficient, 90232.230, corresponds to what property of the lines?
- management intercept
- transportation intercept
- how the transportation intercept compares to the management intercept
Thus, how can we interpret this coefficient in the context of the wage analysis?
Question 3: The education coefficient, 8678.274, corresponds to what property of the lines?
- management slope
- transportation slope
- how the transportation slope compares to the management slope
Thus, how can we interpret this coefficient in the context of the wage analysis?
Question 4: The interaction coefficient, -7580.228, corresponds to what property of the lines?
- management slope
- transportation slope
- how the transportation slope compares to the management slope
Thus, how can we interpret this coefficient in the context of the wage analysis?
Part 1: Exercises
Exercise 1: Wages across all industries
The plot below illustrates the relationship between wage and education for all of the industries in our cps
dataset.
# Plot
ggplot(cps, aes(y = wage, x = education, color = industry)) +
geom_smooth(method = "lm", se = FALSE)
What about this plot indicates that it would be a good idea to fit an interaction model?
What industry will R use as the reference category?
(Challenge!) Before fitting the model in R, write down what you think the model formula will look like.
Fit a model that includes an interaction term between
education
andindustry
.
# Fit an interaction model called wage_model
# Display summarized model output
In what industry do wages increase the most per additional year of education? What is this increase?
Similarly, in what industry do wages increase the least per additional year of education? What is this increase?
Exercise 2: Thinking beyond
Do you think there are other variables (which may or may not be in our cps
data) that have an interaction with industry
in affecting wages? If you were to fit an interaction model, what results might you expect to find?
Reflection
Through the exercises above, we developed ideas about when to fit interaction models and how to interpret results. Describe what makes sense and what is still unclear about this topic.
Part 2: Exercises
Context: We’ll explore data on incumbency and campaign spending, revisit the bikes data we’ve looked at previously in this course, and explore data on characteristics of used cars. Read in the data below.
# Load packages and import data
library(ggplot2)
library(dplyr)
library(readr)
library(stringr)
library(tidyr)
<- read_csv("https://mac-stat.github.io/data/bikeshare.csv")
bikes
# A little bit of data wrangling code - let's not focus on this for now
<- read_csv("https://mac-stat.github.io/data/campaign_spending.csv") %>%
campaigns ::select(wholename, district, votes, incumbent, spending) %>%
dplyrmutate(spending = spending / 1000) %>%
filter(!is.na(spending))
# A little bit of data wrangling code - let's not focus on this for now
<- read_csv("https://mac-stat.github.io/data/used_cars.csv") %>%
cars mutate(milage = milage %>% str_replace(",","") %>% str_replace(" mi.","") %>% as.numeric(),
price = price %>% str_replace(",","") %>% str_replace("\\$","") %>% as.numeric(),
age = 2025 - model_year) # 2025 so that yr. 2024 cars are one year old
For the first several exercises, we’ll consider the following research questions:
What role does campaign spending play in elections?
- Do candidates that spend more money tend to get more votes?
- How might this depend upon whether a candidate is an incumbent (they are running for RE-election) or a challenger (they are challenging the incumbent)?
We’ll use data collected by Benoit and Marsh (2008) on the campaign spending of 464 candidates in the 2002 Irish Dail elections (Ireland’s version of the U.S. House of Representatives) to explore these questions. The units of spending
are 1,000 Euros.
Exercise 1: Translating scientific questions into statistical questions
- Look at the variables we have access to in the cleaned version of the data we read into R, and consider our first research question. How might we translate this question into a statistical one, that we could answer using the data we have available?
There is no one right answer to this! Brainstorm with your group.
head(campaigns)
- Question 2 (a) is a bit more specific than Question 1. Translate this question into a statistical one that can be answered using a simple linear regression model. Write out the model statement in \(E[Y | X] = ...\) notation that would answer this question, and note which regression coefficient you would interpret to provide you with an answer.
\[ E[\_\_\_ | \_\_\_] = ... \]
- Question 2 (b) is also specific, and builds on Question 2 (a). Translate this question into a statistical one that can be answered using a multiple linear regression model. Write out the model statement in \(E[Y | X] = ...\) notation that would answer this question, and note which regression coefficient you would interpret to provide you with an answer.
\[ E[\_\_\_ | \_\_\_] = ... \]
Exercise 2: Visualizing Interaction
- Write R code to visualize the relationship between campaign spending and number of votes a candidate received. Include an aesthetic to distinguish this relationship between incumbents and challengers. Do not include lines of best fit from any statistical model on your plot at this point!
# Visualization
Based on your visualization from part (a), what are your answers to research questions 2 (a) and 2 (b)? Write your answer in 2-3 sentences, describing general trends you notice, suitable for a general audience.
Add lines of best fit from a statistical model that includes an interaction term between incumbent status and spending to your plot from part (a), using
geom_smooth
. Based on your updated plot, do you think including an interaction between incumbent status and spending in a multiple linear regression model would be meaningful in this context? Why or why not?
# Visualization with lines of best fit
Exercise 3: Fitting and interpreting models with interaction terms
- Fit the regression model you wrote out in Exercise 1 (c). Report (do not interpret yet!) the regression coefficients below.
# Model with interaction term
(Intercept):
incumbentYes:
spending:
incumbentYes:spending:
- Using the coefficient estimates from part (a), write out two separate model statements, one for incumbents and one for challengers. Combine terms (using algebra) when you can! Hint: remember the indicator variables video!
- For incumbents:
\[ E[votes | spending] = \]
- For challengers:
\[ E[votes | spending] = \]
Interpret the coefficient for
incumbent
in your interaction model, in context. Make sure to use non-causal language, include units, and talk about averages rather than individual cases. Is this coefficient scientifically meaningful?When interpreting an interaction coefficient where one of the variables interacting is quantitative and one is categorical, it is often convenient to do so in separate sentences: interpret the slope for each category separately!
Interpret the coefficient for the interaction term in your model, in context. Make sure to use non-causal language, include units, and talk about averages rather than individual cases.
- Based on your interpretation in part (d), and the visualization you made including lines of best fit, do you think that including an interaction term for incumbent status and spending is meaningful, when predicting number of votes? Explain why or why not.
Exercise 4: Interactions between two categorical variables
Let’s return to our data on bike ridership. Suppose we are interested in the relationship between daily ridership (our response variable) and whether a user is a casual or registered rider and whether the day falls on a weekend. First, we need to create a binary variable indicating whether a user is a casual or registered rider.
# Creating user variable, don't worry about syntax!
<- bikes %>%
new_bikes ::select(riders_casual, riders_registered, weekend, temp_actual) %>%
dplyrpivot_longer(cols = riders_casual:riders_registered, names_to = "user",
names_prefix = "riders_", values_to = "rides") %>%
mutate(weekend = factor(weekend))
- For each of our three relevant variables,
weekend
,user
, andrides
, classify them as quantitative or categorical.
weekend
:
user
:
rides
:
- Make an appropriate visualization to explore the relationship between these three variables.
# Visualization
Is the relationship between ridership and weekend status the same for both registered and casual users? Explain why or why not, referencing the visualization you made in part (b).
To reflect what you observed in your visualization, fit a multiple linear regression model with an interaction term between
weekend
anduser
in our model of ridership.
# Multiple linear regression model
- Interpret the interaction term from your model, in context. Make sure to use non-causal language, include units, and talk about averages rather than individual cases. Just as in Exercise 3, you may find it useful to first write out multiple model statements for different categories defined by one of your categorical variables, and proceed from there!
Exercise 5: Interactions between two quantitative variables
Here we’ll explore the relationship between price
, milage
, and age
of a used car. Below is a scatterplot of mileage vs. price, colored by age:
%>%
cars ggplot(aes(x = milage, y = price, col = age)) +
geom_point(alpha = 0.5) + # make the points less opaque
scale_color_viridis_c(option = "H") + # a fun, colorblind-friendly palette!
theme_classic() # removes the gray background and grid
It’s a little difficult to tell what exactly is going on here. In particular, does the relationship between mileage and price vary with age of a used car? Let’s try adding some fitted lines for cars of different ages.
# Ignore where the numbers in geom_abline() came from for now... we'll get there
%>%
cars ggplot(aes(x = milage, y = price, col = age)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c(option = "H") +
theme_classic() +
geom_abline(slope = -6.558e-01 + 2.431e-02, intercept = 9.096e+04 -2.665e+03, col = "black") +
geom_abline(slope = -6.558e-01 + 10 * 2.431e-02, intercept = 9.096e+04 - 10 * 2.665e+03, col = "blue") +
geom_abline(slope = -6.558e-01 + 30 * 2.431e-02, intercept = 9.096e+04 - 30 * 2.665e+03, col = "green") +
ggtitle("Black: Age = 1yr, Blue: Age = 10yr, Green: Age = 30yr")
- Challenge question: Based on the fitted lines in the plot above, anticipate what the signs (positive or negative) of the coefficients in the following interaction model will be:
\[ E[price | age, milage] = \beta_0 + \beta_1 milage + \beta_2 age + \beta_3 milage:age \] * \(\beta_0\): Put your response here…
\(\beta_1\): Put your response here…
\(\beta_2\): Put your response here…
\(\beta_3\): Put your response here…
- Fit a multiple linear regression model with an interaction term between
milage
andage
in our model of used carprice
.
# Multiple linear regression model
# ... now do you see where the numbers in geom_abline() came from?
As before, we could choose distinct ages, and interpret the relationship between mileage and price for each of those groups separately. However, since age is quantitative and not categorical, this doesn’t quite give us the whole picture. Instead, we want to know how the relationship between mileage and price changes for each additional year old a car is. This is what the interaction coefficient estimates, when the interaction term is between two quantitative variables!
- Interpret the interaction term, in context. Make sure to use non-causal language, include units, and talk about averages rather than individual cases.
Reflection
Through the exercises above, you practiced visualizing, fitting, and interpreting multiple linear regression models with interaction terms between combinations of categorical and quantitative variables. Think about how the fitted lines looked in situations where you think there was a meaningful interaction taking place. How do you think the fitted lines would look if there was no meaningful interaction present? Explain your reasoning.
Part 1: Solutions
Exercise 1: Wages across all industries
The plot below illustrates the relationship between wage and education for all of the industries in our cps
dataset.
# Plot
ggplot(cps, aes(y = wage, x = education, color = industry)) +
geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
The industry-specific lines all have different slopes.
ag (first in alphabetical order)
This is a challenge! Compare your prediction to what you see when fitting the model in part d.
# Fit an interaction model called wage_model
<- lm(wage ~ education*industry, data = cps)
wage_model
# Display summarized model output
coef(summary(wage_model))
Estimate Std. Error t value
(Intercept) 31475.87521 22370.504 1.40702574
education 61.95396 2039.257 0.03038065
industryconstruction -14427.01189 25740.953 -0.56046923
industryinstallation_production -33208.72359 25346.017 -1.31021469
industrymanagement -97066.48097 23305.235 -4.16500759
industryservice -55462.76415 23229.134 -2.38763810
industrytransportation -6834.25066 27495.549 -0.24855844
education:industryconstruction 2295.51232 2297.659 0.99906577
education:industryinstallation_production 3759.05906 2244.792 1.67456904
education:industrymanagement 8616.31984 2080.190 4.14208220
education:industryservice 4384.72036 2092.523 2.09542317
education:industrytransportation 1036.09210 2409.093 0.43007562
Pr(>|t|)
(Intercept) 1.594509e-01
education 9.757641e-01
industryconstruction 5.751720e-01
industryinstallation_production 1.901533e-01
industrymanagement 3.139604e-05
industryservice 1.697551e-02
industrytransportation 8.037075e-01
education:industryconstruction 3.177870e-01
education:industryinstallation_production 9.405013e-02
education:industrymanagement 3.470009e-05
education:industryservice 3.615851e-02
education:industrytransportation 6.671499e-01
In the management industry, wages increase the most per year of education. The increase is 61.95396 + 8616.31984 = $8678.274 per year. That is, every additional year of education is associated with an average increase of $8678.27 in yearly wages in the management industry.
In the agriculture industry, wages increase the least per year of education. The increase is $61.95 per year. That is, every additional year of education is associated with an average increase of $61.95 in yearly wages in the ag industry.
Exercise 2: Thinking beyond
If a variable x
has an interaction with the industry
variable in affecting wages, then the relationship between x
and wages must be different by industry. We might suspect that this could be the case for hours
worked per week. We can make a plot to verify that this is actually the case:
ggplot(cps, aes(y = wage, x = hours, color = industry)) +
geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
Part 2: Solutions
Exercise 1: Translating scientific questions into statistical questions
From this question, the only clear variable that should be involved in our analysis/exploration is spending. We could first begin by providing numerical and visual summaries of campaign spending. We could also look at whether spending varies by district, number of votes, or incumbency status. This would give us a broad idea of how campagin spending may vary across the variables we access to in our data.
We can estimate the average associated increase in number of votes per additional 1,000 Euros spent, via a simple linear regression model. The model statement that allows us to answer this question is given by
\[
E[votes | spending] = \beta_0 + \beta_1 spending
\] The regression coefficient we would interpret to answer this question is the coefficient fpr spending
, which in this case is \(\beta_1\).
- We are interested in the how the association between average number of votes and campaign spending varies by incumbency status. The model statement that allows us to answer this question is given by
\[ E[votes | spending, incumbent] = \beta_0 + \beta_1 spending + \beta_2 incumbent + \beta_3 spending:incumbent \]
(note that the order in which you put spending and incumbent status does not matter!)
The regression coefficient we would interpret to answer this question is the interaction coefficient, which in this case is \(\beta_3\).
Exercise 2: Visualizing Interaction
# Visualization
%>%
campaigns ggplot(aes(spending, votes, col = incumbent)) +
geom_point()
In general, the more a candidate spends on their campaign, the more votes they receive. Incumbents appear to spend more than challengers on their campaigns, typically. The impact of spending on votes appears to be greater for challengers than for incumbents, in that more spending may lead to even more votes for challengers, than it would for incumbents.
I think including an interaction term between incumbent status and spending would be meaningful, since the relationship between spending and votes does seem to vary by incumbent status. In particular, note that the lines on the visualization are not parallel. Parallel lines imply that there is no interaction present, so the further the lines are from parallel, the more intense (in some sense) the interaction term.
# Visualization with lines of best fit
%>%
campaigns ggplot(aes(spending, votes, col = incumbent)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
Exercise 3: Fitting and interpreting models with interaction terms
# Model with interaction term
lm(data = campaigns, votes ~ spending*incumbent)
Call:
lm(formula = votes ~ spending * incumbent, data = campaigns)
Coefficients:
(Intercept) spending incumbentYes
690.5 209.7 4813.9
spending:incumbentYes
-125.9
(Intercept): 690.5
incumbentYes: 4813.9
spending: 209.7
incumbentYes:spending: -125.9
- For incumbents:
\[ E[votes | spending] = 690 + 4813.9 + 209.7 * spending - 125.9 * spending = 5503.9 + 83.8 * spending \]
- For challengers:
\[ E[votes | spending] = 690.5 + 209.7 * spending \]
On average, we expect the difference in number of votes between incumbents and challengers to be 4813.9, for campaigns where no money is spent. This is likely not a scientifically meaningful estimate since there are very few campaigns where no money is spent. However, such campaigns do exist, so I would say this one could be meaningful in certain contexts, if not broadly!
On average, we expect an increase in spending by 1,000 euros to be associated with an increase in number of votes by 210, for challengers. On average, we expect an increase in spending by 1,000 euros to be associated with an increase in number of votes by 84, for incumbents.
I think the interaction term is meaningful when predicting number of votes, since 84 and 210 are relatively different numbers! The interaction term gives us the additional information that spending has less of an effect on number of votes for incumbents than it does for challengers, which is particularly meaningful if you are a campaign manager!
Exercise 4: Interactions between two categorical variables
# Creating user variable, don't worry about syntax!
<- bikes %>%
new_bikes ::select(riders_casual, riders_registered, weekend, temp_actual) %>%
dplyrpivot_longer(cols = riders_casual:riders_registered, names_to = "user",
names_prefix = "riders_", values_to = "rides") %>%
mutate(weekend = factor(weekend))
weekend
: categorical (binary)
user
: categorical (binary)
rides
: quantitative
# Visualization
%>%
new_bikes ggplot(aes(y = rides, user, fill = weekend)) +
geom_boxplot()
The relationship between ridership and weekend status does not appear to be the same for registered and casual users. Specifically, casual users have higher median riders on weekends, whereas the opposite is true for registered users.
# Multiple linear regression model
lm(data = new_bikes, rides ~ user * weekend)
Call:
lm(formula = rides ~ user * weekend, data = new_bikes)
Coefficients:
(Intercept) userregistered
625.0 3300.5
weekendTRUE userregistered:weekendTRUE
776.7 -1714.4
- On average, we expect there to be 777 more rides on weekends compared to non-weekends, for casual riders. On average, we expect there to be 938 (776.7 - 1714.4, rounded) less rides on weekends compared to non-weekends, for registered riders.
Note: There are lots of ways you could correctly interpret the interaction term here! You could do it one sentence, you could do it in four (one for each unique group defined by the two categorical variables), or you could compare users and registered riders for weekends, and then separately for non-weekends! All are valid options.
Exercise 5: Interactions between two quantitative variables
Here we’ll explore the relationship between price
, milage
, and age
of a used car. Below is a scatterplot of mileage vs. price, colored by age:
%>%
cars ggplot(aes(x = milage, y = price, col = age)) +
geom_point(alpha = 0.5) + # make the points less opaque
scale_color_viridis_c(option = "H") + # a fun, colorblind-friendly palette!
theme_classic() # removes the gray background and grid
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).
It’s a little difficult to tell what exactly is going on here. In particular, does the relationship between mileage and price vary with age of a used car? Let’s try adding some fitted lines for cars of different ages.
# Ignore where the numbers in geom_abline() came from for now... we'll get there
%>%
cars ggplot(aes(x = milage, y = price, col = age)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c(option = "H") +
theme_classic() +
geom_abline(slope = -6.558e-01 + 2.431e-02, intercept = 9.096e+04 -2.665e+03, col = "black") +
geom_abline(slope = -6.558e-01 + 10 * 2.431e-02, intercept = 9.096e+04 - 10 * 2.665e+03, col = "blue") +
geom_abline(slope = -6.558e-01 + 30 * 2.431e-02, intercept = 9.096e+04 - 30 * 2.665e+03, col = "green") +
ggtitle("Black: Age = 1yr, Blue: Age = 10yr, Green: Age = 30yr")
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).
\[ E[price | age, milage] = \beta_0 + \beta_1 milage + \beta_2 age + \beta_3 milage:age \] * \(\beta_0\): positive, since the intercept is the average price for a car with zero miles that is brand new.
\(\beta_1\): negative, since the more miles a new car has, the cheaper it should be
\(\beta_2\): negative, since the intercept of the lines seems to decrease with age (black -> blue -> green)
\(\beta_3\): positive, since the slope of the lines seems to increase with age (black -> blue -> green)
# Multiple linear regression model
lm(data = cars, price ~ milage * age)
Call:
lm(formula = price ~ milage * age, data = cars)
Coefficients:
(Intercept) milage age milage:age
9.096e+04 -6.558e-01 -2.665e+03 2.431e-02
# ... now do you see where the numbers in geom_abline() came from?
As before, we could choose distinct ages, and interpret the relationship between mileage and price for each of those groups separately. However, since age is quantitative and not categorical, this doesn’t quite give us the whole picture. Instead, we want to know how the relationship between mileage and price changes for each additional year old a car is. This is what the interaction coefficient estimates, when the interaction term is between two quantitative variables!
- On average, we expect that an increase in mileage by 1 mile is associated with an additional increase in price by $0.0243 for each additional year old the car is.