Multiple linear regression: model building (part 1)
Notes and in-class exercises
Notes
Download the .qmd file for this activity here.
Learning goals
By the end of this lesson, you should be able to:
Distinguish between descriptive, predictive, and causal research questions
Iterate on your group’s research question to make it more precise and answerable
Choose appropriate model(s) for addressing your group’s research question
Readings and videos
Please watch the following video before class.
File organization: If you would like to take notes in this document, download the template and save it in the “Activities” subfolder of your “STAT155” folder. You are more than welcome to take notes in a separate google document, shared with your project group, if you’d find that more useful!
Steps
Step 1: Review
Take a look at the first project checkpoint (your statistical analysis plan) that your group submitted. As part of this checkpoint, you should have come up with a research question. Record the research question you came up with: we’ll iterate on this question throughout the activity!
Record your research question here
Answer the following questions as a group (some of this may already be in your statistical analysis plan!):
- Who would be interested in the answer to this question?
- What variables do you need in a dataset to address this question?
- What data summaries (not models) would help you answer this question, and why?
- What plots (not models) would help you address your research question, and why?
Step 2: Descriptive Research Questions
Descriptive research questions are questions that seek to better understand the relationships between variables, without interest in causality. In practice, nearly every research question asked is ultimately interested in causality, but practical constraints (such as unmeasured confounding) lead us to ask descriptive questions instead.
If we’re only interested in associations (not causality), we don’t need to adjust for potential confounding variables in our model.
For your group’s chosen research question, write a model statement that would address a descriptive version of your research question below:
Model statement for a descriptive question here
Example
When I was in college, I did a summer research project in collaboration with a statistics professor and the local (Northfield, MN) school district. The school district was interested in whether a professional development program they implemented (Professional Learning Communities) had an effect on student achievement, as measured by standardized test scores. Unfortunately, since every staff member was assigned to a PLC (and not randomly), we could not make any causal conclusions about this relationship.
What we ended up finding was that while student test scores did improve, there did not appear to be an association between this improvement and the goals set by a given PLC. To estimate this association we fit a multiple linear regression model (wow! just like you!):
\[ E[\text{Change in test score} \mid \text{PLC goal}, \text{Demographic Factors}] = \dots \]
We were also interested in whether or not the relationship between PLC goal and change in test score differed by whether a student had free and reduced price lunch. The staff at the schools had hypothesized that the PLCs may have the greatest impact on students from financially disadvantaged backgrounds, and we used free and reduced price lunch as a proxy for income status. To test this, we fit a multiple linear regression model with an interaction term:
\[ E[\text{Change in test score} \mid \text{PLC goal}, \text{FRP}, \text{Demographic Factors}] = \beta_0 + \beta_1 \text{PLC goal} + \beta_2 \text{FRP} + \beta_3 \text{PLC goal X FRP} + \dots \]
This is actually kind of wild that I still have this, but here’s the poster I presented at the end of my summer research project!
Step 3: Predictive Research Questions
Predictive research questions seek to determine if (and how well) we can predict outcomes for new / future events, using the information we already have. We’ve seen a bit of prediction in this course when we talked about fitted values!
Example
Almost every research project I have ever worked on has been predictive in nature, since my research focuses on estimating under-5 mortality rates in low- and middle-income countries (LMICs). What I mostly work on is not the relationship between multiple variables (think: is some factor associated with higher/lower mortality), but rather trying to estimate the probability of dying under the age of 5, using the data we have available to us.
In high-income countries (such as the United States), we have relatively strong vital registration programs, where births and deaths of individuals are recorded for the vast majority of the population almost exactly. In LMICs, we instead rely on nationally representative survey data to provide us with information about when people are born and when they die. This means that (1) we don’t have exact dates of births and deaths for all people, and (2) sometimes the dates aren’t recorded exactly because people misremember when people are born and when they die. This makes estimating child mortality complicated!
The statistical models we use are more complicated than the ones we learn in Stat 155, but one way to frame an idea of what we do is using logistic regression, which we’ll get to shortly. Logistic regression is one way to fit statistical models to data with binary outcomes. Death is (obviously) a binary outcome (did you die during this time period, yes or no). We then fit a model that looks something like this:
\[ \text{logit}(E[Death \mid Stuff]) = \beta_0 + \beta_1 \text{State} + \beta_2 \text{Time Period} + \beta \text{Other Stuff} \] Since we are only interested in fitted values, or our outcome, we can include almost whatever we want in “Other Stuff”!
With your groups, discuss the following:
- Is your research question predictive, or inferential? Inferential questions seek to understand the relationships between variables.
- If your question were predictive, who would be interested/invested in the results from your project? How could the results from your project be used in practice?
- Are there any variables that are not available to you in your data that you would include in your predictive model if you could? Why or why not?
Step 3: Causal Research Questions
Causal research questions are ultimately what most inferential statistics is interested in, regardless of whether or not we end up being able to make causal conclusions. From the videos for today, you learned about different types of variables, and whether or not they should be included or excluded from a model, depending on your causal research question.
With your groups, make a causal diagram (DAG) on the whiteboard for your research question. Consider including all variables you wish you had access to, even if they aren’t available in your data (this will help you later when talking about limitations of your analysis in your final paper), but certainly include relevant variables that are available in your data.
For each variable in your DAG that is available in your dataset, determine whether it should be included or excluded from your model. Use this to update your descriptive model statement from Step 2.
Model statement for a causal question here
Now look back at your DAG, and note if any of the variables that are not available in your data are potential confounders. If so, record them here (this means you likely won’t be able to draw causal conclusions):
List of “unmeasured” confounding variables here
Step 4: Reflection
Today was all about iterating on a research question, and using those questions to guide the way we explore data and fit statistical models. How confident do you feel in distinguishing between descriptive, predictive, and causal research questions? How confident do you feel in knowing which components of a model matter more or less, in each specific case? What might help you feel more confident?
Response: Put your response here.
Render your work
- Click the “Render” button in the menu bar for this pane (blue arrow pointing right). This will create an HTML file containing all of the directions, code, and responses from this activity. A preview of the HTML will appear in the browser.
- Scroll through and inspect the document to check that your work translated to the HTML format correctly.
- Close the browser tab.
- Go to the “Background Jobs” pane in RStudio and click the Stop button to end the rendering process.
- Navigate to your “Activities” subfolder within your “STAT155” folder and locate the HTML file. You can open it again in your browser to double check.