# Load package
library(tidyverse)
# Read in the course evaluation data
<- read_csv("https://mac-stat.github.io/data/dear_abby.csv") abby
Univariate visualization and summaries
Notes and in-class exercises
Notes
Download the .qmd template file for this activity here.
Learning goals
By the end of this lesson, you should be able to:
- Describe what a case (or unit of analysis) represents in a dataset.
- Describe what a variable represents in a dataset.
- Identify whether a variable is categorical or quantitative and what summarizations and visualizations are appropriate for that variable
- Write R code to read in data and to summarize and visualize a single variable at a time.
- Interpret key features of barplots, boxplots, histograms, and density plots
- Describe information about the distribution of a quantitative variable using the concepts of shape, center, spread, and outliers
- Relate summary statistics of data to the concepts of shape, center, spread, and outliers
Readings and videos
Choose either the reading or the videos to go through before class.
- Reading: Sections 2.1-2.4, 2.6 in the STAT 155 Notes
- Videos:
File organization: Save this file in the “Activities” subfolder of your “STAT155” folder.
Exercises
Guiding question: What anxieties have been on Americans’ minds over the decades?
Context: Dear Abby is America’s longest running advice column. Started in 1956 by Pauline Phillips under the pseudonym Abigail van Buren, the column continues to this day under the stewardship of her daughter Jeanne. Each column features one or more letters to Abby from anonymous individuals, all signed with a pseudonym. Abby’s response follows each letter.
In 2018, the data journalism site The Pudding published a visual article called 30 Years of American Anxieties in which the authors explored themes in Dear Abby letters from 1985 to 2017. (We only have the questions, not Abby’s responses.) The codebook is available here.
Exercise 1: Get curious
- Hypothesize with each other: what themes do you think might come up often in Dear Abby letters?
- After brainstorming, take a quick glance at the original article from The Pudding to see what themes they explored.
- Go to the very end of the Pudding article to the section titled “Data and Method”. In thinking about the who, what, when, where, why, and how of data context, what concerns/limitations surface with regards to using this data to learn about Americans’ concerns over the decades?
Exercise 2: Importing and getting to know the data
First, in the Console pane of RStudio, run the following command to install some necessary packages (you will need to do this any time you are installing a new package):
install.packages("tidyverse")
Now, in the Quarto pane, run the following code chunk to load the package and load a dataset (you can either click the green arrow in the top right of the code chunk, put your cursor in the code chunk and hit Ctrl+Alt+C [on Windows/Linux] or Command+Option+C [on Mac]).
If it runs successfully, you should see the following output appear in the Console pane:
> # Load package
> library(tidyverse)
>
> # Read in the course evaluation data
> abby <- read_csv("https://mac-stat.github.io/data/dear_abby.csv")
Rows: 20034 Columns: 11
── Column specification ────────────────────────────
Delimiter: ","
chr (4): day, url, title, question_only
dbl (7): year, month, letterId, afinn_overall, a...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Click on the Environment tab (generally in the upper right hand pane in RStudio). Then click the
abby
line. Theabby
data will pop up as a separate pane (like viewing a spreadsheet) – check it out.In this tidy dataset, what is the unit of observation? That is, what is represented in each row of the dataset?
What term do we use for the columns of the dataset?
Try out each function below. Identify what each function tells you about the
abby
data and note this in the???
:
# ??? [what do both numbers mean?]
dim(abby)
# ???
nrow(abby)
# ???
ncol(abby)
# ???
head(abby)
# ???
names(abby)
- [OPTIONAL] If you’re not sure how exactly to use a function, you can pull up a built-in help page with information about the arguments a function takes (i.e., what goes inside the parentheses), and the output it produces. To do this, click inside the Console pane, and enter
?function_name
. For example, to pull up a help page for thedim()
function, we can type?dim
and hit Enter. Try pulling up the help page for theread_csv()
function we used to load the dataset.
Exercise 3: Preparing to summarize and visualize the data
In the next exercises, we will be exploring themes in the Dear Abby questions and the overall “mood” or sentiment of the questions. Before continuing, read the codebook for this dataset for some context about sentiment analysis, which gives us a measure of the mood/sentiment of a text.
What sentiment variables do we have in the dataset? Are they quantitative or categorical?
If we were able to create a
theme
variable that took values like “friendship”, “marriage”, and “relationships”, wouldtheme
be quantitative or categorical?What visualizations are appropriate for looking at the distribution of a single quantitative variable? What about a single categorical variable?
Exercise 4: Exploring themes in the letters
The dplyr
package provides many useful functions for managing data (like creating new variables, summarizing information). The stringr
package provides tools for working with strings (text). We’ll use these packages to search for words in the questions in order to (roughly) identify themes/subjects.
The code below searches for words related to mothers, fathers, marriage, and money and combines them into a single theme
variable.
- Inside
mutate()
the linemoms = ifelse(str_detect(question_only, "mother|mama|mom"), "mom", "no mom")
creates a new variable calledmoms
. If any of the text “mother”, “mama”, or “mom” (which covers “mommy”) is found, then the variable takes the value “mom”. Otherwise, the variable takes the value “no mom”. - The
dads
,marriage
, andmoney
variables are created similarly. - The
themes = str_c(moms, dads, marriage, money, sep = "|")
line takes the 4 created variables and combines the text of those variables separated with a |. For example, one value of thethemes
variable is “mom|no_dad|no_marriage|no_money” (which contains words about moms but not dads, marriage, or money).
library(dplyr)
library(stringr)
<- abby %>%
abby mutate(
moms = ifelse(str_detect(question_only, "mother|mama|mom"), "mom", "no mom"),
dads = ifelse(str_detect(question_only, "father|papa|dad"), "dad", "no dad"),
marriage = ifelse(str_detect(question_only, "marriage|marry|married"), "marriage", "no marriage"),
money = ifelse(str_detect(question_only, "money|finance"), "money", "no money"),
themes = str_c(moms, dads, marriage, money, sep = "|")
)
- Modify the code above however you wish to replace themes (e.g., replace “moms” with something else) or add new themes to search for. If you want to add a new subject to search for, copy and paste a line for an existing subject above the
themes
line, and modify the code like this:- If your subject is captured by multiple words:
YOUR_SUBJECT = ifelse(str_detect(question_only, "WORD1|WORD2|ETC"), "SUBJECT", "NO SUBJECT"),
- If your subject is captured by a single word:
YOUR_SUBJECT = ifelse(str_detect(question_only, "WORD"), "SUBJECT", "NO SUBJECT"),
- Try to have no more than 6 subjects—otherwise we’ll have too many themes, which will complicate exploration.
- If your subject is captured by multiple words:
- The code below makes a barplot of the
themes
variable using theggplot2
visualization package. Before making the plot, make note of what you expect the plot might look like. (This might be hard–just do your best!) Then compare to what you observe when you run the code chunk to make the plot. (Clearly defining your expectations first is good scientific practice to avoid confirmation bias.)
# Load package
library(ggplot2)
# barplot
ggplot(abby, aes(x = themes)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
- We can follow up on the barplot with a simple numerical summary. Whereas the
ggplot2
package is great for visualizations,dplyr
is great for numerical summaries. The code below constructs a table of the number of questions with each theme. Make sure that these numerical summaries match up with what you saw in the barplot.
# Construct a table of counts
%>%
abby count(themes)
- Before proceeding, let’s break down the plotting code above. Run each chunk to see how the two lines of code above build up the plot in “layers”. Add comments (on the lines starting with
#
) to document what you notice.
# ???
ggplot(abby, aes(x = themes))
# ???
ggplot(abby, aes(x = themes)) +
geom_bar()
# ???
ggplot(abby, aes(x = themes)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
# ???
ggplot(abby, aes(x = themes)) +
geom_bar() +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Exercise 5: Exploring sentiment
We’ll look at the distribution of the afinn_overall
sentiment variable and associated summary statistics.
- The code below creates a boxplot of this variable. In the comment, make note of how this code is simliar to the code for the barplot above. As in the previous exercise, before running the code chunk to create the plot, make note of what you expect the boxplot to look like.
# ???
ggplot(abby, aes(x = afinn_overall)) +
geom_boxplot()
- Challenge: Using the code for the barplot and boxplot as a guide, try to make a histogram and a density plot of the overall average ratings.
- What information is given by the tallest bar of the histogram?
- How would you describe the shape of the distribution?
# Histogram
# Density plot
- We can compute summary statistics (numerical summaries) for a quantitative variable using the
summary()
function or with thesummarize()
function from thedplyr
package. (1st Qu.
and3rd Qu.
stand for first and third quartile.) After inspecting these summaries, look back to your boxplot, histogram, and density plot. Which plots show which summaries most clearly?
# Summary statistics
# Using summary() - convenient for computing many summaries in one command
# Does not show the standard deviation
summary(abby$afinn_overall)
# Using summarize() from dplyr
# Note that we use %>% to pipe the data into the summarize() function
# We need to use na.rm = TRUE because there are missing values (NAs)
%>%
abby summarize(mean(afinn_overall, na.rm = TRUE), median(afinn_overall, na.rm = TRUE), sd(afinn_overall, na.rm = TRUE))
- Write a good paragraph describing the information in the histogram (or density plot) by discussing shape, center, spread, and outliers. Incorporate the numerical summaries from part c.
Exercise 6: Box plots vs. histograms vs. density plots
We took 3 different approaches to plotting the quantitative average course variable above. They all have pros and cons.
- What is one pro about the boxplot in comparison to the histogram and density plot?
- What is one con about the boxplot in comparison to the histogram and density plots?
- In this example, which plot do you prefer and why?
Exercise 7: Explore outliers
Given that Dear Abby column is an advice column, it seems natural that the sentiment of the questions would lean more negative. What’s going on with the questions that have particularly positive sentiments?
We can use the filter()
function in the dplyr
package to look at the . Based on the plots of afinn_overall
that you made in Exercise 5, pick a threshold for the afinn_overall
variable—we’ll say that questions with an overall sentiment score above this threshold are high outliers. Fill in this number where it says YOUR_THRESHOLD
below.
%>%
abby filter(afinn_overall > YOUR_THRESHOLD) %>%
pull(question_only)
What do you notice? Why might these questions have such high sentiment scores?
Exercise 8: Returning to our context, looking ahead
In this activity, we explored data on Dear Abby question, with a focus on exploring a single variable at a time.
- In big picture terms, what have we learned about Dear Abby questions?
- What further curiosities do you have about the data?
Exercise 9: Different ways to think about data visualization
In working with and visualizing data, it’s important to keep in mind what a data point represents. It can reflect the experience of a real person. It might reflect the sentiment in a piece of art. It might reflect history. We’ve taken one very narrow and technical approach to data visualization. Check out the following examples, and write some notes about anything you find interesting.
Exercise 10: Rendering your work
Save this file, and then click the “Render” button in the menu bar for this pane (blue arrow pointing right). This will create an HTML file containing all of the directions, code, and responses from this activity. A preview of the HTML will appear in the browser.
- Scroll through and inspect the document to see how your work was translated into this HTML format. Neat!
- Close the browser tab.
- Go to the “Background Jobs” pane in RStudio and click the Stop button to end the rendering process.
- Navigate to your “Activities” subfolder within your “STAT155” folder and locate the HTML file. You can open it again in your browser to double check.
Reflection
Go to the top of this file and review the learning objectives for this lesson. Which objectives do you have a good handle on, are at least familiar with, or are struggling with? What feels challenging right now? What are some wins from the day?
Response: Put your response here.
Additional Practice
If you have time and want additional practice, try out the following exercises.
Exercise 11: Read in and get to know the weather data
Daily weather data are available for 3 locations in Perth, Australia.
- View the codebook here.
- Complete the code below to read in the data.
# Replace the ??? with your own name for the weather data
# Replace the ___ with the correct function
<- ___("https://mac-stat.github.io/data/weather_3_locations.csv") ???
Exercise 12: Exploring the data structure
Check out the basic features of the weather data.
# Examine the first six cases
# Find the dimensions of the data
What does a case represent in this data?
Exercise 13: Exploring rainfall
The raintoday
variable contains information about rainfall.
- Is this variable quantitative or categorical?
- Create an appropriate visualization, and compute appropriate numerical summaries.
- What do you learn about rainfall in Perth?
# Visualization
# Numerical summaries
Exercise 14: Exploring temperature
The maxtemp
variable contains information on the daily high temperature.
- Is this variable quantitative or categorical?
- Create an appropriate visualization, and compute appropriate numerical summaries.
- What do you learn about high temperatures in Perth?
# Visualization
# Numerical summaries
Exercise 15: Customizing! (CHALLENGE)
Though you will naturally absorb some RStudio code throughout the semester, being an effective statistical thinker and “programmer” does not require that we memorize all code. That would be impossible! In contrast, using the foundation you built today, do some digging online to learn how to customize your visualizations.
- For the histogram below, add a title and more meaningful axis labels. Specifically, title the plot “Distribution of max temperatures in Perth”, change the x-axis label to “Maximum temperature” and y-axis label to “Number of days”. HINT: Do a Google search for something like “add axis labels ggplot”.
# Add a title and axis labels
ggplot(weather, aes(x = maxtemp)) +
geom_histogram()
- Adjust the code below in order to color the bars green. NOTE: Color can be an effective tool, but here it is simply gratuitous.
# Make the bars green
ggplot(weather, aes(x = raintoday)) +
geom_bar()
Check out the
ggplot2
cheat sheet. Try making some of the other kinds of univariate plots outlined there.What else would you like to change about your plot? Try it!
Solutions
Exercise 1: Get curious
- Results of brainstorming themes will vary
- From the “Data and Method” section at the end of the Pudding article, we see this paragraph:
The writers of these questions likely skew roughly 2/3 female (according to Pauline Phillips, who mentions the demographics of responses to a survey she disseminated in 1987), and consequently, their interests are overrepresented; we’ve been unable to find other demographic data surrounding their origins. There is, doubtless, a level of editorializing here: only a fraction of the questions that people have written in have seen publication, because agony aunts (the writers of advice columns) must selectively filter what gets published. Nevertheless, the concerns of the day seem to be represented, such as the HIV/AIDS crisis in the 1980s. Additionally, we believe that the large sample of questions in our corpus (20,000+) that have appeared over recent decades gives a sufficient directional sense of broad trends.
- Writers of the questions are predominately female. The 2/3 proportion was estimated in 1987, so it would be useful to understand shifts in demographics over time.
- What questions were chosen to be answered on the column? Likely a small fraction of what got submitted. What themes tended to get cut out?
Exercise 2: Importing and getting to know the data
Note how clicking the
abby
data causes both a popup pane and the commandView(abby)
to appear in the Console. In fact, theView()
function is the underlying command that opens a dataset pane. (View()
should always be entered in the Console and NOT your Quarto document.)Each row / case corresponds to a single question.
Columns = variables
Try out each function below. Identify what each function tells you about the
abby
data and note this in the???
:
# First number = number of rows / cases
# Second number = number of columns / variables
dim(abby)
[1] 20034 16
# Number of rows (cases)
nrow(abby)
[1] 20034
# Number of columns (variables)
ncol(abby)
[1] 16
# View first few rows of the dataset (6 rows, by default)
head(abby)
# A tibble: 6 × 16
year month day url title letterId question_only afinn_overall afinn_pos
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 1985 1 01 proque… WOMA… 1 "i have been… -30 5
2 1985 1 01 proque… WOMA… 1 "this is for… -30 5
3 1985 1 02 proque… LAME… 1 "our 16-year… 1 3
4 1985 1 03 proque… 'NOR… 1 "i was a hap… -3 7
5 1985 1 04 proque… IT'S… 1 "you be the … 13 31
6 1985 1 04 proque… IT'S… 1 "a further w… 13 31
# ℹ 7 more variables: afinn_neg <dbl>, bing_pos <dbl>, moms <chr>, dads <chr>,
# marriage <chr>, money <chr>, themes <chr>
# Get all column (variable) names
names(abby)
[1] "year" "month" "day" "url"
[5] "title" "letterId" "question_only" "afinn_overall"
[9] "afinn_pos" "afinn_neg" "bing_pos" "moms"
[13] "dads" "marriage" "money" "themes"
- We can display the first 10 rows with
head(abby, n = 10)
.
Exercise 3: Preparing to summarize and visualize the data
The sentiment variables are
afinn_overall
,afinn_pos
,afinn_neg
, andbing_pos
, and they are quantitative. Theafinn
variables don’t have units but we can still get a sense of the scale by remembering that each word gets a score between -5 and 5. Thebing_pos
variable doesn’t have units because it’s a fraction, but we know that it ranges from 0 to 1.theme
would be categorical.Appropriate visualizations:
- single quantitative variable: boxplot, histogram, density plot
- single categorical variable: barplot
Exercise 4: Exploring themes in the letters
Code will vary
Expectations about the plot will vary
ggplot(abby, aes(x = themes)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
- Counts in the table below match the barplot
# Construct a table of counts
%>%
abby count(themes)
# A tibble: 16 × 2
themes n
<chr> <int>
1 mom|dad|marriage|money 67
2 mom|dad|marriage|no money 567
3 mom|dad|no marriage|money 109
4 mom|dad|no marriage|no money 906
5 mom|no dad|marriage|money 121
6 mom|no dad|marriage|no money 839
7 mom|no dad|no marriage|money 293
8 mom|no dad|no marriage|no money 2462
9 no mom|dad|marriage|money 41
10 no mom|dad|marriage|no money 350
11 no mom|dad|no marriage|money 96
12 no mom|dad|no marriage|no money 760
13 no mom|no dad|marriage|money 360
14 no mom|no dad|marriage|no money 2967
15 no mom|no dad|no marriage|money 865
16 no mom|no dad|no marriage|no money 9231
- What do the plot layers do?
# Just sets up the "canvas" of the plot with axis labels
ggplot(abby, aes(x = themes))
# Adds the bars
ggplot(abby, aes(x = themes)) +
geom_bar()
# Rotates the x axis labels
ggplot(abby, aes(x = themes)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
# Changes the visual theme of the plot with a white background and removes gridlines
ggplot(abby, aes(x = themes)) +
geom_bar() +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Exercise 5: Exploring course overall ratings
Now we’ll look at the distribution of the avg_rating
variable and associated summary statistics.
- We might expect the mean of this variable is less than zero given that more negative words might be appear in questions on an advice column.
- The code has a similar structure to the barplot in that there is an initial
ggplot()
layer which sets the canvas, then a+
to add a layer, then the final layergeom_boxplot()
(likegeom_bar()
) which tells R what type of plot to make.
ggplot(abby, aes(x = afinn_overall)) +
geom_boxplot()
Warning: Removed 490 rows containing non-finite outside the scale range
(`stat_boxplot()`).
- We replace
geom_boxplot()
withgeom_histogram()
andgeom_density()
.- The tallest bar of the histogram indicates that over 7500 questions had an overall
afinn
sentiment score between around -8 to 0.(The -8 to 0 comes from eyeballing where the tallest bar is placed on the x-axis, and the height of this bar indicates how many cases fall into that bin.) - The shape of the distribution: roughly symmetric
- The tallest bar of the histogram indicates that over 7500 questions had an overall
# Histogram
ggplot(abby, aes(x = afinn_overall)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 490 rows containing non-finite outside the scale range
(`stat_bin()`).
# Density plot
ggplot(abby, aes(x = afinn_overall)) +
geom_density()
Warning: Removed 490 rows containing non-finite outside the scale range
(`stat_density()`).
- Boxplot shows min, max, median, 1st and 3rd quartile easily. (It shows median, 1st and 3rd quartile directly as lines)
- Histogram and density plot show min and max but the mean and median aren’t shown directly–we have to roughly guess based on the peak of the distribution
# Summary statistics
summary(abby$afinn_overall)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-140.000 -6.000 -1.000 -1.401 3.000 100.000 490
%>%
abby summarize(mean(afinn_overall, na.rm = TRUE), median(afinn_overall, na.rm = TRUE), sd(afinn_overall, na.rm = TRUE))
# A tibble: 1 × 3
mean(afinn_overall, na.rm = TR…¹ median(afinn_overall…² sd(afinn_overall, na…³
<dbl> <dbl> <dbl>
1 -1.40 -1 11.1
# ℹ abbreviated names: ¹`mean(afinn_overall, na.rm = TRUE)`,
# ²`median(afinn_overall, na.rm = TRUE)`, ³`sd(afinn_overall, na.rm = TRUE)`
- The distribution of sentiment scores is roughly symmetric with a mean of -1.4 and a similar median of -1. The median and mean are quite similar because the distribution is fairly symmetric. The standard deviation of the sentiment scores is about 11.08 which tells us how much variation there is from the center of the distribution. 11.08 is somewhat high given the IQR of -6 to 3 (which is a span of 9 units).
Exercise 6: Box plots vs. histograms vs. density plots
- Boxplots very clearly show key summary statistics like median, 1st and 3rd quartile
- Boxplots can oversimplify by not showing the shape of the distribution.
Exercise 7: Explore outliers
There are some positive words in the questions that seem to pull up the sentiment score a lot despite the negative overall tone. From this we can see the limitations of a basic sentiment analysis in which the sentiment of each word is considered in isolation.
%>%
abby filter(afinn_overall > 50) %>%
pull(question_only) %>%
head() # Just to look at first 6
[1] "i am a 36-year-old college dropout whose lifelong ambition was to be a physician. i have a very good job selling pharmaceutical supplies, but my heart is still in the practice of medicine. i do volunteer work at the local hospital on my time off, and people tell me i would have made a wonderful doctor.\nif i go back to college and get my degree, then go to medical school, do my internship and finally get into the actual practice of medicine, it will take me seven years! but, abby, in seven years i will be 43 years old. what do you think?\nunfulfilled in philly"
[2] "we have an only child--a grown daughter we dearly love--and when we pass on, we want to leave her our entire estate, which is considerable.\nthe thing that troubles us is this: our daughter is married to a very unworthy character. for years he has taken advantage of her sweet, forgiving, generous nature because he knows she worships him. we are sure that whatever we leave our daughter will be spent on this dirty dog.\nhow can we prevent this from happening?\nbewildered"
[3] "both of our sons have been married for about 15 years. their wives were of normal weight when they married our sons, but one daughter-in- law weighs about 300 pounds and the other weighs about 225. their ages are 35 and 37. both our sons are good-looking, and neither is fat.\nour daughters-in-law seem to have no pride in their appearance, which upsets everyone in the family, except themselves. they are fat, they know it and they don't care! when they first began to put on weight, they tried various diets, pills, doctors, etc., but they both gave up and decided to \"accept\" themselves as they are.\nthey wear the wrong kind of clothes (shorts and blue jeans) without any apologies.\nour problem (my husband's and mine) is how do we cope with this? we are ashamed to be around them. our sons have accepted the situation, but we seem unable to.\nperhaps we need more help than the girls. any suggestions?\nupset in florida"
[4] "the letter from \"concerned mom,\" who was trying to teach her 5-year-old not to accept gifts from strangers, prompts this letter.\na gentleman friend of mine recently stood in line behind a mother and her young daughter at a bank. the child remarked on the visor he was wearing, as it had the name of a popular pizza imprinted on it.\nmy friend, who is the public relations director for this pizza firm, wanted the child to have the visor but, instead of giving it to the child, he handed the visor to her mother and said to the child: \"i'm giving this to your mother to give to you, because she's probably told you never to accept gifts from a stranger. you won't ever do that, will you?\"\nwhat a thoughtful way to be friendly while reinforcing a message mothers cannot stress enough.\nsue in wichita, kan."
[5] "in january, i sent an original manuscript as a gift to woody allen. i had hand-bound the pages, and decorated the binding with baroque pearls and amethyst. i enclosed my name, address and telephone number. i had hoped that woody would send me a note or call me, or at the very least, instruct his secretary to do so.\nto date, i haven't received even an acknowledgment that my gift was received. is it unrealistic of me to expect a thank-you from a famous person?\ndisappointed in california."
[6] "will you please, please discourage high school and college graduates from sending graduation invitations to every distant relative they and their parents ever heard of? we all know that sending \"invitations\" to people we hardly know is a flagrant, shameless bid for a gift. and if, in a moment of weakness, one does send a gift, a barrage of birth announcements and invitations to weddings, showers and more graduations is sure to follow.\ni am a 75-year-old widow, living on social security and very little else. i just received a high school graduation invitation from the granddaughter of a third cousin whom i have not seen in so long i wouldn't even recognize her. (i have never even met her granddaughter.)\ni have many relatives in this town, but i never hear from them unless they are celebrating something that requires a gift. i have no car, yet they \"invite\" me to every imaginable event, knowing full well i can't possibly attend. this is just shameless begging.\ni am not cheap. i just sent a generous graduation gift to a neighbor girl who used to stop by every day to bring in my mail and newspaper and ask if i needed any errands run.\ndon't suggest that i send \"a nice card\" to the relatives who send me invitations to events they know i can't attend. we both know a card is not what these spongers want.\nsick of them in iowa city"
Exercise 8: Returning to our context, looking ahead
- Answers will vary
Exercise 11: Read in and get to know the weather data
<- read_csv("https://raw.githubusercontent.com/Mac-STAT/data/main/weather_3_locations.csv") weather
Rows: 2367 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): location, windgustdir, winddir9am, winddir3pm, raintoday, raintom...
dbl (17): mintemp, maxtemp, rainfall, evaporation, sunshine, windgustspeed,...
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Exercise 12: Exploring the data structure
Check out the basic features of the weather data.
# Examine the first six cases
head(weather)
# A tibble: 6 × 24
date location mintemp maxtemp rainfall evaporation sunshine windgustdir
<date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 2020-01-01 Wollongo… 17.1 23.1 0 NA NA SSW
2 2020-01-02 Wollongo… 17.7 24.2 0 NA NA SSW
3 2020-01-03 Wollongo… 19.7 26.8 0 NA NA NE
4 2020-01-04 Wollongo… 20.4 35.5 0 NA NA SSW
5 2020-01-05 Wollongo… 19.8 21.4 0 NA NA SSW
6 2020-01-06 Wollongo… 18.3 22.9 0 NA NA NE
# ℹ 16 more variables: windgustspeed <dbl>, winddir9am <chr>, winddir3pm <chr>,
# windspeed9am <dbl>, windspeed3pm <dbl>, humidity9am <dbl>,
# humidity3pm <dbl>, pressure9am <dbl>, pressure3pm <dbl>, cloud9am <dbl>,
# cloud3pm <dbl>, temp9am <dbl>, temp3pm <dbl>, raintoday <chr>,
# risk_mm <dbl>, raintomorrow <chr>
# Find the dimensions of the data
dim(weather)
[1] 2367 24
A case represents a day of the year in a particular area (Hobart, Uluru, Wollongong as seen by the location
variable).
Exercise 13: Exploring rainfall
The raintoday
variable contains information about rainfall.
raintoday
is categorical (No, Yes)- It is more common to have no rain.
# Visualization
ggplot(weather, aes(x = raintoday)) +
geom_bar()
# Numerical summaries
%>%
weather count(raintoday)
# A tibble: 3 × 2
raintoday n
<chr> <int>
1 No 1864
2 Yes 446
3 <NA> 57
Exercise 14: Exploring temperature
The maxtemp
variable contains information on the daily high temperature.
maxtemp
is quantitative- The typical max temperature is around 23 degrees Celsius (with an average of 23.62 and a median of 22 degrees). The max temperatures ranged from 8.6 to 45.4 degrees. Finally, on the typical day, the max temp falls about 7.8 degrees from the mean. There are multiple modes in the distribution of max temperature—this likely reflects the different cities in the dataset.
# Visualization
ggplot(weather, aes(x = maxtemp)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 34 rows containing non-finite outside the scale range
(`stat_bin()`).
# Numerical summaries
summary(weather$maxtemp)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
8.60 18.10 22.00 23.62 27.40 45.40 34
# There are missing values (NAs) in this variable, so we add
# the na.rm = TRUE argument
%>%
weather summarize(sd(maxtemp, na.rm = TRUE))
# A tibble: 1 × 1
`sd(maxtemp, na.rm = TRUE)`
<dbl>
1 7.80
Exercise 15: Customizing! (CHALLENGE)
ggplot(weather, aes(x = maxtemp)) +
geom_histogram() +
labs(x = "Maximum temperature", y = "Number of days", title = "Distribution of max temperatures in Perth")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 34 rows containing non-finite outside the scale range
(`stat_bin()`).
# Make the bars green
ggplot(weather, aes(x = raintoday)) +
geom_bar(fill = "green")