+ - 0:00:00
Notes for current slide

We're going to start off with some review of what we've learned for the past several weeks.

Notes for next slide

for this session, we'll be doing our work in the Academy website (conf.posit.academy)

Go ahead and open up the milestone for this Session 1.

Review & Practice

Day 1, Session 1

1 / 49

We're going to start off with some review of what we've learned for the past several weeks.

Go to Conf Session 1 - Review

2 / 49

for this session, we'll be doing our work in the Academy website (conf.posit.academy)

Go ahead and open up the milestone for this Session 1.

🚀 Rapid fire review

3 / 49

What we'll do:

I'm going to give you a prompt with a short tidyverse task, and you'll work to recreate it.

You'll do your work in the Quarto document you have open so you have a record of your code. You can create as many new code chunks as you like in here.

Review challenge

  • Working together with your groupmates is encouraged.
4 / 49

Review challenge

  • Working together with your groupmates is encouraged.

  • After 1-2 minutes, we'll go over the answer together. And then move on to the next question.

4 / 49

Review challenge

  • Working together with your groupmates is encouraged.

  • After 1-2 minutes, we'll go over the answer together. And then move on to the next question.

  • Everyone gets a small prize at the end! 🎉

4 / 49

Working together is okay (and encouraged)

We'll go over the answer once I see that most people have finished.

There are 20 prompts -- we may get through all of them (or not). But there will be a prize for everyone at the end.

I should also note that there's a bit of a mish mash of difficulty. But The main idea is that you exercise some tidyverse recall and get warmed up.

Done

Help

5 / 49

You'll use the sticky system to signal that you're done or your need help

outbreaks

6 / 49

We'll be working with data about foodborne and waterborne disease outbreaks spread by contact with environmental sources or infected people or animals. I'll refer to this data set as outbreaks. It comes from the CDC's National Outbreak Reporting System (NORS).

First things first, let's take a look a what's in this data set.

So here is your first prompt:

Your Turn 1

Read in the data and explore it. Can you recreate output that looks like this?

## Rows: 57,649
## Columns: 21
## $ year <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009,…
## $ month <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,…
## $ state <chr> "Minnesota", "Minnesota", "Minnesota", "M…
## $ primary_mode <chr> "Person-to-person", "Food", "Person-to-pe…
## $ etiology <chr> "Norovirus Genogroup II", "Norovirus", "N…
## $ serotype_or_genotype <chr> "unknown", NA, NA, NA, NA, NA, NA, NA, NA…
## $ etiology_status <chr> "Confirmed", "Suspected", "Suspected", "C…
## $ setting <chr> "Hotel/motel", "Restaurant - Sit-down din…
## $ illnesses <dbl> 21, 2, 50, 24, 16, 5, 3, 21, 7, 5, 22, 16…
## $ hospitalizations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0,…
## $ info_on_hospitalizations <dbl> 19, 2, 0, 24, 8, 5, 3, 21, 7, 5, 1, 16, 1…
## $ deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ info_on_deaths <dbl> 19, 2, 50, 24, 16, 5, 3, 21, 7, 5, 22, 16…
## $ food_vehicle <chr> NA, NA, NA, NA, NA, NA, NA, "cookies", "s…
## $ food_contaminated_ingredient <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ifsac_category <chr> NA, NA, NA, NA, NA, NA, NA, "Multiple", "…
## $ water_exposure <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ water_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ animal_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ animal_type_specify <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ water_status <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
7 / 49

1 minute

Solution 1

library(tidyverse)
outbreaks <-
read_csv("data/outbreaks.csv")
glimpse(outbreaks)
## Rows: 57,649
## Columns: 21
## $ year <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009,…
## $ month <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,…
## $ state <chr> "Minnesota", "Minnesota", "Minnesota", "M…
## $ primary_mode <chr> "Person-to-person", "Food", "Person-to-pe…
## $ etiology <chr> "Norovirus Genogroup II", "Norovirus", "N…
## $ serotype_or_genotype <chr> "unknown", NA, NA, NA, NA, NA, NA, NA, NA…
## $ etiology_status <chr> "Confirmed", "Suspected", "Suspected", "C…
## $ setting <chr> "Hotel/motel", "Restaurant - Sit-down din…
## $ illnesses <dbl> 21, 2, 50, 24, 16, 5, 3, 21, 7, 5, 22, 16…
## $ hospitalizations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0,…
## $ info_on_hospitalizations <dbl> 19, 2, 0, 24, 8, 5, 3, 21, 7, 5, 1, 16, 1…
## $ deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ info_on_deaths <dbl> 19, 2, 50, 24, 16, 5, 3, 21, 7, 5, 22, 16…
## $ food_vehicle <chr> NA, NA, NA, NA, NA, NA, NA, "cookies", "s…
## $ food_contaminated_ingredient <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ifsac_category <chr> NA, NA, NA, NA, NA, NA, NA, "Multiple", "…
## $ water_exposure <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ water_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ animal_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ animal_type_specify <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ water_status <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
8 / 49

Your Turn 2

What is the earliest year on record in this data set? The latest?

Answer the questions by creating two tables that are sorted.

9 / 49

Solution 2

outbreaks %>%
arrange(year)
year
month
state
primary_mode
etiology
serotype_or_genotype
etiology_status
setting
illnesses
hospitalizations
info_on_hospitalizations
deaths
info_on_deaths
food_vehicle
food_contaminated_ingredient
ifsac_category
water_exposure
water_type
animal_type
animal_type_specify
water_status
1971
11
Alaska
Water
Shigella sonnei
Confirmed
Mobile Home Park
89
0
Drinking water
Community
Reviewed
1971
6
Alabama
Water
Selenium
Confirmed
Unknown
3
0
Drinking water
Individual/Private
Reviewed
1971
6
Arkansas
Water
Hepatitis A
Confirmed
Store/Shop
98
0
Drinking water
Other
Reviewed
1971
2
California
Water
Copper
Confirmed
Restaurant/Cafeteria
2
0
Drinking water
Community
Reviewed
1971
7
California
Water
Unknown
Suspected
Community/Municipality
3500
0
Drinking water
Community
Reviewed
1971
12
California
Water
Unknown
Suspected
Hotel/Motel/Lodge/Inn
84
0
Drinking water
Other
Reviewed
1971
7
Kentucky
Water
Unknown
Suspected
Park - State Park
68
0
Drinking water
Other
Reviewed
1971
6
Missouri
Water
Unknown
Suspected
Subdivision/Neighborhood
2
0
Drinking water
Community
Reviewed
1-8 of 100 rows (Total: 57,649)
...
outbreaks %>%
arrange(desc(year))
year
month
state
primary_mode
etiology
serotype_or_genotype
etiology_status
setting
illnesses
hospitalizations
info_on_hospitalizations
deaths
info_on_deaths
food_vehicle
food_contaminated_ingredient
ifsac_category
water_exposure
water_type
animal_type
animal_type_specify
water_status
2020
1
Mississippi
Person-to-person
Long-term care/nursing home/assisted living facility
18
0
18
0
18
2020
1
Wisconsin
Person-to-person
Norovirus Genogroup II
GII.P16 - GII.4 Sydney
Confirmed
Other healthcare facility
6
0
6
0
6
2020
1
Nebraska
Person-to-person
Norovirus Genogroup II
GII.P16 - GII.4 Sydney
Confirmed
Long-term care/nursing home/assisted living facility
20
1
20
0
20
2020
1
New York
Food
Clostridium perfringens
Suspected
Caterer (food prepared off-site from where served)
7
0
7
0
7
curry chicken; pelau dish; curry chick peas w/ potatoes
chicken; peas; meat; chick peas; potatoes
Multiple
2020
1
New Mexico
Person-to-person
Norovirus Genogroup II
unknown
Confirmed
Long-term care/nursing home/assisted living facility
18
0
18
0
18
2020
1
Minnesota
Person-to-person
Norovirus unknown
Suspected
Long-term care/nursing home/assisted living facility
95
1
95
2
95
2020
1
Minnesota
Person-to-person
Norovirus unknown; Clostridium difficile
Suspected; Suspected
Long-term care/nursing home/assisted living facility
34
4
34
1
34
2020
1
Minnesota
Person-to-person
Norovirus unknown
Suspected
Long-term care/nursing home/assisted living facility
4
0
0
1-8 of 100 rows (Total: 57,649)
...
10 / 49

Your Turn 3

The state variable appears to include US locations -- is the data limited to the 50 states?

Produce a table that displays all the unique values of this variable.

11 / 49

Solution 3

outbreaks %>%
distinct(state)
state
Minnesota
Pennsylvania
Alaska
Alabama
Illinois
Iowa
Montana
Louisiana
1-8 of 58 rows
...
12 / 49

Your Turn 4

What are the different etiologies that have been recorded in this data set, and how often do they appear in the data? Display your results in a table like the one below.

etiology
n
Acanthamoeba unknown
1
Adenovirus
3
Adenovirus B
2
Adenovirus F; Escherichia coli, Enteroaggregative
1
Adenovirus F; Escherichia coli, Enteropathogenic
1
Adenovirus F; Escherichia coli, Enteropathogenic; Clostridium difficile
1
Adenovirus other; Adenovirus
1
Adenovirus unknown
1
1-8 of 100 rows (Total: 679)
...
13 / 49

Solution 4

outbreaks %>%
count(etiology)
etiology
n
Acanthamoeba unknown
1
Adenovirus
3
Adenovirus B
2
Adenovirus F; Escherichia coli, Enteroaggregative
1
Adenovirus F; Escherichia coli, Enteropathogenic
1
Adenovirus F; Escherichia coli, Enteropathogenic; Clostridium difficile
1
Adenovirus other; Adenovirus
1
Adenovirus unknown
1
1-8 of 100 rows (Total: 679)
...
14 / 49

Your Turn 5

Let's turn our attention to only observations where the primary mode of infection is Food. Using that criterion, can you reproduce the table below?

Note: ifsac_category refers to food category (IFSAC = Interagency Food Safety Analytics Collaboration, part of the CDC)

ifsac_category
n
Beef
723
Chicken
571
Crustaceans
111
Dairy
356
Eggs
197
Fish
1019
Fruits
268
Fungi
45
1-8 of 28 rows
15 / 49

Solution 5

outbreaks %>%
filter(primary_mode == "Food") %>%
count(ifsac_category)
ifsac_category
n
Beef
723
Chicken
571
Crustaceans
111
Dairy
356
Eggs
197
Fish
1019
Fruits
268
Fungi
45
1-8 of 28 rows
16 / 49

Your Turn 6

Explore the relationship between month and the number of illnesses with a series of boxplots like the one to the right. Can you recreate it?

Hint: Try using the group aesthetic to tell R that you'd like to create a boxplot for each unique value of month.

17 / 49

Check documentation for use of the argument that allows you to create a new boxplot for each grouping.

Solution 6

outbreaks %>%
ggplot(aes(x = month, y = illnesses)) +
geom_boxplot(aes(group = month))

The outlier here makes it hard to see what's going on. Let's modify this.

18 / 49

But the outlier here makes it hard to see what's going on. Let's modify this.

Your Turn 7

The outlier we just observed represents the outbreak associated with the highest number of illnesses. Which observation is this?

  1. Display only this observation in a table.

  2. Then, recreate the boxplots without this observation.

Here is the code we just used to make the boxplots in the last exercise:

outbreaks %>%
ggplot(aes(x = month, y = illnesses)) +
geom_boxplot(aes(group = month))
19 / 49

Solution 7

outbreaks %>%
filter(illnesses == max(illnesses))
year
month
state
primary_mode
etiology
serotype_or_genotype
etiology_status
setting
illnesses
hospitalizations
info_on_hospitalizations
deaths
info_on_deaths
food_vehicle
food_contaminated_ingredient
ifsac_category
water_exposure
water_type
animal_type
animal_type_specify
water_status
1993
3
Wisconsin
Water
Cryptosporidium parvum
Confirmed
Community/Municipality
403000
50
Drinking water
Community
Reviewed


20 / 49

Solution 7

outbreaks %>%
filter(illnesses == max(illnesses))
year
month
state
primary_mode
etiology
serotype_or_genotype
etiology_status
setting
illnesses
hospitalizations
info_on_hospitalizations
deaths
info_on_deaths
food_vehicle
food_contaminated_ingredient
ifsac_category
water_exposure
water_type
animal_type
animal_type_specify
water_status
1993
3
Wisconsin
Water
Cryptosporidium parvum
Confirmed
Community/Municipality
403000
50
Drinking water
Community
Reviewed


outbreaks %>%
filter(illnesses != max(illnesses)) %>%
ggplot(aes(x = month, y = illnesses)) +
geom_boxplot(aes(group = month))

20 / 49

Your Turn 8

Which outbreaks lead to the highest number of illnesses?

Display these observations in a table from greatest to fewest illnesses:

year
month
state
primary_mode
etiology
serotype_or_genotype
etiology_status
setting
illnesses
hospitalizations
info_on_hospitalizations
deaths
info_on_deaths
food_vehicle
food_contaminated_ingredient
ifsac_category
water_exposure
water_type
animal_type
animal_type_specify
water_status
1993
3
Wisconsin
Water
Cryptosporidium parvum
Confirmed
Community/Municipality
403000
50
Drinking water
Community
Reviewed
1987
1
Georgia
Water
Cryptosporidium parvum
Confirmed
Community/Municipality
13000
0
Drinking water
Community
Reviewed
1983
6
Pennsylvania
Water
Unknown
Suspected
Camp/Cabin Setting
11400
2
0
Drinking water
Other
Reviewed
1991
8
Puerto Rico
Water
Unknown
Suspected
Community/Municipality
9847
12
0
Drinking water
Community
Reviewed
1980
6
Texas
Water
Unknown
Suspected
Community/Municipality
8000
0
Drinking water
Community
Reviewed
2007
5
Utah
Water
Cryptosporidium unknown
Confirmed
Community/Municipality
5697
0
0
Recreational water -- treated
Pool - Swimming Pool
Reviewed
1995
7
Georgia
Water
Cryptosporidium parvum; Giardia duodenalis; Unknown; Unknown
Confirmed; Confirmed; Suspected; Suspected
Park - Waterpark
5449
2
0
Recreational water -- treated
Pool - Swimming Pool; Other; Pool - Wave Pool
Reviewed
1978
3
Colorado
Water
Giardia duodenalis
Confirmed
Community/Municipality
5000
0
Drinking water
Community
Reviewed
1-8 of 100 rows (Total: 57,649)
...
21 / 49

Solution 8

outbreaks %>% arrange(desc(illnesses))
year
month
state
primary_mode
etiology
serotype_or_genotype
etiology_status
setting
illnesses
hospitalizations
info_on_hospitalizations
deaths
info_on_deaths
food_vehicle
food_contaminated_ingredient
ifsac_category
water_exposure
water_type
animal_type
animal_type_specify
water_status
1993
3
Wisconsin
Water
Cryptosporidium parvum
Confirmed
Community/Municipality
403000
50
Drinking water
Community
Reviewed
1987
1
Georgia
Water
Cryptosporidium parvum
Confirmed
Community/Municipality
13000
0
Drinking water
Community
Reviewed
1983
6
Pennsylvania
Water
Unknown
Suspected
Camp/Cabin Setting
11400
2
0
Drinking water
Other
Reviewed
1991
8
Puerto Rico
Water
Unknown
Suspected
Community/Municipality
9847
12
0
Drinking water
Community
Reviewed
1980
6
Texas
Water
Unknown
Suspected
Community/Municipality
8000
0
Drinking water
Community
Reviewed
2007
5
Utah
Water
Cryptosporidium unknown
Confirmed
Community/Municipality
5697
0
0
Recreational water -- treated
Pool - Swimming Pool
Reviewed
1995
7
Georgia
Water
Cryptosporidium parvum; Giardia duodenalis; Unknown; Unknown
Confirmed; Confirmed; Suspected; Suspected
Park - Waterpark
5449
2
0
Recreational water -- treated
Pool - Swimming Pool; Other; Pool - Wave Pool
Reviewed
1978
3
Colorado
Water
Giardia duodenalis
Confirmed
Community/Municipality
5000
0
Drinking water
Community
Reviewed
1-8 of 100 rows (Total: 57,649)
...
22 / 49

Your Turn 9

How many different primary modes of illness are in this data set? How often does each appear in the data?

Order the rows so that the most common primary mode appears at the top.

23 / 49

Solution 9

outbreaks %>%
count(primary_mode, sort = TRUE)

or...

outbreaks %>%
count(primary_mode) %>%
arrange(desc(n))
primary_mode
n
Person-to-person
26561
Food
23105
Indeterminate/Other/Unknown
4647
Water
2713
Animal Contact
512
Environmental contamination other than food/water
111
24 / 49

Looks like the most common primary mode for outbreaks in this particular data set is person-to-person.

Your Turn 10

Can you recreate this visualization of the primary modes of infection?

Hint: Look up the help page for coord_flip().

25 / 49

Solution 10

outbreaks %>%
ggplot(aes(x = primary_mode)) +
geom_bar() +
coord_flip()

26 / 49

Notice coord_flip()

If your first inclination was to use count() then you might have come up with an answer like...

Solution 10 - alternative

outbreaks %>%
count(primary_mode, name = "count") %>%
ggplot(aes(x = primary_mode, y = count)) +
geom_col() +
coord_flip()

27 / 49

This one, but it's a little less efficient code-wise.

Your Turn 11

Let's take a closer look at outbreaks for which the primary mode of infection is Animal Contact.

How many Animal Contact outbreaks has each individual state had? Recreate the table below.

state
n
Ohio
53
Minnesota
42
Idaho
24
Colorado
19
Pennsylvania
17
Utah
17
Wisconsin
17
Illinois
15
1-8 of 43 rows

Hint: You will need to remove observations where state is equal to "Multistate".

28 / 49

Solution 11

outbreaks %>%
filter(
primary_mode == "Animal Contact",
state != "Multistate"
) %>%
count(state, sort = TRUE)
state
n
Ohio
53
Minnesota
42
Idaho
24
Colorado
19
Pennsylvania
17
Utah
17
Wisconsin
17
Illinois
15
1-8 of 43 rows
29 / 49

Your Turn 12

Now that we know the top 3 states are Ohio, Minnesota, and Idaho, recreate this table:

  • Has only the Animal Contact outbreaks for those locations

  • Column names with "animal" appear after state

  • Drop the primary_mode column

year
month
state
animal_type
animal_type_specify
etiology
serotype_or_genotype
etiology_status
setting
illnesses
hospitalizations
info_on_hospitalizations
deaths
info_on_deaths
food_vehicle
food_contaminated_ingredient
ifsac_category
water_exposure
water_type
water_status
2009
1
Ohio
Cattle
Cattle, calf
Cryptosporidium unknown
Confirmed
4
1
4
0
4
2009
4
Ohio
Pet fish
Fish
Salmonella enterica
Paratyphi B
Confirmed
2
2
2
0
2
2009
5
Ohio
Cattle
Cattle, calf
Cryptosporidium unknown
Confirmed
3
0
0
0
3
2009
5
Ohio
Sheep or goats
Goat, kid
Cryptosporidium unknown
Confirmed
3
2
3
0
3
2009
5
Idaho
Cat or kitten
Cat, kitten
Campylobacter jejuni
Suspected
2
0
2
0
2
2009
8
Minnesota
Cattle
Cattle, calf
Cryptosporidium parvum
Confirmed
4
0
4
0
4
2009
8
Minnesota
Cattle
Cattle, calf
Escherichia coli, Shiga toxin-producing
O157:H7
Confirmed
3
2
3
0
3
2010
3
Ohio
Cattle
Cattle, calf
Cryptosporidium unknown
Confirmed
2
0
2
0
2
1-8 of 100 rows (Total: 119)
...
30 / 49

Solution 12

outbreaks %>%
filter(
primary_mode == "Animal Contact",
state %in% c("Ohio", "Minnesota", "Idaho")
) %>%
relocate(
contains("animal"),
.after = state
) %>%
select(-primary_mode)
year
month
state
animal_type
animal_type_specify
etiology
serotype_or_genotype
etiology_status
setting
illnesses
hospitalizations
info_on_hospitalizations
deaths
info_on_deaths
food_vehicle
food_contaminated_ingredient
ifsac_category
water_exposure
water_type
water_status
2009
1
Ohio
Cattle
Cattle, calf
Cryptosporidium unknown
Confirmed
4
1
4
0
4
2009
4
Ohio
Pet fish
Fish
Salmonella enterica
Paratyphi B
Confirmed
2
2
2
0
2
2009
5
Ohio
Cattle
Cattle, calf
Cryptosporidium unknown
Confirmed
3
0
0
0
3
2009
5
Ohio
Sheep or goats
Goat, kid
Cryptosporidium unknown
Confirmed
3
2
3
0
3
2009
5
Idaho
Cat or kitten
Cat, kitten
Campylobacter jejuni
Suspected
2
0
2
0
2
2009
8
Minnesota
Cattle
Cattle, calf
Cryptosporidium parvum
Confirmed
4
0
4
0
4
2009
8
Minnesota
Cattle
Cattle, calf
Escherichia coli, Shiga toxin-producing
O157:H7
Confirmed
3
2
3
0
3
2010
3
Ohio
Cattle
Cattle, calf
Cryptosporidium unknown
Confirmed
2
0
2
0
2
1-8 of 100 rows (Total: 119)
...
31 / 49

Your Turn 13

Visualize the relationship between year and number of illnesses with a smooth line plot:

  • Before you plot, exclude the outlier we previously identified (the outbreak with the max number of illnesses).

  • Color each line by its primary mode.

32 / 49

Solution 13

outbreaks %>%
filter(illnesses != max(illnesses)) %>%
ggplot(aes(y = illnesses, x = year, color = primary_mode)) +
geom_smooth(se = FALSE)

33 / 49

Your Turn 14

Visualize the number of illnesses in Washington DC, grouped by primary mode of infection using a "violin plot". Hint: You may need to investigate geom_violin().

(Don't worry if your plot dimensions look different than the plot below.)

## Warning: Groups with fewer than two data points have been dropped.

34 / 49

Solution 14

outbreaks %>%
filter(state == "Washington DC") %>%
ggplot(aes(x = primary_mode, y = illnesses)) +
geom_violin(aes(color = primary_mode)) +
geom_point(alpha = 0.3) # any alpha value < 1 is OK

35 / 49

Your Turn 15

Let's focus only on Washington DC and food borne illnesses. Show the most recent outbreaks in a table from most recent to least recent.

Hint: Which two columns contain date information?

year
month
state
primary_mode
etiology
serotype_or_genotype
etiology_status
setting
illnesses
hospitalizations
info_on_hospitalizations
deaths
info_on_deaths
food_vehicle
food_contaminated_ingredient
ifsac_category
water_exposure
water_type
animal_type
animal_type_specify
water_status
2019
6
Washington DC
Food
Cyclospora cayetanensis
Confirmed
Restaurant - "Fast-food"(drive up service or pay at counter); Restaurant - Sit-down dining
3
0
3
0
3
basil mayo
Multiple
2019
5
Washington DC
Food
Norovirus Genogroup II
unknown - unknown
Suspected
Private home/residence
4
0
4
0
4
2019
3
Washington DC
Food
Restaurant - Sit-down dining
16
0
14
0
16
2018
9
Washington DC
Food
Restaurant - Buffet
4
0
4
0
4
2018
3
Washington DC
Food
Norovirus Genogroup II
unknown - unknown
Confirmed
Restaurant - Sit-down dining
6
0
5
0
5
2018
3
Washington DC
Food
Norovirus Genogroup II
Suspected
Restaurant - Sit-down dining
12
0
12
0
12
2018
3
Washington DC
Food
Norovirus unknown
Suspected
Restaurant - Sit-down dining
3
0
3
0
3
2017
12
Washington DC
Food
Norovirus Genogroup II
unknown - unknown
Confirmed
Caterer (food prepared off-site from where served)
52
1
51
0
51
1-8 of 44 rows
36 / 49

Solution 15

outbreaks %>%
filter(
state == "Washington DC" &
primary_mode == "Food"
) %>%
arrange(desc(year), desc(month))
year
month
state
primary_mode
etiology
serotype_or_genotype
etiology_status
setting
illnesses
hospitalizations
info_on_hospitalizations
deaths
info_on_deaths
food_vehicle
food_contaminated_ingredient
ifsac_category
water_exposure
water_type
animal_type
animal_type_specify
water_status
2019
6
Washington DC
Food
Cyclospora cayetanensis
Confirmed
Restaurant - "Fast-food"(drive up service or pay at counter); Restaurant - Sit-down dining
3
0
3
0
3
basil mayo
Multiple
2019
5
Washington DC
Food
Norovirus Genogroup II
unknown - unknown
Suspected
Private home/residence
4
0
4
0
4
2019
3
Washington DC
Food
Restaurant - Sit-down dining
16
0
14
0
16
2018
9
Washington DC
Food
Restaurant - Buffet
4
0
4
0
4
2018
3
Washington DC
Food
Norovirus Genogroup II
unknown - unknown
Confirmed
Restaurant - Sit-down dining
6
0
5
0
5
2018
3
Washington DC
Food
Norovirus Genogroup II
Suspected
Restaurant - Sit-down dining
12
0
12
0
12
2018
3
Washington DC
Food
Norovirus unknown
Suspected
Restaurant - Sit-down dining
3
0
3
0
3
2017
12
Washington DC
Food
Norovirus Genogroup II
unknown - unknown
Confirmed
Caterer (food prepared off-site from where served)
52
1
51
0
51
1-8 of 44 rows
37 / 49

Your Turn 16

Continue to focus only on Washington DC and food borne illnesses.

Use tidyverse functions to display the unique values of food_vehicle as a vector, as shown below:

## [1] NA
## [2] "deviled eggs"
## [3] "sandwich, tuna salad; sandwich, chicken salad"
## [4] "sandwich, chicken; pork, roasted"
## [5] "other cheese, pasteurized; honeydew melon; potato, fried"
## [6] "sandwich, chicken"
## [7] "strawberries; watermelon"
## [8] "pasta salad"
## [9] "sandwich, other specialty"
## [10] "quesadilla, chicken"
## [11] "tuna, unspecified"
## [12] "ice"
## [13] "onion, tart; mixed vegetables, unspecified; green salad"
## [14] "water"
## [15] "pastry, paris-brest; tomato, basil and mozzarella salad"
## [16] "chicken, other; potato, mashed; caesar salad"
## [17] "chicken, grilled; green salad"
## [18] "fish"
## [19] "tomato, unspecified; avocado, unspecified; guacamole, unspecified; cilantro, unspecified"
## [20] "pasta-based salads unspecified"
## [21] "fish, amberjack"
## [22] "basil mayo"
38 / 49

Solution 16

outbreaks %>%
filter(
state == "Washington DC",
primary_mode == "Food"
) %>%
distinct(food_vehicle) %>%
pull()
## [1] NA
## [2] "deviled eggs"
## [3] "sandwich, tuna salad; sandwich, chicken salad"
## [4] "sandwich, chicken; pork, roasted"
## [5] "other cheese, pasteurized; honeydew melon; potato, fried"
## [6] "sandwich, chicken"
## [7] "strawberries; watermelon"
## [8] "pasta salad"
## [9] "sandwich, other specialty"
## [10] "quesadilla, chicken"
## [11] "tuna, unspecified"
## [12] "ice"
## [13] "onion, tart; mixed vegetables, unspecified; green salad"
## [14] "water"
## [15] "pastry, paris-brest; tomato, basil and mozzarella salad"
## [16] "chicken, other; potato, mashed; caesar salad"
## [17] "chicken, grilled; green salad"
## [18] "fish"
## [19] "tomato, unspecified; avocado, unspecified; guacamole, unspecified; cilantro, unspecified"
## [20] "pasta-based salads unspecified"
## [21] "fish, amberjack"
## [22] "basil mayo"
39 / 49

Your Turn 17

Find food borne outbreaks that resulted in at least 1 death.

Display the results in a table that contains only the columns year, primary_mode, state, and deaths.

year
primary_mode
state
deaths
2009
Food
Michigan
1
2009
Food
West Virginia
2
2010
Food
Illinois
1
2010
Food
Louisiana
3
2010
Food
New York
1
2010
Food
Colorado
1
2011
Food
Rhode Island
2
2009
Food
Multistate
1
1-8 of 100 rows (Total: 265)
...
40 / 49

Solution

outbreaks %>%
filter(
primary_mode == "Food",
deaths >= 1
) %>%
select(year, primary_mode, state, deaths)
year
primary_mode
state
deaths
2009
Food
Michigan
1
2009
Food
West Virginia
2
2010
Food
Illinois
1
2010
Food
Louisiana
3
2010
Food
New York
1
2010
Food
Colorado
1
2011
Food
Rhode Island
2
2009
Food
Multistate
1
1-8 of 100 rows (Total: 265)
...
41 / 49

Your Turn 18

Calculate new variable for each outbreak: fatality_rate (the number of deaths / the number of illnesses, multiplied by 100).

Then visualize the distribution of the fatality rate with a histogram.

42 / 49

Solution 18

outbreaks %>%
mutate(
fatality_rate = deaths / illnesses * 100
) %>%
ggplot(aes(x = fatality_rate)) +
geom_histogram()
## Warning: Removed 7154 rows containing non-finite values (`stat_bin()`).

43 / 49

Your Turn 19

  • Remove outbreaks that have a fatality rate of zero.

  • What is the distribution of the fatality rate for the remaining outbreaks as shown by a histogram?

44 / 49

Solution 19

outbreaks %>%
mutate(
fatality_rate = deaths / illnesses * 100
) %>%
filter(fatality_rate > 0) %>%
ggplot(aes(x = fatality_rate)) +
geom_histogram()

45 / 49

Your Turn 20 - last one!

Find Washington DC's worst outbreak as defined by its hospitalization_rate (the number of hospitalizations / the number of illnesses, multiplied by 100).

  • Subset the data so you display observations that are at least 1.5 times greater than DC's mean hopsitalization rate.

  • Your answer should be a tibble with the highest hospitalization rates shown at the top.

  • Any ties should be broken by outbreaks that happened more recently.

46 / 49

Solution 20

outbreaks %>%
filter(state == "Washington DC") %>%
mutate(hospitalization_rate = hospitalizations / illnesses * 100) %>%
filter(hospitalization_rate >= 1.60 * mean(hospitalization_rate, na.rm = TRUE)) %>%
arrange(desc(hospitalization_rate), desc(year))
year
month
state
primary_mode
etiology
serotype_or_genotype
etiology_status
setting
illnesses
hospitalizations
info_on_hospitalizations
deaths
info_on_deaths
food_vehicle
food_contaminated_ingredient
ifsac_category
water_exposure
water_type
animal_type
animal_type_specify
water_status
hospitalization_rate
2019
3
Washington DC
Environmental contamination other than food/water
Norovirus Genogroup II
unknown - unknown
Confirmed
Long-term care/nursing home/assisted living facility
2
2
2
0
2
100
2015
3
Washington DC
Environmental contamination other than food/water
Norovirus unknown
Confirmed
Hospital
5
5
5
0
0
100
2014
2
Washington DC
Indeterminate/Other/Unknown
Norovirus Genogroup II
unknown
Suspected
Long-term care/nursing home/assisted living facility
35
9
9
0
0
25.7142857142857
2009
2
Washington DC
Indeterminate/Other/Unknown
75
12
75
0
75
16
2010
12
Washington DC
Person-to-person
Norovirus unknown
Suspected
21
3
21
0
21
14.2857142857143
2019
7
Washington DC
Indeterminate/Other/Unknown
Norovirus unknown
Suspected
School/college/university
16
2
16
0
16
12.5
2014
2
Washington DC
Indeterminate/Other/Unknown
Norovirus Genogroup II
unknown
Confirmed
Long-term care/nursing home/assisted living facility
19
2
2
0
0
10.5263157894737
2002
10
Washington DC
Food
Salmonella enterica; Escherichia coli, Enteropathogenic
Confirmed; Confirmed
Caterer (food prepared off-site from where served)
29
3
0
10.3448275862069
1-8 of 12 rows
47 / 49

Explore

  • Now, pair-up with at least one other groupmate and do something new with the outbreaks data (10 + minutes)

  • Then share with your group.

48 / 49

The end 🚀

49 / 49

Go to Conf Session 1 - Review

2 / 49

for this session, we'll be doing our work in the Academy website (conf.posit.academy)

Go ahead and open up the milestone for this Session 1.

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow