Joins: A fishy dinner


Before you start…

Jump back into the project you created for the previous sessions exercises and then create a new R script. Save it as soon as you make it and give it a good name like ex_1_day_5.R or jamie.R and you’ll be ready to go!

The story so far

After careful study and pain-staking note collection we finally have the data that we have been waiting for…THE PENGUINS’ FAVORITE FOOD!

Now we can make delicious seafood puff pastries that penguins will love because it will have all their favorites!

The only problem is that the island data and the food data are in different data files…so in order for us to figure out where we should take the completed pastries we have to make the data play nice with each other.

Let’s start first with getting the data, that seems like a pretty good place to start!

Load the tidyverse and palmerpenguins packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)

data(penguins)

We need to read in the new data and it even includes the penguin names! We can customize their orders! This is excellent news, maybe we can even mark each pie with their initial so they’ll know who’s is who’s…

penguin_names <- read_csv("https://tidy-mn.github.io/R-camp-penguins/data/penguin_names.csv")
## Rows: 344 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): studyName, Species, Sex, Favorite food, Name
## dbl (2): Sample Number, Swim speed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now that we’ve read the data in, we need to know what columns it contains so that we know how to join the data frames together! We’re going to just take a quick peek at the data and then pull the names from our new data and see which ones match to our old data.

glimpse(penguin_names)
## Rows: 344
## Columns: 7
## $ studyName       <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708",…
## $ `Sample Number` <dbl> 1, 3, 6, 7, 9, 11, 13, 15, 17, 19, 222, 23, 25, 29, 32…
## $ Species         <chr> "Gentoo penguin (Pygoscelis papua)", "Gentoo penguin (…
## $ Sex             <chr> "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEM…
## $ `Favorite food` <chr> "shrimp", "shrimp", "krill", "krill", "krill", "krill"…
## $ `Swim speed`    <dbl> 19, 20, 18, 18, 18, 19, 19, 17, 19, 20, 20, 19, 18, 17…
## $ Name            <chr> "Qua", "Zoe", "Cam", "Rhi", "Zee", "Nag", "Lia", "Jam"…
names(penguin_names)
## [1] "studyName"     "Sample Number" "Species"       "Sex"          
## [5] "Favorite food" "Swim speed"    "Name"
names(penguins_raw)
##  [1] "studyName"           "Sample Number"       "Species"            
##  [4] "Region"              "Island"              "Stage"              
##  [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
## [10] "Culmen Length (mm)"  "Culmen Depth (mm)"   "Flipper Length (mm)"
## [13] "Body Mass (g)"       "Sex"                 "Delta 15 N (o/oo)"  
## [16] "Delta 13 C (o/oo)"   "Comments"


Fantastic! It looks like the column names all match, but if you reallllly want to be sure you can always use the function intersect to see where the names line up!

intersect(names(penguins_raw), names(penguin_names))
## [1] "studyName"     "Sample Number" "Species"       "Sex"

Now it’s time to join up! Join functions are smart, so if your columns have the same names it will automatically figure out which ones to join on so you don’t have to specify. If they do have different column names,all you have to do is pass the function an argument called by where you give it the names of the columns to join by.

named_penguins <- left_join(penguins_raw, penguin_names)
## Joining with `by = join_by(studyName, `Sample Number`, Species, Sex)`

Annnnd just like that, it figured out what to join! Here’s how it specified the join_by statement: by = join_by(studyName,Sample Number, Species, Sex)


So we did a left join, meaning that we expect to keep all the records from the first dataframe we pass it and only those in the second that match.

Let’s filter our new dataset to see if we have all the data that we need!

filter(named_penguins, is.na(Name))
## # A tibble: 4 × 20
##   studyName `Sample Number` Species          Region Island Stage `Individual ID`
##   <chr>               <dbl> <chr>            <chr>  <chr>  <chr> <chr>          
## 1 PAL0708                22 Adelie Penguin … Anvers Biscoe Adul… N11A2          
## 2 PAL0910               133 Adelie Penguin … Anvers Dream  Adul… N76A1          
## 3 PAL0708                22 Gentoo penguin … Anvers Biscoe Adul… N41A2          
## 4 PAL0708                22 Chinstrap pengu… Anvers Dream  Adul… N73A2          
## # ℹ 13 more variables: `Clutch Completion` <chr>, `Date Egg` <date>,
## #   `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
## #   `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
## #   `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>,
## #   `Favorite food` <chr>, `Swim speed` <dbl>, Name <chr>

Uh-oh. We have some penguins that don’t have matches…which seems odd because both datasets have the same number of rows…that must mean that something happened to the data!

Let’s use anti-join to figure out which rows from penguin_names didn’t join to penguins_raw!

anti_join(penguin_names, penguins_raw)
## Joining with `by = join_by(studyName, `Sample Number`, Species, Sex)`
## # A tibble: 4 × 7
##   studyName `Sample Number` Species     Sex   `Favorite food` `Swim speed` Name 
##   <chr>               <dbl> <chr>       <chr> <chr>                  <dbl> <chr>
## 1 PAL0708               222 Gentoo pen… FEMA… krill                     20 Gam  
## 2 PAL0910               153 Adelie Pen… FEMA… fish                       5 Bet  
## 3 PAL0708               222 Adelie Pen… MALE  fish                       4 Abe  
## 4 PAL0708               222 Chinstrap … MALE  krill                     15 Ace

Looks like someone made a typo…do you think we can fix it?

First things first, let’s check and make sure that those values don’t exist in our penguins_raw data - it would be bad if we fixed data that wasn’t actually wrong!

filter(penguins_raw, `Sample Number` %in% c(222, 153))
## # A tibble: 0 × 17
## # ℹ 17 variables: studyName <chr>, Sample Number <dbl>, Species <chr>,
## #   Region <chr>, Island <chr>, Stage <chr>, Individual ID <chr>,
## #   Clutch Completion <chr>, Date Egg <date>, Culmen Length (mm) <dbl>,
## #   Culmen Depth (mm) <dbl>, Flipper Length (mm) <dbl>, Body Mass (g) <dbl>,
## #   Sex <chr>, Delta 15 N (o/oo) <dbl>, Delta 13 C (o/oo) <dbl>, Comments <chr>

Whew, okay, so those numbers don’t exist in the original data, so looks like we just need to update the values to the correct numbers! We should also check to make sure it worked!

Fixing typos

Use the values from the filtered penguins_raw to correct the sample numbers in penguin_names so they match and then check to make sure all values join.


penguin_names <- penguin_names %>% 
  mutate(`Sample Number` = case_when(`Sample Number` == ______,
                                     ________ == _______,
                                     TRUE ~ ________))
  
______(penguins_raw, penguin_names) # should have 0 rows if this works!
  
penguin_names <- penguin_names %>% 
  mutate(`Sample Number` = case_when(`Sample Number` == 222 ~ 22,
                                     `Sample Number` == 153 ~ 133,
                                     TRUE ~ `Sample Number`))

anti_join(penguin_names, penguins_raw)
  

Yay! Let’s try this again…also let’s add clean_names after we join because those names are hard to keep spelling correctly.

library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
named_penguins <- left_join(penguins_raw, penguin_names) %>% 
                  clean_names()
## Joining with `by = join_by(studyName, `Sample Number`, Species, Sex)`

Now, let’s make a plot to see what the penguins favorite food by island is! Check the solution for a way to bring everything we’ve learned together into one beautiful code chunk and for exciting new color developments!

Foodie plot

Feel free to make whatever plot you think is best: we want to know how many penguins like each food by island. Use labels, summaries, or other methods to convey this information and have some fun!


# We want the columns in decending order by count!
ggplot(named_penguins %>% 
         group_by(______, ______) %>% 
         summarise(count = _____), 
       aes(x=_____, y = ____, fill = reorder(favorite_food, -count))) +
  geom_col(position = ___________) +
  labs(title = "Number of penguins by their favorite food and island",
       x = "Island home",
       y = "Number of penguins we need to serve!")+
  geom_text(aes(label = count, y = _____), 
            position = position_dodge2(width = 0.9, preserve = "single")) +
  scale_fill_xxxx(name = _______) +
  theme_xxxx()
  
ggplot(named_penguins %>% 
         group_by(island, favorite_food) %>% 
         summarise(count = n()), 
       aes(x=island, y = count, fill = reorder(favorite_food, -count))) +
  geom_col(position = position_dodge2(preserve = "single")) +
  labs(title = "Number of penguins by their favorite food and island",
       x = "Island home",
       y = "Number of penguins we need to serve!")+
  geom_text(aes(label = count, y = count + 5), 
            position = position_dodge2(width = 0.9, preserve = "single")) +
  scale_fill_viridis_d(name = "Favorite Food") +
  theme_minimal()
  

We might just be showing off now…but look how pretty it looks!

Anyway, krill seem to be the winner by a wide margin, so let’s get baking!