Day 2



1 Data wrangling

WRANGLE that DATA

Cartoon of a small fuzzy monster with a cowboy hat and lasso, riding another larger fuzzy monster and lassoing a group of angry / unruly looking creatures labeled data.

Artwork by @allison_horst

Your data toolbox

Function Job
filter() Keep rows that pass your tests
select() Select columns to keep or drop
arrange() Sort a table based on a column’s values
mutate() Add new columns or update existing columns

Monsters are back


Filter with a single condition / test

Filter with multiple conditions / tests


Keep columns

Drop columns


Arrange from lowest to highest

Arrange from highest to lowest


Update an existing column

Add a new column


# Create monsters
monsters <- data.frame(id = 1:5,
                       color = c("pink", "green",  "pink", "green", "green"),
                       age = c(8,2,7,3,1),
                       hat = c(T,T,F,T,F) )

# FILTER ----

## Use a single condition/test
filter(monsters, color == "pink")

## Use multiple conditions/tests
filter(monsters, color == "pink", age > 7)


# SELECT ----

## Keep only 2 columns
select(monsters, id, age)

## Drop columns with a '-' in front
select(monsters, -hat)


# ARRANGE ----

## From low to high
arrange(monsters, age)

## Descend from high to low
arrange(monsters, desc(age))


# MUTATE ----

## Update an existing column
mutate(monsters, age = age + 2)

## Create a new column
mutate(monsters, is_baby = age < 3)

Our first order

To: Bot Bakery
Subject: Big treat order
Attached: crew_food_prefs.xlsx

Hi friend,

Here’s the table of my crew’s goody preferences. 😋

Guess which crewmate is me!

Get the data

We can download the attached Excel file HERE

After you’ve downloaded the file, move the file from your Downloads/ folder to a data/ folder in your project’s main directory:

  1. Download the Excel file
  2. Create a “data” folder in your project directory
  • You can do this the usual way with Windows Explorer, or use the built-in New Folder button in the RStudio Files pane.
  1. Move the Excel file from your Downloads/ folder to the new data/ folder.

Excel in R

Have an EXCEL file? The readxl package has the function read_excel() for reading Excel data into R.

Let’s install and load readxl:

# Run the install line only one time to get the package
install.packages("readxl")

# Run this line every time you open R to load the package from your library
library(readxl)

Now we can read in our Excel file:

crew_df <- read_excel("data/crew_food_prefs.xlsx")

Ope! What happened to our column names / headers?


crew_df <- read_excel("data/crew_food_prefs.xlsx", skip = 1)


Having troubles loading the data? Here is an identical CSV version to load so you can carry on.

crew_df <- read_csv("https://tidy-mn.github.io/R-camp-penguins/data/crew_food_prefs.csv")


An Excellent alternative

Glimpse the data

head(crew_df)

glimpse(crew_df)

summary(crew_df)

filter() to rows with missing values or NA’s

Use filter() to view the rows where nut_allergy is missing or is NA.

filter(crew_df, is.na(nut_allergy))

filter() out the test data

There appears to be test data included in the table. Yuk! Let’s filter it out so it doesn’t influence our results.

crew_df <- filter(crew_df, id != "test")

select() what we need

Let’s create a separate smaller favorites dataframe with only the id and favorite_goody columns.

favorites_df <- select(crew_df, id, favorite_goody)


Let’s create a separate allergy table with only the id and all of the columns that contain the word “allergy”.

allergy_df <- select(crew_df, id, contains("allergy"))


There’s also some columns in here we don’t need. Let’s drop the country people are from since that won’t help us at the moment.

crew_df <- select(crew_df, -country)


Finally, for easier reference, use select() to make the id and days_on_ship columns the first two columns in the table. Everything else can be in the same order.

crew_df <- select(crew_df, id, days_on_ship, everything())

arrange() the crew

Let’s use arrange() to sort the crew members so the ones that will be on the ship the longest are on top. They’ll be the ones munching on are goodies the longest.

crew_df <- arrange(crew_df, days_on_ship)

Ope! What happened?

Descending order (biggest on top)

crew_df <- arrange(crew_df, desc(days_on_ship))

Alternative: put a - in front of the column name to do the opposite

crew_df <- arrange(crew_df, -days_on_ship)

mutate() new columns

Let’s get an idea of the total meals each crew mate will be eating our baked goods. For that, we’ll need to take their days_on_ship and multiply it by their meals_per_day.

# Use the * to multiply and the / to divide
crew_df <- mutate(crew_df, total_trip_meals = days_on_ship * meals_per_day)

2 Forks in the road: if_else + case_when

if_else( )

How to survive

if_else(path_is_scary, 
        "Turn back!", 
        "Keep going.")

Monster bakers

Hat check

If the monster is wearing a hat (the hat column is TRUE), then they are definitely a baker. Otherwise they are not a baker.

The if_else() statement

if_else(hat, "Baker", "Not a baker")

Assign the result to a new column named “job”

job = if_else(hat, "Baker", "Not a baker")

Use mutate() to add the column to the monster data

mutate(monsters, 
       job = if_else(hat, "Baker", "Not a baker"))

Baby check: Age comparison

The if_else() statement

if_else(age < 3, "Baby", "Not a baby")

Assign the result to a new column named “life_stage”

life_stage = if_else(age < 3, "Baby", "Not a baby")

Use mutate() to add the column to the monster data

mutate(monsters, 
       life_stage = if_else(age < 3, "Baby", "Not a baby"))


What if we have more than 2 categories we want to assign things to? Say we wanted to sort the monsters into babies, kids, and adults?


case_when( )

What to wear

The footwear we choose to wear depends on our activity. If we are running we may choose sneakers. If we are swimming we may choose flippers. And in this case, when we aren’t running or swimming we will be wearing slippers. The final TRUE option in case_when will be used as the default choice when none of the other options above it are true.

footwear = case_when(
                     run_from_monsters  ~  "Sneakers",
                     swim_from_monsters  ~  "Flippers",
                     TRUE  ~  "Slippers"
                    )

Back to monster babies

This if_else() statement

if_else(age < 3, "Baby", "Not a baby")

Turns into this case_when() statement

case_when(age < 3 ~ "Baby", 
          TRUE ~ "Not a baby")


Now we can add more life stages

Add a “Kid” option

case_when(age < 3 ~ "Baby", 
          age < 6 ~ "Kid",
          TRUE ~ "Not a baby")

Set the default option to “Adult”

case_when(age < 3 ~ "Baby", 
          age < 6 ~ "Kid",
          TRUE ~ "Adult")

Use mutate() to add the column to the monster data

mutate(monsters, 
       life_stage = case_when(age < 3 ~ "Baby", 
                              age < 6 ~ "Kid",
                              TRUE ~ "Adult")

The pipe: “Take this and then…”

Patty cake

patty_cake %>% 
  roll_it() %>% 
  prick_it() %>% 
  mark_it(letter = "B")

Read as:

Take a patty cake (and then)
roll it (and then)
prick it (and then)
mark it - with the letter “B”.


%>% with our toolbox

Without the pipe

monsters <- filter(monsters, color == "pink")

monsters <- select(monsters, -hat)

monsters <- mutate(monsters, is_baby = age < 3)

With the pipe

monsters <- monsters %>%
                      filter(color == "pink") %>%
                      select(-hat) %>%
                      mutate(is_baby = age < 3)

Short trip, Long trip


if_else()

library(tidyverse)

days_on_ship <- 5

if_else(days_on_ship <= 7, "Short trip")
# ERROR: What did we forget?

if_else(days_on_ship <= 7, "Short trip", "Long trip")

time_onboard = if_else(days_on_ship <= 7, "Short trip", "Long trip")


case_when()

case_when(days_on_ship <= 7 ~ "Short trip",
          days_on_ship <= 14 ~ "Long trip",
          TRUE ~ "A very long trip")

# Run again with a new value
days_on_ship <- 15

case_when(days_on_ship <= 7 ~ "Short trip",
          days_on_ship <= 14 ~ "Long trip",
          TRUE ~ "A very long trip")


Add columns to the crew_df data

Regions

Let’s assign some countries to North America. We can use the %in% operator to check if a value is in a list. Below, we’ll check if a given country is in a list of some of the countries in North America. If it is in the list, we’ll give the region a value of "North America". But if it isn’t in the list, we’ll label the region as "Other".

crew_df <- read_csv("https://tidy-mn.github.io/R-camp-penguins/data/crew_food_prefs.csv")
  
# The if_else statement
country <- "Brazil"
  
if_else(country %in% c("United States", "Canada", "Mexico"),
        "North America",
        "Other")

# Use mutate to add the region column
crew_df <- crew_df %>%
           mutate(region = if_else(country %in% c("United States", "Canada", "Mexico"),
                                   "North America",
                                   "Other"))

Cocoa powder shortage

Uh-oh. Our cocoa is running low. We need to identify the goodies that include cocoa powder. Similar to the previous example, let’s use %in% to check if a person’s favorite_goody is in a list of chocolate goodies and flag them as needing cocoa.

# The if_else statement 
favorite_goody <- "brownies"

if_else(favorite_goody %in% c("brownies", "chocolate croissant"),
        "Needs cocoa",
        "No cocoa")

# Use mutate to add the column cocoa_check
crew_df <- crew_df %>%
           mutate(cocoa_check = if_else(favorite_goody %in% c("brownies", 
                                                              "chocolate croissant"),
                                        "Needs cocoa",
                                        "No cocoa"))

3 Simmer it down: Summarize the data

Your EXPANDED data toolbox

Function Job
filter() Keep rows that pass your tests
select() Select columns to keep or drop
arrange() Sort a table based on a column’s values
mutate() Add new columns or update existing columns
summarize() Get summary statistics for an entire table
group_by() Group data for group specific summaries

A MONSTER summary

Time to simmer things down. Let’s use summarize() to find a few summary statistics for our data.


Filter with a single condition / test

Filter with multiple conditions / tests


Keep columns

Drop columns


Arrange from lowest to highest

Arrange from highest to lowest


Update an existing column

Add a new column


Average age


Average age by monster color type

Average age and monster count by color


# Create monsters
monsters <- data.frame(id = 1:5,
                       color = c("pink", "green",  "pink", "green", "green"),
                       age = c(8,2,7,3,1),
                       hat = c(T,T,F,T,F) )

# FILTER ----

## Use a single condition/test
monsters %>% filter(color == "pink")

## Use multiple conditions/tests
monsters %>% filter(color == "pink", age > 7)


# SELECT ----

## Keep only 2 columns
monsters %>% select(id, age)

## Drop columns with a '-' in front
monsters %>% select(-hat)


# ARRANGE ----

## From low to high
monsters %>% arrange(age)

## Descend from high to low
monsters %>% arrange(desc(age))


# MUTATE ----

## Update an existing column
monsters %>% mutate(age = age + 2)

## Create a new column
monsters %>% mutate(is_baby = age < 3)


# SUMMARIZE ----
monsters %>% summarize(age_avg = mean(age))


# GROUP_BY ----

## Create a single summary statistic
monsters %>% 
  group_by(color) %>%
  summarize(age_avg = mean(age))

## Create multiple summary stats
monsters %>% 
  group_by(color) %>%
  summarize(age_avg = mean(age),
            n_monsters = n())

Crew examples

Load the data

library(tidyverse)

crew_df <- read_csv("https://tidy-mn.github.io/R-camp-penguins/data/crew_food_prefs.csv")

# Drop the test row
crew_df <- crew_df %>% filter(id != "test")


summarize()

# Average days on the ship
crew_df %>% summarize(days_avg = mean(days_on_ship))
## # A tibble: 1 × 1
##   days_avg
##      <dbl>
## 1     19.7


group_by() %>% summarize()

# Average days on ship grouped by nut allergy column
crew_df %>%
  group_by(nut_allergy) %>%
  summarize(days_avg = mean(days_on_ship))
## # A tibble: 2 × 2
##   nut_allergy days_avg
##   <lgl>          <dbl>
## 1 FALSE           20  
## 2 TRUE            19.7


mutate() %>% group_by() %>% summarize()

Which goody will be most in demand on the ship?

# Add a total meals column
## And then group by goody type and sum the total meals
crew_df %>%
  mutate(total_meals = meals_per_day * days_on_ship) %>%
  group_by(favorite_goody) %>%
  summarize(goody_total_meals = sum(total_meals))

# Arrange total goody meals in descending order
crew_df %>%
  mutate(total_meals = meals_per_day * days_on_ship) %>%
  group_by(favorite_goody) %>%
  summarize(goody_total_meals = sum(total_meals)) %>%
  arrange(desc(goody_total_meals))

Bonus

How many crew are from each country? How many crew voted for each goody for their favorite?

# Crew count for each country
crew_df %>% 
  group_by(country) %>% 
  summarize(n_crew = n())

# Most popular goody
crew_df %>% 
  group_by(favorite_goody) %>% 
  summarize(n_favorites = n())


Alternative: count()

# Crew count for each country
count(crew_df, country)

crew_df %>% count(country)

# Most popular goody
crew_df %>% count(favorite_goody)