Before you start…
Jump back into the project you created for the previous sessions exercises and then create a new R script. Save it as soon as you make it and give it a good name like ex_1_day_2.R
or cameron.R
and you’ll be ready to go!
The story so far
So remember that big order we had? For those researchers who were doing work in Antarctica studying penguins? Well it turns out their analytics computer broke down and they might not able to complete their work! They have some data and were planning to go out again to gather more information but now they might not be able to!
So we might have told them that we could…well…we might have said that we were an…analytic bot. They were so excited that they hired us on the spot! But now we’re in a sticky situation because we really need to prove to them that we have what it takes to do the work.
Let’s get practicing on using our data tools!
Summarize the data on penguin populations
The goal of this exercise is to create summary datasets that answer specific questions about the penguin populations. We will use the penguins
dataset in the palmerpenguins
package. Please ask for help if you get stuck or would like a hint.
Load the penguins
data
Since the dataset is included in the package itself, we don’t need to read in a csv from the web or from our computer folders. Many R packages include datasets so that you can test the functions included in them. Here is a website with packages that include 1000s of datasets! Once you have loaded your R package that contains a dataset, you can pull that dataset into your environment by giving the dataset an object name. If you are ever wondering if a package includes a practice data set, just run this code to get a list of the datasets data(package='palmerpenguins')
Compare flipper lengths
The group_by
and summarize
functions allow us to compare values by categories. Today, we are very interested in how long the flippers are for each species. Remember, you can create a new summary dataset by assigning the summarized data an object name, i.e. data_summary <-
, or you can avoid the name and have your summary information printed into the console. Now that we have our penguin data, lets create a dataset that summarizes the flipper lengths of the penguin populations by species. We gotta make sure the following calculations are made for each penguin species: minimum, maximum, and mean flipper lengths.
- Which penguin species has the shortest flippers?
- Is there a flipper length range that includes all three penguin species?
Flipper summary
First use group_by() to group the data by species, AND THEN use summarize() to add a new column called min_flipper_length. Ignore the NA values with the argument na.rm = TRUE
.
# Remember to account for NA values in your summary function penguins %>% group_by( _______ ) %>% summarize(min_flipper_length = ________ )
penguins %>% group_by(species) %>% summarize(min_flipper_length = min(flipper_length_mm, na.rm = TRUE))
Compare weights
Now we are interested in how much penguins weigh by species AND island. We can do this using the group_by
function on more than one variable. Here’s our task: Create a dataset that summarizes the weights of the penguin populations by species and island. Calculate the minimum, maximum, and mean weights for each group.
- Which penguin species is the heaviest?
- On which island are Adelie penguins the heaviest?
Weight summary
First use group_by() to group the data by species and island, AND THEN use summarize() to add new columns for the min, mean, and max of body_mass_g. Ignore the NA values with the argument na.rm = TRUE
.
# Remember to account for NA values in your summary functions penguins %>% group_by(_______ , _______ ) %>% summarize(min_weight = min( _______ , na.rm = TRUE), mean_weight = _______ , max_weight = _______ )
penguins %>% group_by(species, island) %>% summarize(min_weight = min(body_mass_g, na.rm = TRUE), mean_weight = mean(body_mass_g, na.rm = TRUE), max_weight = max(body_mass_g, na.rm = TRUE))
Compare penguin bills
Penguin bills are fascinating! We want to know everything about them, and we want even more categorically resolved information this time! Sometimes you want summaries based on very specific groups, such as sample results by day of the week and age group. Again, group_by
lets this happen as long as your data allow it.
In this final exercise, let’s create a dataset that summarizes the bill lengths and bill depths of the penguin populations by species, sex, and island. Calculate the mean bill lengths and bill depths of each group.
- In general, do male and female bills tend to be the same length and depth? If not, how do they differ?
- Do the groups with the longest bills also have the deepest bills? Do the groups with the shortest bills also have the shallowest bills?
- From the table, would you assume bill size is most associated with species, sex, or island?
Bill summary
First use group_by() to group the data by species, island, and sex, AND THEN use summarize() to add new columns for the mean bill depth and bill length. Ignore the NA values with the argument na.rm = TRUE
.
Store the result in a new table named bill_summary
.
# Remember to account for NA values in your summary functions bill_summary <- penguins %>% group_by(_______ , _______ , _______ ) %>% summarize(mean_bill_depth = _______ , mean_bill_length = _______ )
bill_summary <- penguins %>% group_by(species, island, sex) %>% summarize(mean_bill_depth = mean(bill_depth_mm, na.rm = TRUE), mean_bill_length = mean(bill_length_mm, na.rm = TRUE))
Great work! Let’s keep at it while we start our trip to…ANTARTICA! HERE WE COME PENGUINS!