Good morning!
We like R.
We aren’t computer scientists – and that’s okay!
We make lots of mistakes. They will be funny. Laugh with us!
1 What’s R?
Let’s launch ourselves into the unknown and make a candy plot. With a little copy-pasting we can make an informative chart of everyone’s favorite candy.
Since we all know you have the best taste in candy, let’s make sure your favorite candy wins the very official BEST candy award.
Instructions
- Copy all the code below. Highlight all of lines, or click the
Copy Code
button on the top-right of the grey box below.
#------------ Install packages ----------------#
if(!require("hrbrthemes")) {
install.packages("ggplot2")
install.packages("readr")
install.packages("hrbrthemes")
}
#-------------- Load packages -----------------#
library(ggplot2)
library(readr)
library(hrbrthemes)
#----------------- Candy data -------------------#
survey <- read_csv('candy, rating
"Snickers", 14
"Almond Joy", 40
"Hersheys Kisses", 16
"Nerds", 5
"Sour Patch Kids", 8')
#----- A bar plot w/ dark Halloween theme -------#
ggplot(survey,
aes(x = candy, y = rating)) +
geom_col(aes(fill = candy), show.legend = FALSE) +
labs(title = "Candy Champions",
subtitle = "Very official candy ratings",
caption = "Data from MN survey 2050") +
coord_flip() +
#scale_fill_viridis_d() +
theme_ft_rc(base_size = 18)
- Open R Studio
- Select File > New File > R Script. You will see a code editor window open.
- Or click the paper icon with the green plus at the top left.
- Paste the copied code into the upper left hand window. This is your code editor.
- Highlight all of the code and hit CTRL
+
ENTER on the keyboard.
This chart should appear in the lower right of RStudio.
- Change the name of a candy to something even better.
- Re-run the code again.
- Try increasing the number next to the new candy.
Explore!
- Add another candy and rating to the data.
- Add your name to the subtitle.
- Delete the hashtag in front of the
scale_viridis...
line.
- What happens when you re-run the code?
- Change the
show.legend =
value toTRUE
- What happens?
Yoda says
- Okay it is! You don’t need to memorize everything.
- Absorb what’s possible and look up the details when you need them.
- You are free to break things. Create errors. Make the computer angry. Learn through mistakes. R will forgive you.
- Cheat if you’re stuck.
- There’s no test. Share with your neighbors. Copy others.
Greetings
Let’s introduce ourselves and the data we love. Chat with your partner and get to know some things about them.
Chit-chat ideas
- Your name or Star Wars alias
- Types of data you have
- Who you share it with
- Something you want to get from the workshop
- The funniest part of your data
- The most repetitive part? Hint: Maybe this is something you can automate with R
Go Team
We’re going to need to work together to help Rey get off the dusty planet Jakku. Use each other as a resource. Share ideas, share code, collaborate. Puns and bad jokes are encouraged.
Here’s one:
Which Jedi is best at opening PDF files? >
A: Adobe Wan Kenobi
2 RStudio Tour
1. Source / Code Editor
Write your scripts and comments here. The is our home and where we spend most of our time. Move your cursor to a line of code and then click [Ctrl] + [Enter] to run the code. The tabs at the top show other scripts and data you have open.
3. Environment / Workspace
This pane shows the data you have loaded, as well as the variables and objects you have created. The History tab contains the code you have run during your current session. Note the broom icon below the Connections tab. This cleans shop and clears all of the objects from your workspace.
2. R Console
This is where code is run by the computer. It shows the code that is running and the messages returned by that code. You can input code directly into the console and run it, but it won’t be saved for later. We encourage running code from a script in the code editor.
You may see some scary warnings and errors appear here after running code. Warnings are helpful messages that let you know the results may not be exactly what you expected. Errors on the other hand, mean the code was unable to run. Usually this means there was a typo, or we forgot to run an important step earlier in our script.
4. Plots and files
These tabs allow you to view and open files in your current directory, view plots and other visual objects like maps, view your installed packages, and access the help window.
The Files tab is especially handy for finding files you want and clicking them to open. You can also click the Gear and select “Show Folder in a New Window” to open your project folder.
Customize R Studio
Let’s add a little style.
Change fonts and colors
- Go to Tools on the top navigation bar
- Choose
Global Options...
- Choose
Appearance
with the paint bucket - Increase the overall
Zoom
to 125%
- Increase the
Editor Font size
- Pick an Editor theme you like
- The default is
Textmate
- The default is
Why R?
R Community
- Around the world
- Help!
- Post questions and help requests on Teams
- R Cheatsheets
- Troubleshooting - See Get Help!
When we use R
- To connect to databases
- To read data from all kinds of formats
- To document our work and share methods
- To create reports, dashboards, and presentations that are easy to update
How R is different?
R vs. Excel
- R can handle much larger data sets
- R is more flexible than Excel
- R analyses are more reproducible
- Excel is more widely used
More comparisons >
R vs. Tableau
- Tableau is primarily a data visualization software.
- R Shiny is more flexible.
- Tableau’s drag-and-drop interface makes it faster and easier for creating simple visualizations, but is not easily reproducible.
- R is 100% text-based so you can track changes over time.
#VersionControl
- R is 100% text-based so you can track changes over time.
R vs. SQL
- SQL is the language of databases.
- SQL queries are sent to a database server and processed there before sending you the data. This may be needed for very large data sets that don’t fit on your computer.
- R can use SQL queries to pull data from databases / data lakes.
- R has the
dbplyr
package which converts R to an SQL query. - R can read data from almost anywhere (databases, flat files, web pages)
R vs. Python
- Python is a general-purpose programming language popular for doing internet things.
- R is more specifically focused on visualizations and statistical analysis.
- R is more specifically focused on visualizations and statistical analysis.
- Compared to python’s
pandas
, R’s Tidyverse is more intuitive and easier to use. - Historically Python has been used with GIS software like ArcGIS, but spatial analysis with R has been growing.
R vs. SAS
- R is open-source, while SAS requires a license.
- Anyone can create and add updates to R packages at anytime.
- New features for SAS only become available when the SAS team makes it so.
Example R project
Here’s an example R analysis project from start to finish using Ozone monitoring data.
EXAMPLE ANALYSIS >
Imagine we just received 3 years worth of ozone monitoring data to summarize. Fun!
Below is an example workflow we might follow in R.
- Create a new project
- Read the data
- Simplify columns
- Plot the data
- Clean the data
- View the data closer
- Summarize the data
- Save the results
- Share with friends
0. Start a new project
We’ll name this project: "2019_Ozone"
1. Read the data
library(readr)
# Read a file from the web
air_data <- read_csv("https://itep-r.netlify.com/data/OZONE_samples_demo.csv")
SITE | Date | OZONE | TEMP_F |
---|---|---|---|
27-137-7554 | 2017-09-15 | 9 | 56.0 |
27-137-7554 | 2017-05-01 | 3 | 44.6 |
27-137-7554 | 2017-06-18 | 8 | 65.6 |
27-137-7001 | 2016-10-10 | 7 | 45.2 |
27-137-7001 | 2018-10-31 | 5 | 36.2 |
2. Simplify column names
3. Plot the data
4. Clean the data
5. View the data closer
6. Summarize the data
air_data <- air_data %>%
group_by(site, year) %>%
summarize(avg_ozone = mean(ozone) %>% round(2),
avg_temp = mean(temp_f) %>% round(2))
site | year | avg_ozone | avg_temp |
---|---|---|---|
27-137-7001 | 2016 | 11.01 | 60.74 |
27-137-7001 | 2017 | 11.26 | 60.66 |
27-137-7001 | 2018 | 11.54 | 60.59 |
27-137-7554 | 2016 | 12.23 | 61.23 |
27-137-7554 | 2017 | 11.81 | 60.98 |
27-137-7554 | 2018 | 12.87 | 61.02 |
You’re on the TEAM!
Rey needs help putting the finishing touches on her ship.
Let’s go! Data awaits us…
While BB8 tracks down data from his scrapyard friends, we’ll prepare for some garbage data.
Let’s get to know R and RStudio.
3 New mission, New R project
Start a new project for your mission.
Step 1: Start a new project
- In Rstudio select File from the top menu bar
- Choose New Project…
- Choose New Directory
- Choose New Project
- Enter a project name such as
"Rcamp"
- Select Browse… and choose a folder where you normally perform your work.
- Click Create Project
Step 2: Create a new R script
- File > New File > R Script
- Click the floppy disk save icon
- Give it a name:
jakku1.R
orday1.R
will work well
4 Names and things
You can assign values to new objects using the “left arrow” <-
.
- It’s typed with a less-than sign
<
followed by a hyphen-
. - It’s more officially known as the assignment operator.
Let’s make a droid!
Try adding the code below to your R script to create an object called droid
.
Assignment operator
To run a line of code in your script, click the cursor anywhere on that line and press CTRL+ENTER.
The droid wants a friend. Let’s create a Wookie named Chewbacca.
Break some things
Error!
Without quotes, R looks for an object called Chewbacca, and then lets you know that it couldn’t find one. Let’s try again but add quotation marks around "Chewbacca"
.
Colors decoded
Blue shows the exact code you ran.
Black is the result of the code. A [1]
in front means there is one item in that object, and its value is bb8
.
Red shows Errors & Warnings. Errors mean something went so wrong it prevented your code from running. Warnings on the other hand are usually ok. They tend to inform you that the result may not be exactly what you expected. Such as a value not being added to a plot because it was NA.
Copy objects
# To copy an object, assign it to a new name
wookie2 <- wookie
# Or overwrite an object with new "text"
wookie <- "Tarfful"
wookie
# Wookie2 stays the same
wookie2
Numbers!
wookie_salary <- 500
# Let's give Chewbacca a big raise $$$
wookie_salary <- 500 * 2
# Print new salary
wookie_salary
# We can also use the object to multiply
wookie_salary * 2
# To save the change we assign it back to itself
wookie_salary <- wookie_salary * 2
Drop and remove data
You can drop objects with the remove function rm()
. Try it out on some of your wookies.
Explore!
How can we get the wookie
object back?
Hint: You are allowed to re-run your code. You can even highlight everything and re-run ALL of the code.
Deleting things is okay >
Don’t worry about deleting data or objects in R. You can always recreate them! When R loads your data it copies the contents and then cuts off any connection to the original data. So your original data remain safe and won’t suffer any accidental changes.
By saving your analysis in an R script you can always re-run the code to get any of your results back. It’s common and good practice to re-run your entire R script during your analysis.
What’s a good name?
Everything has a name in R and you can name things almost anything you like, even TOP_SECRET_shhhhhh...
or data_McData_face
.
Sadly, there are a few restrictions. R doesn’t like names to include spaces or special characters found in math equations, like +
, -
, *
, \
, /
, =
, !
, or )
.
Explore!
Try running some of these examples. Find new ways to create errors. The more broken the better! Half of learning R is finding what doesn’t work.
# What happens when you add these to your R script?
n wookies <- 5
n*wookies <- 5
n_wookies <- 5
n.wookies <- 5
wookies! <- "Everyone"
Names with numbers
# Names cannot begin with a number
1st_wookie <- "Chewbacca"
88b <- "droid"
# But they can contain numbers
wookie1 <- "Chewbacca"
bb8 <- "droid"
# What if you have 10,000 wookies?
n_wookies <- 10,000 # Error
n_wookies <- 10000
- Try to create a new error or warning you haven’t seen yet.
Collect multiple items
We can put multiple values inside c()
to make a vector of items. It’s like a chain of items, where each additional item is connected by a comma. The c
stands for concatenate or to collect.
Let’s use c()
to create a few vectors of names.
# Create a character vector and name it starwars_characters
starwars_characters <- c("Luke", "Leia", "Han Solo")
# Print starwars_characters
starwars_characters
## [1] "Luke" "Leia" "Han Solo"
# Create a numeric vector and name it starwars_ages
starwars_ages <- c(19,19,25)
# Print the ages to the console
starwars_ages
## [1] 19 19 25
Note
Take a look at the new additions to your Environment pane located on the top right.
This window shows all of the objects we’ve created so far and the types of data they contain. It’s a great first look to see if our script ran successfully. You can click the broom
icon to sweep everything out and start with a clean slate.
Make a table
A table in R is known as a data frame. We can think of it as a group of columns, where each column is made from a vector. Data frames in R have columns of data that are all the same length.
Let’s make a data frame with two columns to hold the character names and their ages.
# Create table with 2 columns: characters & ages
starwars_df <- data.frame(characters = starwars_characters,
ages = starwars_ages)
# Print the starwars_df data frame to the console
starwars_df
## characters ages
## 1 Luke 19
## 2 Leia 19
## 3 Han Solo 25
Explore!
Add a 3rd column that lists their fathers
names:
c("Darth", "Darth", "Unknown")
starwars_df <- data.frame(characters = starwars_characters, ages = starwars_ages, fathers = __________________)
starwars_df <- data.frame(characters = starwars_characters, ages = starwars_ages, fathers = c("Darth", "Darth", "Unknown"))
Show all values in $column_name
Use the $
sign after the name of your table to see the values in one of your columns.
## [1] 19 19 25
Notes and comments
The italic and orange lines in the scripts above are called comments. You can add notes to your scripts with the #
to make it easier for others and yourself to understand what is happening and why.
Every line that starts with a #
is ignored and won’t be run as R code.
You can also use comments to add warnings or instructions for others using your code.
Pop Quiz!
Which object name is valid? (Hint: You can test them in R.)
my starwars fandom
my_wookies55
5wookies
my-wookie
Wookies!!!
Show solution
my_wookies55
Yes!! The FORCE is strong with you!
5 Read data
To help find the junk we need for our ship we’re going to conduct a scrap audit. The first step of a good audit is reading in data determine where all the scrap is coming from.
Here’s a small dataset showing the scrap economy on Jakku. It was salvaged from a crash site, but the transfer was incomplete.
origin | destination | item | amount | price_d |
---|---|---|---|---|
Outskirts | Raiders | Bulkhead | 332 | 300 |
Niima Outpost | Trade caravan | Hull panels | 1120 | 286 |
Cratertown | Plutt | Hyperdrives | 45 | 45 |
Tro—- | Ta—- | So—* | 1 | 10—- |
This looks like it could be useful. Now, if only we had some more data to work with…
New Message
Incoming… BB8
BB8: Beep boop Beep.
BB8: I intercepted a large scrapper data set from droid 4P-L of Junk Boss Plutt.
Receiving data now…
scrap_records.csv
item,origin,destination,amount,units,price_per_pound
Flight recorder,Outskirts,Niima Outpost,887,Tons,590.93
Proximity sensor,Outskirts,Raiders,7081,Tons,1229.03
Aural sensor,Tuanul,Raiders,707,Tons,145.27
Electromagnetic filter,Tuanul,Niima Outpost,107,Tons,188.2
...
You: Yikes! This looks like a dense mess! What can I do with this?
CSV to the rescue
The main data format used in R is the CSV (comma-separated values). A CSV is a simple text file that can be opened in R and most other data tools, including Excel. It looks squished together as plain text, but that’s okay! When opened in R, the text becomes a familiar looking table with columns and rows.
Before we launch ahead, let’s add a package to R that will help us read CSV files.
6 Add packages 📦
What is an R package?
A package is a small add-on for R, it’s like a phone App for your phone. They add capabilities like statistical functions, mapping powers, and special charts to R. In order to use a new package we first need to install it. Let’s try it!
The readr package helps import data into R in different formats. It helps you out by cleaning the data of extra white space and formatting tricky date formats automatically.
Add a package to your library
- Open RStudio
- Type
install.packages("readr")
in the lower left console - Press Enter
- Wait two seconds
- Open the
Packages
tab in the lower right window of RStudio to see the packages in your library- Use the search bar to find the
readr
package
- Use the search bar to find the
Your installed packages are stored in your R library. The Packages
tab on the right shows all of the available packages installed in your library. When you want to use one of them, you load it in R.
Loading a package is like opening an App on your phone. To load a package we use the library()
function. You will need to load the package everytime you “turn on” RStudio.
Once you load it, the package will stay loaded until you close (shut down) RStudio.
Note
The 2 steps to using a package in R
install.packages("package-name")
library(package-name)
Read the data
Let’s load the readr package to use the read_csv()
function.
#install.packages("readr")
library(readr)
read_csv("https://mn-r.netlify.com/data/starwars_scrap_jakku.csv")
## # A tibble: 1,132 Ă— 7
## receipt_date item origin desti…¹ amount units price…²
## <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 4/1/2013 Flight recorder Outsk… Niima … 887 Tons 591.
## 2 4/2/2013 Proximity sensor Outsk… Raiders 7081 Tons 1229.
## 3 4/3/2013 Vitus-Series Attitude Thrus… Reest… Raiders 4901 Tons 226.
## 4 4/4/2013 Aural sensor Tuanul Raiders 707 Tons 145.
## 5 4/5/2013 Electromagnetic discharge f… Tuanul Niima … 107 Tons 188.
## 6 4/6/2013 Proximity sensor Tuanul Trade … 32109 Tons 1229.
## 7 4/7/2013 Hyperdrive motivator Tuanul Trade … 862 Tons 1485.
## 8 4/8/2013 Landing jet Reest… Niima … 13944 Tons 1497.
## 9 4/9/2013 Electromagnetic discharge f… Crate… Raiders 7788 Tons 188.
## 10 4/10/2013 Sublight engine Outsk… Niima … 10642 Tons 7211.
## # … with 1,122 more rows, and abbreviated variable names ¹​destination,
## # ²​price_per_pound
## # â„ą Use `print(n = ...)` to see more rows
Name the data
Where did the data go after you read it into R? When we want to work with the data in R, we need to give it a name with the assignment operator: <-
.
# Read in scrap data and set name to "scrap"
scrap <- read_csv("https://mn-r.netlify.com/data/starwars_scrap_jakku.csv")
# Type the name of the table to view it in the console
scrap
## # A tibble: 1,132 Ă— 7
## receipt_date item origin desti…¹ amount units price…²
## <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 4/1/2013 Flight recorder Outsk… Niima … 887 Tons 591.
## 2 4/2/2013 Proximity sensor Outsk… Raiders 7081 Tons 1229.
## 3 4/3/2013 Vitus-Series Attitude Thrus… Reest… Raiders 4901 Tons 226.
## 4 4/4/2013 Aural sensor Tuanul Raiders 707 Tons 145.
## 5 4/5/2013 Electromagnetic discharge f… Tuanul Niima … 107 Tons 188.
## 6 4/6/2013 Proximity sensor Tuanul Trade … 32109 Tons 1229.
## 7 4/7/2013 Hyperdrive motivator Tuanul Trade … 862 Tons 1485.
## 8 4/8/2013 Landing jet Reest… Niima … 13944 Tons 1497.
## 9 4/9/2013 Electromagnetic discharge f… Crate… Raiders 7788 Tons 188.
## 10 4/10/2013 Sublight engine Outsk… Niima … 10642 Tons 7211.
## # … with 1,122 more rows, and abbreviated variable names ¹​destination,
## # ²​price_per_pound
## # â„ą Use `print(n = ...)` to see more rows
Yoda says
Notice the row of <chr>
letter abbreviations under the column names? These describe the data type of each column.
<chr>
stands for character vector or a string of characters. Examples: “apple”, “apple5”, “5 red apples”
<int>
stands for integer. Examples: 5, 34, 1071
<dbl>
stands for double. Examples: 5.000, 3.4E-6, 10.710
We’ll see more data types later on, such as dates
and logical
(TRUE/FALSE).
Pop Quiz!
1. What data type is the destination
column?
letters
character
TRUE/FALSE
numbers
Show solution
character
Woop! You got this.
2. What package does read_csv()
come from?
dinosaur
get_data
readr
dplyr
Show solution
readr
Great job! You are Jedi worthy!
3. How would you load the package junkfinder
?
junkfinder()
library(junkfinder)
load(junkfinder)
gogo_gadget(junkfinder)
Show solution
library("junkfinder")
Excellent! Keep the streak going.
Function options
Function options
Functions have options —also called arguments— that change how they behave. You can set these optins using arguments. Let’s look at a few of the arguments for the function read_csv()
.
Skip a row
Sometimes you may want to ignore the first row in your data file, especially an EPA file that includes a disclaimer on the first row. Yes EPA, we’re looking at you. Please stop.
Let’s open the help window with ?read_csv
and try to find an argument that can help us.
There’s a lot of them! But the skip
argument looks like it could be helpful. Take a look at the description near the bottom. The default is skip = 0
, which reads every line, but we can skip the first line by writing skip = 1
. Let’s try.
Limit the number of rows
Some types of data have last rows that show the total or report “END OF DATA”. In these cases we may want read_csv
to ignore the last row of data, or for large data sets we may only want to pull in a 100 lines of data to see what you’re working with.
Let’s look through the help window to find an argument that can help us. Type ?read_csv
and scroll down.
The n_max
argument looks like it could be helpful. The default is n_max = Inf
, which means it will read every line, but we can limit the lines we read to only one hundred by using n_max = 100
.
# Read in 100 rows
small_data <- read_csv("https://mn-r.netlify.com/data/starwars_scrap_jakku.csv",
skip = 1,
n_max = 100)
# Remove the data
rm(small_data)
See function options
To see all of a function’s arguments
- Type its name in the console followed by a parenthesis.
- Type
read_csv(
- Type
- Press
TAB
on the keyboard. - Explore the drop-down menu of the available arguments.
Pop Quiz!
Which of these is a valid function call?
fly(2 “ships”)
find, “lightsaber”, “Yoda”
build(1000, “droids”)
wait(until Empire leaves)
Show solution
build(1000, "droids")
Correct! You are ready to audit a Junk dealer.
Create a CSV from Excel
Step 1: Open your Excel file.
Step 2: Save as CSV
- Go to File
- Save As
- Browse to your project folder
- Save as type: CSV (Comma Delimited) _(*.csv)_
- Any of the CSV options will work
- Click Yes
- Close Excel (Click “Don’t Save” as much as you need to. Excel loves long goodbyes.)
Step 3: Return to RStudio and open your project. Look at your Files tab in the lower right window. Click on the CSV file you saved and choose View File.
Success!
7 ggplot2
Plot the data, Plot the data, Plot the data
In data analysis you’ll want to look at your data early and often. For that we use a new package called ggplot2!
Install the package by running
Note
You can also install packages from the Packages
tab in the lower right window of RStudio.
A column plot
Here’s a simple chart showing the total amount of scrap sold from each origin location.
ggplot(scrap, aes(y = amount, x = origin)) +
geom_col() +
labs(title = "Which origin sold the most scrap?") +
theme_gray()
## Warning: Removed 910 rows containing missing values (position_stack).
Well, well, well, it looks like there is an All category we should look into more. Either there is a town hogging all the scrap or the data needs some cleaning.
Explore!
Try changing theme_gray()
to theme_dark()
. What changes on the chart? What stays the same?
Try another theme: theme_classic()
or theme_minimal()
or delete the entire line and the +
above to see what the default settings are.
You can view all available theme options at ggplot2 themes.
Got questions
Lost in an ERROR message? Is something behaving strangely and want to know why? Want ideas for a new chart?
Go to Help! for troubleshooting, cheatsheets, and other learning resources.
Key terms
package |
An add-on for R that contains new functions that someone created to help you. It’s like an App for R. | |
library |
The name of the folder that stores all your packages, and the function used to load a package. | |
function |
Functions perform an operation on your data and returns a result. The function sum() takes a series of values and returns the sum for you. |
|
argument |
Arguments are options or inputs that you pass to a function to change how it behaves. The argument skip = 1 tells the read_csv() function to ignore the first row when reading in a data file. To see the default values for a function you can type ?read_csv in the console. |