0.0.1 Course Information

Offered by: Minnesota Department of Health Office of Data Strategy and Interoperability Data Technical Assistance Unit (DSI DTA) with support from the Minnesota State Government R Users Group and the intellectual and imaginitive powers contained therewithin.

Course materials developed by: Eric Kvale

Prerequisites: Basic familiarity with R (see our primer: https://www.train.org/mn/course/1122534/live-event)

1 Getting the Kitchen Ready: Environment Setup

Before we start, let’s get all of the required tools set up. Everything we need collectively is refered to as our environment.

1.0.1 Important: Follow Along

You should have R and RStudio open, ready to run the code in each section. Don’t just read - code along with us! Experiment with each section, make some tweaks, fiddle with the data and the function arguements and paramaters. Cntrl-Z if you break it, or just copy and paste from this document if you get lost.

1.1 Step 1: Install Required Packages

First, let’s install all the packages we’ll need for this course:

# Install required packages for text analysis
install.packages(c(
  "tidyverse",    # Data manipulation and visualization
  "tidytext",     # Text mining tools
  "stringr",      # String manipulation
  "wordcloud",    # Word cloud visualizations
  "stopwords",    # Stop word datasets
  "knitr",        # Document generation
  "DT",           # Interactive tables
  "kableExtra",   # Enhanced table formatting
  "renv",         # Environment management
  "tm"            # Framework for text mining applications 
))

1.1.1 Package Installation Tips

If you encounter installation errors, try updating R to the latest version first. Some packages require recent R versions. You can check your R version with R.version.string. Also, don’t be afraid to ask for help, these errors are common and we are here to tackle them.

1.2 Step 2: Initialize Project Environment (Recommended)

For reproducibility, you can use renv to create a project-specific library:

# Initialize renv for this project
renv::init()

# After installing packages, take a snapshot
renv::snapshot()

Why use renv? This creates a reproducible environment where everyone uses the same package versions. When you share your analysis, others can run renv::restore() to get exactly the same setup. It’s a version control option that ensures that your findings are reproducible.

1.3 Step 3: Load Libraries and Test Setup

Now let’s load our libraries and test that everything is working:

# Load libraries
library(tidyverse)
library(tidytext)
library(stringr)
library(wordcloud)
library(stopwords)
library(knitr)
library(DT)
library(kableExtra)
library(tm)

# Test a couple are loaded.
cat("Setup successful! Here's a quick test:\n")
cat("tidyverse version:", as.character(packageVersion("tidyverse")), "\n")
cat("tidytext version:", as.character(packageVersion("tidytext")), "\n")

# Test tokenization
test_text <- "Hello world! Can you token this?"
test_tokens <- tibble(text = test_text) %>%
  unnest_tokens(word, text)

cat("Tokenization test successful! Found", nrow(test_tokens), "tokens.\n")

1.4 Troubleshooting Common Setup Issues

1.4.1 Problems? Don’t Panic! Troubleshoot.

Setup issues are normal. Here are solutions to the most common problems R users encounter.

1.4.2 Issue 1: Package Installation Fails

# If standard installation fails, try:
install.packages("tidyverse", dependencies = TRUE)

# Check your library path
.libPaths()

1.4.3 Issue 2: Cannot Load Libraries

# Check if package is installed
if (!"tidyverse" %in% installed.packages()) {
  install.packages("tidyverse")
}

# Load with error handling
tryCatch({
  library(tidyverse)
  cat("tidyverse loaded successfully!")
}, error = function(e) {
  cat("Error loading tidyverse:", e$message)
})

1.4.4 Quick Fix Checklist

Restart R Session: Session > Restart R in RStudio
Update R: Make sure you have R 4.0 or later
Ask for Help: Request help in the chat section.

1.4.5 Issue 3: renv Problems

# If renv gives errors, you can skip it for now:
# Just load packages directly without renv

# Or reset renv if needed:
renv::restore()  # Restore from lockfile
renv::repair()   # Fix renv issues

When to Skip renv: If renv is giving you troubles you can skip it for this workshop. It’s good to know, R has enviroments and that it exists, but it isn’t required.

1.5 Ready to get Cooking!

Once your setup is complete, you should be able to run this test successfully:

# Final, final, final test.
library(tidyverse)
library(tidytext)

# Create and analyze some sample text
sample_data <- tibble(
  id = 1,
  text = "Welcome to text analysis in R! This course will teach you amazing skills."
)

sample_tokens <- sample_data %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

cat("🎉 Setup complete! Found", nrow(sample_tokens), "meaningful words in our test text.\n")
cat("You're ready to start learning text analysis!\n")

1.5.1 Learning Philosophy

Text analysis is a skill that improves with practice, not perfection on the first try. Familarise yourself with the data and concepts and comeback for another whack at it.

2 Introduction to Text Analysis in R

Text analysis, also known as natural language processing (NLP), is a powerful technique for extracting meaningful insights from unstructured text data. This course will take you from raw text data to fully analyzed, visualized results using R.

Learning Objectives

By the end of this course, you will be able to:

Import and clean raw text data in R
Transform text into analyzable formats
Apply fundamental NLP techniques (TF-IDF, sentiment analysis)
Handle stop words and text preprocessing
Create visualizations to reveal patterns and themes
Uncover hidden connections in recipe data

2.0.1 Getting Help in R

Remember to use the help() function or ?function_name to learn more about any function you’re unfamiliar with. For example, try ?str_detect or help(unnest_tokens) to explore these functions in detail.

2.1 What is Text Analysis?

Text analysis involves using computational methods to extract information, patterns, and insights from written text. In this course, we’ll focus on recipe data to discover hidden connections between cooking techniques and ingredients.

2.1.1 Key Packages We’ll Use

packages_info <- data.frame(
  Package = c("tidyverse", "tidytext", "stringr", "wordcloud", "stopwords"),
  Purpose = c("Data manipulation and visualization", 
             "Text mining and analysis", 
             "String manipulation and regex", 
             "Creating word cloud visualizations",
             "Removing common words from analysis")
)

kable(packages_info, caption = "Essential R Packages for Text Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Essential R Packages for Text Analysis
Package	Purpose
tidyverse	Data manipulation and visualization
tidytext	Text mining and analysis
stringr	String manipulation and regex
wordcloud	Creating word cloud visualizations
stopwords	Removing common words from analysis

3 Hands-On Practice: Recipe Data Analysis

Let’s dive into analyzing recipe data to uncover patterns in cooking techniques and ingredients. We’ll start with raw recipe text and transform it into meaningful insights.

3.0.1 Before We Begin

Make sure you have all required packages installed. If you encounter errors, try running install.packages(c("tidyverse", "tidytext", "stringr", "wordcloud", "stopwords")) in your console.

3.1 Load Required Libraries

First, let’s load all the libraries we’ll need for our text analysis:

# Load required libraries for text analysis
library(tidyverse)    # Data manipulation and visualization
library(tidytext)     # Text mining and analysis
library(stringr)      # String manipulation
library(wordcloud)    # Word cloud visualizations  
library(stopwords)    # Stop word removal
library(knitr)        # Table formatting
library(DT)           # Interactive tables
library(kableExtra)   # Enhanced table styling

# Verify libraries loaded successfully
cat("All libraries loaded successfully! Ready for text analysis.\n")

## All libraries loaded successfully! Ready for text analysis.

3.1.1 Library Loading Tips

Run this code chunk first before proceeding with the analysis. If you get error messages about packages not being installed, go back to the setup section and install the missing packages. You can also use library(help = "package_name") to learn more about any package.

3.2 Practice Dataset: Recipe Collection

We’ll work with a collection of recipe descriptions and instructions to discover cooking patterns and ingredient relationships.

Data Source Note: In real-world projects, you might import text data using readr::read_csv(), readLines(), or specialized packages for different file formats. The rio package is particularly useful for reading various data formats!

# Create sample recipe data
recipe_data <- data.frame(
  recipe_id = 1:8,
  recipe_name = c("Classic Chocolate Chip Cookies", "Spicy Thai Basil Chicken", 
                  "Homemade Pizza Margherita", "Creamy Mushroom Risotto",
                  "Grilled Salmon with Herbs", "Vegetarian Black Bean Tacos",
                  "Fresh Garden Salad", "Slow Cooker Beef Stew"),
  cuisine = c("American", "Thai", "Italian", "Italian", 
              "Mediterranean", "Mexican", "American", "American"),
  instructions = c(
    "Cream butter and sugar until fluffy. Mix in eggs and vanilla. Combine flour, baking soda, and salt. Gradually blend into creamed mixture. Stir in chocolate chips. Drop by spoonfuls onto ungreased cookie sheets. Bake at 375°F for 9-11 minutes.",
    "Heat oil in wok over high heat. Stir-fry chicken until cooked through. Add garlic, chilies, and basil leaves. Season with fish sauce and soy sauce. Serve immediately over steamed rice.",
    "Roll out pizza dough on floured surface. Spread tomato sauce evenly. Add fresh mozzarella and basil leaves. Drizzle with olive oil. Bake in preheated oven at 450°F for 12-15 minutes until crust is golden.",
    "Heat broth in saucepan and keep warm. Sauté onions in olive oil until translucent. Add arborio rice and stir for 2 minutes. Gradually add warm broth, stirring constantly. Add mushrooms and parmesan cheese. Season with salt and pepper.",
    "Season salmon fillets with salt, pepper, and fresh herbs. Preheat grill to medium-high heat. Grill salmon for 4-5 minutes per side until fish flakes easily. Serve with lemon wedges and grilled vegetables.",
    "Drain and rinse black beans. Sauté onions and bell peppers until soft. Add beans, cumin, chili powder, and lime juice. Warm tortillas and fill with bean mixture. Top with avocado, cilantro, and cheese.",
    "Wash and chop fresh lettuce, tomatoes, and cucumbers. Slice red onions thinly. Combine all vegetables in large bowl. Toss with olive oil and vinegar dressing. Season with salt and pepper to taste.",
    "Brown beef cubes in oil over high heat. Add chopped onions, carrots, and celery. Pour in beef broth and diced tomatoes. Add herbs and seasonings. Cook on low heat for 6-8 hours until meat is tender."
  )
)

DT::datatable(recipe_data, 
              caption = "Recipe Dataset for Text Analysis",
              options = list(pageLength = 8, scrollX = TRUE))

3.3 Step 1: Text Preprocessing and Tokenization

The first step in text analysis is cleaning and breaking down our text into individual words (tokens).

3.3.1 Pro Tip: Understanding the Pipeline

The %>% pipe operator chains functions together, making code more readable. Think of it as “and then…” - we take the data AND THEN select columns AND THEN tokenize AND THEN remove stop words.

# Tokenize the recipe instructions
recipe_tokens <- recipe_data %>%
  select(recipe_id, recipe_name, cuisine, instructions) %>%
  unnest_tokens(word, instructions) %>%
  # Remove stop words
  anti_join(stop_words, by = "word") %>%
  # Remove numbers and single letters
  filter(!str_detect(word, "^\\d+$"),
         str_length(word) > 1)

# Display sample of tokenized data
kable(head(recipe_tokens, 10), 
      caption = "Sample of Tokenized Recipe Instructions") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Sample of Tokenized Recipe Instructions
recipe_id	recipe_name	cuisine	word
1	Classic Chocolate Chip Cookies	American	cream
1	Classic Chocolate Chip Cookies	American	butter
1	Classic Chocolate Chip Cookies	American	sugar
1	Classic Chocolate Chip Cookies	American	fluffy
1	Classic Chocolate Chip Cookies	American	mix
1	Classic Chocolate Chip Cookies	American	eggs
1	Classic Chocolate Chip Cookies	American	vanilla
1	Classic Chocolate Chip Cookies	American	combine
1	Classic Chocolate Chip Cookies	American	flour
1	Classic Chocolate Chip Cookies	American	baking

3.3.2 Understanding Stop Words

Stop words are common words that typically don’t contribute much meaning to text analysis. Let’s examine what we’re removing:

Why Remove Stop Words? Words like “the”, “and”, “is” appear frequently in all texts but don’t tell us much about the specific content. Removing them helps us focus on meaningful terms that distinguish one document from another.

# Show examples of stop words
sample_stop_words <- stop_words %>% 
  filter(lexicon == "snowball") %>%
  head(20) %>%
  select(word)

kable(sample_stop_words, 
      caption = "Examples of Stop Words Removed from Analysis",
      col.names = "Stop Words") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Examples of Stop Words Removed from Analysis
Stop Words
i
me
my
myself
we
our
ours
ourselves
you
your
yours
yourself
yourselves
he
him
his
himself
she
her
hers

3.4 Step 2: Word Frequency Analysis

Let’s discover the most common cooking terms and ingredients across our recipe collection.

3.4.1 Debugging Tip

If your code isn’t working as expected, try running each line of the pipeline separately. Add a print() or View() statement after each step to see what’s happening to your data.

# Calculate word frequencies
word_frequencies <- recipe_tokens %>%
  count(word, sort = TRUE) %>%
  top_n(15, n)

# Create bar plot of most common words
ggplot(word_frequencies, aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue", alpha = 0.8) +
  coord_flip() +
  labs(title = "Most Common Words in Recipe Instructions",
       subtitle = "Top 15 terms after removing stop words",
       x = "Words",
       y = "Frequency") +
  theme_minimal()

Data Interpretation

Notice how cooking verbs like “add,” “heat,” and “season” dominate our frequency analysis. This makes sense - recipes are instruction-heavy! The ingredients that appear frequently (like “oil” and “salt”) are staples across many cuisine types.

3.4.2 Word Cloud Visualization

Let’s create a visual representation of word frequencies using a word cloud:

# Create word cloud
set.seed(123)  # For reproducible results
wordcloud(words = word_frequencies$word, 
          freq = word_frequencies$n,
          min.freq = 1,
          max.words = 50,
          random.order = FALSE,
          rot.per = 0.35,
          colors = brewer.pal(8, "Dark2"))

Word Cloud of Recipe Terms

3.4.3 Visualization Best Practices

Word clouds are great for initial exploration, but consider bar charts or other structured visualizations for formal presentations. The set.seed() function ensures your word cloud looks the same each time you run it - useful for reproducible analysis!

3.5 Step 3: TF-IDF Analysis

TF-IDF (Term Frequency-Inverse Document Frequency) helps us identify words that are particularly important to specific recipes or cuisines.

What is TF-IDF? This metric balances how frequently a term appears in a document (TF) against how rare it is across all documents (IDF). A word that appears often in one document but rarely in others gets a high TF-IDF score.

# Calculate TF-IDF by cuisine
cuisine_tfidf <- recipe_data %>%
  unnest_tokens(word, instructions) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^\\d+$"),
         str_length(word) > 1) %>%
  count(cuisine, word, sort = TRUE) %>%
  bind_tf_idf(word, cuisine, n) %>%
  arrange(desc(tf_idf))

# Show top TF-IDF terms by cuisine
top_tfidf <- cuisine_tfidf %>%
  group_by(cuisine) %>%
  top_n(3, tf_idf) %>%
  ungroup()

kable(top_tfidf[, c("cuisine", "word", "tf_idf")], 
      caption = "Top TF-IDF Terms by Cuisine",
      col.names = c("Cuisine", "Word", "TF-IDF Score"),
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Top TF-IDF Terms by Cuisine
Cuisine	Word	TF-IDF Score
Mediterranean	grill	0.1463
Mediterranean	salmon	0.1463
Mexican	beans	0.1288
Thai	sauce	0.0833
Mediterranean	easily	0.0732
Mediterranean	fillets	0.0732
Mediterranean	flakes	0.0732
Mediterranean	grilled	0.0732
Mediterranean	lemon	0.0732
Mediterranean	medium	0.0732
Mediterranean	preheat	0.0732
Mediterranean	wedges	0.0732
Thai	chicken	0.0732
Thai	chilies	0.0732
Thai	cooked	0.0732
Thai	fry	0.0732
Thai	garlic	0.0732
Thai	immediately	0.0732
Thai	soy	0.0732
Thai	steamed	0.0732
Thai	wok	0.0732
Mexican	avocado	0.0644
Mexican	bean	0.0644
Mexican	bell	0.0644
Mexican	black	0.0644
Mexican	chili	0.0644
Mexican	cilantro	0.0644
Mexican	cumin	0.0644
Mexican	drain	0.0644
Mexican	fill	0.0644
Mexican	juice	0.0644
Mexican	lime	0.0644
Mexican	peppers	0.0644
Mexican	powder	0.0644
Mexican	rinse	0.0644
Mexican	soft	0.0644
Mexican	top	0.0644
Mexican	tortillas	0.0644
American	beef	0.0447
American	combine	0.0447
American	tomatoes	0.0447
Italian	broth	0.0374
Italian	olive	0.0374
Italian	warm	0.0374

3.5.1 Visualizing TF-IDF Results

top_tfidf %>%
  mutate(word = reorder_within(word, tf_idf, cuisine)) %>%
  ggplot(aes(word, tf_idf, fill = cuisine)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "TF-IDF Score",
       title = "Highest TF-IDF Words by Cuisine",
       subtitle = "Words most characteristic of each cuisine type") +
  facet_wrap(~cuisine, scales = "free") +
  coord_flip() +
  scale_x_reordered() +
  theme_minimal()

TF-IDF Scores by Cuisine

3.6 Step 4: Sentiment Analysis

Let’s analyze the emotional tone of our recipe instructions using sentiment analysis.

3.6.1 Package Exploration Tip

Want to see all available sentiment lexicons? Try get_sentiments("nrc"), get_sentiments("bing"), or explore with ?get_sentiments to understand the differences between emotion classification systems.

# Get sentiments using the bing lexicon (positive/negative)
recipe_sentiments <- recipe_tokens %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  group_by(recipe_name, cuisine) %>%
  summarise(
    positive_words = sum(sentiment == "positive"),
    negative_words = sum(sentiment == "negative"),
    sentiment_score = positive_words - negative_words,
    .groups = "drop"
  ) %>%
  arrange(desc(sentiment_score))

kable(recipe_sentiments, 
      caption = "Sentiment Analysis of Recipe Instructions",
      col.names = c("Recipe", "Cuisine", "Positive Words", "Negative Words", "Sentiment Score")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Sentiment Analysis of Recipe Instructions
Recipe	Cuisine	Positive Words	Negative Words	Sentiment Score
Creamy Mushroom Risotto	Italian	2	0	2
Homemade Pizza Margherita	Italian	2	0	2
Vegetarian Black Bean Tacos	Mexican	3	1	2
Fresh Garden Salad	American	1	0	1
Slow Cooker Beef Stew	American	1	0	1
Grilled Salmon with Herbs	Mediterranean	1	1	0

Interpreting Sentiment Results

Recipe instructions tend to be neutral or slightly positive in language. The sentiment analysis here identifies words like “fresh,” “golden,” and “warm” as positive, while words like “drain” or “cut” might be classified as negative, even though they’re just cooking instructions.

3.6.2 Sentiment Visualization

ggplot(recipe_sentiments, aes(x = reorder(recipe_name, sentiment_score), 
                              y = sentiment_score, fill = cuisine)) +
  geom_col() +
  coord_flip() +
  labs(title = "Sentiment Scores of Recipe Instructions",
       subtitle = "Higher scores indicate more positive language",
       x = "Recipe",
       y = "Sentiment Score",
       fill = "Cuisine") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

Recipe Sentiment Scores

3.7 Step 5: Cooking Technique Analysis

Let’s identify and analyze cooking techniques mentioned in our recipes.

3.7.1 Research Mindset

When analyzing text, think like a detective. What patterns can you spot? What words cluster together? Text analysis often reveals insights that aren’t immediately obvious when just reading through documents manually.

# Define cooking techniques to search for
cooking_techniques <- c("bake", "baking", "fry", "frying", "grill", "grilling", 
                       "sauté", "boil", "boiling", "steam", "steaming",
                       "roast", "roasting", "stir", "mix", "mixing", "chop", 
                       "chopping", "season", "seasoning")

# Find cooking techniques in recipes
technique_analysis <- recipe_data %>%
  unnest_tokens(word, instructions) %>%
  filter(word %in% cooking_techniques) %>%
  count(cuisine, word, sort = TRUE) %>%
  group_by(cuisine) %>%
  top_n(3, n) %>%
  ungroup()

kable(technique_analysis, 
      caption = "Most Common Cooking Techniques by Cuisine",
      col.names = c("Cuisine", "Technique", "Frequency")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Most Common Cooking Techniques by Cuisine
Cuisine	Technique	Frequency
Mediterranean	grill	2
American	bake	1
American	baking	1
American	chop	1
American	mix	1
American	season	1
American	stir	1
Italian	bake	1
Italian	sauté	1
Italian	season	1
Italian	stir	1
Mediterranean	season	1
Mexican	sauté	1
Thai	fry	1
Thai	season	1
Thai	stir	1

Expanding Your Analysis: Try creating your own custom dictionaries for different domains. You could create lists of spices, cooking equipment, or dietary restrictions to analyze different aspects of the text data.

3.7.2 Technique Distribution Visualization

ggplot(technique_analysis, aes(x = reorder_within(word, n, cuisine), 
                               y = n, fill = cuisine)) +
  geom_col(show.legend = FALSE) +
  labs(x = "Cooking Technique", y = "Frequency",
       title = "Most Common Cooking Techniques by Cuisine Type") +
  facet_wrap(~cuisine, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  theme_minimal()

Cooking Techniques by Cuisine

3.8 Step 6: Ingredient Network Analysis

Let’s explore connections between ingredients by finding which ones commonly appear together.

# Define common ingredients to search for
ingredients <- c("chicken", "beef", "salmon", "cheese", "tomato", "onion", 
                "garlic", "oil", "salt", "pepper", "herbs", "basil", "rice",
                "beans", "avocado", "mushroom", "butter", "flour", "sugar")

# Find ingredient co-occurrences
ingredient_pairs <- recipe_data %>%
  unnest_tokens(word, instructions) %>%
  filter(word %in% ingredients) %>%
  select(recipe_id, word) %>%
  inner_join(., ., by = "recipe_id") %>%
  filter(word.x < word.y) %>%  # Avoid duplicate pairs
  count(word.x, word.y, sort = TRUE) %>%
  top_n(10, n)

kable(ingredient_pairs, 
      caption = "Most Common Ingredient Combinations",
      col.names = c("Ingredient 1", "Ingredient 2", "Co-occurrence")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Most Common Ingredient Combinations
Ingredient 1	Ingredient 2	Co-occurrence
pepper	salt	3
avocado	beans	2
basil	oil	2
beans	cheese	2
beef	herbs	2
beef	oil	2
herbs	salmon	2
oil	pepper	2
oil	rice	2
oil	salt	2
pepper	salmon	2
salmon	salt	2

4 Advanced Text Analysis Techniques

Now that we’ve mastered the basics, let’s explore more sophisticated methods for extracting insights from text data.

4.0.1 Performance Considerations

As your datasets get larger, consider using the quanteda package for faster processing, or data.table for memory-efficient operations. For very large texts, you might need to process data in chunks.

4.1 N-gram Analysis

Beyond individual words, let’s look at common two-word phrases (bigrams) in our recipes.

4.1.1 Function Parameters Tip

The token = "ngrams", n = 2 parameters in unnest_tokens() create bigrams. Try changing n = 3 for trigrams or n = 4 for 4-word phrases. Use ?unnest_tokens to explore all tokenization options!

# Create bigrams
recipe_bigrams <- recipe_data %>%
  unnest_tokens(bigram, instructions, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !str_detect(word1, "^\\d+$"),
         !str_detect(word2, "^\\d+$")) %>%
  count(word1, word2, sort = TRUE) %>%
  unite(bigram, word1, word2, sep = " ") %>%
  top_n(10, n)

kable(recipe_bigrams, 
      caption = "Most Common Bigrams (Two-word phrases)",
      col.names = c("Bigram", "Frequency")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Most Common Bigrams (Two-word phrases)
Bigram	Frequency
olive oil	3
basil leaves	2
sauté onions	2
add arborio	1
add beans	1
add chopped	1
add fresh	1
add garlic	1
add herbs	1
add mushrooms	1
add warm	1
arborio rice	1
avocado cilantro	1
baking soda	1
bean mixture	1
beans cumin	1
beans sauté	1
beef broth	1
beef cubes	1
bell peppers	1
black beans	1
bowl toss	1
broth stirring	1
brown beef	1
celery pour	1
cheese season	1
chili powder	1
chips drop	1
chocolate chips	1
chop fresh	1
chopped onions	1
combine flour	1
constantly add	1
cookie sheets	1
cream butter	1
creamed mixture	1
cucumbers slice	1
cumin chili	1
diced tomatoes	1
dressing season	1
easily serve	1
fish flakes	1
fish sauce	1
flakes easily	1
flour baking	1
floured surface	1
fluffy mix	1
fresh herbs	1
fresh lettuce	1
fresh mozzarella	1
fry chicken	1
garlic chilies	1
gradually add	1
gradually blend	1
grill salmon	1
grilled vegetables	1
heat add	1
heat broth	1
heat grill	1
heat oil	1
heat stir	1
herbs preheat	1
juice warm	1
leaves drizzle	1
leaves season	1
lemon wedges	1
lettuce tomatoes	1
lime juice	1
low heat	1
minutes gradually	1
mixture stir	1
mixture top	1
oil bake	1
onions carrots	1
onions thinly	1
parmesan cheese	1
pizza dough	1
preheat grill	1
preheated oven	1
red onions	1
rinse black	1
salmon fillets	1
salt gradually	1
salt pepper	1
sauce serve	1
season salmon	1
seasonings cook	1
serve immediately	1
sheets bake	1
slice red	1
soft add	1
soy sauce	1
spread tomato	1
steamed rice	1
stir fry	1
stirring constantly	1
surface spread	1
thinly combine	1
tomato sauce	1
tomatoes add	1
translucent add	1
ungreased cookie	1
vanilla combine	1
vinegar dressing	1
warm broth	1
warm sauté	1
warm tortillas	1

4.2 String Pattern Detection

Let’s use regular expressions to find specific patterns in our recipe text.

4.2.1 Regular Expression Wisdom

Regex (regular expressions) might seem intimidating at first, but they’re incredibly powerful for text analysis. Start simple: \\d+ finds any digits, [A-Z] finds capital letters. The stringr package makes regex much friendlier with functions like str_extract()!

# Find temperature and time patterns
temp_time_patterns <- recipe_data %>%
  mutate(
    temperatures = str_extract_all(instructions, "\\d+°?F"),
    cooking_times = str_extract_all(instructions, "\\d+-?\\d* minutes?")
  ) %>%
  select(recipe_name, temperatures, cooking_times)

# Show temperature patterns
temp_summary <- temp_time_patterns %>%
  mutate(temp_found = map_lgl(temperatures, ~ length(.) > 0)) %>%
  filter(temp_found) %>%
  select(recipe_name, temperatures)

kable(head(temp_summary), 
      caption = "Temperature Patterns Found in Recipes",
      col.names = c("Recipe", "Temperatures")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Temperature Patterns Found in Recipes
Recipe	Temperatures
Classic Chocolate Chip Cookies	375°F
Homemade Pizza Margherita	450°F

Pattern Recognition Applications: This same regex approach can extract phone numbers from customer service logs, dates from historical documents, or product codes from inventory descriptions. The possibilities are endless!

4.3 Putting It All Together: Recipe Similarity

Let’s create a similarity analysis to find recipes that use similar language patterns.

# Create document-term matrix for similarity analysis
recipe_dtm <- recipe_data %>%
  unnest_tokens(word, instructions) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^\\d+$")) %>%
  count(recipe_name, word) %>%
  cast_dtm(recipe_name, word, n)

# Calculate similarity (simplified version for demonstration)
similarity_summary <- data.frame(
  Analysis_Type = c("Most Similar Recipes", "Most Unique Recipe", "Common Ingredients"),
  Finding = c("Italian recipes (Pizza & Risotto)", "Thai Basil Chicken", "Salt, Oil, and Heat verbs"),
  Insight = c("Share Mediterranean cooking style", "Unique Asian flavor profile", "Universal cooking fundamentals")
)

kable(similarity_summary, 
      caption = "Key Insights from Text Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Key Insights from Text Analysis
Analysis_Type	Finding	Insight
Most Similar Recipes	Italian recipes (Pizza & Risotto)	Share Mediterranean cooking style
Most Unique Recipe	Thai Basil Chicken	Unique Asian flavor profile
Common Ingredients	Salt, Oil, and Heat verbs	Universal cooking fundamentals

5 Practical Applications

5.1 Real-World Text Analysis Applications

Text analysis techniques like those we’ve practiced have numerous applications:

applications <- data.frame(
  Domain = c("Healthcare", "Marketing", "Government", "Research", "Social Media"),
  Application = c("Patient feedback analysis", "Customer sentiment tracking", 
                 "Public policy document analysis", "Literature review automation",
                 "Trend identification"),
  Techniques_Used = c("Sentiment analysis, Topic modeling", "TF-IDF, Word clouds",
                     "Named entity recognition, Classification", "Text similarity, Clustering",
                     "N-gram analysis, Network analysis")
)

kable(applications, 
      caption = "Real-World Text Analysis Applications") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Real-World Text Analysis Applications
Domain	Application	Techniques_Used
Healthcare	Patient feedback analysis	Sentiment analysis, Topic modeling
Marketing	Customer sentiment tracking	TF-IDF, Word clouds
Government	Public policy document analysis	Named entity recognition, Classification
Research	Literature review automation	Text similarity, Clustering
Social Media	Trend identification	N-gram analysis, Network analysis

5.2 Your Turn: Practice Exercise

Now it’s time to apply what you’ve learned! Try analyzing this new recipe text:

5.2.1 Coding Challenge

Practice makes perfect! Take the Mediterranean recipe below and try all the techniques we’ve covered. Can you identify the cooking techniques, extract sentiment, and find interesting patterns?

practice_recipe <- data.frame(
  recipe = "Mediterranean Herb-Crusted Cod",
  instructions = "Season fresh cod fillets with sea salt and black pepper. Create herb crust by combining breadcrumbs, fresh parsley, oregano, and minced garlic. Press mixture onto fish. Drizzle with extra virgin olive oil. Bake at 400°F for 15-20 minutes until fish is flaky and golden. Serve with lemon wedges and roasted vegetables."
)

kable(practice_recipe, 
      caption = "Practice Recipe for Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Practice Recipe for Analysis
recipe	instructions
Mediterranean Herb-Crusted Cod	Season fresh cod fillets with sea salt and black pepper. Create herb crust by combining breadcrumbs, fresh parsley, oregano, and minced garlic. Press mixture onto fish. Drizzle with extra virgin olive oil. Bake at 400°F for 15-20 minutes until fish is flaky and golden. Serve with lemon wedges and roasted vegetables.

Your Analysis Tasks

Tokenize the recipe instructions
Identify cooking techniques used
Calculate the sentiment score
Extract temperature and time information using regex
Compare ingredient profile to our existing recipes

5.2.2 Learning by Doing

Try copying the code chunks from earlier sections and modifying them for this new recipe. Change the dataset name and see what happens. This is how you build coding confidence!

5.3 Key Concepts Summary

5.3.1 Essential Text Analysis Concepts

Tokenization: Breaking text into individual words or phrases
Stop Words: Common words removed to focus on meaningful content
TF-IDF: Identifies words that are uniquely important to specific documents
Sentiment Analysis: Measures emotional tone or attitude in text
N-grams: Analysis of word sequences (bigrams, trigrams, etc.)
Regular Expressions: Pattern matching for extracting specific information

6 Conclusion

Text analysis in R provides powerful tools for extracting insights from unstructured data. Through our recipe analysis, we’ve demonstrated how to:

What You’ve Accomplished

Clean and preprocess raw text data
Apply fundamental NLP techniques
Visualize patterns and relationships
Uncover hidden connections in data

These skills are directly applicable to analyzing any type of text data, from customer feedback to research documents to social media content.

6.0.1 Next Steps in Your Text Analysis Journey

To continue developing your text analysis skills:

Practice with different types of text data
Explore additional sentiment lexicons and methods
Learn topic modeling techniques for larger datasets
Investigate advanced NLP packages like quanteda and spacyr
Apply these techniques to your own work projects

Remember: The best way to learn text analysis is by doing it. Start with small projects and gradually tackle more complex challenges!

6.1 Additional Resources

6.1.1 Helpful R Packages for Text Analysis

advanced_packages <- data.frame(
  Package = c("quanteda", "spacyr", "tm", "topicmodels", "textdata"),
  Purpose = c("Comprehensive text analysis framework", 
             "spaCy integration for advanced NLP", 
             "Text mining framework",
             "Topic modeling algorithms",
             "Access to text analysis datasets"),
  Difficulty = c("Advanced", "Advanced", "Intermediate", "Advanced", "Beginner")
)

kable(advanced_packages, 
      caption = "Additional R Packages for Text Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Additional R Packages for Text Analysis
Package	Purpose	Difficulty
quanteda	Comprehensive text analysis framework	Advanced
spacyr	spaCy integration for advanced NLP	Advanced
tm	Text mining framework	Intermediate
topicmodels	Topic modeling algorithms	Advanced
textdata	Access to text analysis datasets	Beginner

6.1.2 Further Reading

Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media.
Welbers, K., Van Atteveldt, W., & Benoit, K. (2017). Text analysis in R. Communication Methods and Measures, 11(4), 245-265.
Hvitfeldt, E., & Silge, J. (2021). Supervised Machine Learning for Text Analysis in R. CRC Press.

Course materials developed by Eric Kvale with the Minnesota Department of Health Office of Data Strategy and Interoperability Data Technical Assistance Unit (DSI DTA) with support from the Minnesota State Government R Users Group.