Offered by: Minnesota Department of Health Office of Data Strategy and Interoperability Data Technical Assistance Unit (DSI DTA) with support from the Minnesota State Government R Users Group and the intellectual and imaginitive powers contained therewithin.
Course materials developed by: Eric Kvale
Prerequisites: Basic familiarity with R (see our primer: https://www.train.org/mn/course/1122534/live-event)
Before we start, let’s get all of the required tools set up. Everything we need collectively is refered to as our environment.
You should have R and RStudio open, ready to run the code in each section. Don’t just read - code along with us! Experiment with each section, make some tweaks, fiddle with the data and the function arguements and paramaters. Cntrl-Z if you break it, or just copy and paste from this document if you get lost.
First, let’s install all the packages we’ll need for this course:
# Install required packages for text analysis
install.packages(c(
  "tidyverse",    # Data manipulation and visualization
  "tidytext",     # Text mining tools
  "stringr",      # String manipulation
  "wordcloud",    # Word cloud visualizations
  "stopwords",    # Stop word datasets
  "knitr",        # Document generation
  "DT",           # Interactive tables
  "kableExtra",   # Enhanced table formatting
  "renv",         # Environment management
  "tm"            # Framework for text mining applications 
))
If you encounter installation errors, try updating R to the latest
version first. Some packages require recent R versions. You can check
your R version with R.version.string. Also, don’t be afraid
to ask for help, these errors are common and we are here to tackle
them.
For reproducibility, you can use renv to create a
project-specific library:
# Initialize renv for this project
renv::init()
# After installing packages, take a snapshot
renv::snapshot()
Why use renv? This creates a reproducible
environment where everyone uses the same package versions. When you
share your analysis, others can run renv::restore() to get
exactly the same setup. It’s a version control option that ensures that
your findings are reproducible.
Now let’s load our libraries and test that everything is working:
# Load libraries
library(tidyverse)
library(tidytext)
library(stringr)
library(wordcloud)
library(stopwords)
library(knitr)
library(DT)
library(kableExtra)
library(tm)
# Test a couple are loaded.
cat("Setup successful! Here's a quick test:\n")
cat("tidyverse version:", as.character(packageVersion("tidyverse")), "\n")
cat("tidytext version:", as.character(packageVersion("tidytext")), "\n")
# Test tokenization
test_text <- "Hello world! Can you token this?"
test_tokens <- tibble(text = test_text) %>%
  unnest_tokens(word, text)
cat("Tokenization test successful! Found", nrow(test_tokens), "tokens.\n")
Setup issues are normal. Here are solutions to the most common problems R users encounter.
# If standard installation fails, try:
install.packages("tidyverse", dependencies = TRUE)
# Check your library path
.libPaths()
# Check if package is installed
if (!"tidyverse" %in% installed.packages()) {
  install.packages("tidyverse")
}
# Load with error handling
tryCatch({
  library(tidyverse)
  cat("tidyverse loaded successfully!")
}, error = function(e) {
  cat("Error loading tidyverse:", e$message)
})
# If renv gives errors, you can skip it for now:
# Just load packages directly without renv
# Or reset renv if needed:
renv::restore()  # Restore from lockfile
renv::repair()   # Fix renv issues
When to Skip renv: If renv is giving you troubles you can skip it for this workshop. It’s good to know, R has enviroments and that it exists, but it isn’t required.
Once your setup is complete, you should be able to run this test successfully:
# Final, final, final test.
library(tidyverse)
library(tidytext)
# Create and analyze some sample text
sample_data <- tibble(
  id = 1,
  text = "Welcome to text analysis in R! This course will teach you amazing skills."
)
sample_tokens <- sample_data %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
cat("🎉 Setup complete! Found", nrow(sample_tokens), "meaningful words in our test text.\n")
cat("You're ready to start learning text analysis!\n")
Text analysis is a skill that improves with practice, not perfection on the first try. Familarise yourself with the data and concepts and comeback for another whack at it.
Text analysis, also known as natural language processing (NLP), is a powerful technique for extracting meaningful insights from unstructured text data. This course will take you from raw text data to fully analyzed, visualized results using R.
By the end of this course, you will be able to:
Remember to use the help() function or
?function_name to learn more about any function you’re
unfamiliar with. For example, try ?str_detect or
help(unnest_tokens) to explore these functions in
detail.
Text analysis involves using computational methods to extract information, patterns, and insights from written text. In this course, we’ll focus on recipe data to discover hidden connections between cooking techniques and ingredients.
packages_info <- data.frame(
  Package = c("tidyverse", "tidytext", "stringr", "wordcloud", "stopwords"),
  Purpose = c("Data manipulation and visualization", 
             "Text mining and analysis", 
             "String manipulation and regex", 
             "Creating word cloud visualizations",
             "Removing common words from analysis")
)
kable(packages_info, caption = "Essential R Packages for Text Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
| Package | Purpose | 
|---|---|
| tidyverse | Data manipulation and visualization | 
| tidytext | Text mining and analysis | 
| stringr | String manipulation and regex | 
| wordcloud | Creating word cloud visualizations | 
| stopwords | Removing common words from analysis | 
Let’s dive into analyzing recipe data to uncover patterns in cooking techniques and ingredients. We’ll start with raw recipe text and transform it into meaningful insights.
Make sure you have all required packages installed. If you encounter
errors, try running
install.packages(c("tidyverse", "tidytext", "stringr", "wordcloud", "stopwords"))
in your console.
First, let’s load all the libraries we’ll need for our text analysis:
# Load required libraries for text analysis
library(tidyverse)    # Data manipulation and visualization
library(tidytext)     # Text mining and analysis
library(stringr)      # String manipulation
library(wordcloud)    # Word cloud visualizations  
library(stopwords)    # Stop word removal
library(knitr)        # Table formatting
library(DT)           # Interactive tables
library(kableExtra)   # Enhanced table styling
# Verify libraries loaded successfully
cat("All libraries loaded successfully! Ready for text analysis.\n")
## All libraries loaded successfully! Ready for text analysis.
Run this code chunk first before proceeding with the analysis. If you
get error messages about packages not being installed, go back to the
setup section and install the missing packages. You can also use
library(help = "package_name") to learn more about any
package.
We’ll work with a collection of recipe descriptions and instructions to discover cooking patterns and ingredient relationships.
Data Source Note: In real-world projects, you might
import text data using readr::read_csv(),
readLines(), or specialized packages for different file
formats. The rio package is particularly useful for reading
various data formats!
# Create sample recipe data
recipe_data <- data.frame(
  recipe_id = 1:8,
  recipe_name = c("Classic Chocolate Chip Cookies", "Spicy Thai Basil Chicken", 
                  "Homemade Pizza Margherita", "Creamy Mushroom Risotto",
                  "Grilled Salmon with Herbs", "Vegetarian Black Bean Tacos",
                  "Fresh Garden Salad", "Slow Cooker Beef Stew"),
  cuisine = c("American", "Thai", "Italian", "Italian", 
              "Mediterranean", "Mexican", "American", "American"),
  instructions = c(
    "Cream butter and sugar until fluffy. Mix in eggs and vanilla. Combine flour, baking soda, and salt. Gradually blend into creamed mixture. Stir in chocolate chips. Drop by spoonfuls onto ungreased cookie sheets. Bake at 375°F for 9-11 minutes.",
    "Heat oil in wok over high heat. Stir-fry chicken until cooked through. Add garlic, chilies, and basil leaves. Season with fish sauce and soy sauce. Serve immediately over steamed rice.",
    "Roll out pizza dough on floured surface. Spread tomato sauce evenly. Add fresh mozzarella and basil leaves. Drizzle with olive oil. Bake in preheated oven at 450°F for 12-15 minutes until crust is golden.",
    "Heat broth in saucepan and keep warm. Sauté onions in olive oil until translucent. Add arborio rice and stir for 2 minutes. Gradually add warm broth, stirring constantly. Add mushrooms and parmesan cheese. Season with salt and pepper.",
    "Season salmon fillets with salt, pepper, and fresh herbs. Preheat grill to medium-high heat. Grill salmon for 4-5 minutes per side until fish flakes easily. Serve with lemon wedges and grilled vegetables.",
    "Drain and rinse black beans. Sauté onions and bell peppers until soft. Add beans, cumin, chili powder, and lime juice. Warm tortillas and fill with bean mixture. Top with avocado, cilantro, and cheese.",
    "Wash and chop fresh lettuce, tomatoes, and cucumbers. Slice red onions thinly. Combine all vegetables in large bowl. Toss with olive oil and vinegar dressing. Season with salt and pepper to taste.",
    "Brown beef cubes in oil over high heat. Add chopped onions, carrots, and celery. Pour in beef broth and diced tomatoes. Add herbs and seasonings. Cook on low heat for 6-8 hours until meat is tender."
  )
)
DT::datatable(recipe_data, 
              caption = "Recipe Dataset for Text Analysis",
              options = list(pageLength = 8, scrollX = TRUE))
The first step in text analysis is cleaning and breaking down our text into individual words (tokens).
The %>% pipe operator chains functions together,
making code more readable. Think of it as “and then…” - we take the data
AND THEN select columns AND THEN tokenize AND THEN remove stop
words.
# Tokenize the recipe instructions
recipe_tokens <- recipe_data %>%
  select(recipe_id, recipe_name, cuisine, instructions) %>%
  unnest_tokens(word, instructions) %>%
  # Remove stop words
  anti_join(stop_words, by = "word") %>%
  # Remove numbers and single letters
  filter(!str_detect(word, "^\\d+$"),
         str_length(word) > 1)
# Display sample of tokenized data
kable(head(recipe_tokens, 10), 
      caption = "Sample of Tokenized Recipe Instructions") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
| recipe_id | recipe_name | cuisine | word | 
|---|---|---|---|
| 1 | Classic Chocolate Chip Cookies | American | cream | 
| 1 | Classic Chocolate Chip Cookies | American | butter | 
| 1 | Classic Chocolate Chip Cookies | American | sugar | 
| 1 | Classic Chocolate Chip Cookies | American | fluffy | 
| 1 | Classic Chocolate Chip Cookies | American | mix | 
| 1 | Classic Chocolate Chip Cookies | American | eggs | 
| 1 | Classic Chocolate Chip Cookies | American | vanilla | 
| 1 | Classic Chocolate Chip Cookies | American | combine | 
| 1 | Classic Chocolate Chip Cookies | American | flour | 
| 1 | Classic Chocolate Chip Cookies | American | baking | 
Stop words are common words that typically don’t contribute much meaning to text analysis. Let’s examine what we’re removing:
Why Remove Stop Words? Words like “the”, “and”, “is” appear frequently in all texts but don’t tell us much about the specific content. Removing them helps us focus on meaningful terms that distinguish one document from another.
# Show examples of stop words
sample_stop_words <- stop_words %>% 
  filter(lexicon == "snowball") %>%
  head(20) %>%
  select(word)
kable(sample_stop_words, 
      caption = "Examples of Stop Words Removed from Analysis",
      col.names = "Stop Words") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
| Stop Words | 
|---|
| i | 
| me | 
| my | 
| myself | 
| we | 
| our | 
| ours | 
| ourselves | 
| you | 
| your | 
| yours | 
| yourself | 
| yourselves | 
| he | 
| him | 
| his | 
| himself | 
| she | 
| her | 
| hers | 
Let’s discover the most common cooking terms and ingredients across our recipe collection.
If your code isn’t working as expected, try running each line of the
pipeline separately. Add a print() or View()
statement after each step to see what’s happening to your data.
# Calculate word frequencies
word_frequencies <- recipe_tokens %>%
  count(word, sort = TRUE) %>%
  top_n(15, n)
# Create bar plot of most common words
ggplot(word_frequencies, aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue", alpha = 0.8) +
  coord_flip() +
  labs(title = "Most Common Words in Recipe Instructions",
       subtitle = "Top 15 terms after removing stop words",
       x = "Words",
       y = "Frequency") +
  theme_minimal()
Notice how cooking verbs like “add,” “heat,” and “season” dominate our frequency analysis. This makes sense - recipes are instruction-heavy! The ingredients that appear frequently (like “oil” and “salt”) are staples across many cuisine types.
Let’s create a visual representation of word frequencies using a word cloud:
# Create word cloud
set.seed(123)  # For reproducible results
wordcloud(words = word_frequencies$word, 
          freq = word_frequencies$n,
          min.freq = 1,
          max.words = 50,
          random.order = FALSE,
          rot.per = 0.35,
          colors = brewer.pal(8, "Dark2"))
Word Cloud of Recipe Terms
Word clouds are great for initial exploration, but consider bar
charts or other structured visualizations for formal presentations. The
set.seed() function ensures your word cloud looks the same
each time you run it - useful for reproducible analysis!
TF-IDF (Term Frequency-Inverse Document Frequency) helps us identify words that are particularly important to specific recipes or cuisines.
What is TF-IDF? This metric balances how frequently a term appears in a document (TF) against how rare it is across all documents (IDF). A word that appears often in one document but rarely in others gets a high TF-IDF score.
# Calculate TF-IDF by cuisine
cuisine_tfidf <- recipe_data %>%
  unnest_tokens(word, instructions) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^\\d+$"),
         str_length(word) > 1) %>%
  count(cuisine, word, sort = TRUE) %>%
  bind_tf_idf(word, cuisine, n) %>%
  arrange(desc(tf_idf))
# Show top TF-IDF terms by cuisine
top_tfidf <- cuisine_tfidf %>%
  group_by(cuisine) %>%
  top_n(3, tf_idf) %>%
  ungroup()
kable(top_tfidf[, c("cuisine", "word", "tf_idf")], 
      caption = "Top TF-IDF Terms by Cuisine",
      col.names = c("Cuisine", "Word", "TF-IDF Score"),
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
| Cuisine | Word | TF-IDF Score | 
|---|---|---|
| Mediterranean | grill | 0.1463 | 
| Mediterranean | salmon | 0.1463 | 
| Mexican | beans | 0.1288 | 
| Thai | sauce | 0.0833 | 
| Mediterranean | easily | 0.0732 | 
| Mediterranean | fillets | 0.0732 | 
| Mediterranean | flakes | 0.0732 | 
| Mediterranean | grilled | 0.0732 | 
| Mediterranean | lemon | 0.0732 | 
| Mediterranean | medium | 0.0732 | 
| Mediterranean | preheat | 0.0732 | 
| Mediterranean | wedges | 0.0732 | 
| Thai | chicken | 0.0732 | 
| Thai | chilies | 0.0732 | 
| Thai | cooked | 0.0732 | 
| Thai | fry | 0.0732 | 
| Thai | garlic | 0.0732 | 
| Thai | immediately | 0.0732 | 
| Thai | soy | 0.0732 | 
| Thai | steamed | 0.0732 | 
| Thai | wok | 0.0732 | 
| Mexican | avocado | 0.0644 | 
| Mexican | bean | 0.0644 | 
| Mexican | bell | 0.0644 | 
| Mexican | black | 0.0644 | 
| Mexican | chili | 0.0644 | 
| Mexican | cilantro | 0.0644 | 
| Mexican | cumin | 0.0644 | 
| Mexican | drain | 0.0644 | 
| Mexican | fill | 0.0644 | 
| Mexican | juice | 0.0644 | 
| Mexican | lime | 0.0644 | 
| Mexican | peppers | 0.0644 | 
| Mexican | powder | 0.0644 | 
| Mexican | rinse | 0.0644 | 
| Mexican | soft | 0.0644 | 
| Mexican | top | 0.0644 | 
| Mexican | tortillas | 0.0644 | 
| American | beef | 0.0447 | 
| American | combine | 0.0447 | 
| American | tomatoes | 0.0447 | 
| Italian | broth | 0.0374 | 
| Italian | olive | 0.0374 | 
| Italian | warm | 0.0374 | 
top_tfidf %>%
  mutate(word = reorder_within(word, tf_idf, cuisine)) %>%
  ggplot(aes(word, tf_idf, fill = cuisine)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "TF-IDF Score",
       title = "Highest TF-IDF Words by Cuisine",
       subtitle = "Words most characteristic of each cuisine type") +
  facet_wrap(~cuisine, scales = "free") +
  coord_flip() +
  scale_x_reordered() +
  theme_minimal()
TF-IDF Scores by Cuisine
Let’s analyze the emotional tone of our recipe instructions using sentiment analysis.
Want to see all available sentiment lexicons? Try
get_sentiments("nrc"), get_sentiments("bing"),
or explore with ?get_sentiments to understand the
differences between emotion classification systems.
# Get sentiments using the bing lexicon (positive/negative)
recipe_sentiments <- recipe_tokens %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  group_by(recipe_name, cuisine) %>%
  summarise(
    positive_words = sum(sentiment == "positive"),
    negative_words = sum(sentiment == "negative"),
    sentiment_score = positive_words - negative_words,
    .groups = "drop"
  ) %>%
  arrange(desc(sentiment_score))
kable(recipe_sentiments, 
      caption = "Sentiment Analysis of Recipe Instructions",
      col.names = c("Recipe", "Cuisine", "Positive Words", "Negative Words", "Sentiment Score")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
| Recipe | Cuisine | Positive Words | Negative Words | Sentiment Score | 
|---|---|---|---|---|
| Creamy Mushroom Risotto | Italian | 2 | 0 | 2 | 
| Homemade Pizza Margherita | Italian | 2 | 0 | 2 | 
| Vegetarian Black Bean Tacos | Mexican | 3 | 1 | 2 | 
| Fresh Garden Salad | American | 1 | 0 | 1 | 
| Slow Cooker Beef Stew | American | 1 | 0 | 1 | 
| Grilled Salmon with Herbs | Mediterranean | 1 | 1 | 0 | 
Recipe instructions tend to be neutral or slightly positive in language. The sentiment analysis here identifies words like “fresh,” “golden,” and “warm” as positive, while words like “drain” or “cut” might be classified as negative, even though they’re just cooking instructions.
ggplot(recipe_sentiments, aes(x = reorder(recipe_name, sentiment_score), 
                              y = sentiment_score, fill = cuisine)) +
  geom_col() +
  coord_flip() +
  labs(title = "Sentiment Scores of Recipe Instructions",
       subtitle = "Higher scores indicate more positive language",
       x = "Recipe",
       y = "Sentiment Score",
       fill = "Cuisine") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")
Recipe Sentiment Scores
Let’s identify and analyze cooking techniques mentioned in our recipes.
When analyzing text, think like a detective. What patterns can you spot? What words cluster together? Text analysis often reveals insights that aren’t immediately obvious when just reading through documents manually.
# Define cooking techniques to search for
cooking_techniques <- c("bake", "baking", "fry", "frying", "grill", "grilling", 
                       "sauté", "boil", "boiling", "steam", "steaming",
                       "roast", "roasting", "stir", "mix", "mixing", "chop", 
                       "chopping", "season", "seasoning")
# Find cooking techniques in recipes
technique_analysis <- recipe_data %>%
  unnest_tokens(word, instructions) %>%
  filter(word %in% cooking_techniques) %>%
  count(cuisine, word, sort = TRUE) %>%
  group_by(cuisine) %>%
  top_n(3, n) %>%
  ungroup()
kable(technique_analysis, 
      caption = "Most Common Cooking Techniques by Cuisine",
      col.names = c("Cuisine", "Technique", "Frequency")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
| Cuisine | Technique | Frequency | 
|---|---|---|
| Mediterranean | grill | 2 | 
| American | bake | 1 | 
| American | baking | 1 | 
| American | chop | 1 | 
| American | mix | 1 | 
| American | season | 1 | 
| American | stir | 1 | 
| Italian | bake | 1 | 
| Italian | sauté | 1 | 
| Italian | season | 1 | 
| Italian | stir | 1 | 
| Mediterranean | season | 1 | 
| Mexican | sauté | 1 | 
| Thai | fry | 1 | 
| Thai | season | 1 | 
| Thai | stir | 1 | 
Expanding Your Analysis: Try creating your own custom dictionaries for different domains. You could create lists of spices, cooking equipment, or dietary restrictions to analyze different aspects of the text data.
ggplot(technique_analysis, aes(x = reorder_within(word, n, cuisine), 
                               y = n, fill = cuisine)) +
  geom_col(show.legend = FALSE) +
  labs(x = "Cooking Technique", y = "Frequency",
       title = "Most Common Cooking Techniques by Cuisine Type") +
  facet_wrap(~cuisine, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  theme_minimal()
Cooking Techniques by Cuisine
Let’s explore connections between ingredients by finding which ones commonly appear together.
# Define common ingredients to search for
ingredients <- c("chicken", "beef", "salmon", "cheese", "tomato", "onion", 
                "garlic", "oil", "salt", "pepper", "herbs", "basil", "rice",
                "beans", "avocado", "mushroom", "butter", "flour", "sugar")
# Find ingredient co-occurrences
ingredient_pairs <- recipe_data %>%
  unnest_tokens(word, instructions) %>%
  filter(word %in% ingredients) %>%
  select(recipe_id, word) %>%
  inner_join(., ., by = "recipe_id") %>%
  filter(word.x < word.y) %>%  # Avoid duplicate pairs
  count(word.x, word.y, sort = TRUE) %>%
  top_n(10, n)
kable(ingredient_pairs, 
      caption = "Most Common Ingredient Combinations",
      col.names = c("Ingredient 1", "Ingredient 2", "Co-occurrence")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
| Ingredient 1 | Ingredient 2 | Co-occurrence | 
|---|---|---|
| pepper | salt | 3 | 
| avocado | beans | 2 | 
| basil | oil | 2 | 
| beans | cheese | 2 | 
| beef | herbs | 2 | 
| beef | oil | 2 | 
| herbs | salmon | 2 | 
| oil | pepper | 2 | 
| oil | rice | 2 | 
| oil | salt | 2 | 
| pepper | salmon | 2 | 
| salmon | salt | 2 | 
Now that we’ve mastered the basics, let’s explore more sophisticated methods for extracting insights from text data.
As your datasets get larger, consider using the quanteda
package for faster processing, or data.table for
memory-efficient operations. For very large texts, you might need to
process data in chunks.
Beyond individual words, let’s look at common two-word phrases (bigrams) in our recipes.
The token = "ngrams", n = 2 parameters in
unnest_tokens() create bigrams. Try changing
n = 3 for trigrams or n = 4 for 4-word
phrases. Use ?unnest_tokens to explore all tokenization
options!
# Create bigrams
recipe_bigrams <- recipe_data %>%
  unnest_tokens(bigram, instructions, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !str_detect(word1, "^\\d+$"),
         !str_detect(word2, "^\\d+$")) %>%
  count(word1, word2, sort = TRUE) %>%
  unite(bigram, word1, word2, sep = " ") %>%
  top_n(10, n)
kable(recipe_bigrams, 
      caption = "Most Common Bigrams (Two-word phrases)",
      col.names = c("Bigram", "Frequency")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
| Bigram | Frequency | 
|---|---|
| olive oil | 3 | 
| basil leaves | 2 | 
| sauté onions | 2 | 
| add arborio | 1 | 
| add beans | 1 | 
| add chopped | 1 | 
| add fresh | 1 | 
| add garlic | 1 | 
| add herbs | 1 | 
| add mushrooms | 1 | 
| add warm | 1 | 
| arborio rice | 1 | 
| avocado cilantro | 1 | 
| baking soda | 1 | 
| bean mixture | 1 | 
| beans cumin | 1 | 
| beans sauté | 1 | 
| beef broth | 1 | 
| beef cubes | 1 | 
| bell peppers | 1 | 
| black beans | 1 | 
| bowl toss | 1 | 
| broth stirring | 1 | 
| brown beef | 1 | 
| celery pour | 1 | 
| cheese season | 1 | 
| chili powder | 1 | 
| chips drop | 1 | 
| chocolate chips | 1 | 
| chop fresh | 1 | 
| chopped onions | 1 | 
| combine flour | 1 | 
| constantly add | 1 | 
| cookie sheets | 1 | 
| cream butter | 1 | 
| creamed mixture | 1 | 
| cucumbers slice | 1 | 
| cumin chili | 1 | 
| diced tomatoes | 1 | 
| dressing season | 1 | 
| easily serve | 1 | 
| fish flakes | 1 | 
| fish sauce | 1 | 
| flakes easily | 1 | 
| flour baking | 1 | 
| floured surface | 1 | 
| fluffy mix | 1 | 
| fresh herbs | 1 | 
| fresh lettuce | 1 | 
| fresh mozzarella | 1 | 
| fry chicken | 1 | 
| garlic chilies | 1 | 
| gradually add | 1 | 
| gradually blend | 1 | 
| grill salmon | 1 | 
| grilled vegetables | 1 | 
| heat add | 1 | 
| heat broth | 1 | 
| heat grill | 1 | 
| heat oil | 1 | 
| heat stir | 1 | 
| herbs preheat | 1 | 
| juice warm | 1 | 
| leaves drizzle | 1 | 
| leaves season | 1 | 
| lemon wedges | 1 | 
| lettuce tomatoes | 1 | 
| lime juice | 1 | 
| low heat | 1 | 
| minutes gradually | 1 | 
| mixture stir | 1 | 
| mixture top | 1 | 
| oil bake | 1 | 
| onions carrots | 1 | 
| onions thinly | 1 | 
| parmesan cheese | 1 | 
| pizza dough | 1 | 
| preheat grill | 1 | 
| preheated oven | 1 | 
| red onions | 1 | 
| rinse black | 1 | 
| salmon fillets | 1 | 
| salt gradually | 1 | 
| salt pepper | 1 | 
| sauce serve | 1 | 
| season salmon | 1 | 
| seasonings cook | 1 | 
| serve immediately | 1 | 
| sheets bake | 1 | 
| slice red | 1 | 
| soft add | 1 | 
| soy sauce | 1 | 
| spread tomato | 1 | 
| steamed rice | 1 | 
| stir fry | 1 | 
| stirring constantly | 1 | 
| surface spread | 1 | 
| thinly combine | 1 | 
| tomato sauce | 1 | 
| tomatoes add | 1 | 
| translucent add | 1 | 
| ungreased cookie | 1 | 
| vanilla combine | 1 | 
| vinegar dressing | 1 | 
| warm broth | 1 | 
| warm sauté | 1 | 
| warm tortillas | 1 | 
Let’s use regular expressions to find specific patterns in our recipe text.
Regex (regular expressions) might seem intimidating at first, but
they’re incredibly powerful for text analysis. Start simple:
\\d+ finds any digits, [A-Z] finds capital
letters. The stringr package makes regex much friendlier
with functions like str_extract()!
# Find temperature and time patterns
temp_time_patterns <- recipe_data %>%
  mutate(
    temperatures = str_extract_all(instructions, "\\d+°?F"),
    cooking_times = str_extract_all(instructions, "\\d+-?\\d* minutes?")
  ) %>%
  select(recipe_name, temperatures, cooking_times)
# Show temperature patterns
temp_summary <- temp_time_patterns %>%
  mutate(temp_found = map_lgl(temperatures, ~ length(.) > 0)) %>%
  filter(temp_found) %>%
  select(recipe_name, temperatures)
kable(head(temp_summary), 
      caption = "Temperature Patterns Found in Recipes",
      col.names = c("Recipe", "Temperatures")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
| Recipe | Temperatures | 
|---|---|
| Classic Chocolate Chip Cookies | 375°F | 
| Homemade Pizza Margherita | 450°F | 
Pattern Recognition Applications: This same regex approach can extract phone numbers from customer service logs, dates from historical documents, or product codes from inventory descriptions. The possibilities are endless!
Let’s create a similarity analysis to find recipes that use similar language patterns.
# Create document-term matrix for similarity analysis
recipe_dtm <- recipe_data %>%
  unnest_tokens(word, instructions) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^\\d+$")) %>%
  count(recipe_name, word) %>%
  cast_dtm(recipe_name, word, n)
# Calculate similarity (simplified version for demonstration)
similarity_summary <- data.frame(
  Analysis_Type = c("Most Similar Recipes", "Most Unique Recipe", "Common Ingredients"),
  Finding = c("Italian recipes (Pizza & Risotto)", "Thai Basil Chicken", "Salt, Oil, and Heat verbs"),
  Insight = c("Share Mediterranean cooking style", "Unique Asian flavor profile", "Universal cooking fundamentals")
)
kable(similarity_summary, 
      caption = "Key Insights from Text Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
| Analysis_Type | Finding | Insight | 
|---|---|---|
| Most Similar Recipes | Italian recipes (Pizza & Risotto) | Share Mediterranean cooking style | 
| Most Unique Recipe | Thai Basil Chicken | Unique Asian flavor profile | 
| Common Ingredients | Salt, Oil, and Heat verbs | Universal cooking fundamentals | 
Text analysis techniques like those we’ve practiced have numerous applications:
applications <- data.frame(
  Domain = c("Healthcare", "Marketing", "Government", "Research", "Social Media"),
  Application = c("Patient feedback analysis", "Customer sentiment tracking", 
                 "Public policy document analysis", "Literature review automation",
                 "Trend identification"),
  Techniques_Used = c("Sentiment analysis, Topic modeling", "TF-IDF, Word clouds",
                     "Named entity recognition, Classification", "Text similarity, Clustering",
                     "N-gram analysis, Network analysis")
)
kable(applications, 
      caption = "Real-World Text Analysis Applications") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
| Domain | Application | Techniques_Used | 
|---|---|---|
| Healthcare | Patient feedback analysis | Sentiment analysis, Topic modeling | 
| Marketing | Customer sentiment tracking | TF-IDF, Word clouds | 
| Government | Public policy document analysis | Named entity recognition, Classification | 
| Research | Literature review automation | Text similarity, Clustering | 
| Social Media | Trend identification | N-gram analysis, Network analysis | 
Now it’s time to apply what you’ve learned! Try analyzing this new recipe text:
Practice makes perfect! Take the Mediterranean recipe below and try all the techniques we’ve covered. Can you identify the cooking techniques, extract sentiment, and find interesting patterns?
practice_recipe <- data.frame(
  recipe = "Mediterranean Herb-Crusted Cod",
  instructions = "Season fresh cod fillets with sea salt and black pepper. Create herb crust by combining breadcrumbs, fresh parsley, oregano, and minced garlic. Press mixture onto fish. Drizzle with extra virgin olive oil. Bake at 400°F for 15-20 minutes until fish is flaky and golden. Serve with lemon wedges and roasted vegetables."
)
kable(practice_recipe, 
      caption = "Practice Recipe for Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
| recipe | instructions | 
|---|---|
| Mediterranean Herb-Crusted Cod | Season fresh cod fillets with sea salt and black pepper. Create herb crust by combining breadcrumbs, fresh parsley, oregano, and minced garlic. Press mixture onto fish. Drizzle with extra virgin olive oil. Bake at 400°F for 15-20 minutes until fish is flaky and golden. Serve with lemon wedges and roasted vegetables. | 
Try copying the code chunks from earlier sections and modifying them for this new recipe. Change the dataset name and see what happens. This is how you build coding confidence!
Text analysis in R provides powerful tools for extracting insights from unstructured data. Through our recipe analysis, we’ve demonstrated how to:
These skills are directly applicable to analyzing any type of text data, from customer feedback to research documents to social media content.
To continue developing your text analysis skills:
quanteda and spacyrRemember: The best way to learn text analysis is by doing it. Start with small projects and gradually tackle more complex challenges!
advanced_packages <- data.frame(
  Package = c("quanteda", "spacyr", "tm", "topicmodels", "textdata"),
  Purpose = c("Comprehensive text analysis framework", 
             "spaCy integration for advanced NLP", 
             "Text mining framework",
             "Topic modeling algorithms",
             "Access to text analysis datasets"),
  Difficulty = c("Advanced", "Advanced", "Intermediate", "Advanced", "Beginner")
)
kable(advanced_packages, 
      caption = "Additional R Packages for Text Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
| Package | Purpose | Difficulty | 
|---|---|---|
| quanteda | Comprehensive text analysis framework | Advanced | 
| spacyr | spaCy integration for advanced NLP | Advanced | 
| tm | Text mining framework | Intermediate | 
| topicmodels | Topic modeling algorithms | Advanced | 
| textdata | Access to text analysis datasets | Beginner | 
Course materials developed by Eric Kvale with the Minnesota Department of Health Office of Data Strategy and Interoperability Data Technical Assistance Unit (DSI DTA) with support from the Minnesota State Government R Users Group.