library(tidyverse)
<- read_csv('https://tidy-mn.github.io/qualitative-guide/posts/data/kids_netflix_shows.csv') tv_shows
2 — Count words
Load kids TV data
Read in a table of kids TV shows on Netflix.
Count totals in each group or category
library(tidyverse)
<- tv_shows %>%
type_count count(type) # type can be any column in the data
type_count
# A tibble: 2 × 2
type n
<chr> <int>
1 Movie 532
2 TV Show 414
Count totals by release year for each type
<- tv_shows %>%
year_type_count count(release_year, type) # Can add multiple column names
%>% head() year_type_count
# A tibble: 6 × 3
release_year type n
<dbl> <chr> <int>
1 1954 Movie 1
2 1968 Movie 1
3 1971 Movie 1
4 1973 Movie 1
5 1977 Movie 1
6 1978 Movie 1
Count type totals in wide format table
library(janitor)
<- tv_shows %>%
tv_tabyl filter(release_year > 2016) %>%
tabyl(release_year, type) %>%
adorn_totals("row") %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 0) %>%
%>%
adorn_ns
adorn_title
tv_tabyl
type
release_year Movie TV Show
2017 48% (48) 52% (52)
2018 49% (63) 51% (65)
2019 56% (72) 44% (56)
2020 52% (79) 48% (73)
2021 20% (1) 80% (4)
Total 51% (263) 49% (250)
Rank occurence of words
tokens
= words or phrases
Top 10 words in the genre column.
library(tidytext)
<- tv_shows %>%
genre_count unnest_tokens(word, genre) %>%
count(word, sort = TRUE)
%>% head(10) genre_count
# A tibble: 10 × 2
word n
<chr> <int>
1 tv 650
2 movies 602
3 children 532
4 family 532
5 kids 414
6 comedies 342
7 dramas 86
8 shows 61
9 action 43
10 adventure 43
Additional token options
Rather than counting every single word, we may be interested in counting how often words occur together. To do this we use unnest_ngrams()
and set the n argument to 2
. If an <NA>
appears in the count column, it indicates a show that had fewer than two words in the genre column.
<- tv_shows %>%
genre_count unnest_ngrams(word, genre, n = 2) %>%
count(word, sort = TRUE)
%>% head(10) genre_count
# A tibble: 10 × 2
word n
<chr> <int>
1 children family 532
2 family movies 532
3 kids tv 414
4 movies comedies 225
5 tv tv 140
6 tv comedies 117
7 movies dramas 64
8 tv shows 61
9 action adventure 43
10 fi fantasy 32
The words children and family were the most often to occur together.
In this data set multiple genres are separated by a comma, so we can treat each phrase before and after a comma as a single genre or token. To do this we use unnest_regex()
and set the pattern argument to ", "
. This will split the text wherever the sequence of a comma followed by a space occurs. Now the genres will be counted as they were intended in the data.
<- tv_shows %>%
genre_count unnest_regex(word, genre, pattern = ", ") %>%
count(word, sort = TRUE)
%>% head(10) genre_count
# A tibble: 10 × 2
word n
<chr> <int>
1 children & family movies 532
2 kids' tv 414
3 comedies 225
4 tv comedies 117
5 dramas 75
6 british tv shows 27
7 music & musicals 27
8 korean tv shows 23
9 tv action & adventure 22
10 action & adventure 21
Stop words
stop words
= Common words or phrases such asthe
,of
, andto
.
When comparing survey responses and narratives, some of the most common words are often the articles, such as the
, a
, and an
, that don’t offer much in terms of signaling the intent or theme of the text. These filler words are commonly referred to as stop words.
A list of stop words is included in the tidytext
package. Let’s load the words and store them in a variable called exclude
. Here’s the first few for example, but go ahead and take a look at the full list to get a better understanding of what may be considered a stop word.
<- stop_words$word
exclude
%>% head() exclude
[1] "a" "a's" "able" "about" "above" "according"
Excluding unwanted words
For this example we will focus on the description
column. This column contains some free text describing the content of the show.
First, let’s start by counting the occurrence of all words in the descriptions.
<- tv_shows %>%
word_count unnest_tokens(word, description) %>%
count(word, sort = TRUE)
%>% head(15) word_count
# A tibble: 15 × 2
word n
<chr> <int>
1 the 1048
2 a 1020
3 and 850
4 to 804
5 of 509
6 in 377
7 his 343
8 with 272
9 her 233
10 their 224
11 when 207
12 on 198
13 an 184
14 for 179
15 from 168
Just as we expected. Lot’s of stop words.
Let’s exclude the stop words from the list with the filter()
function. This should give us a much more informative list of words.
<- tv_shows %>%
word_count unnest_tokens(word, description) %>%
filter(!word %in% exclude) %>%
count(word, sort = TRUE)
%>% head(15) word_count
# A tibble: 15 × 2
word n
<chr> <int>
1 friends 136
2 world 92
3 save 78
4 family 64
5 life 64
6 series 62
7 adventures 60
8 evil 60
9 fun 56
10 adventure 55
11 home 51
12 school 51
13 team 43
14 christmas 42
15 city 40
Additional stop word options
For a given data set there may be additional stop words that provide little insight into the text. For example, the words “tv” and “series” are not very informative if each row in your data is about a TV show.
<- c(stop_words$word,
exclude "tv",
"series",
"movie",
"documentary")
<- tv_shows %>%
word_count unnest_tokens(word, description) %>%
filter(!word %in% exclude) %>%
count(word, sort = TRUE)
%>% head(15) word_count
# A tibble: 15 × 2
word n
<chr> <int>
1 friends 136
2 world 92
3 save 78
4 family 64
5 life 64
6 adventures 60
7 evil 60
8 fun 56
9 adventure 55
10 home 51
11 school 51
12 team 43
13 christmas 42
14 city 40
15 animated 39
Similarly, a word that is included in the stop word list may be informative for a particular data set. When this is the case, we want to keep the word by removing it from the exclusion list. For example, we may want to know the number of show descriptions that reference one
, two
or three
characters or objects.
Here’s how we can keep the words “one”, “two” and “three”.
<- stop_words$word
exclude
<- c("one", "two", "three")
keep
<- exclude[!exclude %in% keep]
exclude
<- tv_shows %>%
word_count unnest_tokens(word, description) %>%
filter(!word %in% exclude) %>%
count(word, sort = TRUE)
%>% head(15) word_count
# A tibble: 15 × 2
word n
<chr> <int>
1 friends 136
2 world 92
3 save 78
4 family 64
5 life 64
6 series 62
7 adventures 60
8 evil 60
9 fun 56
10 adventure 55
11 home 51
12 school 51
13 two 44
14 team 43
15 christmas 42
It appears
"two"
was the most common number referenced in the show descriptions.