2 — Count words

count
words
terms
tokens
Published

January 2, 2024

Modified

January 8, 2024

Load kids TV data

Read in a table of kids TV shows on Netflix.

library(tidyverse)

tv_shows <- read_csv('https://tidy-mn.github.io/qualitative-guide/posts/data/kids_netflix_shows.csv')

Count totals in each group or category

library(tidyverse)

type_count <- tv_shows %>%
              count(type) # type can be any column in the data

type_count
# A tibble: 2 × 2
  type        n
  <chr>   <int>
1 Movie     532
2 TV Show   414

Count totals by release year for each type

year_type_count <- tv_shows %>%
                   count(release_year, type) # Can add multiple column names

year_type_count %>% head()
# A tibble: 6 × 3
  release_year type      n
         <dbl> <chr> <int>
1         1954 Movie     1
2         1968 Movie     1
3         1971 Movie     1
4         1973 Movie     1
5         1977 Movie     1
6         1978 Movie     1

Count type totals in wide format table

library(janitor)

tv_tabyl <- tv_shows %>%
            filter(release_year > 2016) %>%
            tabyl(release_year, type) %>%
            adorn_totals("row") %>%
            adorn_percentages("row") %>%
            adorn_pct_formatting(digits = 0) %>%
            adorn_ns %>%
            adorn_title
            

tv_tabyl  
                   type          
 release_year     Movie   TV Show
         2017 48%  (48) 52%  (52)
         2018 49%  (63) 51%  (65)
         2019 56%  (72) 44%  (56)
         2020 52%  (79) 48%  (73)
         2021 20%   (1) 80%   (4)
        Total 51% (263) 49% (250)

Rank occurence of words

tokens = words or phrases

Top 10 words in the genre column.

library(tidytext)

genre_count <- tv_shows %>%
               unnest_tokens(word, genre) %>%
               count(word, sort = TRUE)

genre_count %>% head(10)
# A tibble: 10 × 2
   word          n
   <chr>     <int>
 1 tv          650
 2 movies      602
 3 children    532
 4 family      532
 5 kids        414
 6 comedies    342
 7 dramas       86
 8 shows        61
 9 action       43
10 adventure    43

Additional token options

Rather than counting every single word, we may be interested in counting how often words occur together. To do this we use unnest_ngrams() and set the n argument to 2. If an <NA> appears in the count column, it indicates a show that had fewer than two words in the genre column.

genre_count <- tv_shows %>%
               unnest_ngrams(word, genre, n = 2) %>%
               count(word, sort = TRUE) 

genre_count %>% head(10)
# A tibble: 10 × 2
   word                 n
   <chr>            <int>
 1 children family    532
 2 family movies      532
 3 kids tv            414
 4 movies comedies    225
 5 tv tv              140
 6 tv comedies        117
 7 movies dramas       64
 8 tv shows            61
 9 action adventure    43
10 fi fantasy          32

The words children and family were the most often to occur together.

In this data set multiple genres are separated by a comma, so we can treat each phrase before and after a comma as a single genre or token. To do this we use unnest_regex() and set the pattern argument to ", ". This will split the text wherever the sequence of a comma followed by a space occurs. Now the genres will be counted as they were intended in the data.

genre_count <- tv_shows %>%
               unnest_regex(word, genre, pattern = ", ") %>%
               count(word, sort = TRUE) 

genre_count %>% head(10)
# A tibble: 10 × 2
   word                         n
   <chr>                    <int>
 1 children & family movies   532
 2 kids' tv                   414
 3 comedies                   225
 4 tv comedies                117
 5 dramas                      75
 6 british tv shows            27
 7 music & musicals            27
 8 korean tv shows             23
 9 tv action & adventure       22
10 action & adventure          21

Stop words

stop words = Common words or phrases such as the, of, and to.

When comparing survey responses and narratives, some of the most common words are often the articles, such as the, a, and an, that don’t offer much in terms of signaling the intent or theme of the text. These filler words are commonly referred to as stop words.

A list of stop words is included in the tidytext package. Let’s load the words and store them in a variable called exclude. Here’s the first few for example, but go ahead and take a look at the full list to get a better understanding of what may be considered a stop word.

exclude <- stop_words$word

exclude %>% head()
[1] "a"         "a's"       "able"      "about"     "above"     "according"

Excluding unwanted words

For this example we will focus on the description column. This column contains some free text describing the content of the show.

First, let’s start by counting the occurrence of all words in the descriptions.

word_count <- tv_shows %>%
              unnest_tokens(word, description) %>%
              count(word, sort = TRUE) 

word_count %>% head(15)
# A tibble: 15 × 2
   word      n
   <chr> <int>
 1 the    1048
 2 a      1020
 3 and     850
 4 to      804
 5 of      509
 6 in      377
 7 his     343
 8 with    272
 9 her     233
10 their   224
11 when    207
12 on      198
13 an      184
14 for     179
15 from    168

Just as we expected. Lot’s of stop words.


Let’s exclude the stop words from the list with the filter() function. This should give us a much more informative list of words.

word_count <- tv_shows %>%
              unnest_tokens(word, description) %>%
              filter(!word %in% exclude) %>%
              count(word, sort = TRUE) 

word_count %>% head(15)
# A tibble: 15 × 2
   word           n
   <chr>      <int>
 1 friends      136
 2 world         92
 3 save          78
 4 family        64
 5 life          64
 6 series        62
 7 adventures    60
 8 evil          60
 9 fun           56
10 adventure     55
11 home          51
12 school        51
13 team          43
14 christmas     42
15 city          40

Additional stop word options

For a given data set there may be additional stop words that provide little insight into the text. For example, the words “tv” and “series” are not very informative if each row in your data is about a TV show.

exclude <- c(stop_words$word,
             "tv",
             "series",
             "movie",
             "documentary")

word_count <- tv_shows %>%
              unnest_tokens(word, description) %>%
              filter(!word %in% exclude) %>%
              count(word, sort = TRUE) 

word_count %>% head(15)
# A tibble: 15 × 2
   word           n
   <chr>      <int>
 1 friends      136
 2 world         92
 3 save          78
 4 family        64
 5 life          64
 6 adventures    60
 7 evil          60
 8 fun           56
 9 adventure     55
10 home          51
11 school        51
12 team          43
13 christmas     42
14 city          40
15 animated      39

Similarly, a word that is included in the stop word list may be informative for a particular data set. When this is the case, we want to keep the word by removing it from the exclusion list. For example, we may want to know the number of show descriptions that reference one, two or three characters or objects.

Here’s how we can keep the words “one”, “two” and “three”.

exclude <- stop_words$word

keep <- c("one", "two", "three")

exclude <- exclude[!exclude %in% keep]

word_count <- tv_shows %>%
              unnest_tokens(word, description) %>%
              filter(!word %in% exclude) %>%
              count(word, sort = TRUE) 

word_count %>% head(15)
# A tibble: 15 × 2
   word           n
   <chr>      <int>
 1 friends      136
 2 world         92
 3 save          78
 4 family        64
 5 life          64
 6 series        62
 7 adventures    60
 8 evil          60
 9 fun           56
10 adventure     55
11 home          51
12 school        51
13 two           44
14 team          43
15 christmas     42

It appears "two" was the most common number referenced in the show descriptions.