R文本挖掘：使用tidytext进行词频分析

学习使用tidytext包对文本进行分词、词频统计和可视化。 · 难度：入门 · +15XP

文本挖掘概述

文本挖掘从非结构化文本中提取有用信息。tidytext基于tidyverse原则，将文本转为整洁格式（每行一个词）。

library(tidytext)
library(dplyr)
library(ggplot2)

text_df <- data.frame(
  line = 1:3,
  text = c('I love R programming', 'R is great for data analysis', 'I enjoy learning R')
)

tokens <- text_df %>%
  unnest_tokens(word, text)
print(tokens)

word_counts <- tokens %>%
  count(word, sort = TRUE)
print(word_counts)

data(stop_words)
tokens_clean <- tokens %>%
  anti_join(stop_words, by = 'word')
word_counts_clean <- tokens_clean %>% count(word, sort = TRUE)

library(wordcloud)
wordcloud(words = word_counts_clean$word, freq = word_counts_clean$n, min.freq = 1)

练习提示：使用janeaustenr包中的《傲慢与偏见》文本，统计最常用的10个词（去除停用词后）。