class: center, middle, inverse, title-slide # Magic of Emotion ## Sentiment Analysis Based on Tidy Data ### Nanyu Zhou ### University of Shanghai for Science and Technology ### 2020/07/16 --- # Self-Intro ### Nanyu Zhou University of Shanghai for Science and Technology **Communication** Geek, Passionate in InformationTech, Writer and Poet, Photographer, *Intangible Cultural Heritager* , Beginner of Computational Communication, Former Assistant Directer of SMG and Intern in **The Paper** Data Journalism Dept. Writer of Netease and IT Home. - Experts in Content Production - Graphic Design - Web and Client Side Development - Data Journalism and Data Viz --- # Content ## Part I A Brief Introducation ## Part II Practice and Demo ## Part III Extensive Resource --- class: inverse, center, middle background-image: url(https://cdn.wardzhou.art/img/smiley-2979107_1920.jpg) background-size: cover; # Part I ### A Brief Introduction --- # A Brief Introduction **Definition** **Sentiment analysis** (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and bio-metrics to systematically identify, extract, quantify, and study affective states and subjective information. -- **Application** - Quantitative Research - Public Opinion Analysis - Custom Client Application - Assess the Effect of Communication - Advertisement and Marketing - Political Campaign - Medical Field - ...... --- class: inverse, center, middle background-image: url(https://cdn.wardzhou.art/img/work-731198_1920.jpg) background-size: cover; # Part II ### Practice and Demo --- # General Steps  --- # Tools & Libs ```r library(rtweet) library(dplyr) library(tidyr) library(tidytext) library(ggplot2) library(echarts4r) ``` - `rtweet` Twitter Official Package with access to its API. Read more on my [blog](https://wardzhou.art/2020/06/07/Rtweet). - `dplyr`,`tidyr`,`tidytext` The three package are used to format and clean data. - `ggplot2`,`echarts4r` Data Visualization --- # Get Your Data! ```r q <- "flight safety" result <- search_tweets(q, n = 3200, type = "recent", include_rts = TRUE, geocode = NULL,max_id = NULL,parse = TRUE,token = NULL,) attributes(result) ``` ``` ## $names ## [1] "user_id" "status_id" ## [3] "created_at" "screen_name" ## [5] "text" "source" ## [7] "display_text_width" "reply_to_status_id" ## [9] "reply_to_user_id" "reply_to_screen_name" ## [11] "is_quote" "is_retweet" ## [13] "favorite_count" "retweet_count" ## [15] "quote_count" "reply_count" ## [17] "hashtags" "symbols" ## [19] "urls_url" "urls_t.co" ## $class ## [1] "tbl_df" "tbl" "data.frame" ``` --- # Subtract text ```r #Substract text text_df <- tibble(text=result$text) text_df <- text_df %>% unnest_tokens(word,text) text_df ``` ``` ## # A tibble: 113,928 x 1 ## word ## <chr> ## 1 american ## 2 airlines ## 3 reaches ## 4 out ## 5 to ## 6 remind ## 7 sen ## 8 ted ## 9 cruz ## 10 about ## # ... with 113,918 more rows ``` --- # Stop Word ```r data("stop_words") text_df <- text_df %>% anti_join(stop_words) text_df ``` ``` ## # A tibble: 63,319 x 1 ## word ## <chr> ## 1 american ## 2 airlines ## 3 reaches ## 4 remind ## 5 sen ## 6 ted ## 7 cruz ## 8 mask ## 9 amid ## 10 covid ## # ... with 63,309 more rows ``` --- # Custom Stop Word ```r stopwd <- data.frame(word = c('t.co','https','http')) text_df <- text_df %>% anti_join(stopwd) text_df ``` ``` ## # A tibble: 58,629 x 1 ## word ## <chr> ## 1 american ## 2 airlines ## 3 reaches ## 4 remind ## 5 sen ## 6 ted ## 7 cruz ## 8 mask ## 9 amid ## 10 covid ## # ... with 58,619 more rows ``` --- # Clean it up! ```r #Remove non-english word text_df <- text_df[which(!grepl("[^\x01-\x7F]+", text_df$word)),] #Remove numb text_df1 <- gsub("[[:digit:]]*", "", text_df$word) #Remove short word text_df1 <- text_df1[!nchar(text_df1) < 2] text_df1 ``` ``` ## [1] "american" "airlines" ## [3] "reaches" "remind" ## [5] "sen" "ted" ## [7] "cruz" "mask" ## [9] "amid" "covid" ## [11] "pandemic" "ktyfpa" ## [13] "usatoday" "tourism" ## [15] "qblxrlvhx" "american" ## [17] "airlines" "reaches" ## [19] "remind" "sen" ...... ## [56157] "xeeiaiqxp" "wsj" ``` --- # Word Freq .pull-left[ ```r data.frame(word=text_df1) %>% count(word, sort = TRUE) %>% filter(n > 100) %>% mutate(word = reorder(word,n)) %>% ggplot(aes(word,n)) + geom_col() + xlab(NULL) + coord_flip() ``` ] .pull-right[ <!-- --> ] --- # WordCloud
--- class: inverse, center, middle background-image: url(https://cdn.wardzhou.art/img/laugh-3673836_1920.jpg) background-size: cover; # Sentiment Analysis ### Jump in! --- # Universal Dictionaries - AFINN(Finn Arup Nielsen) - Bing(Bing Liu and coworkers) - NRC(Saif Monhammad and Peter Turney) The three are all dictionaries based on singal word. .pull-left[ ```r get_sentiments('afinn') ``` ``` ## # A tibble: 2,477 x 2 ## word value ## <chr> <dbl> ## 1 abandon -2 ## 2 abandoned -2 ## 3 abandons -2 ## 4 abducted -2 ## 5 abduction -2 ## 6 abductions -2 ## 7 abhor -3 ## # ... with 2,467 more rows ``` ] .pull-right[ ```r get_sentiments('bing') ``` ``` ## # A tibble: 6,786 x 2 ## word sentiment ## <chr> <chr> ## 1 2-faces negative ## 2 abnormal negative ## 3 abolish negative ## 4 abominable negative ## 5 abominably negative ## 6 abominate negative ## 7 abomination negative ## # ... with 6,776 more rows ``` ] --- # Universal Dictionary .pull-left[ ```r get_sentiments('nrc') ``` ``` ## # A tibble: 13,901 x 2 ## word sentiment ## <chr> <chr> ## 1 abacus trust ## 2 abandon fear ## 3 abandon negative ## 4 abandon sadness ## 5 abandoned anger ## 6 abandoned fear ## 7 abandoned negative ## 8 abandoned sadness ## 9 abandonment anger ## 10 abandonment fear ## # ... with 13,891 more rows ``` ] .pull-right[ ```r #Happy word contained nrcjoy <- get_sentiments('nrc') %>% filter(sentiment == 'joy') nrcjoy ``` ``` ## # A tibble: 689 x 2 ## word sentiment ## <chr> <chr> ## 1 absolution joy ## 2 abundance joy ## 3 abundant joy ## 4 accolade joy ## 5 accompaniment joy ## 6 accomplish joy ## 7 accomplished joy ## 8 achieve joy ## 9 achievement joy ## 10 acrobat joy ## # ... with 679 more rows ``` ] --- # Analyze Tweets with Bing ```r df <- data.frame(word=text_df1) df <- df %>% inner_join(get_sentiments('bing')) %>% count(word, sentiment) %>% spread(sentiment,n,fill = 0) %>% mutate(sentiment = positive - negative) ``` --- # Comparison
--- # Distribution
--- # WordCloud
--- class: inverse, center, middle # Everything seems to be alright. However...... --- # However... Entry could be of great help. However,when analyzing other kind of Unit of text, it might be tricky. ### "I am not having a good day." It's a negative sentence. But the old way does not work now. Why not meet some new friends XD? --- # New Friends The Three R packages that helps to analyze the complete sentence. - coreNLP(Arnold and Tilton, 2016) - cleanNLP(Arnold, 2016) - sentimentr(Rinker, 2017) ```r library(sentimentr) library(gutenbergr) tidy_books <- gutenberg_download(c(1260)) Jane <- tidy_books %>% unnest_tokens(sentence, text, token="regex", pattern="Chapter|CHAPTER [\\dIVXLC]")%>% ungroup() mytext <- Jane$sentence mytext <- get_sentences(mytext) JaneMustHappy <- sentiment(mytext) ``` ---