Following my last post Extracting Data from Twitter, we are now ready to to play with the tweets we grabbed from Twitter.
The first thing we will do is authenticate our connection with Twitter.
load("twitCred.RData")
registerTwitterOAuth(twitCred)
Now, we need a way to score how ‘good’ or ‘bad’ each tweet is. This is actually an interesting problem. I will be using an adapted solution developed from Jeffrey Breen.
# load files containing positive & negative words
positive.words = scan("positive-words.txt",
what='character',
comment.char=';')
negative.words = scan("negative-words.txt",
what='character',
comment.char=';')
# dependent function
tryTolower = function(x){
# create missing value
# this is where the returned value will be
y = NA
# tryCatch error
try_error = tryCatch(tolower(x), error=function(e) e)
# if not an error
if (!inherits(try_error, "error"))
y = tolower(x)
return(y)
}
# sentiment score function
score.sentiment = function(sentences, pos.words, neg.words, .progress='none'){
# We got a vector of sentences.
# plyr will handle a list or a vector as an "l" for us.
# We want a simple array ("a") of scores back, so we use
# "l" + "a" + "ply" = "laply":
scores = laply(sentences, function(sentence, pos.words, neg.words){
# Clean Up Sentences With R's Regex-driven Global Substitute, gsub():
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence)
# Convert to Lower Case:
sentence = iconv(sentence, 'UTF-8', 'ASCII')
sapply(sentence, function(x) tryTolower(x))
# Split into Words. str_split is in the stringr package
word.list = str_split(sentence, '\\s+')
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)
# Compare Our Words to the Dictionaries of Positive & Negative Terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}
# function to grab tweets
grab.tweet <- function(search="#rstats", n=10) {
# check search parameter
if (!is.character(search)) {
warning(sprintf("Search argument %s is not character.",search))
}
# check n parameter
if (!is.numeric(n)) {
warning(sprintf("Number %s is not numeric.",n))
}
# get tweets
tweets <- searchTwitter(search, n=n, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
# put in data frame
tweets <- twListToDF(tweets)
# output to temporary file and read back in to weed out erroneous characters
write.csv(tweets, file="tmp_tweet.csv", row.names=FALSE)
tweets <- read.csv(file="tmp_tweet.csv")
# score tweets
tweet.scores <- score.sentiment(tweets$text, positive.words, negative.words)
# merge back to original data
tweet.scores <- merge(tweets,tweet.scores)
return(tweet.scores)
}
Now we can load tweets and analyze the sentiment score.
# get 500 latest tweets containing "#heat"
tweets <- grab.tweet(search="#heat", n=500)
# examine data frame with tweets and score
head(tweets)
# tabulate scores
table(tweets$score)
# plot scores and output as png file
require(lattice)
png(file="heat_bar.png", width=1000, height=750)
barchart(as.factor(tweets$score),
horizontal=FALSE,
col="red",
scales=list(x=list(cex=1.5), y=list(cex=1.5)),
xlab=list("Score",cex=2),
ylab=list("Frequency",cex=2),
main=list("Miami Heat Tweet Sentiment",cex=4))
dev.off()
This was run right after the Miami Heat won game 5 of the 2013 NBA Playoffs against the Indiana Pacers in the Eastern Conference finals.