Case Study: State of the Union Twitter Sentiment

During the 2013 State of the Union address, I thought it would be an interesting experiment to collect tweets containing the reserved hashtag ‘#SOTU’.

I wrote an R script that collected the latest 1000 tweets containing ‘#SOTU’ every 30 seconds. While this program was running, I took notes on the times and issues President Barack Obama spoke on. Using these two datasets, I was able to put together this plot of average tweet sentiment over time with the volume of tweets below.

lineplot_volume

From this information, we can see which issues had positive and negative reactions.

Positive Reaction:

  • Health Care
  • Education
  • Immigration
  • Minimum Wage
  • Cyber Threat

Negative Reaction:

  • Green Technology
  • Afghanistan
  • Gun Control

To learn to query data and quantify sentiment, please check out my other posts:
Extracting Data From Twitter
Scoring Sentiment of Tweets

Twitter Sentiment Analysis

Following my last post Extracting Data from Twitter, we are now ready to to play with the tweets we grabbed from Twitter.

The first thing we will do is authenticate our connection with Twitter.

load("twitCred.RData")
registerTwitterOAuth(twitCred)

Now, we need a way to score how ‘good’ or ‘bad’ each tweet is. This is actually an interesting problem. I will be using an adapted solution developed from Jeffrey Breen.

# load files containing positive & negative words
positive.words = scan("positive-words.txt",
					  what='character',
					  comment.char=';')
negative.words = scan("negative-words.txt",
					  what='character',
					  comment.char=';')

# dependent function
tryTolower = function(x){

	# create missing value
	# this is where the returned value will be
	y = NA

	# tryCatch error
	try_error = tryCatch(tolower(x), error=function(e) e)

	# if not an error
	if (!inherits(try_error, "error"))
	y = tolower(x)

	return(y)
}

# sentiment score function
score.sentiment = function(sentences, pos.words, neg.words, .progress='none'){

    # We got a vector of sentences.
	# plyr will handle a list or a vector as an "l" for us.
    # We want a simple array ("a") of scores back, so we use
    # "l" + "a" + "ply" = "laply":
    scores = laply(sentences, function(sentence, pos.words, neg.words){

        # Clean Up Sentences With R's Regex-driven Global Substitute, gsub():
        sentence = gsub('[[:punct:]]', '', sentence)
        sentence = gsub('[[:cntrl:]]', '', sentence)
        sentence = gsub('\\d+', '', sentence)

        # Convert to Lower Case:
		sentence = iconv(sentence, 'UTF-8', 'ASCII')
		sapply(sentence, function(x) tryTolower(x))

        # Split into Words. str_split is in the stringr package
        word.list = str_split(sentence, '\\s+')

        # sometimes a list() is one level of hierarchy too much
        words = unlist(word.list)

        # Compare Our Words to the Dictionaries of Positive & Negative Terms
        pos.matches = match(words, pos.words)
        neg.matches = match(words, neg.words)

        # match() returns the position of the matched term or NA
        # we just want a TRUE/FALSE:
        pos.matches = !is.na(pos.matches)
        neg.matches = !is.na(neg.matches)

        # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
        score = sum(pos.matches) - sum(neg.matches)

        return(score)
    }, pos.words, neg.words, .progress=.progress )

	scores.df = data.frame(score=scores, text=sentences)
	return(scores.df)
}

# function to grab tweets
grab.tweet <- function(search="#rstats", n=10) {

	# check search parameter
	if (!is.character(search)) {
		warning(sprintf("Search argument %s is not character.",search))
	}

	# check n parameter
	if (!is.numeric(n)) {
		warning(sprintf("Number %s is not numeric.",n))
	}

	# get tweets
	tweets <- searchTwitter(search, n=n, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

	# put in data frame
	tweets <- twListToDF(tweets)

	# output to temporary file and read back in to weed out erroneous characters
	write.csv(tweets, file="tmp_tweet.csv", row.names=FALSE)
	tweets <- read.csv(file="tmp_tweet.csv")

	# score tweets
	tweet.scores <- score.sentiment(tweets$text, positive.words, negative.words)

	# merge back to original data
	tweet.scores <- merge(tweets,tweet.scores)
	return(tweet.scores)
}

Now we can load tweets and analyze the sentiment score.

# get 500 latest tweets containing "#heat"
tweets <- grab.tweet(search="#heat", n=500)

# examine data frame with tweets and score
head(tweets)

# tabulate scores
table(tweets$score)

# plot scores and output as png file
require(lattice)
png(file="heat_bar.png", width=1000, height=750)
barchart(as.factor(tweets$score),
		 horizontal=FALSE,
		 col="red",
		 scales=list(x=list(cex=1.5), y=list(cex=1.5)),
		 xlab=list("Score",cex=2),
		 ylab=list("Frequency",cex=2),
		 main=list("Miami Heat Tweet Sentiment",cex=4))
dev.off()

This was run right after the Miami Heat won game 5 of the 2013 NBA Playoffs against the Indiana Pacers in the Eastern Conference finals.

heat_bar

Extracting Data from Twitter

Data mining has become an interesting hobby for me. The idea of collecting real-time data seems very powerful and useful. Using R, I will show you how to query tweets from Twitter.

Step 1:
Login in to Twitter Developers. Once you are in, create a new application. Click on create access token. Now click on settings and change the Application Type to Read, Write and Access Direct Messages and Allow this application to be used to sign on Twitter.

Step 2:
Start up R and run this script with your Consumer key and Consumer secret.

require("twitteR")

Key      <- "<< Consumer key >>"   # Consumer key
Secret   <- "<< Consumer secret >>" # Consumer secret
twitCred <- OAuthFactory$new(consumerKey=Key,
							 consumerSecret=Secret,
							 requestURL="https://api.twitter.com/oauth/request_token",
							 accessURL="https://api.twitter.com/oauth/access_token", 
							 authURL="https://api.twitter.com/oauth/authorize")
twitCred$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

R will spit out a URL to visit. Copy and paste this into your browser. Click authorize app and copy the PIN. Go back to your R console and paste the PIN and press enter.

Step 3:
Save credentials as .RData to bypass this step later.

save(twitCred,file="twitCred.RData")

This will save a file called “twitCred.RData” to your working directory.

Step 4:
Now every time you want to extract tweets from Twitter, all you need to do is load your credentials and authenticate.

load("twitCred.RData")
registerTwitterOAuth(twitCred)

Step 5:
Now we are ready to extract tweets from Twitter.

rstats <- searchTwitter('#rstats', n=100, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
rstatsDF <- twListToDF(rstats)