Case Study: State of the Union Twitter Sentiment

Posted on 06/09/2013 by aarie37

During the 2013 State of the Union address, I thought it would be an interesting experiment to collect tweets containing the reserved hashtag ‘#SOTU’.

I wrote an R script that collected the latest 1000 tweets containing ‘#SOTU’ every 30 seconds. While this program was running, I took notes on the times and issues President Barack Obama spoke on. Using these two datasets, I was able to put together this plot of average tweet sentiment over time with the volume of tweets below.

From this information, we can see which issues had positive and negative reactions.

Positive Reaction:

Health Care
Education
Immigration
Minimum Wage
Cyber Threat

Negative Reaction:

Green Technology
Afghanistan
Gun Control

To learn to query data and quantify sentiment, please check out my other posts:
Extracting Data From Twitter
Scoring Sentiment of Tweets

Twitter Sentiment Analysis

Posted on 05/31/2013 by aarie37

Following my last post Extracting Data from Twitter, we are now ready to to play with the tweets we grabbed from Twitter.

The first thing we will do is authenticate our connection with Twitter.

load("twitCred.RData")
registerTwitterOAuth(twitCred)

Now, we need a way to score how ‘good’ or ‘bad’ each tweet is. This is actually an interesting problem. I will be using an adapted solution developed from Jeffrey Breen.

# load files containing positive & negative words
positive.words = scan("positive-words.txt",
					  what='character',
					  comment.char=';')
negative.words = scan("negative-words.txt",
					  what='character',
					  comment.char=';')

# dependent function
tryTolower = function(x){

	# create missing value
	# this is where the returned value will be
	y = NA

	# tryCatch error
	try_error = tryCatch(tolower(x), error=function(e) e)

	# if not an error
	if (!inherits(try_error, "error"))
	y = tolower(x)

	return(y)
}

# sentiment score function
score.sentiment = function(sentences, pos.words, neg.words, .progress='none'){

    # We got a vector of sentences.
	# plyr will handle a list or a vector as an "l" for us.
    # We want a simple array ("a") of scores back, so we use
    # "l" + "a" + "ply" = "laply":
    scores = laply(sentences, function(sentence, pos.words, neg.words){

        # Clean Up Sentences With R's Regex-driven Global Substitute, gsub():
        sentence = gsub('[[:punct:]]', '', sentence)
        sentence = gsub('[[:cntrl:]]', '', sentence)
        sentence = gsub('\\d+', '', sentence)

        # Convert to Lower Case:
		sentence = iconv(sentence, 'UTF-8', 'ASCII')
		sapply(sentence, function(x) tryTolower(x))

        # Split into Words. str_split is in the stringr package
        word.list = str_split(sentence, '\\s+')

        # sometimes a list() is one level of hierarchy too much
        words = unlist(word.list)

        # Compare Our Words to the Dictionaries of Positive & Negative Terms
        pos.matches = match(words, pos.words)
        neg.matches = match(words, neg.words)

        # match() returns the position of the matched term or NA
        # we just want a TRUE/FALSE:
        pos.matches = !is.na(pos.matches)
        neg.matches = !is.na(neg.matches)

        # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
        score = sum(pos.matches) - sum(neg.matches)

        return(score)
    }, pos.words, neg.words, .progress=.progress )

	scores.df = data.frame(score=scores, text=sentences)
	return(scores.df)
}

# function to grab tweets
grab.tweet <- function(search="#rstats", n=10) {

	# check search parameter
	if (!is.character(search)) {
		warning(sprintf("Search argument %s is not character.",search))
	}

	# check n parameter
	if (!is.numeric(n)) {
		warning(sprintf("Number %s is not numeric.",n))
	}

	# get tweets
	tweets <- searchTwitter(search, n=n, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

	# put in data frame
	tweets <- twListToDF(tweets)

	# output to temporary file and read back in to weed out erroneous characters
	write.csv(tweets, file="tmp_tweet.csv", row.names=FALSE)
	tweets <- read.csv(file="tmp_tweet.csv")

	# score tweets
	tweet.scores <- score.sentiment(tweets$text, positive.words, negative.words)

	# merge back to original data
	tweet.scores <- merge(tweets,tweet.scores)
	return(tweet.scores)
}

Now we can load tweets and analyze the sentiment score.

# get 500 latest tweets containing "#heat"
tweets <- grab.tweet(search="#heat", n=500)

# examine data frame with tweets and score
head(tweets)

# tabulate scores
table(tweets$score)

# plot scores and output as png file
require(lattice)
png(file="heat_bar.png", width=1000, height=750)
barchart(as.factor(tweets$score),
		 horizontal=FALSE,
		 col="red",
		 scales=list(x=list(cex=1.5), y=list(cex=1.5)),
		 xlab=list("Score",cex=2),
		 ylab=list("Frequency",cex=2),
		 main=list("Miami Heat Tweet Sentiment",cex=4))
dev.off()

This was run right after the Miami Heat won game 5 of the 2013 NBA Playoffs against the Indiana Pacers in the Eastern Conference finals.

Extracting Data from Twitter

Posted on 05/27/2013 by aarie37

Data mining has become an interesting hobby for me. The idea of collecting real-time data seems very powerful and useful. Using R, I will show you how to query tweets from Twitter.

Step 1:
Login in to Twitter Developers. Once you are in, create a new application. Click on create access token. Now click on settings and change the Application Type to Read, Write and Access Direct Messages and Allow this application to be used to sign on Twitter.

Step 2:
Start up R and run this script with your Consumer key and Consumer secret.

require("twitteR")

Key      <- "<< Consumer key >>"   # Consumer key
Secret   <- "<< Consumer secret >>" # Consumer secret
twitCred <- OAuthFactory$new(consumerKey=Key,
							 consumerSecret=Secret,
							 requestURL="https://api.twitter.com/oauth/request_token",
							 accessURL="https://api.twitter.com/oauth/access_token", 
							 authURL="https://api.twitter.com/oauth/authorize")
twitCred$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

R will spit out a URL to visit. Copy and paste this into your browser. Click authorize app and copy the PIN. Go back to your R console and paste the PIN and press enter.

Step 3:
Save credentials as .RData to bypass this step later.

save(twitCred,file="twitCred.RData")

This will save a file called “twitCred.RData” to your working directory.

Step 4:
Now every time you want to extract tweets from Twitter, all you need to do is load your credentials and authenticate.

load("twitCred.RData")
registerTwitterOAuth(twitCred)

Step 5:
Now we are ready to extract tweets from Twitter.

rstats <- searchTwitter('#rstats', n=100, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
rstatsDF <- twListToDF(rstats)

	Tools – breadc… on Twitter Sentiment Analysis
	2013 State of the Un… on Twitter Sentiment Analysis
	2013 State of the Un… on Extracting Data from Twit…
	Scoring Sentiment of… on Extracting Data from Twit…

Aaron Crowley

Programming with SAS and R.

Category Archives: data mining

Case Study: State of the Union Twitter Sentiment

Twitter Sentiment Analysis

Extracting Data from Twitter

Share:

Share:

Share: