Playing with Twitter Data in R

Twitter Analysis with R

The purpose is to do some analysis of tweets with R and what better start than using the words "Trump" and "Putin".
So, the first step is to create an app in Twitter and then create an application. It is very simple, we just need a description and a name and takes only a few minutes. Once this is done we will need the next information to connect:
  • Consumer key
  • Consumer secret
  • Access Token
  • Access Secret
The elements that we obtained when we created our application on Twitter, will interact with R using OAuth, so we use the function setup_twitter_oauth allowing "twitteR" to get information from our application.
#load library
library(ROAuth)
library(twitteR)
# Parameters configuration
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
options(httr_oauth_cache=T)
consumer_key <- 'xxxxxxxxxxxxxxxxxx'
consumer_secret <- 'xxxxxxxxxxxxxxxxxxx'
access_token <- 'xxxxxxxxxxxxxxxxxx'
access_secret <- 'xxxxxxxxxxxxxxxxxxx'
# twitteR authentication
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
# streamR authentication
credentials_file <- "my_oauth.Rdata"
if (file.exists(credentials_file)){
load(credentials_file)
} else {
cred <- OAuthFactory$new(consumerKey = consumer_key, consumerSecret = consumer_secret, requestURL = reqURL, accessURL = accessURL, authURL = authURL)
cred$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
save(cred, file = credentials_file)
}

Once we have setup Twitter in R and created the authentication parameters, the next is load some libraries and extract the tweets using JSON and then save them in a data set named df, also compute them to know how many tweets we got from each key word.

# load packages
library(streamR);
# connect to Twitter stream a get messages
filterStream("tweets.json", track = c("Trump", "Putin"), timeout = 60, oauth = cred);
# parse tweets
df <- parseTweets("tweets.json", simplify = TRUE);
# compute some measures
show(paste("Number of tweets with #Trump:", length(grep("Trump", df$text, ignore.case = TRUE))));
show(paste("Number of tweets with #Putin:", length(grep("Putin", df$text, ignore.case = TRUE))))
view raw Get the tweets hosted with ❤ by GitHub
Let's start with some analysis.


What emotions are generated?
In order to identify opinions, it is necessary to perform a feeling analysis, using language processing and text analysis after a cleaning process
#text cleaning
mytxt <- df$text
usableText=str_replace_all(mytxt,"[^[:graph:]]", " ")
mytxt <- iconv(mytxt, 'UTF-8', 'ASCII')
mytxt = gsub('(RT|via)((?:\\b\\W*@\\w+)+)', '', mytxt)
mytxt = gsub('@\\w+', '', mytxt)
mytxt = gsub('[[:punct:]]', '', mytxt)
mytxt = gsub('[[:digit:]]', '', mytxt)
mytxt = gsub('http\\w+', '', mytxt)
mytxt = gsub('https\\w+', '', mytxt)
mytxt = gsub('[ \t]{2,}', '', mytxt)
mytxt = gsub('^\\s+|\\s+$', '', mytxt)
mytxt = gsub('í ¼í·ºí ¼í·', '', mytxt)
#get sentiment
sent <- get_nrc_sentiment(mytxt)
tweets <- cbind(df$text, sent)
#common emotions in the tweets
sentimentTotal <- data.frame(colSums(tweets[,c(2:9)]))
names(sentimentTotal) <- "count"
sentimentTotal <- cbind("sentiment" = rownames(sentimentTotal), sentimentTotal)
rownames(sentimentTotal) <- NULL
#plot total sentiment
ggplot(data = sentimentTotal, aes(x = sentiment, y = count)) +
geom_bar(aes(fill = sentiment), stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total Count tweets") + ggtitle("Sentiment Score for All Tweets")
#working with the date from JSON
#this is a way to transform JSON date from Mon May 29 17:21:58 +0000 2017 to 2017-05-29 17:21:58
format.str <- "%a %b %d %H:%M:%S %z %Y"
df$date <- as.POSIXct(strptime(df[,"created_at"], format.str, tz = "GMT"), tz = "GMT")
#sentiment by date
grupo <- mutate(tweets, tweet= ifelse(tweets$positive>0,"positive", ifelse(tweets$negative>0, "negative","neutral")))
by.tweet <- group_by(grupo, tweet, date)
by.tweet <- summarise(by.tweet, number=n())
ggplot(by.tweet, aes(date, number)) + geom_line(aes(group=tweet, color=tweet), size=2)
view raw Emotions hosted with ❤ by GitHub

 We can also get the sentiment from a range of time
What source people use and when do they tweet?
Source
Vol of tweets by minute
R is a powerful data mining tool, that allows us to approach case studies from different statistical approaches, to extract the relevant information, to clean the records and/or unnecessary characters and to prepare the information to be studied.

Comments