3. Methods and Experimental design
Our approach is to analyse the sentiments using machine learning classifiers and feature extractors. The machine learning classifiers are Naive Bayes, Maximum Entropy and Support Vector Machines (SVM). The feature extractors are unigrams and unigrams with weighted positive and negative keywords. We build a framework that treats classifiers and feature extractors as two distinct components. This framework allows us to easily try out different combinations of classifiers and feature extractors.
3.1 Emoticons
Since the training process makes use of emoticons as noisy labels, it is crucial to discuss the role they play in classification. we striped the emoticons out from the training data. If we leave
…show more content…
Repeated letters Tweets contain very casual language. For example, if you search “hungry” with an arbitrary number of u‟s in the middle (e.g. huuuungry, huuuuuuungry, huuuuuuuuuungry) on Twitter, there will most likely be a nonempty result set. We use preprocessing so that any letter occurring more than two times in a row is replaced with two occurrences. In the samples above, these words would be converted into the token “hungry".
Table 2 shows the effect of these feature reductions. These reductions shrink the feature set down to 8.74% of its original size. Table 2. Effect of Feature Reduction
Feature Reduction Steps # of Features Percentage of Original
None 277354 100.00%
URL / Username / Repeated Letters 102354 36.90%
Stop Words („a‟, „is‟, „the‟) 24309 8.74%
Final 24309 8.74%
3.3 Feature Vector
After preprocessing the training set data which consists of 9666 positive tweets, 9666 negative tweets and 2271 neutral tweets, we compute the feature vector as below:
Unigrams As shown in Table 2, at the end of preprocessing we end up with 24309 features which are unigrams and each of the features have equal weights.
4. Results
The unigram feature vector is the simplest way to retrieve features from a tweet. The machine learning algorithms perform average with this feature vector. One of the reasons for the average performance might be the smaller training
Proclaimed as the hottest company since Google and Facebook, Twitter introduced a revolutionary micro-blogging service in 2006 that allowed users to spread and share short messages of 140 characters (“tweets”) with friends and strangers subscribing to follow their communication flow (as so called “followers”) in order to find out what is happening right now from any point of the globe.
Social media is one of the viral methods of spreading the news and information about anything in this world. If the company uses better strategies and make few investments in advertising themselves on social media, it will be a wide way for them to grab customer’s attention. Using the data analysis tools available in the market, the company can also perform social media analysis to identify what kind of products, yarn or fabric are grabbing
The posts that are tweeted in the platform can be predicted through the use of machine learning technique. In context, the aforementioned works on the scale of predicting a tweet given the content of the tweet, the tweeter and more especially the retweeted. The above factors are instrumental in developing a detailed and analyzed strategy of acquiring information through Twitter. Notable also is the fact that the popularity of a user does not depend on the number of followers that one has or the count of the tweets. However, the count of the retweets and the number of users who took part in the process act as the appraisal of popularity and how quick the information will be propagated in the network. The factors that limit the propagation of the information in Twitter is the limit of the word character which is only 140. As such, there is a need to have a predefined terse message that will enhance the spread of the information. There is also need to authenticate information in Twitter so as to hamper rumors and
A statistical validation process is performed to check whether non-related movie tweets are included in the results of the parsing process or not. The validation process is taken into account in step 1.2 of the methodology. Sample tweets are selected with the margin of error (5%) and confidence level (95%) to determine the proper sample size for the population (the raw data). The result of validation process, 89.22 percent tweets are confirmed as related the movies. Entire results of the validation process in the case study are illustrated on Table
As in our study, LDA topics has improved accuracy of finding the keywords for different topics.In this work we examine the social aspects of food tweeting behavior, and provide some support to the social affinity that is not local in geographic sense. There have been several recent studies that probe the viability of public health surveillance by measuring relevant textual signals in social media.Prier, K.W.Smith, M.S.Giraud-Carrier, C. L. Hanson[5] examine all words people use in online reviews, and draw insights on correlating terms and concepts that may not seem immediately relevant to the hygiene status of restaurants. The work draws from the rich body of research that studies online reviews for sentiment analysis based on few research papers.
Within our society, the internet has become the norm and is always present at the tips of our fingers. To present your ideas or share your thoughts on things around you, Twitter is the go-to app that is the most popular for these kinds of things. When you post something on Twitter, to announce to your followers/friends, you are tweeting. Tweeting lets you connect, express your feelings, thoughts, release information, and much more. We are so attached to our phones and twitter that we are more invested in others’ lives than our own. And tweeting usually leads to complaining, making us think our lives are miserable and sad. Since we are so invested in Twitter and social media, we are willing to give out information about ourselves to complete strangers, which can sometimes be a bit too much. However, Twitter has a character limit of 140. To what extent does tweeting, which consists of only 140 characters, have on how we communicate and our behaviors? When socializing on Twitter, it allows the individual to be whoever they want to be. Based off of their identity online, these individuals are able to express their feelings and reveal certain things about themselves, while excluding others. Being online, behind a screen, allows us to create a new identity, and what we say or tweet is usually catered a certain way to match what our audience wants to hear or would agree with. Due to the limit of 140 characters, this resulted in the change of grammatical sentences to the use of slang or abbreviations. To shorten up what we want to convey to our audience, we would use abbreviations so that there are fewer characters. This would sometimes result in the change of meaning and give off a more unfriendly tone. To communicate with others as well, we tend to use slang, which makes us sound cool and trendy. Twitter, which was an outlet for all these changes, affected our communication online and offline. Since social media is a big part of the youths’ lives, how we communicate online with our friends and our audience follows into the real world as well. For example, the slang for “going to” is “gonna,” and without realizing it, that is what we say to each other and sometimes, write in our English essays as well. Whatever
According to Wikipedia Social Media Mining is the process of representing, analyzing and extracting actionable patterns from social media data. The extensive use of Social media like Facebook, twitter, Google plus, Instagram, LinkedIn and Twitter have been generating massive amounts of social media and big user-generated data. The world’s social networks contains enormous customer details that helps in understanding human behavior and conduct research on social science. In order to successfully mine the social data, Social Network Analysis is the most important task. A number of visualization tools are available to analyze these social networks. Even, many corporations has their goal set to tap into social media data in order to develop their business. Through social media, it is easy to find a lot of information about a celebrity and her whereabouts but finding the real-time information like energy consumption is very difficult. Also, social media mining faces a dilemma to find out the useful information as it lacks data or have little or thin data about those we are interested in knowing more about. In this paper, we are going to discuss more on how social media is mined, how analysis is performed on social networks and ways to overcome hurdles with Social Media Mining.
Yang Peng, Melody Moh, Teng-Sheng Moh, Efficient Ad- verse Drug Event Extraction using Twitter Sentiment Analysis , in this they proposed a simple, efficient pipeline for retrieving ADEs. Any selected drug should have been in the market for more than ten years. Following this rule, there are sufficient number of tweets exist for any selected drug. Drug related classification is done on preprocessed Data. Sentimental Anal- ysis. 5 times
The current technological age that uses the social media has led various problems in writing and receiving emails/texts. The biggest problem is not getting any part of a message from the text or email; understanding of the message is the greatest problem. This can be attributed to the receiving of incomprehensible and poorly arranged words and messages. The problem of using slang in writing and receiving texts is a menace. The use of such slang terms like SMH (shaking my head) among others, makes communication unofficial and only understood by a certain group of people (Heather & Graves, 2012).
For the smallest set containing only ten samples, 19 of the 23 possible feature selection algorithms completed processing (4 feature selection algorithms could not be completed due to the 10-fold cross-validation used). For those 19 feature selection algorithms, 585 classification models were generated (few of the ARFF files were empty for the lower feature thresholds due to the small number of samples). The 50-sample dataset completed 20 of the 23 possible feature selection algorithms, thereby generating 665 classification models. When using 100 samples, 20 of the 23 possible feature selection algorithms were completed, and subsequently utilized to generate 665 classification models. The 200-sample dataset provided 20 of the 23 possible
Feature selection is a method used for reducing number of dimensions of a dataset by removing irrelevant and redundant attributes. Given a set of attributes F and a target class C, goal of feature selection is to find a minimum set of F that will yield highest accuracy (for C) for the classification task. Although
4) Then, the accuracy of the redesigned model was calculated to identify frailty. 5) And, this procedure was repeated when the last feature is removed for redesigning the model. 6) All accuracy values were compared. 7) If an accuracy value related to a feature excluded for modeling is the lowest, the feature was eliminated. 8) After that, the same procedure was repeated without the eliminated feature by 1) to 7) until only one feature remained after eliminating features. 9) Finally, the number of features were selected based on the performance of the model evaluated by the recursive feature elimination for selecting the best model to identify frailty.
Probabilistic topic modeling provides computational methods for large text data analysis. Today streaming text mining plays an important role within real-time social media mining. Latent Dirichlet Allocation (LDA) model was developed a decade ago to aid discovery of the hidden thematic structure in large archives of documents. It is acknowledged by many researchers as the most popular approach for building topic models. In this study, we discuss topic modeling and more specifically LDA. We identify speed as one of the major limitations of LDA application in streaming big text data analytics. The main aim of this study is to enhance inference speed of LDA thereby develop a new inference method and algorithm. Given the characteristics of this specific research problem, the approach to the proposed research will follow the experimental model. We will investigate causal relationships using a test
An enterprise may analyze sentiment about products, services, competitors and reputation. In twitter people post real time messages about their opinions on a variety of topics and express sentiments for products they use in daily life.
In this paper we have presented a comparative study of most commonly used algorithms for sentimental analysis. The task of classification is a very vital task in any system that performs sentiment analysis. We present a study of algorithms viz. 1. Naïve Bayes 2.Max Entropy 3.Boosted Trees and 4. Random Forest Algorithms. We showcase the basic theory behind the algorithms, when they are generally used and their pros and cons. The reason behind selecting only the above mentioned algorithms is the extensive use in various tasks of sentiment analysis. Sentiment analysis of reviews is very common application, the