preview

Essay On Machine Learning Classifiers And Feature Extractors

Better Essays

3. Methods and Experimental design
Our approach is to analyse the sentiments using machine learning classifiers and feature extractors. The machine learning classifiers are Naive Bayes, Maximum Entropy and Support Vector Machines (SVM). The feature extractors are unigrams and unigrams with weighted positive and negative keywords. We build a framework that treats classifiers and feature extractors as two distinct components. This framework allows us to easily try out different combinations of classifiers and feature extractors.

3.1 Emoticons

Since the training process makes use of emoticons as noisy labels, it is crucial to discuss the role they play in classification. we striped the emoticons out from the training data. If we leave …show more content…

Repeated letters Tweets contain very casual language. For example, if you search “hungry” with an arbitrary number of u‟s in the middle (e.g. huuuungry, huuuuuuungry, huuuuuuuuuungry) on Twitter, there will most likely be a nonempty result set. We use preprocessing so that any letter occurring more than two times in a row is replaced with two occurrences. In the samples above, these words would be converted into the token “hungry".
Table 2 shows the effect of these feature reductions. These reductions shrink the feature set down to 8.74% of its original size. Table 2. Effect of Feature Reduction
Feature Reduction Steps # of Features Percentage of Original
None 277354 100.00%
URL / Username / Repeated Letters 102354 36.90%
Stop Words („a‟, „is‟, „the‟) 24309 8.74%
Final 24309 8.74%

3.3 Feature Vector
After preprocessing the training set data which consists of 9666 positive tweets, 9666 negative tweets and 2271 neutral tweets, we compute the feature vector as below:
Unigrams As shown in Table 2, at the end of preprocessing we end up with 24309 features which are unigrams and each of the features have equal weights.

4. Results
The unigram feature vector is the simplest way to retrieve features from a tweet. The machine learning algorithms perform average with this feature vector. One of the reasons for the average performance might be the smaller training

Get Access