assignement-3-tanmay-agarwal

pdf

School

New Jersey Institute Of Technology *

*We aren’t endorsed by this school

Course

680

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

10

Uploaded by AgentWaterPrairieDog32

Report
10/31/23, 7:49 PM assignement-3-tanmay-agarwal localhost:8888/nbconvert/html/Downloads/assignement-3-tanmay-agarwal.ipynb?download=false 1/10 Sentiment Analysis of IMDB Movie Reviews Problem Statement: In this, we have to predict the number of positive and negative reviews based on sentiments by using different classification models. Import necessary libraries ['IMDB Dataset.csv'] Reading the dataset review sentiment 0 One of the other reviewers has mentioned that ... positive 1 A wonderful little production. <br /><br />The... positive 2 I thought this was a wonderful way to spend ti... positive 3 Basically there's a family where a little boy ... negative 4 Petter Mattei's "Love in the Time of Money" is... positive Exploratery data analysis In [1]: #Load the libraries import numpy as np import pandas as pd import nltk from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.preprocessing import LabelBinarizer from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer from nltk.tokenize import word_tokenize , sent_tokenize from bs4 import BeautifulSoup import re from nltk.tokenize.toktok import ToktokTokenizer from sklearn.linear_model import LogisticRegression , SGDClassifier from sklearn.naive_bayes import MultinomialNB from sklearn.svm import SVC from sklearn.metrics import classification_report , confusion_matrix , accuracy_score import os print ( os . listdir ( "../input" )) import warnings warnings . filterwarnings ( 'ignore' ) In [3]: #importing the training data imdb_data = pd . read_csv ( '../input/IMDB Dataset.csv' ) imdb_data . head () Out[3]: In [7]: #Checking the data description
10/31/23, 7:49 PM assignement-3-tanmay-agarwal localhost:8888/nbconvert/html/Downloads/assignement-3-tanmay-agarwal.ipynb?download=false 2/10 review sentiment count 50000 50000 unique 49582 2 top Loved today's show!!! It was a variety and not... positive freq 5 25000 <class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 2 columns): review 50000 non-null object sentiment 50000 non-null object dtypes: object(2) memory usage: 781.3+ KB Sentiment count positive 25000 negative 25000 Name: sentiment, dtype: int64 From the above results. Here are the following observations: 1. There are total 50000 reviews in the dataset 2. There are no null values present in the dataset 3. The dataset is not biased as we have equal proportion of both positive and negative reviews in the "Sentiment" column (target feature) Text Preprocessing performing text preprocessing to tokenize the reviews and clean the dataset before applyig machine learning models to the dataset Removing html strips and noise text imdb_data . describe ( include = 'all' ) Out[7]: In [6]: # checking the info of the dataset imdb_data . info () In [8]: #checking the distribution of labels in our target column imdb_data [ 'sentiment' ] . value_counts () Out[8]: In [12]: #Tokenization of text tokenizer = ToktokTokenizer () #Setting English stopwords stopword_list = nltk . corpus . stopwords . words ( 'english' ) In [13]: #Removing the html strips def strip_html ( text ): soup = BeautifulSoup ( text , "html.parser" ) return soup . get_text ()
10/31/23, 7:49 PM assignement-3-tanmay-agarwal localhost:8888/nbconvert/html/Downloads/assignement-3-tanmay-agarwal.ipynb?download=false 3/10 Removing special characters Text stemming Removing stopwords #Removing the square brackets def remove_between_square_brackets ( text ): return re . sub ( '\[[^]]*\]' , '' , text ) #Removing the noisy text def denoise_text ( text ): text = strip_html ( text ) text = remove_between_square_brackets ( text ) return text #Apply function on review column imdb_data [ 'review' ] = imdb_data [ 'review' ] . apply ( denoise_text ) In [14]: #Define function for removing special characters def remove_special_characters ( text , remove_digits = True ): pattern = r'[^a-zA-z0-9\s]' text = re . sub ( pattern , '' , text ) return text #Apply function on review column imdb_data [ 'review' ] = imdb_data [ 'review' ] . apply ( remove_special_characters ) In [15]: #Stemming the text def simple_stemmer ( text ): ps = nltk . porter . PorterStemmer () text = ' ' . join ([ ps . stem ( word ) for word in text . split ()]) return text #Apply function on review column imdb_data [ 'review' ] = imdb_data [ 'review' ] . apply ( simple_stemmer ) In [16]: #set stopwords to english stop = set ( stopwords . words ( 'english' )) print ( stop ) #removing the stopwords def remove_stopwords ( text , is_lower_case = False ): tokens = tokenizer . tokenize ( text ) tokens = [ token . strip () for token in tokens ] if is_lower_case : filtered_tokens = [ token for token in tokens if token not in stopword_list ] else : filtered_tokens = [ token for token in tokens if token . lower () not in stopword_ filtered_text = ' ' . join ( filtered_tokens ) return filtered_text #Apply function on review column imdb_data [ 'review' ] = imdb_data [ 'review' ] . apply ( remove_stopwords )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10/31/23, 7:49 PM assignement-3-tanmay-agarwal localhost:8888/nbconvert/html/Downloads/assignement-3-tanmay-agarwal.ipynb?download=false 4/10 {'down', 'you', 'their', 'these', 'no', "didn't", 'where', 'more', 'about', 'so', 'bo th', 'himself', 'who', 'some', "shan't", 'before', 'why', 'me', 'this', "hadn't", 'o', 'up', 'further', 'own', 'was', 'yours', 'which', 'those', 'the', 'weren', "you'l l", 'it', 'by', "you're", 'but', 'few', 'wasn', 'is', 'were', 'shouldn', "needn't", "you'd", 'too', 're', 'be', "won't", "should've", 'only', 'than', 'y', 'aren', 'unde r', 'am', 's', 'if', 'shan', 'ours', 'should', 'because', 'didn', 'between', 'won', 'each', 'there', 'ain', 'for', 'wouldn', 'most', 'of', 'll', 'are', "weren't", 'in', 'did', 'to', 'had', "aren't", 'at', "haven't", 'itself', 't', 'any', 'on', 'above', "isn't", 'will', 'isn', 'they', 'with', 'now', 'until', 'over', 'into', "that'll", 'h aving', 'does', "it's", 'not', 'what', 'while', 'do', 'needn', 'other', 'whom', 'ou t', 'hadn', "doesn't", 'from', 'such', 'mightn', 'against', 'ourselves', "wasn't", 'h ers', 'myself', "shouldn't", 'herself', 'his', 'an', 'during', 'ma', 'below', "do n't", "you've", 'as', 'can', "couldn't", 'again', 'he', 'been', 'm', 'she', 'yoursel f', 'off', "wouldn't", "she's", 'my', 'then', 'how', 'nor', 'doesn', 've', 'a', 'it s', 'after', 'or', 'hasn', 'that', 'your', 'and', 'don', 'we', 'them', 'once', 'bein g', 'doing', 'has', 'theirs', 'yourselves', "mightn't", 'couldn', 'i', 'same', "has n't", 'him', 'very', 'haven', "mustn't", 'just', 'when', 'themselves', 'd', 'her', 'h ere', 'mustn', 'our', 'through', 'have', 'all'} Normalized train reviews 'thi movi aw cant even bother write review thi garbag say one bore film ive ever seen and act veri bad boy play main charact realli annoy got express hi face movi want sla p basic 80 movi slow motion shot skateboard weird music utter shtappar ive got write least 10 line text submit thi comment ill use line say lead charact ha got one face w ant slapmeh give upthi movi suck' Normalized test reviews 'surviv christma surprisingli funni movi especi consid bad public wa first releas ben affleck funni obnoxi millionair pay famili occupi hi childhood home hi famili christm a drive famili crazi overindulg christma cheer ben affleck fan past though like dared evil paycheck well cast thi role also like christina appleg daughter famili cant stan d affleck charact first sure see thi movi go dont care ignor critic say rent thi movi becaus funnier lot christma movi' In [18]: #normalized train reviews norm_train_reviews = imdb_data . review [: 40000 ] # checking one of the rows norm_train_reviews [ 1000 ] Out[18]: In [19]: #Normalized test reviews norm_test_reviews = imdb_data . review [ 40000 :] norm_test_reviews [ 46000 ] Out[19]: In [29]: # setting the display width of the column to max pd . set_option ( 'display.max_colwidth' , 1000 ) In [30]: norm_train_reviews . head ()
10/31/23, 7:49 PM assignement-3-tanmay-agarwal localhost:8888/nbconvert/html/Downloads/assignement-3-tanmay-agarwal.ipynb?download=false 5/10 0 one review ha mention watch 1 Oz episod youll hook right thi exactli happen meth first thing struck Oz wa brutal unflinch scene violenc set right word GO trust thi sh ow faint heart timid thi show pull punch regard drug sex violenc hardcor classic use wordit call OZ nicknam given oswald maximum secur state penitentari focus mainli emer ald citi experiment section prison cell glass front face inward privaci high agenda E m citi home manyaryan muslim gangsta latino christian italian irish moreso scuffl dea th stare dodgi deal shadi agreement never far awayi would say main appeal show due fa ct goe show wouldnt dare forget pretti pictur paint mainstream audienc forget charm f orget romanceoz doesnt mess around first episod ever saw struck nasti wa surreal coul dnt say wa readi watch develop tast Oz got accustom high level graphic violenc violen c injustic crook guard wholl sold nickel inmat wholl kill order get away well manner middl class inmat turn prison bitch due lack street skill prison ... 1 wonder littl product film techniqu veri unassum veri oldtimebbc fashion give comfort sometim discomfort sens realism entir piec actor extrem well chosen michael sheen onl i ha got polari ha voic pat truli see seamless edit guid refer william diari entri on li well worth watch terrificli written perform piec master product one great master c omedi hi life realism realli come home littl thing fantasi guard rather use tradit dr eam techniqu remain solid disappear play knowledg sens particularli scene concern ort on halliwel set particularli flat halliwel mural decor everi surfac terribl well done 2 thought thi wa wonder way spend time hot summer weekend sit air condit theater watch lightheart comedi plot simplist dialogu witti charact likabl even well bread suspect serial killer may disappoint realiz thi match point 2 risk addict thought wa proof wo odi allen still fulli control style mani us grown lovethi wa Id laugh one woodi comed i year dare say decad ive never impress scarlet johanson thi manag tone sexi imag jum p right averag spirit young womanthi may crown jewel hi career wa wittier devil wear prada interest superman great comedi go see friend 3 basic famili littl boy jake think zombi hi closet hi parent fight timethi movi slower soap opera suddenli jake decid becom rambo kill zombieok first go make film must deci d thriller drama drama movi watchabl parent divorc argu like real life jake hi closet total ruin film expect see boogeyman similar movi instead watch drama meaningless thr iller spots3 10 well play parent descent dialog shot jake ignor 4 petter mattei love time money visual stun film watch Mr mattei offer us vivid portrai t human relat thi movi seem tell us money power success peopl differ situat encount t hi variat arthur schnitzler play theme director transfer action present time new york differ charact meet connect one connect one way anoth next person one seem know previ ou point contact stylishli film ha sophist luxuri look taken see peopl live world liv e habitatth onli thing one get soul pictur differ stage loneli one inhabit big citi e xactli best place human relat find sincer fulfil one discern case peopl encounterth a ct good Mr mattei direct steve buscemi rosario dawson carol kane michael imperioli ad rian grenier rest talent cast make charact come alivew wish Mr mattei good luck await anxious hi next work Name: review, dtype: object Bags of words model It is used to convert text documents to numerical vectors or bag of words. Out[30]: In [31]: #Count vectorizer for bag of words cv = CountVectorizer ( min_df = 0 , max_df = 1 , binary = False , ngram_range = ( 1 , 3 )) #transformed train reviews cv_train_reviews = cv . fit_transform ( norm_train_reviews ) #transformed test reviews cv_test_reviews = cv . transform ( norm_test_reviews ) print ( 'BOW_cv_train:' , cv_train_reviews . shape )
10/31/23, 7:49 PM assignement-3-tanmay-agarwal localhost:8888/nbconvert/html/Downloads/assignement-3-tanmay-agarwal.ipynb?download=false 6/10 BOW_cv_train: (40000, 6209089) BOW_cv_test: (10000, 6209089) Term Frequency-Inverse Document Frequency model (TFIDF) It is used to convert text documents to matrix of tfidf features. Tfidf_train: (40000, 6209089) Tfidf_test: (10000, 6209089) Labeling the sentiment text (50000, 1) Split the sentiment tdata [[1] [1] [1] ... [1] [0] [0]] [[0] [0] [0] ... [0] [0] [0]] Modelling the dataset Let us build logistic regression model for both bag of words and tfidf features print ( 'BOW_cv_test:' , cv_test_reviews . shape ) #vocab=cv.get_feature_names()-toget feature names In [32]: #Tfidf vectorizer tv = TfidfVectorizer ( min_df = 0 , max_df = 1 , use_idf = True , ngram_range = ( 1 , 3 )) #transformed train reviews tv_train_reviews = tv . fit_transform ( norm_train_reviews ) #transformed test reviews tv_test_reviews = tv . transform ( norm_test_reviews ) print ( 'Tfidf_train:' , tv_train_reviews . shape ) print ( 'Tfidf_test:' , tv_test_reviews . shape ) In [33]: #labeling the sentient data lb = LabelBinarizer () #transformed sentiment data sentiment_data = lb . fit_transform ( imdb_data [ 'sentiment' ]) print ( sentiment_data . shape ) In [34]: #Spliting the sentiment data train_sentiments = sentiment_data [: 40000 ] test_sentiments = sentiment_data [ 40000 :] print ( train_sentiments ) print ( test_sentiments )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10/31/23, 7:49 PM assignement-3-tanmay-agarwal localhost:8888/nbconvert/html/Downloads/assignement-3-tanmay-agarwal.ipynb?download=false 7/10 LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=500, multi_class='warn', n_jobs=None, penalty='l2', random_state=42, solver='warn', tol=0.0001, verbose=0, warm_start=False) LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=500, multi_class='warn', n_jobs=None, penalty='l2', random_state=42, solver='warn', tol=0.0001, verbose=0, warm_start=False) Logistic regression model performane on test dataset [0 0 0 ... 0 1 1] [0 0 0 ... 0 1 1] Accuracy of the model lr_bow_score : 0.7512 lr_tfidf_score : 0.75 Printing the classification report to evaluate the performance of the models In [35]: #training the model lr = LogisticRegression ( penalty = 'l2' , max_iter = 500 , C = 1 , random_state = 42 ) #Fitting the model for Bag of words lr_bow = lr . fit ( cv_train_reviews , train_sentiments ) print ( lr_bow ) #Fitting the model for tfidf features lr_tfidf = lr . fit ( tv_train_reviews , train_sentiments ) print ( lr_tfidf ) In [36]: #Predicting the model for bag of words lr_bow_predict = lr . predict ( cv_test_reviews ) print ( lr_bow_predict ) ##Predicting the model for tfidf features lr_tfidf_predict = lr . predict ( tv_test_reviews ) print ( lr_tfidf_predict ) In [37]: #Accuracy score for bag of words lr_bow_score = accuracy_score ( test_sentiments , lr_bow_predict ) print ( "lr_bow_score :" , lr_bow_score ) #Accuracy score for tfidf features lr_tfidf_score = accuracy_score ( test_sentiments , lr_tfidf_predict ) print ( "lr_tfidf_score :" , lr_tfidf_score ) In [38]: #Classification report for bag of words lr_bow_report = classification_report ( test_sentiments , lr_bow_predict , target_names = [ 'Posi print ( lr_bow_report ) #Classification report for tfidf features lr_tfidf_report = classification_report ( test_sentiments , lr_tfidf_predict , target_names = [ ' print ( lr_tfidf_report )
10/31/23, 7:49 PM assignement-3-tanmay-agarwal localhost:8888/nbconvert/html/Downloads/assignement-3-tanmay-agarwal.ipynb?download=false 8/10 precision recall f1-score support Positive 0.75 0.75 0.75 4993 Negative 0.75 0.75 0.75 5007 accuracy 0.75 10000 macro avg 0.75 0.75 0.75 10000 weighted avg 0.75 0.75 0.75 10000 precision recall f1-score support Positive 0.74 0.77 0.75 4993 Negative 0.76 0.73 0.75 5007 accuracy 0.75 10000 macro avg 0.75 0.75 0.75 10000 weighted avg 0.75 0.75 0.75 10000 Confusion matrix [[3768 1239] [1249 3744]] [[3663 1344] [1156 3837]] Stochastic gradient descent or Linear support vector machines for bag of words and tfidf features SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=500, n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5, random_state=42, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=500, n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5, random_state=42, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) Model performance on test data In [39]: #confusion matrix for bag of words cm_bow = confusion_matrix ( test_sentiments , lr_bow_predict , labels = [ 1 , 0 ]) print ( cm_bow ) #confusion matrix for tfidf features cm_tfidf = confusion_matrix ( test_sentiments , lr_tfidf_predict , labels = [ 1 , 0 ]) print ( cm_tfidf ) In [40]: #training the linear svm svm = SGDClassifier ( loss = 'hinge' , max_iter = 500 , random_state = 42 ) #fitting the svm for bag of words svm_bow = svm . fit ( cv_train_reviews , train_sentiments ) print ( svm_bow ) #fitting the svm for tfidf features svm_tfidf = svm . fit ( tv_train_reviews , train_sentiments ) print ( svm_tfidf )
10/31/23, 7:49 PM assignement-3-tanmay-agarwal localhost:8888/nbconvert/html/Downloads/assignement-3-tanmay-agarwal.ipynb?download=false 9/10 [1 1 0 ... 1 1 1] [1 1 1 ... 1 1 1] Accuracy of the model svm_bow_score : 0.5829 svm_tfidf_score : 0.5112 Print the classification report precision recall f1-score support Positive 0.94 0.18 0.30 4993 Negative 0.55 0.99 0.70 5007 accuracy 0.58 10000 macro avg 0.74 0.58 0.50 10000 weighted avg 0.74 0.58 0.50 10000 precision recall f1-score support Positive 1.00 0.02 0.04 4993 Negative 0.51 1.00 0.67 5007 accuracy 0.51 10000 macro avg 0.75 0.51 0.36 10000 weighted avg 0.75 0.51 0.36 10000 Plot the confusion matrix In [41]: #Predicting the model for bag of words svm_bow_predict = svm . predict ( cv_test_reviews ) print ( svm_bow_predict ) #Predicting the model for tfidf features svm_tfidf_predict = svm . predict ( tv_test_reviews ) print ( svm_tfidf_predict ) In [42]: #Accuracy score for bag of words svm_bow_score = accuracy_score ( test_sentiments , svm_bow_predict ) print ( "svm_bow_score :" , svm_bow_score ) #Accuracy score for tfidf features svm_tfidf_score = accuracy_score ( test_sentiments , svm_tfidf_predict ) print ( "svm_tfidf_score :" , svm_tfidf_score ) In [43]: #Classification report for bag of words svm_bow_report = classification_report ( test_sentiments , svm_bow_predict , target_names = [ 'Po print ( svm_bow_report ) #Classification report for tfidf features svm_tfidf_report = classification_report ( test_sentiments , svm_tfidf_predict , target_names = print ( svm_tfidf_report ) In [44]: #confusion matrix for bag of words cm_bow = confusion_matrix ( test_sentiments , svm_bow_predict , labels = [ 1 , 0 ]) print ( cm_bow ) #confusion matrix for tfidf features cm_tfidf = confusion_matrix ( test_sentiments , svm_tfidf_predict , labels = [ 1 , 0 ]) print ( cm_tfidf )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10/31/23, 7:49 PM assignement-3-tanmay-agarwal localhost:8888/nbconvert/html/Downloads/assignement-3-tanmay-agarwal.ipynb?download=false 10/10 [[4948 59] [4112 881]] [[5007 0] [4888 105]] Conclusion: From the above results we can observe that Logistic Regression model is performing better than the Support Vector Machine. The weighted F-1 score of Logistic Regression model is 0.75 while weighted F1 score of SVM model is 0.50. Still we can improve the accuracy of the models by preprocessing data and by using Deep Neural Network models like RNN, LSTM or GRU In [ ]: