HW1_Solutions

pdf

School

University of Texas, Dallas *

*We aren’t endorsed by this school

Course

6320

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by CoachWorld12903

9/17/23, 3:42 PM View Submission | Gradescope https://www.gradescope.com/courses/153270/assignments/1424498/submissions/191791563 1/12 Q1 Language Modeling 20 Points Suppose we have a training corpus consisting of two sentences: the cat sat in the hat on the mat the dog sat on the log Our fixed vocabulary is Q1.1 Smoothing --- Discounting and Katz Backoff 5 Points If we train a bigram Katz backoff model on this corpus, using and no end token, what is ? Given: Find: Solution: Let and i.e., if and, So, V = {cat, dog, fish hat, in, log, mat, on, sat, the} β = 0.75 p (sat∣dog) katz β = 0.75 p ( sat ∣ dog ) katz A ( v ) = w ∣ c ( v , w ) > 0 B ( v ) = w ∣ c ( v , w ) = 0 Now , p ( w ∣ v ) katz p ( sat ∣ dog ) = katz c ( dog ) c ( dog , sat ) d w ∈ A ( v ) c ( dog , sat ) = d c ( dog , sat ) − β = 1 − 0.75 = 0.25 c ( dog ) = 1 p ( sat ∣ dog ) = katz c ( dog ) c ( dog , sat ) d = 1 0.25 = 0.25 = 4 1

9/17/23, 3:42 PM View Submission | Gradescope https://www.gradescope.com/courses/153270/assignments/1424498/submissions/191791563 2/12 What is ? Note that "fish," despite not appearing in the training set, is part of the vocabulary . Show your work. Let and According to the formula, if then and if then x where, Now it is given that fish doesn't belongs to the training corpus, but it belongs to the vocabulary V So, and ,where N=total no. of words in the training corpus since, , therefore x x Q1.2 Smoothing --- Linear Interpolation 5 Points If we use linear interpolation between a bigram model and a unigram model, using and no end token, what is ? Given: and Find: Solution: Since we know that i.e., p (sat∣fish) katz V A ( v ) = w ∣ c ( v , w ) > 0 B ( v ) = w ∣ c ( v , w ) = 0 w ∈ A ( v ) p ( w ∣ v ) = katz c ( v ) c ( v , w ) d w ∈ B ( v ) p ( w ∣ v ) = katz α ( v ) p ( w ) ∑ w ∈ B ( v ) ′ M LE ′ p ( w ) M LE α ( v ) = 1 − ∑ w ∈ A ( v ) ′ c ( v ) c ( v , w ) d α ( v ) = α ( fish ) = 1 − = ∑ w ∈ A ( v ) ′ c ( v ) c ( v , w ) d 1 − 0 = 1 p ( w ) = M LE = N c ( w ) = N c ( sat ) 15 2 w ∈ B ( v ) p ( w ∣ v ) = katz p ( sat ∣ fish ) = katz α ( v ) p ( w ) ∑ w ∈ B ( v ) ′ M LE ′ p ( w ) M LE = 1 = 1 15 2 15 2 λ = 1 λ = 2 0.5 p (dog∣the) inter λ = 1 2 1 λ = 2 2 1 p ( dog ∣ the ) inter p ( w ∣ w ) = inter i i −1 λ p ( w ∣ w ) + 1 M LE i i −1 λ p ( w ) 2 M LE i −1 p ( dog ∣ the ) = inter λ p ( dog ∣ the ) + 1 M LE λ p ( dog ) 2 M LE

9/17/23, 3:42 PM View Submission | Gradescope https://www.gradescope.com/courses/153270/assignments/1424498/submissions/191791563 3/12 , where N = total no. of words in the training corpus So, What is ? Show your work. , where N = total no. of words in the training corpus So, Q1.3 Perplexity 5 Points What is the maximum possible value that the perplexity score can take? What is the minimum possible value it can take? Explain your reasoning and give an example of a training corpus and two test corpora, one that achieves the maximum possible perplexity score and one that achieves the minimum possible perplexity score. (You can do this with a single short sentence for each corpus.) The maximum possible value that the perplexity score can take is 0 i.e., meaning that the perplexity is . And the minimum possible value that the perplexity score can take is 1 i.e., meaning that the perplexity is 1. p ( dog ∣ the ) = M LE = c ( the ) c ( the , dog ) 5 1 p ( dog ) = M LE = N c ( dog ) 15 1 p ( dog ∣ the ) = inter λ p ( dog ∣ the ) + 1 M LE λ p ( the ) 2 M LE = ∗ 2 1 + 5 1 ∗ 2 1 15 1 = 15 2 p (dog∣log) inter p ( dog ∣ log ) = inter λ p ( dog ∣ log ) + 1 M LE λ p ( dog ) 2 M LE p ( dog ∣ log ) = M LE = c ( log ) c ( log , dog ) 0 p ( dog ) = M LE = N c ( dog ) 15 1 p ( dog ∣ log ) = inter λ p ( dog ∣ log ) + 1 M LE λ p ( dog ) 2 M LE = ∗ 2 1 0 + ∗ 2 1 15 1 = 30 1 p ( S ) = 0 ∞ p ( S ) = 1

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

9/17/23, 3:42 PM View Submission | Gradescope https://www.gradescope.com/courses/153270/assignments/1424498/submissions/191791563 4/12 Example would be: A language model trained on Shakespearean plays dataset and tested on text messages from teenagers filled with slang and abbreviations will have the highest perplexity meaning perplexity = 0. A language model trained on the movie "Harry Porter" and tested on the same book will have the minimum perplexity since it can perfectly predict each word in the text it was trained on meaning perplexity = 1. Q1.4 Applications 5 Points Authorship identification is an important task in NLP. Can you think of a way to use language models to determine who wrote an unknown piece of text? Explain your idea and how it would work (you don't need to implement it). You must use language modeling the receive credit! Other approaches do not count. A Language Model (LM) is more likely to anticipate the right order of words if the PP (Perplexity Score) is lower. In the area of language modeling, the authorship identification problem can be performed using the same idea. The following steps will be taken to create such an Authorship Identification engine: 1. Data Collection and Pre-processing: Gather training corpora from the writings of numerous writers who are experts in various literary genres. 2. Building and developing one language model (LM) specifically for each author whose works are in the training corpora. These are the n-gram models that are connected to each author. PP = 2 − I

9/17/23, 3:42 PM View Submission | Gradescope https://www.gradescope.com/courses/153270/assignments/1424498/submissions/191791563 5/12 3. Author Identification: Compare the corpus with an unknown author to the language Model previously trained for each author. As a result, the Language Model with the lowest PP for this test corpus will probably identify the genuine author of this work. Q2 Sentiment Analysis & Classification 15 Points Q2.1 Naive Bayes 10 Points We have a training corpus consisting of three sentences and their labels: The cat sat in the hat, 1 The dog sat on the log, 1 The fish sat in the dish, 0 Suppose we train a Naive Bayes classifier on this corpus, using maximum likelihood estimation and unigram count features without any smoothing. What are the values of the parameters and for all classes and features ? You can simply list the parameters and their values; no need to show the arithmetic. You can skip parameters with value 0, and you can leave your answers as fractions. Prior Probabilities:- and Conditional Probabilities:- p ( c ) p ( f ∣ c ) c f p ( c = 0) = 3 1 p ( c = 1) = 3 2 p ( the ∣ c = 1) = = 12 4 3 1 p ( sat ∣ c = 1) = = 12 2 6 1 p ( cat ∣ c = 1) = 12 1 p ( in ∣ c = 1) = 12 1 p ( hat ∣ c = 1) = 12 1 p ( dog ∣ c = 1) = 12 1

9/17/23, 3:42 PM View Submission | Gradescope https://www.gradescope.com/courses/153270/assignments/1424498/submissions/191791563 6/12 What class would our Naive Bayes classifier predict for the test sentence"`The cat sat"? Show your work, ie. show the calculations for the predicted probabilities of both classes. For , predicted class For , predicted class Q2.2 Logistic Regression 5 Points The last step of the programming component asks you to get the top most important features for your sentiment classifier. When doing this, why do we sort by absolute value? Explain why we do this rather than sorting by the raw weight values (1-2 sentences). p ( on ∣ c = 1) = 12 1 p ( log ∣ c = 1) = 12 1 p ( the ∣ c = 0) = = 6 2 3 1 p ( fish ∣ c = 0) = 6 1 p ( sat ∣ c = 0) = 6 1 p ( in ∣ c = 0) = 6 1 p ( dish ∣ c = 0) = 6 1 c = 0 = c ^ argmax p ( c = c 0) p ( f ∣ c = ∏ i =1 n i 0) = p ( c = 0) ∗ p ( the ∣ c = 0) ∗ p ( cat ∣ c = 0) ∗ p ( sat ∣ c = 0) = ∗ 3 1 ∗ 3 1 0 ∗ 6 1 = 0 c = 0 = c ^ argmax p ( c = c 1) p ( f ∣ c = ∏ i =1 n i 1) = p ( c = 1) ∗ p ( the ∣ c = 1) ∗ p ( cat ∣ c = 1) ∗ p ( sat ∣ c = 1) = ∗ 3 2 ∗ 3 1 ∗ 12 1 6 1 = 648 2 = 324 1 k

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

9/17/23, 3:42 PM View Submission | Gradescope https://www.gradescope.com/courses/153270/assignments/1424498/submissions/191791563 7/12 We sort features by absolute value because all the feature values are positive. Hence, those features who belongs to the negative class i.e., 0 will have the negative weights while features who belongs to the positive class i.e., 1 will have the positive weights. If we sort the features by raw weight values which contains both positive & negative, then the features belongs to negative class would be penalized and that's not what we want. Q3 Part of Speech Tagging 10 Points Suppose we have a training corpus consisting of two tagged sentences: Suppose we train a simple HMM part-of-speech tagger on this corpus, using maximum likelihood estimation, bigram tag transition probabilities, and a single meta-tag (the start tag). What are the values of the parameters and for all tags and words ? You can simply list the parameters and their values; no need to show the arithmetic. You can skip parameters with value 0, and you can leave your answers as fractions. So, according to Maximum Likelihood Estimation: So, the DT can NN is V B in PP the DT drawer NN The DT cat NN can V B see V B the DT fish NN < s > p ( t ∣ t ) i i −1 p ( w ∣ t ) i i t w p ( t ∣ t ) = i i −1 and p ( w ∣ t ) = c ( t ) i −1 c ( t , ti ) i −1 i i c ( t ) i c ( t , w ) i i p ( DT ∣ < s >) = = c (< s >) c (< s >, DT ) = 2 2 1 ( )

9/17/23, 3:42 PM View Submission | Gradescope https://www.gradescope.com/courses/153270/assignments/1424498/submissions/191791563 8/12 What parts of speech would the trained HMM tagger in the previous problem predict for the test sentence "The fish can see the can," using Viterbi decoding? Show your work, ie. the dynamic programming table . You can leave your answers as fractions. p ( NN ∣ DT ) = = c ( DT ) c ( DT , NN ) = 4 4 1 p ( V B ∣ NN ) = = c ( NN ) c ( NN , V B ) = 2 2 1 p ( PP ∣ V B ) = = c ( V B ) c ( V B , P P ) 3 1 p ( DT ∣ PP ) = = c ( P P ) c ( P P , DT ) = 1 1 1 p ( V B ∣ V B ) = = c ( V B ) c ( V B , V B ) 3 1 p ( DT ∣ V B ) = = c ( V B ) c ( V B , DT ) 3 1 p ( the ∣ DT ) = = c ( DT ) c ( DT , the ) = 4 4 1 p ( can ∣ NN ) = = c ( NN ) c ( NN , can ) 4 1 p ( is ∣ V B ) = = c ( V B ) c ( V B , is ) 3 1 p ( in ∣ PP ) = = c ( P P ) c ( P P , in ) = 1 1 1 p ( drawer ∣ NN ) = = c ( NN ) c ( NN , drawer ) 4 1 p ( cat ∣ NN ) = = c ( NN ) c ( NN , cat ) 4 1 p ( can ∣ V B ) = = c ( V B ) c ( V B , can ) 3 1 p ( see ∣ V B ) = = c ( V B ) c ( V B , see ) 3 1 p ( fish ∣ NN ) = = c ( NN ) c ( NN , fish ) 4 1 V

9/17/23, 3:42 PM View Submission | Gradescope https://www.gradescope.com/courses/153270/assignments/1424498/submissions/191791563 9/12 Emision Probality Matrix:- State Transistion Probability Matrix:- We have given the test set After adding the start token <s> x x DT NN PP V B the 1 0 0 0 can 0 4 1 0 3 1 is 0 0 0 3 1 in 0 0 1 0 drawer 0 4 1 0 0 cat 0 4 1 0 0 see 0 0 0 3 1 fish 0 4 1 0 0 DT NN PP V B < s > 1 0 0 0 DT 0 1 0 0 NN 0 0 0 1 PP 1 0 0 0 V B 3 1 0 3 1 3 1 " The fish can see the can ," The fish can see the can p ( DT ∣ the , < s >) = p ( the ∣ DT ) p ( DT ∣ < s >) = 1 1 = 1 V (1, DT ) = 1 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −−

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

9/17/23, 3:42 PM View Submission | Gradescope https://www.gradescope.com/courses/153270/assignments/1424498/submissions/191791563 10/12 x x x x x x x x x x x x since, x x x x x x x x x x x x x x x x since, p ( NN ∣ fish , DT ) = V (1, DT ) p ( fish ∣ NN ) p ( NN ∣ DT ) = 1 4 1 1 = 4 1 V (2, NN ) = 4 1 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −− p ( NN ∣ can , NN ) = V (2, NN ) p ( can ∣ NN ) p ( NN ∣ NN ) = 4 1 4 1 0 = 0 p ( V B ∣ can , NN ) = V (2, NN ) p ( can ∣ V B ) p ( V B ∣ NN ) = 4 1 3 1 = 2 1 24 1 > 24 1 0, therefore V (3, V B ) = 24 1 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −− p ( V B ∣ see , V B ) = V (3, V B ) p ( see ∣ V B ) p ( V B ∣ V B ) = 24 1 3 1 = 3 1 216 1 V (4, V B ) = 216 1 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −− p ( DT ∣ the , V B ) = V (4, V B ) p ( the ∣ DT ) p ( DT ∣ V B ) = 216 1 1 = 3 1 648 1 V (5, DT ) = 648 1 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −− p ( NN ∣ can , DT ) = V (5, DT ) p ( can ∣ NN ) p ( NN ∣ DT ) = 648 1 4 1 1 = 2592 1 p ( V B ∣ can , DT ) = V (5, DT ) p ( can ∣ V B ) p ( V B ∣ DT ) = 648 1 3 1 0 = 0 > 2592 1 0, therefore V (6, NN ) = 2592 1 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −− Word Tag Probability the DT 1

9/17/23, 3:42 PM View Submission | Gradescope https://www.gradescope.com/courses/153270/assignments/1424498/submissions/191791563 11/12 Q4 Late Penalty 0 Points This problem intentionally left blank. fish NN 4 1 can V B 24 1 see V B 216 1 the DT 648 1 can NN 2592 1 Ungraded Homework 1 Written Student Mohit Gullani Total Points - / 45 pts Question 1 Language Modeling 20 pts 1.1 Smoothing --- Discounting and Katz Backoff 5 pts 1.2 Smoothing --- Linear Interpolation 5 pts 1.3 Perplexity 5 pts 1.4 Applications 5 pts Question 2 Sentiment Analysis & Classification 15 pts 2.1 Naive Bayes 10 pts 2.2 Logistic Regression 5 pts 

9/17/23, 3:42 PM View Submission | Gradescope https://www.gradescope.com/courses/153270/assignments/1424498/submissions/191791563 12/12 Question 3 Part of Speech Tagging 10 pts Question 4 Late Penalty 0 pts

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

HW1_Solutions

Related Documents