preview

Analysis Of SVM And CRF Based Tagger

Better Essays

When unknown word percentage (Shown in Table 2) is taken in to consideration, SVM based tagger has provided the highest accuracies when there are more unknown words (E2 and E4), whereas CRF has provided the highest accuracies when there is a less number of unknown words in testing data (E1, E3 and E5). This shows that SVM is more robust to unknown words.Moreover, when testing is done using a different domain (E4 and E5), SVM and CRF based taggers have provided the highest accuracies in E4 and E5, respectively. This confirms that SVM and CRF based taggers are more robust to domain adaptation. Therefore, based on the results of our experiments, a general conclusion cannot be made on a single tagger that performs well for Sinhala Language. …show more content…

Same observation is made for all other experiments as well. Therefore we cannot confidently make a conclusion on the best ensemble tagger setup. But any ensemble tagger outperforms any individual tagger. Next, experiment results confirm that there is a decrease in the accuracy when training and testing phases use different domain corpora. For example, in the individual SVM based tagger, the best accuracy of 88.24% is achieved when the training and testing is done using a combination of both Official Documents and News (E3). But when the tagger is trained with news and tested with official documents (E5), the accuracy is 82.01%, which is a decrement of 6.23%. However, we should consider the properties of training and testing corpora (percentage of unknown words, size of corpora) in these two experiments, before making a general conclusion. The percentages of unknown words are different in E3 and E5, where E5 has 10% unknown words in testing corpus while E3 only has 5%. To make a better conclusion, we can compare E5 with E2, which are again experiments of training and testing with the same domain corpus of news, and has 11% of unknown words in testing corpus. E2 has obtained a tagging accuracy of 88.14%, making a decrement of

Get Access