Using Different Types of Stemmer

1392 Words Jan 25th, 2018 6 Pages
In our work document preprocessing involve removing punctuation marks, numbers, words written in another language, normalize the documents by (replace the letter ("أ إ آ ") with (ا""), replace the letter (ء ؤ" ") with (""ا), and replace the letter("ى") with (""ا). Finally removing the stop words, which are words that can be found in any text like prepositions and pronouns. The rest of words are returned and are referred to as keywords or features. The number of these features is usually large for large documents and therefore some filtering can be applied to these features to reduce their number and remove redundant features.

Features Extraction
Text is categorized by two types of features, external and internal. External features are not related to the content of the text, such as author name, publication date, author gender, and so on. Internal features reflect the text content and are mostly linguistic features, such as lexical items and grammatical categories[www]. In our work, words were treated as a feature on three levels: using a bag of words form, word stem, in which the suffix and prefix were removed and word root. With all these features we need to extract and generates the frequency list of the dataset features (single words) and save it in a training file.

Feature Selection
The output of feature extraction step is…
Open Document