Essay On Vector

1700 Words7 Pages
5.4 Compiling Semantic Relationships with Document Vector

Adding hypernyms, hyponyms and synonyms to the document term vector, would lead to revelations of hidden subjects in the context. So the documents with more common concepts and ideas have a greater chance to be clustered in the same group. For example, a document about "موسیقی" (music) may have some ideas in a more abstract subject like "هنر" (art). The next step after choosing an appropriate sense for an ambiguous term is adding the inclusion (ISA) and synonymy relationships of that sense into the vector. One should determine the radius of the hypernyms (parents) and the hyponyms (children) of the winner synset (sense). The evaluation results for inclusion of concepts with more
…show more content…
Note that hyper_syni has two terms: tc and td (Fig. 2). td is a new term and the story is just like inserting tj. The term tc already has a frequency of 3, so the new frequency of tc in d1 would be 5 (3+2).


In this research the Persian corpus, Hamshahri, is used for the experiments. It is collected in the database laboratory of the University of Tehran. This corpus is a standard collection that was used at CLEF for evaluation of Persian information retrieval systems. Also TREC (Text Retrieval Conference) standards are met in the formal version. The first version includes more than 166 thousand documents. They are collected from Hamshahri Newspaper website archive. They represent the general texts, spoken and written by Persian natives on a daily basis. In spite of a rather large number of newsgroups, several ones have more than 1000 documents. These newsgroups that are used in the evaluations are as follows: economic, politics, sport, social, literature and art, urban, scientific and cultural, miscellaneous, world news and happenings.

Accuracy (AC) and Normalized Mutual Information (NMI) are used as measures of clustering quality. In AC calculations, a one to one correspondence between result clusters and input classes is established. So a function is needed to map or associate the clusters to the input labels. The value of AC varies between 0 and 1, and a higher value indicates the higher clustering quality. AC is used in some
Get Access