Document Similarity Measure Using Selection Of Present Absent Feature Approach

926 WordsNov 18, 20164 Pages
Document Similarity Measure Using Selection of Present-Absent Feature Approach 1. INTRODUCTION Text document processing plays a key role in data mining as well as web search for information retrieval. In text processing, the commonly used model is bag-of-words model [5]. In this model each document is typically represented in vector form in which each element indicates the value of the analogous feature in the document. The feature value can be selected by finding number of occurrences of a term in the document. However relative term frequency can be defined as the ratio between the term frequency and the total number of occurrences of all the terms in the document set. Frequently, the dimensionality of a document is large and the resulting vector is sparse, i.e., most of the selected feature values in the vector are zero. Such high-dimensionality and sparsity is a challenge for similarity measure and thus it is a very important operation in text processing algorithms. A several measures have been proposed for computing the similarity between two document vectors. The Kullback-Leibler divergence [3] proposed a non-symmetric measure of the difference between the probability distributions associated with the two vectors. Euclidean distance [5] is a recognized similarity metric taken from the Euclidean geometry field. Manhattan distance [11], is very similar to Euclidean distance and also recognized as the taxicab metric, is another
Open Document