Measures of document similarity
Abstract
In the literature there are various measures of document similarity used for different purposes. These include: Levenshtein
or edit distance (similarity between sequences); BLEU, an n-gram based measuure of how well a translated document compares
to a gold standard; and vector-space models from information retrieval (or word sense disambiguation) in which documents are
converted to vectors, and measures like cosine distance or latent semantic analysis techniques are used. The aim of this
project is to compare these and other measures on different types of document, to discover the answers to questions like:
which measures are best on short/medium/long documents; what is a good mixture of semantic, lexical and syntactic similarity,
etc.