University of Oxford Logo University of OxfordDepartment of Computer Science - Home

Measures of document similarity

Supervisor

Suitable for

Abstract

In the literature there are various measures of document similarity used for different purposes. These include: Levenshtein or edit distance (similarity between sequences); BLEU, an n-gram based measuure of how well a translated document compares to a gold standard; and vector-space models from information retrieval (or word sense disambiguation) in which documents are converted to vectors, and measures like cosine distance or latent semantic analysis techniques are used. The aim of this project is to compare these and other measures on different types of document, to discover the answers to questions like: which measures are best on short/medium/long documents; what is a good mixture of semantic, lexical and syntactic similarity, etc.