Information Retrieval: 2008-2009
OverviewInformation Retrieval (IR), for the purpose of this course, is the study of the indexing, processing, and querying of textual data. The growing importance of the Web means that IR has acquired added significance in recent years. The course will also look at how models of language similar to those used in IR can be applied to the problem of Machine Translation (MT), which is becoming increasingly important as more and more non-English text appears on the Web.
The aim of the course is to provide an introduction to the basic principles and techniques used in IR; to demonstrate how statistical models of language can be used to solve the document retrieval problem; to consider specific IR applications such as cross-language retrieval; and to show how statistical models of language can be used to develop Machine Translation systems.
- to gain an understanding of the basic concepts and techniques in Information Retrieval;
- to understand how statistical models of text can be used to solve problems in IR, with a focus on how the vector-space model and the language model can be applied to the document retrieval problem;
- to understand how the user can be involved in the document retrieval process, through the use of relevance feedback;
- to understand how statistical models of text can be used for other IR applications, for example clustering;
- to appreciate the difficulties in carrying out document retrieval on the Web, and how the hyperlink structure can facilitate accurate retrieval;
- to appreciate the importance of data structures such as an index to allow efficeint access to the information in large bodies of text;
- to have experience of building a document retieval system, through the practical sessions, including the implementation of a relevance feedback system;
- to understand how statistical models of language can be applied to the Machine Translation problem.
Basics of information retrieval
- Text representation and processing
- Retrieval models (Boolean, vector space, language model)
- Relevance feedback - real feedback, pseudo-relevance feedback
- Document and concept clustering - hierarchical clustering, k-means
- Web retrieval - Page rank, difficulties of Web retrieval
- Cross-language retrieval - queries in one language, documents in another
- Distributional and semantic similarity - automatic thesaurus construction
- Language models for MT
- Estimation from parallel texts
- Decoding (finding the most probable translation)
- Course textbook:Introduction to Information Retrieval, by Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze. http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html
- Modern Information Retrieval (1999), by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
- Readings in Information Retrieval (1997), edited by Karen Sparck Jones and Peter Willett
- Managing Gigabytes : Compressing and Indexing Documents and Images (1999), by Ian H. Witten, Alistair Moffat, and Timothy C. Bell.
- Information Retrieval (1979), by C. J. van Rijsbergen (online at http://www.dcs.gla.ac.uk/Keith/Preface.html)