Information Retrieval: 2011-2012
OverviewModern internet search engines form the primary interface for most users interacting with the World Wide Web. The dramatic increase in the amount of data that is available on the Web, in recent years, means that automatic methods of Information Retrieval (IR) have acquired greater significance. For the purpose of this course IR is the study of the indexing, processing, storage and querying of textual data. The aim of the course is to provide an introduction to the core principles and techniques used in IR, and to demonstrate how statistical models of language can be used to solve the document indexing and retrieval problems. In addition, we will look at the issues involved in indexing the entire web and the creative solutions to this problem currently deployed by large scale online search providers.
On completion of the course students will be expected to:
- gain an understanding of the basic concepts and techniques in Information Retrieval;
- understand how statistical models of text can be used to solve problems in IR, with a focus on how the vector-space model and the language model are implemented and applied to document retrieval problems;
- understand how statistical models of text can be used for other IR applications, for example clustering and news aggregation;
- appreciate the importance of data structures, such as an index, to allow efficient access to the information in large bodies of text;
- understand common text compression algorithms and their role in the efficient building and storage of inverted indices.
- have experience of building a document retrieval system, through the practical sessions, including the implementation of a relevance feedback mechanism;
- understand the issues involved in providing an IR service on a web scale, including distributed index construction and user modeling for recommendation engines.
Prior knowledge of elementary linear algebra would be helpful but is not required for this course. The practical side of this course has a relatively in depth programming component. Students will build a vector space based information retrieval system from scratch using a programming language of their choice. Students should be familiar with object oriented programming, simple data structures such as hash maps, and text processing.
- Text representation and processing, retrieval models: Boolean and vector space models, TF-IDF, evaluation in information retrieval;
- Index construction and compression: inverted index, memory based and sort based inversion with and without compression, text compression and coding;
- Language models: ngram language models, the noisy channel model, probabilistic information retrieval;
- Querying: text normalization and stemming, lexicon representations, Boolean and ranked query retrieval, relevance feedback;
- Document clustering and Text classification;
- Lexical semantics and dimensionality reduction: distributional semantics, Latent Semantic Analysis;
- Information retrieval on the Web: Page Rank, MapReduce, collaborative filtering, web recommendation engines.
- Introduction to other information retrieval systems and challenges: Multimedia Information Retrieval, Mathematical Expression Retrieval, Information Extraction, Document Summarization.
- Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval, 2008. http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html
- Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes : Compressing and Indexing Documents and Images, Second Edition 1999.
- Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce, 2010.
- Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval, 1999.
- Karen Sparck Jones and Peter Willett (editors). Readings in Information Retrieval, 1997.
- C. J. van Rijsbergen. Information Retrieval, 1979. http://www.dcs.gla.ac.uk/Keith/Preface.html