Information Retrieval: 2012-2013

Lecturer

Vasile Palade

Degrees

Schedule C1 — Computer Science

Schedule C1 — Mathematics and Computer Science

Schedule C — MSc in Advanced Computer Science

Term

Michaelmas Term 2012 (20 lectures)

Overview

Modern internet search engines form the primary interface for most users interacting with the World Wide Web. The dramatic increase in the amount of data that is available on the Web, in recent years, means that automatic methods of Information Retrieval (IR) have acquired greater significance. For the purpose of this course, IR will mainly mean the study of the indexing, processing, storage and querying of textual data. The aim of the course is to provide an introduction to the core principles and techniques used in IR, and to demonstrate how statistical models of language can be used to solve document indexing and retrieval problems. In addition, we will look at the issues involved in indexing the entire web and the creative solutions to this problem currently deployed by large scale online search providers.

Learning outcomes

On completion of the course students will be expected to:

gain an understanding of the basic concepts and techniques in Information Retrieval;
understand how statistical models of text can be used to solve problems in IR, with a focus on how the vector-space model and language models are implemented and applied to document retrieval problems;
understand how statistical models of text can be used for other IR applications, for example clustering and news aggregation;
appreciate the importance of data structures, such as an index, to allow efficient access to the information in large bodies of text;
understand common text compression algorithms and their role in the efficient building and storage of inverted indices.
have experience of building a document retrieval system, through the practical sessions, including the implementation of a relevance feedback mechanism;
understand the issues involved in providing an IR service on a web scale, including distributed index construction and user modeling for recommendation engines.

Prerequisites

Prior knowledge of elementary linear algebra would be helpful but is not required for this course. The practical side of this course has a relatively in-depth programming component. Students will build a vector space based information retrieval system from scratch using a programming language of their choice. Students should be familiar with object oriented programming, simple data structures such as hash maps, and text processing.

Synopsis

Text representation and processing, retrieval models: Boolean and Vector Space Models, TF-IDF, document similarity measures, evaluation in information retrieval, term processing (tokenization, normalization, stemming), lexicon representations;
Querying: processing wild-card queries, spelling corrections, Boolean and ranked query retrieval, relevance feedback and query expansion;
Index construction and compression: inverted index, sort-based and memory-based inversion, distributed indexing (MapReduce), dynamic indexing, index compression and coding (Gamma, VB, Golomb codes);
Probabilistic IR: the noisy channel model, language models, smoothing techniques;
Document clustering and Text classification;
Lexical semantics and dimensionality reduction: Latent Semantic Analysis, semantic similarity, distributional semantics;
Information retrieval on the Web: web crawling and indexing, Page Rank, HITS, collaborative filtering, web usage mining and recommendation engines;
Overview of other IR topics and challenges: Multimedia Information Retrieval, Mathematical Formulae Retrieval, Information Extraction, Document Summarization.

Syllabus

Boolean Model and Vector Space Model, evaluation in information retrieval, text representation and processing, relevance feedback and query expansion, index construction and compression, language models and smoothing techniques, document clustering, text classification, dimensionality reduction and semantic similarity, IR on the Web (Page Rank, HITS), web usage mining, other IR topics and challenges.

Reading list

Primary Texts

Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval, 2008. http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html
Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes : Compressing and Indexing Documents and Images, Second Edition 1999.

Secondary Texts

Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce, 2010.
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval, 1999.
Karen Sparck Jones and Peter Willett (editors). Readings in Information Retrieval, 1997.
C. J. van Rijsbergen. Information Retrieval, 1979. http://www.dcs.gla.ac.uk/Keith/Preface.html

Taking our courses

This form is not to be used by students studying for a degree in the Department of Computer Science, or for Visiting Students who are registered for Computer Science courses

Other matriculated University of Oxford students who are interested in taking this, or other, courses in the Department of Computer Science, must complete this online form by 17.00 on Friday of 0th week of term in which the course is taught. Late requests, and requests sent by email, will not be considered. All requests must be approved by the relevant Computer Science departmental committee and can only be submitted using this form.