Skip to main content

Information Retrieval:  2010-2011

Lecturer

Degrees

Schedule C1Computer Science

Schedule C1Mathematics and Computer Science

Schedule CMSc in Computer Science

Term

Overview

The dramatic increase in the amount of data that is available on the Web in recent years means that automatic methods of Information Retrieval (IR) have acquired greater significance. Furthermore, this data exists in multiple forms (text, image, video, etc) and it is becoming increasingly important that the techniques deployed in IR are able to perform search and retrieval operations across these distinct formats. For the purpose of this course IR is the study of the indexing, processing, and querying of both textual and image data.

The aim of the course is to provide an introduction to the basic principles and techniques used in IR; to demonstrate how statistical models of language can be used to solve the document retrieval problem; to explore a range of image processing techniques used in IR; and to show how combined models for language and image processing can enhance document retrieval.

Learning outcomes

  • to gain an understanding of the basic concepts and techniques in Information Retrieval;
  • to understand how statistical models of text can be used to solve problems in IR, with a focus on how the vector-space model and the language model can be applied to the document retrieval problem;
  • to understand how statistical models of text can be used for other IR applications, for example clustering;
  • to appreciate the importance of data structures such as an index to allow efficeint access to the information in large bodies of text;
  • to have experience of building a document retieval system, through the practical sessions, including the implementation of a relevance feedback system;
  • to gain an understanding of the basic operations of image processing that support IR;
  • to understand how image processing techniques for object recognition and motion detection can be used in solving the IR problem for image data;
  • to appreciate how combined models of language and image processing can enhance document retrieval;

Prerequisites

Prior knowledge of elementary linear algebra would be helpful but is not required for this course.

The practical portion of this course has a relatively in depth programming component. Students will build an vector space based information retrieval system from scratch using a programming language of their choice. Students should be familiar with object oriented programming, simple data structures such as hash maps, and text processing. 

Synopsis

Information retrieval (Text Processing)

Text representation and processing

Retrieval models (Boolean, vector space, language model)

Indexing

Evaluation

Relevance feedback - real feedback, pseudo-relevance feedback

Document and concept clustering - hierarchical clustering, k-means

Web retrieval - Page rank, difficulties of Web retrieval

Document clustering

Information Retrieval (Image Processing)

Operations on images

Motion detection

Object recognition

Automatic image annotation and retrieval

Combined models of language and image processing

Reading list

Course Textbooks

  • Introduction to Information Retrieval, by Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze. http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html
  • Computer Vision: A modern approach (2003) by D. Forsyth and J. Ponce , ISBN 0-13-085198-1

Additional reading

  • Modern Information Retrieval (1999), by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
  • Readings in Information Retrieval (1997), edited by Karen Sparck Jones and Peter Willett
  • Managing Gigabytes : Compressing and Indexing Documents and Images (1999), by Ian H. Witten, Alistair Moffat, and Timothy C. Bell.
  • Information Retrieval (1979), by C. J. van Rijsbergen (online at http://www.dcs.gla.ac.uk/Keith/Preface.html)
  • Machine Vision (1995) by R. Jain, R. Kasturi and B. Schunk, McGraw Hill, ISBN 0-07-032018-7
  • Feature extraction and image processing (2002) by M. Nixon and A. Aguado, ISBN 0-7506-5078-8