Skip to main content

Unsupervised Visual Learning using Segment-Masks

Supervisor

Suitable for

MSc in Advanced Computer Science
Mathematics and Computer Science, Part C
Computer Science and Philosophy, Part C
Computer Science, Part C

Abstract

Masked language modelling is the prevalent method for pretraining large language models (LLMs). This involves 'masking' words in the text and prompting the model to predict the hidden words. This approach has also been adapted in computer vision, notably in Vision Transformers (ViTs) [1], where patches of images are dropped, and the model is tasked with predicting these missing regions. However, in the vision context process isn't ideal, as patches in images don't correspond directly to words in text. Unlike words, which represent complete concepts, image patches can contain parts of various objects. This project will investigate a novel approach to vision pretraining: using segmentation masks instead of patches. A key aspect of the project will be comparing the effectiveness of segmentation-based training against the traditional patch-based method in downstream tasks, such as semantic segmentation or depth estimation.

[1] https://arxiv.org/pdf/2111.06377.pdf

Pre-requisites: Suitable for those who have taken a course in machine learning