Good Wrappers are Lazy Wrappers: Avoiding Page Rendering
|
Supervisor |
|
|
Suitable for |
MSc in Computer Science
|
Abstract
The wealth of information on the web is exceeding any other human-created source of information by orders of magnitude. Harnessing that data, however, requires increasingly automated methods that can collect relevant data from many websites and lter it,e .g., for further human consumption. Automatizing such tasks, however, increasingly requires rendering all web pages involved, as scripted interfaces only work if the page is properly rendered and wrappers increasingly use visual features for robust extraction. However, page rendering comes at a signi cant cost, e.g., in OXPath it dominates OXPath evaluation by a wide margin. Therefore, we investigate in this project how to automatically detect when a page accessed by an (OXPath) wrapper needs to be rendered. We aim to do so by a combination of static analysis of the expression and statistics gathered from previous runs.This project is particularly suited for students interested in browser technologies and their use in data extraction. Familiarity with Javascript and the DOM API would be helpful. Since 1 OXPath is implemented in Java and the method developed will be evaluated by implementation as an extension to OXPath, extensive knowledge of Java is needed.
