Searching for datasets: discovering, organizing, and understanding structured data
- 14:00 17th May 2016 ( Trinity Term 2016 )
The world generates reams of structured data every day: the data that we allow companies to collect about our everyday lives, the data that engineers produce as they analyze this collected data, the data that scientists collect to verify their next hypothesis, just to name a few examples. This data comes in many forms, formats, and sizes. Often, we want to find and, maybe, reuse the data that others have produced. Search companies today do a good job of organizing and searching unstructured data, such as HTML pages. However, organizing, understanding, searching, and reusing heterogeneous structured data is largely an open research problem. How do we collect the metadata and understand and capture the semantics of the data? How do we index and rank the datasets in search results? How do we keep our metadata up to date and scale it to use cases where we need to catalog billions of datasets, with billions that are added daily?
In this talk, I will discuss two projects that tackle different aspects of this problem. The first, Google Datasearch, is a project to rethink how we organize structured datasets at scale within an enterprise (Google), in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them. I will discuss the technical challenges that we had to overcome in order to crawl and infer the metadata for billions of datasets, to maintain the consistency of our metadata catalog at scale, and to expose the metadata to users. The second project---Science Search---also aims at extracting and analyzing metadata for diverse datasets. However, for Science Search, we look at the datasets on the open Web, focusing on repositories of public scientific data. I will discuss the challenges that are similar in these two seemingly different set ups and also the unique challenges in handling diverse science data.