Skip to main content

Efficient processing of nested collections in Apache Spark

Supervisor

Suitable for

MSc in Computer Science
Mathematics and Computer Science, Part C
Computer Science and Philosophy, Part C
Computer Science, Part C

Abstract

"Apache Spark is today one of the most popular frameworks for large-scale data analysis. Spark offers a functional (Scala-like) API for processing data collections that are distributed over a cluster of machines. Its declarative approach, domain-specific libraries (e.g., for machine learning and graph processing), and high performance have enabled its wide adoption in the industry.

Although Spark can transform collections of arbitrary types, it can exhibit severe performance problems when processing nested data formats such as JSON and XML. In particular, distributed processing of datasets where nested collections have skewed cardinalities (e.g., one extremely large, others small nested collections) leads to uneven distribution of work among the machines. In such cases, developers typically have to undergo a painful process of manual query re-writing to avoid load imbalance for large inner collections in their workloads. This project aims to extend the Spark API with a new functionality that would automatically transform user queries to avoid data skews. This project is a great opportunity for students to understand how Apache Spark works under the hood and to contribute to an open-source project."