This webpage contains early-release videos of talks for the ARRAY 2021 workshop at PLDI 2021, taking place on Monday 21st June 2021.

For each talk there are two videos: one of the speaker, and one of the slides. The intention is that you play the two videos simultaneously. (They are being post-processed by SlidesLive for the conference website: the final versions may be slightly edited, static slides will be replaced by high-resolution images, and the two videos will be properly synchronized. But we are providing early access to these drafts to allow workshop participants to watch in advance, so there is more time for discussion at the workshop itself.)

To participate in the discussion around these talks, please register for the workshop. In case of any hardship, there is a $10 discounted registration fee, which provides access to all of PLDI. As well as the ARRAY workshop, there are talks about APL and Co-arrays at the Fourth HOPL Conference.

For more details, see the workshop webpage. In particular, you will find the short abstract for the keynote, extended abstracts for the three short talks, and (eventually) links to the published versions of the four full papers.

Abstract:My goal in life is to make sequential software rare. Every program should be a parallel program. This is the only way for software to realize the full benefit of modern systems (from CPU to GPU to clusters and onward to the cloud). I’ve tried many approaches over the years with some success, but I have not come even close to achieving my goal.My next attempt at achieving my goal is to focus on the fundamental data structures of programming. If we exposed them through a high-level API, we could hide the complexity of parallel and distributed computing and parallel programming would become easy. And of the data structures “out there” none is more important than the array.

There are many steps needed to realize this vision with arrays. I will report on two we’ve “completed” so far: GraphBLAS and TileDB. GraphBLAS is an API for expressing Graph algorithms in terms of sparse arrays. TileDB is a storage engine for sparse arrays optimized for use with database applications. I will describe these two projects and how they fit in to my big-picture goal of making software easier to write and routinely parallel.

Bio:Tim Mattson is a parallel programmer obsessed with every variety of science (Ph.D. Chemistry, UCSC, 1985). He is a senior principal engineer in Intel’s parallel computing lab. Tim has been with Intel since 1993 and has worked with brilliant people on great projects including: (1) the first TFLOP computer (ASCI Red), (2) MPI, OpenMP and OpenCL, (3) two different research processors (Intel’s TFLOP chip and the 48 core SCC), (4) Data management systems (Polystore systems and Array-based storage engines), and (5) the GraphBLAS API for expressing graph algorithms as sparse linear algebra. Tim has over 150 publications including five books on different aspects of parallel computing, the latest (Published November 2019) titled The OpenMP Common Core: Making OpenMP Simple Again.

Abstract:We present a type system for expressing size constraints on array-typed function parameters in an ML-style type system. The goal is to detect shape mismatches at compile-time, while being simpler than full dependent types. The main restrictions is that the only terms that can occur in types are array sizes, and syntactically they must be variables or constants. For those programs where this is not sufficient, we support a form of existential types, with the type system automatically managing the requisite book-keeping. We formalise a large subset of the type system in a small core language, which we prove sound. We also present an integration of the type system in the high-performance parallel functional language Futhark, and show on a collection of 44 representative programs that the restrictions in the type system are not too problematic in practice.

Abstract:Multi-dimensional array manipulation constitutes a core component of numerous numerical methods, e.g. finite difference solvers of Partial Differential Equations (PDEs). The efficiency of such computations is tightly connected to traversing array data in a hardware-friendly way. The Mathematics of Arrays (MoA) allows reasoning about array computations at a high level and enables systematic transformations of array-based programs. We have previously shown that stencil computations reduce to MoA’s Denotational Normal Form (DNF). Here we bring to light MoA’s Operational Normal Forms (ONFs) that allow for adapting array computations to hardware characteristics. ONF transformations start from the DNF. Alongside the ONF transformations, we extend MoA with rewriting rules for padding. These new rules allow both a simplification of array indexing and a systematic approach to introducing halos to PDE solvers. Experiments on various architectures confirm the flexibility of the approach.

Abstract:Most implementations of machine learning algorithms are based on special-purpose frameworks such as TensorFlow or PyTorch. While these frameworks are convenient to use, they introduce multi-million lines of code dependency that one has to trust, understand and potentially modify. As an alternative, this paper investigates a direct implementation of a state of the art Convolutional Neural Network (CNN) in an array language. While our implementation requires 150 lines of code to define the special-purpose operators needed for CNNs, which are readily provided through frameworks such as TensorFlow and PyTorch, our implementation outperforms these frameworks by factors 2 and 3 on a fixed set of hardware — a 64-core GPU-accelerated machine. The resulting specification is written in a rank-polymorphic data-parallel style, and it can be immediately leveraged by optimising compilers. Indeed, array languages make neural networks fast.

Abstract:This paper reports on the acceleration of a standard, lattice-based numerical algorithm that is widely used in finance for pricing a class of fixed-income vanilla derivatives. We start with a high-level algorithmic specification, exhibiting irregular-nested parallelism, which is challenging to map efficiently to GPU hardware. From it we systematically derive and optimize two CUDA implementations, which utilize only the outer or all levels of parallelism, respectively. A detailed evaluation demonstrates (i) the high impact of the proposed optimizations, (ii) the complementary strength and weaknesses of the two GPU versions, and that (iii) they are on average 2.4x faster than our well-tuned CPU-parallel implementation (OpenMP+AVX2) running on 104 hardware threads, and by 3-to-4 order of magnitude faster than an OpenMP-parallel implementation using the popular QuantLib library.

Abstract:The goal of this paper is to demonstrate performance enhancements of the high performance dense linear algebra matrix-matrix multiply DGEMM kernel, widely implemented by vendors in the basic linear algebra subroutine BLAS library. The mathematics of arrays (MoA) paradigm due to Mullin (1988) results in contiguous memory accesses in combination with Church-Rosser complete language constructs optimized for target processor architectures. Our performance studies demonstrate that the MoA implementation of DGEMM combined with optimal cache-blocking strategies results in at least a 25% performance gain on both Intel Xeon Skylake and IBM Power-9 processors over the vendor supplied Intel MKL and IBM ESSL basic linear algebra libraries. Results are presented for the NREL Eagle and ORNL Summit supercomputers.

Abstract:DynaSOAr is a dynamic object allocator for GPGPU that enables object-oriented programming with an efficient structure-of-arrays (SOA) memory layout. One of the limitations in DynaSOAr is its poor support for nested objects. When a class has a field of another class, the fields of the inner class are allocated in an arrays-of-structure layout. This paper proposes a technique that translates nested class definitions into flat ones by inlining inner classes into top-level classes. We implemented this technique as a Sanajeh domain-specific language that translates Python class definitions into C++ classes using DynaSOAr. Our preliminary evaluation showed that Sanajeh executes a parallel benchmark program with nested objects at almost the same speed as the one with manually flatten classes.

Abstract:Choosing an appropriate data layout can have a significant impact on the performance of any application. In this presentation we outline the importance of having a flexible way of modifying the data layout to enable more aggressive optimizations. First, we discuss how changing the data layout for complex numbers may enable more efficient SIMD vectorization for Fourier transforms and complex matrix-matrix multiplication. Then, we focus on how changing the data distribution helps in scaling three dimensional Fourier transforms on thousands of compute nodes. Finally, we look at how modifying the data layout enables memory bound operations like Fourier transforms and block sparse computations to be merged to improve data locality and reduce data movement. Using these three examples, we emphasize the need for methods and languages that allow data layouts and data layout changes to be easily expressed in order to achieve higher performance gains.