Skip to main content

F90toCUDA: Compiling FORTRAN 90 Programs to CUDA

Supervisor

Jeyan Thiyagalingam

Suitable for

Abstract

General-purpose computing on graphics processing units has been proved to be a viable and cost-effective approach for seeking substantial performance improvements of scientific applications, especially compared to traditional CPUs. However, obtaining significant performance improvements requires carefully crafted programs making efficient use of the architectural peculiarities of these devices. With a rather diverse set of architectures in the community, efforts for individually specialising every system is difficult. Although programming frameworks such as CUDA or OpenCL do exist, they are aimed at a relatively low level of abstraction which renders an efficient use of these devices, which is still a challenging task. In this project, we propose a more high-level approach. We suggest that the high-level, data parallel Fortran 90 programs can be auto-parallelised to utilise such graphics cards effectively.


1. Introduction
In recent years, the processing capability of graphics processing units (GPUs) has improved rather significantly so that
they are used to accelerate both scientific kernels and real-world computational problems [3, 6]. Among others, two main
features of these architectures render them attractive: large numbers of cores are readily available for end-users and their
cost per MFLOPs is very low compared to large-scale super-computing systems. Their peak performance figures have already
exceeded that of multi-core CPUs while being available at a fraction of the cost of dedicated systems. Support from
industry has improved this setting further: programming frameworks such as CUDA (Compute Unified Device Architecture)
from nVidia [4], Stream SDK from AMD-ATI [1] and directive-based approach from PGI are three major GPU-based
programmingmodels. Recent initiatives on developing a unified programmingmodel, OpenCL [2], confirms this interest further.
Despite these advancements, obtaining a significant performance improvement using these devices is still a challenge.
Firstly, significant speed ups are not always possible by simply re-writing existing applications or by tagging loops with
directives. The programmer is responsible for identifying the parallelism. Secondly, the programming model is not oblivious
to the underlying architecture. Knowledge of the architecture and details therein are fundamental to writing better programs
on these devices. For instance, in CUDA, the data is not managed automatically as in CPU-based cache systems. Instead,
the programmer is responsible for correctly moving the appropriate data to an appropriate memory. An incorrect choice of
memory type may lead to a dismal performance.


Instead of hand-tuning the programs for performance, the process should be automated so that the techniques can be mapped
to different architectures as they evolve. Therefore, this project proposes to auto-parallelise high-level data-parallel programs
written in Fortran 90 to CUDA. In fact, we will not be parallelising the entire program. Instead, we will be selectively identify
portions of programs which may potentially benefit from parallelisation and only those regions will be parallelised. We will
be using the ROSE compiler framework [5] to perform this parallelisation. In particular, the long term vision of this project
is to:
1. Derive a robust compilation scheme which maps data-parallel operations of Fortran 90 into an equivalent CUDA code,
2. Analyse, identify and evolve a set of transformations to be performed on Fortran programs
3. Build a cost model for justifying these transformations
4. Use the ROSE compiler framework to perform part and selective source-to-source program transformation. In cases
where ROSE compiler lacks any feature, say CUDA-specific issues, extend the ROSE framework.
5. Evaluate the effectiveness of the approach using a set of benchmarks and non-trivial scientific applications


2 Selective Source-to-Source Transformation
As highlighted above, transforming entire Fortran programs is not the aim of this project and it may not be feasible. Instead,
only regions which are computationally intensive and show certain amount of parallelism will be mapped to CUDA.
Our experience in compiler transformations suggests that, DO loops and chained array expressions may expose a certain
amount of parallelism and they will become ideal target operations for transformations. These identified regions will be
mapped to CUDA using the ROSE compiler framework. Then the loops inside the original programs are replaced with a
function call to the generated CUDA function, which will be evaluated on the device. This also means, we may have to
transfer any necessary data to and from the device as demanded by the computation. The ROSE compiler framework will
help us automating the entire approach. In particular:
• We will be using the Fortran front end inside the ROSE to read and translate FORTRAN programs to the Abstract
Syntax Tree (AST) format.
• This AST will be manipulated to auto-parallelise the regions which we expect could benefit from parallelisation. We
will be using existing analysis API inside the ROSE framework to achieve this.
• The final AST will be used to generate the Fortran and CUDA sources. Again, the ROSE framework consists of a
number of backends for code generation. Since we have two different source forms to be generated, we will be using
the Fortran and C backends to cover the Fortran and CUDA sources. The C backend needs to be extended slightly to
cover the CUDA form.


3 Components of the Project
The project consists of well balanced parts of research and development. The research component will expect the candidate
to follow up any existing work aligned with the motivation outlined in this project. In some cases, it may be necessary to
extend or to develop novel ideas and algorithms.
4 Workload and Work Schedule
This document outlined the long term vision of the project. It is not necessary and not possible to address all issues outlined
above. The key aim is to demonstrate the possibility and benefits of the approach. Therefore, there are two potential avenues
in which the project can be accomplished. These are included as two different schedules below.
It is possible for a candidate to choose one or two transformations and implement them without any cost model or to develop
the cost model alone. In the case where more than one candidate opts for this project, or when this project is awarded as a
group project, the workload can be shared.
Following are some of the different schedules we think are more sensible. We will be happy to accept any extended schedules
as suggested by the the candidate, for example combining schedules 2 and 3.


Schedule 1

1. Read key papers including that of the ROSE compiler framework, CUDA architecture and on program transformation
techniques.
2. Understand the existing compilation scheme for mapping Fortran programs to CUDA and refine the scheme if necessary.
3. Study the key set of transformations. For examples, it may be worth to start off with :
• DO loops and
• Chained array expressions
4. Use the compilation scheme to manually perform the transformations
5. Benchmark all manually mapped transformations
6. Analyse, identify and evolve a set of transformations to be performed on Fortran programs.


Schedule 2

1. Read key papers including that of the ROSE compiler framework, CUDA architecture, on program transformation
techniques, Fortran expressions and on cost models.
2. Understand the CUDA mappings by studying the existing compilation scheme for mapping Fortran programs to CUDA
3. Focus on a key set of transformations and evaluate Fortran and CUDA mappings, solely targeting to assess any gains.
For examples, it may be worth to start off with :
• DO loops and
• Chained array expressions
For example, while some of the chained expressions remain attractive to be performed on the CUDA devices, some
expressions may not be worth enough at all due to the transfer/computation ratio.
4. Use a set of benchmark kernels to demonstrate the applicability of the cost model.


Schedule 3
1. Read key papers including that of the ROSE compiler framework, CUDA architecture, on program transformation
techniques and on Fortran expressions.
2. Understand the CUDA mappings by studying the existing compilation scheme for mapping the Fortran code to CUDA
3. Familiarising with the ROSE compiler framework. This may involve reading manuals, papers based on the ROSE compiler
framework, exploring the source code, writing example programs and transformations for the ROSE framework.
4. Augment the C-backend of the ROSE framework to address the CUDA specifics. Since the CUDA programmingmodel
relies on an extended set of C-like language, this phase can make use of the C-backend inside the ROSE.
5. Map a set of pre-determined Fortran constructs to CUDA using this backend. Since the ROSE framework includes a
Fortran front-end, the Fortran AST can be realised rather easily. However, the AST may need annotations or colouring
to make use of the CUDA backend.
6. Use a set of benchmark kernels to demonstrate the applicability of the transformation engine.


5 Candidate
This project is aimed at the Masters level. A strong commitment from the student, desire to complete the project, substantial
C/C++ skills and knowledge of graphs are expected. Knowledge of Fortran will be an added advantage but not
necessary.


6 Contact
Jeyan Thiyagalingam and Anne Trefethen, OeRC
Email: jeyarajan.thiyagalingam@oerc.ox.ac.uk


References
[1] AMD Inc. Stream SDK. http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx.
[2] Khronos Group. OpenCL Specification 1.0. http://www.khronos.org/opencl/.
[3] Y. C. Luo and R. Duraiswami. Canny edge detection on NVIDIA CUDA. In Computer Vision on GPU, pages 1–8, 2008.
[4] nVidia Corporation. CUDA Programming Guide 2.0.
[5] D. J. Quinlan. ROSE: Compiler support for object-oriented frameworks. Parallel Processing Letters, 10(2/3):215–226,
2000.
[6] A. J. R. Ruiz and L. M. Ortega. Geometric algorithms on CUDA. In GRAPP, pages 107–112, 2008.