PLUTO - An automatic parallelizer and locality optimizer for multicores
PLUTO is an automatic parallelization tool based on the polyhedral model. The polyhedral model for compiler optimization provides an abstraction to perform high-level transformations such as loop-nest optimization and parallelization on affine loop nests. Pluto transforms C programs from source to source for coarse-grained parallelism and data locality simultaneously. The core transformation framework mainly works by finding affine transformations for efficient tiling and fusion, but not limited to those. The scheduling algorithm used by Pluto has been published in . OpenMP parallel code for multicores can be automatically generated from sequential C program sections. Outer, inner, or pipelined parallelization is achieved (purely with OpenMP pragrams), besides register tiling and making code amenable to auto-vectorization. An experimental evaluation and comparison with previous techniques can be found in . Though the tool is fully automatic (C to OpenMP C), a number of options are provided (both command-line and through meta files) to tune aspects like tile sizes, unroll factors, and outer loop fusion structure. Cloog-ISL is used for code generation. A beta release can be downloaded below. A version with support for generating CUDA code is also available. The git version is the active development version.
- Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model
[PDF | BibTeX ]
Uday Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan.
International Conference on Compiler Construction (ETAPS CC), Apr 2008, Budapest, Hungary.
A Practical Automatic Polyhedral Parallelizer and Locality
Uday Bondhugula, A. Hartono, J. Ramanujan, P. Sadayappan.
ACM SIGPLAN Programming Languages Design and Implementation (PLDI), Jun 2008, Tucson, Arizona.
The Pluto scheduling algorithm, with extensions and further improvements, is also being used in some production compilers including IBM XL and RSTREAM from Reservoir Labs.
Please post any questions or comments related to installation, usage, or development (bug reports,
patches, and requests for new features) at
pluto-development on Google Groups
All libraries that Pluto depends on (PipLib, PolyLib, and
CLooG) are included and are built automatically.
So nothing needs to be installed separately.
Pluto with support for diamond tiling is available in its git repository here.
(BETA) (this version is no longer maintained)
See examples_ccuda/ and README_CCUDA/ in top-level directory.
- tar zxvf pluto-0.9.0.tar.gz
- cd pluto-0.9.0/
- ./configure [--enable-debug]
Public repository: http://repo.or.cz/w/pluto.git
$ git clone git://repo.or.cz/pluto.git $ cd pluto/ $ git submodule init $ git submodule update $ ./bootstrap.sh $ ./configure [--enable-debug] $ make
$ ./polycc test/seidel.c --tile --parallel Number of variables: 3 Number of parameters: 2 Maximum domain dimensionality: 3 Number of loops: 3 Number of poly deps: 27 (PLUTO) Affine transformations (<var coeff's> <const>) T(S1): (t, t+i, 2t+i+j) 3 4 1 0 0 0 1 1 0 0 2 1 1 0 t0 --> fwd_dep loop (band 0) t1 --> fwd_dep loop (band 0) t2 --> fwd_dep loop (band 0) [Pluto] Outermost tilable band: t0--t2 [Pluto] After tiling: t0 --> serial tLoop (band 0) t1 --> parallel tLoop (band 0) t2 --> fwd_dep tLoop (band 0) t3 --> fwd_dep loop (band 0) t4 --> fwd_dep loop (band 0) t5 --> fwd_dep loop (band 0) [PLUTO] using CLooG -f/-l options: 4 6 Output written to ./seidel.par.c icc -O3 -openmp -DTIME seidel.par.c -o par -lm seidel.par.c(43): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
Summary of results below are with Pluto git version 114e419014c6f14b3f193726e951b935ad120466 (25/05/2011) on an
Intel Core i7 870 (2.93 GHz), 4 GB DDR3-1333 RAM running Linux 2.6.35 32-bit with ICC 12.0.3. Examples, problem sizes, etc.
are in examples/ dir (or see git examples/ tree). Actual running times (in seconds) are here.
All results below (old) were from a Intel Core2 Quad Q6600 on Linux x86-64 (2.6.18) with ICC 10.1 used to compile original and transformed/parallelized codes. These were from pluto-0.0.1.
Imperfect Jacobi stencil
ATLAS was tuned with GCC 4.1. kSelMM perf numbers were approximated using the timing report produced after tuning ATLAS ('make time'). kSelMM, kGenMM perf on 2, 3 cores was interpolated (linearly) from perf on 1 and 4 cores.