PLUTO - An automatic parallelizer and locality optimizer for affine loop nests
PLUTO is an automatic parallelization tool based on the polyhedral model. The polyhedral model for compiler optimization provides an abstraction to perform high-level transformations such as loop-nest optimization and parallelization on affine loop nests. Pluto transforms C programs from source to source for coarse-grained parallelism and data locality simultaneously. The core transformation framework mainly works by finding affine transformations for efficient tiling. The scheduling algorithm used by Pluto has been published in . OpenMP parallel code for multicores can be automatically generated from sequential C program sections. Outer (communication-free), inner, or pipelined parallelization is achieved purely with OpenMP parallel for pragrams; the code is also optimized for locality and made amenable for auto-vectorization. An experimental evaluation and comparison with previous techniques can be found in . Though the tool is fully automatic (C to OpenMP C), a number of options are provided (both command-line and through meta files) to tune aspects like tile sizes, unroll factors, and outer loop fusion structure. Cloog-ISL is used for code generation. A beta release can be downloaded below. Unless one wants to purely compare with the latest release of Pluto, the pet branch of its git version is the recommended one. The 'pet' branch of the git version supports the more robust Pet frontend based on LLVM/Clang.
- Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model
[PDF | BibTeX ]
Uday Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan.
International Conference on Compiler Construction (ETAPS CC), Apr 2008, Budapest, Hungary.
A Practical Automatic Polyhedral Parallelizer and Locality
Uday Bondhugula, A. Hartono, J. Ramanujan, P. Sadayappan.
ACM SIGPLAN Programming Languages Design and Implementation (PLDI), Jun 2008, Tucson, Arizona.
The Pluto scheduling algorithm, with extensions and further improvements, is also being used in some production compilers including IBM XL and RSTREAM from Reservoir Labs.
Please post any questions or comments related to installation, usage, or development (bug reports,
patches, and requests for new features) at
pluto-development on Google Groups
All libraries that Pluto depends on (PipLib, PolyLib, and
CLooG) are included and are built automatically.
So nothing needs to be installed separately.
Users intending to experiment with Pluto, while adding new functionality or modifying it in some way, are strongly advised to instead use the development git version (below). The git version also includes tags that correspond to the releases above.
- tar zxvf pluto-0.11.3.tar.gz
- cd pluto-0.11.3/
- ./configure [--enable-debug]
- make -j4
$ ./polycc test/seidel.c --tile --parallel Number of variables: 3 Number of parameters: 2 Maximum domain dimensionality: 3 Number of loops: 3 Number of poly deps: 27 (PLUTO) Affine transformations (<var coeff's> <const>) T(S1): (t, t+i, 2t+i+j) 3 4 1 0 0 0 1 1 0 0 2 1 1 0 t0 --> fwd_dep loop (band 0) t1 --> fwd_dep loop (band 0) t2 --> fwd_dep loop (band 0) [Pluto] Outermost tilable band: t0--t2 [Pluto] After tiling: t0 --> serial tLoop (band 0) t1 --> parallel tLoop (band 0) t2 --> fwd_dep tLoop (band 0) t3 --> fwd_dep loop (band 0) t4 --> fwd_dep loop (band 0) t5 --> fwd_dep loop (band 0) [PLUTO] using CLooG -f/-l options: 4 6 Output written to ./seidel.par.c icc -O3 -openmp -DTIME seidel.par.c -o par -lm seidel.par.c(43): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
Public repository: http://repo.or.cz/w/pluto.git
For all stencil optimization experiments (with diamond tiling), it is strongly recommended that one use the pet branch. The master branch does not include support to use Pet as a frontend, and can be used for quick installation and experimentation (since one will not need to install Clang/LLVM that Pet relies on). Pluto's pet branch can be obtained as described below.
$ git clone git://repo.or.cz/pluto.git -b pet $ cd pluto/ $ git submodule init $ git submodule update $ ./apply_patches.sh $ ./autogen.sh $ ./configure [--enable-debug] $ make -j4 $ cd examples/heat-2d/ $ make orig orig_par tiled par lbpar $ ./orig Number of points = 2560000 |Number of timesteps = 1000 |Time taken = 5993.42300ms |MFLOPS = 4271.348777 |sum: 1.270583e+09 |rms(A) = 1269788995.82 |sum(rep(A)) = 30514486 $ OMP_NUM_THREADS=4 ./orig_par Number of points = 2560000 |Number of timesteps = 1000 |Time taken = 5991.34300ms |MFLOPS = 4272.831651 |sum: 1.270583e+09 |rms(A) = 1269788995.82 |sum(rep(A)) = 30514486 $ ./tiled Number of points = 2560000 |Number of timesteps = 1000 |Time taken = 3360.58500ms |MFLOPS = 7617.721319 |sum: 1.270583e+09 |rms(A) = 1269788995.82 |sum(rep(A)) = 30514486 $ OMP_NUM_THREADS=4 ./par Number of points = 2560000 |Number of timesteps = 1000 |Time taken = 1500.13100ms |MFLOPS = 17065.176308 |sum: 1.270583e+09 |rms(A) = 1269788995.82 |sum(rep(A)) = 30514486 $ OMP_NUM_THREADS=4 ./lbpar 2> out_lbpar4 Number of points = 2560000 |Number of timesteps = 1000 |Time taken = 1285.32500ms |MFLOPS = 19917.141579 |sum: 1.270583e+09 |rms(A) = 1269788995.82 |sum(rep(A)) = 30514486
Distributed-memory (MPI) Code Generation Support
Support for generating MPI code can be found in 'distmem' branch of the development git version (repository info above). Once checked out, see README.distmem. distmem code generation support is not included in any of the releases. The techniques precisely determine communication sets for a tile, pack to contiguous buffers, send/receive, and unpack; they are described in the following two papers.
Compiling Affine Loop Nests for Distributed-Memory Parallel
Architectures [PDF, tool, slides]
ACM/IEEE Supercomputing (SC '13), Nov 2013, Denver, USA.
- Generating Efficient Data Movement Code for Heterogeneous
Architectures with Distributed-Memory [PDF, tool ]
Roshan Dathathri, Chandan G, Thejas Ramashekar, Uday Bondhugula
International conference on Parallel Architectures and Compilation Techniques (PACT 2013), Sep 2013, Edinburgh, UK.
To use this, checkout 'distmem' branch from git and see README.distmem.
$ git clone git://repo.or.cz/pluto.git -b distmem $ git submodule init $ git submodule update $ ./apply_patches.sh $ ./autogen.sh $ ./configure; make -j4 $ cd examples/heat-3d $ ../../polycc heat-3d.c --distmem --mpiomp --commopt_fop --tile --isldep --lastwriter --cloogsh --commreport -o heat-3d.distopt_fop.c $ mpicc -cc=icc -D__MPI -O3 -fp-model precise -ansi-alias -ipo -openmp -DTIME heat_3d_np.distopt_fop.c sigma_heat_3d_np.distopt_fop.c pi_heat_3d_np.distopt_fop.c\ ../../polyrt/polyrt.c -o distopt_fop -I ../../polyrt -lm $ mpirun_rsh -np 16 -hostfile ../hosts MV2_ENABLE_AFFINITY=0 OMP_NUM_THREADS=4 ./distopt_fop ------- SUMMARY Write-out time spent in master node: 0.000263 s Maximum computation time spent across all nodes: 65.769280 s Maximum communication time spent across all nodes: 0.001323 s Maximum packing time spent across all nodes: 0.001055 s Maximum unpacking time spent across all nodes: 0.000229 s Maximum total time spent across all nodes: 65.771958 s ------- time = 66.327028s time = 66.270790s time = 66.169645s time = 66.233985s time = 66.186279s time = 66.257692s time = 66.275415s time = 66.198354s time = 66.226861s time = 66.221464s time = 66.285732s time = 66.188053s time = 66.198346s time = 66.206508s time = 66.136301s time = 66.235197s
Summary of results below are with Pluto git version 114e419014c6f14b3f193726e951b935ad120466 (25/05/2011) on an
Intel Core i7 870 (2.93 GHz), 4 GB DDR3-1333 RAM running Linux 2.6.35 32-bit with ICC 12.0.3. Examples, problem sizes, etc.
are in examples/ dir (or see git examples/ tree). Actual running times (in seconds) are here.