PLUTO - An automatic parallelizer and locality optimizer for affine loop nests

PLUTO is an automatic parallelization tool based on the polyhedral model. The polyhedral model for compiler optimization provides an abstraction to perform high-level transformations such as loop-nest optimization and parallelization on affine loop nests. Pluto transforms C programs from source to source for coarse-grained parallelism and data locality simultaneously. The core transformation framework mainly works by finding affine transformations for efficient tiling. The scheduling algorithm used by Pluto has been published in [1]. OpenMP parallel code for multicores can be automatically generated from sequential C program sections. Outer (communication-free), inner, or pipelined parallelization is achieved purely with OpenMP parallel for pragrams; the code is also optimized for locality and made amenable for auto-vectorization. An experimental evaluation and comparison with previous techniques can be found in [2]. Though the tool is fully automatic (C to OpenMP C), a number of options are provided (both command-line and through meta files) to tune aspects like tile sizes, unroll factors, and outer loop fusion structure. Cloog-ISL is used for code generation.

A beta release can be downloaded below. Unless one wants to purely compare with the latest release of Pluto, the git version is the recommended one. The Pluto scheduling algorithm, with extensions and further improvements, is also being used in some production compilers including IBM XL and RSTREAM from Reservoir Labs.

To cite Pluto:

Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model [PDF | BibTeX ]
Uday Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan.
International Conference on Compiler Construction (ETAPS CC), Apr 2008, Budapest, Hungary.
A Practical Automatic Polyhedral Parallelizer and Locality Optimizer [PDF| BibTeX]
Uday Bondhugula, A. Hartono, J. Ramanujan, P. Sadayappan.
ACM SIGPLAN Programming Languages Design and Implementation (PLDI), Jun 2008, Tucson, Arizona.

Download

Development Version

Pluto is being developed actively and is now available under the MIT license. Its current development primarily happens at the Multicore computing lab, Indian Institute of Science. Users intending to experiment with Pluto, while adding new functionality or modifying it in some way, are strongly advised to use the development version, which is on Github. The git version also includes tags that correspond to releases.

Git repository: https://github.com/bondhugula/pluto

Git master branch (for all stencil experiments)

For all stencil optimization experiments (with diamond tiling, which is enabled with --partlbtile --parallel), it is strongly recommended that one use the git master branch, and the Intel C compiler (version 14.0 or higher) to compile Pluto generated code (especially for time-iterated stencils). The git version includes support for `pet' as a frontend, and this is necessary to allow stencils to be written compactly (without extra copies), and this is the best for performance when optimizing with Pluto.

$ git clone https://github.com/bondhugula/pluto.git
$ cd pluto/
$ git submodule init
$ git submodule update
$ ./autogen.sh
$ ./configure [--enable-debug]
$ make -j4
$ cd examples/heat-2d/
$ cat heat-2d.c
...
    for (t = 0; t < T; t++) {
      for (i = 1; i < N+1; i++) {
        for (j = 1; j < N+1; j++) {
          A[(t+1)%2][i][j] =   0.125 * (A[t%2][i+1][j] - 2.0 * A[t%2][i][j] + A[t%2][i-1][j])
                               + 0.125 * (A[t%2][i][j+1] - 2.0 * A[t%2][i][j] + A[t%2][i][j-1])
                               + A[t%2][i][j];
        }
      }
    }
...
$ make orig orig_par tiled par lbpar
$ ./orig
Number of points = 2560000    |Number of timesteps = 1000    |Time taken = 5993.42300ms    |MFLOPS =  4271.348777    |sum: 1.270583e+09    |rms(A) = 1269788995.82    |sum(rep(A)) = 30514486
$ OMP_NUM_THREADS=4 ./orig_par
Number of points = 2560000    |Number of timesteps = 1000    |Time taken = 5991.34300ms    |MFLOPS =  4272.831651    |sum: 1.270583e+09    |rms(A) = 1269788995.82    |sum(rep(A)) = 30514486
$ ./tiled
Number of points = 2560000    |Number of timesteps = 1000    |Time taken = 3360.58500ms    |MFLOPS =  7617.721319    |sum: 1.270583e+09    |rms(A) = 1269788995.82    |sum(rep(A)) = 30514486
$ OMP_NUM_THREADS=4 ./par
Number of points = 2560000    |Number of timesteps = 1000    |Time taken = 1500.13100ms    |MFLOPS =  17065.176308    |sum: 1.270583e+09    |rms(A) = 1269788995.82    |sum(rep(A)) = 30514486
$ OMP_NUM_THREADS=4 ./lbpar 2> out_lbpar4
Number of points = 2560000    |Number of timesteps = 1000    |Time taken = 1285.32500ms    |MFLOPS =  19917.141579    |sum: 1.270583e+09    |rms(A) = 1269788995.82    |sum(rep(A)) = 30514486

Releases

Pluto 0.13.0 (BETA),

All libraries that Pluto depends on (PipLib, PolyLib, CLooG, ISL) are included and are built automatically. So nothing needs to be installed separately.
Previous releases

Quick Install

tar zxvf pluto-.tar.gz>
cd pluto-/
./configure [--enable-debug]
make -j4

$ ./polycc test/seidel.c --tile --parallel

[pluto] Number of statements: 1
[pluto] Total number of loops: 3
[pluto] Number of deps: 19
[pluto] Maximum domain dimensionality: 3
[pluto] Number of parameters: 2
[pluto] Affine transformations

T(S1): (t, t+i, 2t+i+j)
loop types (loop, loop, loop)

t1 --> fwd_dep  loop   (band 0)no-ujam
t2 --> fwd_dep  loop   (band 0)no-ujam
t3 --> fwd_dep  loop   (band 0)no-ujam

[Pluto] After tiling:
T(S1): (zT3, zT4, zT5, t, t+i, 2t+i+j)
loop types (loop, loop, loop, loop, loop, loop)

t1 --> fwd_dep  loop   (band 0)no-ujam
t2 --> fwd_dep  loop   (band 0)no-ujam
t3 --> fwd_dep  loop   (band 0)no-ujam
t4 --> fwd_dep  loop   (band 0)no-ujam
t5 --> parallel loop   (band 0)no-ujam
t6 --> parallel loop   (band 0)no-ujam

[Pluto] After tile scheduling:
T(S1): (zT3+zT4, zT4, zT5, t, t+i, 2t+i+j)
loop types (loop, loop, loop, loop, loop, loop)

[pluto] using statement-wise -fs/-ls options: S1(4,6),
[pluto] Marking t2 parallel
[pluto] Output written to seidel.pluto.c

[pluto] SCoP extraction + dependence analysis time: 0.068200s
[pluto] Auto-transformation time: 0.005411s (constraint solving: 0.000000s)
[pluto] Code generation time: 0.030124s
[pluto] Other/Misc time: 0.102829s
[pluto] Total time: 0.206564s

$ icc -O3 -xHost -ansi-alias -ipo -fp-model precise -DTIME -openmp seidel.par.c

Distributed-memory (MPI) Code Generation Support

Support for generating MPI code can be found in 'distmem' branch of the development git version (repository info above). Once checked out, see README.distmem. distmem code generation support is not included in any of the releases. The techniques precisely determine communication sets for a tile, pack to contiguous buffers, send/receive, and unpack; they are described in the following two papers.

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures [PDF, tool, slides]
Uday Bondhugula
ACM/IEEE Supercomputing (SC '13), Nov 2013, Denver, USA.
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory [PDF, tool ]
Roshan Dathathri, Chandan G, Thejas Ramashekar, Uday Bondhugula
International conference on Parallel Architectures and Compilation Techniques (PACT 2013), Sep 2013, Edinburgh, UK.

To use this, checkout 'distmem' branch from git and see README.distmem.

$ git clone https://github.com/bondhugula/pluto.git -b distmem
$ cd pluto/
$ git submodule init
$ git submodule update
$ ./autogen.sh
$ ./configure; make -j4
$ cd examples/heat-3d
$ ../../polycc heat-3d.c --distmem --mpiomp --commopt_fop --tile  --isldep --lastwriter --cloogsh --timereport  -o heat-3d.distopt_fop.c
$ mpicc -cc=icc -D__MPI -O3 -fp-model precise -ansi-alias -ipo  -openmp -DTIME heat_3d_np.distopt_fop.c sigma_heat_3d_np.distopt_fop.c pi_heat_3d_np.distopt_fop.c\
        ../../polyrt/polyrt.c -o distopt_fop -I ../../polyrt -lm
$ mpirun_rsh -np 16 -hostfile ../hosts MV2_ENABLE_AFFINITY=0 OMP_NUM_THREADS=4  ./distopt_fop
-------
SUMMARY
Write-out time spent in master node: 0.000263 s
Maximum computation time spent across all nodes: 65.769280 s
Maximum communication time spent across all nodes: 0.001323 s
Maximum packing time spent across all nodes: 0.001055 s
Maximum unpacking time spent across all nodes: 0.000229 s
Maximum total time spent across all nodes: 65.771958 s
-------
time = 66.327028s
time = 66.270790s
time = 66.169645s
time = 66.233985s
time = 66.186279s
time = 66.257692s
time = 66.275415s
time = 66.198354s
time = 66.226861s
time = 66.221464s
time = 66.285732s
time = 66.188053s
time = 66.198346s
time = 66.206508s
time = 66.136301s
time = 66.235197s

LLVM/Polly with libpluto

Download this script and run it to automatically obtain and build LLVM/Polly with libPluto support. Here is an example of how to enable the Pluto scheduler.

$ clang -O3 -mllvm -polly -mllvm -polly-optimizer=pluto [-mllvm -polly-pluto-tile]
                            [-mllvm -polly-pluto-intratileopt] [-mllvm -polly-pluto-debug]
                            [-mllvm -polly-pluto-moredebug] [-mllvm -polly-pluto-silent=<false|true>]
                            [-mllvm -polly-pluto-parallel] [-mllvm -polly-pluto-fusion=<smart|min|max>]
                            [-mllvm -polly-pluto-identity] [-mllvm -polly-pluto-unroll]
                            [-mllvm -polly-pluto-rar] [-mllvm -polly-pluto-l2tile]
                            [-mllvm -polly-pluto-pollyunroll] [-mllvm -polly-pluto-isldep]
                            [-mllvm -polly-pluto-isldepcom] [-mllvm -polly-pluto-islsolve]
                            [-mllvm -polly-pluto-lastwriter] <file_to_compile>

Pluto+

Pluto+'s modeling of transformations allows transformation coefficients to be negative (unlike Pluto); it thus also models compositions of transformations that involve loop reversal and loop skewing by negative factors. The new transformations modeled are required for example to enable tiling for stencils defined on grids with periodic boundary conditions. Pluto+ is available in the pluto+ branch. To use this:

$ git clone https://github.com/bondhugula/pluto.git -b pluto+ pluto+
$ cd pluto+/
$ git submodule init
$ git submodule update
$ ./apply_patches.sh
$ ./autogen.sh
$ ./configure; make -j4

Performance

Summary of results below are with Pluto git version 114e419014c6f14b3f193726e951b935ad120466 (25/05/2011) on an Intel Core i7 870 (2.93 GHz), 4 GB DDR3-1333 RAM running Linux 2.6.35 32-bit with ICC 12.0.3. Examples, problem sizes, etc. are in examples/ dir (or see git examples/ tree). Actual running times (in seconds) are here.

Contributors

List of people who have contributed to Pluto.

Aravind Acharya
Vinayaka Bandishti
Uday Bondhugula (original author, maintainer)
Roshan Dathathri
Chandan G
Anoop JS
Taj Khan
Arvind M
Sven Verdoolaege

Contact

Please post any questions or comments related to installation, usage, or development (bug reports, patches, and requests for new features) at Pluto's Github or on pluto-development@googlegroups.com.

Please send feedback on this web page to udayb@iisc.ac.in

Last updated on Tue Mar 28 00:39:03 IST 2017