PLUTO - An automatic parallelizer and locality optimizer for multicores

PLUTO is an automatic parallelization tool based on the polyhedral model. The polyhedral model for compiler optimization provides an abstraction to perform high-level transformations such as loop-nest optimization and parallelization on affine loop nests. Pluto transforms C programs from source to source for coarse-grained parallelism and data locality simultaneously. The core transformation framework mainly works by finding affine transformations for efficient tiling and fusion, but not limited to those. The scheduling algorithm used by Pluto has been published in [1]. OpenMP parallel code for multicores can be automatically generated from sequential C program sections. Outer, inner, or pipelined parallelization is achieved (purely with OpenMP pragrams), besides register tiling and making code amenable to auto-vectorization. An experimental evaluation and comparison with previous techniques can be found in [2]. Though the tool is fully automatic (C to OpenMP C), a number of options are provided (both command-line and through meta files) to tune aspects like tile sizes, unroll factors, and outer loop fusion structure. Cloog-ISL is used for code generation. A beta release can be downloaded below. A version with support for generating CUDA code is also available. The git version is the active development version.

  1. Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model [PDF | BibTeX ]
    Uday Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan.
    International Conference on Compiler Construction (ETAPS CC), Apr 2008, Budapest, Hungary.

  2. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer [PDF| BibTeX]
    Uday Bondhugula, A. Hartono, J. Ramanujan, P. Sadayappan.
    ACM SIGPLAN Programming Languages Design and Implementation (PLDI), Jun 2008, Tucson, Arizona.

The Pluto scheduling algorithm, with extensions and further improvements, is also being used in some production compilers including IBM XL and RSTREAM from Reservoir Labs.

Mailing list

Please post any questions or comments related to installation, usage, or development (bug reports, patches, and requests for new features) at pluto-development on Google Groups (
pluto-development Google Group

Download (release)

Pluto 0.11.3 (BETA), README, ChangeLog (Feb 8, 2015)

All libraries that Pluto depends on (PipLib, PolyLib, and CLooG) are included and are built automatically. So nothing needs to be installed separately.

Previous releases

CUDA version: Pluto 0.6.2-CUDA (BETA) (no longer maintained)
See examples_ccuda/ and README_CCUDA/ in top-level directory.
Try ppcg for a more advanced and active polyhedral code generator for GPUs.

Quick Install

  1. tar zxvf pluto-0.11.3.tar.gz
  2. cd pluto-0.11.3/
  3. ./configure [--enable-debug]
  4. make -j4
$ ./polycc test/seidel.c --tile --parallel

Number of variables: 3
Number of parameters: 2
Maximum domain dimensionality: 3
Number of loops: 3
Number of poly deps: 27

(PLUTO) Affine transformations (<var coeff's> <const>)

T(S1): (t, t+i, 2t+i+j)
3 4
 1  0  0  0
 1  1  0  0
 2  1  1  0

t0 --> fwd_dep  loop   (band 0)
t1 --> fwd_dep  loop   (band 0)
t2 --> fwd_dep  loop   (band 0)

[Pluto] Outermost tilable band: t0--t2
[Pluto] After tiling:
t0 --> serial   tLoop  (band 0)
t1 --> parallel tLoop  (band 0)
t2 --> fwd_dep  tLoop  (band 0)
t3 --> fwd_dep  loop   (band 0)
t4 --> fwd_dep  loop   (band 0)
t5 --> fwd_dep  loop   (band 0)

[PLUTO] using CLooG -f/-l options: 4 6
Output written to ./seidel.par.c

icc -O3 -openmp -DTIME seidel.par.c -o par -lm
seidel.par.c(43): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
                    Gauss Seidel

Development Version

Pluto is being developed actively. Its current development primarily happens at the Multicore computing lab, Indian Institute of Science.

Public repository:

$ git clone git://
$ cd pluto/
$ git submodule init 
$ git submodule update
$ ./
$ ./configure [--enable-debug]
$ make -j4

Diamond Tiling

Pluto with support for diamond tiling can be found in the pet branch of its git repository. Once cloned, please see

$ git clone git:// -b pet
$ cd pluto/
$ git submodule init 
$ git submodule update
$ ./
$ ./
$ ./configure [--enable-debug]
$ make -j4
$ cd examples/heat-2d/
$ make orig orig_par tiled par lbpar
$ ./orig
Number of points = 2560000    |Number of timesteps = 1000    |Time taken = 5993.42300ms    |MFLOPS =  4271.348777    |sum: 1.270583e+09    |rms(A) = 1269788995.82    |sum(rep(A)) = 30514486
OMP_NUM_THREADS=4 ./orig_par
Number of points = 2560000    |Number of timesteps = 1000    |Time taken = 5991.34300ms    |MFLOPS =  4272.831651    |sum: 1.270583e+09    |rms(A) = 1269788995.82    |sum(rep(A)) = 30514486
Number of points = 2560000    |Number of timesteps = 1000    |Time taken = 3360.58500ms    |MFLOPS =  7617.721319    |sum: 1.270583e+09    |rms(A) = 1269788995.82    |sum(rep(A)) = 30514486
Number of points = 2560000    |Number of timesteps = 1000    |Time taken = 1500.13100ms    |MFLOPS =  17065.176308    |sum: 1.270583e+09    |rms(A) = 1269788995.82    |sum(rep(A)) = 30514486
OMP_NUM_THREADS=4 ./lbpar 2> out_lbpar4
Number of points = 2560000    |Number of timesteps = 1000    |Time taken = 1285.32500ms    |MFLOPS =  19917.141579    |sum: 1.270583e+09    |rms(A) = 1269788995.82    |sum(rep(A)) = 30514486 
                    Heat with diamond tiling

Distributed-memory (MPI) Code Generation Support

Support for generating MPI code can be found in 'distmem' branch of the development git version (repository info above). Once checked out, see README.distmem. distmem code generation support is not included in any of the releases. The techniques precisely determine communication sets for a tile, pack to contiguous buffers, send/receive, and unpack; they are described in the following two papers.

  1. Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures [PDF, tool, slides]
    Uday Bondhugula
    ACM/IEEE Supercomputing (SC '13), Nov 2013, Denver, USA.

  2. Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory [PDF, tool ]
    Roshan Dathathri, Chandan G, Thejas Ramashekar, Uday Bondhugula
    International conference on Parallel Architectures and Compilation Techniques (PACT 2013), Sep 2013, Edinburgh, UK.

To use this, checkout 'distmem' branch from git and see README.distmem.

$ git clone git:// -b distmem
$ git submodule init 
$ git submodule update
$ cd cloog-isl/
$ git am ../patches/0001-loop-execution-time-reporting.patch
$ cd ..
$ ./; ./configure; make -j4
$ cd examples/heat-3d
$ ../../polycc heat-3d.c --distmem --mpiomp --commopt_fop --tile  --isldep --lastwriter --cloogsh --commreport  -o heat-3d.distopt_fop.c
$ mpicc -cc=icc -D__MPI -O3 -fp-model precise -ansi-alias -ipo  -openmp -DTIME heat_3d_np.distopt_fop.c sigma_heat_3d_np.distopt_fop.c pi_heat_3d_np.distopt_fop.c\
        ../../polyrt/polyrt.c -o distopt_fop -I ../../polyrt -lm
$ mpirun_rsh -np 16 -hostfile ../hosts MV2_ENABLE_AFFINITY=0 OMP_NUM_THREADS=4  ./distopt_fop 
Write-out time spent in master node: 0.000263 s
Maximum computation time spent across all nodes: 65.769280 s
Maximum communication time spent across all nodes: 0.001323 s
Maximum packing time spent across all nodes: 0.001055 s
Maximum unpacking time spent across all nodes: 0.000229 s
Maximum total time spent across all nodes: 65.771958 s
time = 66.327028s
time = 66.270790s
time = 66.169645s
time = 66.233985s
time = 66.186279s
time = 66.257692s
time = 66.275415s
time = 66.198354s
time = 66.226861s
time = 66.221464s
time = 66.285732s
time = 66.188053s
time = 66.198346s
time = 66.206508s
time = 66.136301s
time = 66.235197s

Index Set Splitting Support

Support for index set splitting to allow tiling of periodic stencils (described in the paper below) can be found in the pet branch of the pluto git: Once cloned, please see

$ git clone git:// -b pet
$ cd pluto/
$ git submodule init 
$ git submodule update
$ ./
$ ./
$ ./configure [--enable-debug]
$ make -j4

A description of the technique used can be found in the paper below.
Tiling and Optimizing Time-Iterated Computations over Periodic Domains
Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, Guillain Potron, Nicolas Vasilache
IEEE International conference on Parallel Architectures and Compilation Techniques (PACT 2014), Aug 2014.


Summary of results below are with Pluto git version 114e419014c6f14b3f193726e951b935ad120466 (25/05/2011) on an Intel Core i7 870 (2.93 GHz), 4 GB DDR3-1333 RAM running Linux 2.6.35 32-bit with ICC 12.0.3. Examples, problem sizes, etc. are in examples/ dir (or see git examples/ tree). Actual running times (in seconds) are here.

Pluto results summary


All results below (old) were from a Intel Core2 Quad Q6600 on Linux x86-64 (2.6.18) with ICC 10.1 used to compile original and transformed/parallelized codes. These were from pluto-0.0.1 and should not be used or referred to; they are still here to provide a very rough reference.

Imperfect Jacobi stencil

Jacobi (single core) Jacobi (multiple cores)

2-d FDTD


LU decomposition


GEMVER, Doitgen

GEMVER Doitgen



ATLAS was tuned with GCC 4.1. kSelMM perf numbers were approximated using the timing report produced after tuning ATLAS ('make time'). kSelMM, kGenMM perf on 2, 3 cores was interpolated (linearly) from perf on 1 and 4 cores.


Please email or post on pluto-development Google group.