## Code Transformations

**Introduction:** If there exist k independent solutions to the time-partition constraints of a loop nest, then it is possible to transform the loop nest to have k outermost fully permutable loops, which can be transformed to create k—1 degrees of pipelining, or to create k — 1 inner parallelizable loops. Furthermore, we can apply blocking to fully permutable loops to improve data locality of uniprocessors as well as reducing synchronization among processors in a parallel execution.

**Exploiting Fully Permutable Loops: **We can create a loop nest with k outermost fully permutable loops easily fromk independent solutions to the time-partition constraints. We can do so bysimply making the kth solution the kih row of the new transform. Once theaffine transform is created, Algorithm 11.45 can be used to generate the code.

**Example:** The solutions found in Example 11.58 for our SOR example were

Making the first solution the first row and the second solution the second row, we get the transform

' 1 0 "

1 1

which yields the code in Fig. 11.51(a).

Making the second solution the first row instead, we get the transform

" 1 1 '

1 0

which yields the code in Fig. 11.51(c).

It is easy to see that such transforms produce a legal sequential program. The first row partitions the entire iteration space according to the first solution. The timing constraints guarantee that such a decomposition does not violate any data dependences. Then, we partition the iterations in each of the outermost loop according to the second solution. Again this must be legal because we are dealing with just subsets of the original iteration space. The same goes for the rest of the rows in the matrix. Since we can order the solutions arbitrarily, the loops are fully permutable.

**Exploiting Pipelining: **We can easily transform a loop with k outermost fully permutable loops into a code with k — 1 degrees of pipeline parallelism.

**Wavefr ont ing: **It is also easy to generate k-1 inner parallelizable loops from a loop with k outermost fully permutable loops. Although pipelining is preferable, we include this information here for completeness.

We partition the computation of a loop with k outermost fully permutable loops using a new index variable i': where i' is defined to be some combination of all the indices in the k permutable loop nest. For example, i' = i± ... ik is one such combination.

We create an outermost sequential loop that iterates through the i' partitions in increasing order; the computation nested within each partition is ordered as before. The first k - 1 loops within each partition are guaranteed to be parallelizable. Intuitively, if given a two-dimensional iteration space, this transforms groups iterations along 135° diagonals as an execution of the outermost loop. This strategy guarantees that iterations within each iteration of the outermost loop have no data dependence.

**Blocking: **A k-deep, fully permutable loop nest can be blocked in ^-dimensions. Insteadof assigning the iterations to processors based on the value of the outer or innerloop indexes, we can aggregate blocks of iterations into one unit. Blocking isuseful for enhancing data locality as well as for minimizing the overhead ofpipelining.

Suppose we have a two-dimensional fully permutable loop nest, as in Fig. 11.55(a), and we wish to break the computation into b x b blocks. The execution order of the blocked code is shown in Fig. 11.56, and the equivalent code is in Fig. 11.55(b).

If we assign each block to one processor, then all the passing of data from one iteration to another that is within a block requires no inter-processor communication.

Alternatively, we can coarsen the granularity of pipelining by assigning a column of blocks to one processor. Notice that each processor synchronizes with its predecessors and successors only at block boundaries. Thus, another advantage of blocking is that programs only need to communicate data accessed at the boundaries of the block with their neighbor blocks. Values that are interior to a block are managed by only one processor.