## A Simple Code-Generation Algorithm

**Introduction**: Algorithm (Finding a highest-ranked synchronization-free affine partition for a program) generates affine partitions that split computations into independent partitions. Partitions can be assigned arbitrarily to different processors, since they are independent of one another. A processor may be assigned more than one partition and can interleave the execution of its partitions, as long as operations within each partition, which normally have data dependences, are executed sequentially.

It is relatively easy to generate a correct program given an affine partition. We first introduce Algorithm 11.45, a simple approach to generating code for a single processor that executes each of the independent partitions sequentially. Such code optimizes temporal locality, since array accesses that have several uses are very close in time. Moreover, the code easily can be turned into an SPMD program that executes each partition on a different processor. The code generated is, unfortunately, inefficient; we shall next discuss optimizations to make the code execute efficiently.

The essential idea is as follows. We are given bounds for the index variables of a loop nest, and we have determined, in Algorithm (Finding a highest-ranked synchronization-free affine partition for a program), a partition for the accesses of a particular statement s. suppose we wish to generate sequential code that performs the action of each processor sequentially. We create an outermost loop that iterates through the processor IDs. That is, each iteration of this loop performs the operations assigned to a particular processor ID. The original program is inserted as the loop body of this loop; in addition, a test is added to guard each operation in the code to ensure that each processor only executes the operations assigned to it. In this way, we guarantee that the processor executes all the instructions assigned to it, and does so in the original sequential order.

**Algorithm:** Generating code that executes partitions of a program sequentially.

**INPUT:** A program P with affine array accesses. Each statement s in the program has associated bounds of the form B s i bs > 0, where i is the vector of loop indexes for the loop nest in which statement s appears. Also associated with statement s is a partition C 5 i cs = p where p is an m-dimensional vector of variables representing a processor ID; m is the maximum, over all statements in program P, of the rank of the partition for that statement.

**OUTPUT:** A program equivalent to P but iterating over the processor space rather than over loop indexes.

**METHOD**: Do the following:

1. For each statement, use Fourier-Motzkin elimination to project out all the loop index variables from the bounds.

2. Use Algorithm 11.13 to determine bounds on the partition ID's.

3. Generate loops, one for each of the m dimensions of processor space. Let p = [pi,P2,-- - ,Pm] be the vector of variables for these loops; that is, there is one variable for each dimension of the processor space. Each loop variable pi ranges over the union of the partition spaces for all statements in the program P.

Note that the union of the partition spaces is not necessarily convex. To keep the algorithm simple, instead of enumerating only those partitions that have a nonempty computation to perform, set the lower bound of each pi to the minimum of all the lower bounds imposed by all statements and the upper bound of each pi to the maximum of all the upper bounds imposed by all statements. Some values of p may thereby have no operations.

The code to be executed by each partition is the original sequential program. However, every statement is guarded by a predicate so that only those operations belonging to the partition are executed.