Introduction: To require no communication, each pair of operations that share data dependence must be assigned to the same processor. We refer to these constraints as "space-partition constraints." Any mapping that satisfies these constraints creates partitions that are independent of one another. Note that such constraints can be satisfied by putting all the operations in a single partition. Unfortunately, that "solution" does not yield any parallelism. Our goal is to create as many independent partitions as possible while satisfying the space-partition constraints; that is, operations are not placed on the same processor unless it is necessary.
When we restrict ourselves to affine partitions, then instead of maximizing the number of independent units, we may maximize the degree (number of dimensions) of parallelism. It is sometimes possible to create more independent units if we can use piecewise affine partitions. A piecewise affine partition divides instances of a single access into, different sets and allows a different affine partition for each set. However, we shall not consider such an option here.
Formally, an affine partition of a program is synchronization free if and only if for every two (not necessarily distinct) accesses sharing a dependence, T\ = ( F i , f i , B i , b i ) in statement s\ nested in d\ loops, and T2 = (F2 , f y ,B2 ,b2 ) in statement s2 nested in d2 loops, the partitions ( C i , c i ) and ( C 2 , c 2 ) for statements si and s2 , respectively, satisfy the following space-partition constraints:
For all ii in Zdl and i2 in Zd2 such that
a) Biii 4-bi > 0,
b) B 2 i 2 b2 > 0, and
c) F i i i f i = F 2 i 2 f2,
it is the case that C i i i ci = C 2 i 2 c 2.
The goal of the parallelization algorithm is to find, for each statement, the partition with the highest rank that satisfies these constraints. Shown in Fig. 11.25 is a diagram illustrating the essence of the space partition constraints. Suppose there are two static accesses in two loop nests with index vectors ii and i2 . These accesses are dependent in the sense that they access at least one array element in common, and at least one of them is a write. The figure shows particular dynamic accesses in the two loops that happen to access the same array element, according to the affine access functions F i i i f i and F 2 i 2 f2 . Synchronization is necessary unless the affine partitions for the two static accesses, C i i i c i and C 2 i 2 c 2 , assign the dynamic accesses to the same processor.
If we choose an affine partition whose rank is the maximum of the ranks of all statements, we get the maximum possible parallelism. However, under this partitioning some processors may be idle at times, while other processors are executing statements whose affine partitions have a smaller rank. This situation may be acceptable if the time taken to execute those statements is relatively short. Otherwise, we can choose an affine partition whose rank is smaller than the maximum possible, as long as that rank is greater than 0.
We show a small program designed to illustrate the power of the technique. Real applications are usually much simpler than this, but may have boundary conditions resembling some of the issues shown here. We shall use this example throughout this chapter to illustrate those programs with affine accesses have relatively simple space-partition constraints, that these constraints can be solved using standard linear algebra techniques, and that the desired SPMD program can be generated mechanically from the affine partitions.