Eliminating Empty Iterations
Introduction: We now discuss the first of the two transformations necessary to generate efficient SPMD code. The code executed by each processor cycles through all the iterations in the original program and picks out the operations that it is supposed to execute. If the code has k degrees of parallelism, the effect is that each processor performs k orders of magnitude more work. The purpose of the first transformation is to tighten the bounds of the loops to eliminate all the empty iterations.
We begin by considering the statements in the program one at a time. A statement's iteration space to be executed by each partition is the original iteration space plus the constraint imposed by the affine partition. We can generate tight bounds for each statement by applying Algorithm (Computing bounds for a given order of variables) to the new iteration space; the new index vector is like the original sequential index vector, with processor ID's added as outermost indexes. Recall that the algorithm will generate tight bounds for each index in terms of surrounding loop indexes.
After finding the iteration spaces of the different statements, we combine them, loop by loop, making the bounds the union of those for each statement. Some loops end up having a single iteration, and we can simply eliminate the loop and simply set the loop index to the value for that iteration.
Eliminating Tests from Innermost Loops : The second transformation is to remove conditional tests from the inner loops.As seen from the examples above, conditional tests remain if the iteration spacesof statements in the loop intersect but not completely. To avoid the need forconditional tests, we split the iteration space into subspaces, each of whichexecutes the same set of statements. This optimization requires code to beduplicated and should only be used to remove conditionals in the inner loops.
To split an iteration space to reduce tests in inner loops, we apply the following steps repeatedly until we remove all the tests in the inner loops:
1. Select a loop that consists of statements with different bounds.
2. Split the loop using a condition such that some statement is excluded from at least one of its components. We choose the condition from among the boundaries of the overlapping different polyhedra. If some statement has all its iterations in only one of the half planes of the condition, then such a condition is useful.
3. Generate code for each of these iteration spaces separately.
Source-Code Transforms: We have seen how we can derive from simple affine partitions for each statement programs that are significantly different from the original source. It is not apparent from the examples seen so far how affine partitions correlate with changes at the source level. This section shows that we can reason about source code changes relatively easily by breaking down affine partitions into a series of primitive transforms.
Seven Primitive Affine Transforms: Every affine partition can be expressed as a series of primitive affine transforms, each of which corresponds to a simple change at the source level. There are seven kinds of primitive transforms:
We draw the data dependences for the code before and after the transforms. From the data dependence diagrams, we see that each primitive corresponds to a simple geometric transform and induces a relatively simple code transform. The seven primitives are:
- Fusion.The fusion transform is characterized by mapping multiple loop indexes in the original program to the same loop index. The new loop fuses statements from different loops.
- Fission. Fission is the inverse of fusion. It maps the same loop index for different statements to different loop indexes in the transformed code. This splits the original loop into multiple loops.
- Re-indexing. Re-indexing shifts the dynamic executions of a statement by a constant number of iterations. The affine transform has a constant term.
- Scaling. Consecutive iterations in the source program are spaced apart by a constant factor. The affine transform has a positive non unit coefficient.
- Reversal. Execute iterations in a loop in reverse order. Reversal is characterized by having —1 as a coefficient.
- Permutation.Permute the inner and outer loops. The affine transform consists of permuted rows of the identity matrix.
- Shewing. Iterate through the iteration space in the loops at an angle. The affine transform is a uni-modular matrix with l's on the diagonal.
A Geometric Interpretation of Parallelization: The affine transforms shown in all but the fission example are derived by applyingthe synchronization-free affine partition algorithm to the respective sourcecodes. (We shall discuss how fission can parallelize code with synchronizationin the next section.) In each of the examples, the generated code has an (outermost)parallelizable loop whose iterations can be assigned to different processorsand no synchronization is necessary.
These examples illustrate that there is a simple geometric interpretation of how parallelization works. Dependence edges always point from an earlier instance to a later instance. So, dependences between separate statements not nested in any common loop follows the lexical order; dependences between statements nested in the same loop follows the lexicographic order. Geometrically, dependences of a two-dimensional loop nest always point within the range [0°, 180°), meaning that the angle of the dependence must be below 180°, but no less than 0°.
The affine transforms change the ordering of iterations such that all the dependences are found only between operations nested within the same iteration of the outermost loop. In other words, there are no dependence edges at the boundaries of iterations in the outermost loop. We can parallelize simple source codes by drawing their dependences and finding such transforms geometrically.