Synchronization Between Parallel Loops
Introduction: Most programs have no parallelism if we do not allow processors to perform any synchronization. But adding even a small constant number of synchronization operations to a program can expose more parallelism. We shall first discuss parallelism made possible by a constant number of synchronizations in this section and the general case, where we embed synchronization operations in loops, in the next.
A Constant Number of Synchronizations: Programs with no synchronization-free parallelism may contain a sequence ofloops, some of which are parallelizable if they are considered independently. Wecan parallelize such loops by introducing synchronization barriers before andafter their execution.
for ( i = 1; i < n; i )
for (j = 0; j < n; j )
X [ i , j ] = f ( X [ i , j l X [ i - l , j ] ) ;
for (i = 0; i < n; i )
for (j = 1; j < n;
X [ i , j ] = g ( X [ i , j ] X [ i , j - 1 ] ) ;
Figure 11.38: Two sequential loop nests
Example: In Fig. 11.38 is a program representative of an ADI (Alternating Direction Implicit) integration algorithm. There is no synchronization free parallelism. Dependences in the first loop nest require that each processor works on a column of array X; however, dependences in the second loop nest require that each processor works on a row of array X. For there to be no communication, the entire array has to reside on the same processor, hence there is no parallelism. We observe, however, that both loops are independently parallelizable.
One way to parallelize the code is to have different processors work on different columns of the array in the first loop, synchronize and wait for all processors to finish, and then operate on the individual rows. In this way, all the computation in the algorithm can be parallelized with the introduction of just one synchronization operation. However, we note that while only one synchronization is performed, this parallelization requires almost all the data in matrix X to be transferred between processors. It is possible to reduce the amount of communication by introducing more synchronization.
It may appear that this approach is applicable only to programs consisting of a sequence of loop nests. However, we can create additional opportunities for the optimization through code transforms. We can apply loop fission to decompose loops in the original program into several smaller loops, which can then be parallelized individually by separating them with barriers.
Program-Dependence Graphs: To find all the parallelism made possible by a constant number of synchronizations,we can apply fission to the original program greedily. Break up loopsinto as many separate loops as possible, and then parallelize each loop independently.To expose all the opportunities for loop fission, we use the abstraction of aprogram-dependence graph (PDG). A program dependence graph of a program is a graph whose nodes are the assignment statements of the program and whose edges capture the data dependences, and the directions of the data dependence, between statements. An edge from statement si to statement s2 exists whenever some dynamic instance of s\ shares a data dependence with a later dynamic instance of s2.
To construct the PDG for a program, we first find the data dependences between every pair of (not necessarily distinct) static accesses in every pair of (not necessarily distinct) statements. Suppose we determine that there is dependence between access T\ in statement s\ and access T2 in statement s2. Recall that an instance of a statement is specified by an index vector i = [hih, . . ,im] where ik is the loop index of the kth outermost loop in which the statement is embedded.
1. If there exists a data-dependent pair of instances, ii of si and i2 of s2, and ii is executed before i2 in the original program, written ii ^ S l S 2 l2, then there is an edge from s\ to s2.
2. Similarly, if there exists a data-dependent pair of instances, ii of si and i 2 of s2, and i 2 - < s i s 2 ii> then there is an edge from s2 to s\. Note that it is possible for data dependence between two statements si and s2 to generate both an edge from si to s2 and an edge from s2 back to s\. In the special case where statements si and s2 are not distinct, ii - < S l s 2 i 2 if and only if ii -< i2 (ii is lexicographically less than i 2 ) . In the general case, si and s2 may be different statements, possibly belonging to different loop nests.