## Finding Synchronization-Free Parallelism

**Introduction:** Having developed the theory of affine array accesses, their reuse of data, and the dependences among them, we shall now begin to apply this theory to parallelization and optimization of real programs, it is important that we find parallelism while minimizing communication among processors. Let us start by studying the problem of parallelizing an application without allowing any communication or synchronization between processors at all. This constraint may appear to be a purely academic exercise; In fact, many such programs exist in real life, and the algorithm for solving this problem is useful in its own right. In addition, the concepts used to solve this problem can be extended to handle synchronization and communication.

**An Introductory Example: **Shown in Fig. 11.23 is an excerpt of a C translation (with Fortran-style array accesses retained for clarity) from a 5000-line FORTRAN multi grid algorithm to solve three-dimensional Euler equations. The program spends most its time in a small number of routines like the one shown in the figure. It is typical of many numerical programs. These often consist of numerous for-loops, with different nesting levels, and they have many array accesses, all of which are affine expressions of surrounding loop indexes. To keep the example short, we have elided lines from the original program with similar characteristics.

The code of Fig. 11.23 operates on the scalar variable T and a number of different arrays with different dimensions. Let us first examine the use of variable T. Because each iteration in the loop uses the same variable T, we cannot execute the iterations in parallel. However, T is used only as a way to hold a common subexpression used twice in the same iteration. In the first two of the three loop nests in Fig. 11.23, each iteration of the innermost loop writes a value into T and uses the value immediately after twice, in the same iteration. We can eliminate the dependences by replacing each use of T by the right-hand-side expression in the previous assignment of T, without changing the semantics of the program. Or, we can replace the scalar T by an array. We then have each iteration use its own array element T[j, i].

With this modification, the computation of an array element in each assignment statement depends only on other array elements with the same values for the last two components (j and i, respectively). We can thus group all operations that operate on the ( j , i ) th element of all arrays into one computation unit, and execute them in the original sequential order. This modification produces ( j l — 1) x ( i l — 1) units of computation that are all independent of one another. Notice that second and third nests in the source program involve a third loop, with index k. However, because there is no dependence between dynamic accesses with the same values for j and «, we can safely perform the loops on k inside the loops on j and i — that is, within a computation unit.

Knowing that these computation units are independent enables a number of legal transforms on this code. For example, instead of executing the code as originally written, a uniprocessor can perform the same computation by executing the units of independent operation one unit at a time. The resulting code, shown in Fig. 11.24, has improved temporal locality, because results produced are consumed immediately.

The independent units of computation can also be assigned to different processors and executed in parallel, without requiring any synchronization or communication. Since there are ( j l — 1) x ( i l — 1) independent units of computation, we can utilize at most ( j l - 1) x ( i l - 1) processors. By organizing the processors as if they were in a 2-dimensional array, with ID's where 2 < j < jl and 2 < i < i l , the SPMD program to be executed by each processor is simply the body in the inner loop in Fig. 11.24.

The above example illustrates the basic approach to finding synchronization free parallelism. We first split the computation into as many independent units as possible. This partitioning exposes the scheduling choices available. We then assign computation units to the processors, depending on the number of processors we have. Finally, we generate an SPMD program that is executed on each processor.