## Parallelism With Minimum Synchronization

**Introduction: **We have described three powerful parallelization algorithms in the last three sections: Algorithm (Finding a highest-ranked synchronization-free affine partition for a program.) finds all parallelism requiring no synchronizations, Algorithm 11.54 finds all parallelism requiring only a constant number of synchronizations, and Algorithm 11.59 finds all the pipelinable parallelism requiring 0(n) synchronizations where n is the number of iterations in the outermost loop. As a first approximation, our goal is to parallelize as much of the computation as possible, while introducing as little synchronization as necessary.

Following Algorithm, below, finds all the degrees of parallelism in a program, starting with the coarsest granularity of parallelism. In practice, to parallelize a code for a multiprocessor, we do not need to exploit all the levels of parallelism, just the outermost possible ones until all the computation is parallelized and all the processors are fully utilized.

**Algorithm**: Find all the degrees of parallelism in a program, with all the parallelism being as coarse-grained as possible.

**INPUT:** A program to be parallelized.

**OUTPUT**: A parallelized version of the same program.

**METHOD**: Do the following:

1. Find the maximum degree of parallelism requiring no synchronization: Apply Algorithm 11.43 to the program.

2. Find the maximum degree of parallelism that requires 0(1) synchronizations : Apply Algorithm 11.54 to each of the space partitions found in step 1. (If no synchronization-free parallelism is found, the whole computation is left in one partition).

3. Find the maximum degree of parallelism that requires 0(n) synchronizations. Apply Algorithm 11.59 to each of the partitions found in step 2 to find pipelined parallelism. Then apply Algorithm 11.54 to each of the partitions assigned to each processor, or the body of the sequential loop if no pipelining is found.

4. Find the maximum degree of parallelism with successively greater degrees of synchronizations: Recursively apply Step 3 to computation belonging to each of the space partitions generated by the previous step.

**Considering Communication Cost**

Step 2 of Algorithm 11.64 parallelizes each strongly connected component independently if no synchronization-free parallelism is found. However, it may be possible to parallelize a number of the components without synchronization and communication. One solution is to greedily find synchronization-free parallelism among subsets of the program dependence graph that share the most data.

If communication is necessary between strongly connected components, we note that some communication is more expensive than others. For example, the cost of transposing a matrix is significantly higher than just having to communicate between neighboring processors. Suppose ** si **and s2 are statements in two separate strongly connected components accessing the same data in iterations ii and

**i 2 ,**respectively. If we cannot find partition mappings ( C i , c i ) and

**(C2 , c 2 )**for statements

**and s 2 , respectively, such that**

*si*
C i i i ci - **C2 i 2 c2 = 0,**

we instead try to satisfy the constraint

C i i i ci - **C2 i 2 c2 < ***6*

where ** S **is a small constant.

**Trading Communication for Synchronization**

Sometimes we would rather perform more synchronization to minimize communication. Thus, if we cannot parallelize a code with just neighborhood communication among strongly connected components, we should attempt to pipeline the computation instead of parallelizing each component independently.