Introduction: Loops are the main target for parallelization, especially in applications using arrays. Long running applications tend to have large arrays, which lead to loops that have many iterations, One for each element in the array. It is not uncommon to find loops whose iterations are independent of one another. We can divide the large number of iterations of such loops among the processors.
If the amount of work performed in each iteration is roughly the same, simply dividing the iterations evenly across processors will achieve maximum parallelism. Example 11.1 is an extremely simple example showing how we can take advantage of loop-level parallelism.
This example computes the square of differences between elements in vectors X and Y and stores it into Z. The loop is parallelizable because each iteration accesses a different set of data. We can execute the loop on a computer with M processors by giving each processor an unique ID p = 0 , 1 , . . . , M - 1 and having each processor execute the same code:
The parallel code shown in Example 11.1 is an SPMD (Single Program Multiple Data) program. The same code is executed by all processors, but it is parameterized by an identifier unique to each processor, so different processors can take different actions. Typically one processor, known as the master, executes all the serial part of the computation. The master processor, upon reaching a parallelized section of the code, wakes up all the slave processors. All the processors execute the parallelized regions of the code. At the end of each parallelized region of code, all the processors participate in a barrier synchronization. Any operation executed before a processor enters a synchronization barrier is guaranteed to be completed before any other processors are allowed to leave the barrier and execute operations that come after the barrier.
If we parallelize only little loops like those in Example 11.1, then the resulting code is likely to have low parallelism coverage and relatively fine-grain parallelism. We prefer to parallelize the outermost loops in a program, as that yields the coarsest granularity of parallelism. Consider, for example, the application of a two-dimensional FFT transformation that operates on an n x n data set. Such a program performs n FFT's on the rows of the data, then another n FFT's on the columns. It is preferable to assign each of the n independent FFT's to one processor each, rather than trying to use several processors to collaborate on one FFT. The code is easier to write, the parallelism coverage for the algorithm is 100%, and the code has good data locality as it requires no communication at all while computing an FFT.
Many applications do not have large outermost loops that are parallelizable. The execution time of these applications, however, is often dominated by time consuming kernels, which may have hundreds of lines of code consisting of loops with different nesting levels. It is sometimes possible to take the kernel, reorganize its computation and partition it into mostly independent units by focusing on its locality.
Data Locality: There are two somewhat different notions of data locality that need to be considered when parallelizing programs. Temporal locality occurs when the same data is used several times within a short time period. Spatial locality occurs when different data elements that are located near to each other are used within a short period of time. An important form of spatial locality occurs when all the elements that appear on one cache line are used together. The reason is that as soon as one element from a cache line is needed, all the elements in the same line are brought to the cache and will probably still be there if they are used soon. The effect of this spatial locality is that cache misses are minimized, with a resulting important speedup of the program.
Kernels can often be written in many semantically equivalent ways but with widely varying data localities and performances. Example 1 shows an alternative way of expressing the computation in Example 11.1.
Example 1: Like Example 11.1 the following also finds the squares of differences between elements in vectors X and Y.
The first loop finds the differences, the second finds the squares. Code like this appears often in real programs, because that is how we can optimize a program for vector machines, which are supercomputers which have instructions that perform simple arithmetic operations on vectors at a time. We see that the bodies of the two loops here are fused as one in Example 11.1.
Given that the two programs perform the same computation, which performs better? The fused loop in Example 11.1 has better performance because it has better data locality. Each difference is squared immediately, as soon as it is produced; in fact, we can hold the difference in a register, square it, and write the result just once into the memory location Z[i]. In contrast, the code in Example 11.1 fetches Z[i] once, and writes it twice. Moreover, if the size of the array is larger than the cache, Z\i\ needs to be refetched from memory the second time it is used in this example. Thus, this code can run significantly slower.