Minimisation of DFAs
Introduction: We shall introduce many of the techniques used by parallel compilers in an extended example. In this section, we explore the familiar matrix-multiplication algorithm to show that it is nontrivial to optimize even a simple and easily parallelizable program. We shall see how rewriting the code can improve data locality; that is, processors are able to do their work with far less communication (with global memory or with other processors, depending on the architecture) than if the straightforward program is chosen. We shall also discuss how cognizance of the existence of cache lines that hold several consecutive data elements can improve the running time of programs such as matrix multiplication.
The Matrix-Multiplication Algorithm: In Fig. 11.5 we see a typical matrix-multiplication program.2 It takes two nxn matrices, X and F, and produces their product in a third n x n matrix Z. Recall that % — the element of matrix Z in row i and column j — must become YJl=i ^-ihXkj- The code of Fig. 11.5 generates n2 results, each of which is an inner product between one row and one column of the two matrix operands. Clearly, the calculations of each of the elements of Z are independent and can be executed in parallel.
The larger n is, the more times the algorithm touches each element. That is, there are 3n2 locations among the three matrices, but the algorithm performs n3 operations, each of which multiplies an element of X by an element of Y and adds the product to an element of Z. Thus, the algorithm is computation intensive and memory accesses should not, in principle, constitute a bottleneck.
Serial Execution of the Matrix Multiplication: Let us first consider how this program behaves when run sequentially on auniprocessor. The innermost loop reads and writes the same element of Z, anduses a row of X and a column of Y. Z[i,j] can easily be stored in a registerand requires no memory accesses. Assume, without loss of generality, that thematrix is laid out in row-major order, and that c is the number of array elementsin a cache line.
Figure 11.6 suggests the access pattern as we execute one iteration of the outer loop of Fig. 11.5. In particular, the picture shows the first iteration, with i = 0. Each time we move from one element of the first row of X to the next, we visit each element in a single column of Y. We see in Fig. 11.6 the assumed organization of the matrices into cache lines. That is, each small rectangle represents a cache line holding four array elements (i.e., c = 4 and n = 12 in the picture).
Accessing X puts little burden on the cache. One row of X is spread among only n/c cache lines. Assuming these all fit in the cache, only n/c cache misses occur for a fixed value of index i, and the total number of misses for all of X is n2/c, the minimum possible (we assume n is divisible by c, for convenience). However, while using one row of X, the matrix-multiplication algorithm accesses all the elements of Y, column by column. That is, when j = 0, the inner loop brings to the cache the entire first column of Y. Notice that the elements of that column are stored among n different cache lines. If the cache is big enough (or n small enough) to hold n cache lines, and no other uses of the cache force some of these cache lines to be expelled, then the column for j = 0 will still be in the cache when we need the second column of Y. In that case, there will not be another n cache misses reading Y, until j = c, at which time we need to bring into the cache an entirely different set of cache lines for Y. Thus, to complete the first iteration of the outer loop (with i = 0) requires between n 2 / c and n2 cache misses, depending on whether columns of cache lines can survive from one iteration of the second loop to the next.
Moreover, as we complete the outer loop, for i = 1,2, and so on, we may have many additional cache misses as we read Y, or none at all. If the cache is big enough that all n 2 / c cache lines holding Y can reside together in the cache, then we need no more cache misses. The total number of cache misses is thus 2n2 / c , half for X and half for Y. However, if the cache can hold one column of Y but not all of Y, then we need to bring all of Y into cache again, each time we perform an iteration of the outer loop. That is, the number of cache misses is n 2 / c n 3 / c ; the first term is for X and the second is for Y. Worst, if we cannot even hold one column of Y in the cache, then we have n2 cache misses per iteration of the outer loop and a total of n 2 / c n3 cache misses.
Row-by-Row Parallelization: Now, let us consider how we could use some number of processors, say p processors,to speed up the execution of Fig. 11.5. An obvious approach to parallelizingmatrix multiplication is to assign different rows of Z to different processors. Aprocessor is responsible for n/p consecutive rows (we assume n is divisible byp, for convenience). With this division of labor, each processor needs to accessn/p rows of matrices X and Z, but the entire Y matrix. One processor willcompute n2 jp elements of Z, performing n3/p multiply-and-add operations todo so.
While the computation time thus decreases in proportion to p, the communication cost actually rises in proportion to p. That is, each of p processors has to read n2/p elements of X, but all n2 elements of Y. The total number of cache lines that must be delivered to the caches of the p processors is at last n 2 / c p n 2 / c ; the two terms are for delivering X and copies of Y, respectively. As p approaches n, the computation time becomes 0 (n 2) while the communication cost is 0 (n 3). That is, the bus on which data is moved between memory and the processors' caches becomes the bottleneck. Thus, with the proposed data layout, using a large number of processors to share the computation can actually slow down the computation, rather than speed it up.