Introduction: From array access functions we derive two kinds of information useful for locality optimization and parallelization:
1. Data reuse: for locality optimization, we wish to identify sets of iterations that access the same data or the same cache line.
2. Data dependence: for correctness of parallelization and locality loop transformations, we wish to identify all the data dependences in the code. Recall that two (not necessarily distinct) accesses have data dependence if instances of the accesses may refer to the same memory location, and at least one of them is a write.
In many cases, whenever we identify iterations that reuse the same data, there are data dependences between them. Whenever there is data dependence, obviously the same data is reused. For example, in matrix multiplication, the same element in the output array is written 0(n) times. The write operations must be executed in the original execution order;3 there is reuse because we can allocate the same element to a register. However, not all reuse can be exploited in locality optimizations; here is an example illustrating this issue.
Example: Consider the following loop:
f o r ( i = 0; i < n; i )
Z[7i 3] = Z[3i 5] ;
We observe that the loop writes to a different location at each iteration, so there are no reuses or dependences on the different write operations. The loop, however, reads locations 5 , 8 , 1 1 , 1 4 , 1 7 , . . . , and writes locations 3 , 1 0 , 1 7 , 2 4 ,—
The read and write iterations access the same elements 17, 38, and 59 and so on. That is, the integers of the form 17 21 j for j = 0,1, 2 , . . . are all those integers that can be written both as 7ii = 3 and as 3i2 5, for some integers i\ and i 2 . However, this reuse occurs rarely, and cannot be exploited easily if at all. Data dependence is different from reuse analysis in that one of the accesses sharing data dependence must be a write access. More importantly, data dependence needs to be both correct and precise. It needs to find all dependences for correctness, and it should not find spurious dependences because they can cause unnecessary serialization.
With data reuse, we only need to find where most of the exploitable reuses are. This problem is much simpler, so we take up this topic here in this section and tackle data dependences in the next. We simplify reuse analysis by ignoring loop bounds, because they seldom change the shape of the reuse. Much of the reuse exploitable by affine partitioning resides among instances of the same array accesses, and accesses that share the same coefficient matrix (what we have typically called F in the affine index function). As shown above, access patterns like 7% 3 and 3i 5 have no reuse of interest.
Types of Reuse: We first start with Example 11.20 to illustrate the different kinds of data reuses. In the following, we need to distinguish between the access as an instruction in a program, e.g., x = Z [ i , j ] , from the execution of this instruction many times, as we execute the loop nest. For emphasis, we may refer to the statement itself as a static access, while the various iterations of the statement as we execute its loop nest are called dynamic accesses.
Reuses can be classified as self versus group. If iterations reusing the same data come from the same static access, we refer to the reuse as self reuse; if they come from different accesses, we refer to it as group reuse. The reuse is temporal if the same exact location is referenced; it is spatial if the same cache line is referenced.
Example: Consider the following loop nest:
f l o a t Z[n] ;
f o r (i = 0; i < n; i )
f o r (j = 0; j < n; j )
Z[j 1] = (Z[j] Z[j 1] Z [ j 2 ] ) / 3;
Accesses Z[j], Z[j 1], and Z[j 2] each have self-spatial reuse because consecutive iterations of the same access refer to contiguous array elements. Presumably contiguous elements are very likely to reside on the same cache line. In addition, they all have self-temporal reuse, since the exact elements are used over and over again in each iteration in the outer loop. In addition, they all have the same coefficient matrix, and thus have group reuse. There is group reuse, both temporal and spatial, between the different accesses. Although there are 4n2 accesses in this code, if the reuse can be exploited, we only need to bring in about n / c cache lines into the cache, where c is the number of words in a cache line. We drop a factor of n due to self-spatial reuse, a factor of c to due to spatial locality, and finally a factor of 4 due to group reuse.
In the following, we show how we can use linear algebra to extract the reuse information from affine array accesses. We are interested in not just finding how much potential savings there are, but also which iterations are reusing the data so that we can try to move them close together to exploit the reuse.