Introduction Instruction-Level Parallelism
Introduction: Every modern high-performance processor can execute several operations in a single clock cycle. The "billion-dollar question" is how fast can a program be run on a processor with instruction-level parallelism? The answer depends on:
1. The potential parallelism in the program.
2. The available parallelism on the processor.
3. Our ability to extract parallelism from the original sequential program.
4. Our ability to find the best parallel schedule given scheduling constraints.
If all the operations in a program are highly dependent upon one another, then no amount of hardware or parallelization techniques can make the program run fast in parallel. There has been a lot of research on understanding the limits of parallelization. Typical nonnumeric applications have much inherent dependence. For example, these programs have many data-dependent branches that make it hard even to predict which instructions are to be executed, let alone decide which operations can be executed in parallel. Therefore, work in this area has focused on relaxing the scheduling constraints, including the introduction of new architectural features, rather than the scheduling techniques themselves.
Numeric applications, such as scientific computing and signal processing, tend to have more parallelism. These applications deal with large aggregate data structures; operations on distinct elements of the structure are often independent of one another and can be executed in parallel. Additional hardware resources can take advantage of such parallelism and are provided in high performance, general-purpose machines and digital signal processors. These programs tend to have simple control structures and regular data-access patterns, and static techniques have been developed to extract the available parallelism from these programs. Code scheduling for such applications is interesting and significant, as they offer a large number of independent operations to be mapped onto a large number of resources.
Both parallelism extraction and scheduling for parallel execution can be performed either statically in software, or dynamically in hardware. In fact, even machines with hardware scheduling can be aided by software scheduling. This chapter starts by explaining the fundamental issues in using instruction level parallelism, which is the same regardless of whether the parallelism is managed by software or hardware. We then motivate the basic data-dependence analyses needed for the extraction of parallelism.
Finally, we present the basic ideas in code scheduling. We describe a technique for scheduling basic blocks, a method for handling highly data-dependent control flow found in general-purpose programs, and finally a technique called software pipelining that is used primarily for scheduling numeric programs.
Processor Architectures: When we think of instruction-level parallelism, we usually imagine a processorissuing several operations in a single clock cycle. In fact, it is possible fora machine to issue just one operation per clock1 and yet achieve instruction levelparallelism using the concept of pipelining. In the following, we shall firstexplain pipelining then discuss multiple-instruction issue.
Instruction Pipelines and Branch Delays: Practically every processor, be it a high-performance supercomputer or a standard machine, uses an instruction pipeline. With an instruction pipeline, a new instruction can be fetched every clock while preceding instructions are still going through the pipeline. Shown in Fig. 10.1 is a simple 5-stage instruction pipeline: it first fetches the instruction (IF), decodes it (ID), executes the operation (EX), accesses the memory (MEM), and writes back the result (WB).
The figure shows how instructions i, i 1, i 2, i 3, and i 4 can execute at the same time. Each row corresponds to a clock tick, and each column in the figure specifies the stage each instruction occupies at each clock tick. If the result from an instruction is available by the time the succeeding instruction needs the data, the processor can issue an instruction every clock. Branch instructions are especially problematic because until they are fetched, decoded and executed, the processor does not know which instruction will execute next. Many processors speculatively fetch and decode the immediately succeeding instructions in case a branch is not taken. When a branch is found to be taken, the instruction pipeline is emptied and the branch target is fetched.
Thus, taken branches introduce a delay in the fetch of the branch target and introduce "hiccups" in the instruction pipeline. Advanced processors use hardware to predict the outcomes of branches based on their execution history and to prefetch from the predicted target locations. Branch delays are nonetheless observed if branches are mispredicted.
Pipelined Execution: Some instructions take several clocks to execute. One common example is thememory-load operation. Even when a memory access hits in the cache, it usuallytakes several clocks for the cache to return the data. We say that theexecution of an instruction is pipelined if succeeding instructions not dependenton the result are allowed to proceed. Thus, even if a processor can issue onlyone operation per clock, several operations might be in their execution stagesat the same time. If the deepest execution pipeline has n stages, potentiallyn operations can be "in flight" at the same time. Note that not all instructionsare fully pipelined. While floating-point adds and multiplies often arefully pipelined, floating-point divides, being more complex and less frequentlyexecuted, often are not.
Most general-purpose processors dynamically detect dependences between consecutive instructions and automatically stall the execution of instructions if their operands are not available. Some processors, especially those embedded in hand-held devices, leave the dependence checking to the software in order to keep the hardware simple and power consumption low. In this case, the compiler is responsible for inserting "no-op" instructions in the code if necessary to assure that the results are available when needed.