FE-Logo
  • Home
  • Study Material
  • Non-Deterministic Finite Automation
    • Introduction to Compiler
    • The Structure of a Compiler
    • Intermediate Code Generation
    • Building a Compiler
    • Applications of Compiler
    • Optimizations for Computer Architectures
    • Design of New Computer Architectures
    • Program Translations
    • Software Productivity Tools
    • Programming Language Basics
    • Minimisation of DFAs
    • Explicit Access Control
    • Parameter Passing Mechanisms
    • Introduction to Lexical Analysis
    • Regular expressions
    • Short hands
    • Nondeterministic finite automata
    • Converting a regular expression to an NFA
    • Deterministic finite automata
    • Converting an NFA to a DFA
    • The subset construction
    • Dead states
    • Lexers and lexer generators
    • Splitting the input stream
    • Lexical errors
    • Properties of regular languages
    • Limits to expressive power
    • The Role of the Lexical Analyzer
    • Input Buffering
    • Specification of Tokens
    • Operations on Languages
    • Regular Definitions and Extensions
    • Recognition of Tokens
    • The Lexical-Analyzer Generator Lex
    • Finite Automata
    • Construction of an NFA from a Regular Expression
    • Efficiency of String-Processing Algorithms
    • The Structure of the Generated Analyzer
    • Optimization of DFA-Based Pattern Matchers

  • Basic Parsing Techniques
    • Introduction to Syntax analysis
    • Context-free grammars
    • Writing context free grammars
    • Derivation
    • Syntax trees and ambiguity
    • Operator precedence
    • Writing ambiguous expression grammars
    • Other sources of ambiguity
    • Syntax analysis and Predictive parsing
    • Nullable and FIRST
    • Predictive parsing revisited
    • FOLLOW
    • LL(1) parsing
    • Methods for rewriting grammars for LL(1) parsing
    • SLR parsing
    • Constructions of SLR parse tables
    • Conflicts in SLR parse-tables
    • Using precedence rules in LR parse tables
    • Using LR-parser generators
    • Properties of context-free languages
    • Introduction to Syntax-Directed Translator
    • Evaluating an SDD at the Nodes of a Parse Tree
    • Evaluation Orders for SDD\'s
    • Ordering the Evaluation of Attributes
    • A larger example of calculating FIRST and FOLLOW
    • Syntax Definition
    • Associativity of Operators
    • Parse Trees
    • Ambiguity
    • Syntax-Directed Translation
    • Synthesized Attributes
    • Tree Traversals
    • Parsing
    • Predictive Parsing
    • Use e-Productions
    • Translator for Simple Expressions
    • Semantic Rules with Controlled Side Effects
    • Applications of Syntax-Directed Translation
    • The Structure of a Type of syntax
    • Switch-Statements
    • Syntax-Directed Translation Schemes
    • Postfix Translation Schemes
    • SDT\'s With Actions Inside Productions
    • Eliminating Left Recursion from SDT\'s
    • SDT\'s for L-Attributed Definitions
    • Implementing L-Attributed SDD\'s
    • On-The-Fly Code Generation
    • L-Attributed SDD\'s and LL Parsing
    • Bottom-Up Parsing of L-Attributed SDD\'s

  • Syntax-directed Translation
    • Register Allocation and Assignment
    • Semantic Analysis
    • Introduction to Intermediate Code Generation
    • Variants of Syntax Trees
    • Variants of Syntax Trees
    • The Value-Number Method for Constructing DAG\'s
    • Three-Address Code
    • Quadruples
    • Triples
    • Static Single-Assignment Form
    • Types and Declarations
    • Type Equivalence
    • Sequences of Declarations
    • Translation of Expressions
    • Incremental Translation
    • Addressing Array Elements
    • Translation of Array References
    • Type Checking
    • Type Conversions
    • Overloading of Functions and Operators
    • Type Inference and Polymorphic Functions
    • Algorithm for Unification
    • Control Flow
    • Flow-of-Control Statements
    • Control-Flow Translation of Boolean Expressions
    • Boolean Values and Jumping Code
    • Back patching
    • Backpatching for Boolean Expressions
    • Flow-of-Control Statements
    • Break-, Continue-, and Goto-Statements
    • Introduction to Run-Time Environments
    • Stack Allocation of Space
    • Activation Records
    • Calling Sequences
    • Variable-Length Data on the Stack
    • Access to Nonlocal Data on the Stack
    • Displays
    • Heap Management
    • Locality in Programs
    • Reducing Fragmentation
    • Managing and Coalescing Free Space
    • Manual Deallocation Requests
    • Reachability
    • Introduction to Garbage Collection
    • Reference Counting Garbage Collectors
    • Introduction to Trace-Based Collection
    • Basic Abstraction
    • Optimizing Mark-and-Sweep
    • Mark-and-Compact Garbage Collectors
    • Copying collectors
    • Short-Pause Garbage Collection
    • Incremental Reachability Analysis
    • Partial-Collection Basics
    • The Train Algorithm
    • Parallel and Concurrent Garbage Collection
    • Partial Object Relocation
    • Introduction Code Generation
    • Issues in the Design of a Code Generator
    • Instruction Selection
    • Register Allocation
    • The Target Language
    • Addresses in the Target Code
    • Stack Allocation
    • Run-Time Addresses for Names
    • Basic Blocks and Flow Graphs
    • Basic Blocks
    • Next-Use Information
    • Representation of Flow Graphs
    • Optimization of Basic Blocks
    • Dead Code Elimination
    • Representation of Array References
    • Pointer Assignments and Procedure Calls
    • A Simple Code Generator
    • The Code-Generation Algorithm
    • Design of the Function getReg
    • Peephole Optimization
    • Algebraic Simplification and Reduction in Strength
    • Register Assignment for Outer Loops
    • Instruction Selection by Tree Rewriting
    • Code Generation by Tiling an Input Tree
    • Pattern Matching by Parsing
    • General Tree Matching
    • Optimal Code Generation for Expressions
    • Evaluating Expressions with an Insufficient Supply of Registers
    • Dynamic Programming Code-Generation

  • Data Flow Analysis
    • The Lazy-Code-Motion Algorithm
    • Introduction to Machine-Independent Optimizations
    • The Dynamic Programming Algorithm
    • The Principal Sources of Optimization
    • Semantics-Preserving Transformations
    • Copy Propagation
    • Induction Variables and Reduction in Strength
    • Introduction to Data-Flow Analysis
    • The Data-Flow Analysis Schema
    • Reaching Definitions
    • Live-Variable Analysis
    • Available Expressions
    • Foundations of Data-Flow Analysis
    • Transfer Functions
    • The Iterative Algorithm for General Frameworks
    • Meaning of a Data-Flow Solution
    • Constant Propagation
    • Transfer Functions for the Constant-Propagation Framework
    • Partial-Redundancy Elimination
    • The Lazy-Code-Motion Problem
    • Loops in Flow Graphs
    • Depth-First Ordering
    • Back Edges and Reducibility
    • Natural Loops
    • Speed of Convergence of Iterative Data-Flow Algorithms
    • Region-Based Analysis
    • Necessary Assumptions About Transfer Functions
    • An Algorithm for Region-Based Analysis
    • Handling Non-reducible Flow Graphs
    • Symbolic Analysis
    • Data-Flow Problem Formulation
    • Region-Based Symbolic Analysis

  • Code Generation
    • Introduction to Software Pipelining of Loops
    • Matrix Multiply: An In-Depth Example
    • Software Pipelining of Loops
    • Introduction Instruction-Level Parallelism
    • Multiple Instruction Issue
    • A Basic Machine Model
    • Code-Scheduling Constraints
    • Finding Dependences Among Memory Accesses
    • Phase Ordering Between Register Allocation and Code Scheduling
    • Speculative Execution Support
    • Basic-Block Scheduling
    • List Scheduling of Basic Blocks
    • Global Code Scheduling
    • Upward Code Motion
    • Updating Data Dependences
    • Advanced Code Motion Techniques
    • Software Pipelining
    • Register Allocation and Code Generation
    • A Software-Pipelining Algorithm
    • Scheduling Cyclic Dependence Graphs
    • Improvements to the Pipelining Algorithms
    • Conditional Statements and Hardware Support for Software Pipelining
    • Basic Concepts of Parallelism and Locality
    • Parallelism in Applications
    • Loop-Level Parallelism
    • Introduction to Affine Transform Theory
    • Optimizations
    • Iteration Spaces
    • Affine Array Indexes
    • Controlling the Order of Execution
    • Changing Axes
    • Intermediate Code for Procedures
    • Data Reuse
    • Self Reuse
    • Self-Spatial Reuse
    • Array Data-Dependence Analysis
    • Integer Linear Programming
    • Heuristics for Solving Integer Linear Programs
    • Solving General Integer Linear Programs
    • Finding Synchronization-Free Parallelism
    • Affine Space Partitions
    • Space-Partition Constraints
    • Solving Space-Partition Constraints
    • A Simple Code-Generation Algorithm
    • Eliminating Empty Iterations
    • Synchronization Between Parallel Loops
    • The Parallelization Algorithm and Hierarchical Time
    • Pipelining
    • Solving Time-Partition Constraints by Farkas' Lemma
    • Code Transformations
    • Parallelism With Minimum Synchronization
    • Locality Optimizations
    • Partition Interleaving
    • Putting it All Together
    • Uses of Affine Transforms
    • Interprocedural Analysis
    • Context Sensitivity
    • Cloning-Based Context-Sensitive Analysis
    • Importance of Interprocedural Analysis
    • SQL Injection
    • A Logical Representation of Data Flow
    • Execution of Datalog Programs
    • Problematic Datalog Rules
    • A Simple Pointer-Analysis Algorithm
    • Flow Insensitivity
    • Context-Insensitive Interprocedural Analysis
    • Context-Sensitive Pointer Analysis
    • Adding Context to Datalog Rules
    • Datalog Implementation by BDD's
    • Relational Operations as BDD Operations

Branch : Computer Science and Engineering
Subject : Compiler design
Unit : Code Generation

Introduction Instruction-Level Parallelism


Introduction: Every modern high-performance processor can execute several operations in a single clock cycle. The "billion-dollar question" is how fast can a program be run on a processor with instruction-level parallelism? The answer depends on:

1.       The potential parallelism in the program.

2.       The available parallelism on the processor.

3.       Our ability to extract parallelism from the original sequential program.

4.       Our ability to find the best parallel schedule given scheduling constraints.

If all the operations in a program are highly dependent upon one another, then no amount of hardware or parallelization techniques can make the program run fast in parallel. There has been a lot of research on understanding the limits of parallelization. Typical nonnumeric applications have much inherent dependence. For example, these programs have many data-dependent branches that make it hard even to predict which instructions are to be executed, let alone decide which operations can be executed in parallel. Therefore, work in this area has focused on relaxing the scheduling constraints, including the introduction of new architectural features, rather than the scheduling techniques themselves.

Numeric applications, such as scientific computing and signal processing, tend to have more parallelism. These applications deal with large aggregate data structures; operations on distinct elements of the structure are often independent of one another and can be executed in parallel. Additional hardware resources can take advantage of such parallelism and are provided in high performance, general-purpose machines and digital signal processors. These programs tend to have simple control structures and regular data-access patterns, and static techniques have been developed to extract the available parallelism from these programs. Code scheduling for such applications is interesting and significant, as they offer a large number of independent operations to be mapped onto a large number of resources.

Both parallelism extraction and scheduling for parallel execution can be performed either statically in software, or dynamically in hardware. In fact, even machines with hardware scheduling can be aided by software scheduling. This chapter starts by explaining the fundamental issues in using instruction level parallelism, which is the same regardless of whether the parallelism is managed by software or hardware. We then motivate the basic data-dependence analyses needed for the extraction of parallelism.

Finally, we present the basic ideas in code scheduling. We describe a technique for scheduling basic blocks, a method for handling highly data-dependent control flow found in general-purpose programs, and finally a technique called software pipelining that is used primarily for scheduling numeric programs.

Processor Architectures: When we think of instruction-level parallelism, we usually imagine a processorissuing several operations in a single clock cycle. In fact, it is possible fora machine to issue just one operation per clock1 and yet achieve instruction levelparallelism using the concept of pipelining. In the following, we shall firstexplain pipelining then discuss multiple-instruction issue.

Instruction Pipelines and Branch Delays: Practically every processor, be it a high-performance supercomputer or a standard machine, uses an instruction pipeline. With an instruction pipeline, a new instruction can be fetched every clock while preceding instructions are still going through the pipeline. Shown in Fig. 10.1 is a simple 5-stage instruction pipeline: it first fetches the instruction (IF), decodes it (ID), executes the operation (EX), accesses the memory (MEM), and writes back the result (WB).

The figure shows how instructions i, i 1, i 2, i 3, and i 4 can execute at the same time. Each row corresponds to a clock tick, and each column in the figure specifies the stage each instruction occupies at each clock tick. If the result from an instruction is available by the time the succeeding instruction needs the data, the processor can issue an instruction every clock. Branch instructions are especially problematic because until they are fetched, decoded and executed, the processor does not know which instruction will execute next. Many processors speculatively fetch and decode the immediately succeeding instructions in case a branch is not taken. When a branch is found to be taken, the instruction pipeline is emptied and the branch target is fetched.

Thus, taken branches introduce a delay in the fetch of the branch target and introduce "hiccups" in the instruction pipeline. Advanced processors use hardware to predict the outcomes of branches based on their execution history and to prefetch from the predicted target locations. Branch delays are nonetheless observed if branches are mispredicted.

Pipelined Execution: Some instructions take several clocks to execute. One common example is thememory-load operation. Even when a memory access hits in the cache, it usuallytakes several clocks for the cache to return the data. We say that theexecution of an instruction is pipelined if succeeding instructions not dependenton the result are allowed to proceed. Thus, even if a processor can issue onlyone operation per clock, several operations might be in their execution stagesat the same time. If the deepest execution pipeline has n stages, potentiallyn operations can be "in flight" at the same time. Note that not all instructionsare fully pipelined. While floating-point adds and multiplies often arefully pipelined, floating-point divides, being more complex and less frequentlyexecuted, often are not.

Most general-purpose processors dynamically detect dependences between consecutive instructions and automatically stall the execution of instructions if their operands are not available. Some processors, especially those embedded in hand-held devices, leave the dependence checking to the software in order to keep the hardware simple and power consumption low. In this case, the compiler is responsible for inserting "no-op" instructions in the code if necessary to assure that the results are available when needed.

 

Questions of this topic


Ask your question

<
>