Optimization of DFA-Based Pattern Matchers
Introduction: In this section we present three algorithms that have been used to implement and optimize pattern matchers constructed from regular expressions.
1. The first algorithm is useful in a Lex compiler, because it constructs a DFA directly from a regular expression, without constructing an intermediate NFA. The resulting DFA also may have fewer states than the DFA constructed via an NFA.
2. The second algorithm minimizes the number of states of any DFA, by combining states that have the same future behavior. The algorithm itself is quite efficient, running in time 0(n log n), where n is the number of states of the DFA.
3. The third algorithm produces more compact representations of transition tables than the standard, two-dimensional table.
Important States of an NFA: To begin our discussion of how to go directly from a regular expression to a DFA, we must first dissect the NFA construction of Algorithm 3.23 and consider the roles played by various states. We call a state of an NFA important if it has a non-e out-transition. Notice that the subset construction (Algorithm 3.20) uses only the important states in a set T when it computes e-closure (move (T, a)), the set of states reachable from T on input a. That is, the set of states move(s, a) is nonempty only if state s is important. During the subset construction, two sets of NFA states can be identified (treated as if they were the same set) if they:
1. Have the same important states, and
2. Either both have accepting states or neither does.
When the NFA is constructed from a regular expression, we can say more about the important states. The only important states are those introduced as initial states in the basis part for a particular symbol position in the regular expression. That is, each important state corresponds to a particular operand in the regular expression.
The constructed NFA has only one accepting state, but this state, having no out-transitions, is not an important state. By concatenating a unique right end marker # to a regular expression r, we give the accepting state for r a transition on #, making it an important state of the NFA for ( r ) # . In other words, by using the augmented regular expression ( r ) # , we can forget about accepting states as the subset construction proceeds; when the construction is complete, any state with a transition on # must be an accepting state.
The important states of the NFA correspond directly to the positions in the regular expression that hold symbols of the alphabet. It is useful, as we shall see, to present the regular expression by its syntax tree, where the leaves correspond to operands and the interior nodes correspond to operators. An interior node is called a cat-node, or-node, or star-node if it is labeled by the concatenation operator (dot), union operator |, or star operator *, respectively.
Example:Figure 3.56 shows the syntax tree for the regular expression of our running example. Cat-nodes are represented by circles.
Leaves in a syntax tree are labeled by e or by an alphabet symbol. To each leaf not labeled e, we attach a unique integer. We refer to this integer as the position of the leaf and also as a position of its symbol. Note that a symbol can have several positions; for instance, a has positions 1 and 3 in Fig. 3.56. The positions in the syntax tree correspond to the important states of the constructed NFA.
E x a m p l e:Figure 3.57 shows the NFA for the same regular expression as Fig. 3.56, with the important states numbered and other states represented by letters. The numbered states in the NFA and the positions in the syntax tree correspond in a way we shall soon see.