**Branch :**Computer Science and Engineering

**Subject :**Compiler design

## Efficiency of String-Processing Algorithms

**Introduction**: We observed that Algorithm 3.18 processes a string x in time 0(|a?|), while in Section 3.7.3 we concluded that we could simulate an NFA in time proportional to the product of \x\ and the size of the NFA's transition graph. Obviously, it is faster to have a DFA to simulate than an NFA, so we might wonder whether it ever makes sense to simulate an NFA.

One issue that may favor an NFA is that the subset construction can, in the worst case, exponentiation the number of states. While in principle, the number of DFA states does not influence the running time of Algorithm 3.18, should the number of states become so large that the transition table does not fit in main memory, then the true running time would have to include disk I /O and therefore rise noticeably.

**Example**: Consider the family of languages described by regular expressions of the form Ln — (a|b)*a(a|b)™_ 1 , that is, each language Ln consists of strings of a's and &'s such that the nth character to the left of the right end holds a. An n 1-state NFA is easy to construct. It stays in its initial state under any input, but also has the option, on input a, of going to state 1. From state 1, it goes to state 2 on any input, and so on, until in state n it accepts. Figure 3.47 suggests this NFA.

However, any DFA for the language Ln must have at least 2n states. We shall not prove this fact, but the idea is that if two strings of length n can get the DFA to the same state, then we can exploit the last position where the strings differ (and therefore one must have a, the other b) to continue the strings identically, until they are the same in the last n — 1 positions. The DFA will then be in a state where it must both accept and not accept. Fortunately, as we mentioned, it is rare for lexical analysis to involve patterns of this type, and we do not expect to encounter DFA's with outlandish numbers of states in practice.

However, lexical-analyzer generators and other string-processing systems often start with a regular expression. We are faced with a choice of converting the regular expression to an NFA or DFA. The additional cost of going to a DFA is thus the cost of executing Algorithm 3.23 on the NFA (one could go directly from a regular expression to a DFA, but the work is essentially the same). If the string-processor is one that will be executed many times, as is the case for lexical analysis, then any cost of converting to a DFA is worthwhile. However, in other string-processing applications, such as grep, where the user specifies one regular expression and one or several files to be searched for the pattern of that expression, it may be more efficient to skip the step of constructing a DFA, and simulate the NFA directly.

Let us consider the cost of converting a regular expression r to an NFA by Algorithm 3 . 2 3 . A key step is constructing the parse tree for r. In Chapter 4 we shall see several methods that are capable of constructing this parse tree in linear time, that is, in time 0 ( | r | ) , where |r| stands for the size of r — the sum of the number of operators and operands in r. It is also easy to check that each of the basis and inductive constructions of Algorithm 3 . 2 3 takes constant time, so the entire time spent by the conversion to an NFA is 0 ( [ r | ).

Moreover, as we observed in Section 3 . 7 . 4 , the NFA we construct has at most |r| states and at most 2\r\ transitions. That is, in terms of the analysis in Section 3 . 7 . 3 , we have n < \r\ and m < 2\r\. Thus, simulating this NFA on an input string x takes time 0(\r\ x \x\). This time dominates the time taken by the NFA construction, which is 0 ( | r | ) , and therefore, we conclude that it is possible to take a regular expression r and string x, and tell whether x is in L(r) in time 0(\r\ x \x\).

The time taken by the subset construction is highly dependent on the number of states the resulting DFA has. To begin, notice that in the subset construction of Fig. 3 . 3 2 , the key step, the construction of a set of states U from a set of states T and an input symbol a, is very much like the construction of a new set of states from the old set of states in the NFA simulation of Algorithm 3 . 2 2 . We already concluded that, properly implemented, this step takes time at most proportional to the number of states and transitions of the NFA.

Suppose we start with a regular expression r and convert it to an NFA. This NFA has at most |r| states and at most 2\r\ transitions. Moreover, there are at most |r| input symbols. Thus, for every DFA state constructed, we must construct at most |r| new states, and each one takes at most 0(\r\ 2\r\) time. The time to construct a DFA of s states is thus 0{\r\2s).

In the common case where s is about |r|, the subset construction takes time 0(\r\3). However, in the worst case, as in Example 3 . 2 5 , this time is 0(\r\22^). Figure 3 . 4 8 summarizes the options when one is given a regular expression r and wants to produce a recognizer that will tell whether one or more strings x are in L(r).

Figure 3 . 4 8 : Initial cost and per-string-cost of various methods of recognizing the language of a regular expression

If the per-string cost dominates, as it does when we build a lexical analyzer, we clearly prefer the DFA. However, in commands like grep, where we run the automaton on only one string, we generally prefer the NFA. It is not until |x| approaches | r | 3 that we would even think about converting to a DFA.

There is, however, a mixed strategy that is about as good as the better of the NFA and the DFA strategy for each expression r and string x. Start off simulating the NFA, but remember the sets of NFA states (i.e., the DFA states) and their transitions, as we compute them. Before processing the current set of NFA states and the current input symbol, check to see whether we have already computed this transition, and use the information if so.