Regular Definitions and Extensions
Introduction: Giving names to regular expressions is referred a Regular definition, we may wish to give names to certain regular expressions and use those names in subsequent expressions, as if the names were themselves symbols. If £ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form:
di -> n
d2 -» r2
dn ^ rn
where:
1. Each di is a new symbol, not in E and not the same as any other of the cTs, and
2. Each T{ is a regular expression over the alphabet E U {d\,d2,.. . , By restricting to E and the previously defined GTS, we avoid recursive definitions, and we can construct a regular expression over E alone, for each r$. We do so by first replacing uses of d\ in r2 (which cannot use any of the d's except for d\), then replacing uses of d\ and d2 in r-$ by r\ and (the substituted) r2, and so on. Finally, in rn we replace each di, for i — 1,2,... ,n — 1, by the substituted version of r$, each of which has only symbols of E.
Extensions of Regular Expressions: Since Kleene introduced regular expressions with the basic operators for union, concatenation, and Kleene closure in the 1950s, many extensions have been added to regular expressions to enhance their ability to specify string patterns. Here we mention a few notational extensions that were first incorporated into Unix utilities such as Lex that are particularly useful in the specification lexical analyzers. The references to this chapter contain a discussion of some regular expression variants in use today.
1. One or more instances. The unary, postfix operator represents the positive closure of a regular expression and its language. That is, if r is a regular expression, then ( r ) denotes the language (L(r)) . The operator has the same precedence and associativity as the operator *. Two useful algebraic laws, r* = r \e and r = rr* = r*r relate the Kleene closure and positive closure.
2. Zero or one instance. The unary postfix operator ? means "zero or one occurrence." That is, r? is equivalent to r|e, or put another way, L(r?) = L(r) U {e}. The ? operator has the same precedence and associativity as * and .
3. Character classes. A regular expression aifal • • • \an, where the a^s are each symbols of the alphabet, can be replaced by the shorthand [aia,2 • • - an]. More importantly, when 0 1 , 0 2 , . . . , a n f ° r m a logical sequence, e.g., consecutive uppercase letters, lowercase letters, or digits, we can replace them by o i - a n , that is, just the first and last separated by a hyphen. Thus, [abc] is shorthand for a|b|c, and [a-z] is shorthand for a | b | . - - | z .