Regular expressions
Introduction: The set of all integer constants or the set of all variable names are sets of strings, where the individual letters are taken from a particular alphabet. Such a set of strings is called a language. For integers, the alphabet consists of the digits 0-9 and for variable names the alphabet contains both letters and digits (and perhaps a few other characters, such as underscore).
Given an alphabet, we will describe sets of strings by regular expressions, an algebraic notation that is compact and easy for humans to use and understand. The idea is that regular expressions that describe simple sets of strings can be combined to form regular expressions that describe more complex sets of strings.
When talking about regular expressions, we will use the letters (r, s and t) in italics to denote unspecified regular expressions. When letters stand for themselves (i.e., in regular expressions that describe strings using these letters) we will use typewriter font, e.g., a or b. Hence, when
The regular expressions are built recursively out of smaller regular expressions, using the rules described below. Each regular expression r denotes a language L(r), which is also defined recursively from the languages denoted by r ' s sub expressions. Here are the rules that define the regular expressions over some alphabet £ and the languages that those expressions denote.
Figure 2.1 shows the constructions used to build regular expressions and the languages they describe:
- A single letter describes the language that has the one-letter string consisting of that letter as its only element.
- The symbol ε (the Greek letter epsilon) describes the language that consists solely of the empty string. Note that this is not the empty set of strings (see exercise 2.10).
- s|t (pronounced “s or t”) describes the union of the languages described by s and t.
- st (pronounced “s t”) describes the concatenation of the languages L(s) and L(t), i.e., the sets of strings obtained by taking a string from L(s) and putting this in front of a string from L(t). For example, if L(s) is {“a”, “b”} and L(t) is {“c”, “d”}, then L(st) is the set {‘ac”, “ad”, “bc”, “bd”}.
- The language for s* (pronounced “s star”) is described recursively: It consists of the empty string plus whatever can be obtained by concatenating a string from L(s) to a string from L(s*). This is equivalent to saying that L(s*) consists of strings that can be obtained by concatenating zero or more (possibly different) strings from L(s). If, for example, L(s) is {“a”, “b“} then L(s*) is {““, “a“, “b”, “aa”, “ab”, “ba“,“bb“,“aaa“, . . . }, i.e., any string (including the empty) that consists entirely of as and bs.
BASIS: There are two rules that form the basis:
- e is a regular expression, and L(e) is {e}, that is, the language whose sole member is the empty string.
- If a is a symbol in E, then a is a regular expression, and L(a) = {a}, that is, the language with one string, of length one, with a in its one position. Note that by convention, we use italics for symbols, and boldface for their corresponding regular expression
There are four parts to the induction whereby larger regular expressions are built from smaller ones. Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.
- (r)|(s) is a regular expression denoting the language L(r) U L(s).
- (r)(s) is a regular expression denoting the language L(r)L(s).
- (r)* is a regular expression denoting (L(r))*.
- (r) is a regular expression denoting L(r). This last rule says that we can add additional pairs of parentheses around expressions without changing the language they denote.
As defined, regular expressions often contain unnecessary pairs of parentheses. We may drop certain pairs of parentheses if we adopt the conventions that:
- The unary operator * has highest precedence and is left associative.
- Concatenation has second highest precedence and is left associative.
A language that can be defined by a regular expression is called a regular set. If two regular expressions r and s denote the same regular set, we say theyare equivalent and write r = s. For instance, (a|b) = (b|a). There are a numberof algebraic laws for regular expressions; each law asserts that expressions oftwo different forms are equivalent. Figure 3.7 shows some of the algebraic lawsthat hold for arbitrary regular expressions r, s, and t.