Short hands
Introduction: The large number of different digits makes this expression rather verbose. It gets even worse when we get to variable names, where we must enumerate all alphabetic letters (in both upper and lower case). Hence, we introduce a short hand for sets of letters. Sequences of letters within square brackets represent the set of these letters. For example, we use [ab01] as a shorthand for a|b|0|1. Additionally, we can use interval notation to abbreviate [0123456789] to [0-9]. We can combine several intervals within one bracket and for example write [a-zA-Z] to denote all alphabetic letters in both lower and upper case.
(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*
When using intervals, we must be aware of the ordering for the symbols involved. For the digits and letters used above, there is usually no confusion. However, if we write, e.g., [0-z] it is not immediately clear what is meant. When using such notation in lexer generators, standard ASCII or ISO 8859-1 character sets are usually used, with the hereby implied ordering of symbols. To avoid confusion, we will use the interval notation only for intervals of digits or alphabetic letters.
Getting back to the example of integer constants above, we can now write this much shorter as [0-9][0-9]*. Since s* denotes zero or more occurrences of s, we needed to write the set of digits twice to describe that one or more digits are allowed. Such non-zero repetition is quite common, so we introduce another shorthand, s , to denote one or more occurrences of s. With this notation, we can abbreviate our description of integers to [0-9] . On a similar note, it is common that we can have zero or one occurrence of something (e.g., an optional sign to a number). Hence we introduce the shorthand s? for s|ε. and ? bind with the same precedence as *.
We must stress that these short hands are just that. They do not add anything to the set of languages we can describe; they just make it possible to describe a language more compactly. In the case of s , it can even make an exponential difference: If is nested n deep, recursive expansion of s to ss* yields 2n -1 occurrences of * in the expanded regular expression.
Examples: We have already seen how we can describe non-negative integer constants using regular expressions. Here are a few examples of other typical programming language elements:
Ø Keywords. A keyword like if is described by a regular expression that looks exactly like that keyword, e.g., the regular expression if (which is the concatenation of the two regular expressions i and f).
Ø Variable names. In the programming language C, a variable name consists of letters, digits and the underscore symbol and it must begin with a letter or underscore. This can be described by the regular expression
[a-zA-Z_][a-zA-Z_0-9]*
Ø Integers. An integer constant is an optional sign followed by a non-empty sequence of digits: [ -]?[0-9] . In some languages, the sign is a separate symbol and not part of the constant itself. This will allow whitespace between the sign and the number, which is not possible with the above.
Ø Floats. A floating-point constant can have an optional sign. After this, the mantissa part is described as a sequence of digits followed by a decimal point and then another sequence of digits. Either one (but not both) of the digit sequences can be empty. Finally, there is an optional exponent part, which is the letter e (in upper or lower case) followed by an (optionally signed) integer constant. If there is an exponent part to the constant, the mantissa part can be written as an integer constant (i.e., without the decimal point).
Ø String constants. A string constant starts with a quotation mark followed by a sequence of symbols and finally another quotation mark. There are usually some restrictions on the symbols allowed between the quotation marks. For example, line-feed characters or quotes are typically not allowed, though these may be represented by special “escape” sequences of other characters, such as "\n\n" for a string containing two line-feeds. As a (much simplified) example, we can by the following regular expression describe string constants where the allowed symbols are alphanumeric characters and sequences consisting of the backslash symbol followed by a letter (where each such pair is intended to represent a non-alphanumeric symbol):
"([a-zA-Z0-9]|n[a-zA-Z])*"