The Lexical-Analyzer Generator Lex
Introduction: In this section, we introduce a tool called Lex, or in a more recent implementation Flex, that allows one to specify a lexical analyzer by specifying regular expressions to describe patterns for tokens. The input notation for the Lex tool is referred to as the Lex language and the tool itself is the Lex compiler. Behind the scenes, the Lex compiler transforms the input patterns into a transition diagram and generates code, in a file called l e x . y y . c, that simulates this transition diagram. The mechanics of how this translation from regular expressions to transition diagrams occurs is the subject of the next sections; here we only learn the Lex language.
Use of Lex: Figure 3.22 suggests how Lex is used. An input file, which we call l e x. l , is written in the Lex language and describes the lexical analyzer to be generated. The Lex compiler transforms l e x. 1 to a C program, in a file that is always named l e x. y y . c. The latter file is compiled by the C compiler into a file called a . o u t , as always. The C-compiler output is a working lexical analyzer that can take a stream of input characters and produce a stream of tokens. The normal use of the compiled C program, referred to as a. out in Fig. 3.22, is as a subroutine of the parser. It is a C function that returns an integer, which is a code for one of the possible token names. The attribute value, whether it be another numeric code, a pointer to the symbol table, or nothing, is placed in a global variable y y l v a l , 2 which is shared between the lexical analyzer and parser, thereby making it simple to return both the name and an attribute value of a token.
Structure of Lex Programs: A Lex program has the following form:
declarations
°/.7.
translation rules
°/.0/.
auxiliary functions
The declarations section includes declarations of variables, manifest constants (identifiers declared to stand for a constant, e.g., the name of a token), and regular definitions, in the style of Section 3.3.4.
The translation rules each have the form
Pattern {Action}
Each pattern is a regular expression, which may use the regular definitions of the declaration section. The actions are fragments of code, typically written in C, although many variants of Lex using other languages have been created. The third section holds whatever additional functions are used in the actions. Alternatively, these functions can be compiled separately and loaded with the lexical analyzer.
The lexical analyzer created by Lex behaves in concert with the parser as follows. When called by the parser, the lexical analyzer begins reading its remaining input, one character at a time, until it finds the longest prefix of the input that matches one of the patterns Pi. It then executes the associated action Ai. Typically, Ai will return to the parser, but if it does not (e.g., because Pi describes whitespace or comments), then the lexical analyzer proceeds to find additional lexemes, until one of the corresponding actions causes a return to the parser. The lexical analyzer returns a single value, the token name, to the parser, but uses the shared, integer variable y y l v a l to pass additional information about the lexeme found, if needed.