session 4 (lexical and syntax analysis)

Languages

Languages are a potentially infinite set of strings (sometimes called sentences, which are a sequence of symbols from a given alphabet). The tokens of each sentence are ordered according to some structure.

A grammar is a set of rules that describe a language. Grammars assign structure to a sentence.

An automaton is an algorithm that can recognise (accept) all sentences of a language and reject those which do not belong to it.

Lexical and syntactical analysis can be simplified to a machine that takes in some program code, and then returns syntax errors, parse trees and data structures. We can think of the process of description transformation, where we take some source description, apply a transformation technique and end up with a target description – this is inference mapping between two equivalent languages, where the destination is a machine executable.

Compilers are an important part of computer science, even if you never write a compiler, as the concepts are used regularly, in interpreters, intelligent editors, source code debugging, natural language processing (e.g., for AI) and for things such as XML.

Throughout the years, progress has been made in the field of programming to bring higher and higher levels of abstraction, moving from machine code, to assembler, to high level langauges and now to object orrientation, reusable languages and virtual machines.

Translation Process

LSA only deals with the front-end of the compiler, next year’s module CGO deals with the back-end. The front-end of a compiler only analyses the program, it does not produce code.

From source code, lexical analysis produces tokens, the words in a language, which are then parsed to produce a syntax tree, which checks that tokens conform with the rules of a language. Semantic analysis is then performed on the syntax tree to produce an annotated tree. In addition to this, a literal table, which contains information on the strings and constants used in the program, and a symbol table, which stores information on the identifiers occuring in the program (e.g., variable names, constant names, procedure names, etc), are produced by the various stages of the process. An error handler also exists to catch any errors generated by any stages of the program (e.g., a syntax error by a poorly formed line). The syntax tree forms an intermediate representation of the code structure, and has links to the symbol table.

From the annotated tree, intermediate code generation produces intermediate code (e.g., that suitable for a virtual machine, or pseudo-assembler), and then the final code generation stage produces target code whilst also referring to the literal and symbol table and having another error handler. Optimisation is then applied to the target code. The target code could be assembly and then passed to an assembler, or it could be direct to machine code.

We can consider the front-end as a two stage process, lexical analysis and syntactic analysis.

Lexical Analysis

Lexical analysis is the extraction of individual words or lexemes from an input stream of symbols and passing corresponding tokens back to the parser.

If we consider a statement in a programming language, we need to be able to recognise the small syntactic units (tokens) and pass this information to the parser. We need to also store the various attributes in the symbol or literal tables for later use, e.g., if we have an variable, the tokeniser would generate the token var and then associate the name of the variable with it in the symbol table – in this case, the variable name is the lexeme.

Other roles of the lexical analyser include the removal of whitespace and comments and handling compiler directives (i.e., as a preprocessor).

The tokenisation process takes input and then passes this input through a keyword recogniser, an identifier recogniser, a numeric constant recogniser and a string constant recogniser, each being put in to their own output based on disambiguating rules.

These rules may include “reserved words”, which can not be used as identifiers (common examples include begin, end, void, etc), thus if a string can be either a keyword or an identifier, it is taken to be a keyword. Another common rule is that of maximal munch, where if a string can be interpreted as a single token or a sequence of tokens, the former interpretation is generally assumed

The lexical analysis process starts with a definition of what it means to be a token in the language with regular expressions or grammars, then this is translated to an abstract computational model for recognising tokens (a non-deterministic finite state automaton), which is then translated to an implementable model for recognising the defined tokens (a deterministic finite state automaton) to which optimisations can be made (a minimised DFA).

Syntactic Analysis

Syntactic analysis, or parsing, is needed to determine if the series of tokens given are appropriate in a language – that is, whether or not the sentence has the right shape/form. However, not all syntactically valid sentences are meaningful, further semantic analysis has to be applied for this. For syntactic analysis, context-free grammars and the associated parsing techniques are powerful enough to be used – this overall process is called parsing.

In syntactic analysis, parse trees are used to show the structure of the sentence, but they often contain redundant information due to implicit definitions (e.g., an assignment always has an assignment operator in it, so we can imply that), so syntax trees, which are compact representations are used instead. Trees are recursive structures, which complement CFGs nicely, as these are also recursive (unlike regular expressions).

There are many techniques for parsing algorithms (vs FSA-centred lexical analysis), and the two main classes of algorithm are top-down and bottom-up parsing.

For information about
context free grammars,
see MCS.

Context-free grammars can be represented using Backus-Naur Form (BNF). BNF uses three classes of symbols: non-terminal symbols (phrases) enclosed by brackets <>, terminal symbols (tokens) that stand for themselves, and the metasymbol ::= – is defined to be.

As derivations are ambiguous, a more abstract structure is needed. Parse trees generalise derivations and provide structural information needed by the later stages of compilation.

Parse Trees

Parse trees over a grammar G is a labelled tree with a root node labelled with the start symbol (S), and then internal nodes labelled with non-terminals. Leaf nodes are labelled with terminals or ε. If an internal node is labelled with a non-terminal A, and has n children with labels X₁, …, X_n (terminals or non-terminals), then we can say that there is a grammar rule of the form A → X₁…X_n. Parse trees also have optional node numbers.

The above parse tree corresponds to a leftmost derivation.

Traversing the tree can be done by three different forms of traversal. In preorder traversal, you visit the root and then do a preorder traversal of each of the children, in in-order traversal, an in-order traversal is done of the left sub-tree, the root is visited, and then an in-order traversal is done of the remaining subtrees. Finally, with postorder traversal, a postorder traversal is done of each of the children and the root visited.

Syntax Trees

Parse trees are often converted into a simplified form known as a syntax tree that eliminates wasteful information from the parse tree.

At this stage, treatment of errors is more difficult than in the scanner (tokeniser), as the scanner may pass problems to the parser (an error token). Error recovery typically isolates the error and continues parsing, and repair can be possible in simple cases. Generating meaningful error messages is important, however this can be difficult as the actual error may be far behind the current input token.

Grammars are ambiguous if there exists a sentence with two different parse trees – this is particularly common in arithmetic expressions – does 3 + 4 × 5 = 23 (the natural interpretation, that is 3 + (4 × 5)) or 35 ((3 + 4) × 5). Ambiguity can be inessential, where the parse trees both give the same syntax tree (typical for associative operations), but still a disambiguation rule is required. However, there is no computable function to remove ambiguity from a grammar, it has to be done by hand, and the ambiguity problem is undecidable. Solutions to this include changing the grammar of the language to remove any ambiguity, or to introduce disambiguation rules.

A common disambiguation rule is precedence, where a precedence cascade is used to group operators in to different precedence levels.

BNF does have limitations as there is no notation for repetition and optionality, nor is there any nesting of alternatives – auxillary non-terminals are used instead, or a rule is expanded. Each rule also has no explicit terminator.

Extended BNF (EBNF) was developed to work around these restrictions. In EBNF, terminals are in double quotes “”, and non-terminals are written without <>. Rules have the form non-terminal = sequence or non-terminal = sequence | … | sequence, and a period . is used as the terminator for each rule. {p} is used for 0 more occurrences of p, [p] stands for 0 or 1 occurances of p and (p|q|r) stands for exactly one of p, q, or r. However, notation for repetition does not fully specify the parse tree (left or right associativity?).

Syntax diagrams can be used for graphical representations of EBNF rules. Non-terminals are represented in a square/rectangular box, and terminals in round/oval boxes. Sequences and choices are represented by arrows, and the whole diagram is labelled with the left-hand side non-terminal. e.g., for factor → (exp) | “number” .

Semantic Analysis

Semantic analysis is needed to check aspects that are not related to the syntactic form, or that are not easily determined during parsing, e.g., type correctness of expressions and declaration prior to use.

Code Generation and Optimisation

Code generation and optimisation exists to generally transform the annotated syntax tree into some form of intermediate code, typically three-address code or code for a virtual machine (e.g., p-code). Some level of optimisation can be applied here. This intermediate code can then be transformed into instructions for the target mahcine and optimised further.

Optimisation is really code improvement, e.g., constant folding (e.g., replace x = 4*4 with x = 16), and also at a target level (e.g., multiplications by powers of 2 replaced by shift level instructions).

Bootstrapping

The idea behind bootstrapping is to write a compiler for a language in a possibly older version or a restricted subset of the language. A “quick and dirty” compiler is then written for that language that runs on some machine, but in a restricted subset or older version.

This new version of the compiler incorporates the latest features of the language and generates highly efficient code, and all changes to the language are represented in the new compiler, so this compiler is used to generate the “real” compiler for target machines.

To bootstrap, we use the quick and dirty compiler to compile the new compiler, and then recompile the new compiler with itself to generate an efficient version.

With bootstrapping, improvements to the source code can be bootstrapped by applying the 2-step process. Porting the compiler to a new host computer only requires changes to the backend of the compiler, so the new compiler becomes the “quick and dirty” compiler for new versions of the language.

If we use the main compiler for porting, this provides a mechanism to quickly generate new compilers for all target machines, and minimises the need to maintain/debug multiple compilers.

Top-Down Parsing

Top down parsing can be broken down into two classes: backtracking parsers, which try to apply a grammar rule and backtrack if it fails; and predictive parsers, which try to predict the next nonterminal in the input using one or more tokens lookahead. Backtracking parsers can handle a larger class of grammars, but predictive parsers are faster. We look at predictive parsers in this module.

In this module, we look at two classes of predictive parsers: recursive-descent parsers, which are quite versatile and appropriate for a hand-written parser, and were the first type of parser to be developed; and, LL(1) parsing – left-to-right, leftmost derivation, 1 symbol lookahead, a type of parser which is no longer in use. It does demonstrate parsing using a stack, and helps to formalise the problems with recursive descent. There is a more general class of LL(k) parsers.

Recursive Descent

The basic idea here is that each nonterminal is recognised by a procedure. This choice corresponds to a case or if statement. Strings of terminals and nonterminals within a choice match the input and calls to other procedures. Recursive descent has a one token lookahead, from which the choice of appropriate matching procedure is made.

The code for recursive descent follows the EBNF form of the grammar rule, however the procedures must not use left recursion, which will cause the choice to call a matching procedure in an infinite loop.

There are problems with recursive descent, such as converting BNF rules to EBNF, ambiguity in choice and empty productions (must look on further at tokens that can appear after the current non-terminal). We could also consider adding some form of early error detection, so time is not wasted deriving non-terminals if the lookahead token is not a legal symbol.

Therefore, we can improve recursive descent by defining a set First(α) as the set of tokens that can legally begin α, and a set Follow(A) as the set of tokens that can legally come after the nonterminal A. We can then use First(α) and First(β) to decide between A → α and A → β, and then use Follow(A) to decide whether or not A → ε can be applied.

LL(1) Parsing

LL(1) parsing is top-down parsing using a stack as the memory. At the beginning, the start symbol is put onto the stack, and then two basic actions are available: Generate, which replaces a nonterminal A at the top of the stack by string α using the grammar rule A → α; and Match, which matches a token on the top of the stack with the next input token (and, in case of success, pop both). The appropriate action is selected using a parsing table.

With the LL(1) parsing table, when a token is at the top of the stack, there is either a successful match or an error (no ambiguity). When there is a nonterminal A at the top, a lookahead is used to choose a production to replace A.

The LL(1) parsing table is created by starting with an empty table, and then for all table entries, the production choice A → α is added to the table entry M[A,a] if there is a derivation α ├* aβ where a is a token or there are derivations α ├* ε and S$ ├* βAaγ, where S is that start symbol and a is a token (including $). Any entries that are now empty represent errors.

A grammar is an LL(1) grammar if the associate LL(1) parsing table has at most one production in each table entry. A LL(1) grammar is never ambiguous, so if a grammar is ambiguous, disambiguating rules can be used in simple cases. Some consequences of this are for rules of the form A → α | β, then α and β can not derive strings beginning with the same token a and at most one of α or β can derive ε.

With LL(1) parsing, the list of generated actions match the steps in a leftmost derivation. Each node can be constructed as terminals, or nonterminals are pushed onto the stack. If the parser is not used as a recogniser, then the stack should contain pointers to tree nodes. LL(1) parsing has to generate a syntax tree.

In LL(1) parsing, left recursion and choice are still problematic in similar ways to that of recursive descent parsing. EBNF will not help, so it’s easier to keep the grammar as BNF. Instead, two techniques can be used: left recursion removal and left factoring. This is not a foolproof solution, but it usually helps and it can be automated.

In the case of a simple immediate left recursion of the form A → Aα | β, where α and β are strings and β does not begin with A, then we can rewrite this rule into two grammar rules: A → βA′, A′ → αA′ | ε. The first rule generates β and then generates {α} using right recursion.

For indirect left recursion of the form A → Bβ₁ | …, B → Aα₁, then a cycle (a derivation of the form A ├ α ├* A) can occur. The algorithm can only work if a grammar has no cycles or ε-productions.

Left recursion removal does not change the language, but it does change the grammar and parse trees, so we now have the new problem of keeping the correct associativity in the syntax tree. If you use the parse tree directly to get the syntax tree, then values need to be passed from parent to child.

If two or more grammar choices share a common prefix string (e.g., A → αβ | αγ), then left factoring can be used. We can factor out the longest possible α and split it into two rules: A → αA′, A′ → β | γ.

To construct a syntax tree in LL(1) parsing, it takes an extra stack to manipulate the syntax tree nodes. Additional symbols are needed in the grammar rules to provide synchronisation between the two stacks, and the usual way of doing this is using the hash (#). In the example of a grammar representing arithmetic operations, the # tells the parser that the second argument has just been seen, so it should be added to the first one.

First and Follow Sets

To construct a First set, there are rules we can follow:

If x is a terminal, then First(x) = {x} (and First(ε) = {ε})
For a nonterminal A with the production rules A → β₁ | … | β_n, First(A) = First(β₁) ∪ … ∪ First(β_n)
For a rule of the form A → X₁X₂…X_n, First(A) = (First(X₁) ∪ … ∪ First(X_k)) – {ε}, where ε ∈ First(X_j),j = 1..(k – 1) and ε ∉ First(X_k, i.e., where the first non-nullable symbol X symbol is X_k.

Therefore, the First(w) set consists of the set of terminals that begin strings derived from w (and also contains ε if w can generate ε).

To construct Follow sets, we can say that if A is a non-terminal, then Follow(A) is the set of all terminals that can follow A in any sentential form (derived from S).

If A is the start symbol, then the end marker $ ∈ Follow(A)
For a production rule of the form B → αAβ, everything in First(β) is placed in Follow(A) except ε.
If there is a production B → αA, or a production B → αAβ where First(β) contains ε, then Follow(A) containts Follow(B).

A final issue to consider is how to deal with errors in the parsing stage. In recursive descent parsers, a panic mode exists where each procedure declares a set of synchronising tokens, and when confused, input tokens are skipped (scan ahead) until one of the synchronising sets of tokens are seen and you can continue parsing. In LL(1) parsers, sets of synchronising tokens are kept in an additional stack or built directly into the parsing table.