How compilers work

Meticulous Transformer

Author(s):

Compilers translate source code into executable programs and libraries. Inside modern compiler suites, a multistage process analyzes the source code, points out errors, generates intermediate code and tables, rearranges a large amount of data, and adapts the code to the target processor.

Below the surface, a black box compiler handles complex processes that require good knowledge of machine theory and formal languages. Given the importance of compilers, it is not surprising that compiler construction is standard curriculum for computer science students. If you have never been to a college-level lecture on compiler theory – or if you went to the lecture but need a refresher course – this article summarizes the basics. (See the box titled "Teaching Material for Compiler Construction.")

Teaching Material for Compiler Construction

When students of computer science need to learn what holds the programming language world together, the best bet is to attend a lecture on compiler construction. Mathematics professor Peter Köhler held a lecture on compilers in the 2016 summer term at the University of Giessen [1], Germany. With his permission, Linux Magazine worked through his notes and summarized the most important points for this article.

In simple terms, a compiler goes through three steps: It parses the source code, analyzes it, and synthesizes the finished program (Figure 1).

Figure 1: Rough structure of a compiler: parse code, analyze it, and create an executable program.

In the first step, the compiler parses the source code character by character and tries to identify keywords, variable names, numbers, strings, and all other components – including comments. The scanner takes care of this lexical analysis.

In the second phase, another component checks the syntax and semantics of the program. For example, if a programmer uses a variable that has not yet been declared, or if a semicolon at the end of an expression is missing, the compiler will discover the problem in this phase.

Modern compilers distinguish between syntactic and semantic analysis. Syntax is all about structure, whereas semantics is focused on meaning. The component of the compiler called the parser tries to detect the syntax elements. If this phase is successful, the parser calls the semantic routine.

This delineation makes it possible to optimize tasks and analyze the source code more efficiently.

However, syntactic and semantic analysis turns out to be a non-trivial task, as a simple example shows: the compiler for a dumb calculator needs to parse a text file with a calculation expression. The calculator can only add and multiply numbers, where multiplication/division takes priority over addition/subtraction. If in doubt, brackets specify the sequence. For example, the following is allowed,

(1+3)*(4+5)

but this is not:

(1+)*4

The compiler must understand which of the two expressions is valid and which is not.

For the compiler to identify that the second expression as faulty, the compiler builder first tells the compiler how the correct source code is structured. This information is conveyed through rule sets. The compiler checks whether the source code follows the rules defined in the rule sets. In the example, a single number is obviously a valid expression. If two x and y expressions exist, then x + y, x * y and (x) are valid expressions.

The rules can also be written in shorthand. The Backus-Naur Form (BNF) is a very popular form (see Listing 1): The program (program) consists of an expression (expr). This expression, in turn, exists if it is either a term or if it is a term with the character + and then another expression after. The last rule is necessary because numbers can consist of several digits (digits).

Listing 1

Backus-Naur Form

 

These rules, known as production or derivation rules, form the grammar of the source language. Like English grammar, this grammar provides rules that build the programming language. Abbreviations, such as term or factor, are known as symbols. All symbols on the left-hand side are referred to as non-terminals, all others are terminals. Terminals would include digit, +, *, ), and (.

One programming language may include several grammars. The choice would depend on the language and requirements. The grammar used in the preceding example can lead to infinitely long expressions and programs because some rules have a recursive structure. The grammar used in the example is considered a context-free grammar (or type 2). A grammar is considered regular if all rules follow one of the two forms:

U ::= t
U ::= Vt

In this case, t is a terminal character; U and V are non-terminal characters.

Top-Down Parser

To the delight of developers, a grammar can be used to create a compiler quite elegantly: For all non-terminals, the developer creates a function that takes care of the corresponding evaluation. If necessary, the function calls on other functions to help. For example, the function expr(), which belongs to expr, has a structure like the one briefly outlined in Listing 2.

According to the second rule in Listing 1, a correct expression such as 1+2 always starts with a term. In Listing 2, the expr() function includes term(), which still needs to be implemented for an evaluation in the first step. In this example, term() returns 1.

Listing 2

expr()

§§nonumberint expr() {
        int value = term();
        if (File.ReadNextChar() == '+') value += expr();
        return value;
}

In the second step, expr() then looks at the next character in the source code. If the next character is a +, the second rule of Listing 1 requires an expression to follow, which is why expr() calls itself. The result is that expr() returns 2, which the compiler also adds to the previously computed intermediate result. The compiler would then have arrived at the end of a valid expression, which is why expr() returns the calculated value in the example.

At this point, a mature compiler could generate a suitable instruction in machine code. The developer generates functions according to the same principle for all other non-terminals. If you then apply program() to the source code file, you automatically get the translated program – in the example, this would be the calculated number. Because processing starts with program, computer scientists also call this symbol the start symbol.

The calls to the individual functions can be displayed as a tree in this scenario: the syntax tree.

Bottom-Up Parsers

In the process described so far, the parser simply proceeds from the start symbol (in the example program) and then tries to find appropriate grammar rules for each character that has been read. Such parsers are known as top-down parsers.

Other parsers are bottom-up parsers and proceed according to an opposite scheme: They try to replace the source code with grammar rules (reduction) until only the start symbol remains. In practice, such bottom-up parsers usually use an empty stack at the beginning, into which they dump all previously detected, non-terminal characters and all the characters they read. After each new character is read, they check whether the topmost elements on the stack match the right side of a grammar rule. If this is the case, parsers take the elements from the stack and place the non-terminals from the left side of the identified grammar rule on top. The whole process runs until the start symbol is left and the end of the source code is reached. If that doesn't work, there's probably a mistake. A bottom-up parser that works according to this pattern is also known as a shift-reduction parser.

The programming of the scanner and parser is a straightforward consequence of the grammar of the language. This process can also be automated: Tools such as Flex [2] or Bison [3] automatically generate the source code for matching scanners and parsers from the grammar (Figure 2).

Figure 2: Bison is a well-known parser generator for the Linux environment (from the Bison Tutorial by Lan Gao (http://alumni.cs.ucr.edu/~lgao/teaching/bison.html).

Whenever the parser needs more information, it requests it from the scanner. In the example in Listing 2, the expr() function would thus not read the next character itself; instead, File.ReadNextChar() simply consults the scanner.

Scanner Internals: Machines

The source code of most programming languages consists of numbers, variable names, keywords, operators, and separators (delimiters). Names and keywords are composed of several letters and characters.

A finite-state machine helps the scanner detect int, &&&, and all other components of the language. Imagine it like a ticket vending machine: The customer throws in 20 cents, and the machine changes its status to "20 cents paid." If the customer inserts a dollar, the machine then changes the status to "120 cents paid". This continues until the machine reaches the "price paid in full" status. Similarly, you can make a machine that accepts letters from the source code. The machine switches each character read to a different status: If the scanner has read an i, it changes to the state "i read"; if an n is added, it changes to the state "n read." If the t follows, the machine has finally read the whole term int and changes its status to "int identified."

This machine has an initial status and at least one final status. The machine for detecting int has detected the initial "no character read yet" status and the two final "int recognized" and "not the keyword int" statuses. When the final status occurs, computer scientists say the machine has accepted the word int.

You can also envision a scenario in which boxes (nodes) represent the individual states (Figure 3). The start and end states are highlighted accordingly. Arrows (edges) indicate from which status the machine transitions to another state. The transition will only take place if the machine has read one of the characters written on the arrow.

Figure 3: The machine changes status as soon as it encounters a character, digit, or symbol.

A regular grammar can be used to construct not only a parser, but also a finite-state machine that accepts the language of the grammar: First, create a status for all non-terminals, plus the start and end statuses. Then, when it reads a character t, it makes a transition from the s state to the n state if there is a grammar rule in the n ::= st form. Incidentally, this process creates a bottom-up parser for the grammar.

In a simple scenario, you could read the next character and check, with if queries or a switch construction, which character it is and finally note, in the form of a variable, the new state to which the machine has changed.

Once the scanner has identified the next element in the source code (such as the keyword int ), it replaces the element with a symbol and returns it to the parser. The aforementioned Flex tool also uses the strategy with finite-state machines.

Symbol Table

To map the source code compactly in memory, the scanner replaces each recognized keyword, each variable name, and all other elements with a symbol. You could replace the rather long variable names with two numbers: The first number is used as a substitute for the variable name. The second number specifies the row in a table, the symbol table, which contains the variable name used by the developer. During syntax and semantics analysis, additional information, such as the type of variable, appears in the symbol table.

To find an entry in the symbol table as quickly as possible, many compilers run the variable name through a hash function that can be calculated quickly. This step spits out a number, which the compiler then uses as an index in the symbol table. If the compiler encounters the variable name again later on, it simply calculates the hash value and immediately gets the location in the symbol table with all important information about the variable – such as its type. The compiler does not have to search through the entire table.

Both the scanner and the semantic routine generate a number of tables in addition to the symbol table. Among other things, these tables contain the nesting structure for the loops and the loop variables. The compiler repeatedly accesses the information in the tables, even later on.

Interpreter, Assembler, and Translator

Unlike the compiler, an interpreter reads the source code and executes it directly; thus, no object code is generated. The classic interpreters analyze each command in the source code one after another.

Modern interpreters convert the complete source code into a special optimized internal representation. The interpreter then executes this intermediate or byte code much faster. Sometimes a just-in-time compiler translates the internal representation into machine language, which further increases the execution speed. Java uses this procedure.

An assembler is a special form of the compiler that translates a program into assembler or machine language. Since assembler is usually a symbolic representation of the machine commands, both languages are similar. The generic term translator usually refers to all three (i.e., compilers, assemblers, and interpreters).

Internal Representation of the Source Code

The scanner passes the symbols that it has determined to the parser. The syntax and semantic analysis ends with an internal representation of the source code. The compiler and the languages are responsible for what this representation looks like. The program could be present as a syntax tree or in Polish notation. Many compilers also use quadruples, for example, from the A = B + A statement it would be:

+,      B,      A,      T1
=,      T1,             A

T1 is a temporary variable created by the compiler.

So far, the compiler has only analyzed the source program. For this reason, experts also refer to this first phase as the analysis phase.

Generating Code

In the next step, another component optimizes the internal presentation. As a rule, the compiler optimizes the run time and assigns memory locations to the variables. In the preceding example, the compiler would try to eliminate the temporary variable T1.

In the last phase, the compiler finally generates the executable machine code. Generally, programmers call this object code or simply code. Under Linux, it is usually either a (dynamic) library or the executable program.

Some compilers also produce assembler code, which is then converted into machine language by a downstream assembler. The compiler could generate the following code from A = A + B:

lda a   ; load a in the accumulator
add b   ; add b to the accumulator
sto a   ; store accumulator to a

Control structures such as if, while, and for can usually be mapped with the jump instructions of the processor. Complex loops, such as for, may optionally replace a (longer) while loop.

Because it knows the processor instruction set, and the information from the tables, the compiler also makes the code more compact. The stack is used for function calls: Before starting a function, the compiler dumps its arguments and the return address onto the stack. The processor then performs the function. Finally, the compiler has to clean up the stack; current processors use special commands to support the compiler in this task.

Conclusion

This article offered a brief look at how compilers parse and prepare source code. Keep in mind that not all compilers adhere to the approach described in this article. For example, one-pass compilers do without an internal representation of the code and generate the object code in a single pass. Also, the steps presented in this article sometimes do not occur in distinct phases but are, instead, interwoven.

Infos

  1. Compiler construction lecture notes (in German): http://www.staff.uni-giessen.de/~gc1079/
  2. Flex: https://github.com/westes/flex
  3. Bison: https://www.gnu.org/software/bison/