Compiler is the translator between human readable high level language and the computer readable low level languages, it translate the a program from a source language into a target language. Why do we need compiler? Because for human beings, programming in a machine language, such as assembly is highly inefficient and time consuming.

Different compiler does the translation differently, for example, Java complier compile Java into a JVM byte code that stored in the class file, while a C compiler compile C language into an executable file. However, they all follow similar process and are designed based on the similar principles. Understanding how compiler works is an important steps for us to understand how programming languages are design, and how are they compiled into executable code.

A compiler usually has three major components: a front-end component, a middle component, and of course, a back-end component. We will start from the front-end component, which mainly including the scanner, the parser, and the semantic analysis modules.

  • Scanner

A scanner is also called a tokenizer. As indicated by the name, it reads the source file and generates the tex into a stream of known objects, called token. And the token will be send to the Parser for parsing. There are different elements on the coding text, say variables, operators, operands, strings, the lexical analysis is to identify their types.

In the lexical analysis, the parser reads one character at a time, remove the white space and comments from the text, then form the <token-type, value> tuple. For example, <type: Operator, value: +>, <type: Var, value: numberOfDays>. In this phase, parser will check whether the token is legal string. The definition of legal string is usually based on regular expression, which we will discuss in the future.

  • Parser

A parser translates the code to rules of grammar, it build the representation of the code. The following is a simple set of grammar rules:

expression = atom   | list
atom       = number | symbol    
number     = [+-]?['0'-'9']+
symbol     = ['A'-'Z']['A'-'Z''0'-'9'].*
list       = '(', expression*, ')'

When the parser receives the token stream from the scanner, it tries to match the token with the rules on the grammar set. If it is a number, then checking if the number is valid, if it is an expression, checking whether it is an atom expression or a list expression. The low level grammars can be combined and composed into high level grammars.

  • Syntax and Semantic Analysis

The parser will do two types of analysis: syntax analysis, semantic analysis. The difference is subtle: the syntax analysis checks whether or not the sentence is valid, while while semantic refers to the meaning of the sentence. For example:

int x;
print("%d", x)

This example is syntax valid, but semantic invalid. In semantic check, the parser understands the semantic by triggering a set of semantic actions. For example, compiler will maintain a symbol tables to track the scope and status of the variable. When the parser first see the variable, it enter the values and scope into symbol table, when it sees the variable in use, it will check if it has been defined and in the correct scope, otherwise it will throw error. The variable saving and checking is one of a semantic actions. There are more of such actions.