C Compilation

C Compilation#

The compilation of a C program is made out of several steps, called translation phases. Preprocessing is one of such phases, and to better understand what can and cannot be done with the preprocessor, we’ll have to learn a bit about what each phase does, and how they are integrated together.

Phases of translation#

The C Standard defines 8 phases of translation, where the output of a phase is the input of the next :

Interpret the encoding of the source files (e.g. replace \r\n with \n)
Delete newlines preceded by \
Tokenizing: group characters that belong together, assiging a “type” to each group. (More on that later)
Preprocessing:
1. The preprocessor is executed
2. Each file introduced with the #include directive goes through phases 1 through 4, recursively.
3. At the end of this phase, all preprocessor directives are removed from the source.
Escape sequences in string literals are interpreted (e.g. the two adjacent characters \ n are replaced by a single byte with the value 10)
Adjacent string literals are concatenated
Compilation: the tokens are syntactically and semantically analyzed and translated as a translation unit.
Linking: Translation units and library components needed to satisfy external references are collected into a program image which contains information needed for execution in its execution environment (the OS).

Source : cppreference

Tokenizing#

To best undestand what kind of input phase 4 is working with, we must detail what phase 3 does. (However, how it does it is outside the scope of this book[1])

A tokenizer can be viewed as a black box that

takes as input a sequence of characters
outputs a series of tokens, where a token is a group of characters with an assigned type.

This particular tokenizer recognizes and can emit the following types of tokens:

Type	Description	Examples
inclusion	the name of a file the content of which shall be pasted	<math.h> “libft.h”
identifier	a keyword or name (type, variable, function, …)	size_t strlen while
preprocessing number	integer and floating constants	42 1.61 3.E-12
operator or ponctuator		+ { <<=
string or character literal		Nice name A
space
newline

Note

When a comment is encountered, a single space () is emitted.

If adjacent newlines are encountered, a single one may be emitted.

It’s easier to understand with an example:

Source file

Compiler view

#include <stdio.h> // printf

/* What did you expect ? */
int main()
{
    printf("Hello world\n");
}

Preprocessor view

#include <stdio.h> // printf

/* What did you expect ? */
int main()
{
    printf("Hello world\n");
}

Phase 3 input

# i n c l u d e < s t d i o . h > \n \n / * W…

A stream of characters

Phase 3 Output

<stdio.h> int main ( ) { printf ( Hello world\n ) ; }

A stream of tokens

Important

The characters " and ' are never emmited as tokens, their presence in the source code affects the type of the token that will be emitted:

Input	Output
`print(my_name)`	print ( my_name )
`print("my_name")`	print ( my_name )
`x = a;`	x = a ;
`x = 'a';`	x = a ;

C Compilation

Contents

C Compilation#

Phases of translation#

Tokenizing#