C Compilation#
The compilation of a C program is made out of several steps, called translation phases. Preprocessing is one of such phases, and to better understand what can and cannot be done with the preprocessor, we’ll have to learn a bit about what each phase does, and how they are integrated together.
Phases of translation#
The C Standard defines 8 phases of translation, where the output of a phase is the input of the next :
Interpret the encoding of the source files (e.g. replace
\r\n
with\n
)Delete newlines preceded by
\
Tokenizing: group characters that belong together, assiging a “type” to each group. (More on that later)
Preprocessing:
The preprocessor is executed
Each file introduced with the
#include
directive goes through phases 1 through 4, recursively.At the end of this phase, all preprocessor directives are removed from the source.
Escape sequences in string literals are interpreted (e.g. the two adjacent characters
\
n
are replaced by a single byte with the value 10)Adjacent string literals are concatenated
Compilation: the tokens are syntactically and semantically analyzed and translated as a translation unit.
Linking: Translation units and library components needed to satisfy external references are collected into a program image which contains information needed for execution in its execution environment (the OS).
Source : cppreference
Terminology
The word “preprocessing” can be used more generally to refer to all the steps before compilation (phases 1 to 6).
In the GNU toolchain, the cpp
(“The C Preprocessor”) program handles phases 1 to 4.
It is possible to ask the compiler to only do specific phases
Tokenizing#
To best undestand what kind of input phase 4 is working with, we must detail what phase 3 does. (However, how it does it is outside the scope of this book[1])
A tokenizer can be viewed as a black box that
takes as input a sequence of characters
outputs a series of tokens, where a token is a group of characters with an assigned type.
This particular tokenizer recognizes and can emit the following types of tokens:
Type |
Description |
Examples |
---|---|---|
inclusion |
the name of a file the content of which shall be pasted |
<math.h> “libft.h” |
identifier |
a keyword or name (type, variable, function, …) |
size_t strlen while |
preprocessing number |
integer and floating constants |
42 1.61 3.E-12 |
operator or ponctuator |
+ { <<= |
|
string or character literal |
Nice name A |
|
space |
||
newline |
Note
When a comment is encountered, a single space () is emitted.
If adjacent newlines are encountered, a single one may be emitted.
It’s easier to understand with an example:
Source file
1#include <stdio.h> // printf
2
3/* What did you expect ? */
4int main()
5{
6 printf("Hello world\n");
7}
1#include <stdio.h> // printf
2
3/* What did you expect ? */
4int main()
5{
6 printf("Hello world\n");
7}
Phase 3 input
#
i
n
c
l
u
d
e
<
s
t
d
i
o
.
h
>
\n
\n
/
*
W
…
Phase 3 Output
<stdio.h> int main ( ) { printf ( Hello world\n ) ; }
Important
The characters "
and '
are never emmited as tokens, their presence in the source code affects the type of the token that will be emitted:
Input |
Output |
---|---|
|
print ( my_name ) |
|
print ( my_name ) |
|
x = a ; |
|
x = a ; |