C Compilation

C Compilation#

The compilation of a C program is made out of several steps, called translation phases. Preprocessing is one of such phases, and to better understand what can and cannot be done with the preprocessor, we’ll have to learn a bit about what each phase does, and how they are integrated together.

Phases of translation#

The C Standard defines 8 phases of translation, where the output of a phase is the input of the next :

  1. Interpret the encoding of the source files (e.g. replace \r\n with \n)

  2. Delete newlines preceded by \

  3. Tokenizing: group characters that belong together, assiging a “type” to each group. (More on that later)

  4. Preprocessing:

    1. The preprocessor is executed

    2. Each file introduced with the #include directive goes through phases 1 through 4, recursively.

    3. At the end of this phase, all preprocessor directives are removed from the source.

  5. Escape sequences in string literals are interpreted (e.g. the two adjacent characters \ n are replaced by a single byte with the value 10)

  6. Adjacent string literals are concatenated

  7. Compilation: the tokens are syntactically and semantically analyzed and translated as a translation unit.

  8. Linking: Translation units and library components needed to satisfy external references are collected into a program image which contains information needed for execution in its execution environment (the OS).

Source : cppreference

Terminology

The word “preprocessing” can be used more generally to refer to all the steps before compilation (phases 1 to 6).

In the GNU toolchain, the cpp (“The C Preprocessor”) program handles phases 1 to 4.

It is possible to ask the compiler to only do specific phases
sequenceDiagram box Preprocessing participant p1 as Phases<br/>1 to 3 participant p4 as Phase 4:<br/>Preprocessor participant p5 as Phases<br/>5 and 6 end participant p7 as Phase 7:<br/>Compilation participant p8 as Phase 8:<br/>Linking note over p1, p8: cc source.c -o executable note over p1, p7: cc -c source.c -o compiled.o note over p1, p4: cc -E source.c -o processed.i note over p1, p4: cpp source.c -o processed.i note over p5, p7: cc -c processed.i -o compiled.o note over p8: cc compiled.o -o executable note over p8: ld compiled.o -o executable note over p5, p8: cc processed.i -o executable

Tokenizing#

To best undestand what kind of input phase 4 is working with, we must detail what phase 3 does. (However, how it does it is outside the scope of this book[1])

A tokenizer can be viewed as a black box that

  • takes as input a sequence of characters

  • outputs a series of tokens, where a token is a group of characters with an assigned type.

This particular tokenizer recognizes and can emit the following types of tokens:

Type

Description

Examples

inclusion

the name of a file the content of which shall be pasted

<math.h> “libft.h”

identifier

a keyword or name (type, variable, function, …)

size_t strlen while

preprocessing number

integer and floating constants

42 1.61 3.E-12

operator or ponctuator

+ { <<=

string or character literal

Nice name A

space

newline

Note

When a comment is encountered, a single space () is emitted.

If adjacent newlines are encountered, a single one may be emitted.

It’s easier to understand with an example:

Source file

1#include <stdio.h> // printf
2
3/* What did you expect ? */
4int main()
5{
6    printf("Hello world\n");
7}
1#include <stdio.h> // printf
2
3/* What did you expect ? */
4int main()
5{
6    printf("Hello world\n");
7}

Phase 3 input

# i n c l u d e < s t d i o . h > \n \n / * W

Phase 3 Output

<stdio.h> int main ( ) { printf ( Hello world\n ) ; }

Important

The characters " and ' are never emmited as tokens, their presence in the source code affects the type of the token that will be emitted:

Input

Output

print(my_name)

print ( my_name )

print("my_name")

print ( my_name )

x = a;

x = a ;

x = 'a';

x = a ;