HACKING

Overview

Execution Environments

Build, Host, Target

Run-time environments: ABIs and Toolchains

The term "run-time environment" is not well-defined. Some vendors use
"run-time library" to mean the standard library they provide with
their compiler products. Others use it to mean a smaller subset of
code that complements the standard (or "system") libraries.

An ABI (Application Binary Interface) describes a conventional
low-level interface between an application program and the execution
environment. This includes (among other things) storage allocation
conventions, calling conventions, and binary file format.

A toolchain is a set of software elements … A toolchain includes a
compiler, a linker, and a library, along with other utilities for
working with binary files (ar, nm, etc.) It also includes the
startup/shutdown code that is attached to executables. A given
execution environment may support multiple toolchains. For example,
on Linux the standard toolchain is the GNU toolchain: gcc, glibc, and
GNU binutils. But you can use other compilers and libraries as well;
for example, gcc with newlib or dietlib, or the Intel C compiler with
glibc. Even some of the hoary old compilers for windows are
available: Watcom is now Open Watcom, with Linux support on the way.

An important element of the run-time environment that is not part of
the toolchain is the program /loader/. The loader is provided by the
operating system. Its job is to load the executable binary image into
memory and then transfer control to the entry point. For this to
work, there must be a convention regarding the structure of the binary
image and the name or offset of the entry point. The loader may also
perform more sophisticated tasks that involve tasks conventionally
associated with the linker, such as relocation and symbol resolution.
More on this later.

A particular toolchain establishes a /de facto/ ABI. Elements of one
toolchain may not work with another toolchain. For example, it
probably won't do to use malloc from one library and free from
another. It may not even work to use malloc and free from the same
library, but compiled with different compilers. Especially if you're
using a language other than C. In the case of C much (but not all) of
the ABI is specified by the language and standard library definitions.

Compilation: Concepts and Terminology

Lexis

Syntax

Tokens, Symbols, and Attributes

The literature on compilers often uses the term token when discussing lexical analysis and parsing. Usage of this term is not always consistent however; distinctions between token, token type, token class, and symbol are often glossed over, which can lead to some confusion. So here's an attempt at clarity:

lexeme
a run of text in the source that is recognized by the lexer as a member of a lexical class; e.g. a declaration like "int foo;" is recognized as three lexemes, "int", "foo", and ";"
token
usually this means a part of the source text that is recognized by the lexer as a lexical element. E.g. each keyword is recognized as a token, as is each identifier. In fact the basic idea is that the source text - a sequence of characters - is analyzed by the lexical analyzer and turned into a stream of tokens that is passed to the parser. That way the parser doesn't have to deal with the internal structure of the lexical elements. Token in this sense is often used interchangeably with the term symbol.
terminal symbol
in grammar specifications a terminal symbol cannot be decomposed into component elements; for example, ID is a terminal symbol in most languages; it denotes the set of lexemes recognized as an ID
token type
tokens are only recognized because they match a syntactic production, and are thus members of the class defined by that production. In other words, each syntactic (lexical) production effectively defines a token class, and matches a set of lexemes in a source text. Many token classes have only a single member, e.g. '<', but that member may occur any number of times in program text; i.e. each occurance of '<' is represented as a token in the input stream.
token type code
compilers usually encode the token type of each token in the input stream as by defining a token type code using a #define or an enumeration constant.

Attributes of tokens

text
the text of the recognized lexeme; e.g. in "int foo;", the lexeme "foo" is recognized as a token of type TOK_ID, with a text attribute of "foo".
token value
the (c) value encoded by the parsed token; e.g. 17 might be recognized as TOK_LITERALINT (tcc uses TOK_CINT), whose text attribute is "17" and whose value attribute is 0x11.
other attributes
compilers commonly used token attributes to record debug information such as line number on which the token occurs.

Hypothetical examples:

char *s = "abc": "abc" =>
tok.typecode=TOK_STR
tok.text="abc"
tok.value="abc"

A compiler isn't necessarily going to record all this info.

Parsing Stacks

Parsers use data structures to keep track of stuff during a parse.
E.g. environments, local vars, globals, etc. etc. Discussion of use
of stacks, hash tables, etc. - the stuff tcc uses.

Symbol Tables

Compiler internal symbol table

The compiler maintains table(s) of symbols seen and referenced…

Output Format (Linker) symbol table

Output formats (ELF, COFF, PE/COFF, Mach-O) provide for a symbol table
which will be used by a linker to resolve symbol refs and for
relocation…. Not to be confused with the compiler's internal symbol
table, although the latter will be used to write the former.

…brief overview of tinycc's strategy, details to go below…

Instruction Selection, Register Allocation, etc.

Output Formats

Linking

Relocation

Symbol Resolution

Tiny cc Implementation Design

The standard model for a compiler involves a phased approach:

  • Lexer (Scanner)
  • Parser
  • Conrete Syntax Tree
  • Abstract Syntax Tree
  • Intermediate Representation
  • Backend

** Instruction Selection
** Optimization

Toolchains may implement these tasks using separate software
components. For example, one component may parse the input file and
produce an abstract syntax tree; a second component may take the AST
as input and produce an Intermediate Representation - a kind of
abstract assembly language. A backend component would take the IR as
input and produce machine code in an output format.

Tiny cc takes a different approach. The design priorities of tcc are
speed and size, so it eschews the phased approach in favor of
incremental compilation. Instead of multiple phases (and therefor
processing passes), it tries to parse the source code and emit machine
code as soon as possible. This means it can process source code much
faster than a compiler that takes the traditional approach. Howerver,
one consequence of this approach is that the code it produces is not
optimized. With tcc you get very fast compilation, but the code it
produces will be slower and fatter than code produced by optimizing
compilers. This is because optimization strategies often involve
examining the entire program, or at least large chunks of it. Tiny cc
does try to do some optimizations, but only those that are compatible
with its approach. This doesn't mean the code produced by tcc is
horribly slow, only that it is not as fast as fully optimized code.

Build System

NOTE: the build system is undergoing a major rewrite. Below is one possibility:

Tiny cc supports several host and target environments. The makefiles
and the config.h header use macros to indicate which combination
should be built. Some examples:

BLD_CC the compiler to use to compile tcc itself
HOST_ARCH_I386 tcc will run on the I386
HOST_ABI_RT_GLIBC tcc will use the glibc library
TARGET_ARCH_C68 tcc should emit machine code for the TMS320C67x Digital Signal Processor
TARGET_ABI_OF_ELF tcc should write ELF files

See INSTALL for details.

The Code

Organization of the Source Tree

Naming Conventions

Currently the naming conventions are ad-hoc. Proposed conventions:

  • typedef'd names terminate with suffix _t
  • functions are named <type>_<operation>; for example,

buf_parse_number, or sym_allocate

  • types should be typedef'd as specifically as possible; e.g. instead

of "int t", use "tok_t t"

Data Types, Structures, and Values

Currently the data types and structures are a little hard to decipher,
mainly because the naming conventions are not clear. Here's a list of
the most important types, a suggested name, and a description:

Processing Model

Backends

Currently supported architectures:

  • i386
  • ARM
  • C67 (TMS320C67x Floating Point Digital Signal Processor from Texas Instruments)

Currently supported output formats:

  • ELF
  • COFF (C67 architecture)
  • PE (with ELF, not COFF!) - Windows

Instruction Selection

Output Formats

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License