Table of Contents
|
Overview
Execution Environments
Build, Host, Target
Run-time environments: ABIs and Toolchains
The term "run-time environment" is not well-defined. Some vendors use
"run-time library" to mean the standard library they provide with
their compiler products. Others use it to mean a smaller subset of
code that complements the standard (or "system") libraries.
An ABI (Application Binary Interface) describes a conventional
low-level interface between an application program and the execution
environment. This includes (among other things) storage allocation
conventions, calling conventions, and binary file format.
A toolchain is a set of software elements … A toolchain includes a
compiler, a linker, and a library, along with other utilities for
working with binary files (ar, nm, etc.) It also includes the
startup/shutdown code that is attached to executables. A given
execution environment may support multiple toolchains. For example,
on Linux the standard toolchain is the GNU toolchain: gcc, glibc, and
GNU binutils. But you can use other compilers and libraries as well;
for example, gcc with newlib or dietlib, or the Intel C compiler with
glibc. Even some of the hoary old compilers for windows are
available: Watcom is now Open Watcom, with Linux support on the way.
An important element of the run-time environment that is not part of
the toolchain is the program /loader/. The loader is provided by the
operating system. Its job is to load the executable binary image into
memory and then transfer control to the entry point. For this to
work, there must be a convention regarding the structure of the binary
image and the name or offset of the entry point. The loader may also
perform more sophisticated tasks that involve tasks conventionally
associated with the linker, such as relocation and symbol resolution.
More on this later.
A particular toolchain establishes a /de facto/ ABI. Elements of one
toolchain may not work with another toolchain. For example, it
probably won't do to use malloc from one library and free from
another. It may not even work to use malloc and free from the same
library, but compiled with different compilers. Especially if you're
using a language other than C. In the case of C much (but not all) of
the ABI is specified by the language and standard library definitions.
Compilation: Concepts and Terminology
Lexis
Syntax
Tokens, Symbols, and Attributes
The literature on compilers often uses the term token when discussing lexical analysis and parsing. Usage of this term is not always consistent however; distinctions between token, token type, token class, and symbol are often glossed over, which can lead to some confusion. So here's an attempt at clarity:
- lexeme
- a run of text in the source that is recognized by the lexer as a member of a lexical class; e.g. a declaration like "int foo;" is recognized as three lexemes, "int", "foo", and ";"
- token
- usually this means a part of the source text that is recognized by the lexer as a lexical element. E.g. each keyword is recognized as a token, as is each identifier. In fact the basic idea is that the source text - a sequence of characters - is analyzed by the lexical analyzer and turned into a stream of tokens that is passed to the parser. That way the parser doesn't have to deal with the internal structure of the lexical elements. Token in this sense is often used interchangeably with the term symbol.
- terminal symbol
- in grammar specifications a terminal symbol cannot be decomposed into component elements; for example, ID is a terminal symbol in most languages; it denotes the set of lexemes recognized as an ID
- token type
- tokens are only recognized because they match a syntactic production, and are thus members of the class defined by that production. In other words, each syntactic (lexical) production effectively defines a token class, and matches a set of lexemes in a source text. Many token classes have only a single member, e.g. '<', but that member may occur any number of times in program text; i.e. each occurance of '<' is represented as a token in the input stream.
- token type code
- compilers usually encode the token type of each token in the input stream as by defining a token type code using a #define or an enumeration constant.
Attributes of tokens
- text
- the text of the recognized lexeme; e.g. in "int foo;", the lexeme "foo" is recognized as a token of type TOK_ID, with a text attribute of "foo".
- token value
- the (c) value encoded by the parsed token; e.g. 17 might be recognized as TOK_LITERALINT (tcc uses TOK_CINT), whose text attribute is "17" and whose value attribute is 0x11.
- other attributes
- compilers commonly used token attributes to record debug information such as line number on which the token occurs.
Hypothetical examples:
char *s = "abc": "abc" =>
tok.typecode=TOK_STR
tok.text="abc"
tok.value="abc"
A compiler isn't necessarily going to record all this info.
Parsing Stacks
Parsers use data structures to keep track of stuff during a parse.
E.g. environments, local vars, globals, etc. etc. Discussion of use
of stacks, hash tables, etc. - the stuff tcc uses.
Symbol Tables
Compiler internal symbol table
The compiler maintains table(s) of symbols seen and referenced…
Output Format (Linker) symbol table
Output formats (ELF, COFF, PE/COFF, Mach-O) provide for a symbol table
which will be used by a linker to resolve symbol refs and for
relocation…. Not to be confused with the compiler's internal symbol
table, although the latter will be used to write the former.
…brief overview of tinycc's strategy, details to go below…
Instruction Selection, Register Allocation, etc.
Output Formats
Linking
Relocation
Symbol Resolution
Tiny cc Implementation Design
The standard model for a compiler involves a phased approach:
- Lexer (Scanner)
- Parser
- Conrete Syntax Tree
- Abstract Syntax Tree
- Intermediate Representation
- Backend
** Instruction Selection
** Optimization
Toolchains may implement these tasks using separate software
components. For example, one component may parse the input file and
produce an abstract syntax tree; a second component may take the AST
as input and produce an Intermediate Representation - a kind of
abstract assembly language. A backend component would take the IR as
input and produce machine code in an output format.
Tiny cc takes a different approach. The design priorities of tcc are
speed and size, so it eschews the phased approach in favor of
incremental compilation. Instead of multiple phases (and therefor
processing passes), it tries to parse the source code and emit machine
code as soon as possible. This means it can process source code much
faster than a compiler that takes the traditional approach. Howerver,
one consequence of this approach is that the code it produces is not
optimized. With tcc you get very fast compilation, but the code it
produces will be slower and fatter than code produced by optimizing
compilers. This is because optimization strategies often involve
examining the entire program, or at least large chunks of it. Tiny cc
does try to do some optimizations, but only those that are compatible
with its approach. This doesn't mean the code produced by tcc is
horribly slow, only that it is not as fast as fully optimized code.
Build System
NOTE: the build system is undergoing a major rewrite. Below is one possibility:
Tiny cc supports several host and target environments. The makefiles
and the config.h header use macros to indicate which combination
should be built. Some examples:
BLD_CC the compiler to use to compile tcc itself
HOST_ARCH_I386 tcc will run on the I386
HOST_ABI_RT_GLIBC tcc will use the glibc library
TARGET_ARCH_C68 tcc should emit machine code for the TMS320C67x Digital Signal Processor
TARGET_ABI_OF_ELF tcc should write ELF files
See INSTALL for details.
The Code
Organization of the Source Tree
Naming Conventions
Currently the naming conventions are ad-hoc. Proposed conventions:
- typedef'd names terminate with suffix _t
- functions are named <type>_<operation>; for example,
buf_parse_number, or sym_allocate
- types should be typedef'd as specifically as possible; e.g. instead
of "int t", use "tok_t t"
Data Types, Structures, and Values
Currently the data types and structures are a little hard to decipher,
mainly because the naming conventions are not clear. Here's a list of
the most important types, a suggested name, and a description:
Processing Model
Backends
Currently supported architectures:
- i386
- ARM
- C67 (TMS320C67x Floating Point Digital Signal Processor from Texas Instruments)
Currently supported output formats:
- ELF
- COFF (C67 architecture)
- PE (with ELF, not COFF!) - Windows