Tokens

Current Name Proposed Name Meaning
============ ============= =======

struct TokenSym struct tok_t models a parsed token; implemented as a
linked-list node; contains a token_class
indicator and the token label (string);
for token classes define, label, struct,
and identifier contains pointer to the
associated ABI symbol.

A TokenSym is a kind of symbol. gSymbol (grammar symbol) might be a better name; the important point is that it's used by the compiler internals to manage tokens/symbols at compile time, in contrast to the Linker Symbol which is constructed by the compiler for use by the Linker. For linker symbols tcc uses output-format dependent names; for ELF32, linker symbols are of type Elf32_Sym. Then there's type Sym, which is a linked-list node recording token type, register, index into the linker symtable, and type. Each TokenSym contains a Sym. (Does each Sym belong to a TokenSym?)

TokenSym is a little misnamed; is it a Token or a Symbol? Well,
it does point to symbols, but it's basically a representation of a
parsed token, so we rename it "tok_t". That is, a c type named
"tok_t" that represents a parsed token; not to be confused with
"tokt_t", which is the c type of token class (type) codes. A
tok_t value contains a tokt_t field indicating the token-type of
the token. For example, "int", when parsed, becomes a tok_t whose
tokt_t field contains the TOK_INT token-type code.

A TokenSym also records token attributes (text, value?).

TODO:
typedef int tok_t; // number representing a token type

* It's crucial to distinguish between tokens and token types,
because it's easy to confuse them. A source code text is composed
of lexical elements (terminals) arranged in a syntactic structure.
Both lexical elements and higher-level phrase structures
(e.g. "declaration") are recognizable as members of grammatical
classes. For example, "identifier" and "less-than" are lexical
classes; "declaration", "and-expression" are syntactic classes.

The parts of a source text, once parsed, are represented as
/tokens/, and each token is a member of a /token class/; the
latter corresponds to the grammatical class in which it fits. For
example, the string "int" may occur many times in a source text.
Each such occurence is parsed and represented internally as a
token of class TOK_INT (see tcctok.h). TOK_INT, in turn, is a c
value of type tok_t.

More on tokens and token types:

* Token classes are defined in tcctok.h and in tcc.h. The former
contains token type codes for keywords, such as TOK_INT,
TOK_SWITCH, and TOK_STRUCT; these are declared as explained below
as constant ints taking values above 255. The latter contains
codes like TOK_INC, TOK_CINT, and TOK_STR; these are #defined to
hex values below 256.

The token definitions in tcctok.h are used by inclusion in tcc.h
at several places for different purposes, by using a macro DEF
that selects the right definition. tcctok.h looks like:

DEF(TOK_INT, "int")
DEF(TOK_VOID, "void")

and tcc.h says:

/* all identificators and strings have token above that */
#define TOK_IDENT 256

enum tcc_token {
TOK_LAST = TOK_IDENT - 1,
#define DEF(id, str) id,
#include "tcctok.h"
#undef DEF
};

so we get e.g.

enum tcc_token {
TOK_LAST = 255,
TOK_INT,
TOK_VOID,

}

and thus the keyword tokens are members of an enum and take values
above 255. Since enum members behave just like ints, we can use
them more or less the way we would use them if they were #defined.

[[Optimization: would it make sense to use #define instead of an
enum? We would not need the space taken by the tcc_token enum,
and using the values would not involve a load from a memory
location in the generated machine code. Note that the name
tcc_token is not used anywhere. Also, doesn't using a member of
the enum (an "enumeration constant") as an int in e.g. a
comparison require an implicit cast? A: no. They're both ints.]]

The same file tcctok.h is used to generate an array of keyword
strings:

static const char tcc_keywords[] =
#define DEF(id, str) str "\0"
#include "tcctok.h"
#undef DEF
;

which results in:

static const char tcc_keywords[] =
"int" "\0" "void" "\0" …
;

At startup, this string array is used to initialize a table of
ptrs to TokenSym struct ptrs (TokenSym**) called table_ident.
Related to this table is TokenSym* hash_ident which is
preallocated to TOK_HASH_SIZE (8192). Also related are
i=tok_ident - TOK_IDENT; table_ident[i] = ts; where TOK_IDENT=256,
and "all identificators and strings have token above that" and
tok_ident is (only) incremented whenever tok_alloc_new is called
and allocs a new TokenSym*. I.e. the top entry of table_ident, an
array of ptrs to ptrs to TokenSym, is the max tok_ident alloced so
far. Elsewhere we see e.g. table_ident[v - TOK_IDENT], where v is
an int, I think it's a TOK_<FOO> value. This makes sense for
identifiers and strings, since they are always > TOK_IDENT, and so
table_ident only tracks identifiers and strings. table_ident is
often indexed by int v. Whew.

But table_ident is initialized by the table of keyword strings; so
why call it a table of /identifiers/? I guess the idea is that
any syntactic elements or more than two chars are treated as
identifiers?

Recall that tokens are added to the top of the table (should call
it a stack, no?), and tok_ident is both the index and the token
code of the first available slot after the top elt (offset from
0). At startup, the keywords are added starting at TOK_IDENT
(256); once they've been added, tok_ident is the index of the next
available slot.

Furthermore, the value of tok_ident is used as the tokencode for
the token it indexes. IOW, the token at idx n has tokencode
n. This follows from the init logic: tok_ident is set to
TOK_IDENT, which is 256, and then the keywords from tcctok.h are
inserted one-by-one. The first of those is "int", so it is
inserted at 256, and its tokencode is set to 256; furthermore, in
the tcc_token enum, TOK_INT is installed with value 256.

So the keywords occupy the first n slots in the table_ident (call
it "lexis"), then the other stuff from tcctok.h (preprocessor
directives, etc.)

True identifiers are parsed in next_nomacro1, where they are
treated just like the keywords were treated at init time. (search
for parse_ident_fast:) I.e.

- you have a target string
- you hash it to a bucket number
- you obtain the hash_ident bucket
- the bucket points to the head of a TokenSym list
- you search the linked list in the bucket
- follow the ts->hash_next link
- compare tgt str with ts->str
- if no match, call tok_alloc_new

"tok_alloc_new" is redundant; should be called ??

NB: a TokenSym is essentially a linklist node, with some token attributes.

NB: each identifier gets its own token code; contrast classic approach which might use e.g. TOK_ID as the token code for all identifiers..

TokenSym records can be found by

  • using a string to search the hash table
  • using a token code to index table_idents

I.e. hash_idents and table_idents are both lists of pointers to the heads of TokenSym lists. The routines keep them in sync.

relation of table_ident and hash_ident? tok_alloc_new allocs
TokenSyms; it takes a TokenSym** arg (pts), which has been set to
point to an element of hash_ident. tok_alloc_new allocs the new
TokenSym, pushes a pointer to it on top of table_ident, and sets
the hash_ident entry to the same ptr value (using the pts arg).
See tok_alloc_new and tok_alloc. TokenSyms have a hash_next ptr;
not sure how that fits in. Presumably the hash entries are
buckets pointing to a chain of TokenSyms.

Now, these danged TokenSyms use ptrs to track C stuff that has a
/tag/: defines, labels, structs, and identifiers. So searching
for e.g. a struct means: table_ident[v]->sym_struct; where v is
int; presumably an identifier code. I.e. each identifier gets a
unique code so we can index directly.

Use of hash_ident. It's only used to search TokenSyms for a
string value (a /token label/?), which are linked via hash_next.
Recall that the entries of hash_ident are set to point to a newly
allocated TokenSym; the entry so set is determined by a hash of
the sought string. So to search the table_ident for an identifier
or string, we first calculate the hash bucket id h based on the
target string, then:

pts = &hash_ident[h]; /* pts should be ppts */
for(;;) {
ts = *pts;
if (!ts)
break;
if (ts->len == len && !memcmp(ts->str, str, len))
return ts;
pts = &(ts->hash_next);
}

How do TokenSyms get added to the chain, except as the first
element in the bucket? I don't see anywhere that hash_next gets
set; must be done indirectly.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License