I’ve written a bunch of tokenizers and lexers at this point. Some of them were good, and others were quite bad. While exploring the realm of compilers, I decided to write a programming language just for the fun of it — and to learn more about this magic.
The first step in creating a language is to lex the input. This process is fairly simple. The goal is to take a text input and subdivide it into pieces that the parser can work with.
For some basic syntax parsing, the lexing stage can be completely omitted. You can jump straight to the parsing stage, or even combine the parsing and interpreting phases. But it’s still great to have a lexer — more thoughts on that later.
Let’s take, for example, the following input:
let a: s32 = 34 + 35;
Without any code yet, we can imagine the output that we expect from this input:
[ KEYWORD(let)
, IDENT(a)
, SYMBOL(:)
, IDENT(s32)
, SYMBOL(=)
, NUMBER(34)
, OPERATOR(+)
, NUMBER(35)
, SYMBOL(;) ]
If that is what you have in mind, you’re right — because that is exactly what lexing is about: transforming the input into pieces for the parser to use. Or you may be confused at this point. Why do this if it is the same thing as the input? Shouldn’t it be fine to just use the input as is?
Those are totally valid considerations but this is where I stop you. Have you
noticed that there is extra information added to the input chunks? KEYWORD
, IDENT
, SYMBOL
, NUMBER
, and OPERATOR
are what I call qualifiers.
They give a flavor to the data and provide useful information to the parser.
Those are the basic tokens a lexer returns. But I wouldn’t be writing a blog about the matter if it were only this.
Lexers are simple to write, and the output is also basic. And that is one of the things that was bothering me while I was writing the parser.
Consider the following:
printf("%d", add(pow(2, 6), b, c, d, 34))
[
IDENT(printf)
, SYMBOL(()
, STRING("%d")
, SYMBOL(,)
, IDENT(add)
, SYMBOL(()
, IDENT(pow)
, SYMBOL(()
, NUMBER(2)
, SYMBOL(,)
, NUMBER(6)
, SYMBOL())
, SYMBOL(,)
, IDENT(b)
, SYMBOL(,)
, IDENT(c)
, SYMBOL(,)
, IDENT(d)
, SYMBOL(,)
, NUMBER(34)
, SYMBOL())
, SYMBOL())
]
With this kind of tokens, at the parsing stage you’ll have to correctly parse a function call and subsequent ones. Wouldn’t it be easier to give this task to the lexer, therefore making our parser simpler?
Making the lexer capable of recognizing a function call will give more power to it, making error reporting simpler. For example, unbalanced parentheses will cause the lexer to output garbage — which is good, as it allows the parser to catch it right away.
Funcall(String, Vec<Token>)
This new Funcall
token allows us to simplify the representation as follows:
[
FUNCALL("printf", [
STRING("%d"),
FUNCALL("add", [
FUNCALL("pow", [
NUMBER(2),
NUMBER(6)
]),
IDENT(b),
IDENT(c),
IDENT(d),
NUMBER(34)
])
])
]
We know the context in which we are using this lexer. Why not just take advantage of it?
The magic can be improved further by adding special constructs into the mix, like macro expansions and many other niceties.
That’s the beauty of lexer magic: small tweaks can make a big difference in how we design and understand languages.