queried my ten year old daughter. I could see she was having difficulty with lexical analysis of my project’s name. After a brief but lengthy explanation it was obvious that we should have a mascot for this project. She and I collaborated to come up with a suitable animal. Here is the result:
From now on, the RPL prompt will say evalrus:001>
.
For lexical analysis I decided not to take the easy regex way out. I want to
learn how to do this myself so I’m going to tokenize the input String
with
my own state machine. Here are the building blocks I’ve resulted in after
a couple attempts.
The tokenize()
function signature takes a String
of source code and returns
Result<Vec<Token>, ParseError>
where Token
is:
SourcePos
represents the line number and character position in the line of the
token and I expect we’ll be propagating these numbers throughout to helpfully
report errors.
The inner state machine loop looks like the below code section. I’ve
very sensibly decided that tab characters are not going to be valid indentation
so this is an opportunity to use the ParseError
type.
The above code handles single character tokens (brackets, whitespace) and the single invalid tab-character case.
Multi-character tokens must be consumed one character at a time until a terminating character is reached. Terminating characters are any that cannot be part of a symbol (in this case, brackets and whitespace.)
For this I have a quick and simple closure that returns true if the given character is in a list of characters that indicate the end of the symbol:
I’m not so great at recognizing the optimal Rustacious Iterator
use patterns
so perhaps there’s a more concise way of expressing that? In Python I’d just say
if c in terminating:...
.
I’ve pulled out the match
branch for symbols to highlight it separately below:
I’m not totally happy that that code is as pretty as it could be, the loop and break conditions could surely be refactored. Suggestions welcome!
The final state of the code at this point is tagged here.
This is the first code update for which there are #[test]
s and I expect
to continue writing tests from now on.
The next step is parsing the tokens into a syntax tree. Since this is a Lisp, it’s going to simply consist of nested cons cells and symbols for now. In future stages we’ll add more types such as strings and numbers, adding syntax for them too.
Thanks to Bob Nystrom’s excellent new series on interpreters, which I scanned (haha) to help find a better code structure for the lexing state machine than my first attempt.