Parsing into an AST

Well this is turning out deeper than I anticipated. In glorious personal tradition, I went into the parsing problem keyboard blazing and quickly ran into gotchas. I’ve rewritten and refactored my parser twice, reinventing the old problems that standard parsing techniques have been proven to solve before I was born.

Many many years ago I took a compilers course based on The Dragon Book. Perhaps we covered lexing and parsing… I have no recollection of any of it because, while Roger Bailey was a delight, he taught us the course using Haskell without teaching us Haskell and I don’t think I really understood anything.

I’ve spent the past two weeks learning by failing, reading articles, failing to completely understand the articles, failing some more and finally beginning to understand something of parsing. While I may not be a fast learner, in my defence I am a thorough learner.

Stumbling in the dark

My very first hastily scribbled attempt didn’t have any kind of lookahead and I found that I couldn’t accurately determine if an s-expression had correctly terminated or not.

I also hated the code. A very clumsy state machine, full of holes no doubt.

Behold and scorn!

    // loop state variables
    let mut first_token = true;
    let mut after_dot = false;
    let mut expect_closeparen = false;
    let mut expect_list = true;

    loop {
        match token {
            // Open parenthesis
            Some(Token { token: OpenParen, pos }) => {
                if expect_closeparen {
                    return Err(ParseError::new(
                        pos, String::from("expected close-paren")));
                }

                if first_token {
                    tail.set(expression(mem, tokens)?);
                    first_token = false;
                } else if after_dot {
                    tail.dot(expression(mem, tokens)?);
                    expect_closeparen = true;
                } else {
                    let expr = expression(mem, tokens)?;
                    tail = tail.append(mem, expr);
                }
            },
            // and so on for each TokenType...

Blech.

The greying dawn

My second, more thoughtful, attempt was after I kind of began to understand recursive descent parsing. The parser sort of worked but I totally failed to implement lookahead correctly so the result was no better than the first attempt. In the commit above I tried to correct my peek code and came face to face with the wagging finger of the borrow checker.

error[E0502]: cannot borrow `*tokens` as mutable because it is also borrowed as immutable
   --> src/parser.rs:146:43
    |
144 |         match tokens.peek() {
    |               ------ immutable borrow occurs here
145 |             &Some(Token { token: OpenParen, pos: _ }) => {
146 |                 list.push(parse_list(mem, tokens)?, mem);
    |                                           ^^^^^^ mutable borrow occurs here
...
174 |         }
    |         - immutable borrow ends here

Bother. Here I am at time of writing, staring at this error and rethinking my iteration data structures. I like the overall code pattern a whole lot better and I’m pretty sure I’m on the right track to a partially recursive slightly descending parser. That’s encouraging.

The problem is the TokenStream struct which has the wrong structure and lifetimes to solve this problem.

struct TokenStream<I: Iterator<Item = Token>> {
    tokens: I,
    peek: Option<Token>
}

The TokenStream::peek() function can’t return a reference to a Token that outlives more than zero recursive descents into the parser because the peek value gets overwritten by the value from next() from the tokens iterator.

Sunrise

The fanfare to Also sprach Zarathustra is playing in my head as I scroll through the documentation for the Peekable struct. This is the tool that will transform me from compiler-error-generating-keyboard-monkey into Rust-wielding spacefaring modern man.

Replacing TokenStream with Peekable does it. The tests pass, the RPL rpls.

That twice quoted section of code now reads like this:

    let mut list = PairList::open();

    loop {
        match tokens.peek() {
            Some(&&Token { token: OpenParen, pos: _ }) => {
                tokens.next();
                list.push(parse_list(mem, tokens)?, mem);
            },

I’ve moved the Pair and Value types into a separate file called… types.rs. Also in this update, I implemented fmt::Display for Value to print more better.

Here is the source tree in it’s new state.

Up next…

If you looked at the code you’d have noticed that the Symbol enum variant does not carry the name of the symbol in it yet. That’s because I haven’t figured out how I want to represent symbols internally yet. There are questions that need answering.

Symbols have names but names are inefficient representations. Addresses are better! How do I want to map names to addresses and back again? Is this the same as an environment or does it at least overlap in functionality somewhat? Where should the symbol name strings live? In the Arena or in a separate structure?

Tune in next time to find out zero or more answers!

Ack

Here’s a list of articles I have found to be the most helpful in this stage:

Note

In addition to mandatory attendance of a full time job and the usual dinner, children to bed, collapse routine, my spare time is thinning out even more as I’m coaching football (“soccer”) for a team of eight-year-olds. Nevertheless, I will maintain my current level of mediocre productivity!