Compilers, Parsers and Lexers

in #programming6 years ago

Over the last few years, I've made some forays into compiler writing. The project I'm working on now includes a simple compiler, so it's on my mind a lot right now. Things are much easier now that I have some experience with what doesn't work.

What Doesn't Work (Very Well)

First off, compiler generators. They sort-of, kind-of work, but as soon as you try to do anything interesting, they break. This is because they force you to use the paradigm that the compiler generator uses. Trying to deviate from this is more work than you save from using the generator in the first place. Also, trying to directly take raw text into the parser stage is far too complex and leads to spaghetti code in the parser. Writing a C++ class for each each type of abstract syntax tree (AST) node leads to a large amount of code, in which it is quite easy to lose track of what is going on and is prone to bugs.

What Does Work (For Me)

Recursive Descent Parsers, with a Lexer frontend, and a generic Abstract Syntax Tree seems to work very well for me. Using a lexer as the first stage moves a large amount of complexity from the parser to the lexer, allows for things like token look-ahead, and the parser code is much easier to read because of this.

ASTNode* Parser::parse_real_literal()
{
     if( lex.t(0).type == Lexer::Real ) {
          ASTNode* num = new ASTNode();
          num->type = ASTNode::RealLiteral;
          num->value = lex.t(0).text;
          lex.eat(1);
          return num;
     }   
     return nullptr;
}

The lexer looks like a mess, but that is more because state machines always look like a mess rather than than something inherent in lexers. This is one area that would make some sense to use a code generator for, because it would let you hide the state machine complexity behind something slightly less complex (regular expressions).

bool Lexer::queue_token()
{
    bool do_return = false;
    while( !do_return && next_token.type != Lexer::EndOfFile ) {
        bool fetched = do_fetch;
        read_char(&ch);
        if( end_of_file ) {
            next_token.type = Lexer::EndOfFile;
            next_token.text = "";
            tokenize();
            return true;
        }
        if( global_flags.debug ) {
        printf( "State: %s (%d), ch: '%c' (%d), buffer: '%s'\n", Lexer::state_str(state), (int)state, isgraph(ch) ? ch : '.', (int)ch, buffer ); }

        if( ch == '\n' ) {
            next_token.line_number += 1;
            next_token.column = 1;
        } else if( fetched ) {
            next_token.column += 1;
        }
        switch( state ) {
        case state_e::Ground:
            if( isalpha(ch) || ch == '_' ) {
                state = state_e::Keyword;
            } else if( isdigit(ch) ) {
                state = state_e::Number;
            } else if( isspace(ch) ) {
                pos = 0;
                buffer[0] = 0;
                state = state_e::Ground;
                /* snip */

Using a generic AST Node class onto which any structure can be mapped works considerably better than hand-coding each class. Debug code can be shared between all node types, and there is no need for a class hierarchy. The code emitter can then use a switch() on the node type.

class ASTNode
{
    public:
        enum type_e
        /* snip */
        enum flags_e
        /* snip */

        type_e type;
        std::string value;
        std::deque< ASTNode* > children;
        std::set< flags_e > flags;

        void debug_dump( int i = 0)
        {
            printf( "%.*s|-%s: %s\n", i, INDENT, type_str(type), value.c_str() );
            for( const flags_e& flag : flags ) {
                printf( "%.*s|-Flag: %s\n", i, INDENT, flags_str(flag) );
            }
            if( children.size() > 0 ) {
                printf("%.*s|-+-Children:\n", i, INDENT );
                for( ASTNode* child : children ) {
                    if( child ) {
                        child->debug_dump(i+2);
                    } else {
                        printf("%.*s| |- nullptr\n" );
                    }
                }
            }
        }
        /* snip */
Sort:  

Hello! Your post has been resteemed and upvoted by @ilovecoding because we love coding! Keep up good work! Consider upvoting this comment to support the @ilovecoding and increase your future rewards! ^_^ Steem On!

Reply !stop to disable the comment. Thanks!

Congratulations @teknomunk! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of upvotes

Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word STOP

To support your work, I also upvoted your post!

Do you like SteemitBoard's project? Then Vote for its witness and get one more award!

Congratulations @teknomunk! You have completed the following achievement on the Steem blockchain and have been rewarded with new badge(s) :

Award for the number of upvotes

Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word STOP

Support SteemitBoard's project! Vote for its witness and get one more award!