I am writing a parser for a fairly complicated language in C++. The
Parser class is given a list of tokens and it builds the AST. Though only a part of the parser is completed, the
Parser.cpp file is already more than 1.5k lines and the class has around 25 functions. So, I plan to break the large
Parser class into smaller classes such that I can have separate classes for parsing different language constructs.
For example, I wish to have
ExprParser class that parses expressions, a
TypeParser class that parses types. It seems to be much cleaner. The problem is that the parsing functions must have access to a state that includes the position of the current token, and several parsing helper functions. In C#, it is possible to implement related functions in different classes using partial classes. Is there any specific design pattern or recommended way for this?
Create a Scanner or Tokenizer class, which takes the input data (the text to be parsed) and holds the position of the current token or similar state. It can also provide some shared helper functions. Then provide a reference (or a shared pointer) to the Scanner object to all your individual
xyzParser objects, so they can all access the same scanner. The “scanner” will be only responsible for accessing the data by basic tokenize functions, the individual parsers will be responsible for the actual parsing logic.
This will work most easily as long as your scanner does not need to know which individual parsers exists. If the scanner actually needs to know this, you might consider to resolve the cyclic dependency by introducing abstract “interface” base classes, or by implementing some kind of call back or event mechanism, where the scanner can notify any kind of observers.
State design pattern perhaps? It is pretty much straight-forward inheritance, with the parent – abstract – class containing a reference to the current “state” object, i.e. parser.
The pattern coupled with delegates, extension methods, etc. should give plenty of flexibility.
Be wary of breaking apart a class arbitrarily. These smaller classes also need OO integrity. I am not referring to partial classes here.
I particularly like this clean, simple demo video
Quite likely, implementing your grammar as multiple interdependent parsers is only going to make your code more complicated. The data flow will become less obvious, and you will duplicate some behaviour. It is OK if a class is large.
However, many languages can easily be split into different levels, and handling these separately could be sensible. For example:
- you could extract tokenization from the main parser. C has a separate tokenization and preprocessor phase.
- You could do some post-processing in a separate phase that builds the final AST. This is particularly sensible if your parser also checks the semantics, e.g. resolving symbol definitions or doing type checks. Those should be separate from parsing.
- If your language has a strong statement–expression dichtomy, you could have separate parsers for each, with the statement parser calling into the expression parser as needed. Markdown is an example of a language with a line-based grammar (indentation) over a block-level grammar (paragraphs, headlines, lists) over an inline grammar (emphasis, links). Some parsers use a simple recursive descent approach for statement level syntax such as control flow constructs or top level definitions, but switch to an LR algorithm for expressions to properly handle precedence and associativity.
I have found it to be rather advantageous to extract low-level parsing operations into a separate class: handling the input buffer, checking lookaheads, extracting tokens, handling errors, is all better done by a custom class rather than relying on the facilities provided by the language (in particular,
std::istream is unsuitable for most problems). If you are using a parsing algorithm other than Recursive Descent, you should also handle these operations in a separate class.