In fear of some readers being more pious than others, I'd like to clear some things up.
In other words, stop reading at the first big caption.
In summary, Rusky's trying to get me to use a token tree parser by starting work on one as a proof of concept. I assume he will continue to develop it from time to time, until it actually competes with the one I wrote as a competent proof of concept or until I'm all done. Whichever comes first.
I met it with some pretty harsh skepticism. But it's not because I feel token trees are bad.
A tree is a very useful tool. It will allow you to easily skip over large chunks of data to weed out only what you need. They are used in probably more than 90% of modern parsers. I'm not looking up an actual statistic, sorry math fans.
However, I don't feel they're appropriate for the formatter-type parser.
The entire point of the formatter parser is to add semicolons and weed out variable names by declaration. It doesn't really call for a tree, in my opinion.
Here's a sample code:
a=0b=1
That code is valid GML. So now, how do we deal with it?
Those who are weak of heart should stop reading now.
Token tree method:The code is read into memory in a structure called a tree. A representation is depicted below:
=
├─a
└─0
=
├─b
└─1
To save size, a symbol table is used. It's basically a list of variable names/other things that appear in the code. The table makes comparison of strings from the time of its completion onward operate in constant time by comparing them as an index, not byte for byte. (Means it's efficient)
The code is then rebuilt with all proper punctuation.
My method:The code undergoes a (somewhat) organized series of stream-like modification.
First, the code is reread into its current buffer free of comments and excess whitespace. This won't affect the code above.
Next, a second block of memory is allocated the size of the first. The first is named code, the second synt.
They look like this:
a=0b=1
n=0n=0
Next, the blocks are reallocated at twice the size for all upcoming changes. A pass is made that keeps a look out for patterns of tokens. You'll notice above in the syntax string that variable names have been replaced with n, while numbers have been replaced with 0. This is so only one check has to be made to tell what any given character in the code is a part of.
Notice the "0n". A pass is made to look out for things like that.
More complicated codeTrees are usually, or at least very often, done by using a parser generator of sort, which takes a grammar file that can be easily edited. It organizes the process by giving what I've seen to be a regex-like interface for manipulating what patterns of tokens the parser looks for.
My method's basically a hard-coded version of that. I specify what tokens require a semicolon between them in a separate function, instead of using a generator that reads that info from a text file.
In the code
if 0a=10else b=10
the token tree would do just what it did above, using if as a root of another sublevel, then putting the next = token inside it. It would then look up the rules for what's inside if to see what it puts next.
Mine does this, instead: semis->push(')','s'); write_symbol('(');
Which does nothing but insert a left parenthesis and tell it, next time it needs to insert a semicolon, use ')' instead. This is still done in that pass.
(If I were using only find and replace like I said I could, I'd replace "if(.*);" with "if\($0\)" )Rusky said he didn't add for loops to his proof of concept. Wouldn't make much sense for me to do so, because all I had to do to implement for() was push two semicolons and an end parenthesis onto the stack, like I did above.
Partitioning my parserRusky's other complaint is that my parser is divided into sections. One's responsible for syntax check, one's responsible for C definitions, one's responsible for adding semicolons/other GM-C replacements. I don't really see what there's to criticize about that, but here's a nice reason to do so...
The formatter parser has more passes and gets more attention than both the syntax check and C parsers combined. However... They're what Rusky will call a disaster, and what I'll tell you are faster than greased lightning.
Instead of lexing, they keep a couple variables handy (integers, actually) that I like to refer to as registers. (I may in fact devote a register to each of them, as they are so few.) They let the CFile parser identify what it's going through now. Each C++ key word has an ID. (not dissimilar to the symbol table of the token tree).
In addition to this ID, a second variable is kept indicating mode. I keep track of these mentally, because that's how I goddamn roll. (Ask score_under). That's the part that really makes Rusky cringe; I rarely comment, and I don't document, because I just memorize all the values, or review my thinking to get back to them.
This ID, however, is the difference between "using" and "using namespace", just for example.
( Typedef is handled differently. Typedef's value is 128 or some crap, so I can just OR whatever I have by it to indicate that I'm typedef'ing it. (C and C++ will let you totally whore typedef) )The most complicated thing to deal with is what may require a stack, and that's pointers to arrays to pointers....of functions.
Now, here's the kicker: I use a tree to store scope data. Le gasp, right? Each variable is stored the same way. An integer can have members if I really want to. And I can tell you just from using them there how beautiful a system they can be. But knowing that, I still say a pair of aligned strings is the way to go with a formatter.
Benefits/Downfalls- Tree offers O(1) string comparison. But the parser doesn't need that.
- Tree does all punctuation in one pass. But it's liable to leave out, say, entire else statements. And not understand unary unless begged.
- String requires multiple passes, but offers some find-replace methodology. To find a declaration, for example, I just search "d".
- Tree is more comprehensive. I'm not sure if that's good or bad. I bring up unary again... Luis's parser had a fit on it. And Rusky's won't treat ++ as adding a unary-positive real value, or as incrementing the previous. (But he'll tell you how easy it is to add if you ask him)
- Being component based means that my parser's syntax check only needs to do what's necessary for syntax checking... It never touches the C parser. The formatter needs to know the members of declared classes, so it calls the C parser to get that. Syntax check can just add the name as a type.
Okay, this is the single longest thing I've ever typed up. If you did any more than just skim it, I feel sorry for you.